The only guide you need to master classification statistics in machine learning
Published in · Read 26 minutes · 15. Maj
--
SGuided machine learning can be divided into two groups of problems: classification and regression. This article aims to be the definitive guide to ranking stats, so if you're a budding data scientist or a junior, be sure to read it.
First, you might also want to read my guide on the 5 statistics you need to know to master a regression problem:
Secondly, I would like to tell you what you will find here using a table of contents:
Table of contents:What is a classification problem?
Dealing with class inequality
What a classification algorithm actually does
accuracy
precision and memory
F1 result
The Confusion Matrix
sensitivity and specificity
Log loss (cross entropy)
Categorical cross entropy
AUC/ROC-Kurve
precision recall curve
BONUS: KDE and learning curves
As usual, you'll find Python examples to put the theory into practice.
In a classification problem, data is divided into classes: in other words, our label values represent the class to which the data points belong.
There are two types of classification problems:
- Binary classificationProblems: In this case, the target values are marked with a 0 or a 1.
- Multiple classesProblems: In this case, the label gets multiple values (0, 1, 2, 3, etc.) depending on the number of classes.
Let's visualize them. First, let's create a binary classification record as follows:
Import numpy as np
Import matplotlib.pyplot as plt# Set random seed values for reproducibility
np.random.seed(42)
# generate data
number_of_examples = 1000
X = np.random.rand(antal_samples, 2) * 10 - 5
y = np.nullen(tal_samples)
y [np. som ( X ** 2 , as = 1 ) < 5 ] = 1
# Graph data
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm')
plt.xlabel('Function')
plt.ylabel('Label')
plt.title('Binary Classification Dataset')
plt.show()
So this is an example of a dataset with binary classification: some data points belong to the blue class, others to the red class. Now it doesn't matter what these classes represent. It can be apples or oranges, cars or trains. That does not matter. What is important now is that we have visualized a binary classification problem.
Now let's visualize a multiclass problem:
Import numpy as np
Import matplotlib.pyplot as plt# Set random seed values for reproducibility
np.random.seed(42)
# generate data
number_of_examples = 1000
X = np.random.rand(antal_samples, 2) * 10 - 5
y = np.zeros(antal_samples, dtype=int)
y [np. som ( X ** 2 , as = 1 ) < 2,5 ] = 1
y[np.logische_en(X[:, 0] > 2, np.abs(X[:, 1]) < 1)] = 2
y[np.logische_en(X[:, 0] < -2, np.abs(X[:, 1]) < 1)] = 3
# Graph data
plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Klasse 1')
plt.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Klasse 2')
plt.scatter(X[y==2, 0], X[y==2, 1], c='green', label='class 3')
plt.scatter(X[y==3, 0], X[y==3, 1], c='lilla', label='class 4')
plt.xlabel('Function')
plt.ylabel('Label')
plt.title('Multiclass Classification Record')
plt.legend()
plt.show()
So here we have created a classification problem with data points belonging to 4 classes.
One problem with multiclass classification problems is understanding whether all classes are important. Let's see what we mean in the next section.
ANNOTATION:In the case of a binary classification, classes can be denoted as 0-1.
But they can also be referred to as 1-2. So there is no agreement on that
tells us to start at 0.
This also applies in the case of multiple classes. The class can be denoted as 0,1,2,3
and 1,2,3,4.
Consider the following dataset:
Import numpy as np
Import matplotlib.pyplot as plt# Set random seed values for reproducibility
np.random.seed(42)
# Class 1: Blue
Mean1 = [0, 0]
cov1 = [[1, 0], [0, 1]]
antal_point1 = 7000
X1 = np.random.multivariate_normal(mean1, cov1, num_points1)
# Class 2: green
means2 = [3, 3]
cov2 = [[0,5, 0], [0, 0,5]]
num_points2 = 2700
X2 = np.random.multivariate_normal(mean2, cov2, num_points2)
# Class 3: red
Mean3 = [-3, 3]
cov3 = [[0,5,0], [0,0,5]]
antal_point3 = 300
X3 = np.random.multivariate_normal(mean3, cov3, num_points3)
# Plot the data graphically
plt.scatter(X1[:, 0], X1[:, 1], color='blue', s=1, label='Class 1')
plt.scatter(X2[:, 0], X2[:, 1], color='green', s=1, label='Class 2')
plt.scatter(X3[:, 0], X3[:, 1], color='red', s=1, label='Class 3')
plt.xlabel('Function')
plt.ylabel('Label')
plt.title('Unbalanced Multiclass Classification Dataset')
plt.legend()
plt.show()
As we can see, we have a lot of blue spots and also a lot of green spots. The red spots, on the other hand, are very rare compared to the others.
The question is: should we consider the red spots? In other words, can we do our ML analysis by removing the red spots because there are too few of them?
The answer is... it depends!
In general, we can ignore the values of one (or more) class(es) with fewer observations than the others. But in certain cases we shouldn't do that! And this is where domain knowledge comes into play.
For example, if we were examining fraud detection in a banking transaction, we would assume that fraudulent transactions are rare compared to standard transactions. This gives us an unbalanced dataset, which means: we can't remove the values belonging to the class with fewer observations!
The same is true when we study something in the medical field. When it comes to rare diseases, we assume that they are … rare! So we expect an unbalanced data set.
In any case, we intentionally created the above datasets for educational purposes. In general it is very difficult to visualize the data points since we have more than one function. One way to assess the imbalance in the class is to view a histogram of the labels.
Before proceeding...if you don't know the difference between a histogram and a bar chart, read the following article I wrote:
Here's what we can do. Let's create a data set with three labels like the following:
Import pandas as PD
Import numpy as np# Make a list of labels
labels = ['1', '2', '3']
# Make a list of functions
attributes = ['attribute_1', 'attribute_2', 'attribute_3']
# Set the number of samples
number_of_examples = 1000
# Create an empty Pandas DataFrame to store the data
data = pd.DataFrame()
# Add the functions to the DataFrame
for function in functions:
data[feature] = np.random.rand(antal_samples)
# Add the labels to the DataFrame
data['label'] = np.random.choice(labels, num_samples)
Although this dataframe was created on purpose, since it is tabular (meaning we can manipulate it with pandas), it reflects the actual case. So if we show the head we get:
To understand whether or not our data set may be unbalanced, let's plot a histogram like this:
Import Seeborn as SNS
Import matplotlib.pyplot as plt# Draw histogram
sns.histplot(data=data, x='label')
# Write title and axis labels
plt.title('FREQUENCY CLASSES', font size=14) #plot TITLE
plt.xlabel('Our labels (our classes)', fontsize=12) #X-axis label
plt.ylabel('Frequency of the three classes', font size=12) #y axis label
Well, in such cases, the three classes have the same frequency. So the data set is well balanced and we need to include all labels in our analyses.
Instead, such a class imbalance is represented via a histogram:
Import pandas as PD
Import numpy as np
Import Seeborn as SNS
Import matplotlib.pyplot as plt# Make a list of class imbalance labels
Labels = ['1'] * 500 + ['2'] * 450 + ['3'] * 50
# Make a list of functions
attributes = ['attribute_1', 'attribute_2', 'attribute_3']
# Shuffle the labels
np.random.shuffle(labels)
# Create an empty Pandas DataFrame to store the data
data = pd.DataFrame()
# Add the functions to the DataFrame
for function in functions:
data[feature] = np.random.rand(len(labels))
# Add the labels to the DataFrame
data['label'] = labels
# Draw histogram
sns.histplot(data=data, x='label')
# Write title and axis labels
plt.title('FREQUENCY CLASSES', font size=14) #plot TITLE
plt.xlabel('Our labels (our classes)', fontsize=12) #X-axis label
plt.ylabel('Frequency of the three classes', font size=12) #y axis label
So in such cases we need to understand whether we are considering class 3 (we are investigating "rare situations") or not (we are investigating "non-rare situations") so that we can omit all associated values.
Before we delve into the metrics we need to solve a ranking problem, we need to understand what a ranking algorithm actually does.
As we know, we use machine learning to make predictions. This means that we train an ML model on the available data, expecting the predictions to be as close as possible to the actual data.
If you don't know what "training an ML model" actually means, you can read my article here:
So let's look at a binary classification problem. Our ML model takes the features as input and predicts whether the data points belong to class 1 or class 2. If the predictions are "perfect", it means that our model tells us exactly which of the available data are class 1 and which are class 2, with 0 errors. Therefore, all real points belonging to class 1 are predicted to belong to class 1 by our ML model.
Of course, as you can imagine, 0% error is not possible, so we need some metrics to evaluate our ML models.
So, before we get into the stats, we need some nomenclature:
- We define oneZandpositiv (TP)as a data point that belongs to a class that is predicted to belong to that class. For example, if the model predicts that an email is spam and it is actually spam, then that's really positive.
- We define oneTrue negative (TN)as a data point that does not belong to a class that is not expected to belong to that class. For example, if the model predicts that an email is not spam and it is in fact not spam, then that is a true negative.
- We define onefalse positive (FP)as a data point belonging to one class that is predicted to belong to another class. For example, if the model predicts that an email is spam but it is not actually spam, it is a false positive.
- We define oneFalse negative (FN)as a data point that does not belong to a class that is predicted not to belong to that class. For example, if the model predicts that an email is not spam but it is, then it is a false negative.
As you might imagine, in general we want to minimize the number of false positives and false negatives while maximizing the true positives and true negatives to make the model as accurate as possible. This means that our ML model makes accurate predictions.
But what does "exactly" mean? To understand this, we need to look at our initial ranking stats.
The first measure we consider is accuracy. Let's look at the formula:
So,accuracyis a measure of how often our ML model is correct in its predictions.
Suppose we have a record of emails marked as spam or not-spam. ML allows us to predict whether new emails are spam or not. If the model correctly predicts that 80 out of 100 emails are spam and correctly predicts that 90 out of 100 emails are not spam, then its accuracy would be:
This means that our model can correctly predict the class of an email 85% of the time. A high accuracy score (near 1) indicates the model is performing well, while a low accuracy score (near 0) indicates the model needs improvement. However, accuracy alone is not always the best metric to evaluate a model's performance, especially for imbalanced data sets.
This is understandable since the dominant class has “more data”. So if our model is accurate, it will make accurate predictions according to the dominant class. In other words, our model could be biased due to the dominant class.
Let's do an example in Python and create a dataset for it:
Import numpy as np
Import pandas as PD# Random seed for reproducibility
np.random.seed(42)
# Do rehearsals
n_samples = 1000
Fraud_percent = 0,05 # Fraudepercentage
# create classes
X = np.random.rand(n_samples, 10)
y = np.random.binomial(n=1, p=fraude_percentage, grootte=n_samples);
# Create data frame
df = pd.DataFrame(X)
df['cheating'] = y
We've created a simple dataframe with 1000 samples that can display data from a few credit card transactions, for example. We therefore created a class for the fraudulent transaction that accounts for 5% of all observations. This data set is therefore clearly imbalanced.
If our model is correct, it is because it is biased by the 95% of observations belonging to the class representing non-fraudulent transactions. So let's split the dataset, make predictions using the logistic regression model, and plot the accuracy:
von sklearn.model_selection import train_test_split
van sklearn.linear_model import LogisticRegression
aus sklearn.metrics import precision_score# Split the record
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0,3, random_state=42)
# Fit logistic regression model to train sets
model = LogisticRegression()
model.fit(X_tog, y_tog)
# make predictions
y_pred = model.predict(X_test)
# Calculate and print the accuracy
Accuracy = Accuracy_score(y_test, y_pred)
print('precision:', precision)
>>>
Accuracy: 0.95
So our model is 95% accurate: Hooray! Now let's define the other metrics and see what they tell us about this data set.
precisionmeasures a classifier's ability to fail to classify a negative sample as positive. In other words, it measures the proportion of truly positive outcomes out of all positive predictions. Simply put, precision is how accurate the positive predictions in our model are. This is the formula:
For an email spam classification problem, precision measures how many of the emails the model classified as spam are actually spam.
Let's use it in our unbalanced dataset:
aus sklearn.metrics import precision_score# Calculate and print the precision
precision = precision_score(y_test, y_pred)
print('Precision:', Precision)
>>>
Precision: 0.0
Ouch! 95% accuracy and 0% precision: what does that mean? This means that the model predicts that all samples will be negative or non-fraudulent. Which of course is not correct. In fact, a high precision value would indicate that the model correctly identifies a high proportion of fraudulent transactions as fraudulent among all the transactions it predicts.
Then we haverecallA metric that measures the proportion of true positives out of all true positives. In other words, it measures how many of the true positives are correctly predicted. Simply put, memory tells us how well our model can find all the positive events in our data. Here is the formula:
For an email spam classification problem, the recall measures how many of the actual spam emails in the data set are correctly identified as spam emails by our ML classification.
Let's say we have a dataset of 1000 emails, 200 of which are spam and the rest are legitimate. We train a machine learning model to classify emails as spam or non-spam and it predicts that 100% of the emails are spam.
Precision would tell us how many of those 100 predicted spam emails are actually spam. For example, if 90 out of 100 predicted spam emails are actually spam, then the accuracy is 90%. This means that 90% of all emails that the model predicted as spam are actually spam.
Recall, on the other hand, tells us how many of the actual spam emails the model correctly identified as spam. For example, if the model correctly identified 150 emails as spam out of 200 actual spam emails, the recall rate would be 75%. This means that the model correctly identified 75% of all actual spam emails as spam.
Now let's use the callback in our unbalanced data set:
van sklearn.metrics Importeur Recall_Score# Calculate and print callback
Recall = Recall_score(y_test, y_pred)
print('Remember:', remember)
>>>
Shell: 0.0
Again, we have 95% accuracy and 0% recall. What does that mean? This still means that the model does not correctly identify fraudulent transactions and instead predicts that all transactions are non-fraudulent. In fact, a high recall value would indicate that the model correctly identifies a high proportion of fraudulent transactions among all actual fraudulent transactions.
So in practice, depending on the problem being studied, we want to achieve a balance between precision and memory. To do this, we often use two other metrics that take both into account: the confusion matrix and the f1 score. Let's take a look.
F1 resultis an evaluation metric in machine learning that combines precision and recall into a single value in the 0-1 range. If the f1 score is 0, our ML model is performing poorly. If the f1 score is 1, our ML model is performing well.
This metric balances precision and recall by calculating their harmonic mean. This is an average species that is more sensitive to low values, making this metric particularly useful for imbalanced data sets.
Let's look at the formula:
Now we know the results we get for our unbalanced data set (f1 score is 0). But let's see how to use it in Python:
van sklearn.metrics import f1_score# Calculate and print F1 results
f1 = f1_score(y_test, y_pred)
print('F1-Score:', f1)
>>>
F1-Score: 0,0
For the purposes of spam classification, let's say we have a dataset of 1,000 emails, 200 of which are spam and the rest are legitimate. We train a machine learning model to classify emails as spam or non-spam and it predicts that 100% of the emails are spam.
To calculate the spam classifier's F1 score, we first need to calculate its precision and recall. Let's assume that out of 100 predicted spam emails, 80 are actually spam. So the accuracy is 80%. Let's also assume that out of 200 actual spam emails, the model correctly identified 150 as spam. So the recall is 75%.
Now we can calculate the f1 score:
That's a pretty good result because we're close to 1.
The Confusion Matrix is a table that summarizes the performance of a classification model by showing the number of true positives, false positives, true negatives, and false negatives.
For a binary classification problem, the confusion matrix has two rows and two columns and is represented as follows:
Suppose our model predicted 100 emails as spam, of which 80 were actually spam, and 900 emails as non-spam, of which 20 were actually spam.
The confusion matrix for this example looks like this:
This is a very useful classification visualization tool for two reasons:
- It can help us calculate the precision and recall by visualizing it
- It tells us directly what it is about, without calculation. In fact, given a classification problem, we want TN and TP to be as high as possible while FP and FN are as low as possible (as close to 0 as possible). So if the values on the main diagonal are high and the values on the other positions are low, then our ML model performs well.
This is why I love the confusion matrix: we only need to see the main diagonal (top left to bottom right) and off-diagonal values to assess the performance of an ML classifier.
Due to our imbalanced data set, we received a 0 grade for precision and recall, saying that this means the model does not correctly identify fraudulent transactions and instead predicts all transactions as non-fraudulent.
Because of the precision and recall formulas, it can be very difficult to visualize. We must keep them in mind. Since I don't find this type of visualization easy for me, let's apply the confusion matrix to our example and see what happens:
van sklearn.metrics import hazard_matrix# Calculate and print out the confusion matrix
cm = verwarring_matrix(y_test, y_pred)
print('Confusion Matrix:\n', cm)
(Video) Data Science Roadmap 2023 | Learn Data Science Skills in 6 Months>>>
Confusion Matrix:
[[285 0]
[ 15 0]]
Look what happened?! We can clearly see that our model is not faring well because it captures 285 TNs but 0 TPs! This is the visual power of the Confusion Matrix!
There is also another way to represent the confusion matrix and I really love it because it improves the visualization experience. Here is the code:
vom sklearn.metrics-Importeur ConfusionMatrixDisplay# Calculate confusion matrix
cm = verwarring_matrix(y_test, y_pred)
# Draw confusion matrix
cmd = ConfusionMatrixDisplay(cm)
cmd.plot()
This type of visualization is very useful for multi-class classification problems. Let's see an example:
Import numpy as np
from sklearn.datasets importer make_classification
von sklearn.model_selection import train_test_split
vom sklearn.metrics-Importer confusion_matrix, ConfusionMatrixDisplay# Generate random data with 3 classes
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
n_clusters_per_class=1, n_informative=5,
class_sep=0.5, random_mode=42)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
van sklearn.linear_model import LogisticRegression
# Train a logistic regression model on the training data
clf = LogisticRegression(random_state=42).fit(X_train, y_train)
# Make predictions about the test data
y_pred = clf.predict(X_test)
# Calculate the confusion matrix
cm = verwarring_matrix(y_test, y_pred)
# View the Confusion Matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=['Class 0', 'Class 1', 'Class 2'])
disp.plot()
In these cases, it is not easy to understand what the TPs, TNs, etc. are, since we have three classes. In any case, we can simply refer to the values on the main diagonal and the values off the diagonal. In this case, on the main diagonal we have 49, 52, and 44, which are much higher than the off-diagonals, which tells us that this model performs well (also note that we calculated the confusion matrix on the test set! ).
There are a number of measures that I personally think are more appropriate in certain cases: sensitivity and specificity. Let me talk about that for a moment, and then we'll discuss its applicability in specific cases.
sensitivityis the ability of a classifier to find all positive samples:
A moment! But isn't that the recall?!?
yes there is It's not a mistake. That's why I tell you that these measurements are more appropriate for certain cases. But let me continue.
We definespecificityas a classifier's ability to find all negative samples:
So both describe the "precision" of a test: Sensitivity describes the probability of a positive test. specificity of a negative.
In my experience, these metrics are more appropriate for classifications in medicine, biology, etc.
For example, consider a COVID test. Consider this approach (which could be considered Bayesian, but let's skip that): you take a COVID test and the result is positive. Q: What is the probability of a positive test? And what is the probability of a negative test?
In other words, what isSensitivity and specificity of the tooldid you use to get the result?
Well, you may be wondering: what kind of questions are you asking, Federico?
Let me give you an example of where I lived last summer.
Here in Italy, a positive COVID test had to be certified (let's skip the reasons) by someone: usually a hospital or pharmacy. So here, when we had symptoms, we usually did a COVID test at home (COVID test for €3-5) and then went to a pharmacy to confirm it (COVID test for €15).
I developed symptoms last July after my wife and daughters tested positive. So I tested at home and the result was positive. Then immediately went to the pharmacy for confirmation, and ... negative result!
How is it possible? Simple: the tool I used for the COVID test at home was more sensitive than the pharmacist's (or the test the pharmacist used was more specific than the one I used).
Therefore, in my experience, these measurements are particularly useful for measuring devices of any kind (mechanical, electrical, etc.) and/or in certain specific fields (e.g. biology, medicine, etc.). Also note that these metrics use TP, TN, FP, and FN for precision and storage: this again underscores the fact that these are more appropriate in the case of a binary classification problem.
Of course I won't tell you the sensitivity and specificitymustis only used in the above cases. In my experience they are simply better suited.
Log loss - also called cross entropy - is an important measure in classification and is based on probability. This value compares the predicted probability of each class to the actual class designations.
Let's look at the formula:
Where do we have:
N
is the total number of observations andI
is a single observation.J
is the actual value.S
is the predicted probability.Ln
is the natural logarithm.
To calculate the predicted probabilityS
, we need to use an ML model that can actually compute probabilities, such as logistic regression. In this case we must usepredict_proba()
method as such:
van sklearn.linear_model import LogisticRegression# Invoke logistic regression model
model = LogisticRegression()
# Customize the data of the train set
model.fit(X_tog, y_tog)
# Calculate odds
y_prob = model.predict_proba(X_new)
Suppose we have a binary classification problem, calculate the probability using the logistic regression model, and assume the following table shows our results:
The calculation we would perform to determine the log loss is as follows:
And that leads to a value close to 0, which we can be satisfied with, which means that our logistic regression model predicts the labels for each class fairly well. In fact, a log-loss with a value of 0 represents the best possible fit. In other words, a model with a log-loss of 0 predicts the probability of each observation as the true value.
But don't worry, we don't need to calculate the log loss value manually. lucky for uslearned
came to the rescue. So let's go back to our unbalanced dataset. To calculate the log loss in Python, we write:
von sklearn.metrics import log_loss# Call and print log tab
log_tab_score = log_tab(y_test, y_pred)
print("Logtabsscore:", log_loss_score)
>>>
Log Tab Score: 1.726938819745535
Again, we got poor stats from the test set, confirming all the points above.
Finally, one last consideration: Log Loss lends itself to binary classification problems. What about multiple class issues?
The categorical cross-entropy metric represents the generalization of log loss to multi-class cases.
This metric is particularly useful for unbalanced data sets because it takes into account the probability of the predicted class. This is important when we have an unbalanced data set, as the relative abundance of the classes can affect the model's ability to correctly predict the "minority" classes.
Here we have:
The nomenclature is the same as in the case of log loss.
After all, we use it in Python in the same way as Log Loss, that is, by calling itvon sklearn.metrics import log_loss
. In this discussion it should only be emphasized that there is a slight difference in the case of a binary classification or in the case of a multiclass classification.
ROCstands for "Receiver Operating Characteristic" and is a graphical way of evaluating a classifier by plotting the actual positive rate (TPR) compared to the false positive rate (FPR) at different thresholds.
AUCstands for Area Under Curve and represents the area under the ROC curve. So this is an overall performance method, ranging from 0 to 1 (where 1 means the classifier predicts 100% of the labels as true) and better for is suitable for comparing different classifiers.
First, let's define TPR and FPR:
- TPR is sensitivity (which, as mentioned, can also be called memory).
- FPR is defined as
1 special feature
.
Note that AUC/ROC is appropriate in the case of a binary classification problem. In the case of a multi-class classifier, TPR and FPR must indeed be reconsidered. This requires some work, so here's my advice to only use it in case of a binary classification problem.
Now let's see how to implement this in Python:
Import numpy as np
from sklearn.datasets importer make_classification
von sklearn.model_selection import train_test_split
van sklearn.linear_model import LogisticRegression
van sklearn.metrics Importeur roc_curve, roc_auc_score
Import matplotlib.pyplot as plt# Generate a random dataset with binary classification
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2,
random_state=42)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0,2, random_state=42)
# Apply a logistic regression model to the training data
model = LogisticRegression()
model.fit(X_tog, y_tog)
# Predict probabilities for the test data
probs = model.predict_proba(X_test)
# Calculate ROC curve and AUC score
fpr, tpr, drempels = roc_curve(y_test, probs[:, 1])
auc_score = roc_auc_score(y_test, probs[:, 1])
(Video) Complete Roadmap to Become a Data Scientist# Graph of ROC curve
plt.plot(fpr, tpr, label='AUC = {:.2f}'.format(auc_score))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='bottom right')
plt.show()
The dashed line represents a purely random classifier (which is equivalent to randomly guessing one class versus another. And since this is a binary classification problem, the line has a slope of 0.5, which means we're using a 50 percent chance of guessing correctly). . The further away our curve is, the better our model is. Ideally, our curve should stay in the top left corner as much as possible, meaning there is a low false positive rate and a high true positive rate.
Because of this, this chart is good for comparing models: Better models have curves in the upper-left corner of the chart. Let's look at an example: we'll use the same dataset as before, but fit the data to three different ML models.
Import numpy as np
from sklearn.datasets importer make_classification
von sklearn.model_selection import train_test_split
van sklearn.linear_model import LogisticRegression
Imports RandomForestClassifier from sklearn.ensemble
from sklearn.neighbors input KNeighborsClassifier
van sklearn.metrics Importeur roc_curve, roc_auc_score
Import matplotlib.pyplot as plt# Generate a random dataset with binary classification
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2,
random_state=42)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Apply three different classifications to the training data
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_schatters=100)
clf3 = KNeighborsClassifier(n_buren=5)
clfs = [clf1, clf2, clf3]
# Predict probabilities for the test data
plt.figure(figsize=(8,6))
for clf i clfs:
clf.fit(X_trein, y_trein)
probs = clf.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, probs[:,1])
auc_score = roc_auc_score(y_test, probs[:,1])
plt.plot(fpr, tpr, label='{} (AUC = {:.2f})'.format(clf.__class__.__name__,
auc_score))
# Draw the ROC/AUC curves for each classifier
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC-curvevergelijking')
plt.legend(loc="bottom right")
plt.show()
So in this case, the random forest classification is the one that better predicts our data because the curve is in the upper left corner at higher values than the other models.
Concluding this section, I would like to remind you that at the beginning of this section we said that ROC plots TPR vs. FPR at different thresholds, but we didn't state otherwise. Let's do that in the next section.
Consider a binary classification problem. We apply the data to a classifier and this assigns each predicted value to class 1 or class 0: what criteria are used for the assignment?
Stop reading for a moment and try to think about it.
Yes, you guessed it right: in classification problems, a classifier assigns each sample a score between 0 and 1. This gives the probability that the sample belongs to the positive class.
Therefore, our ML models use a threshold to convert the probability values into class predictions. In other words, any sample with a probability value above the threshold, say positive, is predicted.
This is of course also true in the case of a multi-class classification problem: we simply used the binary classification case to simplify our argument.
Therefore, ROC curves are useful because they show how an ML model's performance varies at different thresholds.
However, the fact that a classifier assigns the predicted value to a class based on a threshold tells us that precision and recall are tradeoffs (as are bias and variance).
In addition, we can even plot the precision recall curve. Let's see how we can do this using the same dataset we used for the AUC/ROC curve:
Import numpy as np
from sklearn.datasets importer make_classification
von sklearn.model_selection import train_test_split
van sklearn.linear_model import LogisticRegression
aus sklearn.metrics import precision_recall_curve
Import matplotlib.pyplot as plt# Generate a random dataset with binary classification
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2,
random_state=42)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Apply a logistic regression model to the training data
clf = LogisticRegression()
clf.fit(X_trein, y_trein)
# Predict probabilities for the test data. Calculate the precision recall curve
probs = clf.predict_proba(X_test)
precision, recall, thresholds = precision_recallcurve(y_test, probs[:,1])
# Draw the precision recall curve
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('precision')
plt.title('Precision Recall Curve')
plt.show()
So above we see that the precision stays at 1 to about. 0.5 recall, then it drops dramatically quickly. Therefore, before this value, we want to choose a compromise between precision and memory. Let's say at 0.4 recall.
Another good way to illustrate this trade-off is to plot precision versus memory as the threshold varies. Here's what happens with the same record:
Import numpy as np
from sklearn.datasets importer make_classification
von sklearn.model_selection import train_test_split
van sklearn.linear_model import LogisticRegression
aus sklearn.metrics import precision_recall_curve
Import matplotlib.pyplot as plt# Generate a random dataset with binary classification
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2,
random_state=42)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Apply a logistic regression model to the training data
clf = LogisticRegression()
clf.fit(X_trein, y_trein)
# Predict probabilities for the test data. Calculate the precision recall curve
probs = clf.predict_proba(X_test)
precision, recall, thresholds = precision_recallcurve(y_test, probs[:,1])
# Plot and retrieve precision as thresholds change
plt.plot(thresholds, precision[:-1], label='precision')
plt.plot(thresholds, Recall[:-1], label='Recall')
plt.xlabel('Billboard')
plt.ylabel('Precision and recall')
plt.legend()
plt.title('Precision and memory when thresholds change')
plt.show()
So the graph above confirms that the threshold balancing the trade-off between precision and memory is around 0.4 in this case.
So when someone tells you that you found an ML model with 95% accuracy, ask: "By what memory?”
Since they use nearly the same metrics, you may be wondering when to use the AUC/ROC curve and when to use precision recall. Quote from reference 1 (p. 92):
As a rule of thumb, if the positive class is rare, or if you care more about the false positives than the false negatives, you should prefer the Precision.Recall curve, otherwise the ROC curve
Of all the methods and measures that we have seen above that are specific to the classification cases, there are two that are transversal. This means that they can be used to evaluate both classification and regression problems.
There are KDE plots and learning curves. I've written about this in previous articles, so I'll link them below:
For what a KDE is and how to use it, see point 3 of the "Graphical Methods for Validating Your ML Model" section of the following article:
You can read what learning curves are and how to use them here:
So far we have seen many statistics and methods to validate a ranking algorithm. If you're wondering which one to use, I always say that while it's good practice to try them all (especially to compare them), it's difficult to answer the question for many reasons. It's often just a matter of taste.
Also, using just one metric to evaluate an ML model is not enough, and this is a rule of thumb.
If you read other articles of mine, you know that I personally like to use at least one analytical method and one graphical method. For classification problems I generally use the Confusion Matrix and KDE.
But again, it's a matter of personal taste. My advice here is to practice with them and decide which one you like. Remember, you need more than one to make accurate decisions about your ML models.
FREE PANDA EBOOK:
Did you like the story? Become a mid-level member for $5/monthvia my referral link: I earn a small commission at no extra cost to you:
Bibliography and references:
- [1] Hands-on machine learning with scikit-learn & tensorflow - Aurelien Gueron
- [2] Machine Learning with PyTorch and Scikit-learn – Sebastian Raschka, Yuxi Liu, Vahid Mirialili
FAQs
Classification Statistics: The Complete Guide for Aspiring Data Scientists? ›
According to Elite Data Science, a data science educational platform, data scientists need to understand the fundamental concepts of descriptive statistics and probability theory, open_in_new which include the key concepts of probability distribution, statistical significance, hypothesis testing and regression.
What statistics do I need to learn for data science? ›According to Elite Data Science, a data science educational platform, data scientists need to understand the fundamental concepts of descriptive statistics and probability theory, open_in_new which include the key concepts of probability distribution, statistical significance, hypothesis testing and regression.
How do you prepare statistics for data science? ›- Core Statistics Concepts – Descriptive statistics, distributions, hypothesis testing, and regression.
- Bayesian Thinking – Conditional probability, priors, posteriors, and maximum likelihood.
How Hard is Statistics? Statistics is mandatory for many college programs. Learning it can be more difficult than other college math courses. This is due to the different concepts introduced in statistics, including descriptive and inferential statistics that are not typically used in other math courses.
Is data science very difficult? ›Data science is a difficult field. There are many reasons for this, but the most important one is that it requires a broad set of skills and knowledge. The core elements of data science are math, statistics, and computer science. The math side includes linear algebra, probability theory, and statistics theory.
How long does it take to learn statistics for data science? ›The course spans between 6-12 months. A degree program in data science normally lasts three to four years and mainly emphasizes academics. Machine learning, cloud computing, data visualization, python programming, and operating systems are examples of M.
Can we learn data science without statistics? ›Lack of statistics knowledge is dangerous at later stages of the ML pipeline too. While fitting a ML model can be as easy as calling the fit function, selecting the right model for the job requires understanding of how different models model the data.
Can you learn statistics on your own? ›Statistics is a whole field of study in itself, so if you think you can learn enough in one sitting then you are gravely mistaken, I am afraid. However, you don't need to be a trained mathematician to understand some pretty sophisticated statistical tools and method.
What is the easiest way to learn data science? ›The best way to learn data science is to work on projects so you can gain data science skills that can be applied immediately and are useful from a real-world implementation perspective. The sooner you start working on diverse data science projects, the faster you will learn the related concepts.
How many college students fail statistics? ›College dropout rates indicate that up to 32.9% of undergraduates do not complete their degree program. First-time undergraduate freshmen have a 12-month dropout rate of 24.1%. Among first-time bachelor's degree seekers, 25.7% ultimately drop out; among all undergraduate students, up to 40% drop out.
How many people fail statistics in college? ›
While high school dropout rates are decreasing, the United States experiences a daunting 40% college dropout rate every year. With only 41% of students graduating after four years without delay, American universities tend to pale at the scale of this recurring issue (ThinkImpact, 2021).
What is the hardest thing to learn in statistics? ›As previously discussed, the hardest part of statistics is figuring out how to approach each problem. Once the correct logic is understood and correct formulas are selected to answer a certain problem type, the actual math computation is relatively easy and involves basic algebra and calculator skills.
Do data scientists code a lot? ›Traditionally, data science roles do require coding skills, and most experienced data scientists working today still code. However, the data science landscape continues to change, and technologies now exist that allow people to complete entire data projects without typing code.
Is data science a lot of math? ›Mathematics is an integral part of data science. Any practicing data scientist or person interested in building a career in data science will need to have a strong background in specific mathematical fields.
What is the hardest thing in data science? ›The hardest part of data science is not building an accurate model or obtaining good, clean data, but defining feasible problems and coming up with reasonable ways of measuring solutions.
Can I complete data science in 3 months? ›To start learning data science, you must have the following capabilities to get a positive result in 3 months: You must have some technical knowledge like a degree in Stat. Math etc. You also need to know about coding schemes and programming languages.
How do I become a data scientist with no experience? ›- Polish up on your math skills.
- Learn a programming language (or two!)
- Take on side projects or internships.
- Start as a data analyst.
- Work hard—and network harder.
- Explain your career transition to potential employers.
It's definitely possible to become a data scientist without any formal education or experience. The most important thing is that you have the drive to learn and are motivated to solve problems. And if you can find a mentor or community who can help guide and support your learning then that's even better!
Can I do data science if I'm bad at math? ›Being mathematically gifted isn't a strict prerequisite for being a data scientist. Sure, it helps, but being a data scientist is more than just being good at math and statistics. Being a data scientist means knowing how to solve problems and communicate them in an effective and concise manner.
Can I do data science if I am weak in maths? ›The answer is yes! While data science requires a strong knowledge of math, the important data science math skills can be learned — even if you don't think you're math-minded or have struggled with math in the past.
Can I become a data scientist in 6 months? ›
Becoming a data scientist in six months is possible if you have a strong background in mathematics and coding.
Is statistics just calculus? ›In general, statistics is more vast and covers more topics than calculus. Hence, it is also perceived to be more challenging. Basic or entry-level statistics is much easier as compared to basic level calculus. Advance level statistics is much much harder than advanced level calculus.
What is the easiest statistical software to learn? ›IBM SPSS is one of the most popular statistical analysis tools because it's incredibly easy to use for beginners and expert statisticians alike. It includes everything you need to solve industry-specific issues and predict future trends with ease.
Is statistics harder than math? ›Level Of The Course
In fact, most beginning statistics courses are quite easy. Statistics stands out as being the more difficult type of math mostly because of the abstract concepts and ideas that you will get to later on in your study.
How long does it take to become a data scientist? As we outline in our data science FAQs, on average, to a person with no prior coding experience and/or mathematical background, it takes around 7 to 12 months of intensive studies to become an entry-level data scientist.
What should I learn before data scientist? ›- Statistical analysis and computing.
- Machine Learning.
- Deep Learning.
- Processing large data sets.
- Data Visualization.
- Data Wrangling.
- Mathematics.
- Programming.
Get an entry-level data analytics job.
Though there are many paths to becoming a data scientist, starting in a related entry-level job can be an excellent first step. Seek positions that work heavily with data, such as data analyst, business intelligence analyst, statistician, or data engineer.
- Refreshing your knowledge of foundational concepts.
- Mastering statistics fundamentals.
- Using your time wisely.
- Getting help early if you need it.
- Not stressing about the course.
Plan on spending at least two hours studying and/or doing problems for every 50-minute class session. Read your textbook. Constantly review what has been covered and read ahead to prepare yourself for class. Get in the habit of consistently doing work for your courses.
How fast can I learn statistics? ›If you choose to learn statistics on your own and devote six to eight hours a day to your studies, you can become a master statistician in just a couple of months. However, if you decide to enroll in a college degree program, it will take anywhere from two to four years, depending on your degree.
What year do most students fail? ›
You may be surprised to learn that many students fail academically in their first year of college. One-third of freshmen students don't make it to their sophomore year. That's a huge number, and it worries everyone concerned with higher education.
What majors have the highest dropout rate? ›Majors like computer science and agriculture that require technical skills and an extensive math background usually have the highest dropout rates. In addition, many students drop out due to the coursework and rigorous requirements of the major.
Why do students dropout of college statistics? ›College Dropout Statistics Revolving Around Finances
55% of college students struggle to find financial support for their studies. Consequently, 51% of college dropouts drop out because of the lack of money. 79% of the students delay their graduations due to financial difficulties.
Despite wide differences in levels of regret when it comes to majors, the vast majority of respondents were glad they went to school. Only 9% of those who attended a public institution wish they had not gone to college, the Federal Reserve survey found.
What percent of people regret college? ›Research from Strada Education and Gallup finds 51% of Americans regret higher education decisions. If they had to do it over again, the majority of Americans (51 percent) who pursued a postsecondary education would change their degree type, institution or major.
Is it rare to fail a class in college? ›Failing a class in college is common. At The Ohio State University, around 10% of undergraduates retake a failed class every year. That means more than 1 in 10 undergrads fail a class (since not every student repeats a failed class).
Can you be good at statistics but bad at math? ›Statistics is different in the sense that it doesn't give you a definite answer. Rather, it tries to give you an answer with minimum error. I would say you can still be good at statistics even if you are not good at maths because the logic of statistics is different from maths.
What majors require statistics? ›- AGRICULTURAL SCIENCES. Agricultural and Environmental Plant. Sciences. ...
- ALLIED MEDICAL. PROFESSIONS. ...
- BUSINESS. Accounting. ...
- ENGINEERING. Aerospace Engineering. ...
- MATHEMATICS. AND SCIENCE. ...
- Anthropology. Archaeology.
Calculus Courses
Calculus teaches problem-solving and develops numerical competency, both skills that are important for statistics. In addition to this, a knowledge of calculus is necessary to prove results in statistics.
So, until and unless we find a way to not use data itself, data science as a field is not going to be obsolete anytime soon. However, many believe that since a data scientist's daily tasks are quantitative or statistical in nature, they can be automated, and there will not be a need for a data scientist in the future.
Is data science harder than engineering? ›
Is data science harder than software engineering? No, data science is not harder than software engineering. Like with most disciplines, data science comes easier to some people than others. If you enjoy statistics and analytical thinking, you may find data science easier than software engineering.
How much Python is required for data science? ›For data science, the estimate is a range from 3 months to a year while practicing consistently. It also depends on the time you can dedicate to learn Python for data science. But it can be said that most learners take at least 3 months to complete the Python for data science learning path.
What level of statistics is needed for data science? ›According to Elite Data Science, a data science educational platform, data scientists need to understand the fundamental concepts of descriptive statistics and probability theory, open_in_new which include the key concepts of probability distribution, statistical significance, hypothesis testing and regression.
What level of math is data science? ›in mathematics to be a Data Scientist. Data Scientists use three main types of math—linear algebra, calculus, and statistics. Probability is another math data scientists use, but it is sometimes grouped together with statistics.
Is data science harder than machine learning? ›The consensus is that data science is in fact easier than machine learning. Data science involves more statistics, while machine learning involves more computer science in addition to statistics.
Is data science a dead field? ›As long as a data scientist is able to solve problems with the help of data and bridge the gap between technical and business skills, the role will continue to persist.
Is data science a stressful field? ›Several data professionals have defined data analytics as a stressful career. So, if you are someone planning on taking up data analytics and science as a career, it is high time that you rethink and make an informed decision.
Is data science harder than computer science? ›computer science is relatively easy if you understand both of these domains. Data science is suitable for those who like to work with numbers and statistics. Data science roles would require you to collect and analyze large quantities of data.
What math do I need for data science? ›Data Scientists use three main types of math—linear algebra, calculus, and statistics. Probability is another math data scientists use, but it is sometimes grouped together with statistics.
Does data scientist require a lot of math? ›Data Science doesn't actually require much calculus, other than as a prerequisite to probability and statistical theory. Linear Algebra, as it is the basis of modern practical computing. Least squares, dimensionality reduction, collinearity, and more, all can be understood in terms of Linear Algebra.
What statistics should I know for machine learning? ›
Some of the fundamental Statistical and Probability Theory needed for ML are Combinatorics, Probability Rules & Axioms, Bayes' Theorem, Random Variables, Variance and Expectation, Conditional and Joint Distributions, Standard Distributions (Bernoulli, Binomial, Multinomial, Uniform and Gaussian), Moment Generating ...
Can you be a data scientist without math? ›Data science careers require mathematical study because machine learning algorithms, and performing analyses and discovering insights from data require math. While math will not be the only requirement for your educational and career path in data science, but it's often one of the most important.
Does data science require coding? ›1. Does Data Science Require Coding? Yes, data science needs coding because it uses languages like Python and R to create machine-learning models and deal with large datasets.
Do you need high IQ to be data scientist? ›As for data science, it turns out you need to have an IQ of 150 (3 std up above the average population). The truth is that IQ is purely genetic (meaning you cannot improve your IQ and at best you can up about 2 points basis), and it is in fact a good way to measure your intelligence and success besides consciousness.
Can I learn data science without programming background? ›Many data scientists started their careers without prior knowledge or experience in coding. The basic requirements for a non-coder to become a data scientist include: Thorough understanding of probability and statistics. Having a passion for working with numbers.
Which degree is best for data science? ›B.S. in Computer Science: This degree is a natural fit for a career in data science with its emphasis on programming languages.
Is machine learning more math or statistics? ›Beginners do need some math for machine learning
You'll also need knowledge of basic statistics … about as much knowledge as you'd get in a basic “Introduction to Statistics” course. That is, you need to understand concepts like mean, standard deviation, variance, and other things you'd learn in an intro stats class.
Basic knowledge for machine learning includes linear algebra, calculus, programming skills, probability, and statistics.
Should I learn statistics before machine learning? ›Statistics is a core component of data analytics and machine learning. It helps you analyze and visualize data to find unseen patterns. If you are interested in machine learning and want to grow your career in it, then learning statistics along with programming should be the first step.