Now That You Have a Machine Learning Model, It’s Time to Evaluate Your Security Classifier

This is the third installment in a three-part series about machine learning. Be sure to read part one and part two for more context on how to choose the right artificial intelligence solution for your security problems.

As we move into this third part, we hope we have helped our readers better identify an artificial intelligence (AI) solution and select the right algorithm to address their organization’s security needs. Now, it’s time to evaluate the effectiveness of the machine learning (ML) model being used. But with so many metrics and systems available to measure security success, where does one begin?

Classification or Regression? Which Can Get Better Insights From Your Data?

By this time, you may have selected an algorithm of choice to use with your machine learning solution. It could fall into one of two categories in general, classification or regression. Here is a reminder of the main difference: From a security standpoint, these two types of algorithms tend to solve different problems. For example, a classifier might be used as an anomaly detector, which is often the basis of the new generation of intrusion detection and prevention systems. Meanwhile, a regression algorithm might be better at things such as detecting denial-of-service attacks (DoS) because these problems tend to involve numbers rather than nominal labels.

At first look, the difference between Classification and Regression might seem complicated, but it really isn’t. It just comes down to what type of value our target variable, also called our dependent variable, contains. In that sense, the main difference between the two is that the output variable in Regression is numerical while he output for Classification is categorical/discrete.

For our purposes in this blog, we’ll focus on metrics that are used to evaluate algorithms applied to supervised ML. For reference, supervised machine learning is the form of learning where we have complete labels and a ground truth. For example, we know that the data can be divided into class1 and class2, and each of our training, validation, and testing samples is labeled as belonging to class1 or class2.

Classification Algorithms – or Classifiers

To have ML work with data, we can select a security classifier, which is an algorithm with a non-numeric class value. We want this algorithm to look at data and classify it into predefined data “classes.” These are usually two or more categorical, dependent variables.

For example, we might try to classify something as an attack or not an attack. We would create two labels, one for each of those classes. A classifier then takes the training set and tries to learn a “decision boundary” between the two classes. There could be more than two classes, and in some cases only one class. For example, the Modified National Institute of Standards and Technology (MNIST) database demo tries to classify an image as one of the ten possible digits from hand-written samples. This demo is often used to show the abilities of deep learning, as the deep net can output probabilities for each digit rather than one single decision. Typically, the digit with the highest probability is chosen as the answer.

A Regression Algorithm – or Regressor

A Regression algorithm, or regressor, is used when the target variable is a number. Think of a function in math: there are numbers that go into the function and there is a number that comes out of it. The task in Regression is to find what this function is. Consider the following example:

Y = 3x+9

We will now find ‘Y’ for various values of ‘X’. Therefore:

X = 1 -> y = 12

X = 2 -> y = 15

X = 3 -> y = 18

The regressor’s job is to figure out what the function is by relying on the values of X and Y. If we give the algorithm enough X and Y values, it will hopefully find the function 3x+9.

We might want to do this in cases where we need to calculate the probability of an event being malicious. Here, we do not want a classification, as the results are not fine-grained enough. Instead, we want a confidence or probability score. So, for example, the algorithm might provide the answer that “there is a 47 percent probability that this sample is malicious.”

In the next section, we will be looking at the various metrics for each, Classification, and Regression, which can help us determine the efficacy of our security posture by using our chosen ML model.

Metrics for Classification

Before we dive into common classification metrics, let’s define some key terms:

  • Ground truth is a set of known labels or descriptions of which class or target variable represents the correct solution. In a binary classification problem, for instance, each example in the ground truth is labeled with the correct classification. This mirrors the training set, where we have known labels for each example.
  • Predicted labels represent the classifications that the algorithm believes are correct. That is, the output of the algorithm.

Now let’s take a closer look at some of the most useful metrics against which we can choose to measure the success of our machine learning deployment.

True Positive Rate

This is the ratio of correctly predicted positive examples to the total number of examples in the ground truth. If there are 100 examples in the ground truth and the model correctly predicts 65 of them as positive, then the true positive rate (TPR) is 65 percent, sometimes written as 0.65.

False Positive Rate

The false positive rate (FPR) is the number of incorrectly predicted examples that are labeled as positive by the algorithm but are actually negative in the ground truth. If we have 100 examples and 15 of them are incorrectly predicted as positive, then the false positive rate would be 15 percent, sometimes written as 0.15.

True Negative Rate

The true negative rate (TNR) is the number of correctly predicted negative examples divided by the number of examples in the ground truth. Let us say that in the scenario of 100 examples that another 15 of these examples were correctly predicted as negative. Therefore, the true negative rate (TNR) is 15 percent, also written as 0.15. Notice here that there were 15 false positives and 15 true negatives. This makes for a total of 30 negative examples.

False Negative Rate

The false negative rate (FNR) is the ratio of examples predicted incorrectly as belonging to the negative class over the number of examples in the ground truth. Continuing with the aforementioned case, let’s say that out of 100 examples in the ground truth, the algorithm correctly predicted 65 as positive. We also know that 15 were predicted as false positives and 15 were predicted as true negatives. This leaves us with 5 examples unaccounted for, so our false negative rate is 5 percent, or 0.05. The false negative rate is the complement to the true positive rate, so the sum of the two metrics should be 70 percent (0.7), as 70 examples actually belong to the positive class.


Accuracy measures the proportion of correct predictions, both positive and negative, to the total number of examples in the ground truth. This metric can often be misleading if, for instance, there is a large proportion of positive examples in the ground truth compared to the number of negative examples. Similarly, if the model predicts only the positive class correctly, accuracy will not give you a sense of how well the model does with negative predictions versus negative examples in the ground truth even though the accuracy could be quite high because the positive examples were predicted.

Accuracy = (TP+TN)/(TP+TN+FP+FN)


Before we explore the precision metric, it’s important to define a few more terms:

  • TP is the raw number of true positives (in the above example, the TP is 65).
  • FP is the raw number of false positives (15 in the above example).
  • TN is the raw number of true negatives (15 in the above example).
  • FN is the raw number of false negatives (5 in the above example).

Precision, sometimes known as the positive predictive value, is the proportion of true positives predicted by the algorithm over the sum of all examples predicted as positive. That is, precision=TP/(TP+FP).

In our example, there were 65 positives in the ground truth that the algorithm correctly labeled as positive. However, it also labeled 15 examples as positive when they were actually negative.

These false positives go into the denominator of the precision calculation. So, we get 65/(65+15), which yields a precision of 0.81.

What does this mean? In brief, high precision means that the algorithm returned far more true positives than false positives. In other words, it is a qualitative measure. The higher the precision, the better job the algorithm did of predicting true positives while rejecting false positives.


Recall, also known as sensitivity, is the ratio of true positives to true positives plus false negatives: TP/(TP+FN).

In our example, there were 65 true positives and 5 false negatives, giving us a recall of 65/(65+5) = 0.93. Recall is a quantitative measure; in a classification task, it is a measure of how well the algorithm “memorized” the training data.

Note that there is often a trade-off between precision and recall. In other words, it’s possible to optimize one metric at the expense of the other. In a security context, we may often want to optimize recall over precision because there are circumstances where we must predict all the possible positives with a high degree of certainty.

For example, in the world of automotive security, where kinetic harm may occur, it is often heard that false positives are annoying, but false negatives can get you killed. That is a dramatic example, but it can apply to other situations as well. In intrusion prevention, for instance, a false positive on a ransomware sample is a minor nuisance, while a false negative could cause catastrophic data loss.

However, there are cases that call for optimizing precision. If you are constructing a virus encyclopedia, for example, higher precision might be preferred when analyzing one sample since the missing information will presumably be acquired from another sample.


An F-measure (or F1 score) is defined as the harmonic mean of precision and recall. There is a generic F-measure, which includes a variable beta that causes the harmonic mean of precision and recall to be weighted.

Typically, the evaluation of an algorithm is done using the F1 score, meaning that beta is 1 and therefore the harmonic mean of precision and recall is unweighted. The term F-measure is used as a synonym for F1 score unless beta is specified.

The F1 score is a value between 0 and 1 where the ideal score is 1, and is calculated as 2 * Precision * Recall/(Precision+Recall), or the harmonic mean. This metric typically lies between precision and recall. If both are 1, then the F-measure equals 1 as well. The F1 score has no intuitive meaning per se; it is simply a way to represent both precision and recall in one metric.

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC), sometimes written as Phi, is a representation of all four values — TP, FP, TN and FN. Unlike precision and recall, the MCC takes true negatives into account, which means it handles imbalanced classes better than other metrics. It is defined as:


If the value is 1, then the classifier and ground truth are in perfect agreement. If the value is 0, then the result of the classifier is no better than random chance. If the result is -1, the classifier and the ground truth are in perfect disagreement. If this coefficient seems low (below 0.5), then you should consider using a different algorithm or fine-tuning your current one.

Youden’s Index

Also known as Youden’s J statistic, Youden’s index is the binary case of the general form of the statistic known as ‘informedness’, which applies to multiclass problems. It is calculated as (sensitivity + specificity–1) and can be seen as the probability of an informed decision verses a random guess. In other words, it takes all four predictors into account.

Remember from our examples that recall=TP/(FP+FN) and specificity, or TNR, is also the complement of the FPR. Therefore, the Youden index incorporates all measures of predictors. If the value of Youden’s index is 0, then the probability of the decision actually being informed is no better than random chance. If it is 1, then both false positives and false negatives are 0.

Area Under the Receiver Operator Characteristic Curve

This metric, usually abbreviated as AUC or ROC, measures the area under the curve plotted with true positives on the Y-axis and false positives on the X-axis. This metric can be useful because it provides a single number that lets you compare models of different types. An AUC value of 0.5 means the result of the test is essentially a coin flip. You want the AUC to be as close to 1 as possible because this enables researchers to make comparisons across experiments.

Area Under the Precision Recall Curve

Area under the precision recall curve (AUPRC) is a measurement that, like MCC, accounts for imbalanced class distributions. If there are far more negative examples than positive examples, you might want to use AUPRC as your metric and visual plot. The curve is precision plotted against recall. The closer to 1, the better. Note that since this metric/plot works best when there are more negative predictions than positive predictions, you might have to invert your labels for testing.

Average Log Loss

Average log loss represents the penalty of wrong prediction. It is the difference between the probability distributions of the actual and predicted models.

In deep learning, this is sometimes known as the cross-entropy loss, which is used when the result of a classifier such as a deep learning model is a probability rather than a binary label. Cross-entropy loss is therefore the divergence of the predicted probability from the actual probability in the ground truth. This is useful in multiclass problems but is also applicable to the simplified case of binary classification.

By using these metrics to evaluate your ML model, and tailoring them to your specific needs, you could fine-tune the output from the data and essentially get more certain results, thus detecting more issues/threats, and optimizing controls as needed.

Metrics for Regression

For regression, the goal is to determine the amount of errors produced by the ML algorithm. The model is considered good if the error value between the predicted and observed value is small.

Let’s take a closer look at some of the metrics used for evaluating regression models.

Mean Absolute Error

Mean absolute error (MAE) is the closeness of the predicted result to the actual result. You can think of this as the average of the differences between the predicted value and the ground truth value. As we proceed along each test example when evaluating against the ground truth, we subtract the actual value reported in the ground truth from the predicted value from the regression algorithm and take the absolute value. We can then calculate the arithmetic mean of these values.

While the interpretation of this metric is well-defined, because it is an arithmetic mean, it could be affected by very large, or very small differences. Note that this value is scale-dependent, meaning that the error is on the same scale as the data. Because of this, you cannot compare two MAE values across datasets.

Root Mean Squared Error

Root mean squared error (RMSE) attempts to represent all error across moments in time in one value. This is often the metric that optimization algorithms seek to minimize in regression problems. When an optimization algorithm is tuning so-called hyperparameters, it seeks to make RMSE as small as possible.

Consider, however, that like MAE, RMSE is both sensitive to large and small outliers and is scale-dependent. Therefore, you have to be careful and examine your residuals to look for outliers — values that are significantly above or below the rest of the residuals. Also, like MAE, it is improper to compare RMSE across datasets unless the scaling translations have been accounted for, because data scaling, whether by normalization or standardization, is dependent upon the data values.

For example, in Standardization, the scale from -1 to 1 is determined by subtracting the mean from each value and dividing the value by the standard deviation. This gives the normal distribution. If, on the other hand, the data is normalized, the scaling is done by taking the current value and subtracting the minimum value, then dividing this by the quantity (maximum value – minimum value). These are completely different scales, and as a result, one cannot compare the RMSE between these two data sets.

Relative Absolute Error

Relative absolute error (RAE) is the mean difference divided by the arithmetic mean of the values in the ground truth. Note that this value can be compared across scales because it has been normalized.

Relative Squared Error

Relative squared error (RSE) is the total squared error of the predicted values divided by the total squared error of the observed values. This also normalizes the error measurement so that it can be compared across datasets.

Machine Learning Can Revolutionize Your Organization’s Security

Machine learning is integral to the enhancement of cybersecurity today and it will only become more critical as the security community embraces cognitive platforms.

In this three-part series, we covered various algorithms and their security context, from cutting-edge technologies such as generative adversarial networks to more traditional algorithms that are still very powerful.

We also explored how to select the appropriate security classifier or regressor for your task, and, finally, how to evaluate the effectiveness of a classifier to help our readers better gauge the impact of optimization. With a better idea about these basics, you’re ready to examine and implement your own algorithms and to move toward revolutionizing your security program with machine learning.

The post Now That You Have a Machine Learning Model, It’s Time to Evaluate Your Security Classifier appeared first on Security Intelligence.