In this blog post, we are going to discuss how to calibrate anomaly scores to make sure that the scores are trusted indicators of the severity of the anomalies they are supposed to signal. To provide some context to our work – we are developing algorithms for a product that is meant to detect anomalous user activities in IT systems, based on, for example, screen content, command usage or mouse movement. Several algorithms are used to find these anomalies. The algorithms score activities between 0 and 100. Smaller scores mean usual activities, while higher ones imply unusual events. Security analysts should make decisions based on these scores. If the score is high – above 90 –, the related activity should be investigated. If the score is low, the event is supposedly normal for the given user and requires no attention.

## Score ranking

In many data science applications, the ranking of scores has exclusive importance. In marketing campaign predictions, for example, where a list containing the top (let’s say) 200 potential customers is generated, and it is strictly this sole criterion (whether one fit into the top 200) that then serves as the basis of customer acquisition efforts. In our use case, the ranking itself doesn’t say much and it is the absolute / concrete values of the scores rather that should be the trigger of an action.

Our data scientists work toward building the best possible machine learning model for a given task. We have already described how we defined the correctness of the methods we used in previous blog posts (Part1, Part2, Part3). As a reminder, the first – and maybe the most important – measure is the AUC, which describes how well the usual and unusual activities of users are ranked by the model. Our aim is always to arrive at an AUC measure that is reliable. This may take several iterations to achieve, and the standard data science model building process should be followed: collect more data, create more features, find better parameters for the model or use another machine learning algorithm. This field is widely published and well-known by data scientists. See more in this blog.

A model with a good separation ability (good AUC) ensures only that the ranking of the scores is reliable, that is, unusual activities, on average, score higher than usual ones. If the concrete values of the scores also count, then proper calibration might be needed. In IT security, where our product is used, it must be ensured that most of the scores of usual activities concentrate around 0 and the majority of the unusual activities get scores close to 100. This criterion helps security staff to make fast decisions. For example, scores above 90 indicate that further investigation is required, whereas activities below 50 are supposed to be usual activities so no further analysis is needed. Measures like pAUC and RP distance help to evaluate the algorithms precisely from this point of view. For more information on AUC, pAUC and RP distance, see earlier blog posts: AUC and pAUC, RP distance.

For the rest of this blog post, suppose that we have built a machine learning model that provides a score distribution with good AUC, so we can be sure that rankings are reliable. In this blog post, we would like to show how these scores can be calibrated to get proper score values and consequently support fast and/or automated security decisions.

**Figure 1. Example score distribution**

Figure 1. shows an example score distribution. The model can separate the scores of usual and unusual activities well, so the ranking is good (AUC = 0.931). However, additional score calibration is needed.

As you can see, the scores of usual activities are mainly concentrated around 65 and those of unusual activities around 80. Using a cut-off value of 70, we could separate the labels quite well, but the usual scores are too high in general.

## Baseline regularization

We can do a number of things to try and improve the calibration of our scores. One of the methods we are going to use is called baseline regularization. The following paper suggests methods for interpreting and unifying outlier scores: http://www.dbs.ifi.lmu.de/~zimek/publications/SDM2011/SDM11-outlier-preprint.pdf

The baseline regularization method described in the publication solves one part of our problem. It ensures that the scores for usual activities are concentrated around 0. Let’s see how we can get there.

Let \(r_{\text{usual}}\) be a reference value of the usual scores’ distribution. The reference value could be an aggregated value of the score distribution, such as the mean or the median of the scores. In the paper, it is referred to as the expected inlier value, but we use the median (50th percentile).

The idea behind regularization is to decrease the observed values by the reference value. Since the scores may be smaller than \(r_{\text{usual}}\), we need trimming to keep scores non-negative after transformation:

$$\text{regularized}(\text{score}) = \max{ ( 0, \text{score} − r_{\text{usual}} ) }$$

It is easy to see that this regularization is ranking-stable, so the AUC won’t change significantly.

Visualization might help to understand what is happening: we subtract \(r_{\text{usual}}\) (the median of the usual scores) from all the scores, and increase the value to 0 if it was negative.

In the example distribution shown in Figure 1., the reference \(r_{\text{usual}}\) of the usual scores is 62.45. So we subtract this value from all of the scores and raise the result to 0 if needed. The resulting distribution can be seen in Figure 2.

**Figure 2. Example score distribution following baseline regularization**

With the help of baseline regularization, we always transform at least half of the usual scores to 0, which partially matches our goals, but at the same time the scores for unusual activities also become smaller so they are not concentrated around 100 (note that pAUC and RP distance values did not improve at all).

## Event score transformation by empirical reference

In the next stage, our idea is to develop further the regularization described above to decrease the scores for usual events, and at the same time, increase the scores for unusual ones. If we could transform the majority of the usual scores to 0, then similarly, why couldn’t we transform the unusual scores to 100? To do so, we pick a reference value of the unusual scores as well, and transform all the scores higher than that reference value to 100. Let’s use the following formula:

\(r_{\text{usual}}\) : 50th percentile of usual scores – the reference value for the score distribution of usual activities

\(r_{\text{unusual}}\): 50th percentile of unusual scores – the reference value for the score distribution of unusual actvities

$$ \text{calibrated}(\text{score}) = \min{\Big( 100, \max{\big( 0, (\text{score} − r_{\text{usual}}) / ( r_{\text{unusual}} – r_{\text{usual}}) \big) } \Big)}$$

Using this formula, we convert the scores originally not greater than \(r_{\text{usual}}\) to 0, and scores originally not smaller than \(r_{\text{unusual}}\) to 100. So finally we get the score distribution, where the majority of “usual” scores are around 0 and most of the “unusual” scores are around 100. Both pAUC and RP_AUC show the improvement. Figure 3. shows the example distribution after the transformation.

**Figure 3. Example score distribution following event score transformation by experienced reference**

### Prerequisites

There are two main prerequisites for success when using this method:

- Good AUC.
- A model that is not overfitted, and/or using a separate dataset for training the model, for score calibration and for testing performance

1. Good AUC, that is, the original model is able to separate usual and unusual activities well enough. A model with poor AUC might end up with a high ratio of false negatives and/or false positives. Furthermore, good AUC ensures that \(r_{\text{usual}} < r_{\text{unusual}}\) , otherwise the formula would transform all the scores to 0.

2. The score distribution used during calibration should be very similar to the score distribution we get in the production environment. The safest way to ensure this, is to use a separate dataset for building the model and another separate dataset for calibrating the scores. This way, miscalibration resulting from overfitting could be avoided.

## Controlling the false positive ratio

Converting all scores above \(r_{\text{unusual}}\) might result in false positives, that is, some usual activities get scores around 100. A small proportion of high scores for usual activities could be acceptable, as there may be some noise in the data, such as people using shared accounts, IT maintenance or other reasons when an activity was executed by someone other than the owner of the account. These kinds of activities are executed in the name of the original user, but they should get high scores as they might be risky from a security point of view.

In our case, it is very important to keep the false positive rate low. A high number of false positives would generate lots of useless work for security analysts, and therefore it would result in flagging trust. At the same time, the false negative ratio (unusual activities with scores around 0) should also be kept low, otherwise a data breach or the misuse of data might go unobserved. So for practical reasons, we have to strive to keep the false positive ratio under a certain threshold.

### Fine-tuning the calibration

Having an AUC value that is less than 1 means that usual and unusual scores cannot be separated perfectly, so there are either false positive or false negative scores, or both. We can fine-tune the above calibration by modifying the reference value definition to keep the false positive ratio under a certain threshold. So far the median (the 50th percentile) was used both for the usual and the unusual scores. But this can be flexibly set to other percentiles to get an acceptable ratio of false positives. It is easy to see that using a higher percentile value in \(r_{\text{unusual}}\) may decrease the number of false positives.

### Example for fine-tuning

In the next couple of figures, I would like to show how to fine-tune the calibration by moving the numbers in the definition of the reference value. Figure 4. shows the original distribution of scores in an example, which we are going to use for illustration purposes.

**Figure 4. Example – Original distribution of scores**

As a first step, we did some calibration using the event score transformation by empirical reference method described above, with the following reference values:

\(r_{\text{usual}}\) : 50th percentile of usual scores

\(r_{\text{unusual}}\): 50th percentile of unusual scores

Figure 5. shows the result of calibration.

**Figure 5. Calibration with the 50th percentiles of usual and unusual scores (the red circle showing false positives)**

The blue bars on the right side are false positives, that is, usual activities with high scores. If their ratio seems to be too high, the percentile value in the definition of \(r_{\text{unusual}}\) should be increased.

What happens when we increase the percentile in the unusual activities’ reference value? The number of usual activities with high scores should decrease. Figure 6. shows the calibration we did using the following reference values:

\(r_{\text{usual}}\) : 50th percentile of usual scores

\(r_{\text{unusual}}\): 60th percentile of unusual scores

**Figure 6. Calibration with the 50th percentile of usual and the 60th percentile of unusual scores**

Increasing the percentile in the unusual reference value further should result in a decreasing number of usual activities with high scores. Figure 7. shows the calibration done with the following reference values:

\(r_{\text{usual}}\): 50th percentile of usual scores

\(r_{\text{unusual}}\): 70th percentile of unusual scores

**Figure 7. Calibration with the 50th percentile of usual and the 70th percentile of unusual scores**

Finally Figure 8. shows the calibration done with the following reference values:

\(r_{\text{usual}}\): 50th percentile of usual scores

\(r_{\text{unusual}}\): 80th percentile of unusual scores

**Figure 8. Calibration with the 50th percentile of usual and the 80th percentile of unusual scores**

As you can see in Figure 8, all usual activities get scores lower than 90 if we use the 80th percentile of the unusual scores as \(r_{\text{unusual}}\).

If increasing the percentile value in \(r_{\text{unusual}}\) doesn’t help, the percentile value in \(r_{\text{usual}}\) could also be decreased.

## Conclusion

The calibration method presented above is used in our system to set up proper score values. The disadvantage of the method is that we need unusual activities and calculate their scores based on the given user’s baseline. We can use samples from other users’ activities as unusual ones, but selecting a random sample from a huge dataset requires lots of resources. This means that for building the baseline for a user, we need not only that user’s activities but we need all activities from all users and select a sample from it. In large systems, it might be slow and ineffective.

Once we have found an algorithm with good AUC, we can use the calibration method to get the score distribution to fulfill our needs without losing the separation ability of the model. This way we are able to support automated decision making processes. Also, aggregating scores coming from different algorithms becomes much more straightforward as algorithms give scores with similar distribution. From the point of view of the algorithms, the method is simple, robust and the scoring phase is very fast.