UNSUPERVISED

A New Look at an Old Friend: how to visualize and generalize ROC-AUC

Published on 07 February 2018

In the data science world, everybody and their mom knows about the Area under the ROC Curve, or simply: AUC. It is a widely adopted metric for measuring the performance of machine learning algorithms.

The ROC curve itself is unmistakable from a mile away, appearing in the performance evaluation reports made by data scientists. You’ve seen the likes of Figure 1 a thousand times, haven’t you?

ROC curve

Figure 1. ROC curve

We at Balabit also make use of AUC when assessing our algorithms that are searching for anomalies in IT security data, as presented in a previous blog post. We like to visualize the performance results as well — in our reports we include both common visualizations of common metrics (such as ROC curves), and special figures tailored to our own performance metrics, such as the RP curve described in this post.

Since AUC is, by definition, rooted in the ROC curve, it sounds a bit odd to ask, “Are there ways to visualize AUC other than the ROC curve?” We are happy to share the good news: the answer turned out to be yes.

In this blog post, you can check out our plot that illustrates AUC in a completely different way than a ROC curve does. And there is more! A new look at a problem often sheds light on new solutions. That is why, in the second half of this post, you can also read about our ideas on extending AUC, which were inspired by our own visualization efforts.

The evaluation method behind the plot

We attempt to evaluate our anomaly scoring algorithms from as many meaningful aspects as possible. Evaluating without labels is hard — the story about why we ended up in this situation and what measures we take to tackle this problem is here.

To aid understanding the rest of the discussion, it is necessary to have a grasp on the process we call cross-scoring, so let’s recap that briefly. The following explanation stays in the domain of user behavior analysis in IT security; however, the presented logic is applicable beyond this field. Animation 1 will help in your understanding.

Cross-scoring

Animation 1. Cross-scoring

The process of cross-scoring comprises the following phases:

  • The input is a bunch of activities, each of them attributed to a user account. These activities describe what users are doing in an IT system. For example, Alice logged in from her computer to a certain database server via remote desktop.
  • The collection of activities is split into two sets: training and test sets.
  • In the training phase, we build a so-called baseline for each user, which captures how that user usually behaves. Each baseline is built using information only from the training set.
  • In the testing phase, all activities from the test set (or a reasonable sample of them) are compared against all of the baselines. Considering every comparison, the test activity and the baseline sometimes belong to the same user, but most of the time they belong to different users.
  • After each comparison, the anomaly detector provides an anomaly score that shows how far the compared test case and the baseline are from each other.

LACI: Lightweight Activity Comparison Illustration

The aim of our Lightweight Activity Comparison Illustration, or LACI plot, is to give an overview of the test activities from the perspective of anomaly scores assigned to them.

This visualization is built upon the output of the cross-scoring process described above, that is, a test set of activities. Each test activity is compared both to the baseline of the user who that activity originally belongs to, and to a baseline of another user as well. Following each comparison, an anomaly score is assigned to the activity by an anomaly detector.

We are now ready to specify what a LACI plot is:

  • It is a scatter plot in which each point represents a test activity.
  • The X axis shows the anomaly score assigned to an activity when it was compared to the baseline of the user who truly carried out that activity (aka. “own baseline”).
  • The Y axis shows the anomaly score assigned to the same activity when it was compared to the baseline of another user (aka. “foreign baseline”).

It is high time to finally take a look at a LACI plot. The first example is shown in Figure 2.

LACI plot

Figure 2. LACI plot of a decent anomaly detector. Most of the points are above the diagonal

There is an element in the visualization we have not yet mentioned. It is the diagonal, indicated by the red dashed line.

The diagonal is an important point of reference. A data point in a scatter plot is above the diagonal if its Y coordinate is larger than its X coordinate. Translating this to our domain, an activity is above the diagonal if the anomaly score it receives when compared to a foreign baseline is higher than the anomaly score it gets when compared to the user’s own baseline (that is, the user who performed that activity).

Each point above the diagonal tells a small story, in which a metaphorical Bob working with Alice’s user account (thus, compared to Alice’s baseline behavior) results in a higher anomaly score than Bob doing the same activity using his own account (thus, compared to his own baseline behavior). This is a sign of good scoring, a job well done by an anomaly detector since a possible account theft obtains a higher anomaly score than “business as usual.” That is the reason why we want to see data points concentrate above the red line as it is apparently happening in Figure 2.

Therefore, we can say that Figure 2 reveals satisfying performance. It is visible that most of the data points are above the diagonal. Their percentage is actually 90% — this is also shown in the lower right corner of the figure.

We are not so lucky with the LACI plot in Figure 3 though. Sadly, there are almost as many points above the diagonal than below. In fact, only 59% of data points are above it. What this result means is that the anomaly scoring algorithm being evaluated can barely distinguish between Bob using his own account and Bob using Alice’s account.

LACI plot

Figure 3. Points scattered over the whole LACI plot indicate poor anomaly scoring performance

Where are the weak spots of this current anomaly detector? If you dare to look at the mistakes, you should focus on the data points below the diagonal as those are the cases in which an activity receives a higher anomaly score when compared to the user’s own baseline than when compared to a foreign one. The bolder you are, the closer you should venture to the lower right corner. By observing those activities more thoroughly, you possibly learn a lot about the workings of the detector and why it could not output proper scores in some cases.

The connection with AUC

At first we thought of LACI plot only as a simple illustration that aids visual performance evaluation as it facilitates understanding the abilities of the evaluated anomaly scorers.

Then we noticed that by quantifying the proportion of the data points above the diagonal, we also attached a performance metric to the plot.

But the greatest surprise comes with the realization that the proportion of the data points above the diagonal in a LACI plot equals the AUC value of the anomaly scorer.

How come?

To understand this equivalence, first remember this definition of AUC in Fawcett’s paper. “[The] AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

Now let us find the analogies between this definition and our plot.

A “positive instance” for us is an event when Bob uses Alice’s account, that is, a virtual account theft. Likewise, a “negative instance” is an event when Bob uses his own account, that is, business as usual.

By the definition of LACI plot, a point in the plot is a pair of positive and negative instances. “Randomly choosing” a positive instance and a negative instance is, then, equivalent to randomly choosing a point in the plot.

A randomly chosen positive instance is “ranked higher” by a “classifier” (that is, anomaly scorer) than a randomly chosen negative instance if and only if the anomaly score of the former is higher than the latter. This, in the context of the plot, is equivalent to a data point that is above the diagonal.

Finally, the “probability” that a randomly chosen point in a scatter plot is above the diagonal is equal to the proportion of the points above the diagonal.

Hopefully, by seeing that every phrase in the definition can be translated into the language of the plot, our statement that the LACI plot provides a visual explanation of the ROC-AUC metric is plausible to you.

Paradigm shift

By now you understand why we are happy to see most of the data points above the diagonal in a LACI plot. However, you might also have a feeling that there are different grades of being “above the diagonal” and that also influences how good an anomaly scorer is.

Let us take points “A” and “B” in Figure 4 as examples. “A” is at position (10, 90) while “B” is at (20, 30). Both are above the diagonal but “A” is above it by 80 units, while “B” is by 10.

LACI plot

Figure 4. Point “A” is further from the diagonal than point “B”

This means that the anomaly score that activity “A” receives when compared to a foreign baseline is greater by 80 than the score it gets when compared to its own baseline. This number is only 10 in the case of “B”.

It is safe to say that the anomaly scorer we investigate is much better at highlighting the anomalousness of the virtual account theft that point “A” stands for. How about point “B”? Unfortunately, the anomaly detector does not excel in emphasizing the anomalousness of account switch “B”.

Now suppose we got up on the wrong side of the bed. We declare a test scoring successful only if the anomaly score compared to a foreign baseline is higher than the score compared to the own baseline by at least 20. If this condition does not hold for a point, then that point will testify about a job badly done by the anomaly detector.

This can be expressed in the LACI plot by shifting the dashed red line, which separates good scorings and bad scorings, upwards by 20 units as seen in Figure 5. For reference, we kept the diagonal in the figure as a grey line.

LACI plot

Figure 5. A stricter condition of good scoring changes the evaluation of point “B”

Point “B” is now below the red line, expressing that we are not satisfied with the scores our algorithm assigned to the activity it represents. The percentage of points above the red line, that is, the percentage of scorings we are satisfied with is 78%. We are stricter now but this still seems to be a nice result.

Generalizing AUC with discriminative constant

How exactly can we formalize shifting the red line? Let us recap the definition of AUC (for the last time, I promise):

[The] AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

Suppose that we translate the above definition of AUC into the following formula:

AUC = P(score(act+) > score(act-))

where

  • act+ is a randomly chosen positive example (in our case, an unusual activity),
  • act– is a randomly chosen negative example (in our case, a usual activity),
  • score(act) is a function that assigns an anomaly score to activity act.

In order to formalize the line shift, we introduce a new parameter of AUC, namely the discriminative constant disc, in the following way:

AUC(disc) = P(score(act+) > score(act-) + disc)

Using the discriminative constant, we can set our strictness we want to apply when calculating the separating power of an anomaly detection algorithm through AUC. disc is the extent to which we want a test activity to be assigned a higher score when compared to a foreign baseline as opposed to when compared to the own baseline.

Benefits of the generalization

You can learn a couple of things about the performance of your anomaly detector if you calculate AUC with various disc parameters. A large AUC value computed with a large discriminative constant is certainly a good sign of an anomaly scorer working very well.

You can even cover the whole [0, 100] range when picking a value for this parameter. That would be equivalent to shifting the red line so much to the top of the LACI plot that it falls off. This is illustrated in Animation 2.

LACI plot

Animation 2. Calculating AUC with many possible values of the discriminative constant

With the discriminative constant, it is also easier to choose between candidate algorithms. Suppose that the classic AUC — that is, AUC with disc=0 — of two anomaly detectors are the same. By calculating AUC with a higher disc, we may be able to find out which one of them separates positive and negative examples more distinctively — that is, find the one for which the AUC with the higher disc is larger.

Conclusion

In this post, a new kind of performance evaluation method was introduced.

A LACI plot is a simple scatter plot that we use for visually evaluating our anomaly detection algorithms, but it can be applied to other problems as well. It is a special view on the anomaly scores assigned to test activities. It also turns out to be a visual interpretation of the well-known ROC-AUC.

Based on the ideas behind the plot, the definition of AUC can be extended. We propose a new parameter called discriminative constant to plug into the formula of AUC, which expresses our strictness when evaluating the separating power of an anomaly detector.

Hoping that you find these methods as useful as we do, we’d be interested to learn if you applied this method or similar ones to your problems.

by Arpad Fulop

Árpád is a data scientist at Balabit working on Privileged Account Analytics, part of Balabit's PAM solution. He applies machine learning and other analytical methods to computer network data in order to detect anomalies and discover security issues.

share this article
Mitigate against privileged account risks
Get in touch

Recent Resources

The top IT Security trends to watch out for in 2018

With 2017 now done and dusted, it’s time to think ...

The key takeaways from 2017’s biggest breaches

Like many years before it, 2017 has seen a large ...

Why is IT Security winning battles, but losing the war…?

When a child goes near something hot, a parent will ...

“The [Balabit] solution’s strongest points are the privileged session management, recording and search, and applying policy filters to apps and commands typed by administrators on monitored sessions.”

– The Forrester Wave, Privileged Identity Management, Q3 2016, by Andras Cser