Our data science team develops the anomaly detectors that operate in Blindspotter, Balabit’s UBA (User Behavior Analytics) product. You can read about a bunch of these algorithms in this blog post. Today we are addressing a related problem, that is, how can we be confident in the correctness of our methods, both during product implementation and in production? There are well-established measures for evaluating machine learning methods, although most of them were devised for situations where labeled data are available — which is not usually the case with UBA setups.

In this post, we will describe this problem in more detail, and also provide possible approaches of applying standard performance measures to situations involving unlabeled data.

UBA, a case for unsupervised anomaly detection

In large enterprises, security professionals find it almost impossible to keep track of what the potentially thousands of employees are doing in the enterprise IT systems. The hardest part for them is decide whether certain user actions pose threat or not. UBA products aim to detect unusual activities that might be the signs of misuses or attacks, and send alerts, in real-time, to security analysts because these anomalies are the types of events these experts want to focus on.

If we knew exactly what anomalies were of interest to them, it would be straightforward to develop a solution alerting on an activity matching pre-defined rules of “being suspicious.” Unfortunately, threats can be of many kinds. External attackers or malicious insiders can hijack the account of an innocent user in a lot of sophisticated ways. Attacks can be so complex that manually establishing rules to identify them is not really feasible. Moreover, any manual ruleset easily becomes outdated in no time because attackers constantly think of new attack methods that might not yet have occurred to security professionals.

Instead of keeping on fighting this losing battle, we decided to take a different approach and use machine learning methods to analyze the data in IT systems and find irregularities within them. This basically means performing the classic data mining task of anomaly detection or outlier detection within this domain of digital user “fingerprints” with the aim of uncovering security issues.

One of the more challenging aspects of using machine learning in UBA is the fact that most of the data sets collected and analyzed are not labeled. In this case, the ideal label would tell whether the observed user activity is a security incident or something benign. As attacks are relatively rare and such data sets are practically never published, getting labeled data of this kind is next to impossible. An acceptable step down would be to at least get labels about whether an activity is unusual, and thus, probably worthy of being investigated. However, “unusualness” is typically not measured and unusual events are rare by definition, so acquiring or creating a sufficiently large labeled data set is not realistic. The only assumption we can make is that, in general, most events are typical and harmless.

Within these constraints we have no choice but to rely on unsupervised machine learning algorithms. Our algorithms have to look at the data and create a good definition of what normal user behavior looks like. In other words, they need to build a baseline (i.e., reference) of normal behavior. After this step they can detect potentially risky events, without having a priori knowledge of what malicious and benign behavior looks like, by checking if a new event is dissimilar enough from the baseline.

Measuring performance without labels

There are several excellent unsupervised anomaly detection algorithms that we can apply to the problem described. However, it is essential to measure their performance, i.e., how well they can recognize unusual events. For this we would normally need labels that tell whether an event is in fact unusual. But, as we stated before, the data sets we are using do not have labels. How can we still estimate performance? We mention two approaches briefly, then a third one in more details.

First, there are measures that do not require labels to calculate. For example, this paper by Goix shows certain criteria that are suitable for comparing unsupervised algorithms without labels and in this sense are comparable to ROC or Precision-Recall based criteria, which are standard metrics of evaluation with labels. The demonstrated measures have some limitations, such as working only with continuous features (attributes). As our feature set typically includes a handful of categorial features we decided to look for other options.

The second and third possible approaches follow the idea, “If you do not have label, make your own.”

The second is about evaluation using artificially synthesized events instead of events that actually happened. We sample random events from the entire feature space of user activities (i.e., every possible combination of all observed attributes). To each of these sampled events can be assigned a probability of being sampled from the historical distribution of the data. If we labeled the probable ones as “normal”, while the improbable ones “abnormal”, the anomaly detectors could be evaluated on how well they identify the abnormal events within the entire sample. However, calculating those probabilities decently is far from trivial. Before getting lost in its details we now turn our attention to the third road that is easier to follow.

Cross-scoring, a simple and effective way

The concept behind this approach that we like to use is that when measuring how well an algorithm learned the normal behavior of a user, we test its knowledge against activities made by other users (as well as activities made by that very user).

In other words, to see if we would recognize if Alice did something unusual or if someone hijacked her account, we check what would happen if Bob took her seat and started doing what he himself usually does, but this time using her account instead of his own. If Bob’s session is found unusual compared to Alice’s baseline that implies that the algorithm has the capability of finding potentially risky activities.

For this to work we have to assume that in general, people tend to act differently from each other in the observed dimensions (so what is normal for Bob is at least somewhat unusual for Alice). We found this to be true theoretically and our empirical evidence supported this.

Let us see this idea illustrated to make it clearer. In the picture below you can see the activities (resembled by the different shapes) of 3 users over a timeline.



We pick a point in time. From all of the activities before that, we build a baseline for each user that captures the essence of the behavior of the user. The baselines are denoted by the large rectangles. The activities after the picked point in time will serve as the test cases.


We have now 3 baselines and a couple of test activities per user. We take the test cases and compare all of them against all of the baselines. For example, in the illustration below, 30 comparisons occur. For each comparison, the anomaly detector to be tested provides a 0-100 score that shows how far were the compared test case and the baseline from each other.

We call this method cross-scoring, but many other names apply as well. The first occurrence of such a trick that we know about in the security context is in the paper “Computer Intrusion: Detecting Masquerades” by Schonlau et al. from 2001 that investigates how to find intruders in IT systems based on the usage of Unix commands.

Having constructed the test cases this way, we now have labeled data. The label of each test case when it was compared against the baseline of someone else is “abnormal”, and in the opposite case it is “normal”. This resembles the well-known one-vs-rest concept for multiclass classification. With the anomaly scores and labels at our hands, now the standard measures of performance can be calculated on this data set.

Getting philosophical

To end this chain of thoughts, we ponder about a concern regarding evaluation via cross-scoring.

Earlier we touched on the idea that unsupervised algorithms are in a sense better matches to the UBA problem than supervised ones. If we had trained a supervised model with a one-vs-rest scheme using data assembled by mixing activities of different users, the model would have learnt how to separate the activities of users only from the activities of rest of the users present in the training set but not from the actions of other, unknown actors. (In contrast, the objective of an unsupervised model is to detect any irregular activities regardless of who the actor is.)

A similar concern can be raised as we evaluate our unsupervised algorithms using the cross-scoring approach. The performance measures obtained this way only reflect the ability of the detector to distinguish activities of a certain user from the activities of others in our database, but not certainly all possible abnormal activities. By using these performance metrics to fine-tune our algorithms we are actually training them to be the best at distinguishing users within the system and not to tell insiders from outsiders.

It is crucial to notice, however, that this feedback loop is not closed computationally and automatically and we are not suddenly turning our unsupervised model into a supervised one just because we evaluate its performance this way. In some sense, supervision does happen through the usage of the information gathered via this evaluation method, but its extent is, and has to be controlled carefully by the designer of the algorithm.

How can we control the extent of this effect?

Suppose there is an adjustment of a hyperparameter of an algorithm that apparently increases its performance measured on a specific data set. First, we can often derive from domain knowledge whether this adjustment is too specific to be generalizable beyond the current set of users.

Moreover, classic machine learning techniques developed to avoid overfitting also help. For example we can use separate validation and test data sets. The performance measures gained from the validation set should be used for adjusting the hyperparameters of the algorithm. But after we measure the performance on a new, independent test set we must hold ourselves back from additional fine-tunes, or else the test set performance after further adjustments will be estimated overly optimistically.

In the next episode

In a following blog post we will show label-based measures that we find useful for performance measurement — both standard and special techniques.