“The temptation to form premature theories upon insufficient data is the bane of our profession.”  – Sherlock Holmes

All of us should give heed to Sherlock Holmes’ advice, who are working in the field of IT Security – but is especially important for those of us, who develop some kind of data analytics solution, for example a User Behavior Analytics (UBA) tool. Garbage in, garbage out is a well-known slogan of the software development industry. It means that regardless of how accurate a program’s logic is, the results will be incorrect if the input is invalid.

Large enterprises hope that UBA can be a silver bullet for them, which is able to stop external attackers and malicious insiders, while in the same time efficiently increases the productivity of SOCs – at least, the marketing messages of UBA-vendors try to prove that. But the problem is as security departments race to provide an impenetrable defense against malicious insiders and outsiders, they don’t always have the high-quality data they need. They have lots of data – operating system logs, business application logs, network flow etc. –, but how much of it is relevant? Most of the enterprises don’t know who their real cyber adversaries are.  They don’t have the right data, the right people to analyze it, and the right tools to analyze the data with.

What is high quality data?

“War is ninety percent information.” – Napoleon Bonaparte

Magnifying glass and documents with analytics data lying on table,selective focusMagnifying glass and documents with analytics data lying on table,selective focus

However almost one and a half century elapsed between the death of Napoleon and the birth of cybersecurity, his quote is perfectly true for our industry as well. Security experts need to know their systems and users very well if they want to provide an effective defense against attackers. The most important thing that they need is high quality data. But how can we classify something as high quality data, what does it mean in the field of analytics?

First, high quality data means complete data – you should be able to capture everything related to the behavior of the user, avoiding any blind spots. All datasets have missing values but good models are able to substitute most of them. But if your dataset is full of holes like an Emmentaler cheese, the result of your analysis will be only as useful as a prophecy of a fortune-teller in the late night show of a cable TV. In practice it means you need to know not only the physical location and the login-logout times of the user when you build her user behavior profile, but her other characteristics as well, from the list of favorite applications to her idiosyncratic keyboard usage patterns.

 “Not everything that can be counted counts, and not everything that counts can be counted.” – Albert Einstein

Second, although you should capture everything, you don’t want to feed your analytics solution with everything – you need to sort the really relevant information. All data analytics solution has a maximum processing capacity. The key of maximum effectiveness is loading only the really relevant data. It means that you should be able to filter, transform and enrich the data with the most relevant information before feeding it to your analytics solution. The two most important aspects of relevancy:

  • The data must be connected to a single user
  • The data must describe the behavior of a single user directly

For example network data – such as NetFlow – is a great source of plentiful data but it shows only the low-level reactions of users, not their exact activities. As it generates too much data, the signal-to-noise ratio is unfavorable, so analysis of this data offers more problems than advantages.

“If you can’t explain it simply, you don’t understand it well enough.” – Albert Einstein

Third, high quality data can’t be aggregated information. The ability to own the granular data about the behavior of users is not just nice to have – it is a must have. Although aggregated data is useful in many situations – and it is also easier to store – but for example granular data is always needed for incident investigation, which is a typical use case for every User Behavior Analytics solution.

Typical failures of data analytics projects

  • Resolving of identities
    When a user accesses several different services with different usernames, you must be able to connect these to the proper persona instead of creating a redundant one.
  • Understanding of data
    You must know the characteristics of your data before you start to analyze it – e.g. you should never try to count an average from the ID-number of the users.
  • Continuous baseline-building
    The representation of the reality must closely follow the changes of the reality – you should update your baselines as frequently as possible to avoid a growing number of false positives and false negatives. It is extremely important especially when the user moves to a new position or another important factor changes.
  • Inconsistent data format
    Consistency is crucial in data sciences. If the format of your dataset varies, your model won’t be able to interpret the required results.

Any analytics project can be successful only if it considers these aspects of high quality data. The famous pollster, Arthur C. Nielsen’s quote perfectly shows the reward of these projects: The price of light is less than the cost of darkness.