Taming DNS Data: Stacking Machine Learning Algorithms to Find DGA Malware Activities
DNS traffic is the backbone of the internet. It performs essential function of resolving user requested URLs to the IP addresses hosting them. Without that the world-wide-web would not be able to work. Unfortunately, being so critical and ever-present it also provides cyber criminals with the perfect environment to hide. DNS traffic can not be blocked, it is everywhere, it is as common as internet itself and it is extremely noisy. In a typical large enterprise users generate more than 30 millions of DNS requests per hour requesting over a million of different domains!
Cyber criminals hijack DNS protocol and use it for their own purposes: beaconing and establishing initial communication channels, command and control and even data exfiltration. They use a wide variety of tools to accomplish this, from easily available commodity malware to custom made software.
At Sqrrl we recognize the critical need for security teams to have the set of defensive tools that would provide them with full visibility and control over the DNS traffic in their organization. To fulfill this need we developed a series of advanced analysis algorithms designed to detect a wide range of TTPs in DNS traffic. In today’s blog we describe one of these algorithms, which detects instances of DGA malware activity.
Domain Generation Algorithm (DGA) malware is a widespread technique for establishing communication from the inside of the compromised network to one of the domains controlled by the threat actors. To avoid being blacklisted and disrupted, the perpetrators use a range of short-lived domains thus presenting a “moving” target. The DGA malware generates a set of domains names (often new set for each day) and tries to resolve them searching for the active domain that would be ready to start communication. This technique provides cyber criminals with reliable means of communication which is difficult to detect and disrupt.
When designing new analysis algorithm our first step is to define operational characteristics that it should achieve and understand real-life operational conditions. Generally SOCs have limited time and human resources for investigation. Efficient algorithm should have low rate of false positives ( ideally less than 10 per day) and high detection efficacy ( ideally 80-90% of DGA malware activity should be detected). Given the typical level of noise (~30 millions of DNS requests per hour!) these are extremely stringent requirements. Yet, this is what is required, otherwise the algorithm would be essentially useless!
Machine learning algorithms while very powerful are not a magic pill. The role of data science is to formulate mathematical description, determine the set of the most informative features and finally build end-to-end analysis flow using the most suitable machine learning algorithms. Ultimately it is not the tools themselves, but how the tools are used that makes the difference between failure and success.
To satisfy the stringent operational requirements we had to find a way to suppress the DNS noise and find in it just the right kind of DNS requests that fits into the patterns of DGA activity. First, we realized that the DGA malware activity is always a sequence of DNS requests that emanate from single compromised end-point (host or IP address) in relatively short period of time. The number of requests and period between requests may vary but it is essential to search for DGA signal as series of requests. On one hand this allows increase confidence in detection (roughly proportional to number of suspicious requests uncovered) and on the other provides analyst with complete picture of the activity.
Another realization was that the information relevant for identifying DGA requests is contained at different levels: properties of domain name strings, features of individual DNS requests and finally characteristics of grouped sequences of DNS requests. These type of problems requires deep learning architecture in which data at each level is processed by appropriate machine learning algorithm whose output is used at the next level of analysis. Such multi-layered architecture is reminiscent of convolutional neural networks (CNN) for which coined the term deep learning. In the case of DNS data, deep learning requires stacking of different kind of machine learning algorithms.
The very first layer deals with information contained in the structure of the domain name string. Depending on the DGA malware the names of the generated domains could be random alphanumeric strings or concatenation of english words. In any case, being generated they tend to be different from the common names used for domains. One could use natural language processing (NLP) algorithms to trained on large corpus of the domain names requested by normal users to assign string information score that reflects likelihood for a domain name to be anomalous. That string information score is assigned to all DNS requests and used as classification feature in the next level. In addition to unusual name, the domains used by the adversaries exhibit different historical patterns of resolutions from most of the benign domains. They are resolved by small number of the infected end-points and tend to have high fraction of NXDOMAIN response codes ( signifying that the requested domain is not registered). Such features can not be extracted from a single DNS request but instead require stateful aggregation and storage. Thankfully, Sqrrl’s graph database model aggregates all DNS log records from the enterprise network and can be queried to extract these features. The string information score, the stateful features as well as other features from DNS requests are processed by multivariate machine learning algorithm that treats each DNS request as a numeric vector and provide the score reflecting likelihood for DNS request to be suspicious. At the final layer, the processed DNS requests are grouped and arranged into sequences, called DGA sessions. Each session is characterized by such features as number of requests to distinct domains that have returned with NXDOMAIN response code, hour of the day, day of the week and others. Each DGA session is processed by the final multivariate machine learning algorithm and combined detection score if assigned which include contributions from all levels. The full analysis architecture is shown on diagram below.
The full DGA detection algorithm consist of multiple machine learning layers. In the first layer, NLP algorithm is used to classify domain name of every DNS requests and convert it into string information score. In the second layer, multivariate machine learning algorithm is applied to assign score to every DNS request based on the vector of its features. Some of the features are extracted from the aggregated data stored in the Sqrrl Graph Model. In The third layer DNS requests are grouped into DGA sessions based on the end-point and closeness in time. The DGA sessions features and analyzed by another multivariate machine learning algorithm. The final output is the detection statistic score assigned to DGA sessions which includes contributions from all other layers. Such architecture is reminiscent of convolutional neural networks.
In our experiments we found that information contained at each level is not enough to achieve efficient operational characteristics in realistic data. If taken individual, each machine learning algorithm results in too many false positives. This is due to the fact that DNS data is very noisy: there are too many requests to benign domains that have weird names, there are too many individual DNS requests issued for rare or new domains that are benign, and finally there are too many DNS requests that returns with NXDOMAIN response code that tend to cluster into groups. Only via combination of all of these features and deep learning architecture we were able to achieve the desired performance.
When detected the DGA activity is presented to analyst in Sqrrl enriched with contextual information that helps him/her to perform investigation and perform incident response. Below is the screenshot showing example of the detected DGA.
DGA detection provides analyst rich contextual information about malware activity. In one view, it shows the IP address of the compromised endpoint, the list of domains malware attempted to reach as well as basic characteristics such as start and end time of activity, total duration etc.
The explore view of the detected DGA activity shows its graph representation. It can be used by the analyst to perform quick in-depth investigation, identify compromised assets and determine the extent of the compromise.
By now our DGA algorithm has been tested and experimentally verified in multiple large enterprise networks. It results in very small number of false positives ( only a few per day for the whole network) and is able to find with high proficiency various DGA malware activity even when they generate small number of requests ( tens). The paper describing results of our research is in preparation.
We would like to highlight that this was result of rigorous data science research work and that need for this type of multi-layered stacked machine learning architectures in analysis of cyber security data is a rule rather than exception. The challenges like this is what inspires our data science team to continue on the path of innovation bring more and more advanced tools to help security analysts to find cyber criminals.
For more information on Sqrrl and Threat Hunting, watch these short clips:
- Leveraging DNS and Data Science for Surface Attacker Activity
- How Sqrrl Applies Data Science to Threat Hunting
- How to Overcome Common Threat Hunting Challenges
- How Alert-driven Investigations Differ from Threat Hunting