Behavioral Analysis in Malware Detection Task

Man

Professional
Messages
3,106
Reaction score
665
Points
113
Malware has long been one of the main threats in the field of information security. Approaches to analysis and protection against such attacks are different. In general, two approaches are distinguished: static and dynamic analysis.

The task of static analysis is to search for malicious content patterns in a file or process memory. These can be strings, fragments of encoded or compressed data, sequences of compiled code. Not only individual patterns can be searched for, but also their combinations with additional conditions (for example, with a link to the location of the signature, checking the relative distance in the location from each other).

Dynamic analysis is the analysis of program behavior. It is worth noting that the program can be launched in the so-called emulated mode. It is assumed that actions are safely interpreted without causing damage to the operating system. Another way is to launch the program in a virtual environment (sandbox). In this case, actions will be honestly performed on the system with subsequent recording of calls. The degree of logging detail is a kind of balance between the depth of observation and the performance of the analyzing system. The output is a log of program actions in the operating system (behavior trace), which can be further analyzed.

Dynamic or behavioral analysis provides a key advantage - regardless of attempts to obfuscate the program code and the desire to hide the intruder's intentions from the virus analyst, the malicious effect will be recorded. Reducing the task of detecting malware to the analysis of actions allows us to put forward a hypothesis about the stability of an advanced algorithm for detecting malware. And the reproducibility of behavior, thanks to the same initial state of the environment for analysis (a snapshot of the virtual server state), simplifies the task of classifying legitimate and malicious behavior.

Often, behavioral analysis approaches are based on rule sets. Expert analysis is transferred to signatures, based on which the malware and file detection tool makes conclusions. However, in this case, a problem may arise: only those attacks that strictly comply with the written rules can be taken into account, and attacks that do not meet these conditions, but are still malicious, can be missed. The same problem arises in the case of changes to the same malware. This can be solved either by using softer triggering criteria, i.e., you can write a more general rule, or by using a large number of rules for each malware. In the first scenario, we risk getting many false positives, and the second requires a lot of time, which can lead to a delay in the necessary updates.

There is a need to extend the existing knowledge to other similar cases. That is, those that we have not encountered before and have not processed with rules, but based on the similarity of some signs we can conclude that the activity may be malicious. This is where machine learning algorithms come to the rescue.

ML models, when trained correctly, have generalization ability. This means that the trained model has not only learned all the examples it was trained on, but is able to make decisions for new examples based on patterns from the training sample.

However, for the generalizing ability to work, two main factors must be taken into account at the learning stage:
  • The set of features should be as complete as possible (so that the model can see as many patterns as possible, and therefore better extend its knowledge to new examples), but not redundant (so as not to store and process features that do not carry useful information for the model).
  • The data set must be representative, balanced and regularly updated.

Since we had the opportunity to collect the required amount of data and had a hypothesis that machine learning could expand the existing solution, we decided to take on this research: to form a set of features, train a model on them, and achieve accuracy that would allow us to trust the model’s conclusions about the maliciousness of files.

How Expertise Is Transferred to Machine Learning Models​

In the context of malware analysis, the source data is the files themselves, and the intermediate data is the auxiliary processes they create. The processes, in turn, make system calls. The sequences of such calls are the data that we need to transform into a set of features.

The dataset compilation began on the expert side. The features that, in the experts' opinion, should be significant in terms of malware detection were selected. All the features could be reduced to the form of n-grams by system calls. Then, using the model, they assessed which features contribute most to detection, discarded the unnecessary ones, and obtained the final version of the dataset.

Initial data:
[CODE{"count":1,"PID":"764","Method":"NtQuerySystemInformation","unixtime":"1639557419.628073","TID":"788","plugin":"syscall","PPID":"416","Others":"REST: ,Module=\"nt\",vCPU=1,CR3=0x174DB000,Syscall=51,NArgs=4,SystemInformationClass=0x53,SystemInformation=0x23BAD0,SystemInformationLength=0x10,ReturnLength=0x0","ProcessName":"windows\\system32\\svchost.exe"}
{"Key":"\\registry\\machine","GraphKey":"\\REGISTRY\\MACHINE","count":1,"plugin":"regmon","Method":"NtQueryKey","unixtime":"1639557419.752278","TID":"3420","ProcessName":"users\\john\\desktop\\e95b20e76110cb9e3ecf0410441e40fd.exe","PPID":"1324","PID":"616"}
{"count":1,"PID":"616","Method":"NtQueryKey","unixtime":"1639557419.752278","TID":"3420","plugin":"syscall","PPID":"1324","Others":"REST: ,Module=\"nt\",vCPU=0,CR3=0x4B7BF000,Syscall=19,NArgs=5,KeyHandle=0x1F8,KeyInformationClass=0x7,KeyInformation=0x20CD88,Length=0x4,ResultLength=0x20CD98","ProcessName":"users\\john\\desktop\\e95b20e76110cb9e3ecf0410441e40fd.exe"}[/CODE]

Intermediate data (sequences):
Code:
syscall_NtQuerySystemInformation*regmon_NtQueryKey*syscall_NtQueryKey

The feature vector is presented in the table:
syscall_NtQuerySystemInformation regmon_NtQueryKeyregmon_NtQueryKeysyscall_NtQueryKeysyscall_NtQuerySystemInformation*syscall_NtQueryKey
110

How the model's knowledge was accumulated, how this process changed, why it is important to stop data accumulation in time​

As stated above, the main requirements for data are representativeness, balance and regular updating. Let us explain all three points in the context of behavioral analysis of malicious files:
  1. Representativeness. The distribution of data by features should be close to the distribution in real life.
  2. Balance. The initial data for training the model comes with the label "legitimate" or "malicious", and this information is passed to the model, that is, we solve the problem when the number of malicious examples is close to the number of clean examples.
  3. Regular update. Largely related to representativeness. Since trends in malicious files are constantly changing, it is necessary to regularly update the model's knowledge.

Taking into account all the above requirements, the following data accumulation process was built:
  1. The data is divided into two types — the main data stream and reference examples. Reference examples are manually checked by experts, the correctness of their labeling is guaranteed. They are needed to validate the model and manage the training sample by adding references. The main stream is labeled with rules and automated checks. It is needed to enrich the sample with various examples from real life.
  2. All standards are immediately added to the training set.
  3. In addition, some initial data set from the stream is added for the required volume of data for training. The required volume of data here is understood as the volume at which the training sample is sufficiently complete (in terms of data diversity) and representative. Since reference examples are checked manually by experts, it is not possible to collect several tens of thousands only from references, hence the need to get data diversity from the stream.
  4. Periodically, the model is tested on new data from the stream.
  5. Accuracy must be guaranteed primarily for reference examples; if contradictions arise, preference is given to the reference data, which are retained in any case.

Over time, a lot of data was accumulated from the stream, and there was a need to get rid of automated accumulation based on errors in favor of a more controlled training sample:
  1. The training sample accumulated to the current moment is recorded;
  2. Data from the stream is now used only for testing the model, no instances are added to the training set;
  3. Updating the training set is only possible if the set of reference examples is updated.

In this way we were able to achieve the following:
  1. We made sure that the trained and fixed model can be sufficiently robust to data drift;
  2. We control each new example added to the training set (reference examples are checked manually by experts);
  3. We can track every change and guarantee accuracy on a reference data set.

How to ensure that the quality of the model improves with each update​

After the described process of data accumulation, a completely natural question may arise: why are we so sure that each update of the model actually improves it?

The answer is the same reference sample. We believe it is the most correct because the examples from this sample are checked and labeled manually by experts, and with each update, the first thing we check is that we still guarantee 100% accuracy on this sample. Testing in the wild confirms that accuracy is improving.

This is achieved by clearing the training sample from contradictory reference data. By contradictory data we mean examples accumulated from the flow that are close enough in vector distance to the traces from the reference sample, but have the opposite label.

Our experiments showed that such examples are outliers even from the point of view of the data from the stream, since after removing them from the training set in order to increase the accuracy on the reference set, the accuracy on the stream also increased.

Complementarity of the ML approach and behavioral detections in the form of correlations​

The ML model has shown itself to be very good in combination with behavioral detections in the form of correlations. It is important to note that it is precisely in combination, since the generalizing ability of the model is good in cases where it is necessary to expand the solution by detecting similar and close incidents, but not in cases where a detection is needed within the framework of a clear understanding of the rules and criteria of what is malware.

Examples where the ML approach was able to truly expand the solution were:
  • Anomalous chains of subprocesses. A large number of branched chains is a legitimate phenomenon in itself. But the model notices anomalies in the number of nodes, the degree of nesting, the repeatability or non-repeatability of some specific names of processes, and a person will not fantasize about finding such a thing harmful in advance.
  • Non-standard values of default call parameters. In most cases, the analyst is interested in the significant parameters of the functions in which malware is being searched. The remaining parameters, roughly speaking, are default values, they are not of particular interest. But at some point it happens that instead of, say, five default values, there is a sixth. The analyst might not have assumed that this was possible, but the model noticed.
  • Atypical sequences of function calls. The case when each function individually does nothing malicious. And in combination, too. But it so happens that their sequence is not found in legitimate software. An analyst would need a huge amount of experience to independently notice such a pattern. But the model notices (and not just one), solving the classification problem in an unconventional way based on a feature that was not actually intended as an indicator of maliciousness.

Examples where signature behavioral analysis is important:
  • Use of a specific component by one call for malicious action. The system uses hundreds of objects in different variations, to different degrees. It is unlikely that you will be able to detect the use of one against the background of a million others - the granularity of the anomaly is still low.
  • Proactive detection by threat model. Decided that some action on some object in the system at least once is unacceptable. The model may not understand the first time that this is a significant phenomenon, and there will be a chance of an error or an uncertain decision at the stage of classifying something similar.
  • Obfuscation of the sequence of actions. For example, it may be known that 3-4 actions need to be performed in a certain order. It does not matter what will be between them. If you throw garbage actions between 3-4 key ones, this will knock down the model, the decision will be made incorrectly. At the same time, the dimensionality of the number of features does not allow such obfuscations to be taken into account by storing all combinations of call sequences, and not just the total number.

Source
 
Top