Skip to main content

Applying Data Science to Malware — Part 3

 Now we will build a machine learning detector. In order to build a machine learning detector, we need to extract a substantial amount of features from our software binary, not just malware because the point of the detector is to determine whether the software binary is malicious or benign. 

But at this moment in time, I’m only using the strings feature, in the future I plan to add more features.

Strings feature

def get_string_features(path,hasher):
 chars = r” -~”
 min_length = 5
 string_regexp = ‘[%s]{%d,}’ % (chars, min_length)
 file_object = open(path)
 data = file_object.read()
 pattern = re.compile(string_regexp)
 strings= pattern.findall(data)
string_features = {}
 for string in strings:
 string_features[string] = 1
hashed_features = hasher.transform([string_features])
hashed_features = hashed_features.todense()
 hashed_features = numpy.asarray(hashed_features)
 hashed_features = hashed_features[0]
print “Extracted {0} strings from {1}”.format(len(string_features),path)
 return hashed_features

We start off by defining our function that has 2 parameters, the path, and a hasher. A hasher is a feature of the sklearn library.

Sklearn

Sklearn is short for Scikit-learn which is a highly popular open-source machine learning package. You can learn more about it here: https://scikit-learn.org/stable/getting_started.html

The hashing library allows us to compress an enormous amount of features down to a smaller chuck. This is so that your hardware can handle the amount of data being processed. 4000 compressed features vs 1 million will result in a big difference.
We then want to extract all the strings from the file passed through, but we only want strings that are 5+ characters long. 
We’ll then have a for loop to go through all the strings extracted based on our above rule and for each string, we’ll store it into our “string_features” dictionary with a value of “1” to say that the feature is present in the software binary. Also to use the sklearn hasher feature, it requires a list of dictionaries

hashed_features = hasher.transform([string_features])

Train detector

Now that we have built the strings feature, we can now build our function to extract the data from the software binaries we pass and train our detector.

def train_detector(benign_path, malicious_path, hasher):

 def get_training_paths(directory):

 targets = []

 for path in os.listdir(directory):

 targets.append(os.path.join(directory,path))

 return targets

 malicious_paths = get_training_paths(malicious_path)

 benign_paths = get_training_paths(benign_path)

 X = [get_string_features(path,hasher) for path in malicious_paths + benign_paths]

 y = [1 for i in range(len(malicious_paths))] + [0 for i in range(len(benign_paths))]

# print X

 #print y

 return X,y

#classifier = tree.DecisionTreeClassifier()

 classifier = ensemble.RandomForestClassifier(64)

 classifier.fit(X,y)

 pickle.dump((classifier,hasher),open(“saved_detector.pkl”,”w+”))

Our function here takes 3 parameters, the first for the “non-malicious” binary, the second for Malware, and the third is our hasher (which is defined later).

Like before, we’ll need to create the absolute file path for each file within the directory we supply and they will be our targets but this will be a sub-function(helper function) of the “train_dectector” function.
And straight after you can see we use it to get all the absolute file paths for both the “benign_path” and “malicious_path”.

Now we can extract our feature for the supplied path to create our label vector.

Vectors

A vector in machine learning(ML) is arrays of numbers where each index corresponds to a single feature. An example of this from above is an extracted string set to “1” to say it exists. We could set it to “0” to say it does not exist in another scenario.

In our example, we have two vectors, X and y. X is the features vector (the features returning from “get_string_features” and y being the label vector, which will label each string to its corresponding binary to say whether it is a malware binary or benign.

Next, we build our decision tree. I won’t delve into how decision trees work but know that they can be used for detection. This is done by the decision tree asking a series of question, for example, does the binary contain 50%+ strings that match to our known malware strings, if yes then follow this path of questions else go this path. That is the way I look at it for now to keep it simple.

#classifier = tree.DecisionTreeClassifier()
classifier = ensemble.RandomForestClassifier(64)

Once we decided which decision tree we want use (a random forest is a collection of many decision trees), we’ll pass X and y into it and that will train it.

classifier.fit(X,y)

And we’ll save our detector and hasher using the Python pickle module.

pickle.dump((classifier,hasher),open(“saved_detector.pkl”,”w+”))

Scan file

Now we’ll write a function to take a binary and check it against our trained dectector to see if we can tell if it is malicous or benign.

def scan_file(path):

 if not os.path.exists(“saved_detector.pkl”):

 print “Train a dectector before scanning files.”

 sys.exit(1)

 with open(“saved_detector.pkl”) as saved_detector: 

 classifier, hasher = pickle.load(saved_detector)

 features = get_string_features(path,hasher)

 result_proba = classifier.predict_proba([features])[:,1]

if result_proba > 0.5:

 print “it appears this faile is malicious!!”, `result_proba`

 else:

 print “it appears this file is benign”, `result_proba`

Most of it is self explanatory but the main bit I want to cover is:

classifier, hasher = pickle.load(saved_detector)
 features = get_string_features(path,hasher)
 result_proba = classifier.predict_proba(features)[1]

We set our classifier and hasher to our train dectector by using the pickle.load method.
We’ll run our “get_string_features” against the new binary with our original hasher and set that to the local features variable.
We’ll pass that to our classifier (using the random forest method) to predict the probability of the binary being malware or benign.

Put to the test

Now let’s test the detector out, I passed through the same malware samples in my previous posts and some random benign binary that was on my machine.

That will train our detector to tell the differance between the two supplied path, and now let’s pass through a malware sample to see the result:

………….Hmm………


Comments

Popular posts from this blog

Malware Analysis: Dissecting a Golang Botnet - Part 1

Introduction In this post, I walk through the process of analyzing a Golang-based botnet sample — specifically a variant of FritzFrog , a peer-to-peer (P2P) botnet known for brute-forcing SSH servers and spreading laterally across networks. The goal here is to share my steps, tools, and insights while preparing for a cybersecurity analyst role.  🐸 1. Downloading the Malware Sample I began by grabbing the malware sample from Da2dalus’ excellent GitHub repository of real-world malware: URL: FritzFrog Sample on GitHub To fetch the raw binary into my WSL environment, I used: wget -O botnet_malware_IM https://github.com/Da2dalus/The-MALWARE-Repo/raw/refs/heads/master/Botnets/FritzFrog/001eb377f0452060012124cb214f658754c7488ccb82e23ec56b2f45a636c859 📤 2. Transferring the Malware to the Flare VM (Windows) My analysis environment was running inside a Windows VM using FLARE VM . Since the malware was downloaded via WSL, I needed a way to securely transfer it to the Windows VM. First, ...

Building my own write blocker

  Spoiler — It’s cheaper than buying one I was looking to buy a write blocker to do data recovery/forensics tasks but I quickly noticed that I was window shopping write blockers due to their cost. Some starting at £300, others that cost less were no longer being built or sold, maybe you could find a 2nd hand one with or without the wires. Most of these write blockers were industry standard, used by law enforcement but was it necessary for me to buy such an expensive write blocker….or is it possible to build my own….. So th e  research began, reading through articles, publications, and so on, and with the information gained, I felt that I could build my own write blocker. So what do I need: A Raspberry Pi A Linux distro. HDD/SSD to test the write blocker And to put the information I gained into practice Building the write blocker So, I brought a Raspberry Pi 4 Model B that came with a power supply, HDMI cables, 32GB SD card, a case, and some extras. ( https://www.okdo.com/c/pi-...

Notes from a Linux command line course

 Recently I took a course on Linux command line and shell scripting, below are the notes I took which I decided to write into a blog to refer to for future reference (there's no way I could remember all of this in a single sitting) 1. Kernel vs Shell OS has 3 layers: Application layer - User apps, Daemons Shell - Command line interface. Kernel - Hardware management, memory management, I/O Handler, CPU, process management. Closest layer to the hardware The kernel controls and mediates access to hardware, for example, it schedules and allocates system resources like memory, CPU, disk etc. The shell works as an interface to access the services provided by the OS. We can further breakdown the layers into the following: User space - If you run a for loop etc, you are in user space. But when you want to perform an operation such as, write to the disk, for example, save a file, then it needs to talk to the kernel space. As the application can't directly talk to the hardware. Kernel sp...