Skip to main content

Applying Data Science to Malware —Part 1

 With Malware exploding in numbers, I decided to learn and apply Data Science to Malware.

So first I need a number of Malware samples, which I obtained from https://github.com/fabrimagic72/malware-samples

Now the following techniques can work on any set of Malware, maybe if your a business/organization who is being targeted or you’ve been following a certain group of Malware authors and you want to see how the Malware is connected, if they use the same resources, hosts, code, etc then that would yield some interesting data and start to paint a picture.

Unfortunately, I don’t have access to those sets of Malware but that doesn’t say we can’t apply the techniques to Malware collected from honeypots.

Ransomeware samples

From the Malware samples, the Ransomware folder looks to have a number of samples we could apply the techniques on.

Step one: unzip all the Malware within that dir:

find . -name “*.zip” | while read filename; do 7z x $filename -pinfected -aou; done;

Step two: start building the script

Now I won’t post the whole script on here, I’ll add a link at the bottom it once I put it up on Github.

Step two: start building the script

Now I won’t post the whole script on here, I’ll add a link at the bottom it once I put it up on Github.
So let’s take a look at the interesting stuff:

for root,dirs,files in os.walk(args.target_path):
 for path in files:
 #try opening the file with pe to see if it’s really a pe file
try:
  pe = pefile.PE(os.path.join(root,path))
  except pefile.PEFormatError:
 continue
 fullpath = os.path.join(root,path)
 #extract printable strings from the target sample
 strings = os.popen(“strings ‘{0}’”.format(fullpath)).read()
#use the search_doc function in the included reg mod, to find hostnames
 hostnames = find_hostname(strings)
 if len(hostnames):
 #add the nodes and edges for the bipartite network
 network.add_node(path,label=path[:32],color=’black’,penwidth=5,bipartite=0)
for hostname in hostnames:
 network.add_node(hostname,label=hostname,color=’blue’,penwidth=10,bipartite=1)
 network.add_edge(hostname,path,penwidth=2) 

#NOTE WE HAVE EXTRACTED ALL MALWARE INTO ONE FOLDER

if hostnames:
 print “extracted hostnames from:”,path
 pprint.pprint(hostnames)

What this does, is looks through each file in the given directory, check if it has a PE header if so we run the program “strings” on it, then get the list of strings from the file and run it through a function called “find_hostname” (which I’ve not posted here, but it goes through a regex process to strip the input and run the list through a list of domain suffixes to say if it string matches a list within domain suffixes, then it is accepted as a domain)

Then we create our network.

If we have a positive list of hostname, we’ll create a node for that malware.

network.add_node(path,label=path[:32],color=’black’,penwidth=5,bipartite=0)

Now we’ll start to create nodes and edges for each hostname we find that is connected to that malware.

network.add_node(hostname,label=hostname,color=’blue’,penwidth=10,bipartite=1)
 network.add_edge(hostname,path,penwidth=2) 

And then print the hostname to the screen.
And the results are, to me anyway, interesting. We can see the following 

hostnames:
extracted hostnames from smb-b4tq2hti.bin
[‘mnses7xf743znk7.onion’,
 ‘r5x6sdidz4q7f6q.onion’, 
‘sw7xmbs2ivmt5og.onion’,

Note — I have removed some characters from the hostname, safety first :)

Now let’s save everything to a “.dot” file so that we can convert the network into a visual graph.

#write the dot file to disk
write_dot(network,args.output_file)
malware = set(n for n, d in network.nodes(data=True) if d[‘bipartite’]==0)
hostname = set(network)-malware
#use networkX’s bipartite network projection function to produce the malware and hostname projections
malware_network = bipartite.projected_graph(network, malware)
hostname_network = bipartite.projected_graph(network, hostname)
#write the projected networks to disk as specified by the user
write_dot(malware_network, args.malware_projection)
write_dot(hostname_network,args.hostname_projection)

So we will have 3 files, the first being the whole network, the second to show the connection between the Malware, and the third to show the connections between the hostnames.

python ransomewareMalwareNetwork.py /home/osboxes/myAnalysis/malware-samples/Ransomware/extracted ./orginal.dot ./malwareProjection.dot ./hostname_projection.dot

We then use fdp (one of many choices but it is suited for a small network) to convert the “.dot” files into images:

fdp orginal.dot -T png -o orignal_ransomeware_image.png

And now let’s view the results:

Original network, showing the connection between the Malware and hostnames:

A network of Ransomeware and hostnames

The blue circle represents hostnames and the black circle represents the Malware. Now, granted it’s quite hard to actually see the connection via Medium but this was something I was hoping to see.
The cluster on the left is from the “Wannacry” folder and we would expect to see those files and hostnames linked.

Next, let’s view just the Malware connections:

Malware projection

And the hostname projection:

Hostname projection

Now, visually the hostname doesn’t tell us much, so that’s going to take me some time to adjust the network for the hostname to get it to be more visually pleasing and useful.

I won’t do a review of my findings as this is just to apply what I learn to some real-world Malware. And I’m quite happy with the findings and itching to see how else we can use the data we learn from Malware via Malware analysis and add it to these methods.

Now, I did try to build a graph based on image relationship for the ransomware malware, which is done by extracting the images from the malware but the results were far less “exciting” but that could be because the malware doesn’t use images or it is obfuscated.
Either way, we can use the same methods on different samples to see what they yeild.

Everything I learned and applied in the above is from the book “Malware Data Science”. I highly recommend it.

https://github.com/cchaq/MalwareDataScience

Comments

Popular posts from this blog

Malware Analysis: Dissecting a Golang Botnet - Part 1

Introduction In this post, I walk through the process of analyzing a Golang-based botnet sample — specifically a variant of FritzFrog , a peer-to-peer (P2P) botnet known for brute-forcing SSH servers and spreading laterally across networks. The goal here is to share my steps, tools, and insights while preparing for a cybersecurity analyst role.  🐸 1. Downloading the Malware Sample I began by grabbing the malware sample from Da2dalus’ excellent GitHub repository of real-world malware: URL: FritzFrog Sample on GitHub To fetch the raw binary into my WSL environment, I used: wget -O botnet_malware_IM https://github.com/Da2dalus/The-MALWARE-Repo/raw/refs/heads/master/Botnets/FritzFrog/001eb377f0452060012124cb214f658754c7488ccb82e23ec56b2f45a636c859 📤 2. Transferring the Malware to the Flare VM (Windows) My analysis environment was running inside a Windows VM using FLARE VM . Since the malware was downloaded via WSL, I needed a way to securely transfer it to the Windows VM. First, ...

Building my own write blocker

  Spoiler — It’s cheaper than buying one I was looking to buy a write blocker to do data recovery/forensics tasks but I quickly noticed that I was window shopping write blockers due to their cost. Some starting at £300, others that cost less were no longer being built or sold, maybe you could find a 2nd hand one with or without the wires. Most of these write blockers were industry standard, used by law enforcement but was it necessary for me to buy such an expensive write blocker….or is it possible to build my own….. So th e  research began, reading through articles, publications, and so on, and with the information gained, I felt that I could build my own write blocker. So what do I need: A Raspberry Pi A Linux distro. HDD/SSD to test the write blocker And to put the information I gained into practice Building the write blocker So, I brought a Raspberry Pi 4 Model B that came with a power supply, HDMI cables, 32GB SD card, a case, and some extras. ( https://www.okdo.com/c/pi-...

Notes from a Linux command line course

 Recently I took a course on Linux command line and shell scripting, below are the notes I took which I decided to write into a blog to refer to for future reference (there's no way I could remember all of this in a single sitting) 1. Kernel vs Shell OS has 3 layers: Application layer - User apps, Daemons Shell - Command line interface. Kernel - Hardware management, memory management, I/O Handler, CPU, process management. Closest layer to the hardware The kernel controls and mediates access to hardware, for example, it schedules and allocates system resources like memory, CPU, disk etc. The shell works as an interface to access the services provided by the OS. We can further breakdown the layers into the following: User space - If you run a for loop etc, you are in user space. But when you want to perform an operation such as, write to the disk, for example, save a file, then it needs to talk to the kernel space. As the application can't directly talk to the hardware. Kernel sp...