May 29, 2022

Malware GAN, ML & Microsoft ATP: Detailed Introduction

Atlan Team

Introduction:

When conducting Red Team operations, especially in EDR, and ever more so XDR environments, it is important to step back and actually evaluate how to approach implant design

Hypotheses are just as important as directed bruteforce technical skill or solely human cunning. As a partner at KPMG said when leading an internal Red Team meeting call— mindest is critical! 

While I’ve read about, implemented and tested many approaches, some extremely technical, some procedural and some cunningly human, but when combined all extremely effective, this insight seeks to highlight how, with a balanced set of skills you can design implants that address the challenges consultants face presented by Endpoint Detection & Response solutions

Taking an adversarial mindest means that one should look at the problem set from above (mindset) and consider what it is we are actually attacking: an EDR is an Machine Learning model. More precisley basic logical assumptions, then more challenging statistical mathematics at play, and then finally vast data warehousing/engineering.

Where for example certain researchers will delve extremely deeply into the world of image load events for dynamic evasion, or other researchers focus on LLVM obfuscation for static evasion, or a wide range of techniques for unhooking ASMI, Event Tracing and similar, I personally believe that the time vs result in energy invested into these endeavours is not always worth the labour invested to achieve your goals.

This introdution into how we are approaching building our GAN will focus on Microsoft Advanced Threat Protection (ATP) or Microsoft Defender for Endpoint (I will use the terms interchangeably depsite the name change) E5 for two reasons:

  1. Anyone can get access to ATP E5 for as little as £40–60 pounds per month
  2. Microsoft, with its entire E5 technology stack has created a hugely expensive, extensive, and in my view certainly one of the most highly trained & tested (& constantly evolving) set of ML models around, especially based on the sheer volume of labelled data that they have access to from Endpoint telemetry from both USERLAND & KERNEL space events across the entire Windows empire. Therefore testing should be done against the best in class EDR, or what will undoubtedly win the race in my view.

Having delved into ML models (K-means: clustering & classification) when co-developing SharpML, and indeed having personally done some logical auditing of a Coronvirus ML modelling algorithm with our former and now current again — Saudi partners (Fiduciam Global Consulting) - at the start of Covid in 2020, the lessons I learnt along the way have informed implant design for me ever since.

Technical Approach

The technical approach we are taking developing our GAN will be broken down into a few sections, discussing each in some limited detail, and using varying approaches for solving each problem set, and so that we can be armed with the knowledge to create our machine learning model, and associated labs and tools.

Part 0. Primer on ML clustering & classification

While the two primary areas of ML learning I focused in 2020 on are K-means and Hierarchical culstering, I am by no means a complete expert on Machine Learning.

While my old research manager Marco was profoundly excellent in this domain, naturally I was more focused on an adversarial mindset, wanting to understand how to break ML models.

(NOTE: I had an interesting chat with someone from NCC Group about hacking and physics on Linkedin and he indicated that areas such as blackbox ML testing are niche. Nothing could be further from the truth in reality and all phishing, maldev, evading network forensics and more is pretty much blackbox ML testing these days on a technical level — so ML hacking is very much mainstream whether you realise it or not ).

What I will do first is to briefly explain what hierarchical clustering and K-means clustering are. It doesn’t matter if you don’t see the revelance to malware development straight away but it was this mindset, and a wide knowledge base that allowed me to develop evasive malware when working previously at KPMG, to meet the demands of the firm’s clients and my manager at the time.

So let’s start with my own explanation of each:

K-means (Unsupervised):

K-means clustering is used in a wide range of fields, including genetics, cancer research (flow cytometry) and is extremely effective at finding the correlations between various datapoints.

What do I mean? Well let’s keep it revelant to hacking and do away with apples & oranges, or cells / chromosomes.

Let’s say for example that you are looking at millions of events happening inside the Windows operating system. These will consist of processes & threads being created, file operations, network connections being made and much more. 

Example of events plotted on a diagram:

Graph of events on a Windows System

In the image above you can see things like file operations, network connections being made, processes & threads being created. Now clearly EDRs are able to take inputs of labelled data, and a multi million pound ML model doesn’t produce output as simple as the above, but that’s essentially what is happening conceptually — at least explained simply. Data being fed in, run through the model and then correlations being made.

So where does K-means come in? Well given that as consultants we don’t actually need to understand the math behind it (in fact neither do most ML engineers but rather make logical assumptions as they mostly use libraries also) there is no point other than visually presenting this to explain.

If we take those random events above, the ML model will try and understand which datapoints are correlated - even events that don't share the same process. So lets draw a circle around where there are a few events taking place. Again I would emphasises that is very watered down, and I am merely trying to convey the concepts for the later steps of the discussion around malware development and subsequent posts around our how we built our GAN.

Correlated Events

The shape itself is calculated by using various mathematical formulas, but also based on the assumptions that you are making. That circle of events; let’s say it represents a new binary being run (MS Paint.exe) with the axis at the bottom representing time. Process and threads are created, file operations, and network connections are made etc. 

A computer doesn’t think like us. We see MS Paint.exe open, the ML models sees that within a process various WinApi calls are being made, network connections, file operations made when you save the image — it doesn’t care or know what MS Paint.exe is, and indeed for the malware analysis it is built around, the model is looking to find correlated events even in remote processes, threads and files.

It is important to note that k-means clusting is computationally demanding and requires more resources than hierarchical clustering, yet I believe it is more efficient at making these correlations  where time is involved. This why I believe that this kind of approach would be favoured when looking at dynamic events, and why i suspect sometimes it takes a while for an EDR or ATP to flag certain events as suspicious after the fact. Again, I am not an expert but attacking black box models requires some kind of assumptions from consultants to be made —we;ve certainly made a few in the preperations phase.

That’s a brief introduction into K-means clustering and how we conceptualise one aspect of the machine learning model we are coming up against. If you want to read the real deal about k-means be my guest, then you can learn all about centroids, clusters and more to your hearts content.

Hierarchical Clustering (Unsupervised):

Next up is hierarchical clustering. Let’s take for example this image:

This is what is called a dendodgram. The axis on the left is similarity and each of the little rectangles are clusters being most similar themselves.

While I am may be wrong I personally conceptualise that the static evasion element of an EDR model would likely use hierarchical clustering to build clusters of properties to understand what constitutes the properties of a single binary and how those properties might match the properties of malware. What do I mean?

Well if a typical binary has a certain size, certain type of code base, metadata, strings and other properties, then you can begin to cluster what a normal binary looks like. Except the beauty is in malware development, there is no normal binary.

It is much easier to develop clusters of what malcious binaries look like:

  • Cobalt Strike shellcode size being rougly 400kb
  • An implementation of an AES or XOR algorithm
  • Some kind of Syscall (NTCreateThreadEx) functions

While we surveyed Github, and blog posts that individuals (including myself!) were frequently borrowing from, there are some extremely similar properties (transplanted into Nim, Rust, C++ etc) between them all, and that is a piece of cake for this kind of ML model to correlate.

ML Conclusion:

Now that we have an understanding of what type of machine learning might be implemented in engines like Advanced Threat Protection (this is blackbox testing after all), we can develop a hypothesis as to how we can circumvent and evade them!

What is relevant for me, is that this hypothesis & asumption has paid dividends for me when developing malware on engagements, and manually testing the incremenetal progress of ATP and other EDRs like SentinalOne, Carbon Black and Cylance, have allowed me to keep a mental database as to which is putting more weight on certain correlations of events and static properties of a binary.

This also why ATP, with its ability to monitor the kernel (alongside all it’s telemetry to evolve the model) will dominate, and why it is interesting to watch the leading EDR organisations pivot into Managed Services, and consulting. I don't believe that in terms of long term that its merely diversification - some of their core R&D folks seem to have moved onto other pastures. Especially with Microsoft's investment into OpenAi (or more like ClosedAI now) we can only expect ATP to develop at an even more rapid pace. 

Part 1. Static Analysis

Now let’s first look at what an ML model might look at when using hierarchical clustering to evaluate the properties of a binary.

While there will be many more areas, I am going to focus on three main areas: entropy, strings and file analysis. You can certainly do research in other areas, but I found that these areas paid the most dividends for me personally.

Let’s start.

Part 1.A Entropy Analysis

While real malware design takes time and effort, and is usually incredibly difficult to do under huge time limitations, indeed you are wrestling with the entirety of an ML model that is constantly learning and evolving from hundreds of millions of telemetry events taking place, you can speed this process up by applying some concepts from maths and physics! So here goes entropy:

Let’s start by discussing entropy and what it is.

Entropy is the foundation upon which all cryptographic functions operate. Entropy, in cyber security, is a measure of the randomness or diversity of a data-generating function. Data with full entropy is completely random and no meaningful patterns can be found.

What does this mean and why is this relevant? Well when you encrypt some shellcode then the entropy is going to increase.

So XOR encoding increases the entropy somewhat but is easy for an EDR to bruteforce especially as you will need to have the key inside the binary to decode even if you do environmental keying. AES encryption further increases the entropy and so on as the randomness does.

As the ML model will use entropy as a data point, then binaries with high levels of entropy are likely to be clustered with other binaries that have high entropy — encryption is used when you want to hide something!

Once you then take a binary that is not code signed, is roughly the same size, and the size of the encrypted/encoded section being around 400kb (Cobalt Strike shellcode), it’s not very hard even for the untrained eye to cluster these similarities together let alone a frighteningly fast ML model with vast computational resources.

Part 1.B Entropy Evasion

Let’s discuss how we can measure and actually evaluate how much entropy certain sections of your binary has. I am not going to go into the full details, as this has already been done here, where you can find a python script on @oxPat’s blog.

Using that python script you can run it against your binary and evaluate the level of entropy in the sections of your PE file.

Ideally you want a situtation whereby the ML model has full insight into what it is seeing, no obvious key to bruteforce inside the binary, and low entropy.

Lot’s of teams have moved for this very reason to steganography. ATP (or the ML model) will note an image, and (i'm assuming) may not even be concerned about the levels of entropy of an image file, especially not a png/bitmap image (pns uses lossless compression and bitmap is uncompressed)[read more here about image entropy]. Always think back to that image I showed around hierarchical clustering and what associations are being made and then imagine all the binaries with favicons, logos, images and more floating around on Windows machines globally and whether it would make sense to be considering the entropy of images, from a computational perspective (think £).

While I am sure the next step for the team at Microsoft IS to begin to work on analysing images for shellcode, which is manually already being done: stego-toolkit, a next stage insight that I will be writing about Next-Generation Environmental keying will solve this problem and present an implementation in code — that particular idea was not mine, but from a senior manager at KPMG.

While I have not shown any code here because I want to present some other aspects related to the various computational requirements of each approach and why it makes more sense in partiulcar instances given that your binary is going to be assessed by both (I hypothesise) a time based K-means model and a static analysis hierarchical model along the execution lifetime of your malware.

Part 1.C String Analysis

Next up is string analysis. If you again keep in mind the correlative model of what a normal and malicious binary might look like, you can begin to see why so many payoads fail. Its no use merely fine tuning your injection methodologies, or ensuring that you are encrypting and decrypting shellcode in memory, or whatever low level abuse you are focusing on, if the model can correlate your activity to an inital vector, be it a DLL or similar, that is being clustered in with other malware. 

Hackers are not developers, but we are required to develop in multiple languages doing complex operations that even seasoned developers find challenging.

However it is important to bear in mind, that typical binaries are basically all code that isn’t malware. There are mathemetical, graphical, game and a whole range of other functions that normal developers will use. That is why junk code is not to be underestimated nor why the restructing of shellcode in memory can be effective rather than merely preventing the model from understanding what it is doing.

Part 1.D String Obfuscation

While many malware developers will use some very smart approaches to hide what they are doing from EDR, including compile time string-obfuscation & control flow flattening and indeed randomising for example syscall function names, whether using something like this, or AsStrongAsFuck, or indeed any number of C# nuget packages, I disagree with these approaches unless you’re sure that Reverse Engineers are going to be looking at your implants — in the commercial space this not a worry for us, but for Nation States where attribution is important; it is relevant. But then again foreign language code comments, (Russian?) timezone beaconing times and other methods may prove better again than code implementation to put off analysts.

A very quick approach is to take an existing application — simply type into Google “C# Gui app Github”, and download a solution and embed your malicious code into new classes inside the applications and divert the flow of the main function to your own functions, renaming for example NTCreateThreadEx, to something similar already inside the actual application that you are embedding into.

I will not go into this in more detail now, but, would stop you from running strings on the actual Microsoft Teams binary (for example in the past when powrprof.dll is used with Teams for proxying)[NOTE: When in the past i have read about ManagedExports and attempting to export DllMain I came across this post by Mandiant, by Evan Pena, Ruben Boonen, Brett Hawkins, around abusing DLL misconfigurations which seems to confirm my hypothesis relating to string/code analysis] and therefore using the same strings inside the actual DLL itself to name your functions? How many binaries in the wild have randomly generated function names? Not many — we want to fit in, not stick out. Futhermore what stops you from generating a macaronic language based off, of the naming conventions of official code on the entire Windows OS?

I know this part of my hypothesis works because I was able in an afternoon’s work at KPMG to develop an implant generator on compile that bypassed Cylance and Carbon Black (it did require further debugging later with my team mate Mike to map syscalls to other OS versions[in 2021]). Defender freemium also failed to anaylse it, but ATP E5 did because of some mistakes I made and took me a little longer to hypothesise and experiment as to what ATP was doing.

Let’s think back to what I believe the ML is doing; because the ML model is not just looking at your binary. ATP is at once (using Cloud Submission) looking at the entire Windows estate, and if your binary looks more like other binaries that are not malware then when (what I hypothesise) the hierarchical clustering is performed then your binary is going to blend in, rather then being correlated with malware.

Part 1.E File Analysis

File analysis is done to look at the various sections of a PE file, whether it has code signing and looking at file size to evaluate the binary — I’m sure it looks at other areas also and in our GAN adventure we will be reporting what exactly what it looks at.

I won't belabour this one too much, but there are some great resources on the PE file format, on how to spoof code signing and similar but unless you work for an intelligence agency — good luck making sure that this part of your binary is right (this is addressed below).

It would be my bet, given Microsoft working closesly with the US government, that they (NSA etc) can borrow code signing certs, so us in the commercial space will have to make do with doing less.

Throughout the development of our GAN posts we will present in detail what we have managed to coax out of ATP for exactly what it is looking for.

Part 1.F File Evasion

While we could discuss the relative merits of using LLVM-Obfuscation (good post here ) if you are programming in C or C++ or even Nim, or alternatively to discuss managed vs unmanaged code in C#, when keeping hierarchical clustering in mind, I believe these techniques are not serving your best interests and going for managed code is in my opinion your best bet — if of course your requirements allow.

Allowing the ML models to inspect every element of your code is what going to serve you better, at least in our GAN that's what we are doing, and using a programming language that allows for embeding Resources is even better - which is why I typically favour .NET.

With regards to code signing certificates, you can visit this post, where is presented a list of certificate authorities you can buy a valid code signing certificate from. You will need to incorporate a company likely to do so, so dedicated consultants could do so. 

Part 2. Dynamic Analysis

Next up is a brief overview of dynamic analysis. While this aspect is even more relevant to the initial vector when getting your implant in and first stage code execution, there are some extremely interesting cases where without the need for raw technical skill you can put off the the K-means algorithm by doing things slightly differently and cases where hiding/encryption is merited.

Part 2.A Hooking & Kernel Callbacks

Anyone who has worked in malware development knows about how typically EDRs without access to the kernel will hook into your processes and reroute the functions to study them further — again so much research has been done on this area that I don’t actually believe will be relevant in a few years time because ATP is monitoring the kernel (which no other EDR has access to) and indeed MDSec released a fantastic blog post sometime ago around Bypassing Image Load Kernel callbacks so the race to the low level dark corners of the OS will no doubt continue, but Microsoft will see everything eventually. 

Part 2.B Event Tracing

Next let‘s discuss Event Tracing. Again much research has been done around this area starting with Countercept (here) and a blog post by MDSec on hiding your .NET, so i will not go to lengths to repeat what they have said.

Part 2.C Time Correlation

Next up is time correlation. Clearly the K-means model and cluster I showed above correlates datapoints, I have made the assumption that the bottom axis is time.

So, having multiple suspicious events taking place in quick succession are again like to be clustered together as malicious events irrespective of whether they are in the same process space, thread etc.

Part 2.F Memory Analysis

Shellcode is merely code - obviously.

However given that shellcode is exactly that - shellcode - the interpration & correlations made by the ML model will now be required to be made on shellcode in memory when the hierachical clustering is done, can be approached in a similar fashion to static analysis evasion - at least that is our approach. Rather than encrypt, we are focusing on modifying the shellcode itself to blend in.

We are redeveloping MovFuscator to enable x86-64 .NET compilation to better understand that work we need to do, and later focusing on how we can apply some methodologies to adapting in memory shellcode in a similar way to approaching static analysis. Stay tuned. 

PART 3: Bringing onto our next step: GAN

In a series of blog posts in the next four months we will be highlighting how we built the lab (an earlier draft topology for your perusal) in the image below to systematically reverse enginer the model behind APT, represent the different elements of features of malware mathemtically, and to subsequently generate an automated Malware Generative Adversarial Network to produce malware that evades Microsoft's ATP 100% of the time:

Rememeber ATP needs to be 100% effective to stop us, we only need to get one past it fand we can continously retrain it. 

Given that you can repeat this, with ATP acting as the discriminator, we envisage a fairly comprehensive way to defeat their models in a practical way.

Stay tuned for more. 

Contact Us

How can we help?

Whether you represent a corporate, a consultancy, a government or an MSSP, we’d love to hear from you. To discover just how our offensive security contractors could help, get in touch.

Enquire

ENQUIRIES

Whether you represent a corporate, a consultancy, a government or an MSSP, we’d love to hear from you. To discover just how our offensive security contractors could help, get in touch.

General Enquiries

+44 (0)208 102 0765

enquiries@atlan.digital

86-90 Paul Street
London
EC2A 4NE

New Business

Tom Kallo

+44 (0)208 102 0765

tom@atlan.digital