Title: Improving the Understanding of Malware using Machine Learning

Date: Friday, November 17th 2023

Time: 11:30 AM -- 1:00 PM EST

Location: Coda C0903 Ansley

Zoom link: https://gatech.zoom.us/j/98812589751?pwd=ZkxPUGVHWmVNTi8raFc2UlJGY3kzZz09

 

Evan Downing

Ph.D. Candidate in Computer Science

School of Cybersecurity and Privacy

Georgia Institute of Technology

 

Committee:

Dr. Wenke Lee (advisor), School of Cybersecurity and Privacy, Georgia Institute of Technology

Dr. Mustaque Ahamad, School of Cybersecurity and Privacy, Georgia Institute of Technology

Dr. Brendan Saltaformaggio, School of Cybersecurity and Privacy, Georgia Institute of Technology

Dr. Fabian Monrose, School of Electrical and Computer Engineering, Georgia Institute of Technology

Dr. Frank Li, School of Cybersecurity and Privacy, Georgia Institute of Technology

 

Abstract:

Malicious software continues to threaten users who rely on computational devices.

From destruction to the monetization of their victims’ information, malware authors seek

to cause harm for their personal gain. Over the past few decades, automated solutions have

been developed to catch and prevent malicious code from infecting and spreading through-

out cyberspace. These solutions often rely on statistical properties of what distinguishes

malware from goodware. However, these solutions are also seen as blackbox, forcing mal-

ware analysts to trust the models’ verdicts without allowing them to provide feedback from

their own domain knowledge and expertise.

 

To address these challenges, I propose using humans-in-the-loop design with Machine

Learning (ML), which combines the best of both worlds by allowing expert analysts to both

learn new insights from the results of malware detection models and provide feedback to

improve the results of those models. This leads to a partnership, rather than a competi-

tion between humans and algorithms. I first introduce DeepReflect, a deep learning

system which identifies malicious functionality statically within malware binaries –

allowing analysts to label clusters of similar functionality in a semi-supervised approach.

DeepReflect increases the Area Under the Curve (AUC) value by 6-10% compared to

four state-of-the-art approaches on a dataset of 36k unique, unpacked malware binaries.

This helps analysts understand what a malware is capable of doing before they execute it.

Next, I introduce BCRAFTY, a system which automatically creates dynamic analysis be-

havior combinations to improve detecting malware: increasing True Positive Rate (TPR)

by 7.5% while keeping the False Positive Rate (FPR) near 0.3% compared to using analyst-

defined behaviors alone. The system allows analysts to learn new behaviors not previously

considered, increasing their understanding of how to improve malware detection, and give

feedback by accepting or rejecting suggested behavior combinations for the model to use.