Title: Improving the Understanding of Malware using Machine Learning
Date: Friday, November 17th 2023
Time: 11:30 AM -- 1:00 PM EST
Location: Coda C0903 Ansley
Zoom link: https://gatech.zoom.us/j/98812589751?pwd=ZkxPUGVHWmVNTi8raFc2UlJGY3kzZz09
Evan Downing
Ph.D. Candidate in Computer Science
School of Cybersecurity and Privacy
Georgia Institute of Technology
Committee:
Dr. Wenke Lee (advisor), School of Cybersecurity and Privacy, Georgia Institute of Technology
Dr. Mustaque Ahamad, School of Cybersecurity and Privacy, Georgia Institute of Technology
Dr. Brendan Saltaformaggio, School of Cybersecurity and Privacy, Georgia Institute of Technology
Dr. Fabian Monrose, School of Electrical and Computer Engineering, Georgia Institute of Technology
Dr. Frank Li, School of Cybersecurity and Privacy, Georgia Institute of Technology
Abstract:
Malicious software continues to threaten users who rely on computational devices.
From destruction to the monetization of their victims’ information, malware authors seek
to cause harm for their personal gain. Over the past few decades, automated solutions have
been developed to catch and prevent malicious code from infecting and spreading through-
out cyberspace. These solutions often rely on statistical properties of what distinguishes
malware from goodware. However, these solutions are also seen as blackbox, forcing mal-
ware analysts to trust the models’ verdicts without allowing them to provide feedback from
their own domain knowledge and expertise.
To address these challenges, I propose using humans-in-the-loop design with Machine
Learning (ML), which combines the best of both worlds by allowing expert analysts to both
learn new insights from the results of malware detection models and provide feedback to
improve the results of those models. This leads to a partnership, rather than a competi-
tion between humans and algorithms. I first introduce DeepReflect, a deep learning
system which identifies malicious functionality statically within malware binaries –
allowing analysts to label clusters of similar functionality in a semi-supervised approach.
DeepReflect increases the Area Under the Curve (AUC) value by 6-10% compared to
four state-of-the-art approaches on a dataset of 36k unique, unpacked malware binaries.
This helps analysts understand what a malware is capable of doing before they execute it.
Next, I introduce BCRAFTY, a system which automatically creates dynamic analysis be-
havior combinations to improve detecting malware: increasing True Positive Rate (TPR)
by 7.5% while keeping the False Positive Rate (FPR) near 0.3% compared to using analyst-
defined behaviors alone. The system allows analysts to learn new behaviors not previously
considered, increasing their understanding of how to improve malware detection, and give
feedback by accepting or rejecting suggested behavior combinations for the model to use.