More than two billion active devices run on Google’s Android platform. Unfortunately, the popularity of the operating system has made it the target of malicious software, or malware—around 3.2 million new Android malware samples were identified by the end of the third quarter of 2018, according to German security software company G Data.
Cybersecurity experts have developed defenses against some of these bad actors, including machine learning and artificial intelligence tools that can recognize suspicious applications. However, most existing methods require expert analysis to predetermine specific malware features—an approach only possible for known cybersecurity threats.
“As malware keeps on evolving, any predetermined features will soon become outdated, but manually defining new features takes time and is not easy. Also, updating the batch learning-based classifiers requires retraining the malware detection model with new malware samples and all previous training samples, which is slow and resource intensive,” said Dr. Li Zhang, former Research Scientist at A*STAR’s Institute for Infocomm Research (I2R), who is now with ST Engineering.
To overcome these limitations, Li and colleagues combined two techniques—n-gram analysis and online classifiers—to create a machine learning model for more efficient discovery of Android malware. The method uses part of an application’s code to generate n-grams, the equivalent of a fingerprint containing detailed information about the application.
A classifier algorithm then automatically assigns a weight, or score, to the component parts of the fingerprint (sub-fingerprints) according to how closely each sub-fingerprint resembles malware. “A dedicated classifier is used to handle a specific category of information in the Android application. Such a design helps further improve the classification accuracy and reduce the model training time,” Zhang explained. Importantly, their model can adapt itself based on new training samples without forgetting knowledge obtained from prior datasets—what is known as incremental learning.
Applying their approach to a benchmark dataset of more than 10,000 application samples, the researchers achieved a malware detection accuracy of 99.2 percent. Tested on a real-world dataset containing more than 70,000 samples, the model performed with 86.2 percent accuracy. Furthermore, when classifying malware, the technique obtained an accuracy of 98.8 percent on the top 23 malware families of the Debrin dataset, a well-annotated library of Android malware.
“Our framework can help security analysts or antivirus developers better cope with fast-evolving malware. Besides, the underlying model is linear and lightweight, which can even be deployed on phones to achieve real-time protection of Android users,” said Zhang.
His team is now expanding the framework by also considering the runtime behaviors of Android applications, which will further improve malware classification accuracy.
The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research (I2R).