Android-Malware-Detection-System-Using-Machine-Learning

Predicting malicious/benign nature of apps based on their app permissions; with the help of Machine Learning as a tool


Project maintained by pankaj-2k01 Hosted on GitHub Pages — Theme by mattgraham

Machine Learning Final Project

Purpose:

Project at IIITD under the course CSE343 : Machine Learning under the guidance of Professor Jainendra Shukla

Contributors:

Motivation:

With the increasing boom in the android market, there is a constant increase of apps with malicious activities. According to ZDNet, 10-24% of apps over the Play store could be malicious applications. On the surface, these apps look like any other standard app, but they exploit the user system in various harmful ways. The current methods to detect malwares are both resource heavy and exhaustive, yet fail to compete with the pace of new malwares.

What can help us to overcome these challenges ?

Introduction:

Despite the increasing malwares, there is not yet an effective and robust method to detect malware applications. With the increasing applicability of Machine Learning in various domains, we believe the issue of detecting Malware can be solved using Machine Learning techniques. Our project aims at a detailed and systematic study of malware detection using machine learning techniques, and further creating an efficient ML model which could classify the apps into benign(0) and malware(1) based on the requested app permissions. This study Proposes

Dataset Description:

Details:

Prerocessing, Visualization and Analysis: Data is read from a csv file into a dataframe for easy use. Required attributes are filtered out from the dataset. Several plots are built to better understand/analyse the data. Data is checked for null/missing values and are therefore replaced by the mean of the column. Data is then analysed on the basis of the distribution of Malware and Benign applications in various settings and several plots were made to visualise the results. Matplotlib and Seaborn are used for plotting and visualization. Removed all other columns having information other than permissions. Mapped app names to index to easily access the information.

Plots:

Unsampled Class Distribution Undersampled Class Distribution Oversampled Class Distribution

Oversampled Class Distribution Classification of Apps using Categories

Methodology:

After Prerocessing the data, data is split into testing and training sets on a 8:2 ratio. We have done the Under and Across Sampling over the Dataset, however the outcome don’t appears promising at the end. Following the sampling, we used different classifiers, including logistic regression, decision trees, and Naive Bayes. However, the outcomes are unsatisfactory. However, after inspecting the Dataset, we see that there are several multivariate data tables, thus we must apply PCA to each Dataset. We plotted the Variance Percentage after using the PCA. As a result, we chose to use the Inverse transform. It is now up to us to apply the classifiers to the provided dataset. First, we used Random Forest, which resulted in a considerable improvement in the supplied accuracies. Following that, we used the Boosting approach to increase their prediction accuracy. We used the boosting strategy on an unsampled dataset and on one after selecting Reliable features, and the results show that the model is improving. Finally, we used SVM and MLP to the final dataset and obtained our best results. When we compare the results obtained after feature selection, we can see that we have progressed and obtained better accuracy.


PCA features vs Variance Percentage

Libraries Used:

Results and Analysis:

On Basic Models

Models Unsampled Oversampled Undersampled
Logistic Training Accuracy 0.69
Test Accuracy 0.68
Recall Score 0.95
ROC Score 0.53
Training Accuracy 0.63
Test Accuracy 0.62
Recall Score 0.66
ROC Score 0.61
Training Accuracy 0.63
Test Accuracy 0.63
Recall Score 0.67
ROC Score 0.62
Naive Training Accuracy 0.68
Test Accuracy 0.67
Recall Score 0.97
ROC Score 0.52
Training Accuracy 0.53
Test Accuracy 0.53
Recall Score 0.98
ROC Score 0.51
Training Accuracy 0.53
Test Accuracy 0.53
Recall Score 0.99
ROC Score 0.50
Decision Tree Training Accuracy 0.67
Test Accuracy 0.67
Recall Score 0.99
ROC Score 0.51
Training Accuracy 0.57
Test Accuracy 0.55
Recall Score 0.68
ROC Score 0.54
Training Accuracy 0.55
Test Accuracy 0.56
Recall Score 0.79
ROC Score 0.55

As we can see that sampling is not effective in our case so move forward with unsampled data only.

Models Optimal Parameter Accuracy Recall ROC
SVM default Training Accuracy 0.85
Test Accuracy 0.85
0.94 0.80
Random Forest n_estimators=200, n_jobs = -1 Training Accuracy 0.87
Test Accuracy 0.86
0.93 0.81
MLP random_state = 42, max_iter = 300 Training Accuracy 0.85
Test Accuracy 0.85
0.95 0.80

By looking at the result all the three models performs more or less the same with Random Forest with Accuracy of 86%. As we seen in the Tabulation that, Accuracy follows the order as follow: Random Forest > MLP > SVM

Conclusion:

References: