Abstract
The research in the area of malware analysis is very popular with accent on machine learning algorithms that help automate this subject. One of the leader portals that help researchers with dataset problems is Virus Total, providing free academic accounts with hundreds of thousands malware samples with metadata. This work contributes with the analysis of 429058 malware samples from Virus Total in terms of overcoming the problem of inconsistent labeling of the antivirus scan results from different vendors. Two methods were used, LSA and LDA both with automatic calibration of parameters, with purpose of finding the optimal number of clusters - both resulting with 5. The graphical representation of the clusters was done by k-menas clustering in two dimensional space. Additional research on most informative words in each cluster, showed that we can report of 4 similar classes, and one cluster per method (LSA and LDA) that was not related to the cluster in opposite method by wrd meaning in it. The showed results give good approach malware data analysis when dealing with inconsistently labeled dataset.
Keywords
References
I (we), the author(s), hereby declare under full moral, financial and criminal liability that the manuscript submitted for publication to the Journal of Computer and Forensic Sciences
a) is the result of my (our) own original research and that I (we) hold the right to publish it;
b) does not infringe any copyright or other third-party proprietary rights;
c) complies with the Journal’s research and publishing ethics standards;
d) has not been published elsewhere, under this or any other title;
e) is not under consideration by another publication, under this or any other title.
I (we) also declare under full moral, financial and criminal liability:
f) that all conflicts of interest that may directly or potentially influence or impart bias on the work have been disclosed in the manuscript;
g) that if the article has been accepted for publishing I (we) will transfer all copyright ownership of the manuscript to the University of Criminal Investigation and Police Studies in Belgrade.
Signed by the Corresponding Author on behalf of the all other authors.