Download PDFOpen PDF in browser

Feature Extraction Methods and Classification for Malware Incident News

EasyChair Preprint no. 11119

6 pagesDate: October 23, 2023


Studies related to data mining is one of the topics that has received quite a lot of interest recently, including for the form of unstructured data. One that is quite commonly discussed is the automatic classification process using machine learning methods. The large amount of data is the main obstacle in the manual classification process but there are still many people who have difficulty determining the right combination between feature extraction and classification methods, so with this we provide suggestions for using a combination of methods that can produce better accuracy in text classification. This research compares several feature extraction methods which include Bag-of-Word (BoW), Term Frequency - Inverse Document Frequency (TF-IDF) and Word2Vec which are focused on the Skip-gram model. Apart from that, this research also uses several classification methods which include Support Vector Machine (SVM), Decision Tree, Logistic Regression, Gaussian Naive Bayes, K-Nearest Neighbor, Neural Network, Random Forest and also Doc2Vec. This research used 200 crawled articles from several web blogs that had been labeled manually and has been split into two class, malware incident news and non-malware incident news class, and the dataset quality also measured using an open-source python library known as "cleanlab".

Keyphrases: Document Embedding, malware incident, text classification, text mining, web crawling

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
  author = {Gugum Gumilar and Eka Budiarto and Maulahikmah Galinium and Charles Lim},
  title = {Feature Extraction Methods and Classification for Malware Incident News},
  howpublished = {EasyChair Preprint no. 11119},

  year = {EasyChair, 2023}}
Download PDFOpen PDF in browser