Comparative Analysis of Machine Learning Models for Email Spam Detection

https://doi.org/10.47194/ijgor.v6i3.392

Authors

  • Mugi Lestari
  • Yasir Salih Department of Mathematics, Faculty of Education, Red Sea University, SUDAN
  • Alim Jaizul Research Collaboration Community, Bandung, Indonesia

Abstract

The development of information technology has driven a significant increase in the use of email as a primary communication tool across various sectors. Spam emails have become a serious issue that can disrupt productivity and threaten data security as well as user privacy. Conventional rule-based spam filtering systems are no longer considered effective in countering increasingly sophisticated and adaptive spam attack patterns. A more dynamic and accurate approach is required through the utilization of Machine Learning. This study aims to analyze and compare the performance of several Machine Learning algorithms in detecting spam emails, namely Extra Trees Classifier, Random Forest, Support Vector Machine (SVM) with an RBF kernel, and CatBoost. The methodology involves data acquisition from the SMS Spam Collection Dataset, data preprocessing through text cleaning and feature extraction using Term Frequency–Inverse Document Frequency (TF-IDF), followed by model training and evaluation using Accuracy, F1 Score, and ROC AUC metrics. The results show that the Extra Trees Classifier achieved the best performance, with an Accuracy of 97.29%, an F1 Score of 0.8814, and a ROC AUC of 0.9868. Tree-based ensemble models, particularly Extra Trees and Random Forest, demonstrated superior capability in maintaining a balance between precision and recall. The SVM (RBF) recorded the highest AUC value but presented a trade-off in the form of a higher number of False Negatives. The findings of this research serve as a reference for the development of more adaptive and effective Machine Learning–based spam detection systems.

Published

2025-08-25