Abstract:
Email messages have become increasingly important and widespread method of
communication because of its time and speed. Where the amount of email received per day can range from tens for a regular user to thousands for companies. Spam emails are considered as one of the biggest challenges for the Internet. Thus email classification, which aims to correctly classify legitimate emails and spam emails, becomes an important topic for both industry and academia. To achieve this goal, machine learning approaches, especially supervised machine learning algorithms, have been extensively applied to this field. In literature, several studies reveal that supervised machine learning (SML) suffers from some limitations such as performance fluctuation, hence many works start focusing on designing more complex algorithms. This study therefore developed two efficient supervised models and compared their performances on new emails. The study utilized weighting term using Feature selection method applying TF-IDF (Term Frequency-inverse Document Frequency) to eliminate redundant and less relevant words/tokens. Descriptions of the algorithms used are
presented; Naïve Bayes and Support Vector Machine and their applicability to the problem of Email classification. Models using Naïve Bayes and Support Vector Machine Classifiers are developed and their performance evaluated.Evaluation of performance was carried out using Precision, Recall, F-Score and Accuracy on the datasets used for the two algorithms. The model was simulated using R statistical programming language. The results proved that Support Vector Machine outperforms the Naïve Bayes Classifier during the classification process.