Comparative Evaluation of Logistic Regression and Naïve Bayes for Fake News Detection Using NLP Techniques
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Detection of fake news is an important task in this digital era, where misinformation is being spread everywhere. This paper utilizes the WELFake dataset, which contains 72,134 news articles, with 35,028 labeled as real and 37,106 labeled as fake. This dataset is an aggregation of four famous sources: Kaggle, McIntire, Reuters, and BuzzFeed Political. This dataset combines data from several sources with the aim of making classifiers more robust and preventing overfitting. Each entry consists of a serial number, the title of the news, the content of the article, and a label describing the news as real (1) or fake (0). While the original CSV file includes 78,098 entries , this provided data frame focuses on 72,134 entries to maintain data quality and relevance for machine learning tasks. This study implements two machine learning algorithms: Logistic Regression and Naive Bayes. The Logistic Regression performed the best with an accuracy of 94.53% while having balanced precision, recall, and F1-scores of 0.95 for both the real and fake news classes. The Naive Bayes classifier gave an accuracy of 84.72% but had poor F1- scores since it operates under the assumption of independence of features. The preprocessing steps included cleaning, tokenization, and lemmatization, after which feature extraction was performed by using TF-IDF. Although it has given good results, this baseline points to the advantage of Logistic Regression in high-dimensional spaces. This study has been published in IEEE Transactions on Computational Social Systems and underlines the synergy of machine learning and deep text analysis in fighting against fake news. It provides a vision for future improvements in classification methods and detection systems in real-time to develop a reliable and trustworthy information ecosystem for further research and applications.