Automated Classification and Trend Analysis of Large Language Model Survey Papers Using Machine Learning and Natural Language Processing Techniques

Meherunnesa Tania

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study investigates the application of ma?chine learning (ML) and natural language pro?cessing (NLP) techniques to classify academicsurvey papers into predefined taxonomy cate?gories. The dataset, consisting of paper titles,summaries, release dates, taxonomy labels, andcategories, was analyzed to uncover trends andpatterns in the publication of research papers.Exploratory data analysis (EDA) revealed im?portant insights through visualizations, such aspublication trends over time, the distributionof taxonomy categories, and the most commonterms used in paper summaries. Key NLP tech?niques, including Term Frequency-Inverse Doc?ument Frequency (TF-IDF), were employed totransform the textual data into numerical fea?tures, while one-hot encoding was applied tothe categorical data. A Random Forest Classi?fier was trained on the extracted feature matrixto predict the taxonomy category of each paper.The model achieved promising accuracy, effec?tively capturing patterns in the dataset. Thestudy also identified areas for future improve?ment, including addressing class imbalance andexploring more sophisticated models. Thesefindings demonstrate the potential of ML andNLP for automating the classification of aca?demic papers, providing a scalable solution formanaging large collections of research liter?ature while offering insights into publicationdynamics and trends.

Version published to 10.31219/osf.io/fsrhy on OSF Preprints
Oct 2, 2024

Listed in

Abstract

Article activity feed