Exploring Trends and Taxonomies in Survey Papers on Large Language Models through Data Science
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This report provides an in-depth analysis of survey papers within a specific dataset, utilizing various data science techniques. The main goal is to investigate, manipulate, and evaluate the data to uncover trends and distributions of taxonomy related to surveys in this field. The exploration phase began with a time-series analysis of survey releases, allowing for the visualization of trends over time. Subsequently, taxonomy distributions were analyzed through bar and pie charts to identify the most common categories. A feature matrix was created in the manipulation phase by implementing TF-IDF vectorization on the textual components (titles and summaries) and one-hot encoding for categorical variables. These features were then normalized and divided into training and testing sets to facilitate model evaluation. For the evaluation process, a Random Forest classifier was employed to predict the taxonomy of surveys based on the extracted features. Performance metrics accuracy and precision were utilized, with the model achieving an accuracy of 56.89\%. Other models such as LinearSVC, and Logistic Regression were also used for data evaluation and they gave approximately the same accuracy as that of random forest classifier. While this result suggests significant room for improvement, it highlights the potential of machine learning to automate the classification of survey papers based on their content. This analysis demonstrates how data science methods, including natural language processing (NLP) and machine learning, can be leveraged to discern trends, conduct feature engineering, and assess models in the context of survey data. Future research may focus on integrating more sophisticated models and feature selection strategies to enhance predictive accuracy.