LLM Survey Analysis Using Random Forest
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This project investigates the application of a Random Forest Classifier for analyzing metadata from survey papers on large language models (LLMs), a rapidly growing area within AI. The goal is to assist new researchers by providing insights into the trends and patterns in LLM survey publications. Through a structured workflow—comprising data loading, exploration, manipulation, and visualization—key attributes such as release dates, categories, and taxonomies were analyzed. Techniques like TF-IDF vectorization, one-hot encoding, and feature scaling were employed to construct a robust feature matrix. Hyperparameter tuning using grid search optimized the classifier’s performance. Although the model achieved perfect training accuracy, a lower test accuracy (0.39) indicated overfitting, likely caused by dataset imbalance. With a best cross-validation score of 0.26, future improvements will focus on addressing data imbalance, enhancing feature engineering, and exploring alternative models to boost performance. The project highlights trends in LLM research and suggests paths for enhancing model accuracy.