Educational Video Transcript Analysis with LLMs: Improving Entity Recognition and Qualitative Insights
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Digital educational media platforms such as YouTube have become invaluable sources of qualitative data for educational researchers, who increasingly rely on video content to examine classroom interactions, instructional practices, and student engagement at scale. Yet, large-scale qualitative analysis of video content remains hampered by transcription inaccuracies and challenges in extracting structured information. This study introduces an integrated methodological pipeline leveraging Automatic Speech Recognition (ASR), Large Language Models (LLMs) for coreference resolution, and Named Entity Recognition (NER) correction to enhance transcript fidelity and analytical utility. We analyzed 48 episodes from CrashCourse US History, applying multiple ASR systems and LLM-based transcript enhancement to assess large-scale trends in transcription quality, entity extraction, and topic modeling. For evaluation, four episodes were selected for detailed manual annotation, serving as gold-standard benchmarks for validating NER and coreference improvements introduced by the LLM-powered pipeline. Results show that LLM-assisted coreference resolution and NER correction significantly improve the accuracy, recall, and precision of key historical entities, especially for complex event, organization, and law entities. Topic modeling analyses further reveal that LLM-cleaned transcripts yield more coherent and semantically distinct topics, both at the corpus level and in focused case studies, such as the “Reagan Revolution” episode. By comparing traditional ASR pipelines with the proposed LLM-enhanced workflow, we show the value of combining automated language technologies with qualitative research goals. The findings highlight the potential of LLMs as an artificial intelligence tool to advance educational data mining and qualitative inquiry, enabling researchers to increase the reliability of entity recognition in educational videos, facilitate thematic mapping and comparative analyses of teaching practices, classroom interactions, or policy enactment across diverse educational settings.The code and data are available at https://github.com/wwang93/JEDM-Paper-Pipeline.git