Exploring Teacher-Student Interaction through Multimodal Large Language Models: An Empirical Investigation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Teacher–student interaction is central to classroom learning, yet traditional observation and machine-learning approaches often remain inefficient and subjective. This study explores the use of multimodal large language models (MLLMs) for systematic analysis of classroom dynamics. We fine-tuned VisualGLM-6B on 2,380 annotated images from 30 classroom videos, covering five interaction types: guided, collaborative, questioning, independent, and exhibitive. LoRA-based fine-tuning combined with prompt engineering was employed to enhance interpretability and domain-specific accuracy. Model performance was assessed through confusion matrices, BERTScore, and expert comparisons. The fine-tuned model achieved 82% overall accuracy, performing best on guided, independent, and exhibitive interactions, while collaborative and questioning types remained challenging. Compared with expert annotation, the model provided more structured and interpretable outputs, though occasional misclassifications and hallucinations persisted. These findings demonstrate the feasibility of applying MLLMs for efficient, objective analysis of teacher–student interactions and highlight future directions such as incorporating audio inputs and larger datasets to further advance educational research methodologies.

Article activity feed