Design of a Future-Oriented Intelligent Multi-modal Spoken English Platform: A Study on Deep Integration of CNN-DNN-LSTM and Self-Supervised Mechanisms
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
With the increasing demand for efficient English speaking ability in global communication, academic testing, and remote education, conventional language learning platforms are facing limitations in providing adaptive, real-time feedback across multiple modalities. To address these challenges, an intelligent multimodal English speaking platform (IMSEP) has been developed, integrating speech and text data through a hybrid CNN-DNN-LSTM deep fusion architecture, enhanced by a self-supervised learning mechanism. The proposed platform is capable of extracting local textual features, modeling temporal speech dynamics, and performing nonlinear multimodal alignment. Furthermore, self-supervised pretraining enables the model to utilize large-scale unlabeled data, while a reinforcement learning-based adaptive feedback mechanism provides personalized learning recommendations. Experimental validation demonstrates that, compared with baseline systems, the platform achieves improvements in pronunciation accuracy (from 84–92%), user learning efficiency (by 20%), and overall satisfaction (by 15%). These results indicate the effectiveness of the proposed method in supporting data-driven, scalable, and intelligent English language education.