MLDP: A Multimodal Learning Framework for Robust and Accurate Information Extraction from Tobacco Box Labels

Songling Huang
Shuai Zhang
Minghua Han
Jianping Yang
Yuyan Bai

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With the deepening of the smart re-baking factory and the accelerating digital transformation of the agricultural industry, tobacco re-baking, as the core link of raw materials for the cigarette process, is facing the urgent need to move from the traditional manual way to the intelligent automated system in terms of cigarette product quality traceability and information management, since the traditional way of relying on manual auditing has been difficult to meet the modern tobacco industry's high standards of efficiency, accuracy and traceability. In this paper, a novel multimodal learning framework, MLDP (Multimodal Learning for Document Processing), is proposed to address the bottlenecks in efficiency, accuracy, and robustness of traditional cigarette box label information extraction methods. The framework integrates the advantages of text modal and image modal, and integrates various DeBERTa variants such as BiLSTM-DeBERTa, Multi-Sample Dropout-DeBERTa, Distil-DeBERTa, etc., and achieves a balance between the model diversity and computational efficiency through the Optuna-driven weighted integration strategy. A balance between model diversity and computational efficiency is achieved; for image modality processing, PaddleOCR is used for character recognition, and a text correction mechanism based on Levenshtein distance is introduced to reduce the digit recognition error. In order to enhance the cross-modal synergy, MLDP designs a dynamic weighted fusion modal decision-making mechanism, and constructs a modal error correction matrix to optimize the information extraction results, so as to improve the robustness and accuracy in complex scenarios. The experiments are conducted on the multimodal dataset of cigarette box labels provided by the re-baking factory, and the results show that the MLDP framework improves the average F1 score by 4.34% compared to the text-only model (MLDP-T) and 5.73% compared to the image-only model (MLDP-V), and achieves a 1.18%~1.55% improvement in recognizing the non-textual elements in the DocBank complex document task, which verifies its strong generalization capability. Furthermore, comparative experiments with state-of-the-art models including LayoutLMv3, UDOP, and Donut demonstrate the superiority of MLDP, achieving an average F1-score of 96.51% and outperforming these strong baselines by 1.3% to 3.7%. The results validate the effectiveness of the dynamic multimodal fusion strategy, especially in handling noisy documents and recognizing numeric fields. In addition, this paper also constructed and open-sourced an industrial-grade cigarette box label multimodal dataset containing 9,719 graphic-aligned samples, which fills the gap of high-quality benchmark data in this field.

Version published to 10.21203/rs.3.rs-7596494/v1 on Research Square
Oct 25, 2025

A Dual-Architecture Deep Learning Pipeline for Real-Time High-Accuracy Arabic Sign Language Recognition

This article has 3 authors:
1. Asmaa Youssef
2. Amira Gaber
3. Shereen M. El-Metwally
This article has no evaluationsLatest version Feb 4, 2026
Fast and Accurate Meat Freshness Classification Using Depthwise Separable Convolution and SPPF

This article has 6 authors:
1. Khanh-Duy Cao-Phan
2. Hoang-Khang Dang
3. Huyen-Tran Tran-Quynh
4. Minh-Phuc Lam-Doan
5. Huyen-Trang Luu
6. Hong-Quan Bui
This article has no evaluationsLatest version Feb 1, 2026
Prediction of tomato leaf disease using deep learning approach

This article has 4 authors:
1. Asim Khalil
2. Du Hubing
3. Muhammad Mustafa
4. Khalid Mehmood
This article has no evaluationsLatest version Feb 26, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Dual-Architecture Deep Learning Pipeline for Real-Time High-Accuracy Arabic Sign Language Recognition

Fast and Accurate Meat Freshness Classification Using Depthwise Separable Convolution and SPPF

Prediction of tomato leaf disease using deep learning approach