MLDP: A Multimodal Learning Framework for Robust and Accurate Information Extraction from Tobacco Box Labels

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

With the deepening of the smart re-baking factory and the accelerating digital transformation of the agricultural industry, tobacco re-baking, as the core link of raw materials for the cigarette process, is facing the urgent need to move from the traditional manual way to the intelligent automated system in terms of cigarette product quality traceability and information management, since the traditional way of relying on manual auditing has been difficult to meet the modern tobacco industry's high standards of efficiency, accuracy and traceability. In this paper, a novel multimodal learning framework, MLDP (Multimodal Learning for Document Processing), is proposed to address the bottlenecks in efficiency, accuracy, and robustness of traditional cigarette box label information extraction methods. The framework integrates the advantages of text modal and image modal, and integrates various DeBERTa variants such as BiLSTM-DeBERTa, Multi-Sample Dropout-DeBERTa, Distil-DeBERTa, etc., and achieves a balance between the model diversity and computational efficiency through the Optuna-driven weighted integration strategy. A balance between model diversity and computational efficiency is achieved; for image modality processing, PaddleOCR is used for character recognition, and a text correction mechanism based on Levenshtein distance is introduced to reduce the digit recognition error. In order to enhance the cross-modal synergy, MLDP designs a dynamic weighted fusion modal decision-making mechanism, and constructs a modal error correction matrix to optimize the information extraction results, so as to improve the robustness and accuracy in complex scenarios. The experiments are conducted on the multimodal dataset of cigarette box labels provided by the re-baking factory, and the results show that the MLDP framework improves the average F1 score by 4.34% compared to the text-only model (MLDP-T) and 5.73% compared to the image-only model (MLDP-V), and achieves a 1.18%~1.55% improvement in recognizing the non-textual elements in the DocBank complex document task, which verifies its strong generalization capability. Furthermore, comparative experiments with state-of-the-art models including LayoutLMv3, UDOP, and Donut demonstrate the superiority of MLDP, achieving an average F1-score of 96.51% and outperforming these strong baselines by 1.3% to 3.7%. The results validate the effectiveness of the dynamic multimodal fusion strategy, especially in handling noisy documents and recognizing numeric fields. In addition, this paper also constructed and open-sourced an industrial-grade cigarette box label multimodal dataset containing 9,719 graphic-aligned samples, which fills the gap of high-quality benchmark data in this field.

Article activity feed