Unified Donut Transformer Meets Oriented Object Detection: A Hybrid Framework for Structured Parsing of 2D Engineering Drawings
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The precise and structured extraction of critical technical specifications embedded within complex 2D engineering drawings constitutes a cornerstone for achieving high-fidelity manufacturing outcomes. Conventional manual extraction methodologies are inherently inefficient, susceptible to human error, and ill-suited for handling intricate layouts, while legacy Optical Character Recognition (OCR) systems frequently falter when confronted with overlapping symbols, non-standard glyphs, or densely annotated regions, yielding unstructured and often unreliable textual outputs. To surmount these persistent challenges, this research introduces a sophisticated hybrid deep learning architecture specifically designed for robust structured information extraction. The core innovation lies in the synergistic integration of an Oriented Bounding Box (OBB) detection paradigm, implemented using YOLOv11, with a state-of-the-art transformer-based Document Understanding Transformer (Donut) model. To empower this framework, a meticulously curated in-house dataset was annotated, enabling the training of the YOLOv11-OBB model to accurately localize nine pivotal annotation categories: Geometric Dimensioning and Tolerancing (GD\&T), General Tolerances, Measures, Materials, Notes, Radii, Surface Roughness, Threads, and Title Blocks. Regions identified by the OBB detector are subsequently segmented as distinct image patches. These patches, alongside their structured JSON-formatted ground-truth labels, serve as the foundation for fine-tuning the Donut model, transforming visual inputs directly into structured data representations. Critically, two distinct fine-tuning strategies were rigorously evaluated: a unified model trained comprehensively across all nine categories, and an ensemble of nine specialized models, each dedicated to a single category. Comprehensive experimental results demonstrate that the unified model paradigm consistently surpasses the category-specific ensemble across all established evaluation metrics. Specifically, it attains superior precision (notably 94.77\% for GD\&T), near-perfect recall (achieving 100\% for the majority of categories), a higher aggregate F1 score (97.3\%), and significantly reduced hallucination rates (only 5.23\%). This proposed framework represents a substantial advancement, delivering markedly improved extraction accuracy, drastically diminishing the reliance on manual interpretation effort, and furnishing a scalable solution for deployment within precision-critical industrial sectors reliant on accurate drawing interpretation.