Unified Donut Transformer Meets Oriented Object Detection: A Hybrid Framework for Structured Parsing of 2D Engineering Drawings

Jin Yu
Zeng Yong
Leiwu Wen

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The precise and structured extraction of critical technical specifications embedded within complex 2D engineering drawings constitutes a cornerstone for achieving high-fidelity manufacturing outcomes. Conventional manual extraction methodologies are inherently inefficient, susceptible to human error, and ill-suited for handling intricate layouts, while legacy Optical Character Recognition (OCR) systems frequently falter when confronted with overlapping symbols, non-standard glyphs, or densely annotated regions, yielding unstructured and often unreliable textual outputs. To surmount these persistent challenges, this research introduces a sophisticated hybrid deep learning architecture specifically designed for robust structured information extraction. The core innovation lies in the synergistic integration of an Oriented Bounding Box (OBB) detection paradigm, implemented using YOLOv11, with a state-of-the-art transformer-based Document Understanding Transformer (Donut) model. To empower this framework, a meticulously curated in-house dataset was annotated, enabling the training of the YOLOv11-OBB model to accurately localize nine pivotal annotation categories: Geometric Dimensioning and Tolerancing (GD\&T), General Tolerances, Measures, Materials, Notes, Radii, Surface Roughness, Threads, and Title Blocks. Regions identified by the OBB detector are subsequently segmented as distinct image patches. These patches, alongside their structured JSON-formatted ground-truth labels, serve as the foundation for fine-tuning the Donut model, transforming visual inputs directly into structured data representations. Critically, two distinct fine-tuning strategies were rigorously evaluated: a unified model trained comprehensively across all nine categories, and an ensemble of nine specialized models, each dedicated to a single category. Comprehensive experimental results demonstrate that the unified model paradigm consistently surpasses the category-specific ensemble across all established evaluation metrics. Specifically, it attains superior precision (notably 94.77\% for GD\&T), near-perfect recall (achieving 100\% for the majority of categories), a higher aggregate F1 score (97.3\%), and significantly reduced hallucination rates (only 5.23\%). This proposed framework represents a substantial advancement, delivering markedly improved extraction accuracy, drastically diminishing the reliance on manual interpretation effort, and furnishing a scalable solution for deployment within precision-critical industrial sectors reliant on accurate drawing interpretation.

Version published to 10.20944/preprints202506.0731.v1
Jun 10, 2025

FCNet: A Transformer-Based Context-Aware Segmentation Framework for Detecting Camouflaged Fruits in Orchard Environments

This article has 3 authors:
1. Ivan Roy Evangelista
2. Argel Bandala
3. Elmer Dadios
This article has no evaluationsLatest version Jun 30, 2025
WITHDRAWN: Comparing Twins Transformer: A Steel SurfaceDefect Detection Method Based on Deep FeatureSimilarity Retrieval and Comparison

This article has 5 authors:
1. Qike Wu
2. Sharafiz Abdul Rahim
3. Muhammad Azim Azizi
4. Sai Hong Tang
5. Shuai Liao
This article has no evaluationsLatest version Jun 6, 2025
PCPE-YOLO: A Lightweight Object Detection Framework with Dynamic Reconfigurable Backbone for Enhanced Small Object Recognition

This article has 4 authors:
1. Weijia Chen
2. Jiaming Liu
3. Tong Liu
4. Yaoming Zhuang
This article has no evaluationsLatest version Jun 11, 2025

Listed in

Abstract

Article activity feed

Related articles

FCNet: A Transformer-Based Context-Aware Segmentation Framework for Detecting Camouflaged Fruits in Orchard Environments

WITHDRAWN: Comparing Twins Transformer: A Steel SurfaceDefect Detection Method Based on Deep FeatureSimilarity Retrieval and Comparison

PCPE-YOLO: A Lightweight Object Detection Framework with Dynamic Reconfigurable Backbone for Enhanced Small Object Recognition