AI Image Content Extraction Survey

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

With Artificial Intelligence (AI) rapidly increasing in popularity and presence in everyday life, new applications utilizing AI are being explored throughout virtually all domains, from banking and healthcare to cybersecurity to generative AI for images, voice, and video content creation. With that trend comes an inherent need for increased AI capabilities. One cornerstone of AI applications is the ability of Generative AI to consume documents and utilize their content to answer questions, generate new content, correlate it with other data sources, and more. No longer constrained to text alone, we now leverage multimodal AI models to help us understand visual elements within documents, such as images, tables, figures, and charts. Within this realm, capabilities have expanded exponentially from traditional Optical Character Recognition (OCR) approaches towards increasingly utilizing complex AI models for visual content analysis and understanding. Modern approaches, especially those leveraging AI, are now focusing on interpreting more complex diagrams such as flowcharts, block diagrams, electrical schematics, and timing diagrams. These diagrams combine text, symbols, and structured layout, making them challenging to parse and comprehend using conventional techniques. This paper presents a historical analysis and comprehensive survey of scientific literature exploring this domain of visual understanding of complex technical illustrations and diagrams. We explore the use of deep learning models, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based architectures. These models, along with OCR, enable the extraction of both textual and structural information from visually complex sources. Despite these advancements, numerous challenges remain, however. These range from hallucinations, where OCR systems produce outputs not grounded in the source image, leading to misinterpretations, to a lack of contextual understanding of diagrammatic elements, such as arrows, grouping, and spatial hierarchy. This survey focuses on four key diagram types: flowcharts, block diagrams, electrical schematics, and timing diagrams. It evaluates the effectiveness, limitations, and practical solutions — both traditional and AI-driven — that aim to enable the extraction of accurate and meaningful information from complex diagrams in a way that is trustworthy and suitable for real-world, high-accuracy AI applications. This survey reveals that virtually all approaches struggle with accurately extracting technical diagram information, and it illustrates a path forward. Pursuing research to improve their accuracy further is crucial for various applications, including complex document question answering and Retrieval Augmented Generation, document-driven AI agents, accessibility applications, and automation.

Article activity feed