A Transformer-Driven Clustering Framework for Image-Based Document Segregation of OCR-Extracted Data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid increase in image-based documents across industries such as healthcare, law, and government underscores the need for efficient techniques to organize and extract meaningful insights from unstructured datasets. Traditional methods, including manual sorting and rule-based clustering, fail to effectively handle large-scale, noisy, and heterogeneous datasets, highlighting a significant research gap. To address this, we propose the Enhancing Document Segregation (EDS) model, a framework designed to cluster image-based datasets using a combination of Optical Character Recognition (OCR), semantic analysis, and advanced clustering algorithms. The EDS pipeline extracts text from images via OCR, preprocesses the data to eliminate noise, and generates embeddings using transformer-based models to capture semantic relationships. These embeddings are clustered using K-means, DBSCAN, Gaussian Mixture Models, and agglomerative clustering techniques to verify changes in variable data. Empirical analysis demonstrates the robustness of the EDS model in improving clustering accuracy and efficiency, particularly in noisy and complex datasets. Integrating theoretical foundations with practical clustering methodologies ensures the EDS model delivers a scalable solution for real-world challenges, enhancing document organization and retrieval in critical domains.