FLACON: An Information-Theoretic Approach to Flag-Aware Contextual Clustering for Large-Scale Document Organization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Enterprise document management faces a significant challenge: traditional clustering methods focus solely on content similarity while ignoring organizational context, such as priority, workflow status, and temporal relevance. This paper introduces FLACON (Flag-Aware Context-sensitive Clustering), an information-theoretic approach that captures multi-dimensional document context through a six-dimensional flag system encompassing Type, Domain, Priority, Status, Relationship, and Temporal dimensions. FLACON formalizes document clustering as an entropy minimization problem, where the objective is to group documents with similar contextual characteristics. The approach combines a composite distance function—integrating semantic content, contextual flags, and temporal factors—with adaptive hierarchical clustering and efficient incremental updates. This design addresses key limitations of existing solutions, including context-aware systems that lack domain-specific intelligence and LLM-based methods that require prohibitive computational resources. Evaluation across nine dataset variations demonstrates notable improvements over traditional methods, including a 7.8-fold improvement in clustering quality (Silhouette Score: 0.311 vs. 0.040) and performance comparable to GPT-4 (89% of quality) while being ~7× faster (60 s vs. 420 s for 10 K documents). FLACON achieves O(m log n) complexity for incremental updates affecting m documents and provides deterministic behavior, which is suitable for compliance requirements. Consistent performance across business emails, technical discussions, and financial news confirms the practical viability of this approach for large-scale enterprise document organization.