A large-scale, granular topic classification system for scientific documents
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Knowledge Organisation Systems (KOSs) are crucial for search, retrieval, and analysis of the vast volumes of academic research, but KOSs are challenging to scale in a way that preserves granularity, breadth, and quality. Topics constructed with data-driven algorithms are a key type of KO in scientometrics, and developing high-quality systems for topic construction remains a cornerstone of scientometrics research. We present a topic construction and classification system that advances the state of the art in terms of breadth and granularity, consisting of over 29,000 topics organised into a four-level hierarchy, while achieving a high-quality as measured both quantitatively and via expert judgment. The paper makes three key contributions that address clear gaps in the current state of the art: first, documenting our approach to building a broad and granular topic solution; second, demonstrating that we can train a successful supervised classifier for a large number of topics to assign topics to new documents at scale; third, introducing a new evaluation measure, to measure topic coherence at scale. The paper exemplifies how citation clustering and Natural Language Processing (NLP) can be flexibly wielded together within scientometrics to advance the state of the art.