LLM-Augmented Innovation Regime Classification: A Hybrid Framework for Patentometric Foresight
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Patent-based scientometrics has developed rich longitudinal indicators for characterizing technology innovation dynamics, yet translating continuous indicator vectors into discrete, interpretable regime categories — the vocabulary of technology foresight — remains methodologically underdeveloped. Existing rule-based classifiers generate high rates of unclassifiable observations and systematically neglect the qualitative semantic content embedded in patent text. This paper introduces a three-layer hybrid framework that integrates quantitative patentometric indicators, a purpose-built ontological classification scheme, and large language model (LLM) semantic calibration to address this gap. Rather than deploying LLMs as standalone classifiers, the framework formalizes their role as structured calibration agents within indicator-based scientometric workflows. The framework is evaluated on 41,475 green hydrogen patents across three Cooperative Patent Classification subdomains (C25B, H01M, Y02E) spanning 2005–2024. The first layer computes seven patentometric indicators across 54 rolling three-year windows; the second layer maps indicator profiles to the Minimal Foresight Ontology (MFO v1.0), an eight-regime categorical scheme with percentile-anchored threshold conditions; and the third layer employs Qwen2.5-3B-Instruct to adjudicate structurally ambiguous observations under a conservative dual-condition asymmetric overwrite rule. Calibrated regime sequences are then subjected to first-order Markov chain analysis and predictive validity testing. LLM calibration resolves the 38.9% of observations left unclassified by the rule-based layer and increases regime label diversity by ΔH = + 0.298 bits. Divergence cluster analysis reveals that epistemic misalignment between text-based and indicator-based signals concentrates in periods of rapid structural change. Markov analysis identifies Emerging Trajectory as the dominant long-run attractor (π = 0.433), Volatile Expansion as the most self-persistent regime (E[T] = 2.50 windows), and current regime labels as significant predictors of next-window Shannon entropy, semantic drift, and patent volume. The proposed framework contributes a replicable pipeline for LLM-augmented patent foresight and establishes the first empirical Markov characterization of innovation regime transition dynamics in a calibrated patent corpus.