Measuring Relatedness from Co-Occurrence Data: Item- and Context-Size Corrections in Co-Classification and Co-Location Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Measures of technological relatedness based on co-occurrence data are widely used in scientometrics and economic geography, yet raw co-occurrence counts are biased by differences in item frequency and the size of the contexts in which items appear. Despite this challenge, normalization practices often lack grounding in the underlying data-generating process (DGP), risking mismeasurement of relatedness and misinterpretation of economic dynamics. This paper argues that appropriate normalization depends on the DGP: co-classification data reflect deliberate, expert-curated assignments and therefore call for set-theoretic normalizations and eventually a correction for context-size-differences, whereas co-location data partly arise mechanically from finite contexts and thus require probabilistic normalizations that explicitly adjust for item and context size. Using patent data at the CPC 4-digit level, I compare representative normalization measures across co-classification and co-location datasets, and assess their ability to predict regional diversification patterns. Results show modest differences in sparse co-classification data but substantial performance variation in denser co-location data, where context-weighted probabilistic measures – such as a context-sensitive PMI and the Ellison–Glaeser index – outperform alternatives. The findings demonstrate that normalization is not a technical afterthought but must be aligned with the DGP, with implications for empirical studies relying on co-occurrence data across scientific and technological domains. JEL Classification: C18, R12.

Article activity feed