Navigating the Maze of Measurement: Large Language Models for objective instrument selection
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: The proliferation of psychological measures and constructs has led to critical conceptual fragmentation, complicating instrument selection and undermining content validity. Traditional expert-based procedures are often resource-intensive and subjective, underscoring the need for objective, scalable assessment methods.Aims and Methods: This study evaluates the capability of Large Language Models—comparing embedding-based semantic similarity and prompt-based generative approaches—to perform scalable content assessments. The methodology was validated across 13 scales for Internet Gaming Disorder (IGD) and applied to 7 established measures of depression, benchmarking results against theoretical criteria and expert consensus.Results: Models demonstrated strong capability across two key tasks. Generative models achieved high classification accuracy (Cohen's κ up to 0.90) in mapping scale items to their theoretically intended symptoms. Furthermore, aggregated semantic similarity derived from embedding models strongly correlated with overall expert rankings of scales’ content validity (ρ = 0.89), validating their use for objective instrument triage. Importantly, the open-source model (intfloat/e5-large-v2) successfully replicated expert consensus in the depression application. However, models struggled to reliably replicate fine-grained symptom-level quality assessments.Conclusion: Both embedding and generative models offer a powerful, scalable, and theory-referenced heuristic for psychometric triage. By providing an objective and cost-effective methodology to rank item-construct alignment, this approach helps researchers efficiently select the measurement tools with the highest content validity with respect to the predefined construct.