Using Large Language Models to Obtain Quantitative Data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study evaluated the capability of Claude 3.5 Sonnet (a large language model, LLM) to generate quantitatively analyzable data. Two datasets were generated: biological measurements for 327 land animals and simulated questionnaire responses for 202 famous people. For the biological data, Claude estimated 29 characteristics (e.g., size, speed, aggressiveness) as z-scores relative to all land animals. For the psychometric data, Claude provided predicted responses to 60 Likert-style questions from PhillyMatch.org for each famous person. Validation of biological estimates against the AnAge database showed strong correlations for animal size (r=0.85) and metabolic rate (r=-0.74). Network analysis of the biological data revealed meaningful taxonomic clustering, with mammals distinctly separated from other classes and interpretable exceptions (e.g., golden poison frog clustering with reptiles due to toxin resistance). Factor analysis of the psychometric data yielded four interpretable factors: Conservative Traditionalism, Impulsive Escapism, Pro-Family Orientation, and Sport-Focused Conventionalism. Network analysis of famous individuals showed meaningful clustering beyond simple professional categories, such as grouping individuals with violent histories regardless of their primary occupation. Notably, Claude's direct prediction of factor structure without quantitative data differed from the analysis of its generated data, suggesting distinct implicit and explicit knowledge representations. While these results demonstrate LLMs' potential for generating analyzable quantitative data, further validation against real-world data is needed before application in high-stakes domains