Measuring complex psychological and sociological constructs in large-scale text
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In recent years, there has been an increasing exchange between social science and machine learning. In principle, natural language processing enables social scientists to systematically process large amounts of text, while rich domain knowledge helps machine learning scholars to build valid models of social phenomena. However, there is a lack of clear guidelines for constructing valid and reliable mixed methods approaches, which can increase the rigor and comparability of computational social science research. We provide a set of guidelines for leveraging human data annotation and automatic text classification at scale in five stages: (1) classification scheme development, (2) data labeling, (3) model selection, (4) model training and performance improvement, and (5) statistical analysis. Using examples from our own research on countering online hate, we outline potential problems and respective solutions. We demonstrate how consequently integrating expertise from social science and machine learning can enhance the study of diverse social phenomena.