Bridging Cultures in the Era of Big Data: A Cross-Language Equivalence Framework in Machine Learning Research with Social Media Texts
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Past research on cross-cultural equivalence has focused on statistical procedures and techniques for ensuring measurement equivalence in tests and surveys. With the rise of big data and machine learning(ML), particularly natural language processing (NLP), researchers have powerful tools to study culture using large-scale, organic language data from social media. However, the lack of methodological guidance on how to establish cross-language equivalence in cross-cultural studies, especially with multilingual or culturally diverse text data, poses a major challenge. To address this gap, this paper proposes a framework to raise awareness of key equivalence challenges and offer practical guidance for reducing measurement biases when applying ML techniques to social media language data. The framework outlines five types of equivalence following the ML pipeline from data collection to evaluation: source equivalence, sample equivalence, input equivalence, psychological ground truth equivalence, and model performance equivalence. We also draw parallels with survey-based research to highlight shared conceptual challenges and identify future directions for advancing cross-cultural research with big data and computational linguistic methods.