Using large language models to track national identities
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
National identity drives a range of political action, from civic engagement to intergroup violence. We present a novel and accessible approach using large language models (LLMs) to label texts for expressions of positive (national identification, patriotism) and defensive (nationalism, national narcissism) national identities. We test four popular LLMs (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Pro, and Llama 4 Maverick) across 13 million words sourced from social media, surveys, and political speeches in 25 languages. Our findings reveal that LLMs label texts reliably and in a theoretically-consistent manner, producing judgments that closely match those of domain experts and outperform both dictionary-based approaches and crowdworkers—also reducing the cost by a factor of 1,000 compared to the latter. An analysis of US presidential addresses using the best-performing LLM (GPT-4o) reveals that expressions of national identities have doubled over the twentieth century. Further studies revealed differences between Republicans and Democrats. Defensive national identities were five times more prevalent in Republicans' social media posts than in Democrats'. Such identities were also frequent in the speeches of populist leaders around the globe. Together, these findings demonstrate that LLMs offer a reliable, valid, accessible, and cost-effective approach to labeling texts for nuanced expressions of national identity, enabling new insights into its role in contemporary and historical social trends.