CCF Database: A Machine-Learning-Annotated Corpus of 266,271 Canadian Climate Articles (1978–2024)

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Climate discourse does not merely mirror public debate around climate change, but actively shapes how societies perceive and respond to the crisis. Yet large-scale, fine-grained analysis of climate discourse remains hindered by fragmented corpora and ad hoc coding schemes. The Canadian Climate Framing (CCF) project, introduced in this article, releases a large-scale annotated database, a generalizable annotation framework, and accompanying training data to overcome these limitations. The CCF comprises 266,271 articles from 20 Canadian newspapers (1978–2024), processed into 9.2 million bilingual sentences (82.9\% English, 17.1\% French) with 65 hierarchical annotations: eight thematic frames, nine actor types, eight event categories, solution strategies, emotional tone, geographic focus, and named entities. Construction relied on transformer classifiers (BERT/CamemBERT) trained through human-in-the-loop iteration on 4,000+ expert-coded sentences, validated against a gold standard (F1=0.866) with confirmed intercoder reliability. Four analytical applications demonstrate the database’s research potential: quantifying associations between political actors and science skepticism, characterizing the structure of discursive polarization, modeling editorial prioritization of frames, and reconstructing networks of epistemic authority. Both the CCF and its underlying framework and training data are designed for replication across national and linguistic contexts.

Article activity feed