Social-RAG: A Retrieval-Augmented Generation Pipeline for Computational Social Science Research on Telegram

Leonardo F. Nascimento
Eric Brasil
Ruan Arthur Lima Santos
Gabriel Andrade
Ricardo Sodré Andrade
Tarssio Barreto

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Digital trace data have expanded empirical opportunities in the social sciences while intensifying the methodological challenge of scale: researchers increasingly face corpora too large and fast-moving to read exhaustively without sacrificing interpretive rigor. This article presents Social-RAG, a modular Retrieval-Augmented Generation (RAG) architecture designed to support scalable qualitative inquiry over large text corpora while preserving evidence traceability, auditability, and researcher control. Our empirical basis consists of messages from public Telegram groups and channels, organized into two thematic subsets: vaccine-related discourse and debates surrounding Brazil’s Lei Rouanet cultural funding policy. We detail key design decisions, including a “one post = one chunk” indexing strategy, semantic retrieval over vector embeddings with efficient ANN search, an Adaptive-K dynamic cutoff for context selection, MMR re-ranking for diversity, and structured analytical instructions that constrain generation to retrieved evidence. We evaluate system behavior using two complementary question blocks, hermeneutic (narrative) and factual, and compare outputs across three language models with distinct deployment profiles (a local open-weight model, a cloud open-weight model, and a commercial closed model), using an LLM-as-judge protocol with explicit qualitative criteria. Results show consistent behavior across both thematic corpora and highlight a key trade-off: the two larger/closed models perform similarly and robustly in both narrative and factual tasks when evidential discipline is maintained, whereas the smaller local model remains useful for exploratory narrative synthesis but is less reliable for strict factual extraction and attribution. We conclude by discussing methodological implications, limitations, and future directions, with a focus on scalability and extensibility to new data types and analytical problems.

Version published to 10.31235/osf.io/wmc2q_v1 on OSF Preprints
Feb 19, 2026

Synthetic Participants Generated by Large Language Models: A Systematic Literature Review

This article has 3 authors:
1. Eduard Kuric
2. Peter Demcak
3. Matus Krajcovic
This article has no evaluationsLatest version Mar 10, 2026
Multilingual Rag Agents For Localized Knowledge: Adaptive Indexing For Under-Represented Languages

This article has 2 authors:
1. Nnaemeka Kingsley Ugwumba
2. Kelechi Ernest Okechukwu
This article has no evaluationsLatest version Jan 29, 2026
Restructuring scientific papers for human and AI readers

This article has 1 author:
1. Zhicheng Lin
This article has no evaluationsLatest version Jan 21, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Synthetic Participants Generated by Large Language Models: A Systematic Literature Review

Multilingual Rag Agents For Localized Knowledge: Adaptive Indexing For Under-Represented Languages

Restructuring scientific papers for human and AI readers