Linguistic Polarity and Decision Architecture in Large Language Model–Based Abstract Screening in the Dental Field

Amir M. Behrouzian
Marco Meleti
Maria Teresa Colangelo
Elena Calciolari
Carlo Galli

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are increasingly investigated for abstract screening in sys-tematic reviews, yet it remains unclear whether screening errors attributed to linguistic complexity reflect intrinsic semantic limitations or the decision architecture in which the model is embedded. We investigated how five polarity variants of logically equivalent eli-gibility criteria—affirmative inclusion, antonymic exclusion, predicate negation, verb-level negation, and double negation—affect screening outcomes in a controlled biomedical task. Using 1,000 abstracts derived from a reconstructed Cochrane review corpus (50 eligible TARGET studies; 950 non-targets), we implemented four abstract-visible criteria within a sequential hard-gated pipeline, where failure at any step triggered irreversible exclusion. Under hard gating, linguistic polarity alone produced substantial sensitivity shifts. For GPT-5.1, recall ranged from 0.72 to 0.32 despite identical logical predicates and input da-ta. Replication with GPT-3.5 Turbo yielded a similar polarity-dependent divergence (recall range 0.92–0.18), confirming that the effect generalizes across model generations. TAR-GET losses were highly concentrated at criteria frequently satisfied but inconsistently re-ported in abstracts, consistent with conservative exclusion under evidential un-der-specification. To assess whether this effect was semantic or architectural, we reim-plemented screening using a scoring-based evidence-accumulation framework in which each criterion contributed graded support (YES/NO/UNCLEAR) and inclusion was de-termined by a tunable score threshold. Scoring substantially reduced polarity-driven recall divergence and transformed it into an explicit precision–recall trade-off. These findings indicate that negation sensitivity in LLM screening is strongly mediated by decision ar-chitecture: irreversible Boolean gating amplifies linguistic asymmetries under uncertainty, whereas cumulative scoring preserves uncertainty and enables controllable operating points.

Version published to 10.20944/preprints202603.1440.v1
Mar 18, 2026

Auditing frontier general-purpose large language models in biomedical tasks: reasoning gains, extraction limits, and benchmark reliability

This article has 9 authors:
1. Yu Hou
2. Zaifu Zhan
3. Min Zeng
4. Yifan Wu
5. Shuang Zhou
6. Xiaoyi Chen
7. Huixue Zhou
8. Meijia Song
9. Rui Zhang
This article has no evaluationsLatest version Feb 18, 2026
Deterministic Retrieval-Grounded Language Models for Clinical Counseling: Large-Scale Multilingual Evaluation with Cryptographically Verifiable Pipelines

This article has 1 author:
1. Panagiotis Karmiris
This article has no evaluationsLatest version Mar 17, 2026
Language-dependent variability in large language model performance on pharmaceutical knowledge tasks

This article has 7 authors:
1. Hiroto Asano
2. Yu-Shi Tian
3. Asuka Hatabu
4. Minako Ohishi
5. Kaori Fukuzawa
6. Daisuke Takaya
7. Kenji Ikeda
This article has no evaluationsLatest version Mar 27, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Auditing frontier general-purpose large language models in biomedical tasks: reasoning gains, extraction limits, and benchmark reliability

Deterministic Retrieval-Grounded Language Models for Clinical Counseling: Large-Scale Multilingual Evaluation with Cryptographically Verifiable Pipelines

Language-dependent variability in large language model performance on pharmaceutical knowledge tasks