Harnessing Human Uncertainty to Train More Accurate and Aligned AI Systems

Gunnar Paul Epping
Andrew Caplin
Erik Duhaime
William Holmes
Daniel Martin
Jennifer S Trueblood

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

AI-augmented decision-making (AIADM) aims to leverage the computational power of machine learning (ML) models to assist humans in their decision-making processes. In many such systems, especially for complex tasks like medical image classification, ML models are often trained on large datasets annotated by humans. Neglecting to account for human decision-making biases when constructing these labeled datasets can lead to biased datasets, and subsequently models trained on such datasets can inherit the biases. We propose a novel approach to developing AIADM systems that aims to overcome these challenges by harnessing human uncertainty. Our approach has three elements: we collect subjective judgments from human annotators, we calibrate those subjective judgments, and we use the recalibrated subjective judgments to create probabilistic (i.e., soft) labels, which the AI decision aid is then trained on. We evaluate our methods through two studies using data from DiagnosUs, a crowdsourcing platform for medical image annotation. Across multiple training datasets, we assess how our proposed methods impact three key properties of AI decision aids: accuracy, calibration, and alignment with human uncertainty. We refer to these properties as the AIADM tri-criteria. Our results show that ML models trained on recalibrated soft labels are more accurate and better aligned with expert judgments. We also observe a tradeoff between ML calibration and alignment with human uncertainty. These findings highlight the value of capturing and correcting human uncertainty in ML training data and the need to consider the tri-criteria when developing AI systems.

Version published to 10.31234/osf.io/wtnx6_v3 on OSF Preprints
Nov 3, 2025
Version published to 10.31234/osf.io/wtnx6_v1 on OSF Preprints
May 26, 2025

Screenathon 2.0: Human–AI Collaborative Screening Applied to Patient-Generated Health Data

This article has 11 authors:
1. Jonas Bergmann
2. Tiago Azzi
3. Rutger Chris Neeleman
4. Kianush Monschau
5. Elena Jalsovec
6. Emily Westerbeek
7. Felix Weijdema
8. Jonathan de Bruin
9. Qixiang Fang
10. Rens van de Schoot
11. Berke Yazan
This article has no evaluationsLatest version Jan 9, 2026
How to Evaluate Medical AI

This article has 8 authors:
1. Ilia Kopanichuk
2. Petr Anokhin
3. Vladimir Shaposhnikov
4. Vladimir Makharev
5. Ekaterina Tsapieva
6. Iaroslav Bespalov
7. Dmitry Dylov
8. Ivan Oseledets
This article has no evaluationsLatest version Jan 22, 2026
Understanding the Impact of Dataset Characteristics on RAG-based Multi-hop QA Performance

This article has 3 authors:
1. Nimet Aksoy
2. Zekeriya Anıl Güven
3. Murat Osman Ünalır
This article has no evaluationsLatest version Dec 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Screenathon 2.0: Human–AI Collaborative Screening Applied to Patient-Generated Health Data

How to Evaluate Medical AI

Understanding the Impact of Dataset Characteristics on RAG-based Multi-hop QA Performance