Dataset Documentation for Responsible AI: Analysis of Suitability and Usage for Health Datasets

Anna Heinke
LingLing Huang
Kyongmi U. Simpkins
Fritz Gerald P. Kalaw
Apoorva Karsolia
Kiratjit Singh
Sanjay Soundarajan
Camille Nebeker
Sally L. Baxter
Cecilia S. Lee
Aaron Y. Lee
Bhavesh Patel

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Artificial Intelligence (AI) is rapidly transforming healthcare, but also raising concerns about algorithmic biases that mostly stem from the training data. It is widely supported that transparent dataset documentation is key to enabling responsible AI development. Several standardized dataset documentation approaches have been established, such as Datasheet, Dataset Nutrition Label, Accountability Documentation, Healthsheet, and Data Card. However, their suitability and usage for health datasets remain unclear. In this work, we compared all five approaches and evaluated their alignment with the STANDING Together Recommendations for Documentation of Health Datasets. We also investigated their real-world usage and gathered insights from generators and consumers of health datasets. Our findings reveal that none of these documentation approaches are used widely or fully suited for health datasets. We recommend developing a standard documentation approach for health datasets along with clear guidelines and automation tools to support adoption.

Version published to 10.1101/2025.11.18.689064 on bioRxiv
Nov 19, 2025

INTEGRATION OF DATA LAKES AND DATA WAREHOUSES FOR AI-DRIVEN HEALTHCARE ANALYTICS

This article has 6 authors:
1. Monalisa Dike
2. Chidera Theola Onuh
3. Felix Ikpoki Acha
4. Ebenezer Oseneboh
5. Tobiloba Johnson Ojo
6. Okungbowa Olayemi
This article has no evaluationsLatest version Dec 27, 2025
Ten Quick Tips for Biomedical Federated Learning

This article has 8 authors:
1. Kyle Ellrott
2. Venkat S. Maladi
3. Jean-Christophe Bélisle-Pipon
4. Emek Demir
5. Yael Bensoussan
6. Serghei Mangul
7. Alex A. T. Bui
8. Paul C. Boutros
This article has no evaluationsLatest version Jan 27, 2026
Personalized Disease Risk Prediction from Multimodal Health Data Using Large Language Models

This article has 2 authors:
1. Hanieh Arjmand
2. Alexandre Tomberg
This article has no evaluationsLatest version Jan 25, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

INTEGRATION OF DATA LAKES AND DATA WAREHOUSES FOR AI-DRIVEN HEALTHCARE ANALYTICS

Ten Quick Tips for Biomedical Federated Learning

Personalized Disease Risk Prediction from Multimodal Health Data Using Large Language Models