Dataset Documentation for Responsible AI: Analysis of Suitability and Usage for Health Datasets

doi:10.1101/2025.11.18.689064

Dataset Documentation for Responsible AI: Analysis of Suitability and Usage for Health Datasets

2025 · doi:10.1101/2025.11.18.689064

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 1,976 characters · extracted from oa-doi-fallback · click to expand

Abstract Artificial Intelligence (AI) is rapidly transforming healthcare, but also raising concerns about algorithmic biases that mostly stem from the training data. It is widely supported that transparent dataset documentation is key to enabling responsible AI development. Several standardized dataset documentation approaches have been established, such as Datasheet, Dataset Nutrition Label, Accountability Documentation, Healthsheet, and Data Card. However, their suitability and usage for health datasets remain unclear. In this work, we compared all five approaches and evaluated their alignment with the STANDING Together Recommendations for Documentation of Health Datasets. We also investigated their real-world usage and gathered insights from generators and consumers of health datasets. Our findings reveal that none of these documentation approaches are used widely or fully suited for health datasets. We recommend developing a standard documentation approach for health datasets along with clear guidelines and automation tools to support adoption. Competing Interest Statement The authors have declared no competing interest. Footnotes Data availability The data associated with this manuscript consists of several Excel files (mentioned in the Methods and Results section). Since no FAIR guidelines were found for structuring such data, we structured it according to the SPARC Data Structure (SDS), which provides a broad data and metadata structure to organize biomedical research data in line with the FAIR principles.27 The SPARC data curation software SODA for SPARC was used to organize the data and prepare the metadata files.28,29 The dataset is maintained in a GitHub repository called “dataset-documentation-paper-data” in the AI-READI GitHub organization, and the version associated with this manuscript (v1.0.0) is also archived on Zenodo.30 This data is shared under the permissible Creative Commons Attribution 4.0 International (CC-BY) license.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00