Data Quality Verification Metrics in Medicine: Experiments and Evaluations from the Perspectives of Safety and Utility

doi:10.21203/rs.3.rs-7152856/v1

Data Quality Verification Metrics in Medicine: Experiments and Evaluations from the Perspectives of Safety and Utility

2025 · doi:10.21203/rs.3.rs-7152856/v1

preprint OA: closed

Full text JSON View at publisher

Full text 11,594 characters · extracted from preprint-html · click to expand

Data Quality Verification Metrics in Medicine: Experiments and Evaluations from the Perspectives of Safety and Utility | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Data Quality Verification Metrics in Medicine: Experiments and Evaluations from the Perspectives of Safety and Utility Minsu Lee, Pildong Hwang, Seunghee Lee, Sung Ryul Shim, Jong-Yeup Kim, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7152856/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Medical data is crucial not only for research but also for clinical decision support systems (CDSS). However, its use is often limited by strict privacy concerns and data scarcity, particularly for fields with small patient cohorts like rare diseases. Synthetic data is emerging as a promising solution, yet universal and standardized quality metrics are still lacking. This study reviews and categorizes a range of metrics to evaluate the quality of medical synthetic data, followed by experimental validation using MIMIC-III admissions data (categorical) and AI-Hub dementia data (continuous). Safety was evaluated based on the risk of re-identification and membership inference attacks, while utility was assessed by measuring distributional similarity and the consistency of analytical results. The synthetic categorical data (Admissions) demonstrated high utility and safety across most metrics. However, a low Nearest Neighbor Adversarial Accuracy (NNAA) score suggested a significant risk of the model overfitting to the original data. Conversely, the continuous data (Dementia) exhibited low utility and safety, confirming that generation methods must be tailored to data characteristics to preserve quality. Ultimately, this study proposes a structured framework for evaluating medical synthetic data and highlights the critical need to select metrics appropriate for specific data types to ensure a reliable quality assessment. medical synthetic data synthetic data quality evaluation metrics synthetic data safety synthetic data utility Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7152856","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":493671197,"identity":"7fefaf44-1acf-4941-8ac4-0922a7e71756","order_by":0,"name":"Minsu Lee","email":"","orcid":"","institution":"Konyang University","correspondingAuthor":false,"prefix":"","firstName":"Minsu","middleName":"","lastName":"Lee","suffix":""},{"id":493671199,"identity":"84f9ca1d-6e47-499d-8945-cde24d99e193","order_by":1,"name":"Pildong Hwang","email":"","orcid":"","institution":"Konyang University Hospital","correspondingAuthor":false,"prefix":"","firstName":"Pildong","middleName":"","lastName":"Hwang","suffix":""},{"id":493671200,"identity":"a029a226-5daf-49f3-9daa-295606d87733","order_by":2,"name":"Seunghee Lee","email":"","orcid":"","institution":"Konyang University Hospital","correspondingAuthor":false,"prefix":"","firstName":"Seunghee","middleName":"","lastName":"Lee","suffix":""},{"id":493671201,"identity":"38170bc3-df0d-496e-9c59-dcb437ea73ab","order_by":3,"name":"Sung Ryul Shim","email":"","orcid":"","institution":"Konyang University Hospital","correspondingAuthor":false,"prefix":"","firstName":"Sung","middleName":"Ryul","lastName":"Shim","suffix":""},{"id":493671202,"identity":"86bd78e1-6216-446e-a96f-dfdb4af5c203","order_by":4,"name":"Jong-Yeup Kim","email":"","orcid":"","institution":"Konyang University Hospital","correspondingAuthor":false,"prefix":"","firstName":"Jong-Yeup","middleName":"","lastName":"Kim","suffix":""},{"id":493671203,"identity":"63794b37-3763-4182-8ce9-792f6ce99e93","order_by":5,"name":"Jieun Shin","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA6klEQVRIiWNgGAWjYBACxgYGNoYEBhsDJDE2orSkkaAFquIwCVqY29ufPXi447wxv3SP4eeCXwzy/A1saR/wOqznjLlB4pnbZpJzzhhLz+xjMJxxgO3wDLxaZuSwSSS23bYxuJFjIM3bw8C4gYG9Ga/DGOc/fwbUcg6kxfg3UIs9YS0zGMyAWg6YAbWYSfP8YEjcwMB2GL+WnhyQlmRjyRlpZda8DRLJMw6zJePVYth+/JnkzzY7w36J5M23ef7Y2Pa3txnj19IAZ3IYMDC2SQDDHa8GBgZ5BJP9AQPDHwLKR8EoGAWjYEQCAJSMRPiVrjgJAAAAAElFTkSuQmCC","orcid":"","institution":"Konyang University Hospital","correspondingAuthor":true,"prefix":"","firstName":"Jieun","middleName":"","lastName":"Shin","suffix":""}],"badges":[],"createdAt":"2025-07-18 01:38:17","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7152856/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7152856/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":99313666,"identity":"fbd12c62-7131-4a88-8c95-023c64948caf","added_by":"auto","created_at":"2025-12-31 16:20:24","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":562569,"visible":true,"origin":"","legend":"","description":"","filename":"DataQuality.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7152856/v1_covered_95c65e3f-6e93-458f-ab5b-30e3f62caa6d.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Data Quality Verification Metrics in Medicine: Experiments and Evaluations from the Perspectives of Safety and Utility","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"medical synthetic data, synthetic data quality, evaluation metrics, synthetic data safety, synthetic data utility","lastPublishedDoi":"10.21203/rs.3.rs-7152856/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7152856/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eMedical data is crucial not only for research but also for clinical decision support systems (CDSS). However, its use is often limited by strict privacy concerns and data scarcity, particularly for fields with small patient cohorts like rare diseases. Synthetic data is emerging as a promising solution, yet universal and standardized quality metrics are still lacking.\u003c/p\u003e\u003cp\u003eThis study reviews and categorizes a range of metrics to evaluate the quality of medical synthetic data, followed by experimental validation using MIMIC-III admissions data (categorical) and AI-Hub dementia data (continuous). Safety was evaluated based on the risk of re-identification and membership inference attacks, while utility was assessed by measuring distributional similarity and the consistency of analytical results.\u003c/p\u003e\u003cp\u003eThe synthetic categorical data (Admissions) demonstrated high utility and safety across most metrics. However, a low Nearest Neighbor Adversarial Accuracy (NNAA) score suggested a significant risk of the model overfitting to the original data. Conversely, the continuous data (Dementia) exhibited low utility and safety, confirming that generation methods must be tailored to data characteristics to preserve quality.\u003c/p\u003e\u003cp\u003eUltimately, this study proposes a structured framework for evaluating medical synthetic data and highlights the critical need to select metrics appropriate for specific data types to ensure a reliable quality assessment.\u003c/p\u003e","manuscriptTitle":"Data Quality Verification Metrics in Medicine: Experiments and Evaluations from the Perspectives of Safety and Utility","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-08-05 03:41:37","doi":"10.21203/rs.3.rs-7152856/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"441aaf07-e0b5-46aa-8f9f-9d4a6118ecb6","owner":[],"postedDate":"August 5th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-12-26T06:58:23+00:00","versionOfRecord":[],"versionCreatedAt":"2025-08-05 03:41:37","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7152856","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7152856","identity":"rs-7152856","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00