On the inescapable bias in random forests: sources, manifestations, and corrections | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article On the inescapable bias in random forests: sources, manifestations, and corrections Matthew Berkowitz, Rachel MacKay Altman, Thomas M. Loughin This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9431439/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract In regression settings, random forests (RFs) often produce unavoidably biased estimates and predictions. We explain sources of bias in terms of RF-based estimated conditional distribution functions (ECDFs). For given covariate values, the RF ECDF is typically based on observations that are not identically distributed, which can produce bias in the ECDF and in mean or quantile estimates. Bias is especially pronounced in sparsely populated regions and when tail quantiles are estimated, as with prediction intervals. We distinguish distal and proximal sources of bias, show how they manifest differently, and explain how tuning parameters and data complexity contribute to ECDF bias. We propose a two-stage bias-correction procedure to reduce bias in the ECDF and in estimates derived from it, including means and quantiles. Using an estimate of the relationship between the RF ECDF and the covariates, we develop a bias adjustment for the entire ECDF and derived estimates. Compared with other procedures, ours was, in the settings considered, more effective at reducing conditional bias in 0.5-quantile estimates while maintaining or reducing MSE. We also show settings where its conditional bias adjustment yields prediction intervals valid over a larger region and/or with less coverage error than other methods. random forest quantile regression forest bias correction quantile estimation Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9431439","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":625374859,"identity":"94bd8b12-bfef-41df-8a64-6a425018df8e","order_by":0,"name":"Matthew Berkowitz","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAu0lEQVRIiWNgGAWjYFACHiA+wCAHYh54QIoWY7CWBFK0JDaA2ERp0W0/e/BxxRmb9Plhhx8CbbGT020goMXsTF6y4Zkbabkbb6cZALUkG5sdIKTlBo+ZZMOHw7kbZyeAtBxI3EaEFvOfQC3phrPTPxCtxYyx4cbhBHnpHGJtAfpFsuFMmuEG6ZyCAwkGxPjl+NmDHxuO2cjLz07f/OFDhZ0cQS1wYABWaUCschCQbyBF9SgYBaNgFIwoAABzCUstgl6aawAAAABJRU5ErkJggg==","orcid":"","institution":"Simon Fraser University","correspondingAuthor":true,"prefix":"","firstName":"Matthew","middleName":"","lastName":"Berkowitz","suffix":""},{"id":625374860,"identity":"126f14ad-e214-43af-97bb-bce52b93cdcd","order_by":1,"name":"Rachel MacKay Altman","email":"","orcid":"","institution":"Simon Fraser University","correspondingAuthor":false,"prefix":"","firstName":"Rachel","middleName":"MacKay","lastName":"Altman","suffix":""},{"id":625374863,"identity":"e3937308-b7c8-4b61-a967-56f486dbcdc5","order_by":2,"name":"Thomas M. Loughin","email":"","orcid":"","institution":"Simon Fraser University","correspondingAuthor":false,"prefix":"","firstName":"Thomas","middleName":"M.","lastName":"Loughin","suffix":""}],"badges":[],"createdAt":"2026-04-15 23:23:17","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9431439/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9431439/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":107706188,"identity":"9fb34942-f13f-4615-9efc-0d1af47fbae2","added_by":"auto","created_at":"2026-04-24 09:17:37","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3303843,"visible":true,"origin":"","legend":"","description":"","filename":"SCpapersourcefile.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9431439/v1_covered_4808927c-cc2b-421f-bb18-4140760fea17.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"On the inescapable bias in random forests: sources, manifestations, and corrections","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"random forest, quantile regression forest, bias correction, quantile estimation","lastPublishedDoi":"10.21203/rs.3.rs-9431439/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9431439/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eIn regression settings, random forests (RFs) often produce unavoidably biased estimates and predictions. We explain sources of bias in terms of RF-based estimated conditional distribution functions (ECDFs). For given covariate values, the RF ECDF is typically based on observations that are not identically distributed, which can produce bias in the ECDF and in mean or quantile estimates. Bias is especially pronounced in sparsely populated regions and when tail quantiles are estimated, as with prediction intervals. We distinguish distal and proximal sources of bias, show how they manifest differently, and explain how tuning parameters and data complexity contribute to ECDF bias. We propose a two-stage bias-correction procedure to reduce bias in the ECDF and in estimates derived from it, including means and quantiles. Using an estimate of the relationship between the RF ECDF and the covariates, we develop a bias adjustment for the entire ECDF and derived estimates. Compared with other procedures, ours was, in the settings considered, more effective at reducing conditional bias in 0.5-quantile estimates while maintaining or reducing MSE. We also show settings where its conditional bias adjustment yields prediction intervals valid over a larger region and/or with less coverage error than other methods.\u003c/p\u003e","manuscriptTitle":"On the inescapable bias in random forests: sources, manifestations, and corrections","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-22 10:30:28","doi":"10.21203/rs.3.rs-9431439/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"05e5ecde-bd54-443c-9dec-75a1a9be826f","owner":[],"postedDate":"April 22nd, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-04-22T10:30:28+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-22 10:30:28","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9431439","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9431439","identity":"rs-9431439","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.