Optimizing Insurance Fraud Claim Detection through Machine Learning: A Comprehensive Approach for Improved Fraud Detection | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Optimizing Insurance Fraud Claim Detection through Machine Learning: A Comprehensive Approach for Improved Fraud Detection Aayush . This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4109015/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Insurance fraud is a growing concern, prompting proactive measures through advanced machine learning techniques. This research focuses on constructing a predictive model for distinguishing genuine and fraudulent auto insurance claims. The dataset, comprising 1,000 instances and 40 attributes, covers customer demographics, policy details, incidents, and financial data. Early fraud detection is crucial for financial loss mitigation and maintaining insurance system integrity. The study employs data preprocessing to handle missing values and features XGBoost importance, variance thresholding, and correlation analysis for enhanced model interpretability. The machine learning model integrates nine algorithms, with a hard-voting ensemble of Logistic Regression and XGBoost demonstrating competitive accuracy, reaching 83.0%. Results highlight Linear Discriminant Analysis as the leading classifier, achieving 84% accuracy. The ensemble approach achieves 83.0% accuracy with a notable precision of 91%, showcasing the strength of combining diverse models. The study emphasizes the significance of preprocessing, feature selection, and ensemble learning for fraud detection optimization. The refined model achieves a minimal Brier loss of 0.00054, indicating minimal discrepancies in predicted probabilities and actual outcomes in binary classification. Exploration of principal component analysis (PCA) with multiple linear regression reveals a trade-off between model simplicity and performance. Retaining 32 components preserves 95% of variance, achieving a balance at 0.7967, while keeping 35 components reaches the highest value of 0.9991, showcasing dimensionality reduction's potential to capture nearly all the data variance. Support Vector Machine (SVM) Decision Tree Random Forest XGBoost Naive Bayes K Nearest Neighbors (KNN) Linear Regression AdaBoost Linear Discriminant Analysis (LDA) Ensemble model Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4109015","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":280241257,"identity":"7cf2722f-2443-4dac-a9a1-cfdc92d1f224","order_by":0,"name":"Aayush .","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABMElEQVRIie3Pv0rEMBzA8YSDnkOqa6Rgn0DIIQRBMa/yKwc31UO5xUGkcJBbCq76GLfIDQ4pBadg18qBeMtNN5zc1EXM9fwztNVVMN8hDUk/JEHIZvur4ei4/LYQouVErQdnPUAj6X0T8kWcH0n6SVBJNjn1f++MNN8tJpm/f63w6vz+UAh6mqTk4lls+zFDs0mFUB1yz9XTzl0OLe9mToOY9iElehBIhzAU6OoxyhAsp8Bz5LSIokBoyJJbCeCURFaEny3MxeQj8EyVRGzIG4gmwvKQU1cq4ApKgmND1GsEWDaQTj4fHLmya94SDD1DglgvmFo+gHlL70zVkL2sO34q5InPszRZEXUl2qPwYAmXIPxhOp4VVYLQFvuY4Ki6pWoAQu2X2uVftmw2m+0/9Q7yKWliRAb2bwAAAABJRU5ErkJggg==","orcid":"","institution":"Christ Deemed To Be University","correspondingAuthor":true,"prefix":"","firstName":"Aayush","middleName":"","lastName":".","suffix":""}],"badges":[],"createdAt":"2024-03-15 15:33:11","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4109015/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4109015/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":53611896,"identity":"f8def403-b256-42f1-b8db-167947cc0c2a","added_by":"auto","created_at":"2024-03-28 05:26:33","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":909286,"visible":true,"origin":"","legend":"","description":"","filename":"insurancefrauddetectionusingmachinelearning.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4109015/v1_covered_6be1d3ac-627d-4cd4-b5d1-dd82a1774286.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Optimizing Insurance Fraud Claim Detection through Machine Learning: A Comprehensive Approach for Improved Fraud Detection","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Support Vector Machine (SVM), Decision Tree, Random Forest, XGBoost, Naive Bayes, K Nearest Neighbors (KNN), Linear Regression, AdaBoost, Linear Discriminant Analysis (LDA), Ensemble model","lastPublishedDoi":"10.21203/rs.3.rs-4109015/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4109015/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eInsurance fraud is a growing concern, prompting proactive measures through advanced machine learning techniques. This research focuses on constructing a predictive model for distinguishing genuine and fraudulent auto insurance claims. The dataset, comprising 1,000 instances and 40 attributes, covers customer demographics, policy details, incidents, and financial data. Early fraud detection is crucial for financial loss mitigation and maintaining insurance system integrity.\u003c/p\u003e \u003cp\u003eThe study employs data preprocessing to handle missing values and features XGBoost importance, variance thresholding, and correlation analysis for enhanced model interpretability. The machine learning model integrates nine algorithms, with a hard-voting ensemble of Logistic Regression and XGBoost demonstrating competitive accuracy, reaching 83.0%.\u003c/p\u003e \u003cp\u003eResults highlight Linear Discriminant Analysis as the leading classifier, achieving 84% accuracy. The ensemble approach achieves 83.0% accuracy with a notable precision of 91%, showcasing the strength of combining diverse models.\u003c/p\u003e \u003cp\u003eThe study emphasizes the significance of preprocessing, feature selection, and ensemble learning for fraud detection optimization. The refined model achieves a minimal Brier loss of 0.00054, indicating minimal discrepancies in predicted probabilities and actual outcomes in binary classification. Exploration of principal component analysis (PCA) with multiple linear regression reveals a trade-off between model simplicity and performance. Retaining 32 components preserves 95% of variance, achieving a balance at 0.7967, while keeping 35 components reaches the highest value of 0.9991, showcasing dimensionality reduction's potential to capture nearly all the data variance.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e","manuscriptTitle":"Optimizing Insurance Fraud Claim Detection through Machine Learning: A Comprehensive Approach for Improved Fraud Detection","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-03-19 10:10:27","doi":"10.21203/rs.3.rs-4109015/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"07075cd2-5fa6-4a92-a1c7-ff34c8f12a3e","owner":[],"postedDate":"March 19th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-03-28T05:18:24+00:00","versionOfRecord":[],"versionCreatedAt":"2024-03-19 10:10:27","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4109015","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4109015","identity":"rs-4109015","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.