Optimizing Seminal Quality Prediction Using Machine Learning with Data Preprocessing and Feature Selection

preprint OA: closed
Full text JSON View at publisher
Full text 14,081 characters · extracted from preprint-html · click to expand
Optimizing Seminal Quality Prediction Using Machine Learning with Data Preprocessing and Feature Selection | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Optimizing Seminal Quality Prediction Using Machine Learning with Data Preprocessing and Feature Selection Aamir Farooq, Zhengrong Xiang, Musaed Alhussein, Muhammad Shahzad, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5930473/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 8 You are reading this latest preprint version Abstract Due to the increasing prevalence of medical diseases, accurately diagnosing patients has become a significant challenge. Medical data is often raw and unstructured, requiring normalization to convert it into a suitable format for disease prediction. Even once data is appropriately formatted, additional challenges remain, such as handling imbalanced datasets, selecting effective features, and choosing suitable machine learning algorithms to achieve reliable predictive accuracy. This research focuses on predicting the seminal quality of men, addressing these challenges through a series of methodologies. The study utilizes the Fertility Dataset and employs preprocessing techniques to convert categorical values into normalized domain values based on WHO 2010 criteria. To handle class imbalance, the SMOTE algorithm is applied. Feature selection is optimized using CFS-Subset Evaluator and Best-First Search techniques to identify the most relevant features. Several machine learning models, including Naïve Bayes and Multi-layer Perceptron (non-ensemble), and ensemble methods like Bagging, Random Forest, and XG-Boost, are evaluated. Both percentage split and 10-fold cross-validation methods are employed for model validation. The highest accuracy achieved in this study is 96.2%. Health sciences/Medical research Physical sciences/Engineering Physical sciences/Mathematics and computing/Computer science Medical Disease Diagnosis Data Normalization Imbalanced Datasets SMOTE Algorithm Feature Selection Machine Learning Models Predictive Accuracy Full Text Additional Declarations No competing interests reported. Supplementary Files Supp.docx newone.docx Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 11 Apr, 2025 Reviews received at journal 09 Apr, 2025 Reviews received at journal 09 Apr, 2025 Reviewers agreed at journal 08 Apr, 2025 Reviewers agreed at journal 04 Apr, 2025 Reviewers invited by journal 04 Apr, 2025 Submission checks completed at journal 04 Apr, 2025 First submitted to journal 31 Mar, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5930473","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":439862888,"identity":"3ec21f5f-5319-4633-a714-21a216fa2b41","order_by":0,"name":"Aamir Farooq","email":"","orcid":"","institution":"Nanjing University of Science and Technology","correspondingAuthor":false,"prefix":"","firstName":"Aamir","middleName":"","lastName":"Farooq","suffix":""},{"id":439862889,"identity":"e46b7e36-74f1-4ca5-aca1-e3f7b301073a","order_by":1,"name":"Zhengrong Xiang","email":"","orcid":"","institution":"Nanjing University of Science and Technology","correspondingAuthor":false,"prefix":"","firstName":"Zhengrong","middleName":"","lastName":"Xiang","suffix":""},{"id":439862890,"identity":"cc375827-3b29-4b34-bf01-9edb01bbeac4","order_by":2,"name":"Musaed Alhussein","email":"","orcid":"","institution":"King Saud University","correspondingAuthor":false,"prefix":"","firstName":"Musaed","middleName":"","lastName":"Alhussein","suffix":""},{"id":439862891,"identity":"900d4aec-19b8-4e9f-9e1b-0a93f55de787","order_by":3,"name":"Muhammad Shahzad","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABDUlEQVRIiWNgGAWjYDACCQY2IMkM49okgKmEAmK1HGBISwDzEwyI13IYooUBjxb52c3PHv6osc7jn93+8PGHmvN5/PLdiR8eGDDI84sdwKrF4M4xcwOJY+nFEnfOGBscOHa7WLKNd7ME0GGGM2cnYNcikWAmYcB2OLHhRg6bxAG224kbjvFuAGlJMLiNXYv8jPRvEgn/DifOv5H+TOLAv3MgLZt/4NPCcCPHTOJg2+HEDTcSQIwDIC3b8NpicCOnTLKxL73Y8EaOscHZvuTEmW252ywSDCRw+gXosG2SP75Z58ndSH/4oOKbXWI/89nNN39U2MjzS+NwGBRgyErgVY5VyygYBaNgFIwCOAAA92NmSCJEr2sAAAAASUVORK5CYII=","orcid":"","institution":"Muhammad Nawaz Sharif University of Engineering and Technology","correspondingAuthor":true,"prefix":"","firstName":"Muhammad","middleName":"","lastName":"Shahzad","suffix":""},{"id":439862892,"identity":"36c66603-a0d6-4e51-b19e-6c012c3b9b78","order_by":4,"name":"Muhammad Farhan","email":"","orcid":"","institution":"Government College University","correspondingAuthor":false,"prefix":"","firstName":"Muhammad","middleName":"","lastName":"Farhan","suffix":""},{"id":439862893,"identity":"47a92ba2-1e4c-40af-8745-c6ff79f24178","order_by":5,"name":"Khursheed Aurangzeb","email":"","orcid":"","institution":"King Saud University","correspondingAuthor":false,"prefix":"","firstName":"Khursheed","middleName":"","lastName":"Aurangzeb","suffix":""}],"badges":[],"createdAt":"2025-01-30 13:08:30","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5930473/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5930473/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":80233380,"identity":"45d1ec70-5301-4fdd-8f25-01eac2d792e1","added_by":"auto","created_at":"2025-04-09 13:12:47","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":963119,"visible":true,"origin":"","legend":"","description":"","filename":"OptimizingSeminalQualityPredictionUsingMachineLearningwithDataPreprocessingandFeatureSelection.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5930473/v1_covered_da822dcf-a11d-497e-a068-fdf10ca01c80.pdf"},{"id":80231591,"identity":"cd1e846d-0182-460e-9454-26ed1bb44359","added_by":"auto","created_at":"2025-04-09 12:48:39","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":275230,"visible":true,"origin":"","legend":"","description":"","filename":"Supp.docx","url":"https://assets-eu.researchsquare.com/files/rs-5930473/v1/8177b077b5f44e105f8a60f0.docx"},{"id":80232275,"identity":"25e0e306-5814-4a24-910e-93654fb00cbc","added_by":"auto","created_at":"2025-04-09 12:56:39","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":275230,"visible":true,"origin":"","legend":"","description":"","filename":"newone.docx","url":"https://assets-eu.researchsquare.com/files/rs-5930473/v1/1014eae42561484668b9799d.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Optimizing Seminal Quality Prediction Using Machine Learning with Data Preprocessing and Feature Selection","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Medical Disease Diagnosis, Data Normalization, Imbalanced Datasets, SMOTE Algorithm, Feature Selection, Machine Learning Models, Predictive Accuracy","lastPublishedDoi":"10.21203/rs.3.rs-5930473/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5930473/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eDue to the increasing prevalence of medical diseases, accurately diagnosing patients has become a significant challenge. Medical data is often raw and unstructured, requiring normalization to convert it into a suitable format for disease prediction. Even once data is appropriately formatted, additional challenges remain, such as handling imbalanced datasets, selecting effective features, and choosing suitable machine learning algorithms to achieve reliable predictive accuracy. This research focuses on predicting the seminal quality of men, addressing these challenges through a series of methodologies. The study utilizes the Fertility Dataset and employs preprocessing techniques to convert categorical values into normalized domain values based on WHO 2010 criteria. To handle class imbalance, the SMOTE algorithm is applied. Feature selection is optimized using CFS-Subset Evaluator and Best-First Search techniques to identify the most relevant features. Several machine learning models, including Na\u0026iuml;ve Bayes and Multi-layer Perceptron (non-ensemble), and ensemble methods like Bagging, Random Forest, and XG-Boost, are evaluated. Both percentage split and 10-fold cross-validation methods are employed for model validation. The highest accuracy achieved in this study is 96.2%.\u003c/p\u003e","manuscriptTitle":"Optimizing Seminal Quality Prediction Using Machine Learning with Data Preprocessing and Feature Selection","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-04-09 12:40:34","doi":"10.21203/rs.3.rs-5930473/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-04-11T08:42:40+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-04-09T20:36:08+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-04-09T08:28:24+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"275965972705351855127509908499910135202","date":"2025-04-08T04:58:31+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"213680866662639439982118706088486218070","date":"2025-04-04T14:47:03+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-04-04T14:04:05+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-04-04T08:52:28+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-03-31T09:47:11+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"a5929cf6-2639-499a-becb-2b5cacd5f9b0","owner":[],"postedDate":"April 9th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":46826012,"name":"Health sciences/Medical research"},{"id":46826013,"name":"Physical sciences/Engineering"},{"id":46826014,"name":"Physical sciences/Mathematics and computing/Computer science"}],"tags":[],"updatedAt":"2025-05-20T03:53:24+00:00","versionOfRecord":[],"versionCreatedAt":"2025-04-09 12:40:34","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-5930473","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5930473","identity":"rs-5930473","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00