A comparative ML approach to classify Lupinus species using VIS-NIR spectral data from entire seeds and various data transformation techniques and resampling methods | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article A comparative ML approach to classify Lupinus species using VIS-NIR spectral data from entire seeds and various data transformation techniques and resampling methods Josefa Díaz-Álvarez, Francisco A. Galea-Gragera, Francisco Chávez de la O, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7240896/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract The increasing interest in the cultivation and utilization of Lupinus species is driven by their nutritional value and potential for sustainable agriculture. This study evaluates five machine learning algorithms for classifying seven Lupinus species using visible and near-infrared (VIS-NIR) spectral data (reflectance and absorbance) from seeds of the official active collection at the CICYTEX Germplasm Bank, characterized by class imbalance. Both raw data and four hybrid data transformation techniques were analyzed. To address class imbalance, six resampling methods were applied alongside the original dataset. Two validation approaches were employed: a simple split (80% training, 20% testing) and stratified K-fold cross-validation (K=5). Random Forest and Support Vector Classification algorithms achieved the highest F1-score (>94%) and AUC (>97%) across all techniques. Logistic regression also performed well with hybrid transformation methods. Cross-validation confirmed model robustness and generalization. These findings demonstrate that combining non-destructive spectral analysis with machine learning is effective for taxonomic identification and genetic resource management in germplasm collections. Furthermore, this approach may facilitate rapid, objective, and cost-effective selection of Lupinus ecotypes in breeding programs, as well as enhance traceability and conservation for sustainable agriculture. Increasing minority species representation and validating models in external environments are recommended to maximize applicability. Agronomy Artificial Intelligence and Machine Learning VIS-NIR spectroscopy machine learning multi-class classification data preprocessing germplasm identification legume seeds Full Text Additional Declarations The authors declare no competing interests. Supplementary Files supplementarymaterial.pdf Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7240896","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":492452985,"identity":"988f8a98-5d33-484f-9159-beec7a5b023a","order_by":0,"name":"Josefa Díaz-Álvarez","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAsUlEQVRIiWNgGAWjYFCCAwYHGBhsSNeSRpo1BkB8mAT1uo2HNx748ed84nb2HgOGD3+I0GJ24FjBwR6e24k7e44lMM5sI0rLGYMDPBK3EzfcSD7AzNtApJaDfwzOJW64/7CB+Q9xDjtjcJgn4QDQFuYDzAxsRGk5VnBY5kCy8YYzaQkHe4nyy43Dmz+++WMnu+H4GcMHP4hxGIPEAQT7AC5FqIC/gTh1o2AUjIJRMIIBAAmlRCjEeMvmAAAAAElFTkSuQmCC","orcid":"https://orcid.org/0000-0003-2105-3905","institution":"Universidad de Extremadura. Centro Universitario de Mérida","correspondingAuthor":true,"prefix":"","firstName":"Josefa","middleName":"","lastName":"Díaz-Álvarez","suffix":""},{"id":492453824,"identity":"9c73dcd2-553f-478c-bda1-18357f3ece86","order_by":1,"name":"Francisco A. Galea-Gragera","email":"","orcid":"https://orcid.org/0000-0001-5670-8014","institution":"Pasture and Forage Crops Area, Finca La Orden-Valdesequera” Agricultural Research Institute. Extremadura Scientific and Technological Research Centre (CICYTEX)","correspondingAuthor":false,"prefix":"","firstName":"Francisco","middleName":"A.","lastName":"Galea-Gragera","suffix":""},{"id":492454440,"identity":"ccd56ad0-5907-47c2-81fe-90b5656cff14","order_by":2,"name":"Francisco Chávez de la O","email":"","orcid":"https://orcid.org/0000-0002-9565-743X","institution":"Universidad de Extremadura. Centro Universitario de Mérida","correspondingAuthor":false,"prefix":"","firstName":"Francisco","middleName":"Chávez de la","lastName":"O","suffix":""},{"id":492454987,"identity":"d0e2e0a5-513b-40d7-930e-02bb78fd347c","order_by":3,"name":"Pedro A. Salguero-López","email":"","orcid":"https://orcid.org/0009-0003-9143-7063","institution":"Universidad de Extremadura. Centro Universitario de Mérida","correspondingAuthor":false,"prefix":"","firstName":"Pedro","middleName":"A.","lastName":"Salguero-López","suffix":""},{"id":492455818,"identity":"32614c8c-aee4-4079-b0a9-35f9d600c8d6","order_by":4,"name":"Fernando Llera Cid","email":"","orcid":"","institution":"Pasture and Forage Crops Area, Finca La Orden-Valdesequera” Agricultural Research Institute. Extremadura Scientific and Technological Research Centre (CICYTEX)","correspondingAuthor":false,"prefix":"","firstName":"Fernando","middleName":"Llera","lastName":"Cid","suffix":""}],"badges":[],"createdAt":"2025-07-29 08:33:24","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-7240896/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7240896/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":87916160,"identity":"a4f86ef6-e41f-4781-9951-5e4f5ffb5776","added_by":"auto","created_at":"2025-07-30 10:54:53","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1937216,"visible":true,"origin":"","legend":"","description":"","filename":"mainlupinusclassification.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7240896/v1_covered_1ad4ce8b-ef2a-43ff-b912-90bc6d2a8828.pdf"},{"id":87915245,"identity":"233115ba-a5ea-4220-af4c-ed4454535444","added_by":"auto","created_at":"2025-07-30 10:46:49","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":371814,"visible":true,"origin":"","legend":"","description":"","filename":"supplementarymaterial.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7240896/v1/7754e7850acf1292dd1554c1.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eA comparative ML approach to classify Lupinus species using VIS-NIR spectral data from entire seeds and various data transformation techniques and resampling methods\u003c/p\u003e","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Universidad de Extremadura","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"VIS-NIR spectroscopy, machine learning, multi-class classification, data preprocessing, germplasm identification, legume seeds","lastPublishedDoi":"10.21203/rs.3.rs-7240896/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7240896/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe increasing interest in the cultivation and utilization of Lupinus species is driven by their nutritional value and potential for sustainable agriculture. This study evaluates five machine learning algorithms for classifying seven Lupinus species using visible and near-infrared (VIS-NIR) spectral data (reflectance and absorbance) from seeds of the official active collection at the CICYTEX Germplasm Bank, characterized by class imbalance. Both raw data and four hybrid data transformation techniques were analyzed. To address class imbalance, six resampling methods were applied alongside the original dataset. Two validation approaches were employed: a simple split (80% training, 20% testing) and stratified K-fold cross-validation (K=5). Random Forest and Support Vector Classification algorithms achieved the highest F1-score (\u0026gt;94%) and AUC (\u0026gt;97%) across all techniques. Logistic regression also performed well with hybrid transformation methods. Cross-validation confirmed model robustness and generalization. These findings demonstrate that combining non-destructive spectral analysis with machine learning is effective for taxonomic identification and genetic resource management in germplasm collections.\u003c/p\u003e\n\u003cp\u003eFurthermore, this approach may facilitate rapid, objective, and cost-effective selection of Lupinus ecotypes in breeding programs, as well as enhance traceability and conservation for sustainable agriculture. Increasing minority species representation and validating models in external environments are recommended to maximize applicability.\u003c/p\u003e","manuscriptTitle":"A comparative ML approach to classify Lupinus species using VIS-NIR spectral data from entire seeds and various data transformation techniques and resampling methods","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-30 10:46:45","doi":"10.21203/rs.3.rs-7240896/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"5bb42e92-9eeb-4b7b-b83a-d9728e9f8de5","owner":[],"postedDate":"July 30th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":52361959,"name":"Agronomy"},{"id":52361960,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2025-07-30T10:46:45+00:00","versionOfRecord":[],"versionCreatedAt":"2025-07-30 10:46:45","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7240896","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7240896","identity":"rs-7240896","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.