Toward AI‑Driven IoT Cybersecurity: A Preprocessing Framework for Benchmark Datasets | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Toward AI‑Driven IoT Cybersecurity: A Preprocessing Framework for Benchmark Datasets Virginia Martinez-Fuentes, Angel Arroyo, Diego Granados López, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8213296/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 26 Feb, 2026 Read the published version in International Journal of Information Security → Version 1 posted 9 You are reading this latest preprint version Abstract The rapid expansion of Internet of Things (IoT) systems, found in environments such as smart homes, poses growing cybersecurity challenges. In response, research has examined the role of artificial intelligence, particularly machine learning, in enhancing IoT security. To support this effort, machine learning models have been developed and evaluated on benchmark datasets. However, preparing datasets for machine learning requires preprocessing techniques that are tailored to the specific characteristics of the data. In this context, exploratory data analysis provides insights into dataset structure and distribution, thereby supporting informed preprocessing decisions prior to modeling. Accordingly, this study introduces a reproducible five‑step preprocessing framework for IoT cybersecurity datasets and demonstrates its application to the NF‑ToN‑IoT V1 dataset. The proposed framework is organized into two phases: an exploratory data analysis phase consisting of (1) dataset overview and identification of categorical and numerical features, (2) analysis of missing and zero values, (3) assessment of categorical feature distributions, and (4) assessment of numerical feature distributions; and a preprocessing phase consisting of (5) proportional stratified random downsampling to produce a reduced dataset that preserves the original class distribution. By establishing a systematic, data-driven framework, this study contributes to the preparation of structured datasets for attack detection in IoT environments, with potential applications in smart homes. Cybersecurity IoT Dataset preprocessing Exploratory data analysis Downsampling Machine learning Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 26 Feb, 2026 Read the published version in International Journal of Information Security → Version 1 posted Editorial decision: Accepted 16 Feb, 2026 Reviews received at journal 22 Jan, 2026 Reviewers agreed at journal 16 Jan, 2026 Reviews received at journal 15 Jan, 2026 Reviewers agreed at journal 14 Jan, 2026 Reviewers invited by journal 14 Jan, 2026 Editor assigned by journal 02 Dec, 2025 Submission checks completed at journal 02 Dec, 2025 First submitted to journal 26 Nov, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8213296","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":574838046,"identity":"b1487cb8-01e9-4d03-9b66-68884ad68b1a","order_by":0,"name":"Virginia Martinez-Fuentes","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAv0lEQVRIiWNgGAWjYFACxgcQmr2BjVgtzAZAAoh5DpCsRSKBSC0G1w6zSXzc8UdOfuYbs8c8DDb2hLXcTmaTnHnGwNjgdo65MQ9DWmIDIS2Ss/OPSfO2GSRukM4xk+ZhOJxA0BbJ2cls0n/bDOrnzzwD0vKfsMP4pYFaGNsMEhhu8IC0HGAk6DCgFmbL3jZjww1n0soN5xgkE/YLm3Qy442fbXLy8u2Htz14U2FH2GFowIBUDaNgFIyCUTAKsAIAfogyLogCt5MAAAAASUVORK5CYII=","orcid":"","institution":"University of Burgos","correspondingAuthor":true,"prefix":"","firstName":"Virginia","middleName":"","lastName":"Martinez-Fuentes","suffix":""},{"id":574838049,"identity":"2cf595f9-a80b-4084-a09f-bc5b2015b16d","order_by":1,"name":"Angel Arroyo","email":"","orcid":"","institution":"University of Burgos","correspondingAuthor":false,"prefix":"","firstName":"Angel","middleName":"","lastName":"Arroyo","suffix":""},{"id":574838054,"identity":"7c1ee701-4c09-4b5a-9909-754e445735ee","order_by":2,"name":"Diego Granados López","email":"","orcid":"","institution":"University of Burgos","correspondingAuthor":false,"prefix":"","firstName":"Diego","middleName":"Granados","lastName":"López","suffix":""},{"id":574838057,"identity":"7990b5ab-6eca-4cdb-a027-3fd2337dfc5b","order_by":3,"name":"Alvaro Herrero","email":"","orcid":"","institution":"University of Burgos","correspondingAuthor":false,"prefix":"","firstName":"Alvaro","middleName":"","lastName":"Herrero","suffix":""}],"badges":[],"createdAt":"2025-11-26 13:08:31","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8213296/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8213296/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s10207-026-01235-z","type":"published","date":"2026-02-26T15:58:47+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":100398453,"identity":"2061c70e-326c-41cc-8658-ab6d6b091987","added_by":"auto","created_at":"2026-01-16 11:53:48","extension":"json","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":7433,"visible":true,"origin":"","legend":"","description":"","filename":"5ec25632a6bf4c6380a0f4876c6c0085.json","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/4d897167f576bc2c0dccc542.json"},{"id":100398671,"identity":"7a81e515-5406-4cf3-93b6-8ed3d799c777","added_by":"auto","created_at":"2026-01-16 11:54:12","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":73465,"visible":true,"origin":"","legend":"","description":"","filename":"Coverletter.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/20e8458362d2ea5cd7644053.pdf"},{"id":100421401,"identity":"c54a70de-9492-4f34-b711-b528d754ea27","added_by":"auto","created_at":"2026-01-16 13:32:54","extension":"png","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":355469,"visible":true,"origin":"","legend":"","description":"","filename":"Fig1multiclass.png","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/f85fc81ba7b3d5545e0165a1.png"},{"id":100398349,"identity":"18a4678c-b398-4486-bb30-6810516ba8f7","added_by":"auto","created_at":"2026-01-16 11:53:06","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":747443,"visible":true,"origin":"","legend":"","description":"","filename":"Fig2boxplots.png","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/0a8b5baf0b59e5b305fbd17a.png"},{"id":100398441,"identity":"c37d6abd-4556-4582-b7cc-3f6c4707d0a0","added_by":"auto","created_at":"2026-01-16 11:53:45","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":415618,"visible":true,"origin":"","legend":"","description":"","filename":"Fig3correlation.png","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/a6af018767359d07c1800a78.png"},{"id":100398355,"identity":"5983a6c1-f356-40bd-87dc-8cd16f69d3b7","added_by":"auto","created_at":"2026-01-16 11:53:09","extension":"pdf","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1582355,"visible":true,"origin":"","legend":"","description":"","filename":"TowardAIDrivenIoTCybersecurityAPreprocessingFrameworkforBenchmarkDatasets.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/3bdf91bb88aa1304ecaa1bb5.pdf"},{"id":100421399,"identity":"27e33961-4c0a-48fa-a0d4-18239a85a989","added_by":"auto","created_at":"2026-01-16 13:32:53","extension":"bst","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":28600,"visible":true,"origin":"","legend":"","description":"","filename":"spphys.bst","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/521a5642ceb8ed57011a3a31.bst"},{"id":100398358,"identity":"47388535-97a7-4b7f-8ccc-23d2c0929566","added_by":"auto","created_at":"2026-01-16 11:53:16","extension":"clo","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":3696,"visible":true,"origin":"","legend":"","description":"","filename":"svglov3.clo","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/992bb31e79e76cebde13ad6e.clo"},{"id":100398632,"identity":"9d95a0fa-501c-481b-8173-9937d99ef597","added_by":"auto","created_at":"2026-01-16 11:54:05","extension":"cls","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":47679,"visible":true,"origin":"","legend":"","description":"","filename":"svjour3.cls","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/f78ddf5d30aa4a904a68ecf7.cls"},{"id":100398669,"identity":"3161a170-7684-4f04-9de4-e4b48b09e618","added_by":"auto","created_at":"2026-01-16 11:54:11","extension":"xml","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":109211,"visible":true,"origin":"","legend":"","description":"","filename":"5ec25632a6bf4c6380a0f4876c6c00851structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/47839013cf759a2288c9564c.xml"},{"id":103765594,"identity":"29645bc2-d4d0-46c5-aa7e-2e45553c81b1","added_by":"auto","created_at":"2026-03-02 16:05:27","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1890273,"visible":true,"origin":"","legend":"","description":"","filename":"TowardAIDrivenIoTCybersecurityAPreprocessingFrameworkforBenchmarkDatasets.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1_covered_c09790a4-d107-44ce-aed7-3681ddde8dff.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Toward AI‑Driven IoT Cybersecurity: A Preprocessing Framework for Benchmark Datasets","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"international-journal-of-information-security","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ijis","sideBox":"Learn more about [International Journal of Information Security](http://link.springer.com/journal/10207)","snPcode":"10207","submissionUrl":"https://submission.nature.com/new-submission/10207/3","title":"International Journal of Information Security","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Cybersecurity, IoT, Dataset preprocessing, Exploratory data analysis, Downsampling, Machine learning","lastPublishedDoi":"10.21203/rs.3.rs-8213296/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8213296/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"The rapid expansion of Internet of Things (IoT) systems, found in environments such as smart homes, poses growing cybersecurity challenges. In response, research has examined the role of artificial intelligence, particularly machine learning, in enhancing IoT security. To support this effort, machine learning models have been developed and evaluated on benchmark datasets. However, preparing datasets for machine learning requires preprocessing techniques that are tailored to the specific characteristics of the data. In this context, exploratory data analysis provides insights into dataset structure and distribution, thereby supporting informed preprocessing decisions prior to modeling. Accordingly, this study introduces a reproducible five‑step preprocessing framework for IoT cybersecurity datasets and demonstrates its application to the NF‑ToN‑IoT V1 dataset. The proposed framework is organized into two phases: an exploratory data analysis phase consisting of (1) dataset overview and identification of categorical and numerical features, (2) analysis of missing and zero values, (3) assessment of categorical feature distributions, and (4) assessment of numerical feature distributions; and a preprocessing phase consisting of (5) proportional stratified random downsampling to produce a reduced dataset that preserves the original class distribution. By establishing a systematic, data-driven framework, this study contributes to the preparation of structured datasets for attack detection in IoT environments, with potential applications in smart homes.","manuscriptTitle":"Toward AI‑Driven IoT Cybersecurity: A Preprocessing Framework for Benchmark Datasets","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-01-16 08:38:12","doi":"10.21203/rs.3.rs-8213296/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Accepted","date":"2026-02-16T06:53:53+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-01-22T09:07:03+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"185312219731650124636432653694633733086","date":"2026-01-16T09:26:52+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-01-16T03:08:15+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"196811325319323124440749423304374012017","date":"2026-01-15T01:57:21+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-01-14T09:03:19+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-12-02T15:30:13+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-12-02T15:28:50+00:00","index":"","fulltext":""},{"type":"submitted","content":"International Journal of Information Security","date":"2025-11-26T13:06:01+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"international-journal-of-information-security","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ijis","sideBox":"Learn more about [International Journal of Information Security](http://link.springer.com/journal/10207)","snPcode":"10207","submissionUrl":"https://submission.nature.com/new-submission/10207/3","title":"International Journal of Information Security","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"a6f4524f-113d-4f0e-9567-f503328ad334","owner":[],"postedDate":"January 16th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2026-03-02T16:01:57+00:00","versionOfRecord":{"articleIdentity":"rs-8213296","link":"https://doi.org/10.1007/s10207-026-01235-z","journal":{"identity":"international-journal-of-information-security","isVorOnly":false,"title":"International Journal of Information Security"},"publishedOn":"2026-02-26 15:58:47","publishedOnDateReadable":"February 26th, 2026"},"versionCreatedAt":"2026-01-16 08:38:12","video":"","vorDoi":"10.1007/s10207-026-01235-z","vorDoiUrl":"https://doi.org/10.1007/s10207-026-01235-z","workflowStages":[]},"version":"v1","identity":"rs-8213296","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8213296","identity":"rs-8213296","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.