Toward AI‑Driven IoT Cybersecurity: A Preprocessing Framework for Benchmark Datasets

preprint OA: closed CC-BY-4.0
📄 Open PDF Full text JSON View at publisher
Full text 17,888 characters · extracted from preprint-html · click to expand
Toward AI‑Driven IoT Cybersecurity: A Preprocessing Framework for Benchmark Datasets | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Toward AI‑Driven IoT Cybersecurity: A Preprocessing Framework for Benchmark Datasets Virginia Martinez-Fuentes, Angel Arroyo, Diego Granados López, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8213296/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 26 Feb, 2026 Read the published version in International Journal of Information Security → Version 1 posted 9 You are reading this latest preprint version Abstract The rapid expansion of Internet of Things (IoT) systems, found in environments such as smart homes, poses growing cybersecurity challenges. In response, research has examined the role of artificial intelligence, particularly machine learning, in enhancing IoT security. To support this effort, machine learning models have been developed and evaluated on benchmark datasets. However, preparing datasets for machine learning requires preprocessing techniques that are tailored to the specific characteristics of the data. In this context, exploratory data analysis provides insights into dataset structure and distribution, thereby supporting informed preprocessing decisions prior to modeling. Accordingly, this study introduces a reproducible five‑step preprocessing framework for IoT cybersecurity datasets and demonstrates its application to the NF‑ToN‑IoT V1 dataset. The proposed framework is organized into two phases: an exploratory data analysis phase consisting of (1) dataset overview and identification of categorical and numerical features, (2) analysis of missing and zero values, (3) assessment of categorical feature distributions, and (4) assessment of numerical feature distributions; and a preprocessing phase consisting of (5) proportional stratified random downsampling to produce a reduced dataset that preserves the original class distribution. By establishing a systematic, data-driven framework, this study contributes to the preparation of structured datasets for attack detection in IoT environments, with potential applications in smart homes. Cybersecurity IoT Dataset preprocessing Exploratory data analysis Downsampling Machine learning Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 26 Feb, 2026 Read the published version in International Journal of Information Security → Version 1 posted Editorial decision: Accepted 16 Feb, 2026 Reviews received at journal 22 Jan, 2026 Reviewers agreed at journal 16 Jan, 2026 Reviews received at journal 15 Jan, 2026 Reviewers agreed at journal 14 Jan, 2026 Reviewers invited by journal 14 Jan, 2026 Editor assigned by journal 02 Dec, 2025 Submission checks completed at journal 02 Dec, 2025 First submitted to journal 26 Nov, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8213296","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":574838046,"identity":"b1487cb8-01e9-4d03-9b66-68884ad68b1a","order_by":0,"name":"Virginia Martinez-Fuentes","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAv0lEQVRIiWNgGAWjYFACxgcQmr2BjVgtzAZAAoh5DpCsRSKBSC0G1w6zSXzc8UdOfuYbs8c8DDb2hLXcTmaTnHnGwNjgdo65MQ9DWmIDIS2Ss/OPSfO2GSRukM4xk+ZhOJxA0BbJ2cls0n/bDOrnzzwD0vKfsMP4pYFaGNsMEhhu8IC0HGAk6DCgFmbL3jZjww1n0soN5xgkE/YLm3Qy442fbXLy8u2Htz14U2FH2GFowIBUDaNgFIyCUTAKsAIAfogyLogCt5MAAAAASUVORK5CYII=","orcid":"","institution":"University of Burgos","correspondingAuthor":true,"prefix":"","firstName":"Virginia","middleName":"","lastName":"Martinez-Fuentes","suffix":""},{"id":574838049,"identity":"2cf595f9-a80b-4084-a09f-bc5b2015b16d","order_by":1,"name":"Angel Arroyo","email":"","orcid":"","institution":"University of Burgos","correspondingAuthor":false,"prefix":"","firstName":"Angel","middleName":"","lastName":"Arroyo","suffix":""},{"id":574838054,"identity":"7c1ee701-4c09-4b5a-9909-754e445735ee","order_by":2,"name":"Diego Granados López","email":"","orcid":"","institution":"University of Burgos","correspondingAuthor":false,"prefix":"","firstName":"Diego","middleName":"Granados","lastName":"López","suffix":""},{"id":574838057,"identity":"7990b5ab-6eca-4cdb-a027-3fd2337dfc5b","order_by":3,"name":"Alvaro Herrero","email":"","orcid":"","institution":"University of Burgos","correspondingAuthor":false,"prefix":"","firstName":"Alvaro","middleName":"","lastName":"Herrero","suffix":""}],"badges":[],"createdAt":"2025-11-26 13:08:31","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8213296/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8213296/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s10207-026-01235-z","type":"published","date":"2026-02-26T15:58:47+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":100398453,"identity":"2061c70e-326c-41cc-8658-ab6d6b091987","added_by":"auto","created_at":"2026-01-16 11:53:48","extension":"json","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":7433,"visible":true,"origin":"","legend":"","description":"","filename":"5ec25632a6bf4c6380a0f4876c6c0085.json","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/4d897167f576bc2c0dccc542.json"},{"id":100398671,"identity":"7a81e515-5406-4cf3-93b6-8ed3d799c777","added_by":"auto","created_at":"2026-01-16 11:54:12","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":73465,"visible":true,"origin":"","legend":"","description":"","filename":"Coverletter.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/20e8458362d2ea5cd7644053.pdf"},{"id":100421401,"identity":"c54a70de-9492-4f34-b711-b528d754ea27","added_by":"auto","created_at":"2026-01-16 13:32:54","extension":"png","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":355469,"visible":true,"origin":"","legend":"","description":"","filename":"Fig1multiclass.png","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/f85fc81ba7b3d5545e0165a1.png"},{"id":100398349,"identity":"18a4678c-b398-4486-bb30-6810516ba8f7","added_by":"auto","created_at":"2026-01-16 11:53:06","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":747443,"visible":true,"origin":"","legend":"","description":"","filename":"Fig2boxplots.png","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/0a8b5baf0b59e5b305fbd17a.png"},{"id":100398441,"identity":"c37d6abd-4556-4582-b7cc-3f6c4707d0a0","added_by":"auto","created_at":"2026-01-16 11:53:45","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":415618,"visible":true,"origin":"","legend":"","description":"","filename":"Fig3correlation.png","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/a6af018767359d07c1800a78.png"},{"id":100398355,"identity":"5983a6c1-f356-40bd-87dc-8cd16f69d3b7","added_by":"auto","created_at":"2026-01-16 11:53:09","extension":"pdf","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1582355,"visible":true,"origin":"","legend":"","description":"","filename":"TowardAIDrivenIoTCybersecurityAPreprocessingFrameworkforBenchmarkDatasets.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/3bdf91bb88aa1304ecaa1bb5.pdf"},{"id":100421399,"identity":"27e33961-4c0a-48fa-a0d4-18239a85a989","added_by":"auto","created_at":"2026-01-16 13:32:53","extension":"bst","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":28600,"visible":true,"origin":"","legend":"","description":"","filename":"spphys.bst","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/521a5642ceb8ed57011a3a31.bst"},{"id":100398358,"identity":"47388535-97a7-4b7f-8ccc-23d2c0929566","added_by":"auto","created_at":"2026-01-16 11:53:16","extension":"clo","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":3696,"visible":true,"origin":"","legend":"","description":"","filename":"svglov3.clo","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/992bb31e79e76cebde13ad6e.clo"},{"id":100398632,"identity":"9d95a0fa-501c-481b-8173-9937d99ef597","added_by":"auto","created_at":"2026-01-16 11:54:05","extension":"cls","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":47679,"visible":true,"origin":"","legend":"","description":"","filename":"svjour3.cls","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/f78ddf5d30aa4a904a68ecf7.cls"},{"id":100398669,"identity":"3161a170-7684-4f04-9de4-e4b48b09e618","added_by":"auto","created_at":"2026-01-16 11:54:11","extension":"xml","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":109211,"visible":true,"origin":"","legend":"","description":"","filename":"5ec25632a6bf4c6380a0f4876c6c00851structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1/47839013cf759a2288c9564c.xml"},{"id":103765594,"identity":"29645bc2-d4d0-46c5-aa7e-2e45553c81b1","added_by":"auto","created_at":"2026-03-02 16:05:27","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1890273,"visible":true,"origin":"","legend":"","description":"","filename":"TowardAIDrivenIoTCybersecurityAPreprocessingFrameworkforBenchmarkDatasets.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8213296/v1_covered_c09790a4-d107-44ce-aed7-3681ddde8dff.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Toward AI‑Driven IoT Cybersecurity: A Preprocessing Framework for Benchmark Datasets","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"international-journal-of-information-security","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ijis","sideBox":"Learn more about [International Journal of Information Security](http://link.springer.com/journal/10207)","snPcode":"10207","submissionUrl":"https://submission.nature.com/new-submission/10207/3","title":"International Journal of Information Security","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Cybersecurity, IoT, Dataset preprocessing, Exploratory data analysis, Downsampling, Machine learning","lastPublishedDoi":"10.21203/rs.3.rs-8213296/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8213296/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"The rapid expansion of Internet of Things (IoT) systems, found in environments such as smart homes, poses growing cybersecurity challenges. In response, research has examined the role of artificial intelligence, particularly machine learning, in enhancing IoT security. To support this effort, machine learning models have been developed and evaluated on benchmark datasets. However, preparing datasets for machine learning requires preprocessing techniques that are tailored to the specific characteristics of the data. In this context, exploratory data analysis provides insights into dataset structure and distribution, thereby supporting informed preprocessing decisions prior to modeling. Accordingly, this study introduces a reproducible five‑step preprocessing framework for IoT cybersecurity datasets and demonstrates its application to the NF‑ToN‑IoT V1 dataset. The proposed framework is organized into two phases: an exploratory data analysis phase consisting of (1) dataset overview and identification of categorical and numerical features, (2) analysis of missing and zero values, (3) assessment of categorical feature distributions, and (4) assessment of numerical feature distributions; and a preprocessing phase consisting of (5) proportional stratified random downsampling to produce a reduced dataset that preserves the original class distribution. By establishing a systematic, data-driven framework, this study contributes to the preparation of structured datasets for attack detection in IoT environments, with potential applications in smart homes.","manuscriptTitle":"Toward AI‑Driven IoT Cybersecurity: A Preprocessing Framework for Benchmark Datasets","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-01-16 08:38:12","doi":"10.21203/rs.3.rs-8213296/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Accepted","date":"2026-02-16T06:53:53+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-01-22T09:07:03+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"185312219731650124636432653694633733086","date":"2026-01-16T09:26:52+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-01-16T03:08:15+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"196811325319323124440749423304374012017","date":"2026-01-15T01:57:21+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-01-14T09:03:19+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-12-02T15:30:13+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-12-02T15:28:50+00:00","index":"","fulltext":""},{"type":"submitted","content":"International Journal of Information Security","date":"2025-11-26T13:06:01+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"international-journal-of-information-security","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ijis","sideBox":"Learn more about [International Journal of Information Security](http://link.springer.com/journal/10207)","snPcode":"10207","submissionUrl":"https://submission.nature.com/new-submission/10207/3","title":"International Journal of Information Security","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"a6f4524f-113d-4f0e-9567-f503328ad334","owner":[],"postedDate":"January 16th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2026-03-02T16:01:57+00:00","versionOfRecord":{"articleIdentity":"rs-8213296","link":"https://doi.org/10.1007/s10207-026-01235-z","journal":{"identity":"international-journal-of-information-security","isVorOnly":false,"title":"International Journal of Information Security"},"publishedOn":"2026-02-26 15:58:47","publishedOnDateReadable":"February 26th, 2026"},"versionCreatedAt":"2026-01-16 08:38:12","video":"","vorDoi":"10.1007/s10207-026-01235-z","vorDoiUrl":"https://doi.org/10.1007/s10207-026-01235-z","workflowStages":[]},"version":"v1","identity":"rs-8213296","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8213296","identity":"rs-8213296","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-20T11:00:21.680559+00:00
License: CC-BY-4.0