OLTunes: Online learning-based Auto-tuning System for DL Inference in Heterogeneous GPU Cluster

preprint OA: closed
Full text JSON View at publisher
Full text 12,681 characters · extracted from preprint-html · click to expand
OLTunes: Online learning-based Auto-tuning System for DL Inference in Heterogeneous GPU Cluster | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article OLTunes: Online learning-based Auto-tuning System for DL Inference in Heterogeneous GPU Cluster Seoyoung Kim, Jiwon Ha, Yoonhee Kim This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5342517/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 8 You are reading this latest preprint version Abstract With rapid AI advancements, GPU accelerator technology is also evolving, increasing heterogeneous computing nodes in datacenters. This requires schedulers to recognize and optimally manage diverse resources to meet application needs dynamically. For latency-sensitive tasks like deep learning inference, lack of precise GPU scheduling leads to resource interference, degrading both application performance and overall GPU utilization. The rise of NLP and LLMs has intensified focus on models that balance throughput and latency, but dynamic loads on specific resources can degrade performance through head-of-line blocking. Thus, proactive resource management is essential for reducing costs while ensuring QoS and maintaining energy efficiency.This paper introduces OLTunes, a cluster-level scheduling system for deep learning inference models, which combines streaming and batch methods to efficiently handle online and offline models. Leveraging FM-FTML, an online learning technique, OLTunes optimizes runtime environments and resource selection to meet user SLA via prediction and optimization. It forms co-running groups based on job characteristics and model variants to reduce interference, ensuring complementary affinities between tasks. OLTunes also automatically tunes resources and settings to enhance performance and reduce resource fragmentation. Performance experiments on a heterogeneous GPU cluster showed an average GPU utilization improvement of 58%, reduced P99 tail latency by up to 49%, and increased throughput by 61%. It also achieved approximately 84.6% energy savings with a maximum accuracy loss of 4% and reduced latency-sensitive SLO violations by up to 92% compared to other baselines, ensuring end-to-end QoS. Heterogeneous GPU Cluster Online-learning Machine Learning Deep Learning Inference Resource Scheduling Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 30 Dec, 2024 Reviews received at journal 10 Nov, 2024 Reviewers agreed at journal 04 Nov, 2024 Reviewers agreed at journal 04 Nov, 2024 Reviewers invited by journal 03 Nov, 2024 Editor assigned by journal 29 Oct, 2024 Submission checks completed at journal 28 Oct, 2024 First submitted to journal 27 Oct, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5342517","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":375025670,"identity":"403f7af8-0776-4d7b-b9ba-23597b4fa9d9","order_by":0,"name":"Seoyoung Kim","email":"","orcid":"","institution":"Sookmyung Women's University","correspondingAuthor":false,"prefix":"","firstName":"Seoyoung","middleName":"","lastName":"Kim","suffix":""},{"id":375025671,"identity":"d7f7f521-1c24-4500-b46b-55ea6d32aa7b","order_by":1,"name":"Jiwon Ha","email":"","orcid":"","institution":"Seoul National University","correspondingAuthor":false,"prefix":"","firstName":"Jiwon","middleName":"","lastName":"Ha","suffix":""},{"id":375025672,"identity":"4ce337ab-de7c-4127-861b-9042a542bc53","order_by":2,"name":"Yoonhee Kim","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA4UlEQVRIiWNgGAWjYFAC5gYGhgogzYMkREALI1DLGZK1MLaRosXgRmLj58J5dXL8PKfTpHl31DLwtx9gNq7Ar6VZeua2w8aSvb3bpHnPHGeQOJPAnHgGv5YGad5tBxI3nOcFamk7xsBwg4H5YAMBW37zzqlDaJEnQkubNG8Dc+KGsyCHtdUARRiYE/FpkTzzsM2a5xjQLz1nN1vObTvAY3gmsdkQnxa+48mHb/PUgEIsd+ONt211cnLHDx+WxKdF4QCCzSLBwHCYBxK5eIA8kjTzBwaGOryqR8EoGAWjYGQCAFUuTqqB48UVAAAAAElFTkSuQmCC","orcid":"","institution":"Sookmyung Women's University","correspondingAuthor":true,"prefix":"","firstName":"Yoonhee","middleName":"","lastName":"Kim","suffix":""}],"badges":[],"createdAt":"2024-10-27 18:38:15","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5342517/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5342517/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":68418745,"identity":"53ccb209-f55e-4c1a-a595-8ec871229f17","added_by":"auto","created_at":"2024-11-07 05:39:23","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3580421,"visible":true,"origin":"","legend":"","description":"","filename":"Journal24.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5342517/v1_covered_bed0f374-44f5-4a6b-b0c4-680286f3b6df.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"OLTunes: Online learning-based Auto-tuning System for DL Inference in Heterogeneous GPU Cluster","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"cluster-computing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [Cluster Computing](https://www.springer.com/journal/10586)","snPcode":"10586","submissionUrl":"https://submission.nature.com/new-submission/10586/3","title":"Cluster Computing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Heterogeneous GPU Cluster, Online-learning, Machine Learning, Deep Learning Inference, Resource Scheduling","lastPublishedDoi":"10.21203/rs.3.rs-5342517/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5342517/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eWith rapid AI advancements, GPU accelerator technology is also evolving, increasing heterogeneous computing nodes in datacenters. This requires schedulers to recognize and optimally manage diverse resources to meet application needs dynamically. For latency-sensitive tasks like deep learning inference, lack of precise GPU scheduling leads to resource interference, degrading both application performance and overall GPU utilization. The rise of NLP and LLMs has intensified focus on models that balance throughput and latency, but dynamic loads on specific resources can degrade performance through head-of-line blocking. Thus, proactive resource management is essential for reducing costs while ensuring QoS and maintaining energy efficiency.This paper introduces OLTunes, a cluster-level scheduling system for deep learning inference models, which combines streaming and batch methods to efficiently handle online and offline models. Leveraging FM-FTML, an online learning technique, OLTunes optimizes runtime environments and resource selection to meet user SLA via prediction and optimization. It forms co-running groups based on job characteristics and model variants to reduce interference, ensuring complementary affinities between tasks. OLTunes also automatically tunes resources and settings to enhance performance and reduce resource fragmentation. Performance experiments on a heterogeneous GPU cluster showed an average GPU utilization improvement of 58%, reduced P99 tail latency by up to 49%, and increased throughput by 61%. It also achieved approximately 84.6% energy savings with a maximum accuracy loss of 4% and reduced latency-sensitive SLO violations by up to 92% compared to other baselines, ensuring end-to-end QoS.\u003c/p\u003e","manuscriptTitle":"OLTunes: Online learning-based Auto-tuning System for DL Inference in Heterogeneous GPU Cluster","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-11-07 05:38:41","doi":"10.21203/rs.3.rs-5342517/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-12-30T05:05:47+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-11-10T08:19:49+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"284329501528705919805217475471483455789","date":"2024-11-04T10:27:07+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"90939114088492397732825645978792054054","date":"2024-11-04T06:15:39+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-11-04T04:00:14+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-10-29T06:39:02+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-10-28T13:25:05+00:00","index":"","fulltext":""},{"type":"submitted","content":"Cluster Computing","date":"2024-10-27T18:34:58+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"cluster-computing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [Cluster Computing](https://www.springer.com/journal/10586)","snPcode":"10586","submissionUrl":"https://submission.nature.com/new-submission/10586/3","title":"Cluster Computing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"96c8611e-758f-4c86-8312-15ea8a0a71a3","owner":[],"postedDate":"November 7th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2025-02-14T16:38:32+00:00","versionOfRecord":[],"versionCreatedAt":"2024-11-07 05:38:41","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-5342517","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5342517","identity":"rs-5342517","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00