LinGen-Uni: A Universal Linear-Complexity Framework for High-Resolution Minute-Length Text-to-Video Generation | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article LinGen-Uni: A Universal Linear-Complexity Framework for High-Resolution Minute-Length Text-to-Video Generation Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, and 7 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7159495/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. We further propose a distillation framework that quickly transfers the self-attention layers in pre-trained DiTs to our proposed MATE layers by reusing the self-attention weights to initialize 90% of the weights of MATE layers. Benefiting from this, our proposed LinGen can be universally deployed on any pre-trained DiT through light distillation. Thus, we call LinGen equipped with this distillation framework as LinGen-Uni. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15x (11.5x) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate that our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way for hour-length movie generation and real-time interactive video generation. Furthermore, distillation results indicate that LinGen-Uni maintains the quality of Wan2.1-T2V-1.3B after distillation while achieving up to 30.7x speedup in terms of inference latency, outperforming LTX-Video-2B significantly in terms of video quality and text-video alignment. More minute-length video examples can be found at our project website: https://lineargen.github.io/. The complete code of the distillation framework will be released soon after acceptance. Artificial Intelligence and Machine Learning video generation diffusion models linear complexity state space models architecture distillation Full Text Additional Declarations The authors declare potential competing interests as follows: The early conference version of this work was supported in part by NSF under Grant No. CCF-2203399 and in part by a Meta summer internship. All extensions beyond the conference paper were conducted entirely at Princeton University with support from NSF under Grant No. CCF-2203399. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7159495","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":487568049,"identity":"47831f46-35c8-407f-a8bf-58d176230a52","order_by":0,"name":"Hongjie Wang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABDElEQVRIiWNgGAWjYBACAxAhASYZGB8AmXAZxgYitDAbQLQwE6EFCtigVhDQYs7ee/iFRYFNYv/s9mvVhXss8vjb+w8+LmCwkd1wALsWy55zaRYSBmmJM+6cKbs945lEscSZw8zGMxjSjHFpMbiRY2YgYXA4t+FGTtptngMSiRskktmkeRgOJxLQ8j93PlBLMViL/GOQlv/4tBg/kDA4kLvhRvoxZogtzCAtB3BrOXPGDBjIyfUbb+QwS88AaplxJtnYmMcg2XgmLi3He4w/S/yxM5a7kf7wc8GBusT+9oMPH/NU2Mn24dACBGzSkPjgMWBGMgqnchBg/vgBTLM/YMarbhSMglEwCkYsAACw+2BOPRS+dAAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0002-1841-4119","institution":"Princeton University","correspondingAuthor":true,"prefix":"","firstName":"Hongjie","middleName":"","lastName":"Wang","suffix":""},{"id":487568050,"identity":"ff9f72a4-74dd-4340-9b81-3cfdf9ebbfce","order_by":1,"name":"Chih-Yao Ma","email":"","orcid":"","institution":"Meta","correspondingAuthor":false,"prefix":"","firstName":"Chih-Yao","middleName":"","lastName":"Ma","suffix":""},{"id":487568051,"identity":"2230ed87-6582-4b38-a5de-59afa574f640","order_by":2,"name":"Yen-Cheng Liu","email":"","orcid":"","institution":"Meta","correspondingAuthor":false,"prefix":"","firstName":"Yen-Cheng","middleName":"","lastName":"Liu","suffix":""},{"id":487568052,"identity":"babecd05-a306-4324-a789-7c698ea078d9","order_by":3,"name":"Ji Hou","email":"","orcid":"","institution":"Meta","correspondingAuthor":false,"prefix":"","firstName":"Ji","middleName":"","lastName":"Hou","suffix":""},{"id":487568053,"identity":"c7ec0834-7e0d-476e-b289-816a8b72bd86","order_by":4,"name":"Tao Xu","email":"","orcid":"","institution":"Meta","correspondingAuthor":false,"prefix":"","firstName":"Tao","middleName":"","lastName":"Xu","suffix":""},{"id":487568054,"identity":"1a0b6cd6-8a0c-442a-b164-88438d7c447b","order_by":5,"name":"Jialiang Wang","email":"","orcid":"","institution":"Meta","correspondingAuthor":false,"prefix":"","firstName":"Jialiang","middleName":"","lastName":"Wang","suffix":""},{"id":487568055,"identity":"c06a06f1-f7a3-45c5-a079-b66d446051d0","order_by":6,"name":"Felix Juefei-Xu","email":"","orcid":"","institution":"Meta","correspondingAuthor":false,"prefix":"","firstName":"Felix","middleName":"","lastName":"Juefei-Xu","suffix":""},{"id":487568056,"identity":"3fa4c820-1dd4-4405-9d2a-2bcfd03b23be","order_by":7,"name":"Yaqiao Luo","email":"","orcid":"","institution":"Meta","correspondingAuthor":false,"prefix":"","firstName":"Yaqiao","middleName":"","lastName":"Luo","suffix":""},{"id":487568057,"identity":"cdfa604a-c947-47dd-a82e-ccbaa771a67c","order_by":8,"name":"Peizhao Zhang","email":"","orcid":"","institution":"Meta","correspondingAuthor":false,"prefix":"","firstName":"Peizhao","middleName":"","lastName":"Zhang","suffix":""},{"id":487568058,"identity":"cf7d1793-5c67-4de1-a4aa-519db00cef31","order_by":9,"name":"Tingbo Hou","email":"","orcid":"","institution":"Meta","correspondingAuthor":false,"prefix":"","firstName":"Tingbo","middleName":"","lastName":"Hou","suffix":""},{"id":487568059,"identity":"15f4f226-6742-4a0c-9b0b-2094cde51a66","order_by":10,"name":"Peter Vajda","email":"","orcid":"","institution":"Meta","correspondingAuthor":false,"prefix":"","firstName":"Peter","middleName":"","lastName":"Vajda","suffix":""},{"id":487568060,"identity":"99b6ddd5-c650-48db-a3d7-a2c4f46101a1","order_by":11,"name":"Xiaoliang Dai","email":"","orcid":"","institution":"Meta","correspondingAuthor":false,"prefix":"","firstName":"Xiaoliang","middleName":"","lastName":"Dai","suffix":""},{"id":487568061,"identity":"674b4e6a-6188-44f6-b2b5-86926abd39cf","order_by":12,"name":"Niraj K. Jha","email":"","orcid":"","institution":"Princeton University","correspondingAuthor":false,"prefix":"","firstName":"Niraj","middleName":"K.","lastName":"Jha","suffix":""}],"badges":[],"createdAt":"2025-07-18 16:36:23","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":true,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-7159495/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7159495/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":87227642,"identity":"7e2a7417-f694-4ab9-a138-a56290d6e291","added_by":"auto","created_at":"2025-07-21 17:45:15","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":9446997,"visible":true,"origin":"","legend":"","description":"","filename":"LinGenIJCVextensionv3.0.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7159495/v1_covered_b9301bb1-ba87-40ef-953b-b61dd5f51bd7.pdf"}],"financialInterests":"The authors declare potential competing interests as follows: The early conference version of this work was supported in part by NSF under Grant No. CCF-2203399 and in part by a Meta summer internship. All extensions beyond the conference paper were conducted entirely at Princeton University with support from NSF under Grant No. CCF-2203399.","formattedTitle":"\u003cp\u003eLinGen-Uni: A Universal Linear-Complexity Framework for High-Resolution Minute-Length Text-to-Video Generation\u003c/p\u003e","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Princeton University","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"video generation, diffusion models, linear complexity, state space models, architecture distillation","lastPublishedDoi":"10.21203/rs.3.rs-7159495/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7159495/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eText-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. \u0026nbsp;This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length.\u0026nbsp;We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. We further propose a distillation framework that quickly transfers the self-attention layers in pre-trained DiTs to our proposed MATE layers by reusing the self-attention weights to initialize 90% of the weights of MATE layers. Benefiting from this, our proposed LinGen can be universally deployed on any pre-trained DiT through light distillation. Thus, we call LinGen equipped with this distillation framework as LinGen-Uni. Experimental results show that \u0026nbsp;LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15x (11.5x) FLOPs (latency) reduction. \u0026nbsp;Furthermore, both automatic metrics and human evaluation demonstrate that our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way for hour-length movie generation and real-time interactive video generation. Furthermore, distillation results indicate that LinGen-Uni maintains the quality of Wan2.1-T2V-1.3B after distillation while achieving up to 30.7x speedup in terms of inference latency, outperforming LTX-Video-2B significantly in terms of video quality and text-video alignment. More minute-length video examples can be found at our project website: https://lineargen.github.io/. The complete code of the distillation framework will be released soon after acceptance.\u003c/p\u003e","manuscriptTitle":"LinGen-Uni: A Universal Linear-Complexity Framework for High-Resolution Minute-Length Text-to-Video Generation","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-21 17:37:04","doi":"10.21203/rs.3.rs-7159495/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"75db7397-47a0-4405-9b29-f3d238f74a62","owner":[],"postedDate":"July 21st, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":51767722,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2025-07-21T17:37:04+00:00","versionOfRecord":[],"versionCreatedAt":"2025-07-21 17:37:04","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7159495","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7159495","identity":"rs-7159495","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.