SGVLM: Depth-Integrated Semantic Scene Graph Fusion for Enhanced Autonomous Driving Decision-Making

doi:10.22541/au.175774544.49992943/v1

SGVLM: Depth-Integrated Semantic Scene Graph Fusion for Enhanced Autonomous Driving Decision-Making

2025 · doi:10.22541/au.175774544.49992943/v1

preprint OA: closed

Full text JSON View at publisher

Full text 6,825 characters · extracted from preprint-html · click to expand

SGVLM: Depth-Integrated Semantic Scene Graph Fusion for Enhanced Autonomous Driving Decision-Making | Authorea try { document.documentElement.classList.add('js'); } catch (e) { } var _gaq = _gaq || []; _gaq.push(['_setAccount', 'G-8VDV14Y67G']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); Skip to main content Preprints Collections Wiley Open Research IET Open Research Ecological Society of Japan All Collections About About Authorea FAQs Contact Us Quick Search anywhere Search for preprint articles, keywords, etc. Search Search ADVANCED SEARCH SCROLL Computational Intelligence This is a preprint and has not been peer reviewed. Data may be preliminary. 13 September 2025 V1 Latest version Share on SGVLM: Depth-Integrated Semantic Scene Graph Fusion for Enhanced Autonomous Driving Decision-Making Authors : Yiming Han2 , Yiran Tao and Xiang Cui , and Tinglun Song [email protected] Authors Info & Affiliations https://doi.org/10.22541/au.175774544.49992943/v1 Published Computational Intelligence Version of record Peer review timeline 314 views 151 downloads Contents Abstract Supplementary Material Information & Authors Metrics & Citations View Options References Figures Tables Media Share Abstract Autonomous driving decision-making requires a deep semantic understanding of traffic scenes. In this paper, we propose the SGVLM (Semantic Graph Vision-Language Model) architecture: a vision-language model that enhances autonomous driving decision-making through depth-integrated semantic scene graph fusion. Key objects are represented as nodes (category, state) and spatial-semantic relations as edges, enriched with pixel-wise depth estimates from Depth-Anything-V2 to capture accurate inter-object distances. These structured graph features are aggregated via a two-layer Graph Attention Network and projected into the FastVLM’s FastViTHD feature space. A cross-modal triplet fusion layer then jointly integrates graph embeddings, visual features, and natural-language queries. Leveraging Low-Rank Adaptation (LoRA) for efficient fine-tuning, SGVLM_7B achieves relative improvements of 25.9% in BLEU-4 and 18.6% in ROUGE-L over the InternVL4Drive-v2 baseline on the DriveLM-nuScenes benchmark, and attains 94.56% accuracy on collision-warning decision tasks in our TTSG-data safety-critical scenarios. These results demonstrate that depth-integrated semantic scene graph fusion substantially enhances the model’s ability to generate actionable driving decisions under complex traffic conditions. Supplementary Material File (sgvlm.docx) Download 5.33 MB Information & Authors Information Version history V1 Version 1 13 September 2025 Peer review timeline Published Computational Intelligence Version of Record 13 Apr 2026 Published Copyright This work is licensed under a Non Exclusive No Reuse License. Collection Computational Intelligence Keywords autonomous driving graph attention network semantic scene graph Authors Affiliations Yiming Han2 Nanjing University of Aeronautics and Astronautics College of Energy and Power Engineering View all articles by this author Yiran Tao and Xiang Cui Nanjing University of Aeronautics and Astronautics College of Energy and Power Engineering View all articles by this author Tinglun Song [email protected] Chery Automobile Co Ltd View all articles by this author Metrics & Citations Metrics Article Usage 314 views 151 downloads .FvxKWukQNSOunydq8rnd { width: 100px; } Citations Download citation Yiming Han2, Yiran Tao and Xiang Cui, Tinglun Song. SGVLM: Depth-Integrated Semantic Scene Graph Fusion for Enhanced Autonomous Driving Decision-Making. Authorea . 13 September 2025. DOI: https://doi.org/10.22541/au.175774544.49992943/v1 If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download. For more information or tips please see 'Downloading to a citation manager' in the Help menu . Format Please select one from the list RIS (ProCite, Reference Manager) EndNote BibTex Medlars RefWorks Direct import Tips for downloading citations document.getElementById('citMgrHelpLink').addEventListener('click', function() { popupHelp(this.href); return false; }); $(".js__slcInclude").on("change", function(e){ if ($(this).val() == 'refworks') $('#direct').prop("checked", false); $('#direct').prop("disabled", ($(this).val() == 'refworks')); }); View Options View options PDF View PDF Figures Tables Media Share Share Share article link Copy Link Copied! Copying failed. Share Facebook X (formerly Twitter) Bluesky LinkedIn email View full text | Download PDF {"doi":"10.22541/au.175774544.49992943/v1","type":"Article"} Now Reading: Share Figures Tables Close figure viewer Back to article Figure title goes here Change zoom level Go to figure location within the article Download figure Toggle share panel Toggle share panel Share Toggle information panel Toggle information panel Go to previous graphic Go to next graphic Go to previous table Go to next table All figures All tables View all material View all material xrefBack.goTo xrefBack.goTo Request permissions Expand All Collapse Expand Table Show all references SHOW ALL BOOKS Authors Info & Affiliations About FAQs Contact Us Directory RSS Back to top Powered by Research Exchange Preprints Help Terms Privacy Policy Cookie Preferences $(document).ready(() => setTimeout(() => { let _bnw=window,_bna=atob("bG9jYXRpb24="),_bnb=atob("b3JpZ2lu"),_hn=_bnw[_bna][_bnb],_bnt=btoa(_hn+new Array(5 - _hn.length % 4).join(" ")); $.get("/resource/lodash?t="+_bnt); },4000)); (function(){function c(){var b=a.contentDocument||a.contentWindow.document;if(b){var d=b.createElement('script');d.innerHTML="window.__CF$cv$params={r:'a0047796bfa48e2e',t:'MTc3OTU0MzU3MQ=='};var a=document.createElement('script');a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);";b.getElementsByTagName('head')[0].appendChild(d)}}if(document.body){var a=document.createElement('iframe');a.height=1;a.width=1;a.style.position='absolute';a.style.top=0;a.style.left=0;a.style.border='none';a.style.visibility='hidden';document.body.appendChild(a);if('loading'!==document.readyState)c();else if(window.addEventListener)document.addEventListener('DOMContentLoaded',c);else{var e=document.onreadystatechange||function(){};document.onreadystatechange=function(b){e(b);'loading'!==document.readyState&&(document.onreadystatechange=e,c())}}}})();

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00