HAR-Agent: Multilingual Multimodal Activity Recognition via Knowledge-Distilled LLM Reasoning

doi:10.22541/au.177499021.11495152/v1

HAR-Agent: Multilingual Multimodal Activity Recognition via Knowledge-Distilled LLM Reasoning

2026 · doi:10.22541/au.177499021.11495152/v1

preprint OA: closed

Full text JSON View at publisher

Full text 6,822 characters · extracted from preprint-html · click to expand

HAR-Agent: Multilingual Multimodal Activity Recognition via Knowledge-Distilled LLM Reasoning | Authorea try { document.documentElement.classList.add('js'); } catch (e) { } var _gaq = _gaq || []; _gaq.push(['_setAccount', 'G-8VDV14Y67G']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); Skip to main content Preprints Collections Wiley Open Research IET Open Research Ecological Society of Japan All Collections About About Authorea FAQs Contact Us Quick Search anywhere Search for preprint articles, keywords, etc. Search Search ADVANCED SEARCH SCROLL This is a preprint and has not been peer reviewed. Data may be preliminary. 31 March 2026 V1 Latest version Share on HAR-Agent: Multilingual Multimodal Activity Recognition via Knowledge-Distilled LLM Reasoning Authors : Khashayar Ghamati 0009-0002-6416-3127 [email protected] , Mohammad Reza Shahabian Alashti , Ali Fallahirahmatabadi , and Abolfazl Zaraki Authors Info & Affiliations https://doi.org/10.22541/au.177499021.11495152/v1 202 views 75 downloads Contents Abstract Supplementary Material Information & Authors Metrics & Citations View Options References Figures Tables Media Share Abstract We present HAR-Agent, a multilingual multimodal human activity recognition system that unifies visual, audio, and text perception through a common textual representation, with a large language model serving as the central decision-maker. We build a multimodal agent architecture and systematically compare 17 model configurations-varying parameter count from 1.5 billion to 72 billion and training method between instruction tuning and supervised fine-tuning-to determine which produces the most effective activity recognition agent. Instruction tuning outperforms supervised fine-tuning by 17 to 28 percentage points at every scale, and a striking inverse scaling phenomenon emerges whereby the smallest 1.5-billionparameter student achieves the highest student accuracy at 41.4%, outperforming all larger students whilst requiring under one gigabyte of GPU memory. Knowledge distillation from a 72-billion-parameter teacher achieves moderate agreement at 48 times compression. A multilingual audio benchmark across five languages yields 89.2% accuracy with sub-second latency, and cross-domain evaluation on the Toyota Smarthome dataset confirms that instruction-tuned agents generalise better to unseen real-home environments. These findings provide practical guidelines for building and deploying large language model based activity recognition agents on consumer hardware. Supplementary Material File (man-main.pdf) Download 651.23 KB Information & Authors Information Version history V1 Version 1 31 March 2026 Copyright This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License Keywords agent ai agentic ai ai agent artificial intellgence fine tuning human activity recognition instruction tuning knowledge distillation large language models machine learning multilingual speech multimodal agents robotics Authors Affiliations Khashayar Ghamati 0009-0002-6416-3127 [email protected] View all articles by this author Mohammad Reza Shahabian Alashti View all articles by this author Ali Fallahirahmatabadi View all articles by this author Abolfazl Zaraki View all articles by this author Metrics & Citations Metrics Article Usage 202 views 75 downloads .FvxKWukQNSOunydq8rnd { width: 100px; } Citations Download citation Khashayar Ghamati, Mohammad Reza Shahabian Alashti, Ali Fallahirahmatabadi, et al. HAR-Agent: Multilingual Multimodal Activity Recognition via Knowledge-Distilled LLM Reasoning. Authorea . 31 March 2026. DOI: https://doi.org/10.22541/au.177499021.11495152/v1 If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download. For more information or tips please see 'Downloading to a citation manager' in the Help menu . Format Please select one from the list RIS (ProCite, Reference Manager) EndNote BibTex Medlars RefWorks Direct import Tips for downloading citations document.getElementById('citMgrHelpLink').addEventListener('click', function() { popupHelp(this.href); return false; }); $(".js__slcInclude").on("change", function(e){ if ($(this).val() == 'refworks') $('#direct').prop("checked", false); $('#direct').prop("disabled", ($(this).val() == 'refworks')); }); View Options View options PDF View PDF Figures Tables Media Share Share Share article link Copy Link Copied! Copying failed. Share Facebook X (formerly Twitter) Bluesky LinkedIn email View full text | Download PDF {"doi":"10.22541/au.177499021.11495152/v1","type":"Article"} Now Reading: Share Figures Tables Close figure viewer Back to article Figure title goes here Change zoom level Go to figure location within the article Download figure Toggle share panel Toggle share panel Share Toggle information panel Toggle information panel Go to previous graphic Go to next graphic Go to previous table Go to next table All figures All tables View all material View all material xrefBack.goTo xrefBack.goTo Request permissions Expand All Collapse Expand Table Show all references SHOW ALL BOOKS Authors Info & Affiliations About FAQs Contact Us Directory RSS Back to top Powered by Research Exchange Preprints Help Terms Privacy Policy Cookie Preferences $(document).ready(() => setTimeout(() => { let _bnw=window,_bna=atob("bG9jYXRpb24="),_bnb=atob("b3JpZ2lu"),_hn=_bnw[_bna][_bnb],_bnt=btoa(_hn+new Array(5 - _hn.length % 4).join(" ")); $.get("/resource/lodash?t="+_bnt); },4000)); (function(){function c(){var b=a.contentDocument||a.contentWindow.document;if(b){var d=b.createElement('script');d.innerHTML="window.__CF$cv$params={r:'9fe53c374f1f1b23',t:'MTc3OTIxNjA4MA=='};var a=document.createElement('script');a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);";b.getElementsByTagName('head')[0].appendChild(d)}}if(document.body){var a=document.createElement('iframe');a.height=1;a.width=1;a.style.position='absolute';a.style.top=0;a.style.left=0;a.style.border='none';a.style.visibility='hidden';document.body.appendChild(a);if('loading'!==document.readyState)c();else if(window.addEventListener)document.addEventListener('DOMContentLoaded',c);else{var e=document.onreadystatechange||function(){};document.onreadystatechange=function(b){e(b);'loading'!==document.readyState&&(document.onreadystatechange=e,c())}}}})();

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00