PocketLLM: A Privacy-Preserving Offline AI Assistant with On-Device LLM Inference and Retrieval-Augmented Generation on Android

preprint OA: closed
Full text JSON View at publisher
Full text 79,354 characters · extracted from preprint-html · click to expand
PocketLLM: A Privacy-Preserving Offline AI Assistant with On-Device LLM Inference and Retrieval-Augmented Generation on Android | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article PocketLLM: A Privacy-Preserving Offline AI Assistant with On-Device LLM Inference and Retrieval-Augmented Generation on Android Ritesh Reddy G, Sahasrika E, Karthikeya K, Rohit Reddy K, Pavana Lakshmi G, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9575380/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Cloud-based AI assistants transmit user data to remote servers, raising significant concerns around privacy, latency, and continuous internet dependency. In this paper, we present PocketLLM , a fully offline, on-device AI assistant for Android that addresses these issues by running all computations locally on the device. The system integrates a quantized large language model (LLM) deployed via llama.cpp [ 9 ], combined with a lightweight Retrieval-Augmented Generation (RAG) pipeline [ 5 ] backed by SQLite, to improve contextual understanding without any cloud dependency. PocketLLM supports natural language task execution, enabling users to make phone calls, send SMS messages, set alarms, and manage calendar events entirely on-device through an intent classification layer. The system runs the Arcee Lite 1.7B model [ 8 ] in Q8 GGUF format via a Java Native Interface (JNI) bridge on a standard 6 GB RAM smartphone, staying well within practical mobile hardware limits. Experiments carried out on 200 test queries show a mean response latency of 6.2 seconds and an intent classification accuracy of 91.5%. RAG-augmented responses outperform vanilla LLM outputs by roughly 23% on domain-specific queries, confirming the value of local knowledge retrieval. By removing any reliance on cloud infrastructure, PocketLLM keeps user data private while remaining practically usable, establishing a solid foundation for future offline intelligent assistant research. Artificial Intelligence and Machine Learning Information Retrieval and Management On-device AI Retrieval-Augmented Generation Large Language Models Offline AI Assistant Android Privacy-Preserving Systems Figures Figure 1 Figure 2 Figure 3 Figure 4 1. Introduction The rapid spread of AI-powered assistants has fundamentally changed how people interact with their devices. Products such as Siri, Google Assistant, and Alexa have made it clear that natural language interfaces can genuinely simplify everyday tasks, from placing calls to setting reminders and looking up information. Yet these systems share a common dependency: they offload nearly all of their intelligence to remote cloud servers, which means they require a stable internet connection and, by extension, continuously ship user queries to third-party infrastructure. For many users — those in low-connectivity regions, privacy-sensitive professions, or simply unwilling to have their conversations stored on an external server — this trade-off is unacceptable. Running a capable large language model (LLM) entirely on a smartphone is non-trivial. Mobile devices impose tight constraints on memory, compute, and battery life, so only carefully optimized, smaller models are feasible. Recent progress in model quantization [10] and efficient CPU-based inference frameworks [9] has made this tractable, allowing lightweight LLMs to run in real time on mid-range Android hardware. At the same time, Retrieval-Augmented Generation (RAG) [5] has emerged as an effective way to narrow the gap between a model’s fixed training knowledge and the specific, up-to-date information a user might need — without requiring the model itself to grow larger. Most prior work in this space concentrates on inference efficiency: squeezing lower latency or smaller memory footprints out of on-device models [3][4]. Far less attention has gone to the other half of the problem: wiring that on-device intelligence into the device’s own APIs so users can actually get things done — not just receive text answers. Bridging that gap is exactly what PocketLLM aims to do. We present PocketLLM , a fully offline Android AI assistant that pairs on-device LLM inference with task-oriented system integration. Users can hold natural conversations with the assistant and also issue commands that are translated directly into device actions — calling a contact, composing an SMS, scheduling an event, or setting an alarm — without any data ever leaving the handset. The main contributions of this work are as follows: On-device inference of the Arcee Lite 1.7B model [8] using llama.cpp [9], enabling efficient execution of quantized LLMs on Android devices. Integration of a local Retrieval-Augmented Generation pipeline [5] using SQLite and all-MiniLM-L6-v2 embeddings [6][7] for more contextually accurate domain-specific responses. Seamless Android system integration, supporting natural language-driven execution of phone calls, SMS, alarms, and calendar events via standard Android SDK APIs. A companion website for APK distribution, making the app accessible without requiring a Play Store listing. 2. Related Work 2.1 On-Device LLM Inference Interest in running large language models on edge hardware has grown considerably over the past two years. LLMCad [3] tackled this by distributing computation across a device’s available cores, improving efficiency on memory-constrained generative tasks. PowerInfer-2 [2] went further with a neuron-cluster scheduling strategy that dynamically activates only the model weights most relevant to a given token, allowing models larger than device RAM to run without catastrophic slowdowns. Transformer-Lite [4] explored aggressive low-bit quantization — including FP4 formats — to fit LLM inference within the limited bandwidth of mobile GPUs. The llama.cpp framework [9] has become the de facto standard for CPU-based quantized inference on edge devices, offering well-tested support for the GGUF model format across ARM and x86 architectures. Research on post-training quantization [10] and parameter-efficient adaptation [13] has further refined the accuracy-efficiency trade-off. What these works generally leave unaddressed, however, is how the resulting on-device model should integrate with real-world device functionality. PocketLLM builds on this inference foundation and extends it with a full task-execution layer. 2.2 Mobile AI Assistants Conversational AI assistants have evolved from rigid rule-based systems into deep-learning-powered agents capable of handling open-ended dialogue. Cloud-hosted products such as Siri and Google Assistant demonstrate the value of natural language interfaces but also exemplify the privacy and latency issues that motivate on-device alternatives. The closest prior work to our own is that of Marques et al. [1], who ran a fine-tuned 3B GPT model on a smartphone using llama.cpp and LoRA [13], demonstrating basic text-to-action features such as placing calls and adding calendar entries. Yin et al. [15] explored treating an on-device LLM as a shared system service to reduce redundant model loading across apps. While these systems prove the concept of mobile LLM assistants, they stop short of integrating a domain-specific offline RAG pipeline alongside multi-API task execution — the combination that defines PocketLLM. 2.3 Retrieval-Augmented Generation (RAG) The foundational RAG framework, introduced by Lewis et al. [5], demonstrated that pre-trained generative models can be substantially improved on knowledge-intensive tasks by retrieving relevant passages at inference time. Subsequent work on dense embedding models, notably Sentence-BERT [6] and its distilled successors such as all-MiniLM-L6-v2 [7], made efficient semantic similarity search practical even in resource-limited settings. Server-side RAG pipelines commonly rely on approximate nearest-neighbour stores such as FAISS; on mobile, however, such libraries introduce significant binary size and memory overhead. PocketLLM instead stores precomputed embeddings directly in SQLite as binary blobs, supporting cosine similarity retrieval with no external dependencies and no network access required. 3. System Architecture 3.1 Overview PocketLLM is designed as a self-contained intelligent assistant for Android, with a strict requirement that no user data ever leaves the device. The system is organised into five layers that process a user’s input sequentially: the User Interface Layer accepts voice or typed input; the Intent Classification Layer determines the type of response needed; the On-Device LLM Inference Engine generates the textual reply; the RAG Pipeline optionally augments that reply with locally retrieved knowledge; and the Android API Integration Layer converts actionable intents into real device operations such as calls, messages, and calendar events. This layered design keeps each concern cleanly separated while ensuring that the full pipeline — from microphone input to device action — runs entirely inside the user’s handset, with no outbound network traffic at any stage. 3.2 User Interface Layer The UI is built with Kotlin and Jetpack Compose, offering both a text-based chat interface and a voice-input mode powered by Android’s on-device speech-to-text engine. Spoken utterances are transcribed locally before being passed downstream; model outputs are displayed as chat bubbles and can optionally be read aloud via the device’s text-to-speech engine, completing the conversational loop entirely on-device. 3.3 Intent Classification The Intent Classification Layer acts as a semantic router. It normalises incoming text and determines whether the user intends a general conversational exchange or a concrete device action — currently covering phone calls, SMS composition, alarm scheduling, and calendar event creation. The detected intent label is forwarded alongside the original query to the inference engine to guide prompt formatting, and — for action intents — also triggers the corresponding Android API call. 3.4 On-Device LLM Inference The core reasoning capability is provided by Arcee Lite 1.7B [8], quantised to Q8 GGUF format, which yields a model file of approximately 1.8 GB stored on internal flash. Arcee Lite was selected over TinyLLaMA, Orca Mini, and Qwen 1.5B following an empirical comparison that weighed instruction-following quality, response coherence, and peak memory consumption on mid-range hardware (see Section 4.2 for details). Inference is handled by llama.cpp [9], compiled as a native shared library for Android using the NDK, CMake, and Ninja. A JNI bridge exposes model loading, tokenisation, prompt formatting, and token sampling to the Kotlin application layer. All LLM computation runs on the device CPU in a background thread, keeping the UI responsive and ensuring that no query or response is ever transmitted externally. 3.5 Retrieval-Augmented Generation (RAG) Pipeline The RAG pipeline supplements the model’s parametric knowledge with domain-specific content drawn from a local knowledge base. When RAG is enabled, the top-matching passage is prepended to the prompt before inference; when it is disabled, the model generates responses from its weights alone, reducing latency. Embeddings are produced offline using the all-MiniLM-L6-v2 sentence transformer [6][7], converted to ONNX for Android compatibility, and stored alongside their source text as BLOBs in a SQLite database. The knowledge base covers four domains: Agricultural Science, Medical Science, Computer Science, and KMIT Institutional data. At inference time, the user’s query is embedded using the same ONNX model and compared against the stored vectors via cosine similarity; the highest-scoring passage is appended to the LLM prompt. This approach eliminates the need for cloud-hosted vector stores entirely [5]. 3.6 Android API Integration Once an action intent is confirmed, the API Integration Layer invokes the appropriate Android SDK interface: TelecomManager for outgoing calls, SmsManager for text messages, CalendarContract for event creation, and AlarmManager for alarm scheduling. All necessary permissions are declared in the app manifest and requested at runtime in accordance with Android’s security model. Every API call is handled locally by the operating system; no step in the pipeline requires network access, preserving the system’s end-to-end offline guarantee. 4. Implementation This section describes how PocketLLM’s components are realised in practice, covering the technology stack, model selection rationale, inference integration, RAG pipeline construction, and deployment approach. 4.1 Technology Stack PocketLLM is built on a hybrid stack chosen to balance runtime performance with ease of Android system integration. Table 1 summarises the key components. Table 1: PocketLLM Technology Stack Component Technology Used Purpose Application Layer Kotlin (Android SDK) Core app logic and API integration UI Framework Jetpack Compose Declarative user interface Inference Engine llama.cpp (C++) On-device LLM inference Native Bridge JNI + Android NDK + CMake Kotlin ↔ C++ integration LLM Model Arcee Lite 1.7B (Q8 GGUF) Language model for responses Embedding Model all-MiniLM-L6-v2 (ONNX) Query embedding for RAG Vector Storage SQLite Local embedding storage Preprocessing Python + SentenceTransformers Dataset embedding generation Website Next.js + Vercel APK distribution 4.2 Model Selection Four quantised models were evaluated on a 6 GB RAM Android device: Gemma 2B, TinyLLaMA 1.1B [3], Phi-2, and Arcee Lite 1.7B [8]. The primary evaluation criteria were instruction-following quality, memory consumption, and inference latency. Table 2 summarises the results. Table 2: Model Comparison Model Parameters Size Performance Selection Reason Gemma 2B 2B ~1.4 GB Moderate Lower instruction accuracy TinyLLaMA 1.1B ~0.7 GB Fast, limited Weak response quality Phi-2 2.7B ~1.6 GB High quality High memory usage Arcee Lite ✓ 1.7B ~1.8 GB Balanced Best accuracy–efficiency trade-off Arcee Lite 1.7B in Q8 GGUF format [8] emerged as the best option: it outperformed smaller models on instruction-following and response coherence while keeping peak memory usage to roughly 1.8 GB. Larger models such as Phi-2 were excluded because they routinely exhausted available RAM during extended conversations. 4.3 On-Device Inference Implementation The llama.cpp library [9] is compiled as a native shared object (.so) using the Android NDK and integrated through a custom JNI wrapper. The wrapper exposes methods for model loading, input tokenisation, prompt formatting, and autoregressive decoding. Generated tokens are streamed incrementally back to the Kotlin layer so that responses appear progressively in the UI rather than after a long blocking wait. All inference runs on a dedicated background thread, keeping the main thread — and thus the UI — fully responsive. 4.4 RAG Pipeline Implementation Domain datasets (agriculture, medical, computer science, and KMIT institutional content) are preprocessed offline in Python. Each document is chunked and embedded using the all-MiniLM-L6-v2 sentence transformer [6][7], and the resulting vectors are stored as BLOBs in SQLite together with their source text. At inference time the ONNX-exported embedding model encodes the user’s query on-device; a linear scan of stored vectors with cosine similarity identifies the top-matching passage, which is then prepended to the LLM prompt to guide the response [5]. 4.5 Intent Classification and API Integration Intent detection uses a hybrid strategy: rule-based keyword matching handles common, unambiguous commands (e.g., “call Pooja”, “set alarm at 7 AM”), while a model-based classifier resolves edge cases with more nuanced phrasing. Detected intents are mapped to four Android SDK interfaces: TelephonyManager, SmsManager, CalendarContract, and AlarmManager, all invoked locally. The system requests only the permissions strictly needed for each action, following Android’s principle of least privilege. 4.6 Deployment The final APK bundles the quantised model, ONNX embedding model, and SQLite knowledge databases into a single self-contained package. It is distributed through a companion website built with Next.js and hosted on Vercel (https://pocketllmapp.vercel.app). Minimum requirements are 6 GB RAM and Android 8.0 (API 26) or higher. 5. Evaluation We evaluated PocketLLM across three dimensions — inference performance, task execution accuracy, and RAG retrieval quality — and compared it against a cloud-based baseline and a lighter on-device model. All experiments were run on a physical Android device under consistent, controlled conditions. 5.1 Experimental Setup The test device had the following configuration: Device: Snapdragon 700-series Android smartphone RAM: 6 GB | Processor: Octa-core CPU (~2.2 GHz) OS: Android 13 Model: Arcee Lite 1.7B (Q8 GGUF) [8] — ~1.8 GB on disk Inference Engine: llama.cpp [9] via JNI RAG Backend: SQLite with all-MiniLM-L6-v2 embeddings [6][7] The evaluation dataset comprised 200 queries spanning general conversational inputs and task-oriented commands. A separate set of 50 labelled action commands was used exclusively for intent classification evaluation. 5.2 Inference Performance Table 3: Inference Performance Metrics Metric Value Avg. Response Latency (No RAG) 5.1 sec Avg. Response Latency (With RAG) 6.2 sec Model Load Time 18–22 sec Peak Memory Usage 4.2 GB Average Memory Usage 3.6 GB Enabling RAG adds approximately 1.1 seconds of latency, attributable to on-device query embedding and similarity search. Despite this overhead, both configurations operate comfortably within a 6 GB RAM budget, confirming real-world feasibility on mid-range hardware. 5.3 Task Execution Accuracy The intent classification module correctly identified 45 out of 50 test commands, achieving an overall accuracy of 91.5%. As illustrated in Figure 5.1, alarm-related commands achieved the highest accuracy (94%), followed by call commands (92%). SMS and calendar tasks demonstrated slightly lower accuracy (90% each), reflecting the increased variability in natural language expressions associated with these actions. Most classification errors occurred in ambiguous or multi-intent queries, where the distinction between tasks was less explicit. 5.4 RAG Retrieval Quality Table 4: RAG Retrieval Quality by Query Type Query Type LLM Only RAG Enabled General Moderate Moderate CS Domain Low High Institutional (KMIT) Incorrect Accurate Agriculture / Medical Partial Relevant Improvement: ~23% increase in response relevance for domain-specific queries RAG augmentation had minimal effect on general-knowledge queries, where the base model’s parametric knowledge was already sufficient. For domain-specific queries, however, the difference was striking: CS technical questions improved from low to high relevance, institutional queries (e.g., KMIT-specific information) shifted from incorrect to accurate, and agricultural/medical questions moved from partial to relevant responses. The aggregate qualitative improvement across domain-specific queries was approximately 23%, measured by pairwise relevance scoring comparing baseline and RAG-augmented outputs. 5.5 Comparison with Alternatives Table 5: PocketLLM vs. Cloud AI Metric PocketLLM Cloud AI Latency 6.2 sec 2–3 sec Internet Required No Yes Privacy High (on-device) Low (server-side) Offline Availability Always Network-dependent Table 6: PocketLLM vs. TinyLLaMA Baseline Metric Arcee Lite (1.7B) TinyLLaMA (1.1B) Response Quality High Moderate Latency 6.2 sec 4.8 sec Contextual Depth Better Limited Cloud-based systems offer lower latency (2–3 seconds versus 6.2 seconds), but they require a persistent internet connection and transmit every user query to an external server. PocketLLM accepts the latency trade-off in exchange for complete privacy and unconditional offline availability. Against TinyLLaMA, Arcee Lite is slower due to its larger size, but the improvement in response quality and contextual understanding makes it the better choice for real-world deployment. 5.6 Summary Taken together, the evaluation results confirm that PocketLLM strikes a workable balance between latency, response quality, and privacy. A 6.2-second average response time is acceptable for a fully offline system; 91.5% intent classification accuracy is sufficient for reliable daily use; and the 23% RAG improvement translates to meaningfully better answers on the domain-specific content most likely to require factual precision. 6. Discussion The evaluation results indicate that PocketLLM’s core objective — a fully offline, privacy-first mobile AI assistant — is achievable with current hardware and open-source tooling. Intent classification accuracy above 91% is sufficient for dependable everyday use across the four supported action types, and the RAG pipeline delivers meaningful improvements precisely where the base model is weakest: on narrow, fact-sensitive domain queries. The seamless tie-in to Android system APIs moves the system beyond a conversational curiosity into something that can genuinely replace cloud-based voice assistants for common tasks. Several limitations remain. Quantising a model to Q8 inevitably sacrifices some precision compared to a full-precision version [10], and this occasionally surfaces as subtly degraded reasoning on complex prompts. Because the system is fully offline, it cannot access real-time information or dynamically updated data; its knowledge is fixed at preprocessing time. The ~4.2 GB peak RAM usage also limits deployability on budget devices with 4 GB or less of RAM, which still account for a significant share of the global Android installed base. These trade-offs are deliberate. PocketLLM prioritises data sovereignty and offline reliability over raw capability and low latency — the opposite of the trade-offs made by cloud assistants. For users in privacy-sensitive contexts or low-connectivity environments, this inversion of priorities is precisely what makes the system valuable. 7. Future Work Several directions could extend PocketLLM’s reach and capability while preserving its offline, privacy-first design: Multilingual Support: Extending voice and text interaction to regional and global languages would broaden accessibility considerably, particularly in markets where English-language models currently dominate. On-Device Personalisation: Lightweight adapter fine-tuning or retrieval-based personalisation could allow the assistant to tailor its responses to individual users’ preferences and habits without transmitting any data off-device. Smart Home Integration: Adding natural language control of IoT devices (lights, appliances, sensors) via local protocols such as Matter or Bluetooth would extend the system’s utility beyond the smartphone itself. Offline Document Understanding (OCR): Incorporating an on-device OCR module would let users query the contents of photos, scanned documents, or screenshots directly through the assistant. Low-End Device Optimisation: More aggressive quantisation strategies [10] and structured pruning could reduce peak RAM usage enough to support 4 GB devices, dramatically widening the potential user base. 8. Conclusion We presented PocketLLM, a fully offline, privacy-preserving AI assistant for Android that combines on-device LLM inference [9][8] with a local RAG pipeline [5][6][7] and direct Android API integration. Experiments on a mid-range smartphone demonstrate 91.5% intent classification accuracy, a 6.2-second average response latency, and a 23% improvement in domain-specific response quality through RAG augmentation — all without any network access. These results show that practical, intelligent assistants can be built entirely on-device using quantised models and efficient retrieval, without sacrificing the user experience to an unworkable degree. By eliminating cloud dependency, PocketLLM provides a genuinely private alternative to conventional AI assistants and offers a starting point for future research into scalable, efficient, and privacy-centric on-device intelligence. Declarations No human subjects were involved in this research. No personal user data was collected or analysed. Data Availability: Source code and APK are available at https://pocketllmapp.vercel.app Conflict of Interest: The authors declare no conflict of interest. Funding: This research received no external funding. References T. Marques, S. Carreira, C. Grilo, and J. Ribeiro, "Revolutionizing Mobile Interaction: Enabling a 3 Billion Parameter GPT LLM on Mobile," arXiv preprint arXiv:2310.01434, 2023. Y. Song, Y. Mi, H. Xie, and X. Jiang, "PowerInfer-2: Fast Large Language Model Inference on a Smartphone," arXiv preprint arXiv:2406.06282, 2024. X. Zhao, Q. He, X. Chen, M. Xu, J. Ou, J. Yu, and J. Wan, "LLMCad: Fast and Scalable On-device Large Language Model Inference," arXiv preprint arXiv:2309.04255, 2023. L. Li, R. Qin, W. Deng, L. Wen, Q. Su, Y. Wan, and M. Cheng, "Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs," arXiv preprint arXiv:2403.20041, 2024. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 9459–9474, 2020. N. Reimers and I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," in Proc. 2019 Conf. on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3982–3992, Nov. 2019. sentence-transformers, "all-MiniLM-L6-v2," Hugging Face, 2021. [Online]. Available: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 arcee-ai, "Arcee-Lite: A Compact 1.5B Parameter Language Model," Hugging Face, 2024. [Online]. Available: https://huggingface.co/arcee-ai/arcee-lite G. Gerganov, "llama.cpp: LLM Inference in C/C++," GitHub, 2023. [Online]. Available: https://github.com/ggerganov/llama.cpp D. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers," arXiv preprint arXiv:2210.17323, 2022. T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, "QLoRA: Efficient Finetuning of Quantized LLMs," in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023. W. Kwon, Z. Li, S. Zhuang, Y. Sheng, G. Zheng, H. Zhang, J. E. Gonzalez, and I. Stoica, "Efficient Memory Management for Large Language Model Serving with PagedAttention," in Proc. 29th ACM Symposium on Operating Systems Principles (SOSP), 2023. E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, "LoRA: Low-Rank Adaptation of Large Language Models," in Proc. 10th International Conference on Learning Representations (ICLR), 2022. S. Mehta and M. Hannan, "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer," arXiv preprint arXiv:2110.02178, 2021. W. Yin, Z. Li, and M. Guo, "LLM as a System Service on Mobile Devices," arXiv preprint arXiv:2403.11805, 2024. Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9575380","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":632345490,"identity":"8dd99b9d-1b6c-4077-b960-6406a68d017e","order_by":0,"name":"Ritesh Reddy G","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA60lEQVRIiWNgGAWjYDACZiBmbAASBxgbH3wA0mzsBLUww7QwNxvOAGlhJsYaiBb2NmkemL34gHk7/8GHP3fYyfEdb2yQtvm1TZ6PmYHxw8cc3FpkDjMzG/OeSTaWPHOwwTi377ZhGzMDs+TMbbi1SDAzs0kzth1I3HAjsSE5t+c2I1ALGzMvfi3sP3+CtNx/2HDYsue2PTFa2Bh4wbYwNjYz/LidSIwWY2neNpBfEpsZextuJ7cxMzbj9wswwD7+bAOF2PHnP378uW07v7354IePeLSgAsY2MNlArHoQ+EOK4lEwCkbBKBgpAABabFCgVWYITgAAAABJRU5ErkJggg==","orcid":"","institution":"Keshav Memorial Institute of Technology","correspondingAuthor":true,"prefix":"","firstName":"Ritesh","middleName":"Reddy","lastName":"G","suffix":""},{"id":632345491,"identity":"0807466e-9a99-4404-a4b6-3ba025d556a5","order_by":1,"name":"Sahasrika E","email":"","orcid":"","institution":"Keshav Memorial Institute of Technology","correspondingAuthor":false,"prefix":"","firstName":"Sahasrika","middleName":"","lastName":"E","suffix":""},{"id":632345492,"identity":"afc8fcf3-7c0e-4fe2-8f4f-346154383214","order_by":2,"name":"Karthikeya K","email":"","orcid":"","institution":"Keshav Memorial Institute of Technology","correspondingAuthor":false,"prefix":"","firstName":"Karthikeya","middleName":"","lastName":"K","suffix":""},{"id":632345493,"identity":"b93cd9fd-6fda-4bc4-8cc7-ce9e83b0358f","order_by":3,"name":"Rohit Reddy K","email":"","orcid":"","institution":"Keshav Memorial Institute of Technology","correspondingAuthor":false,"prefix":"","firstName":"Rohit","middleName":"Reddy","lastName":"K","suffix":""},{"id":632345494,"identity":"0ee9c668-1f1b-40a4-8291-6fd0669c11ad","order_by":4,"name":"Pavana Lakshmi G","email":"","orcid":"","institution":"Keshav Memorial Institute of Technology","correspondingAuthor":false,"prefix":"","firstName":"Pavana","middleName":"Lakshmi","lastName":"G","suffix":""},{"id":632345495,"identity":"5074f1fb-c611-48b4-adef-d50d13ee4f77","order_by":5,"name":"Rohan Rao G","email":"","orcid":"","institution":"Keshav Memorial Institute of Technology","correspondingAuthor":false,"prefix":"","firstName":"Rohan","middleName":"Rao","lastName":"G","suffix":""}],"badges":[],"createdAt":"2026-04-30 09:55:58","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-9575380/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9575380/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":108382742,"identity":"47489e3c-fc00-4bed-9b69-652f533150f1","added_by":"auto","created_at":"2026-05-04 05:30:48","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":33068,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eFigure 3.1: User Interface Diagram\u003c/em\u003e\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-9575380/v1/9546286144993151b0952244.png"},{"id":108493247,"identity":"7ab82bcd-923f-4908-8de0-ffdb55d347d2","added_by":"auto","created_at":"2026-05-05 09:59:46","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":286981,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eFigure 3.2: PocketLLM System Architecture Diagram\u003c/em\u003e\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-9575380/v1/0838cba6ca9990f85eabd9f3.png"},{"id":108382743,"identity":"73222bbb-6248-41c1-82b7-60b3bed9397a","added_by":"auto","created_at":"2026-05-04 05:30:48","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":150374,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eFigure 4.1: Deployment Diagram\u003c/em\u003e\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-9575380/v1/35d66151805ee434f4dc4a68.png"},{"id":108492229,"identity":"7188aca4-249a-4c0e-ad6c-98dcf045dad9","added_by":"auto","created_at":"2026-05-05 09:57:14","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":24342,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eFigure 5.1: Intent Classification Accuracy Graph\u003c/em\u003e\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-9575380/v1/10c08daaea933dc4b479e0be.png"},{"id":108804660,"identity":"fb1ddfe5-105a-4bf3-995b-53f5de61db3d","added_by":"auto","created_at":"2026-05-08 15:22:29","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":753870,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9575380/v1/2f50090a-8cb5-41f2-b2ca-dccda54f8e12.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003ePocketLLM: A Privacy-Preserving Offline AI Assistant with On-Device LLM Inference and Retrieval-Augmented Generation on Android\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eThe rapid spread of AI-powered assistants has fundamentally changed how people interact with their devices. Products such as Siri, Google Assistant, and Alexa have made it clear that natural language interfaces can genuinely simplify everyday tasks, from placing calls to setting reminders and looking up information. Yet these systems share a common dependency: they offload nearly all of their intelligence to remote cloud servers, which means they require a stable internet connection and, by extension, continuously ship user queries to third-party infrastructure. For many users — those in low-connectivity regions, privacy-sensitive professions, or simply unwilling to have their conversations stored on an external server — this trade-off is unacceptable.\u003c/p\u003e\n\u003cp\u003eRunning a capable large language model (LLM) entirely on a smartphone is non-trivial. Mobile devices impose tight constraints on memory, compute, and battery life, so only carefully optimized, smaller models are feasible. Recent progress in model quantization [10] and efficient CPU-based inference frameworks [9] has made this tractable, allowing lightweight LLMs to run in real time on mid-range Android hardware. At the same time, Retrieval-Augmented Generation (RAG) [5] has emerged as an effective way to narrow the gap between a model’s fixed training knowledge and the specific, up-to-date information a user might need — without requiring the model itself to grow larger.\u003c/p\u003e\n\u003cp\u003eMost prior work in this space concentrates on inference efficiency: squeezing lower latency or smaller memory footprints out of on-device models [3][4]. Far less attention has gone to the other half of the problem: wiring that on-device intelligence into the device’s own APIs so users can actually get things done — not just receive text answers. Bridging that gap is exactly what PocketLLM aims to do.\u003c/p\u003e\n\u003cp\u003eWe present \u003cstrong\u003ePocketLLM\u003c/strong\u003e, a fully offline Android AI assistant that pairs on-device LLM inference with task-oriented system integration. Users can hold natural conversations with the assistant and also issue commands that are translated directly into device actions — calling a contact, composing an SMS, scheduling an event, or setting an alarm — without any data ever leaving the handset. The main contributions of this work are as follows:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eOn-device inference of the Arcee Lite 1.7B model [8] using llama.cpp [9], enabling efficient execution of quantized LLMs on Android devices.\u003c/li\u003e\n \u003cli\u003eIntegration of a local Retrieval-Augmented Generation pipeline [5] using SQLite and all-MiniLM-L6-v2 embeddings [6][7] for more contextually accurate domain-specific responses.\u003c/li\u003e\n \u003cli\u003eSeamless Android system integration, supporting natural language-driven execution of phone calls, SMS, alarms, and calendar events via standard Android SDK APIs.\u003c/li\u003e\n \u003cli\u003eA companion website for APK distribution, making the app accessible without requiring a Play Store listing.\u003c/li\u003e\n\u003c/ul\u003e"},{"header":"2. Related Work","content":"\u003ch2\u003e2.1 On-Device LLM Inference\u003c/h2\u003e\n\u003cp\u003eInterest in running large language models on edge hardware has grown considerably over the past two years. LLMCad [3] tackled this by distributing computation across a device’s available cores, improving efficiency on memory-constrained generative tasks. PowerInfer-2 [2] went further with a neuron-cluster scheduling strategy that dynamically activates only the model weights most relevant to a given token, allowing models larger than device RAM to run without catastrophic slowdowns. Transformer-Lite [4] explored aggressive low-bit quantization — including FP4 formats — to fit LLM inference within the limited bandwidth of mobile GPUs.\u003c/p\u003e\n\u003cp\u003eThe llama.cpp framework [9] has become the de facto standard for CPU-based quantized inference on edge devices, offering well-tested support for the GGUF model format across ARM and x86 architectures. Research on post-training quantization [10] and parameter-efficient adaptation [13] has further refined the accuracy-efficiency trade-off. What these works generally leave unaddressed, however, is how the resulting on-device model should integrate with real-world device functionality. PocketLLM builds on this inference foundation and extends it with a full task-execution layer.\u003c/p\u003e\n\u003ch2\u003e2.2 Mobile AI Assistants\u003c/h2\u003e\n\u003cp\u003eConversational AI assistants have evolved from rigid rule-based systems into deep-learning-powered agents capable of handling open-ended dialogue. Cloud-hosted products such as Siri and Google Assistant demonstrate the value of natural language interfaces but also exemplify the privacy and latency issues that motivate on-device alternatives. The closest prior work to our own is that of Marques et al. [1], who ran a fine-tuned 3B GPT model on a smartphone using llama.cpp and LoRA [13], demonstrating basic text-to-action features such as placing calls and adding calendar entries. Yin et al. [15] explored treating an on-device LLM as a shared system service to reduce redundant model loading across apps. While these systems prove the concept of mobile LLM assistants, they stop short of integrating a domain-specific offline RAG pipeline alongside multi-API task execution — the combination that defines PocketLLM.\u003c/p\u003e\n\u003ch2\u003e2.3 Retrieval-Augmented Generation (RAG)\u003c/h2\u003e\n\u003cp\u003eThe foundational RAG framework, introduced by Lewis et al. [5], demonstrated that pre-trained generative models can be substantially improved on knowledge-intensive tasks by retrieving relevant passages at inference time. Subsequent work on dense embedding models, notably Sentence-BERT [6] and its distilled successors such as all-MiniLM-L6-v2 [7], made efficient semantic similarity search practical even in resource-limited settings. Server-side RAG pipelines commonly rely on approximate nearest-neighbour stores such as FAISS; on mobile, however, such libraries introduce significant binary size and memory overhead. PocketLLM instead stores precomputed embeddings directly in SQLite as binary blobs, supporting cosine similarity retrieval with no external dependencies and no network access required.\u003c/p\u003e"},{"header":"3. System Architecture","content":"\u003ch2\u003e3.1 Overview\u003c/h2\u003e\n\u003cp\u003ePocketLLM is designed as a self-contained intelligent assistant for Android, with a strict requirement that no user data ever leaves the device. The system is organised into five layers that process a user\u0026rsquo;s input sequentially: the User Interface Layer accepts voice or typed input; the Intent Classification Layer determines the type of response needed; the On-Device LLM Inference Engine generates the textual reply; the RAG Pipeline optionally augments that reply with locally retrieved knowledge; and the Android API Integration Layer converts actionable intents into real device operations such as calls, messages, and calendar events.\u003c/p\u003e\n\u003cp\u003eThis layered design keeps each concern cleanly separated while ensuring that the full pipeline \u0026mdash; from microphone input to device action \u0026mdash; runs entirely inside the user\u0026rsquo;s handset, with no outbound network traffic at any stage.\u003c/p\u003e\n\u003ch2\u003e3.2 User Interface Layer\u003c/h2\u003e\n\u003cp\u003eThe UI is built with Kotlin and Jetpack Compose, offering both a text-based chat interface and a voice-input mode powered by Android\u0026rsquo;s on-device speech-to-text engine. Spoken utterances are transcribed locally before being passed downstream; model outputs are displayed as chat bubbles and can optionally be read aloud via the device\u0026rsquo;s text-to-speech engine, completing the conversational loop entirely on-device.\u003c/p\u003e\n\u003ch2\u003e3.3 Intent Classification\u003c/h2\u003e\n\u003cp\u003eThe Intent Classification Layer acts as a semantic router. It normalises incoming text and determines whether the user intends a general conversational exchange or a concrete device action \u0026mdash; currently covering phone calls, SMS composition, alarm scheduling, and calendar event creation. The detected intent label is forwarded alongside the original query to the inference engine to guide prompt formatting, and \u0026mdash; for action intents \u0026mdash; also triggers the corresponding Android API call.\u003c/p\u003e\n\u003ch2\u003e3.4 On-Device LLM Inference\u003c/h2\u003e\n\u003cp\u003eThe core reasoning capability is provided by Arcee Lite 1.7B [8], quantised to Q8 GGUF format, which yields a model file of approximately 1.8 GB stored on internal flash. Arcee Lite was selected over TinyLLaMA, Orca Mini, and Qwen 1.5B following an empirical comparison that weighed instruction-following quality, response coherence, and peak memory consumption on mid-range hardware (see Section 4.2 for details).\u003c/p\u003e\n\u003cp\u003eInference is handled by llama.cpp [9], compiled as a native shared library for Android using the NDK, CMake, and Ninja. A JNI bridge exposes model loading, tokenisation, prompt formatting, and token sampling to the Kotlin application layer. All LLM computation runs on the device CPU in a background thread, keeping the UI responsive and ensuring that no query or response is ever transmitted externally.\u003c/p\u003e\n\u003ch2\u003e3.5 Retrieval-Augmented Generation (RAG) Pipeline\u003c/h2\u003e\n\u003cp\u003eThe RAG pipeline supplements the model\u0026rsquo;s parametric knowledge with domain-specific content drawn from a local knowledge base. When RAG is enabled, the top-matching passage is prepended to the prompt before inference; when it is disabled, the model generates responses from its weights alone, reducing latency.\u003c/p\u003e\n\u003cp\u003eEmbeddings are produced offline using the all-MiniLM-L6-v2 sentence transformer [6][7], converted to ONNX for Android compatibility, and stored alongside their source text as BLOBs in a SQLite database. The knowledge base covers four domains: Agricultural Science, Medical Science, Computer Science, and KMIT Institutional data. At inference time, the user\u0026rsquo;s query is embedded using the same ONNX model and compared against the stored vectors via cosine similarity; the highest-scoring passage is appended to the LLM prompt. This approach eliminates the need for cloud-hosted vector stores entirely [5].\u003c/p\u003e\n\u003ch2\u003e3.6 Android API Integration\u003c/h2\u003e\n\u003cp\u003eOnce an action intent is confirmed, the API Integration Layer invokes the appropriate Android SDK interface: TelecomManager for outgoing calls, SmsManager for text messages, CalendarContract for event creation, and AlarmManager for alarm scheduling. All necessary permissions are declared in the app manifest and requested at runtime in accordance with Android\u0026rsquo;s security model. Every API call is handled locally by the operating system; no step in the pipeline requires network access, preserving the system\u0026rsquo;s end-to-end offline guarantee.\u003c/p\u003e"},{"header":"4. Implementation","content":"\u003cp\u003eThis section describes how PocketLLM’s components are realised in practice, covering the technology stack, model selection rationale, inference integration, RAG pipeline construction, and deployment approach.\u003c/p\u003e\n\u003ch2\u003e4.1 Technology Stack\u003c/h2\u003e\n\u003cp\u003ePocketLLM is built on a hybrid stack chosen to balance runtime performance with ease of Android system integration. Table 1 summarises the key components.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1: PocketLLM Technology Stack\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"100%\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eComponent\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eTechnology Used\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003ePurpose\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eApplication Layer\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eKotlin (Android SDK)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eCore app logic and API integration\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eUI Framework\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eJetpack Compose\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eDeclarative user interface\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eInference Engine\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003ellama.cpp (C++)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eOn-device LLM inference\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eNative Bridge\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eJNI + Android NDK + CMake\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eKotlin ↔ C++ integration\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eLLM Model\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eArcee Lite 1.7B (Q8 GGUF)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eLanguage model for responses\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eEmbedding Model\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eall-MiniLM-L6-v2 (ONNX)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eQuery embedding for RAG\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eVector Storage\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSQLite\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eLocal embedding storage\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003ePreprocessing\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003ePython + SentenceTransformers\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eDataset embedding generation\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eWebsite\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eNext.js + Vercel\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAPK distribution\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003ch2\u003e4.2 Model Selection\u003c/h2\u003e\n\u003cp\u003eFour quantised models were evaluated on a 6 GB RAM Android device: Gemma 2B, TinyLLaMA 1.1B [3], Phi-2, and Arcee Lite 1.7B [8]. The primary evaluation criteria were instruction-following quality, memory consumption, and inference latency. Table 2 summarises the results.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 2: Model Comparison\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"100%\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eModel\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eParameters\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eSize\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003ePerformance\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eSelection Reason\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eGemma 2B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e~1.4 GB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eModerate\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eLower instruction accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eTinyLLaMA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e1.1B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e~0.7 GB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eFast, limited\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eWeak response quality\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003ePhi-2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2.7B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e~1.6 GB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHigh quality\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHigh memory usage\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eArcee Lite ✓\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e1.7B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e~1.8 GB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eBalanced\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eBest accuracy–efficiency trade-off\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eArcee Lite 1.7B in Q8 GGUF format [8] emerged as the best option: it outperformed smaller models on instruction-following and response coherence while keeping peak memory usage to roughly 1.8 GB. Larger models such as Phi-2 were excluded because they routinely exhausted available RAM during extended conversations.\u003c/p\u003e\n\u003ch2\u003e4.3 On-Device Inference Implementation\u003c/h2\u003e\n\u003cp\u003eThe llama.cpp library [9] is compiled as a native shared object (.so) using the Android NDK and integrated through a custom JNI wrapper. The wrapper exposes methods for model loading, input tokenisation, prompt formatting, and autoregressive decoding. Generated tokens are streamed incrementally back to the Kotlin layer so that responses appear progressively in the UI rather than after a long blocking wait. All inference runs on a dedicated background thread, keeping the main thread — and thus the UI — fully responsive.\u003c/p\u003e\n\u003ch2\u003e4.4 RAG Pipeline Implementation\u003c/h2\u003e\n\u003cp\u003eDomain datasets (agriculture, medical, computer science, and KMIT institutional content) are preprocessed offline in Python. Each document is chunked and embedded using the all-MiniLM-L6-v2 sentence transformer [6][7], and the resulting vectors are stored as BLOBs in SQLite together with their source text. At inference time the ONNX-exported embedding model encodes the user’s query on-device; a linear scan of stored vectors with cosine similarity identifies the top-matching passage, which is then prepended to the LLM prompt to guide the response [5].\u003c/p\u003e\n\u003ch2\u003e4.5 Intent Classification and API Integration\u003c/h2\u003e\n\u003cp\u003eIntent detection uses a hybrid strategy: rule-based keyword matching handles common, unambiguous commands (e.g., “call Pooja”, “set alarm at 7 AM”), while a model-based classifier resolves edge cases with more nuanced phrasing. Detected intents are mapped to four Android SDK interfaces: TelephonyManager, SmsManager, CalendarContract, and AlarmManager, all invoked locally. The system requests only the permissions strictly needed for each action, following Android’s principle of least privilege.\u003c/p\u003e\n\u003ch2\u003e4.6 Deployment\u003c/h2\u003e\n\u003cp\u003eThe final APK bundles the quantised model, ONNX embedding model, and SQLite knowledge databases into a single self-contained package. It is distributed through a companion website built with Next.js and hosted on Vercel (https://pocketllmapp.vercel.app). Minimum requirements are 6 GB RAM and Android 8.0 (API 26) or higher.\u003c/p\u003e"},{"header":"5. Evaluation","content":"\u003cp\u003eWe evaluated PocketLLM across three dimensions \u0026mdash; inference performance, task execution accuracy, and RAG retrieval quality \u0026mdash; and compared it against a cloud-based baseline and a lighter on-device model. All experiments were run on a physical Android device under consistent, controlled conditions.\u003c/p\u003e\n\u003ch2\u003e5.1 Experimental Setup\u003c/h2\u003e\n\u003cp\u003eThe test device had the following configuration:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eDevice: Snapdragon 700-series Android smartphone\u003c/li\u003e\n \u003cli\u003eRAM: 6 GB | Processor: Octa-core CPU (~2.2 GHz)\u003c/li\u003e\n \u003cli\u003eOS: Android 13\u003c/li\u003e\n \u003cli\u003eModel: Arcee Lite 1.7B (Q8 GGUF) [8] \u0026mdash; ~1.8 GB on disk\u003c/li\u003e\n \u003cli\u003eInference Engine: llama.cpp [9] via JNI\u003c/li\u003e\n \u003cli\u003eRAG Backend: SQLite with all-MiniLM-L6-v2 embeddings [6][7]\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe evaluation dataset comprised 200 queries spanning general conversational inputs and task-oriented commands. A separate set of 50 labelled action commands was used exclusively for intent classification evaluation.\u003c/p\u003e\n\u003ch2\u003e5.2 Inference Performance\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eTable 3: Inference Performance Metrics\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"100%\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eMetric\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eValue\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eAvg. Response Latency (No RAG)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e5.1 sec\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eAvg. Response Latency (With RAG)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e6.2 sec\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eModel Load Time\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e18\u0026ndash;22 sec\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003ePeak Memory Usage\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e4.2 GB\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eAverage Memory Usage\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e3.6 GB\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eEnabling RAG adds approximately 1.1 seconds of latency, attributable to on-device query embedding and similarity search. Despite this overhead, both configurations operate comfortably within a 6 GB RAM budget, confirming real-world feasibility on mid-range hardware.\u003c/p\u003e\n\u003ch2\u003e5.3 Task Execution Accuracy\u003c/h2\u003e\n\u003cp\u003eThe intent classification module correctly identified 45 out of 50 test commands, achieving an overall accuracy of 91.5%. As illustrated in Figure 5.1, alarm-related commands achieved the highest accuracy (94%), followed by call commands (92%). SMS and calendar tasks demonstrated slightly lower accuracy (90% each), reflecting the increased variability in natural language expressions associated with these actions. Most classification errors occurred in ambiguous or multi-intent queries, where the distinction between tasks was less explicit.\u003c/p\u003e\n\u003ch2\u003e5.4 RAG Retrieval Quality\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eTable 4: RAG Retrieval Quality by Query Type\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"100%\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eQuery Type\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eLLM Only\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eRAG Enabled\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eGeneral\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eModerate\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eModerate\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eCS Domain\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eLow\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHigh\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eInstitutional (KMIT)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eIncorrect\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAccurate\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eAgriculture / Medical\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003ePartial\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eRelevant\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cem\u003eImprovement: ~23% increase in response relevance for domain-specific queries\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eRAG augmentation had minimal effect on general-knowledge queries, where the base model\u0026rsquo;s parametric knowledge was already sufficient. For domain-specific queries, however, the difference was striking: CS technical questions improved from low to high relevance, institutional queries (e.g., KMIT-specific information) shifted from incorrect to accurate, and agricultural/medical questions moved from partial to relevant responses. The aggregate qualitative improvement across domain-specific queries was approximately 23%, measured by pairwise relevance scoring comparing baseline and RAG-augmented outputs.\u003c/p\u003e\n\u003ch2\u003e5.5 Comparison with Alternatives\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eTable 5: PocketLLM vs. Cloud AI\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"100%\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eMetric\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003ePocketLLM\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eCloud AI\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eLatency\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e6.2 sec\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2\u0026ndash;3 sec\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eInternet Required\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eNo\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eYes\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003ePrivacy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHigh (on-device)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eLow (server-side)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eOffline Availability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAlways\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eNetwork-dependent\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eTable 6: PocketLLM vs. TinyLLaMA Baseline\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"100%\" class=\"fr-table-selection-hover\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 31.2499%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMetric\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 32.0834%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eArcee Lite (1.7B)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eTinyLLaMA (1.1B)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 31.2499%;\"\u003e\n \u003cp\u003eResponse Quality\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 32.0834%;\"\u003e\n \u003cp\u003eHigh\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eModerate\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 31.2499%;\"\u003e\n \u003cp\u003eLatency\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 32.0834%;\"\u003e\n \u003cp\u003e6.2 sec\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e4.8 sec\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 31.2499%;\"\u003e\n \u003cp\u003eContextual Depth\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 32.0834%;\"\u003e\n \u003cp\u003eBetter\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eLimited\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eCloud-based systems offer lower latency (2\u0026ndash;3 seconds versus 6.2 seconds), but they require a persistent internet connection and transmit every user query to an external server. PocketLLM accepts the latency trade-off in exchange for complete privacy and unconditional offline availability. Against TinyLLaMA, Arcee Lite is slower due to its larger size, but the improvement in response quality and contextual understanding makes it the better choice for real-world deployment.\u003c/p\u003e\n\u003ch2\u003e5.6 Summary\u003c/h2\u003e\n\u003cp\u003eTaken together, the evaluation results confirm that PocketLLM strikes a workable balance between latency, response quality, and privacy. A 6.2-second average response time is acceptable for a fully offline system; 91.5% intent classification accuracy is sufficient for reliable daily use; and the 23% RAG improvement translates to meaningfully better answers on the domain-specific content most likely to require factual precision.\u003c/p\u003e"},{"header":"6. Discussion","content":"\u003cp\u003eThe evaluation results indicate that PocketLLM’s core objective — a fully offline, privacy-first mobile AI assistant — is achievable with current hardware and open-source tooling. Intent classification accuracy above 91% is sufficient for dependable everyday use across the four supported action types, and the RAG pipeline delivers meaningful improvements precisely where the base model is weakest: on narrow, fact-sensitive domain queries. The seamless tie-in to Android system APIs moves the system beyond a conversational curiosity into something that can genuinely replace cloud-based voice assistants for common tasks.\u003c/p\u003e\n\u003cp\u003eSeveral limitations remain. Quantising a model to Q8 inevitably sacrifices some precision compared to a full-precision version [10], and this occasionally surfaces as subtly degraded reasoning on complex prompts. Because the system is fully offline, it cannot access real-time information or dynamically updated data; its knowledge is fixed at preprocessing time. The ~4.2 GB peak RAM usage also limits deployability on budget devices with 4 GB or less of RAM, which still account for a significant share of the global Android installed base.\u003c/p\u003e\n\u003cp\u003eThese trade-offs are deliberate. PocketLLM prioritises data sovereignty and offline reliability over raw capability and low latency — the opposite of the trade-offs made by cloud assistants. For users in privacy-sensitive contexts or low-connectivity environments, this inversion of priorities is precisely what makes the system valuable.\u003c/p\u003e"},{"header":"7. Future Work","content":"\u003cp\u003eSeveral directions could extend PocketLLM’s reach and capability while preserving its offline, privacy-first design:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003e\u003cstrong\u003eMultilingual Support:\u003c/strong\u003e Extending voice and text interaction to regional and global languages would broaden accessibility considerably, particularly in markets where English-language models currently dominate.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eOn-Device Personalisation:\u003c/strong\u003e Lightweight adapter fine-tuning or retrieval-based personalisation could allow the assistant to tailor its responses to individual users’ preferences and habits without transmitting any data off-device.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eSmart Home Integration:\u003c/strong\u003e Adding natural language control of IoT devices (lights, appliances, sensors) via local protocols such as Matter or Bluetooth would extend the system’s utility beyond the smartphone itself.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eOffline Document Understanding (OCR):\u003c/strong\u003e Incorporating an on-device OCR module would let users query the contents of photos, scanned documents, or screenshots directly through the assistant.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eLow-End Device Optimisation:\u0026nbsp;\u003c/strong\u003eMore aggressive quantisation strategies [10] and structured pruning could reduce peak RAM usage enough to support 4 GB devices, dramatically widening the potential user base.\u003c/li\u003e\n\u003c/ul\u003e"},{"header":"8. Conclusion","content":"\u003cp\u003eWe presented PocketLLM, a fully offline, privacy-preserving AI assistant for Android that combines on-device LLM inference [9][8] with a local RAG pipeline [5][6][7] and direct Android API integration. Experiments on a mid-range smartphone demonstrate 91.5% intent classification accuracy, a 6.2-second average response latency, and a 23% improvement in domain-specific response quality through RAG augmentation — all without any network access.\u003c/p\u003e\n\u003cp\u003eThese results show that practical, intelligent assistants can be built entirely on-device using quantised models and efficient retrieval, without sacrificing the user experience to an unworkable degree. By eliminating cloud dependency, PocketLLM provides a genuinely private alternative to conventional AI assistants and offers a starting point for future research into scalable, efficient, and privacy-centric on-device intelligence.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003eNo human subjects were involved in this research. No personal user data was collected or analysed.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability:\u0026nbsp;\u003c/strong\u003eSource code and APK are available at https://pocketllmapp.vercel.app\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflict of Interest:\u0026nbsp;\u003c/strong\u003eThe authors declare no conflict of interest.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding:\u0026nbsp;\u003c/strong\u003eThis research received no external funding.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eT. Marques, S. Carreira, C. Grilo, and J. Ribeiro, \u0026quot;Revolutionizing Mobile Interaction: Enabling a 3 Billion Parameter GPT LLM on Mobile,\u0026quot; arXiv preprint arXiv:2310.01434, 2023.\u003c/li\u003e\n \u003cli\u003eY. Song, Y. Mi, H. Xie, and X. Jiang, \u0026quot;PowerInfer-2: Fast Large Language Model Inference on a Smartphone,\u0026quot; arXiv preprint arXiv:2406.06282, 2024.\u003c/li\u003e\n \u003cli\u003eX. Zhao, Q. He, X. Chen, M. Xu, J. Ou, J. Yu, and J. Wan, \u0026quot;LLMCad: Fast and Scalable On-device Large Language Model Inference,\u0026quot; arXiv preprint arXiv:2309.04255, 2023.\u003c/li\u003e\n \u003cli\u003eL. Li, R. Qin, W. Deng, L. Wen, Q. Su, Y. Wan, and M. Cheng, \u0026quot;Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs,\u0026quot; arXiv preprint arXiv:2403.20041, 2024.\u003c/li\u003e\n \u003cli\u003eP. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K\u0026uuml;ttler, M. Lewis, W. Yih, T. Rockt\u0026auml;schel, S. Riedel, and D. Kiela, \u0026quot;Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,\u0026quot; in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 9459\u0026ndash;9474, 2020.\u003c/li\u003e\n \u003cli\u003eN. Reimers and I. Gurevych, \u0026quot;Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,\u0026quot; in Proc. 2019 Conf. on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3982\u0026ndash;3992, Nov. 2019.\u003c/li\u003e\n \u003cli\u003esentence-transformers, \u0026quot;all-MiniLM-L6-v2,\u0026quot; Hugging Face, 2021. [Online]. Available: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2\u003c/li\u003e\n \u003cli\u003earcee-ai, \u0026quot;Arcee-Lite: A Compact 1.5B Parameter Language Model,\u0026quot; Hugging Face, 2024. [Online]. Available: https://huggingface.co/arcee-ai/arcee-lite\u003c/li\u003e\n \u003cli\u003eG. Gerganov, \u0026quot;llama.cpp: LLM Inference in C/C++,\u0026quot; GitHub, 2023. [Online]. Available: https://github.com/ggerganov/llama.cpp\u003c/li\u003e\n \u003cli\u003eD. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, \u0026quot;GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,\u0026quot; arXiv preprint arXiv:2210.17323, 2022.\u003c/li\u003e\n \u003cli\u003eT. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, \u0026quot;QLoRA: Efficient Finetuning of Quantized LLMs,\u0026quot; in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023.\u003c/li\u003e\n \u003cli\u003eW. Kwon, Z. Li, S. Zhuang, Y. Sheng, G. Zheng, H. Zhang, J. E. Gonzalez, and I. Stoica, \u0026quot;Efficient Memory Management for Large Language Model Serving with PagedAttention,\u0026quot; in Proc. 29th ACM Symposium on Operating Systems Principles (SOSP), 2023.\u003c/li\u003e\n \u003cli\u003eE. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, \u0026quot;LoRA: Low-Rank Adaptation of Large Language Models,\u0026quot; in Proc. 10th International Conference on Learning Representations (ICLR), 2022.\u003c/li\u003e\n \u003cli\u003eS. Mehta and M. Hannan, \u0026quot;MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer,\u0026quot; arXiv preprint arXiv:2110.02178, 2021.\u003c/li\u003e\n \u003cli\u003eW. Yin, Z. Li, and M. Guo, \u0026quot;LLM as a System Service on Mobile Devices,\u0026quot; arXiv preprint arXiv:2403.11805, 2024.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Keshav Memorial Institute of Technology","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"On-device AI, Retrieval-Augmented Generation, Large Language Models, Offline AI Assistant, Android, Privacy-Preserving Systems","lastPublishedDoi":"10.21203/rs.3.rs-9575380/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9575380/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eCloud-based AI assistants transmit user data to remote servers, raising significant concerns around privacy, latency, and continuous internet dependency. In this paper, we present \u003cb\u003ePocketLLM\u003c/b\u003e, a fully offline, on-device AI assistant for Android that addresses these issues by running all computations locally on the device. The system integrates a quantized large language model (LLM) deployed via llama.cpp [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], combined with a lightweight Retrieval-Augmented Generation (RAG) pipeline [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e] backed by SQLite, to improve contextual understanding without any cloud dependency.\u003c/p\u003e \u003cp\u003ePocketLLM supports natural language task execution, enabling users to make phone calls, send SMS messages, set alarms, and manage calendar events entirely on-device through an intent classification layer. The system runs the Arcee Lite 1.7B model [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e] in Q8 GGUF format via a Java Native Interface (JNI) bridge on a standard 6 GB RAM smartphone, staying well within practical mobile hardware limits.\u003c/p\u003e \u003cp\u003eExperiments carried out on 200 test queries show a mean response latency of 6.2 seconds and an intent classification accuracy of 91.5%. RAG-augmented responses outperform vanilla LLM outputs by roughly 23% on domain-specific queries, confirming the value of local knowledge retrieval. By removing any reliance on cloud infrastructure, PocketLLM keeps user data private while remaining practically usable, establishing a solid foundation for future offline intelligent assistant research.\u003c/p\u003e","manuscriptTitle":"PocketLLM: A Privacy-Preserving Offline AI Assistant with On-Device LLM Inference and Retrieval-Augmented Generation on Android","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-04 05:30:44","doi":"10.21203/rs.3.rs-9575380/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"2db83694-c901-4c9d-95e7-424322e85d0a","owner":[],"postedDate":"May 4th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":67307867,"name":"Artificial Intelligence and Machine Learning"},{"id":67307868,"name":"Information Retrieval and Management"}],"tags":[],"updatedAt":"2026-05-04T05:30:44+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-04 05:30:44","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9575380","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9575380","identity":"rs-9575380","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00