Running LLMs on-device with Qualcomm Snapdragon 8 Elite
Why on-device LLM inference is changing automotive AI?
Large Language Models have traditionally lived in the cloud - massive GPU clusters serving billions of requests through APIs. But for industries like automotive, cloud dependency is not always acceptable. Connectivity fails in tunnels, rural areas, and underground parking structures. Latency spikes are unacceptable for safety-critical driver interactions. And sending private conversation data to external servers raises serious compliance and data sovereignty concerns. This is precisely where on-device LLM inference and edge AI become not just attractive, but essential.
What if the LLM could run directly on a chip inside the vehicle?
We set out to prove this is not only possible, but production-ready. Using Qualcomm’s Snapdragon 8 Elite platform with its dual Hexagon NPU cores, we deployed multiple on-device LLM variants on an Android device and built a fully functional edge AI inference server requiring zero cloud connectivity.
Platform and software stack for on-device AI inference
Our setup relies on the following components:
- Chipset: Qualcomm Snapdragon 8 Elite with dual Hexagon NPU cores
- Operating system: Android
- Inference runtime: Qualcomm Genie - a lightweight on-device engine optimized for autoregressive LLM inference on Hexagon NPU
- Model compilation: Qualcomm AI Hub - a cloud service that compiles and optimizes models for specific Snapdragon chipsets
- Application: Custom-built Android service based on Qualcomm's AI Hub Apps reference architecture, extended with an HTTP server interface
The Genie runtime is the critical piece. It takes QNN context binaries - precompiled, quantized model graphs - and executes them on the NPU with minimalCPU overhead. The model never touches a GPU. All heavy computation runs on dedicated AI silicon.
Llama models for edge AI: 3B vs. 8B on-device parameter comparison
We worked with two models from Meta's Llama family, chosen to represent different points on the size-versus-quality spectrum:
Llama 3.2 3B Instruct -a compact 3-billion parameter model. Quantized to W4A16 precision (4-bit weights, 16-bit activations), it fits comfortably in device memory and delivers fast, responsive inference. Ideal for quick interactions - voice command interpretation, short summaries, simple question answering.
Llama 3.1 8B Instruct -a larger 8-billion parameter model with a 2048-token context window. Also quantized to W4A16, it produces noticeably higher quality responses with better reasoning, longer coherent outputs, and more nuanced instruction following.This model represents a sweet spot between capability and on-device feasibility.
Both models were exported through Qualcomm AI Hub as QNN context binaries, each split across five parts for efficient loading and memory management.
Compiling LLMs for Snapdragon 8 Elite with Qualcomm AI Hub
The path from open-source model weights to on-device NPU execution is streamlined through Qualcomm's toolchain. The qai_hub_models Python package provides export scripts for supported models. A single command handles quantization, optimization, and compilation:
python -m
qai_hub_models.models.llama_v3_1_8b_instruct.export--
chipsetqualcomm-snapdragon-8-elite --skip-profiling --
output-dirgenie_bundle
The entire export process - from Hugging Face model to device-readybinaries - takes minutes, not days. This dramatically lowers the barrier to experimenting with different models on-device.
Building a self-contained on-device LLM inference server
Rather than building a traditional Android chat application, we took a more versatile approach. We transformed the device into a network-accessible LLM inference server.

The device runs a lightweight HTTP server as an Android foreground service. It can be reached over USB or Ethernet - any machine on the local network can send prompts and receive streaming responses. In an automotive context, this means the LLM service can be accessed by the vehicle's head unit, a diagnostic tool, or any connected system - without requiring a custom client application.
The server exposes a simple REST API:
- GET/health - service health check
- GET /models - list available models on device
- POST /models/load - load a model into NPU memory
- POST /models/unload - unload a model and freeNPU memory
- POST /generate - send a prompt, receive streaming SSE response
The /generate endpoint uses Server-Sent Events for real-time tokenstreaming. Each token is pushed to the client the moment it is generated,creating a responsive, conversational experience.
The server also hosts a self-contained web interface at the root path.Opening the device's IP address in any browser presents a chat interface where users can select a model, type a prompt, and watch the response appear token by token. No app installation required on the client side. Just a browser and a network connection.


On-device LLM performance: Tokens per second on Snapdragon 8 Elite
The results demonstrate that on-device LLM inference on Snapdragon 8Elite is not just a tech demo - it delivers genuinely usable performance.
Llama 3.2 3B Instruct runs at approximately 10 tokens per second. At this speed, responses feel fluid and interactive. A typical short answer (50–80tokens) appears in under 8 seconds. For voice assistant scenarios - where the user asks a question and expects a spoken answer - this is more than sufficient. The response begins streaming before the user even finishes processing the question mentally.
Llama 3.1 8B Instruct runs at approximately 5 tokens per second. While slower, this is still fast enough for many practical applications. A 100-token response completes in about 20 seconds. More importantly, the quality improvement over the 3B model is substantial - longer coherent reasoning chains, better instruction adherence, and more nuanced responses. For tasks like summarizing a vehicle manual section, explaining a dashboard warning, or having a multi-turn conversation about navigation options, the 8B model's quality advantage justifies the speed tradeoff.
To put these numbers in perspective: 5 tokens per second from an8-billion parameter model running entirely on a mobile chipset, with zero cloud dependency, zero network latency, and complete data privacy. Two years ago, this would have required a server rack.
Optimizing on-device AI: Genie runtime configuration explained
Genie's behavior is tuned through two configuration files that ship alongside each model.
The Genie config (genie_config.json) controls core runtime parameters:context length in tokens, CPU thread count, CPU core affinity via bit mask, and memory mapping settings.
The HTP (Hexagon Tensor Processor - NPU) backend config(htp_backend_ext_config.json) controls NPU-specific settings: which Hexagon NPU core to use (Snapdragon 8 Elite has two), and the performance profile -"burst" for maximum throughput or"sustained_high_performance" for thermal stability during extended sessions.
These configurations provide fine-grained control over the performance-power-thermal tradeoff - essential in automotive environments where sustained operation and thermal management are critical constraints.
Automotive AI use cases enabled by on-device LLM inference
Running an LLM directly in the vehicle opens up application scenarios that are impossible or impractical with cloud-based AI:
Offline voice assistant - a natural language interface that works in tunnels, parking garages, rural areas, and anywhere without cellular coverage.The driver asks a question; the answer comes from the device, not the cloud.
On-board vehicle manual - instead of flipping through a 500-page PDF, the driver or passenger asks "What does the yellow triangle warning light mean?" and gets an immediate, contextual answer.
Predictive maintenance dialogue - the vehicle detects an anomaly and theLLM explains it in natural language: "Your tire pressure is 15% below recommended. This is likely due to the temperature drop overnight. I recommend checking the pressure at your next stop."
Multi-language support - a single quantized multilingual model can serve drivers in any language without requiring separate language packs or cloud translation services.
Privacy-first personal assistant - calendar, contacts, and preferences stay on-device. No conversation transcripts leave the vehicle. Full GDPR compliance by design.
Diagnostic and service support - technicians can query the vehicle's state in natural language during service appointments, without needing specialized diagnostic software.
Key lessons from deploying LLMs on edge hardware
The export toolchain is mature enough for production exploration.Qualcomm AI Hub abstracts away enormous complexity - quantization, graph optimization, NPU code generation - behind a single export command. Going from a Hugging Face model to running inference on a phone takes less than an hour.
Configuration matters. CPU core affinity, thread count, NPU core assignment, memory mapping, and performance profiles all affect throughput, latency, and thermal behavior. There is no universal "best"configuration - it depends on the model, the workload pattern, and the thermal constraints of the deployment environment.
The hardware is ready. Snapdragon 8 Elite's dual Hexagon NPU cores deliver real, practical LLM inference performance. An 8-billion parameter model generating coherent, high-quality text at 5 tokens per second on a mobile chipset - with headroom for optimization - is a remarkable engineering achievement.
The form factor is transformative. An Android device the size of a credit card, drawing a few watts, running a full LLM inference stack accessible over HTTP. For automotive, industrial IoT, and edge computing, this changes what is possible without cloud infrastructure.
Conclusion: On-device LLM inference is ready for production
We set out to answer a simple question: can you run a real, useful LLMon a Snapdragon 8 Elite, entirely on-device, and make it accessible as a service?
The answer is yes.
With Qualcomm's Genie runtime and AI Hub toolchain, we deployed both a3-billion and an 8-billion parameter Llama model on an Android device. We built an HTTP server interface that turns the device into a self-contained LLM appliance - accessible from any browser, any connected system, with zero cloud dependency.
The 3B model at 10 tokens per second is fast enough for real-time voice assistant interactions. The 8B model at 5 tokens per second delivers meaningfully better response quality while remaining practical for conversational use. Both run entirely on the NPU, leaving the CPU and GPU free for other tasks.
For the automotive industry - where privacy, reliability, and offline capability are non-negotiable - on-device LLM inference is no longer a future promise. It is a present reality. The hardware exists. The toolchain exists.The performance is there.
The question is no longer "Can we run an LLM in the car?" It is "What will we build with it?"
FAQ: On-device LLM inference
What is on-device LLM inference?
On-device LLM inference means running a large language model directly on local hardware - such as a mobilechipset or embedded processor - without sending data to cloud servers. This approach delivers lower latency, complete offline capability, and full dataprivacy, since no prompts or responses ever leave the device.
Why is Qualcomm Snapdragon 8 Elite suited for LLM inference?
The Snapdragon 8 Elite features dual Hexagon NPU (Neural Processing Unit) cores specifically designed for AIworkloads. These dedicated cores execute quantized model graphs with minimalCPU and GPU overhead, enabling models with billions of parameters to run at practical speeds entirely on a mobile chipset.
What is the Qualcomm Genie runtime?
Qualcomm Genie is a light weight on-device inference engine optimized for auto regressive LLM inference on Hexagon NPU hardware. It takes precompiled, quantized QNN context binaries and executes them efficiently on the NPU, keeping the CPU and GPU free for other vehicle system tasks.
How fast do LLMs run on Snapdragon 8 Elite?
In our testing, Llama 3.2 3B Instruct achieved approximately 10 tokens per second, and the larger Llama 3.18B Instruct model ran at approximately 5 tokens per second. Both deliver practical, interactive performance suitable for real-world applications such as voice assistants and on-board advisory systems.
What is W4A16 quantization and why does it matter for on-device AI?
W4A16 quantization uses 4-bit precision for model weights and 16-bit precision for activations. This significantly reduces the memory footprint of large language models without a major loss in output quality - making multi-billion parameter models feasible for deployment on mobile hardware and automotive-grade chips.
Can LLMs run in a vehicle without internet connectivity?
Yes. On-device LLM inference runs entirely on the vehicle's local chipset, requiring no network connection. This is critical for automotive applications where connectivity is intermittent -such as tunnels, rural roads, or underground parking - and where real-time,reliable response is non-negotiable.
What is Qualcomm AI Hub?
Qualcomm AI Hub is a cloud-based compilation service that converts open-source model weights into optimized QNN context binaries targeting specific Snapdragon chipsets. It handles quantization, graph optimization, and NPU code generation automatically, reducing the path from a Hugging Face model checkpoint to device-ready binaries to under one hour.
Is on-device AI inference GDPR compliant?
On-device AI inference processes all data locally, meaning no conversation transcripts, voice recordings, or personal data are transmitted to external servers. This architecture supports GDPR compliance by design and is especially relevant for in-vehicle personal assistants, predictive maintenance systems, and any application involving sensitive user data.

Driving changes through AI & Data solutions
Production-grade analytics and GenAI - built for scale, transparency, and compliance.
Check related articles
Read our blog and stay informed about the industry's latest trends and solutions.
Interested in our services?
Reach out for tailored solutions and expert guidance.




