We ran a local LLM on a single-board computer. Here’s how we did it

Date
March 16, 2026
Hot topics 🔥
AI & TechHow-to Guides
Contributor
Dmitry Ermakov
Summarize with AI:
Illustration of a phone with AI sign on it

Most AI projects start with the same assumption: there will be internet. A cloud API call, a hosted model endpoint, a reliable connection somewhere in the pipeline. It’s such a default that most architecture guides don’t even mention it.

When Crisis Cognition came to us, that assumption was off the table entirely. Their team builds decision-support tools for humanitarian responders, people working in disaster zones, conflict areas, and infrastructure-damaged regions where connectivity is the first thing to fail. They needed an AI assistant that would keep running when everything else went down.

What followed was one of the more instructive builds we’ve done. This article walks through the core architectural decisions we made, why we made them, and what engineers considering their own offline or edge AI deployment should think through before writing a line of code.

What offline AI actually means (and why it’s harder than it sounds)

Offline AI, or on-device AI, means running the full inference stack locally, model, runtime, and interface, with no dependency on cloud services or an external network. No API calls. No fallback.

That distinction matters because most “edge AI” guides still assume some connectivity exists, even if reduced. True offline-first AI removes that safety net entirely. The model has to fit on your hardware. The runtime has to execute without reaching out. And the system has to remain useful within those constraints.

It’s a growing space. According to Grand View Research, the global edge AI market sits at $24.91 billion in 2025 and is projected to reach $118.69 billion by 2033, driven by demand for real-time local processing. But market growth doesn’t mean the engineering is straightforward. Here’s the core tension most teams hit:

Offline AICloud-connected AI
Connectivity requiredNoneReliable internet
Inference locationOn-deviceRemote server
Model sizeConstrained by hardwareEffectively unlimited
Data privacyFull local controlDependent on provider
LatencyLow (local)Variable (network-dependent)
Update mechanismManual / sync when connectedContinuous

The tradeoffs are real, and understanding them upfront is what shapes every decision that follows.

Decision 1: Choosing your hardware

Hardware selection is the constraint that determines everything else. Get it wrong and no amount of model optimisation will save you.

For the Crisis Cognition prototype, we needed something compact, low-power, and capable of running language model inference without a GPU. After evaluating the options, we selected the Orange Pi 5 Max, built around the Rockchip RK3588 processor. The deciding factor was the RK3588’s integrated NPU (neural processing unit), dedicated silicon for AI workloads that dramatically outperforms CPU-only inference for the model sizes we were targeting.

This is the question we’d push every team to answer first: does your target chip have an NPU, and is it well-supported by available runtimes? It changes the performance ceiling considerably. As the Edge AI and Vision Alliance notes, quantisation, compression, and pruning help reduce model sizes, but dedicated hardware acceleration is what makes viable on-device inference possible at scale.

Before committing to hardware, ask:

  • What is the thermal and power envelope? (Critical for field deployment)
  • Does the chip have a supported NPU with an active runtime ecosystem?
  • How much RAM is available, and is it shared with the OS?
  • What storage options exist for model files?
  • Can the device act as a network access point if needed?

We added a Micro SD card for local storage of model files and operational data. The whole unit runs on Armbian OS, a lightweight Linux distribution optimised for ARM boards that gives us stability, predictable performance, and a clean base for customisation.

Decision 2: Selecting and compressing your model

Once hardware is fixed, model selection becomes a constrained optimisation problem: find the highest-capability model that fits within your memory budget and meets your latency requirements.

For this build, we selected Qwen2.5-3B-Instruct, optimised for the RK3588. A 3-billion parameter model is modest by current standards, but it hits the right balance for constrained hardware, accurate enough for structured decision-support queries, fast enough to respond without frustrating users in the field.

The key enabling technique here is quantisation. Quantisation reduces model precision by converting weights from 32-bit or 16-bit floating-point values to lower-precision formats like 8-bit integers, dramatically shrinking memory footprint and speeding up inference. We used a w8a8 configuration (8-bit weights, 8-bit activations) with a hybrid ratio that preserves accuracy where it matters most.

The Edge AI and Vision Alliance reports that more than 75% of large-scale AI models published in 2024 feature fewer than 100 billion parameters, and compression techniques are making smaller models increasingly competitive. The practical upshot: you have more viable options for on-device deployment than you might expect.

Our rule of thumb: start with the smallest model that meets your accuracy floor for the specific task, then optimise from there. Don’t try to fit a 70B model onto embedded hardware. Work with what the hardware supports and tune the model to the use case.

Hugging Face is the best starting point for finding quantised model variants. For teams not targeting a specific NPU, llama.cpp supports GGUF-format quantised models across a wide range of hardware with minimal dependencies, and has become the de facto standard for local LLM deployment.

Decision 3: The inference runtime

The runtime is the layer between your model and your hardware. It’s easy to overlook, but choosing the wrong one for your chip will leave significant performance on the table.

For the RK3588 NPU, we used rkllm, a runtime built specifically for this chip, and extended it in our own repository to fit the prototype’s requirements. The NPU-native runtime gave us the inference speed we needed. Running the same model through a generic CPU-based runtime would have been noticeably slower.

The general principle: always match the runtime to your chip architecture. Generic runtimes are portable; purpose-built ones are faster. The right choice depends on whether portability or performance is the priority.

Here’s a simplified comparison for teams evaluating their options:

RuntimeBest forQuantisation supportHardware flexibility
rkllmRockchip RK3588 NPUYes (chip-native)RK3588 only
llama.cppCPU/GPU, broad hardware1.5-bit to 8-bit (GGUF)High, runs almost anywhere
ONNX RuntimeCross-platform portabilityINT8, FP16High, CPU, CUDA, TensorRT

For most teams not targeting a specific NPU, llama.cpp is the right starting point. For production deployments on specific edge silicon, investigate whether a chip-native runtime exists first.

Decision 4: Making it usable in the field

A working model is not a usable system. This is the step that gets underestimated.

Crisis Cognition’s users are field responders in high-pressure situations. They’re not engineers. The UI layer couldn’t require any setup, configuration, or technical knowledge. It had to just work.

Our solution had two parts. First, we configured the Orange Pi 5 Max as a Wi-Fi access point, so phones, tablets, and laptops can connect directly to the device without any external infrastructure. Second, we implemented a captive portal that automatically routes any connected device to the Open WebUI interface the moment it joins the network. Open WebUI runs in a standard browser, no app install, no login friction, no instructions needed.

This architecture is worth borrowing for any offline AI deployment where non-technical users are in the picture. The model capability is irrelevant if the people who need it can’t access it quickly under pressure.

What we proved, and what the tradeoffs are

The prototype confirmed what we set out to demonstrate: LLMs can run reliably on the RK3588 NPU, offline inference is viable for crisis environments, and a fully self-contained AI assistant can operate without cloud services or external infrastructure. Non-technical users can access it through a browser within seconds of connecting.

The honest tradeoffs are worth naming. A 3B parameter model has a capability ceiling; it handles structured, task-specific queries well, but it’s not a replacement for a frontier model on open-ended reasoning. There’s no live data. And keeping the model updated requires a manual sync process when connectivity is available.

For teams weighing fully offline against a hybrid approach, research published in early 2025 found that hybrid edge/cloud setups for AI workloads can achieve energy savings of up to 75% and cost reductions exceeding 80% compared to pure cloud processing. If your use case allows for intermittent connectivity, a hybrid model, local inference for real-time decisions and cloud sync for updates and heavier reasoning, may be the better long-term architecture. For Crisis Cognition’s context, fully offline was the only viable option.

Key takeaways

  • Hardware first. NPU availability and runtime support should determine your hardware choice before anything else.
  • Constrain the model to the task. The smallest model that meets your accuracy floor is the right model. Don’t optimise for capability at the expense of deployability.
  • Match runtime to silicon. A chip-native runtime outperforms a generic one. Always check whether one exists for your target hardware.
  • Design for the end user, not the engineer. The access layer, how people actually interact with the system in the field, is as important as the inference stack.

If you’re exploring an offline or edge AI deployment and want to talk through the architecture, get in touch with our team. You can also read the full Crisis Cognition case study for a detailed breakdown of the technical implementation.

SaveSaved
Summarize with AI:

Dmitry Ermakov

Dmitry is our our Head of Engineering. He's been with WeAreBrain since the inception of the company, bringing solid experience in software development as well as project management.
Woman holding the Working machines book

Working Machines

An executive’s guide to AI and Intelligent Automation

Working Machines eBook