llamafu vs llama.rn vs flutter_llama_cpp: On-Device LLM for Flutter in 2026

A practical comparison of llamafu, llama.rn, and flutter_llama_cpp for running LLMs on Flutter — what each does well, what each doesn't, and which to pick.

The question

Should I use llamafu, llama.rn, or flutter_llama_cpp to run an LLM in my Flutter app?

The Flutter-on-device-LLM ecosystem has matured in 2025–2026. There are now three serious contenders, and the differences are real but not always obvious. This post is the comparison we wished we had when we started llamafu.

The 60-second version: llamafu is the research-instrumented runtime; llama.rn is the React-Native bridge; flutter_llama_cpp is the community-maintained FFI binding. They share the same inference engine (llama.cpp) but differ in measurement, ergonomics, and Flutter-native integration.

What each project is

llamafu is a Flutter runtime built around llama.cpp with a measurement-first methodology. Every inference is instrumented; token/s per device, memory ceilings, KV-cache utilisation, and quantisation quality trade-offs are all first-class outputs. The goal is to publish the measurements so the community can compare deployments honestly.

llama.rn is the React Native equivalent — a JS-callable binding to llama.cpp, with a smaller API surface focused on getting models running quickly. It is the de-facto choice for cross-platform mobile (iOS, Android) when the app is RN-based.

flutter_llama_cpp is a community-maintained FFI binding that exposes llama.cpp to Dart directly. It is the most Dart-native of the three and the most configurable at the engine level (KV cache, sampling, batching).

The six dimensions

Dimensionllamafullama.rnflutter_llama_cpp
Primary goalResearch-instrumented runtimeProduction mobile inferenceDart-native binding
Enginellama.cppllama.cppllama.cpp
Flutter integrationFirst-class via LlamafuEngine widgetVia platform channelsDirect FFI
Default model size7B Q4_K_M3-7B Q43-13B configurable
Memory measurementsFirst-class (publishes per-device numbers)NoNo
Quantisation benchmarksPublished (Q4/Q5/Q8 quality delta)NoNo
Tool callingYesYesLimited
StreamingYesYesYes
Multi-model (hot-swap)YesLimitedYes
EmbeddingYesNoNo
iOS supportYesYesYes
Android supportYesYesYes
macOS / Windows / LinuxYesNo (mobile-first)Yes
Active developmentYesYesYes
LicenseMITMITMIT
Production users(early)ManyMany

When to use which

Use llamafu when:

  • You are doing research on on-device LLM performance and need the measurements, not just the inference.
  • You need to publish a benchmark (paper, blog, internal report) and want the numbers to be defensible.
  • You want a single API that works on iOS, Android, macOS, Windows, and Linux without per-platform code paths.
  • You are building an agent on top of the runtime (tool calls, memory) and want a coherent API surface.

Use llama.rn when:

  • Your app is React Native, not Flutter.
  • You are shipping fast and the default model parameters are good enough.
  • The community-maintained version is what your team already knows.

Use flutter_llama_cpp when:

  • You are pure-Dart and want the lowest-level FFI access to llama.cpp.
  • You want to customise the engine (sampling, KV cache, batching) at the C-API level.
  • You are migrating an existing llama.cpp C++ project to Flutter.

What llamafu does that the others don’t

Three things, each of them a different kind of value:

  1. Published measurements. llamafu’s GitHub releases include per-device token/s numbers for every supported model and quantisation level. The README contains a benchmark table; docs.llamafu.dev has the full data. This means you can defend your choice of model + device to a stakeholder with data, not vibes.

  2. Cross-platform API surface. llamafu has a single Dart API that runs on iOS, Android, macOS, Windows, and Linux. The LlamafuEngine widget is Flutter-native; the underlying bindings are the same code path everywhere. The same model file works on every platform.

  3. First-class agent support. llamafu is designed to be the inference substrate for ukkin (our on-device mobile AI agent). Tool calls, structured output, and reasoning traces are part of the API, not bolted on.

What llamafu does NOT do

  • It is not the fastest option for raw token/s. If you only care about throughput and don’t need the measurements or the agent support, llama.rn is comparable.
  • It is not the most flexible at the C-API level. If you need to call llama.cpp internals directly, flutter_llama_cpp gives you more control.
  • It does not have a managed model hub. You bring your own GGUF files. This is intentional (we believe in local models, not managed) but worth knowing.

A realistic 30-minute eval

# 1. Add llamafu to your Flutter app
flutter pub add llamafu

# 2. Drop a GGUF model in your assets
mkdir -p assets/models
curl -L -o assets/models/qwen2.5-1.5b-instruct-q4_k_m.gguf \
  https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_k_m.gguf

# 3. Wire it up
import 'package:llamafu/llamafu.dart';
final engine = LlamafuEngine.fromAsset(
  'assets/models/qwen2.5-1.5b-instruct-q4_k_m.gguf',
  contextSize: 2048,
);
final response = await engine.generate('Hello in three words.');
print(response.text);
print('Tokens/s: ${response.metrics.tokensPerSecond}');
print('Memory peak: ${response.metrics.peakMemoryMb} MB');

If you don’t have a GGUF file yet, llamafu will also load from a URL with LlamafuEngine.fromUrl(...).