mullama vs Ollama vs vLLM: Choosing a Local LLM Server in 2026
A practical comparison of mullama, Ollama, vLLM, LocalAI, and LM Studio for local LLM serving — when to use which, and why you might pick the research-instrumented option.
The question
Should I use Ollama, vLLM, LocalAI, LM Studio, or mullama to serve LLMs locally?
Local LLM serving has split into distinct categories in 2025–2026. The right answer depends on whether you are prototyping, shipping to production, or doing research. This post is the comparison we wish we had when we started mullama.
The 60-second version: Ollama is the default for getting started; vLLM is the production-grade GPU server; LM Studio is the desktop GUI; LocalAI is the OpenAI-compatible alternative; mullama is the research-focused alternative that exposes llama.cpp internals for instrumentation.
What each option is
mullama is a Python application that wraps llama.cpp via its C API using ctypes bindings. It directly interfaces with llama.cpp to expose model lifecycle management, inference scheduling, and API compatibility, with a pluggable scheduler layer and instrumented internals.
Ollama is a high-level wrapper around llama.cpp with a
download-and-run UX (ollama run llama3), an OpenAI-compatible
REST API, and a Go-based model manager. The de-facto choice for
getting started.
vLLM is a production-grade GPU inference server from UC Berkeley. Uses PagedAttention for high-throughput, continuous batching, and tensor parallelism. The right choice for high-QPS production serving.
LocalAI is an OpenAI-compatible REST API gateway that supports llama.cpp, vLLM, and other backends. The right choice for drop-in OpenAI API replacement.
LM Studio is a desktop GUI for running LLMs locally. Closed source, but excellent for non-developers. The right choice for exploration and prototyping.
The six dimensions
| Dimension | mullama | Ollama | vLLM | LocalAI | LM Studio |
|---|---|---|---|---|---|
| Primary goal | Research-instrumentation | Developer experience | Production GPU serving | OpenAI-compatible API | Desktop GUI |
| Engine | llama.cpp (ctypes) | llama.cpp | Custom (PagedAttention) | Multiple backends | llama.cpp / MLX |
| API | OpenAI-compatible + extensions | OpenAI-compatible | OpenAI-compatible | OpenAI-compatible | None (GUI) |
| Default platform | CPU + GPU | CPU + GPU | GPU (NVIDIA) | CPU + GPU | CPU + GPU + MLX |
| Model management | Pluggable | Built-in (Ollama registry) | Bring your own | Bring your own | Built-in hub |
| Scheduler | Pluggable (round-robin, priority, preemptive) | Internal | PagedAttention + continuous batching | Backend-dependent | Internal |
| KV cache management | Exposed (research hook) | Internal | PagedAttention | Internal | Internal |
| Quantisation hot-swap | Yes | No | No | No | No |
| Multi-model serving | Yes (with hot-swap) | Yes (load on demand) | Yes (LoRA adapters) | Yes | No (one at a time) |
| Observability | First-class (hooks at every layer) | Logs + metrics | Prometheus metrics | Logs | GUI |
| Memory profiling | Yes (per-model, per-request) | No | No | No | No |
| Production users | (early) | Many | Many | Many | (consumer) |
| License | MIT | MIT | Apache-2.0 | MIT | Proprietary |
| Maintenance | Skelf | Ollama Inc. | vLLM team | LocalAI | LM Studio |
| Language | Python | Go | Python + CUDA | Go | TypeScript |
When to use which
Use mullama when:
- You are doing research on inference scheduling, KV cache, or model lifecycle, and you need the internals to be visible.
- You want to compare scheduling strategies (round-robin, priority, preemptive) under controlled load.
- You want a Python codebase you can read, instrument, and modify.
- You are running a mixed fleet of CPU + GPU nodes and want a single API surface that does the right thing on each.
Use Ollama when:
- You want the easiest path from
ollama run llama3to a working API. - You are prototyping, and the default model parameters are good enough.
- You need a wide model registry with pre-quantised downloads.
Use vLLM when:
- You are serving a high-QPS production workload on NVIDIA GPUs.
- You need PagedAttention, continuous batching, and tensor parallelism.
- You have an SRE team that can run a Python service in production.
Use LocalAI when:
- You need a drop-in OpenAI-compatible API and want to choose the backend per-model.
- You are migrating from OpenAI and want to keep your client code unchanged.
Use LM Studio when:
- You are exploring, not shipping.
- You are a non-developer who wants a GUI.
- You are on macOS and want the MLX backend (Apple Silicon).
Why might you pick the research-instrumented option?
The honest answer: most teams shouldn’t. If you are shipping a production LLM service, vLLM is the right answer. If you are prototyping, Ollama is the right answer. The research-instrumented option is for a specific audience:
- Research engineers studying inference scheduling, KV cache strategies, or model lifecycle as research objects. The internals need to be visible and modifiable.
- Platform teams building production services on top of llama.cpp (not vLLM) who need first-class observability and hot-swap. The default Ollama scheduling is internal; the default vLLM doesn’t run on CPU.
- Multi-model serving scenarios where you need to swap between models in < 1 second without dropping requests. Ollama’s load-on-demand model takes seconds; mullama’s hot-swap is sub-second.
If none of those describe you, you probably want Ollama or vLLM.
A 10-minute mullama eval
# Install
pip install mullama
# Start the server
mullama serve --model Qwen/Qwen2.5-7B-Instruct-GGUF --quant Q4_K_M
# Use the OpenAI-compatible API
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-7b-instruct",
"messages": [{"role": "user", "content": "Hello"}],
"stream": false
}'
# Inspect the scheduler state
curl http://localhost:8080/internal/scheduler | jq
# Profile a request
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Profile: true" \
-d '{"model": "qwen2.5-7b-instruct", "messages": [{"role": "user", "content": "Hello"}], "stream": false}' \
| jq '.profile' # {prompt_tokens, completion_tokens, kv_cache_hit, kv_cache_miss, kv_cache_evict, time_to_first_token_ms, total_ms}
The X-Profile: true header is what makes mullama different:
every request returns a per-request profile you can graph. Use it
to find your prompt-cache hit rate, your eviction pressure, your
TTFT distribution.
What to read next
- Building mullama: What We Learned Replacing Ollama from Scratch — the full mullama post-mortem
- Intelligent LLM Routing: Spending Compute Where It Matters — the routing layer that sits in front
- mullama repository
- Ollama
- vLLM