Building mullama: What We Learned Replacing Ollama from Scratch

We built mullama because Ollama was not designed for the things we needed to study. That sounds like a criticism, but it is not — Ollama is good software that solves a real problem. It just solves a different problem than ours.

This post is a post-mortem on building a local LLM serving layer from first principles. We will cover why we started, what we built, what worked, what did not, and what we would do differently.

Why Not Ollama

Ollama makes it easy to run open-weight models locally. Download a model, run ollama serve, hit the API. For most use cases, this is exactly right.

Our research required three things that Ollama’s architecture made difficult:

Fine-grained control over inference scheduling. We needed to study how different scheduling strategies (round-robin, priority-based, preemptive) affect throughput and latency under concurrent load. Ollama’s scheduler is internal and not designed to be swapped out.

Direct access to llama.cpp internals. We wanted to experiment with KV cache management, context window strategies, and quantisation at a level below what Ollama’s API exposes. Ollama wraps llama.cpp and presents a clean abstraction over it — which is the right design choice for a user-facing tool, but the wrong one for a research instrument.

Model lifecycle as a study object. We wanted to study model loading, unloading, and hot-swapping as first-class operations. How long does it take to load a 7B model into GPU memory? What happens when you need to swap between two 13B models on a single 24GB GPU? Ollama handles this internally, and its model management is optimised for the common case, not for instrumentation.

We evaluated alternatives — vLLM, text-generation-inference, LocalAI — and found similar constraints. Each tool makes reasonable assumptions about its users. None of those assumptions matched what we needed for research.

So we built mullama.

Architecture

mullama is a Python application that wraps llama.cpp via its C API using ctypes bindings. The architecture has three layers:

┌─────────────────────────────────────┐
│           HTTP API Layer            │
│  (OpenAI-compatible + extensions)   │
├─────────────────────────────────────┤
│         Scheduling Layer            │
│  (pluggable scheduling strategies)  │
├─────────────────────────────────────┤
│         Engine Layer                │
│  (llama.cpp bindings, KV cache,     │
│   model loading, quantisation)      │
└─────────────────────────────────────┘

The Engine Layer is a thin wrapper around llama.cpp. We use ctypes rather than the Python bindings (llama-cpp-python) because we needed to call functions that the Python bindings do not expose. The engine handles model loading, tokenisation, sampling, and KV cache management. It exposes these as Python objects with explicit lifecycle methods — load(), unload(), generate(), get_kv_state(), etc.

The Scheduling Layer is where the research happens. It receives inference requests and decides how to dispatch them to loaded models. The scheduler is a pluggable interface:

class Scheduler(Protocol):
    def submit(self, request: InferenceRequest) -> Future[InferenceResult]:
        ...
    def on_model_loaded(self, model: LoadedModel) -> None:
        ...
    def on_model_unloaded(self, model_id: str) -> None:
        ...

We have implemented several schedulers: a simple FIFO queue, a priority scheduler that respects request deadlines, and a preemptive scheduler that can interrupt long-running generations to service higher-priority requests. Swapping between them is a configuration change, not a code change.

The HTTP API Layer implements the OpenAI chat completions API. This was a deliberate choice — compatibility with existing client libraries reduces the barrier to using mullama as a drop-in replacement for other serving layers. We also expose extension endpoints for research operations: model loading/unloading, KV cache inspection, scheduler metrics, and health checks.

llama.cpp Integration: The Hard Parts

Working directly with llama.cpp via ctypes was the most time-consuming part of the project. A few specific challenges:

Memory management across the FFI boundary. llama.cpp manages its own memory for model weights, KV caches, and scratch buffers. Coordinating this with Python’s garbage collector requires careful reference counting. We had several bugs where Python objects were garbage-collected while llama.cpp still held pointers into their memory. The fix was a ContextManager that explicitly tracks all allocated resources and releases them in the correct order.

Threading model mismatch. llama.cpp uses OpenMP for parallelism within a single inference call. Python has the GIL. Running multiple concurrent inferences means running multiple llama.cpp contexts, each potentially saturating CPU cores. We settled on a process-per-model architecture for CPU inference and a single-process, serialised-access model for GPU inference (since GPU memory is shared and llama.cpp’s CUDA backend is not designed for concurrent contexts).

GGUF model format parsing. We needed to read model metadata (architecture, layer count, quantisation scheme, context length) from GGUF files without loading the full model. llama.cpp provides functions for this, but the API is not well-documented and has changed across versions. We ended up writing a standalone GGUF header parser in Python that reads the metadata directly from the file, bypassing llama.cpp entirely for the discovery phase.

Version churn. llama.cpp is under active development. The C API has changed significantly across releases, sometimes without deprecation warnings. We pin to a specific commit and update deliberately, but each update requires reviewing the API changes and updating our bindings. This is the ongoing maintenance cost of building on a rapidly evolving dependency.

Model Lifecycle Management

One of our research questions was: what does the model lifecycle actually look like in a multi-model serving environment?

The lifecycle is more complex than “load model, serve requests, unload model.” In practice:

Discovery. Scan local storage for available models. Parse GGUF metadata to determine architecture, parameter count, quantisation, and context length.
Loading. Allocate memory, read weights, initialise the KV cache. On GPU, this involves VRAM allocation and weight transfer.
Warm-up. The first few inferences after loading are slower due to cache effects. We run a warm-up sequence before marking a model as ready.
Serving. The steady state. The model handles requests via the scheduler.
Hot-swapping. When a request arrives for a model that is not loaded and GPU memory is full, we need to unload the current model and load the requested one. This takes seconds for small models and minutes for large ones.
Unloading. Release all resources. This needs to be orderly — in-flight requests must complete or be cancelled before the model’s memory is freed.

The hot-swapping case is the interesting one. On a 24GB GPU, you can hold one 13B model or two 7B models (at Q4 quantisation) comfortably. If your workload alternates between models, you spend a significant fraction of time loading and unloading. We found that even a simple predictive prefetching strategy — loading the next likely model based on recent request history — reduced swap latency by 40% in our benchmarks.

What Worked

The pluggable scheduler abstraction. This was the best design decision we made. It let us run controlled experiments comparing scheduling strategies without modifying the rest of the system. The scheduler interface is small enough that implementing a new strategy takes an afternoon, not a week.

OpenAI API compatibility. Being a drop-in replacement for the OpenAI API meant we could use mullama with existing applications, evaluation harnesses, and client libraries immediately. We did not need to build a custom ecosystem.

Explicit model lifecycle management. Treating model loading and unloading as instrumented, observable operations gave us data we could not have obtained from Ollama’s opaque model management. The loading time distributions, memory usage profiles, and swap frequency data directly informed our research.

What Did Not Work

ctypes bindings. If we were starting over, we would use pybind11 or write a dedicated C extension. ctypes works, but the lack of type safety across the FFI boundary caused subtle bugs that were difficult to diagnose. Every pointer type is just c_void_p, every buffer is just c_char_p. The compiler cannot help you.

Python for the hot path. The HTTP server and scheduler are fine in Python. The token-by-token generation loop is not. Even with the actual inference running in C (via llama.cpp), the per-token Python overhead — callback invocation, streaming response construction, scheduling checks — adds measurable latency. For single-user interactive use, it is negligible. Under concurrent load, it adds up.

Trying to support every GGUF variant. Early on, we attempted to support every quantisation scheme and model architecture that llama.cpp supports. This was a mistake. The long tail of architectures and quantisation formats is large, and each one has edge cases. We eventually narrowed our focus to the architectures and quantisation levels relevant to our research (Llama, Mistral, and Phi architectures; Q4_K_M and Q5_K_M quantisation).

Trade-offs We Accept

mullama is a research tool, not a production serving layer. We explicitly accept trade-offs that would be unacceptable in production software:

No multi-GPU support. Our research focuses on single-GPU and CPU scenarios. Tensor parallelism across GPUs is a different problem.
No continuous batching. We serialise requests to a given model. This limits throughput but simplifies scheduling analysis.
No speculative decoding. We may add this later, but it complicates the scheduling model we are studying.

These are deliberate scope limitations, not missing features. mullama exists to answer specific research questions, and its design reflects those questions.

Lessons

Building mullama taught us that the distance between “run a model locally” and “understand what happens when you run a model locally” is larger than it appears. The existing tools are good at the first thing. They are intentionally opaque about the second, because opacity is what makes them easy to use.

If you need a local LLM serving layer for application development, use Ollama. It is better software than mullama for that purpose. If you need to study what happens inside a local LLM serving layer — scheduling, memory management, model lifecycle, inference timing — that is the gap mullama fills.