Autonomous Mobile Agents: ukkin's Architecture for On-Device AI

What Does an Agent Look Like on a Phone?

The word “agent” in AI has come to mean many things. In the context of ukkin, we mean something specific: a software system running on a mobile device that can observe its environment, reason about what to do, and take actions — all without sending data to a remote server. The environment is the phone itself: its screen contents, its notifications, its apps, its sensors. The actions are the things a user might do: tapping buttons, entering text, navigating between apps, adjusting settings.

This is a different proposition from cloud-based AI agents. A cloud agent has access to powerful models, vast memory, and essentially unlimited compute. It also requires the user to send their data — screen contents, personal messages, browsing history — to a remote server. An on-device agent operates under severe resource constraints but keeps everything local. The trade-off is between capability and privacy, and ukkin is our exploration of where that trade-off lands in practice.

Architecture Overview

ukkin is structured as three layers, each with a distinct responsibility.

The Observation Layer

The observation layer is responsible for understanding what is happening on the device. It captures and interprets the current state of the screen, active notifications, and relevant system state (connectivity, battery level, time of day).

Screen understanding is the most challenging component. ukkin uses a combination of accessibility APIs and lightweight vision models to extract structured information from the screen. The accessibility APIs provide the semantic structure — button labels, text fields, list items, navigation elements. The vision component handles cases where the accessibility tree is incomplete or absent, which is unfortunately common in many apps.

The output of the observation layer is a structured representation of the current device state: what app is active, what content is displayed, what actions are available (which buttons can be tapped, which fields can be filled), and what has changed since the last observation. This representation is compact enough to fit within the context window of the on-device language model.

The Reasoning Layer

The reasoning layer takes the structured observation and decides what to do. This is where ukkin integrates with llamafu for local inference.

The reasoning layer receives three inputs: the current observation, the user’s goal (expressed in natural language), and a memory of previous actions and their outcomes. It produces one of three outputs: a specific action to take, a request for clarification from the user, or a determination that the goal has been achieved or cannot be achieved.

The reasoning process uses a structured prompt format that constrains the language model’s output to valid actions. Rather than allowing the model to generate free-form text and then parsing it for action instructions, ukkin provides the model with an explicit action schema — a list of available actions with their parameters — and requires the model to select from this schema. This eliminates a large class of parsing failures and ensures that every model output maps to a concrete, executable action.

The trade-off is that reasoning happens on-device, which means it uses a smaller, quantised model rather than a frontier model. The reasoning is less sophisticated than what a cloud-based agent could produce. ukkin compensates for this by decomposing complex tasks into simpler steps and by using structured prompting to keep each individual reasoning step within the capabilities of the local model.

The Action Layer

The action layer executes the actions determined by the reasoning layer. It translates abstract actions (“tap the Send button,” “type ‘hello’ into the message field,” “scroll down”) into platform-specific gestures and inputs.

On Android, this uses the Accessibility Service API to inject taps, swipes, and text input. On iOS, the mechanisms are more constrained due to platform restrictions, and ukkin’s capabilities are correspondingly narrower.

The action layer also handles verification: after executing an action, it triggers a new observation to confirm that the expected change occurred. If the screen state does not match expectations (the button was not found, the app crashed, a dialog appeared), the action layer reports the discrepancy to the reasoning layer, which can adjust its plan.

The Autonomy-Safety Spectrum

The central design question for any agent is: how much should it be allowed to do without asking?

At one extreme, the agent does nothing without explicit approval for every action. This is safe but not useful — it reduces the agent to a suggestion engine that still requires the user to perform every tap.

At the other extreme, the agent acts fully autonomously, making purchases, sending messages, and modifying settings without confirmation. This is useful in theory but dangerous in practice. A misunderstood instruction or a reasoning error could send a message to the wrong person, make an unintended purchase, or delete data.

ukkin adopts a tiered permission model that places actions on a spectrum from low-risk to high-risk.

Tier 1: Observe only. The agent can look at the screen and provide suggestions but cannot interact with it. This is the default mode for new tasks and unfamiliar apps.

Tier 2: Reversible actions. The agent can perform actions that are easily undone: scrolling, navigating between screens, opening apps, performing searches. These actions change the view but do not change data.

Tier 3: Data-modifying actions. The agent can fill forms, compose text, and stage changes, but requires explicit user confirmation before submitting, sending, or saving. The user sees what the agent has prepared and approves or modifies it.

Tier 4: Autonomous execution. For specific, pre-approved workflows, the agent can act without per-action confirmation. The user defines the workflow and its boundaries in advance. For example: “Every morning, check my calendar and send me a summary notification.” Within this defined scope, the agent operates autonomously. Outside it, it falls back to Tier 3.

This model is conservative by design. We would rather have users occasionally frustrated by confirmation prompts than have the agent take an irreversible action based on a misunderstanding. The tiers can be adjusted per app, per action type, and per user preference.

Integration with llamafu

ukkin’s reasoning layer depends on local LLM inference, which is provided by llamafu. This integration creates a tight coupling between the agent’s cognitive capabilities and the hardware constraints documented in our llamafu experiment report.

The practical implications are significant. The reasoning model is limited to what can run on-device — currently, quantised models in the 3B-7B parameter range. This is sufficient for interpreting screen state, following simple multi-step instructions, and selecting appropriate actions from a constrained action space. It is not sufficient for complex planning, nuanced understanding of ambiguous instructions, or reasoning about the consequences of actions several steps ahead.

ukkin manages this limitation through task decomposition. Rather than asking the model to plan an entire multi-step workflow at once, ukkin breaks tasks into individual steps. At each step, the model only needs to understand the current screen state and select the next action. This reduces the cognitive demand on any single inference call and keeps each step within the model’s reliable operating range.

The memory system is also constrained by context length. On most devices, the effective context window is 2048-3072 tokens. ukkin maintains a rolling memory that includes the current observation, the user’s goal, and a compressed summary of recent actions. Older actions are summarised rather than retained verbatim. This works for short workflows (5-15 steps) but becomes lossy for longer sequences.

Comparison with Cloud-Based Agents

Cloud-based agents have clear advantages: larger models, longer context, more sophisticated reasoning, access to web APIs and external tools. For complex tasks that require deep understanding, long-horizon planning, or access to information beyond the device, cloud agents are superior.

ukkin’s advantages are in a different dimension. Privacy is absolute — no screen contents, no personal data, no browsing history leaves the device. Latency for individual actions is low because there is no network round-trip. Availability is unconditional — the agent works offline, in airplane mode, in areas with no connectivity. And the cost is zero per inference — there are no API charges, no usage limits, no subscription fees.

The practical upshot is that ukkin and cloud-based agents serve different use cases. ukkin is well-suited for routine, privacy-sensitive, on-device automation: managing settings, navigating familiar apps, performing repetitive tasks, summarising on-screen content. Cloud agents are better for open-ended tasks that require world knowledge, complex reasoning, or interaction with web services.

We do not view this as a permanent division. As on-device models improve, the range of tasks that ukkin can handle will expand. But we are honest about where the boundaries are today, and we design ukkin’s permission model and user experience around those boundaries rather than around aspirational future capabilities.

Current Limitations

Beyond the reasoning constraints imposed by model size, ukkin faces several practical limitations.

App compatibility. Not all apps expose useful accessibility information. Some apps use custom rendering that makes screen understanding difficult. Games, heavily customised UIs, and apps with canvas-based rendering are largely opaque to the observation layer.

Platform restrictions. iOS imposes significant limitations on background operation and accessibility access for third-party apps. ukkin’s capabilities on iOS are narrower than on Android as a result.

Error recovery. When the agent makes a mistake mid-workflow — taps the wrong button, navigates to the wrong screen — recovery is not always smooth. The reasoning model can detect that something went wrong but does not always find the correct recovery path, particularly if the mistake has taken it to an unfamiliar screen.

Speed. Each action cycle (observe, reason, act, verify) takes 3-8 seconds on current hardware. For simple tasks, this is slower than a human performing the same actions directly. The value proposition is not speed but automation — the agent performs tasks that the user does not want to do, not tasks that the user cannot do fast enough.

Where This Is Going

ukkin is an early-stage exploration. We are sharing it not because it is a finished product but because the architecture and the trade-offs it embodies are instructive. On-device agents are coming, driven by improving hardware and increasingly capable small models. The design decisions being made now — about permission models, about the balance between autonomy and safety, about how to handle the limitations of local inference — will shape what these agents look like when they mature.

We believe the tiered permission model is the right foundation. We believe local-first inference is the right default for personal devices. And we believe that honest communication about what the agent can and cannot do is more valuable than impressive demos that obscure the limitations. ukkin is built on these principles, and we will continue developing it accordingly.