Skip to content

Chapter 11: The Turn Anatomy — How Every LLM Request Is Assembled

In Chapter 10: Semantic Search, we saw how kiro-cli retrieves knowledge locally using vector embeddings. But we've been studying individual organs — the agent loop (Chapter 6), the tool system (Chapter 7), MCP servers (Chapter 8), code intelligence (Chapter 9), semantic search (Chapter 10) — without ever watching the whole body move.

When you type a question into kiro-cli, a surprising amount of machinery activates before the LLM sees a single token. This chapter shows how all those pieces fuse into one API call. Think of it as the "assembly line" chapter: raw materials go in on the left, one carefully packaged request comes out on the right.


The Assembly Pipeline

Here's the big picture. Every turn follows this flow:

sequenceDiagram
    participant U as User Input
    participant A as Agent
    participant CTX as Context Builder
    participant TS as Tool Spec Builder
    participant RTS as RTS API

    U->>A: user types a message
    A->>CTX: format_user_context_message()
    Note over CTX: 7 layers fused into<br/>one synthetic User message
    CTX-->>A: [user_msg, assistant_ack]
    A->>TS: make_tool_spec()
    Note over TS: built-in tools + MCP tools<br/>+ skill-aware filtering
    TS-->>A: Vec<ToolSpec>
    A->>RTS: stream(messages, tool_specs, None)
    Note over RTS: system_prompt param<br/>is ignored by RTS impl
    RTS-->>A: streamed content blocks

Three inputs converge: the context messages (a synthetic User+Assistant pair carrying all background), the tool specifications (a separate parameter), and the conversation history (prior turns). The RTS API receives all three and forwards them to the LLM provider.

Let's examine each piece.


1. No System Prompt: Everything Is a User Message

Most LLM APIs have a dedicated system role for instructions. kiro-cli's production path — the RTS (Runtime Service) — does not support it.

The Model trait accepts a system_prompt parameter:

// crates/agent/src/agent/agent_loop/model.rs:30
fn stream(
    &self,
    messages: Vec<Message>,
    tool_specs: Option<Vec<ToolSpec>>,
    system_prompt: Option<String>,
    cancel_token: CancellationToken,
) -> ...;

But the RTS implementation ignores it — note the underscore prefix:

// crates/chat-cli-v2/src/agent/rts/mod.rs:410
_system_prompt: Option<String>,  // unused

The comment at the injection site says it plainly:

// crates/agent/src/agent/mod.rs:3130
/// We use context messages since the API does not allow
/// any system prompt parameterization.

So how does the agent get its instructions to the LLM? By packing everything into a synthetic User message at the start of the conversation. The LLM sees this role layout:

Position Role Content
1 User Giant synthetic message with all context layers
2 Assistant Canned acknowledgment
3 User First real user message
4 Assistant LLM's first response
... ... Conversation continues

The LLM never knows the difference — it just sees a very thorough "first question" followed by a cooperative assistant reply. This is the foundation everything else builds on.


2. The 7-Layer Context Injection

The synthetic User message is the heart of every request. It's assembled by format_user_context_message() at crates/agent/src/agent/mod.rs:3174, and it has seven distinct layers stacked in order:

┌─────────────────────────────────────────────┐
│  Layer 1: Conversation Summary              │
│  Layer 2: Knowledge Base Listing            │
│  Layer 3: Task Context                      │
│  Layer 4: Agent Spawn Hooks                 │
│  Layer 5: Resource Files (file://)          │
│  Layer 6: Skills Metadata (skill://)        │
│  Layer 7: Agent Prompt (raw, no delimiters) │
└─────────────────────────────────────────────┘
  One giant User-role message

Each layer (except layer 7) is wrapped in delimiters defined in crates/agent/src/agent/consts.rs:39-40:

pub const CONTEXT_ENTRY_START_HEADER: &str =
    "--- CONTEXT ENTRY BEGIN ---\n";
pub const CONTEXT_ENTRY_END_HEADER: &str =
    "--- CONTEXT ENTRY END ---\n\n";

Here's what each layer carries:

Layer Source Lines What It Contains
1. Conversation summary mod.rs:3194–3200 Compressed history from prior turns (if any)
2. Knowledge base listing mod.rs:3203–3207 Indexed knowledge contexts for RAG (Chapter 10)
3. Task context mod.rs:3209–3212 Active task/spec state
4. Agent spawn hooks mod.rs:3214–3222 Output from AgentSpawn lifecycle hooks
5. Resource files mod.rs:3224–3229 Full content of file:// resources from agent config
6. Skills metadata mod.rs:3232–3240 Name + description hints for skill:// resources
7. Agent prompt mod.rs:3244–3246 The agent's main instruction, prefixed with "Follow this instruction: "

Layer 7 is special

The agent prompt is appended raw — it is NOT wrapped in CONTEXT_ENTRY delimiters like the other six layers. This gives it a distinct visual position at the end of the message, making it the last thing the LLM reads before the conversation begins.

After the User message, the agent appends a canned Assistant acknowledgment (crates/agent/src/agent/mod.rs:3165–3168):

"I will fully incorporate this information when generating my responses, and explicitly acknowledge relevant parts of the summary when answering questions."

The caller create_context_messages() at line 3131 returns both as a pair:

sequenceDiagram
    participant Agent
    participant Fmt as format_user_context_message()
    participant Ctx as create_context_messages()

    Agent->>Ctx: build context for this turn
    Ctx->>Fmt: assemble 7 layers
    Fmt-->>Ctx: single User message string
    Ctx-->>Agent: vec![user_msg, assistant_ack]

These two messages are prepended to the conversation history on every turn. The LLM always sees the full context — there's no persistent memory between calls.


3. Tool Specs: The Other Channel

Context messages carry the agent's instructions. But tool definitions travel through a separate API parameter: tools: [...].

The function make_tool_spec() at crates/agent/src/agent/mod.rs:2591 assembles the full tool list:

// Simplified from mod.rs:2591
async fn make_tool_spec(&mut self) -> Vec<ToolSpec> {
    // 1. Gather MCP tool specs from all launched servers
    // 2. Merge with built-in tool definitions
    // 3. Filter by agent config's allowed tools
    // 4. Sanitize names and return
}

It queries each running MCP server for its tool specs, merges them with built-in tools (like fs_read, shell, code), and filters the result against the agent's tools and allowedTools configuration.

Tool specs count against the context window

A common misconception: because tool specs are a separate parameter, they're "free." They are not. The LLM provider serializes them into the context window alongside messages. The budget is:

context messages + tool specs + conversation history + generation
≤ model context window

An agent with 50+ MCP tools can consume thousands of tokens in tool specs alone, leaving less room for conversation history.


4. Skills: Metadata Pre-Injection, Content On-Demand

Chapter 5 introduced agent configuration with file:// and skill:// resource URIs. Here's the critical difference in how they're injected.

file:// resources are read in full and injected as Layer 5 — every byte goes into the context message. Good for small, always-relevant files like steering docs.

skill:// resources are treated differently. Imagine an agent with 100 skill files. Injecting all of them at full content would blow the token budget before the user even asks a question. Instead, kiro-cli injects only metadata — a one-line hint per skill.

The function format_skill_hint() at crates/agent/src/agent/mod.rs:3423 extracts the YAML frontmatter and produces:

{name}: {description} (file: {filepath})

For example:

frontend-design: Best practices for React component architecture (file: .kiro/skills/frontend-design/SKILL.md)

All hints are grouped under a header defined at crates/agent/src/agent/consts.rs:41:

The following file entries contain: name, filepath,
and description. You SHOULD decide when to read the
full file using the filepath based on its description:

This is lazy loading for LLM context. The LLM sees a menu of available skills and uses fs_read to pull in the ones it needs. The default skill paths (crates/chat-cli-v2/src/util/paths.rs:60-61) scan two locations:

skill://.kiro/skills/*/SKILL.md      (project-local)
skill://~/.kiro/skills/*/SKILL.md    (user-global)

Skills depend on fs_read

If your agent config disables the fs_read tool, skills become a menu the LLM can read but never order from. The metadata hints will still appear in the context, but the LLM won't be able to fetch the full content.


5. MCP: Launch at Init, Frozen After

Chapter 8 covered MCP server management in detail. The key architectural constraint for turn assembly is: the set of MCP servers is frozen after initialization.

Look at the McpManagerRequest enum at crates/agent/src/agent/mcp/mod.rs:511–537:

pub enum McpManagerRequest {
    LaunchServer { server_name, config },
    GetToolSpecs { server_name },
    GetPrompts { server_name },
    GetPrompt { server_name, name, arguments },
    ExecuteTool { server_name, tool_name, args },
    Terminate,
}

Notice what's missing: there is no AddServer or RemoveServer variant. You can launch servers and terminate the entire manager, but you cannot hot-add or hot-remove individual servers mid-conversation.

launch_mcp_servers() is called in exactly two places:

  1. Agent initialization (crates/agent/src/agent/mod.rs:699) — the normal startup path
  2. Agent swap (crates/agent/src/agent/mod.rs:1391) — which terminates ALL existing servers, creates a fresh McpManager, and relaunches from the new agent's config

Attempting to launch a server that's already running hits the ServerAlreadyLaunched error at line 561.

sequenceDiagram
    participant User
    participant Agent
    participant MCP as McpManager

    User->>Agent: /agent swap new-agent
    Agent->>MCP: Terminate (kill all servers)
    Agent->>Agent: create fresh McpManager
    Agent->>MCP: LaunchServer (server A)
    Agent->>MCP: LaunchServer (server B)
    Note over MCP: Frozen until next<br/>swap or shutdown

This is a deliberate tradeoff: a frozen server set keeps tool resolution deterministic within a conversation. The agent always knows exactly which tools are available. The cost is that adding a new MCP server mid-session requires a full /agent swap.


Putting It Together — One Round Trip

Here's a complete turn, from keypress to response, showing how all five pillars combine:

sequenceDiagram
    participant User
    participant TUI
    participant ACP as ACP Server
    participant Agent
    participant RTS as RTS API
    participant LLM
    participant Tool

    User->>TUI: types a question
    TUI->>ACP: session/prompt
    ACP->>Agent: new turn

    Note over Agent: Build context
    Agent->>Agent: create_context_messages()<br/>(7-layer User msg + Assistant ack)
    Agent->>Agent: make_tool_spec()<br/>(built-in + MCP tools)
    Agent->>Agent: append conversation history

    Agent->>RTS: stream(messages, tool_specs, None)
    RTS->>LLM: forward to provider

    LLM-->>RTS: text block
    RTS-->>Agent: agent_message_chunk
    Agent-->>ACP: kiro.dev/session/update
    ACP-->>TUI: render text

    LLM-->>RTS: tool_use block
    RTS-->>Agent: tool_call
    Agent->>Tool: dispatch tool
    Tool-->>Agent: tool_result
    Agent-->>ACP: tool_call_update

    Note over Agent: New turn with tool_result
    Agent->>RTS: stream(messages + tool_result, ...)
    RTS->>LLM: continue
    LLM-->>RTS: final text
    RTS-->>Agent: agent_message_chunk
    Agent-->>ACP: kiro.dev/session/update
    ACP-->>TUI: render final answer

The TUI receives updates via the kiro.dev/session/update notification method (defined at packages/tui/src/acp-client.ts:52). The current TUI handles five session update variants:

Variant Purpose
user_message_chunk Echoes user input during session replay
agent_message_chunk Streams LLM text to the terminal
tool_call Announces a tool invocation
tool_call_update Reports tool completion or failure
available_commands_update Refreshes the command palette

Plus one extension notification: tool_call_chunk, which streams incremental tool output.

Limited variant set

The TUI's convertAcpUpdateToEvent() handler only processes the six variants above. Other session update types (if any exist in the ACP SDK) are logged as "Unhandled session update type" and discarded. This is the current TUI's scope — not the full ACP protocol surface.


Why This Design?

The "everything in User messages, tools on the side, no system prompt" architecture isn't accidental. Here's the reasoning:

  • Provider agnosticism. Not all LLM providers support a system role. By using only User and Assistant messages, kiro-cli works with any provider behind RTS without protocol translation.
  • Client-owned history. The agent rebuilds the full context on every turn. The LLM is stateless — it receives the complete conversation each time. This means the client controls exactly what the model sees, with no hidden server-side state.
  • Token efficiency via lazy-loading. Skills inject metadata only (~1 line each). The LLM pulls full content on demand via fs_read. An agent with 50 skills pays ~50 lines of context instead of ~50 files.
  • Deterministic tool resolution. Freezing MCP servers after init means make_tool_spec() returns the same set throughout a conversation. No mid-turn surprises where a tool appears or vanishes.
  • Privacy. Context assembly happens locally. Steering files, skill metadata, and resource content are composed on your machine. Only the final assembled messages cross the network to the LLM provider.

Practical Implications

If you're building against kiro-cli — whether that's a Kanvas session host, an A2A bridge, or a custom agent — keep these in mind:

  • Dynamic MCP requires an agent swap. There is no hot-add API. If your workflow needs to register a new MCP server mid-conversation, you must trigger a /agent swap, which terminates all existing servers and relaunches from the new config.
  • Skills compose with fs_read. The skill system's lazy-loading depends on the LLM being able to call fs_read. If your agent config removes or blocks fs_read, skill hints become inert text — the LLM sees the menu but can't order anything.
  • The context message can be large. The 7-layer synthetic User message can easily reach 10,000+ tokens with multiple steering files, hooks, and resource files. Monitor your token budget, especially when combining many file:// resources with a long conversation history.
  • Tool specs are not free. Each tool definition (name, description, parameter schema) consumes tokens from the same context window as messages. Agents with many MCP servers can hit the ceiling faster than expected.
  • The canned Assistant ack is always present. Every conversation starts with the same two synthetic messages. If you're analyzing token usage or debugging prompt behavior, account for this fixed overhead.

The Analogy

The LLM is a surgeon who's called in for one operation at a time. Each time, you hand them a fresh folder with the patient's full history (conversation summary), current vitals (task context, hooks), allowed instruments (tool specs), reference manuals (skills metadata), and the specific question (user message). They operate, hand back results, and forget everything.

Next time, you hand them the folder again — updated with the results of the last operation. They never remember the previous call. This is why the "folder" (the 7-layer context message) has to be so carefully organized: it's the surgeon's entire world for the duration of one turn.


Conclusion

This chapter unified the five pillars that previous chapters introduced separately:

Pillar Chapter Role in Turn Assembly
Agent Configuration 5 Defines which resources, skills, tools, and MCP servers to load
Agent Loop 6 Orchestrates the turn cycle: build context → call LLM → dispatch tools → repeat
Tool System 7 Provides built-in tool specs and executes tool calls
MCP Integration 8 Supplies external tool specs from frozen server set
Code Intelligence 9 Powers code-aware tools that appear in the tool spec list

Every turn follows the same assembly line: pack seven layers into a synthetic User message, gather tool specs from built-in and MCP sources, prepend the context pair to conversation history, and stream it all to the LLM. The model responds with text and tool calls, the agent dispatches tools, and the cycle repeats.

Now you've seen the full picture — from the first keypress in the TUI to the last token streamed back. Happy building.


← Chapter 10: Semantic Search | → Home