Chapter 11: The Turn Anatomy — How Every LLM Request Is Assembled¶
In Chapter 10: Semantic Search, we saw how kiro-cli retrieves knowledge locally using vector embeddings. But we've been studying individual organs — the agent loop (Chapter 6), the tool system (Chapter 7), MCP servers (Chapter 8), code intelligence (Chapter 9), semantic search (Chapter 10) — without ever watching the whole body move.
When you type a question into kiro-cli, a surprising amount of machinery activates before the LLM sees a single token. This chapter shows how all those pieces fuse into one API call. Think of it as the "assembly line" chapter: raw materials go in on the left, one carefully packaged request comes out on the right.
The Assembly Pipeline¶
Here's the big picture. Every turn follows this flow:
sequenceDiagram
participant U as User Input
participant A as Agent
participant CTX as Context Builder
participant TS as Tool Spec Builder
participant RTS as RTS API
U->>A: user types a message
A->>CTX: format_user_context_message()
Note over CTX: 7 layers fused into<br/>one synthetic User message
CTX-->>A: [user_msg, assistant_ack]
A->>TS: make_tool_spec()
Note over TS: built-in tools + MCP tools<br/>+ skill-aware filtering
TS-->>A: Vec<ToolSpec>
A->>RTS: stream(messages, tool_specs, None)
Note over RTS: system_prompt param<br/>is ignored by RTS impl
RTS-->>A: streamed content blocks
Three inputs converge: the context messages (a synthetic User+Assistant pair carrying all background), the tool specifications (a separate parameter), and the conversation history (prior turns). The RTS API receives all three and forwards them to the LLM provider.
Let's examine each piece.
1. No System Prompt: Everything Is a User Message¶
Most LLM APIs have a dedicated system role for instructions. kiro-cli's production path — the RTS (Runtime Service) — does not support it.
The Model trait accepts a system_prompt parameter:
// crates/agent/src/agent/agent_loop/model.rs:30
fn stream(
&self,
messages: Vec<Message>,
tool_specs: Option<Vec<ToolSpec>>,
system_prompt: Option<String>,
cancel_token: CancellationToken,
) -> ...;
But the RTS implementation ignores it — note the underscore prefix:
The comment at the injection site says it plainly:
// crates/agent/src/agent/mod.rs:3130
/// We use context messages since the API does not allow
/// any system prompt parameterization.
So how does the agent get its instructions to the LLM? By packing everything into a synthetic User message at the start of the conversation. The LLM sees this role layout:
| Position | Role | Content |
|---|---|---|
| 1 | User | Giant synthetic message with all context layers |
| 2 | Assistant | Canned acknowledgment |
| 3 | User | First real user message |
| 4 | Assistant | LLM's first response |
| ... | ... | Conversation continues |
The LLM never knows the difference — it just sees a very thorough "first question" followed by a cooperative assistant reply. This is the foundation everything else builds on.
2. The 7-Layer Context Injection¶
The synthetic User message is the heart of every request. It's assembled by format_user_context_message() at crates/agent/src/agent/mod.rs:3174, and it has seven distinct layers stacked in order:
┌─────────────────────────────────────────────┐
│ Layer 1: Conversation Summary │
│ Layer 2: Knowledge Base Listing │
│ Layer 3: Task Context │
│ Layer 4: Agent Spawn Hooks │
│ Layer 5: Resource Files (file://) │
│ Layer 6: Skills Metadata (skill://) │
│ Layer 7: Agent Prompt (raw, no delimiters) │
└─────────────────────────────────────────────┘
↓
One giant User-role message
Each layer (except layer 7) is wrapped in delimiters defined in crates/agent/src/agent/consts.rs:39-40:
pub const CONTEXT_ENTRY_START_HEADER: &str =
"--- CONTEXT ENTRY BEGIN ---\n";
pub const CONTEXT_ENTRY_END_HEADER: &str =
"--- CONTEXT ENTRY END ---\n\n";
Here's what each layer carries:
| Layer | Source Lines | What It Contains |
|---|---|---|
| 1. Conversation summary | mod.rs:3194–3200 |
Compressed history from prior turns (if any) |
| 2. Knowledge base listing | mod.rs:3203–3207 |
Indexed knowledge contexts for RAG (Chapter 10) |
| 3. Task context | mod.rs:3209–3212 |
Active task/spec state |
| 4. Agent spawn hooks | mod.rs:3214–3222 |
Output from AgentSpawn lifecycle hooks |
| 5. Resource files | mod.rs:3224–3229 |
Full content of file:// resources from agent config |
| 6. Skills metadata | mod.rs:3232–3240 |
Name + description hints for skill:// resources |
| 7. Agent prompt | mod.rs:3244–3246 |
The agent's main instruction, prefixed with "Follow this instruction: " |
Layer 7 is special
The agent prompt is appended raw — it is NOT wrapped in CONTEXT_ENTRY delimiters like the other six layers. This gives it a distinct visual position at the end of the message, making it the last thing the LLM reads before the conversation begins.
After the User message, the agent appends a canned Assistant acknowledgment (crates/agent/src/agent/mod.rs:3165–3168):
"I will fully incorporate this information when generating my responses, and explicitly acknowledge relevant parts of the summary when answering questions."
The caller create_context_messages() at line 3131 returns both as a pair:
sequenceDiagram
participant Agent
participant Fmt as format_user_context_message()
participant Ctx as create_context_messages()
Agent->>Ctx: build context for this turn
Ctx->>Fmt: assemble 7 layers
Fmt-->>Ctx: single User message string
Ctx-->>Agent: vec![user_msg, assistant_ack]
These two messages are prepended to the conversation history on every turn. The LLM always sees the full context — there's no persistent memory between calls.
3. Tool Specs: The Other Channel¶
Context messages carry the agent's instructions. But tool definitions travel through a separate API parameter: tools: [...].
The function make_tool_spec() at crates/agent/src/agent/mod.rs:2591 assembles the full tool list:
// Simplified from mod.rs:2591
async fn make_tool_spec(&mut self) -> Vec<ToolSpec> {
// 1. Gather MCP tool specs from all launched servers
// 2. Merge with built-in tool definitions
// 3. Filter by agent config's allowed tools
// 4. Sanitize names and return
}
It queries each running MCP server for its tool specs, merges them with built-in tools (like fs_read, shell, code), and filters the result against the agent's tools and allowedTools configuration.
Tool specs count against the context window
A common misconception: because tool specs are a separate parameter, they're "free." They are not. The LLM provider serializes them into the context window alongside messages. The budget is:
An agent with 50+ MCP tools can consume thousands of tokens in tool specs alone, leaving less room for conversation history.
4. Skills: Metadata Pre-Injection, Content On-Demand¶
Chapter 5 introduced agent configuration with file:// and skill:// resource URIs. Here's the critical difference in how they're injected.
file:// resources are read in full and injected as Layer 5 — every byte goes into the context message. Good for small, always-relevant files like steering docs.
skill:// resources are treated differently. Imagine an agent with 100 skill files. Injecting all of them at full content would blow the token budget before the user even asks a question. Instead, kiro-cli injects only metadata — a one-line hint per skill.
The function format_skill_hint() at crates/agent/src/agent/mod.rs:3423 extracts the YAML frontmatter and produces:
For example:
frontend-design: Best practices for React component architecture (file: .kiro/skills/frontend-design/SKILL.md)
All hints are grouped under a header defined at crates/agent/src/agent/consts.rs:41:
The following file entries contain: name, filepath,
and description. You SHOULD decide when to read the
full file using the filepath based on its description:
This is lazy loading for LLM context. The LLM sees a menu of available skills and uses fs_read to pull in the ones it needs. The default skill paths (crates/chat-cli-v2/src/util/paths.rs:60-61) scan two locations:
Skills depend on fs_read
If your agent config disables the fs_read tool, skills become a menu the LLM can read but never order from. The metadata hints will still appear in the context, but the LLM won't be able to fetch the full content.
5. MCP: Launch at Init, Frozen After¶
Chapter 8 covered MCP server management in detail. The key architectural constraint for turn assembly is: the set of MCP servers is frozen after initialization.
Look at the McpManagerRequest enum at crates/agent/src/agent/mcp/mod.rs:511–537:
pub enum McpManagerRequest {
LaunchServer { server_name, config },
GetToolSpecs { server_name },
GetPrompts { server_name },
GetPrompt { server_name, name, arguments },
ExecuteTool { server_name, tool_name, args },
Terminate,
}
Notice what's missing: there is no AddServer or RemoveServer variant. You can launch servers and terminate the entire manager, but you cannot hot-add or hot-remove individual servers mid-conversation.
launch_mcp_servers() is called in exactly two places:
- Agent initialization (
crates/agent/src/agent/mod.rs:699) — the normal startup path - Agent swap (
crates/agent/src/agent/mod.rs:1391) — which terminates ALL existing servers, creates a freshMcpManager, and relaunches from the new agent's config
Attempting to launch a server that's already running hits the ServerAlreadyLaunched error at line 561.
sequenceDiagram
participant User
participant Agent
participant MCP as McpManager
User->>Agent: /agent swap new-agent
Agent->>MCP: Terminate (kill all servers)
Agent->>Agent: create fresh McpManager
Agent->>MCP: LaunchServer (server A)
Agent->>MCP: LaunchServer (server B)
Note over MCP: Frozen until next<br/>swap or shutdown
This is a deliberate tradeoff: a frozen server set keeps tool resolution deterministic within a conversation. The agent always knows exactly which tools are available. The cost is that adding a new MCP server mid-session requires a full /agent swap.
Putting It Together — One Round Trip¶
Here's a complete turn, from keypress to response, showing how all five pillars combine:
sequenceDiagram
participant User
participant TUI
participant ACP as ACP Server
participant Agent
participant RTS as RTS API
participant LLM
participant Tool
User->>TUI: types a question
TUI->>ACP: session/prompt
ACP->>Agent: new turn
Note over Agent: Build context
Agent->>Agent: create_context_messages()<br/>(7-layer User msg + Assistant ack)
Agent->>Agent: make_tool_spec()<br/>(built-in + MCP tools)
Agent->>Agent: append conversation history
Agent->>RTS: stream(messages, tool_specs, None)
RTS->>LLM: forward to provider
LLM-->>RTS: text block
RTS-->>Agent: agent_message_chunk
Agent-->>ACP: kiro.dev/session/update
ACP-->>TUI: render text
LLM-->>RTS: tool_use block
RTS-->>Agent: tool_call
Agent->>Tool: dispatch tool
Tool-->>Agent: tool_result
Agent-->>ACP: tool_call_update
Note over Agent: New turn with tool_result
Agent->>RTS: stream(messages + tool_result, ...)
RTS->>LLM: continue
LLM-->>RTS: final text
RTS-->>Agent: agent_message_chunk
Agent-->>ACP: kiro.dev/session/update
ACP-->>TUI: render final answer
The TUI receives updates via the kiro.dev/session/update notification method (defined at packages/tui/src/acp-client.ts:52). The current TUI handles five session update variants:
| Variant | Purpose |
|---|---|
user_message_chunk |
Echoes user input during session replay |
agent_message_chunk |
Streams LLM text to the terminal |
tool_call |
Announces a tool invocation |
tool_call_update |
Reports tool completion or failure |
available_commands_update |
Refreshes the command palette |
Plus one extension notification: tool_call_chunk, which streams incremental tool output.
Limited variant set
The TUI's convertAcpUpdateToEvent() handler only processes the six variants above. Other session update types (if any exist in the ACP SDK) are logged as "Unhandled session update type" and discarded. This is the current TUI's scope — not the full ACP protocol surface.
Why This Design?¶
The "everything in User messages, tools on the side, no system prompt" architecture isn't accidental. Here's the reasoning:
- Provider agnosticism. Not all LLM providers support a
systemrole. By using onlyUserandAssistantmessages, kiro-cli works with any provider behind RTS without protocol translation. - Client-owned history. The agent rebuilds the full context on every turn. The LLM is stateless — it receives the complete conversation each time. This means the client controls exactly what the model sees, with no hidden server-side state.
- Token efficiency via lazy-loading. Skills inject metadata only (~1 line each). The LLM pulls full content on demand via
fs_read. An agent with 50 skills pays ~50 lines of context instead of ~50 files. - Deterministic tool resolution. Freezing MCP servers after init means
make_tool_spec()returns the same set throughout a conversation. No mid-turn surprises where a tool appears or vanishes. - Privacy. Context assembly happens locally. Steering files, skill metadata, and resource content are composed on your machine. Only the final assembled messages cross the network to the LLM provider.
Practical Implications¶
If you're building against kiro-cli — whether that's a Kanvas session host, an A2A bridge, or a custom agent — keep these in mind:
- Dynamic MCP requires an agent swap. There is no hot-add API. If your workflow needs to register a new MCP server mid-conversation, you must trigger a
/agentswap, which terminates all existing servers and relaunches from the new config. - Skills compose with
fs_read. The skill system's lazy-loading depends on the LLM being able to callfs_read. If your agent config removes or blocksfs_read, skill hints become inert text — the LLM sees the menu but can't order anything. - The context message can be large. The 7-layer synthetic User message can easily reach 10,000+ tokens with multiple steering files, hooks, and resource files. Monitor your token budget, especially when combining many
file://resources with a long conversation history. - Tool specs are not free. Each tool definition (name, description, parameter schema) consumes tokens from the same context window as messages. Agents with many MCP servers can hit the ceiling faster than expected.
- The canned Assistant ack is always present. Every conversation starts with the same two synthetic messages. If you're analyzing token usage or debugging prompt behavior, account for this fixed overhead.
The Analogy¶
The LLM is a surgeon who's called in for one operation at a time. Each time, you hand them a fresh folder with the patient's full history (conversation summary), current vitals (task context, hooks), allowed instruments (tool specs), reference manuals (skills metadata), and the specific question (user message). They operate, hand back results, and forget everything.
Next time, you hand them the folder again — updated with the results of the last operation. They never remember the previous call. This is why the "folder" (the 7-layer context message) has to be so carefully organized: it's the surgeon's entire world for the duration of one turn.
Conclusion¶
This chapter unified the five pillars that previous chapters introduced separately:
| Pillar | Chapter | Role in Turn Assembly |
|---|---|---|
| Agent Configuration | 5 | Defines which resources, skills, tools, and MCP servers to load |
| Agent Loop | 6 | Orchestrates the turn cycle: build context → call LLM → dispatch tools → repeat |
| Tool System | 7 | Provides built-in tool specs and executes tool calls |
| MCP Integration | 8 | Supplies external tool specs from frozen server set |
| Code Intelligence | 9 | Powers code-aware tools that appear in the tool spec list |
Every turn follows the same assembly line: pack seven layers into a synthetic User message, gather tool specs from built-in and MCP sources, prepend the context pair to conversation history, and stream it all to the LLM. The model responds with text and tool calls, the agent dispatches tools, and the cycle repeats.
Now you've seen the full picture — from the first keypress in the TUI to the last token streamed back. Happy building.