Chapter 10: Semantic Search & Knowledge¶
In Chapter 9: Code Intelligence, we explored how kiro-cli understands code structurally — parsing ASTs, resolving symbols, navigating definitions. That's powerful when you know what you're looking for. But sometimes you don't have a symbol name. You have a question: "How does our authentication architecture work?" or "Which design doc covers rate limiting?"
Structured search is like scanning a library's card catalog by call number. Semantic search is like walking up to the librarian and describing what you need in plain English — and having them pull the three most relevant books off the shelf.
That's what this chapter is about: local vector embeddings that let kiro-cli understand meaning, not just text.
Motivation¶
Why run embeddings locally instead of calling a cloud API?
- Privacy — Your design docs, internal architecture notes, and proprietary code never leave your machine.
- Offline — Works on a plane, in a bunker, behind an air-gapped network. No API key required.
- Speed — No network round-trip. Embedding a query takes milliseconds on CPU.
- RAG integration — The agent can retrieve relevant context before calling the LLM, grounding its answers in your actual knowledge base instead of hallucinating.
kiro-cli ships with Candle, a lightweight ML framework written in Rust. It loads a pre-trained sentence-transformer model (all-MiniLM-L6-v2) and runs inference entirely on your CPU — no Python, no CUDA, no external runtime.
Use Case¶
Imagine you have a folder of design documents:
~/docs/architecture/
├── auth-design.md
├── rate-limiting.md
├── data-pipeline.md
└── caching-strategy.md
You index this folder as a knowledge context. Later, you (or the agent) ask:
"How do we handle token refresh for service-to-service auth?"
The semantic search system:
- Embeds your query into a 384-dimensional vector
- Compares it against every pre-embedded chunk from your docs
- Returns the top-k most similar chunks — likely paragraphs from
auth-design.mdabout token lifecycle
The agent then feeds those chunks into the LLM prompt as grounding context. This is Retrieval-Augmented Generation (RAG) — and it all happens locally.
Key Concepts¶
Embeddings¶
An embedding is a fixed-length vector of floats that captures the meaning of a piece of text. Texts with similar meanings produce vectors that are close together in high-dimensional space.
"token refresh flow" → [0.12, -0.34, 0.56, ..., 0.08] (384 floats)
"JWT renewal process" → [0.11, -0.33, 0.55, ..., 0.09] (very close!)
"database sharding" → [-0.45, 0.22, -0.18, ..., 0.67] (far away)
kiro-cli uses the all-MiniLM-L6-v2 model — a 6-layer BERT variant that produces 384-dimensional embeddings. It's small (~23MB), fast, and well-suited for semantic similarity.
Candle ML Framework¶
Candle is Hugging Face's Rust-native ML framework. kiro-cli uses it to load the BERT model weights (.safetensors format), tokenize input text, run forward inference, and extract embeddings — all without leaving Rust.
// crates/semantic-search-client/src/embedding/candle.rs (simplified)
pub struct CandleTextEmbedder {
model: BertModel,
tokenizer: Tokenizer,
device: Device, // CPU — always
config: ModelConfig,
}
Chunking¶
Documents are split into overlapping chunks before embedding. This is necessary because the model has a 512-token context window, and a 20-page design doc won't fit in one pass.
// crates/semantic-search-client/src/processing/text_chunker.rs
pub fn chunk_text(text: &str, chunk_size: Option<usize>, overlap: Option<usize>) -> Vec<String> {
let config = config::get_config();
let chunk_size = chunk_size.unwrap_or(config.chunk_size); // default: 512 words
let overlap = overlap.unwrap_or(config.chunk_overlap); // default: 128 words
// ... sliding window over whitespace-split words
}
Overlap ensures that a concept spanning a chunk boundary still appears in at least one chunk.
Vector Storage (HNSW Index)¶
Embedded chunks are stored in an HNSW (Hierarchical Navigable Small World) index — a data structure optimized for fast approximate nearest-neighbor search. kiro-cli uses the hnsw_rs crate:
// crates/semantic-search-client/src/index/vector_index.rs (simplified)
pub struct VectorIndex {
index: RwLock<Hnsw<'static, f32, DistCosine>>,
count: AtomicUsize,
}
The index supports insert(vector, id) and search(query_vector, top_k). It uses cosine distance — two vectors pointing in the same direction (similar meaning) have a distance near 0.
Similarity Search (Cosine Distance)¶
Cosine similarity measures the angle between two vectors, ignoring magnitude. It's the standard metric for sentence embeddings:
- 0.0 = identical meaning
- 1.0 = completely unrelated
- Values in between = degrees of relatedness
The HNSW index returns results sorted by ascending cosine distance — closest matches first.
How It Works: End-to-End Flow¶
sequenceDiagram
participant User
participant Agent
participant Embedder
participant VectorDB
participant LLM
User->>Agent: "How does token refresh work?"
Agent->>Embedder: embed(query)
Embedder-->>Agent: query_vector [384 floats]
Agent->>VectorDB: search(query_vector, top_k=5)
VectorDB-->>Agent: top-k chunk results
Agent->>LLM: prompt + retrieved chunks
LLM-->>User: grounded answer
- The user's question reaches the agent
- The agent embeds the query using
CandleTextEmbedder - The resulting vector is searched against the HNSW index
- The top-k most similar document chunks are returned with their metadata (file path, text)
- The agent injects those chunks into the LLM prompt as context, producing a grounded answer
Internal Implementation¶
The Embedding Trait¶
All embedding backends implement a common trait, making the system pluggable:
// crates/semantic-search-client/src/embedding/trait_def.rs
pub trait TextEmbedderTrait: Send + Sync {
fn embed(&self, text: &str) -> Result<Vec<f32>>;
fn embed_batch(&self, texts: &[String]) -> Result<Vec<Vec<f32>>>;
}
kiro-cli ships two implementations:
| Type | Backend | Use Case |
|---|---|---|
Best |
Candle (all-MiniLM-L6-v2) | High-quality semantic similarity |
Fast |
BM25 (keyword scoring) | Quick lexical search, no model needed |
Embedding with Candle¶
The CandleTextEmbedder loads the model once and reuses it for all queries:
// crates/semantic-search-client/src/embedding/candle.rs (simplified flow)
impl CandleTextEmbedder {
pub fn embed(&self, text: &str) -> Result<Vec<f32>> {
// 1. Tokenize text
// 2. Convert tokens → tensors
// 3. Run BERT forward pass
// 4. Mean-pool over token dimension
// 5. L2-normalize the result
// → 384-dimensional unit vector
}
}
Batch embedding uses Rayon for parallel processing — multiple chunks are embedded concurrently across CPU cores.
Knowledge Contexts¶
A knowledge context is a named, searchable collection of embedded chunks. The SemanticSearchClient manages them:
// crates/semantic-search-client/src/client/implementation.rs (simplified)
pub struct SemanticSearchClient {
base_dir: PathBuf,
volatile_contexts: ContextMap, // in-memory only
persistent_contexts: HashMap<ContextId, KnowledgeContext>, // saved to disk
embedder: Box<dyn TextEmbedderTrait>,
config: SemanticSearchConfig,
}
Contexts can be persistent (survive restarts, stored as JSON + HNSW files on disk) or volatile (in-memory, discarded on exit). Each context tracks its source path, include/exclude patterns, and embedding type.
Indexing Pipeline¶
When you add a directory as a knowledge context:
- File discovery — Walk the directory, filter by patterns, classify file types
- Processing — Read each file, detect type (Markdown, code, text, etc.)
- Chunking — Split content into overlapping word-windows (default: 512 words, 128 overlap)
- Embedding — Batch-embed all chunks through Candle
- Indexing — Insert each
(vector, id)pair into the HNSW index - Persistence — Save data points as JSON, HNSW graph to disk
The AsyncSemanticSearchClient runs this pipeline in a background worker with progress tracking and cancellation support.
Hybrid Search: BM25 + Vectors¶
For the Fast embedding type, kiro-cli uses a BM25 index — a classic keyword-scoring algorithm that doesn't need a neural model:
// crates/semantic-search-client/src/index/bm25_index.rs (simplified)
pub struct BM25Index {
engine: RwLock<SearchEngine<usize>>,
// ...
}
BM25 excels at exact keyword matches ("ProvisionedThroughputExceededException") while vector search excels at semantic matches ("DynamoDB throttling"). The ContextManager can maintain both index types for the same context, giving the agent the best of both worlds.
Configuration¶
Defaults live in SemanticSearchConfig:
// crates/semantic-search-client/src/config.rs
impl Default for SemanticSearchConfig {
fn default() -> Self {
Self {
chunk_size: 512,
chunk_overlap: 128,
default_results: 5,
model_name: "all-MiniLM-L6-v2".to_string(),
max_files: 10000,
embedding_type: EmbeddingType::Best, // Candle by default
// ...
}
}
}
The config is stored at ~/.semantic_search/semantic_search_config.json and loaded once at startup via a thread-safe OnceLock.
Model Management¶
Models are stored locally under ~/.semantic_search/models/all-MiniLM-L6-v2/. The model downloader fetches model.safetensors and tokenizer.json on first use. SHA verification ensures integrity:
~/.semantic_search/
├── models/
│ └── all-MiniLM-L6-v2/
│ ├── model.safetensors (~23MB)
│ └── tokenizer.json
├── contexts.json (metadata for all persistent contexts)
└── semantic_search_config.json
Key Files at a Glance¶
| File | Purpose |
|---|---|
embedding/candle.rs |
BERT inference via Candle — tokenize, forward pass, pool, normalize |
embedding/trait_def.rs |
TextEmbedderTrait — common interface for all embedding backends |
embedding/candle_models.rs |
Model configs (MiniLM-L6-v2, MiniLM-L12-v2) and BERT hyperparameters |
index/vector_index.rs |
HNSW index — insert, search, save/load to disk |
index/bm25_index.rs |
BM25 keyword index — lexical search fallback |
processing/text_chunker.rs |
Sliding-window text chunking with configurable size and overlap |
processing/file_processor.rs |
File type detection and content extraction |
client/implementation.rs |
SemanticSearchClient — context CRUD, search orchestration |
client/context/semantic_context.rs |
Per-context data points + vector index |
client/context/context_manager.rs |
Manages volatile + persistent + BM25 contexts |
config.rs |
Global config with chunk size, model name, defaults |
types.rs |
Core types: DataPoint, SearchResult, KnowledgeContext, EmbeddingType |
Conclusion¶
Over the course of these ten chapters, we've traced the full path of a user interaction through kiro-cli — from the first keypress to a grounded, knowledge-aware response:
| Chapter | What We Learned |
|---|---|
| 1. TUI | How the terminal interface captures input and renders output |
| 2. Twinki | The message bus that connects UI, agent, and tools |
| 3. ACP | The Agent Communication Protocol that bridges CLI and LLM |
| 4. Session Manager | How sessions are created, persisted, and resumed |
| 5. Agent Configuration | Loading system prompts, tools, and MCP servers |
| 6. Agent Loop | The turn-by-turn orchestration between user, model, and tools |
| 7. Tool System | How tool calls are dispatched, executed, and returned |
| 8. MCP Integration | Connecting external tool servers via the Model Context Protocol |
| 9. Code Intelligence | AST-based code understanding — symbols, definitions, references |
| 10. Semantic Search | Local vector embeddings for meaning-based retrieval and RAG |
Each layer builds on the ones below it. The TUI captures your question. The session manager loads your history. The agent loop orchestrates the conversation. The tool system executes actions. And semantic search grounds the LLM's answers in your actual knowledge — all running locally, all in Rust.
For the full table of contents, see the index.
Thank you for following along. Whether you're contributing to kiro-cli, building your own agent framework, or just curious about how the pieces fit together — we hope this walkthrough gave you a clear map of the territory. Happy building.