Skip to content

Chapter 10: Semantic Search & Knowledge

In Chapter 9: Code Intelligence, we explored how kiro-cli understands code structurally — parsing ASTs, resolving symbols, navigating definitions. That's powerful when you know what you're looking for. But sometimes you don't have a symbol name. You have a question: "How does our authentication architecture work?" or "Which design doc covers rate limiting?"

Structured search is like scanning a library's card catalog by call number. Semantic search is like walking up to the librarian and describing what you need in plain English — and having them pull the three most relevant books off the shelf.

That's what this chapter is about: local vector embeddings that let kiro-cli understand meaning, not just text.


Motivation

Why run embeddings locally instead of calling a cloud API?

  1. Privacy — Your design docs, internal architecture notes, and proprietary code never leave your machine.
  2. Offline — Works on a plane, in a bunker, behind an air-gapped network. No API key required.
  3. Speed — No network round-trip. Embedding a query takes milliseconds on CPU.
  4. RAG integration — The agent can retrieve relevant context before calling the LLM, grounding its answers in your actual knowledge base instead of hallucinating.

kiro-cli ships with Candle, a lightweight ML framework written in Rust. It loads a pre-trained sentence-transformer model (all-MiniLM-L6-v2) and runs inference entirely on your CPU — no Python, no CUDA, no external runtime.


Use Case

Imagine you have a folder of design documents:

~/docs/architecture/
├── auth-design.md
├── rate-limiting.md
├── data-pipeline.md
└── caching-strategy.md

You index this folder as a knowledge context. Later, you (or the agent) ask:

"How do we handle token refresh for service-to-service auth?"

The semantic search system:

  1. Embeds your query into a 384-dimensional vector
  2. Compares it against every pre-embedded chunk from your docs
  3. Returns the top-k most similar chunks — likely paragraphs from auth-design.md about token lifecycle

The agent then feeds those chunks into the LLM prompt as grounding context. This is Retrieval-Augmented Generation (RAG) — and it all happens locally.


Key Concepts

Embeddings

An embedding is a fixed-length vector of floats that captures the meaning of a piece of text. Texts with similar meanings produce vectors that are close together in high-dimensional space.

"token refresh flow"  → [0.12, -0.34, 0.56, ..., 0.08]  (384 floats)
"JWT renewal process" → [0.11, -0.33, 0.55, ..., 0.09]  (very close!)
"database sharding"   → [-0.45, 0.22, -0.18, ..., 0.67] (far away)

kiro-cli uses the all-MiniLM-L6-v2 model — a 6-layer BERT variant that produces 384-dimensional embeddings. It's small (~23MB), fast, and well-suited for semantic similarity.

Candle ML Framework

Candle is Hugging Face's Rust-native ML framework. kiro-cli uses it to load the BERT model weights (.safetensors format), tokenize input text, run forward inference, and extract embeddings — all without leaving Rust.

// crates/semantic-search-client/src/embedding/candle.rs (simplified)
pub struct CandleTextEmbedder {
    model: BertModel,
    tokenizer: Tokenizer,
    device: Device,       // CPU — always
    config: ModelConfig,
}

Chunking

Documents are split into overlapping chunks before embedding. This is necessary because the model has a 512-token context window, and a 20-page design doc won't fit in one pass.

// crates/semantic-search-client/src/processing/text_chunker.rs
pub fn chunk_text(text: &str, chunk_size: Option<usize>, overlap: Option<usize>) -> Vec<String> {
    let config = config::get_config();
    let chunk_size = chunk_size.unwrap_or(config.chunk_size);   // default: 512 words
    let overlap = overlap.unwrap_or(config.chunk_overlap);       // default: 128 words
    // ... sliding window over whitespace-split words
}

Overlap ensures that a concept spanning a chunk boundary still appears in at least one chunk.

Vector Storage (HNSW Index)

Embedded chunks are stored in an HNSW (Hierarchical Navigable Small World) index — a data structure optimized for fast approximate nearest-neighbor search. kiro-cli uses the hnsw_rs crate:

// crates/semantic-search-client/src/index/vector_index.rs (simplified)
pub struct VectorIndex {
    index: RwLock<Hnsw<'static, f32, DistCosine>>,
    count: AtomicUsize,
}

The index supports insert(vector, id) and search(query_vector, top_k). It uses cosine distance — two vectors pointing in the same direction (similar meaning) have a distance near 0.

Similarity Search (Cosine Distance)

Cosine similarity measures the angle between two vectors, ignoring magnitude. It's the standard metric for sentence embeddings:

  • 0.0 = identical meaning
  • 1.0 = completely unrelated
  • Values in between = degrees of relatedness

The HNSW index returns results sorted by ascending cosine distance — closest matches first.


How It Works: End-to-End Flow

sequenceDiagram
    participant User
    participant Agent
    participant Embedder
    participant VectorDB
    participant LLM

    User->>Agent: "How does token refresh work?"
    Agent->>Embedder: embed(query)
    Embedder-->>Agent: query_vector [384 floats]
    Agent->>VectorDB: search(query_vector, top_k=5)
    VectorDB-->>Agent: top-k chunk results
    Agent->>LLM: prompt + retrieved chunks
    LLM-->>User: grounded answer
  1. The user's question reaches the agent
  2. The agent embeds the query using CandleTextEmbedder
  3. The resulting vector is searched against the HNSW index
  4. The top-k most similar document chunks are returned with their metadata (file path, text)
  5. The agent injects those chunks into the LLM prompt as context, producing a grounded answer

Internal Implementation

The Embedding Trait

All embedding backends implement a common trait, making the system pluggable:

// crates/semantic-search-client/src/embedding/trait_def.rs
pub trait TextEmbedderTrait: Send + Sync {
    fn embed(&self, text: &str) -> Result<Vec<f32>>;
    fn embed_batch(&self, texts: &[String]) -> Result<Vec<Vec<f32>>>;
}

kiro-cli ships two implementations:

Type Backend Use Case
Best Candle (all-MiniLM-L6-v2) High-quality semantic similarity
Fast BM25 (keyword scoring) Quick lexical search, no model needed

Embedding with Candle

The CandleTextEmbedder loads the model once and reuses it for all queries:

// crates/semantic-search-client/src/embedding/candle.rs (simplified flow)
impl CandleTextEmbedder {
    pub fn embed(&self, text: &str) -> Result<Vec<f32>> {
        // 1. Tokenize text
        // 2. Convert tokens → tensors
        // 3. Run BERT forward pass
        // 4. Mean-pool over token dimension
        // 5. L2-normalize the result
        // → 384-dimensional unit vector
    }
}

Batch embedding uses Rayon for parallel processing — multiple chunks are embedded concurrently across CPU cores.

Knowledge Contexts

A knowledge context is a named, searchable collection of embedded chunks. The SemanticSearchClient manages them:

// crates/semantic-search-client/src/client/implementation.rs (simplified)
pub struct SemanticSearchClient {
    base_dir: PathBuf,
    volatile_contexts: ContextMap,           // in-memory only
    persistent_contexts: HashMap<ContextId, KnowledgeContext>,  // saved to disk
    embedder: Box<dyn TextEmbedderTrait>,
    config: SemanticSearchConfig,
}

Contexts can be persistent (survive restarts, stored as JSON + HNSW files on disk) or volatile (in-memory, discarded on exit). Each context tracks its source path, include/exclude patterns, and embedding type.

Indexing Pipeline

When you add a directory as a knowledge context:

  1. File discovery — Walk the directory, filter by patterns, classify file types
  2. Processing — Read each file, detect type (Markdown, code, text, etc.)
  3. Chunking — Split content into overlapping word-windows (default: 512 words, 128 overlap)
  4. Embedding — Batch-embed all chunks through Candle
  5. Indexing — Insert each (vector, id) pair into the HNSW index
  6. Persistence — Save data points as JSON, HNSW graph to disk

The AsyncSemanticSearchClient runs this pipeline in a background worker with progress tracking and cancellation support.

Hybrid Search: BM25 + Vectors

For the Fast embedding type, kiro-cli uses a BM25 index — a classic keyword-scoring algorithm that doesn't need a neural model:

// crates/semantic-search-client/src/index/bm25_index.rs (simplified)
pub struct BM25Index {
    engine: RwLock<SearchEngine<usize>>,
    // ...
}

BM25 excels at exact keyword matches ("ProvisionedThroughputExceededException") while vector search excels at semantic matches ("DynamoDB throttling"). The ContextManager can maintain both index types for the same context, giving the agent the best of both worlds.

Configuration

Defaults live in SemanticSearchConfig:

// crates/semantic-search-client/src/config.rs
impl Default for SemanticSearchConfig {
    fn default() -> Self {
        Self {
            chunk_size: 512,
            chunk_overlap: 128,
            default_results: 5,
            model_name: "all-MiniLM-L6-v2".to_string(),
            max_files: 10000,
            embedding_type: EmbeddingType::Best,  // Candle by default
            // ...
        }
    }
}

The config is stored at ~/.semantic_search/semantic_search_config.json and loaded once at startup via a thread-safe OnceLock.

Model Management

Models are stored locally under ~/.semantic_search/models/all-MiniLM-L6-v2/. The model downloader fetches model.safetensors and tokenizer.json on first use. SHA verification ensures integrity:

~/.semantic_search/
├── models/
│   └── all-MiniLM-L6-v2/
│       ├── model.safetensors   (~23MB)
│       └── tokenizer.json
├── contexts.json               (metadata for all persistent contexts)
└── semantic_search_config.json

Key Files at a Glance

File Purpose
embedding/candle.rs BERT inference via Candle — tokenize, forward pass, pool, normalize
embedding/trait_def.rs TextEmbedderTrait — common interface for all embedding backends
embedding/candle_models.rs Model configs (MiniLM-L6-v2, MiniLM-L12-v2) and BERT hyperparameters
index/vector_index.rs HNSW index — insert, search, save/load to disk
index/bm25_index.rs BM25 keyword index — lexical search fallback
processing/text_chunker.rs Sliding-window text chunking with configurable size and overlap
processing/file_processor.rs File type detection and content extraction
client/implementation.rs SemanticSearchClient — context CRUD, search orchestration
client/context/semantic_context.rs Per-context data points + vector index
client/context/context_manager.rs Manages volatile + persistent + BM25 contexts
config.rs Global config with chunk size, model name, defaults
types.rs Core types: DataPoint, SearchResult, KnowledgeContext, EmbeddingType

Conclusion

Over the course of these ten chapters, we've traced the full path of a user interaction through kiro-cli — from the first keypress to a grounded, knowledge-aware response:

Chapter What We Learned
1. TUI How the terminal interface captures input and renders output
2. Twinki The message bus that connects UI, agent, and tools
3. ACP The Agent Communication Protocol that bridges CLI and LLM
4. Session Manager How sessions are created, persisted, and resumed
5. Agent Configuration Loading system prompts, tools, and MCP servers
6. Agent Loop The turn-by-turn orchestration between user, model, and tools
7. Tool System How tool calls are dispatched, executed, and returned
8. MCP Integration Connecting external tool servers via the Model Context Protocol
9. Code Intelligence AST-based code understanding — symbols, definitions, references
10. Semantic Search Local vector embeddings for meaning-based retrieval and RAG

Each layer builds on the ones below it. The TUI captures your question. The session manager loads your history. The agent loop orchestrates the conversation. The tool system executes actions. And semantic search grounds the LLM's answers in your actual knowledge — all running locally, all in Rust.

For the full table of contents, see the index.


Thank you for following along. Whether you're contributing to kiro-cli, building your own agent framework, or just curious about how the pieces fit together — we hope this walkthrough gave you a clear map of the territory. Happy building.