---
id: "semantic-search"
date: "2026-02-01"
title: "Search by meaning, not just keywords"
summary: "Daneel crawls and indexes entire websites, then lets you search them with natural language — matching concepts, not strings, with results in under 5 milliseconds."
image: "/medias/semantic.search.png"
header: "Feature"
tags: ["feature", "search", "rag", "indexing", "privacy"]
---

## Why keyword search falls short

You search a documentation site for "how to handle errors" and get zero results because the docs say "exception management." You look for "deploy to production" and miss the page titled "release workflow." Keyword search only finds what you already know how to phrase.

Semantic search works differently. It understands that "handle errors" and "exception management" mean the same thing. It matches intent, not strings.

## How Daneel indexes a site

Open Daneel on any page, switch to Site mode, and hit Index. Here is what happens:

1. **Sitemap discovery** — Daneel looks for the site's `sitemap.xml` and follows nested sitemaps up to three levels deep. If there is no sitemap, it indexes the current page.
2. **Content extraction** — Each page is converted to clean Markdown using a three-strategy pipeline: Readability, CSS-aware Turndown, or plain-text fallback.
3. **Chunking** — The text is split into overlapping segments along natural boundaries (paragraphs, then sentences), preserving context at the edges.
4. **Embedding** — Each chunk is transformed into a 384-dimensional vector by an embedding model running on your GPU via WebGPU. No cloud calls.
5. **Storage** — Embeddings are persisted in IndexedDB, partitioned by domain. They survive browser restarts.

The entire pipeline runs in your browser. Your browsing data never leaves your machine.

## How search works

When you type a question, Daneel embeds your query with the same model and runs a cosine similarity search across all stored chunks. On modern hardware with WebGPU, this takes under 5 milliseconds, even across tens of thousands of chunks.

Results are ranked by semantic similarity with a keyword boost layer: if your exact terms appear in a chunk, it gets a small score bonus. This hybrid approach catches both conceptual matches and precise terminology.

The top results, with their source URLs, are assembled into context for the LLM, which answers your question grounded in the actual site content.

## What is coming next

Sitemaps cover most well-structured sites, but not all pages are listed. We are building a true crawler that follows links breadth-first, discovering pages the sitemap missed. Stay tuned.

---

[Read on site](https://daneel.injen.io/news/semantic-search.html?utm_source=extension_news_reader&utm_medium=extension_settings&utm_campaign=extension)
