---
id: "web-crawl-link-discovery"
date: "2026-04-16"
title: "Index any site, even without a sitemap"
summary: "Daneel can now discover pages by following links, so sites without a sitemap.xml are no longer invisible to site search."
image: "/medias/spider.simple.png"
header: "Feature"
tags: ["feature", "site-rag", "crawler"]
---

## The sitemap gap

Until now, indexing a site in Daneel required a `sitemap.xml`. The extension would discover it, parse every URL, fetch and embed each page. Fast, reliable, and complete, when a sitemap exists.

The problem is that many sites don't publish one. Documentation portals behind custom generators, small business sites, internal wikis, personal blogs. For those sites, clicking "Crawl" did nothing useful: no sitemap found, no pages indexed, no search results.

That gap is now closed.

## Web Crawl: follow the links

Daneel ships a second crawling strategy called **Web Crawl**. Instead of relying on a sitemap, it starts from the page you're on and discovers content by following every link it finds in the HTML, breadth-first.

The crawler respects the same limits you already know, max pages and depth, and adds a few of its own:

- **Same-origin only** by default, so it won't wander off to external sites
- **Query parameter normalization** collapses pagination and session URLs into a single entry, preventing calendar or archive sinkholes
- **Path depth cap** skips URLs with deeply nested segments
- **Queue size limit** bounds memory even on sites that generate thousands of internal links
- **Politeness delay** between requests, with exponential backoff on server errors

Everything downstream is unchanged. Pages flow through the same Readability extraction, the same chunking and embedding pipeline, and land in the same IndexedDB store. Search works identically whether the pages came from a sitemap or a crawl.

## Choosing a discovery method

When you open the Site panel, Daneel checks for sitemaps automatically. If it finds one, **Sitemap** is pre-selected. If not, the UI switches to **Web Crawl** and you're ready to go.

You can also switch manually. Some sites publish a sitemap that only covers part of their content. In that case, picking Web Crawl and letting the crawler explore on its own may surface pages the sitemap missed.

## Stay in scope with path prefix

Web Crawl includes a **path prefix** filter that keeps the crawler focused. If you're browsing `/docs/getting-started`, Daneel infers `/docs` as the prefix and pre-fills it. The crawler will only follow links under that path.

You can edit the prefix or clear it entirely. It's a suggestion, not a wall.

## Background tasks, as always

Web crawls run as background tasks, the same system that handles sitemap crawls and vault indexing. Close the panel, navigate away, even let the service worker sleep. The task survives, checkpoints its progress, and picks up where it left off.

Progress shows in the Site panel while active, and the full history lives in **Settings > Tasks**.

## What's next

The crawler is intentionally conservative in this first release: sequential requests, static HTML only, same-origin scope. Future iterations may add concurrent fetching, JavaScript-rendered page support, and cross-subdomain crawling as the use cases become clearer.

---

[Read on site](https://daneel.injen.io/news/web-crawl-link-discovery.html?utm_source=extension_news_reader&utm_medium=extension_settings&utm_campaign=extension)
