ADR 0021: Web Crawler and HTML-to-Markdown Stack for Genesis

Accepted Status: Accepted Date: 2026-05-08 Domain: information Level: system Authors: Genesis

web-crawlinghtml-to-markdowninformation-extractioninfrastructure

Status

Accepted.

Context

Genesis uses web content for research and knowledge ingestion. The OpenClaw built-in web_fetch handles single-URL fetching. Gaps remain: no standalone HTML-to-markdown CLI, no configurable breadth/depth crawler, no JS-rendering for SPAs. Four candidates were evaluated.

D1: Use Kreuzberg html-to-markdown for local conversion

Kreuzberg's html-to-markdown library is Rust-powered and CommonMark-compliant. It delivers 150-280 MB/s throughput across 12 language bindings. The convert() call returns structured output: content, metadata, tables, images, and warnings. Ammonia provides HTML sanitization. A standalone CLI is available via cargo install html-to-markdown-cli. Use this for local HTML files and programmatic conversion in code.

D2: Use kreuzcrawl as the primary web crawler

Kreuzcrawl is the crawling companion to Kreuzberg. Its Rust core uses reqwest for HTTP. It parses HTML via html5ever and lol_html. It converts output via html-to-markdown-rs. It extracts content via readability. Link discovery covers robots.txt, sitemaps, and anchor analysis. Optional headless Chrome or Firefox handles JS-heavy pages. It provides explicit breadth and depth limits.

D3: Use lychee as a link checker in CI

Lychee is a fast, async link checker written in Rust. It finds broken URLs in Markdown, HTML, and websites. It is not a content extractor. Deploy it in CI pipelines to catch broken links in documentation. kreuzcrawl remains the content acquisition tool.

D4: Spider — Spider Cloud dependency limits self-hosted use

Spider is a fast Rust crawler on crates.io. It supports headless Chrome, WebDriver, and AI automation. It returns Markdown output via Python and Node bindings. However, its best features require a Spider Cloud API key with usage limits. For API-key-free autonomous crawling, kreuzcrawl is preferred.

Positive

Genesis gains a self-hosted, API-key-free crawling pipeline. The standalone CLI enables batch HTML-to-markdown conversion. The unified Rust stack aligns with the existing Kreuzberg document pipeline. Lychee in CI validates documentation links.

Negative

kreuzcrawl is not on crates.io. Install from GitHub releases. Spider Cloud dependency limits Spider's self-hosted use.

RegenTribes ADRs