ADR 0018: OCR Stack for Image Content Extraction in Genesis Pipeline

Accepted Status: Accepted Date: 2026-05-08 Domain: information Level: system Authors: Genesis
ocrimage-extractionkreuzbergdocument-intelligence

Status

Accepted

Context

Genesis processes diverse document formats for RegenTribe community members. Images embedded in PDFs or uploaded as stand-alone files may contain text that standard parsing cannot extract. An OCR (Optical Character Recognition) pipeline is required.

Three candidate tools were evaluated: usls, ocrs, and Kreuzberg. Each targets a different niche: usls is a general vision/VLM library with OCR sub-models, ocrs is a dedicated Rust OCR engine, and Kreuzberg is a polyglot document intelligence framework with plugin-based OCR support.

D1: Adopt Kreuzberg as the primary OCR engine for image content extraction

Kreuzberg provides the most production-ready path for Genesis use cases.

Rationale:

  • Native OCR integration. Kreuzberg ships built-in Tesseract bindings (no external tesseract install required), plus optional EasyOCR and PaddleOCR backends. The Rust core handles format detection and extraction pipeline orchestration.
  • Document intelligence scope. Kreuzberg extracts text, tables, metadata, and images from 97+ formats. OCR is one capability within a larger document pipeline — aligned with how Genesis uses extraction (ingestion of community documents for knowledge graph population).
  • Polyglot bindings. Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, Elixir, Rust, C FFI. Genesis operates primarily in Python (semantic-graph pipeline). The Python SDK is first-class and well-documented.
  • Plugin architecture. OCR backend is swappable via configuration. Enables future migration to PaddleOCR or EasyOCR without code changes.
  • Existing Genesis integration. Kreuzberg is already installed in the semantic-graph virtual environment. It is the standard extraction tool for the Genesis pipeline. No new dependencies required.

D2: usls is not suited for standalone OCR text extraction

usls is a Rust library for vision and vision-language models powered by ONNX Runtime. It includes OCR sub-models (DB for text detection, SVTR for text recognition) as part of a broader model zoo covering object detection, segmentation, pose estimation, VLM, and embedding models.

Rationale:

  • Alpha-stage versioning. Latest available version is 0.2.0-alpha.3. Not production-ready.
  • Scope mismatch. usls targets computer vision researchers and practitioners who need SOTA model inference (YOLO, SAM, GroundingDINO). The OCR models are incidental components of a much larger model portfolio, not a dedicated OCR engine.
  • Integration cost. Requires ONNX Runtime management, model download/caching, and GPU/threading configuration. High operational overhead for the benefit of using a single OCR model from the usls ecosystem.
  • Rust-only. No Python bindings, no CLI tool for quick extraction. Requires Rust compilation and custom integration code.

D3: ocrs is a promising Rust-native OCR engine but not yet viable for Genesis pipeline

ocrs is a Rust library and CLI tool for OCR using PyTorch models exported to ONNX, executed via the RTen engine. It targets modern OCR with minimal preprocessing and open training data.

Rationale:

  • Early preview state. The README explicitly states "Expect more errors than commercial OCR engines." Behavior may change across versions.
  • Latin alphabet only. ocrs currently recognizes only the Latin alphabet. Many RegenTribe community documents may involve non-Latin scripts (Māori, Pacific languages, technical symbols). Kreuzberg's Tesseract backend supports 100+ languages out of the box.
  • Language constraint. The requirement for multilingual document support disqualifies ocrs at this stage.

Consequences

Positive:

  • Kreuzberg provides a unified extraction pipeline: documents, images, OCR, tables, metadata — one tool, consistent API, minimal operational overhead.
  • The Python SDK integrates cleanly with the existing semantic-graph pipeline. No new runtime dependencies.
  • Tesseract backend works without external tesseract installation (native Rust binding). Backends are swappable via config.
  • Language coverage is broad: Tesseract supports 100+ languages, PaddleOCR supports Chinese, Japanese, Korean, Arabic, and more.

Negative:

  • Tesseract (default backend) produces lower accuracy on complex layouts compared to dedicated OCR systems. For highly structured documents, PaddleOCR or VLM-based OCR (GPT-4o, Claude) may be needed as a future upgrade.
  • Kreuzberg is Elastic License v2 (ELv2) — not fully open source (source available, but license restrictions apply). Acceptable for Genesis internal pipeline use.
  • The current Kreuzberg Python package is version 4.4.3 with pre-compiled binaries. Newer versions (4.10.0-rc.15) exist but pip may lag behind releases.

References