ADR 0018: OCR Stack for Image Content Extraction in Genesis Pipeline

Accepted Status: Accepted Date: 2026-05-08 Domain: information Level: system Authors: Genesis

ocrimage-extractionkreuzbergdocument-intelligence

Status

Accepted

Context

Genesis processes diverse document formats for RegenTribe community members. Images embedded in PDFs or uploaded as stand-alone files may contain text that standard parsing cannot extract. An OCR (Optical Character Recognition) pipeline is required.

Three candidate tools were evaluated: usls, ocrs, and Kreuzberg. Each targets a different niche: usls is a general vision/VLM library with OCR sub-models, ocrs is a dedicated Rust OCR engine, and Kreuzberg is a polyglot document intelligence framework with plugin-based OCR support.

D1: Adopt Kreuzberg as the primary OCR engine for image content extraction

Kreuzberg provides the most production-ready path for Genesis use cases.

Rationale:

Native OCR integration. Kreuzberg ships built-in Tesseract bindings (no external tesseract install required), plus optional EasyOCR and PaddleOCR backends. The Rust core handles format detection and extraction pipeline orchestration.

Document intelligence scope. Kreuzberg extracts text, tables, metadata, and images from 97+ formats. OCR is one capability within a larger document pipeline — aligned with how Genesis uses extraction (ingestion of community documents for knowledge graph population).

Polyglot bindings. Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, Elixir, Rust, C FFI. Genesis operates primarily in Python (semantic-graph pipeline). The Python SDK is first-class and well-documented.

Plugin architecture. OCR backend is swappable via configuration. Enables future migration to PaddleOCR or EasyOCR without code changes.

Existing Genesis integration. Kreuzberg is already installed in the semantic-graph virtual environment. It is the standard extraction tool for the Genesis pipeline. No new dependencies required.

D2: usls is not suited for standalone OCR text extraction

usls is a Rust library for vision and vision-language models powered by ONNX Runtime. It includes OCR sub-models (DB for text detection, SVTR for text recognition) as part of a broader model zoo covering object detection, segmentation, pose estimation, VLM, and embedding models.

Rationale:

Alpha-stage versioning. Latest available version is 0.2.0-alpha.3. Not production-ready.

Scope mismatch. usls targets computer vision researchers and practitioners who need SOTA model inference (YOLO, SAM, GroundingDINO). The OCR models are incidental components of a much larger model portfolio, not a dedicated OCR engine.

Integration cost. Requires ONNX Runtime management, model download/caching, and GPU/threading configuration. High operational overhead for the benefit of using a single OCR model from the usls ecosystem.

Rust-only. No Python bindings, no CLI tool for quick extraction. Requires Rust compilation and custom integration code.

D3: ocrs is a promising Rust-native OCR engine but not yet viable for Genesis pipeline

ocrs is a Rust library and CLI tool for OCR using PyTorch models exported to ONNX, executed via the RTen engine. It targets modern OCR with minimal preprocessing and open training data.

Rationale:

Early preview state. The README explicitly states "Expect more errors than commercial OCR engines." Behavior may change across versions.

Latin alphabet only. ocrs currently recognizes only the Latin alphabet. Many RegenTribe community documents may involve non-Latin scripts (Māori, Pacific languages, technical symbols). Kreuzberg's Tesseract backend supports 100+ languages out of the box.

Language constraint. The requirement for multilingual document support disqualifies ocrs at this stage.

Consequences

Positive:

Kreuzberg provides a unified extraction pipeline: documents, images, OCR, tables, metadata — one tool, consistent API, minimal operational overhead.
The Python SDK integrates cleanly with the existing semantic-graph pipeline. No new runtime dependencies.
Tesseract backend works without external tesseract installation (native Rust binding). Backends are swappable via config.
Language coverage is broad: Tesseract supports 100+ languages, PaddleOCR supports Chinese, Japanese, Korean, Arabic, and more.

Negative:

Tesseract (default backend) produces lower accuracy on complex layouts compared to dedicated OCR systems. For highly structured documents, PaddleOCR or VLM-based OCR (GPT-4o, Claude) may be needed as a future upgrade.
Kreuzberg is Elastic License v2 (ELv2) — not fully open source (source available, but license restrictions apply). Acceptable for Genesis internal pipeline use.
The current Kreuzberg Python package is version 4.4.3 with pre-compiled binaries. Newer versions (4.10.0-rc.15) exist but pip may lag behind releases.

References

Kreuzberg on GitHub
Kreuzberg OCR Guide
usls on GitHub — Alpha-stage vision library with OCR sub-models
ocrs on GitHub — Rust-native OCR (Latin only, early preview)
Kreuzberg Python SDK — extract_file_sync
Source — OCR Stack Evaluation Notes