Core Concepts¶

This guide explains the key concepts and architecture of databrew.

Architecture Overview¶

Databrew has a clean separation of concerns:

Orchestrator (coordinates everything)
    ├── Store (SQLite URL queue + Parquet item storage)
    ├── Fetcher (HTTP with rate limiting)
    ├── Extractor (HTML/JSON parsing)
    └── Policy (retry rules, stopping conditions)

At the package level, these concerns map to component packages:

databrew.core: policy, stats, strict config models, and module-loading utilities
fetchkit: HTTP/browser fetchers, request pacer, fetcher registry
extractkit: HTML/JSON extractors and parser registry
itemstore: Parquet item storage and storage metadata contracts
databrew.state: URL queue and unified state store

The main databrew package composes these components and provides the CLI, config loading, orchestrator, middleware, and hooks.

URL Types¶

Databrew distinguishes between two types of URLs:

Pagination URLs¶

Listing pages that contain links to other pages and items:

Search result pages
Category listings
Index pages

Pagination URLs are always followed to discover new content.

Item URLs¶

Detail pages that contain the actual data to extract:

Product pages
Article pages
Profile pages

Item URLs are checked against storage before fetching (deduplication).

Why This Matters¶

The separation enables smart incremental crawling:

Pagination pages are re-crawled each run to find new items
Item pages are only fetched once (unless the item doesn't exist in storage)
When all items on a pagination page already exist, that branch stops

Data Flow¶

Config (TOML) → create_components() → Orchestrator.run()
                                          ↓
Store.get_next_url() → Fetcher.fetch() → Extractor.extract()
        ↑                                      ↓
        └────────── Store.add_*_urls() ←───────┘
                           ↓
                    Store.save_item()

Config Loading: TOML is parsed into typed configuration objects
Component Creation: Store, fetcher, and extractor are initialized
URL Queue: The orchestrator pulls URLs from the queue
Fetching: The fetcher retrieves page content
Extraction: The extractor parses items and discovers links
Storage: Items are saved, new links are added to the queue

The Orchestrator¶

The orchestrator is the main crawl loop that coordinates everything:

async def run(self):
    while True:
        # Get batch of URLs
        tasks = self.store.get_pending_urls(limit=self.policy.concurrency)

        if not tasks:
            break  # Nothing left to process

        # Process concurrently
        results = await asyncio.gather(*[
            self._process_url(task) for task in tasks
        ])

        # Check stopping conditions
        if self.policy.should_stop(...):
            break

Key features:

Concurrent processing: Fetches multiple URLs in parallel
Automatic retries: Failed URLs are retried with exponential backoff
Incremental stopping: Each pagination branch stops independently
Lifecycle hooks: Shell commands at key points (start, failure, complete) for automated recovery
Progress tracking: Statistics are updated in real-time

Extractors¶

Extractors parse page content and return structured data.

HTML Extractor¶

Uses CSS selectors to extract data from HTML:

[extract]
type = "html"

[extract.items]
selector = ".product"

[extract.items.fields]
title = "h2"
price = { selector = ".price", parser = "parse_price" }

JSON Extractor¶

Uses dot-notation paths to extract data from JSON:

[extract]
type = "json"

[extract.items]
path = "data.products"

[extract.items.fields]
title = "name"
price = { path = "pricing.amount", parser = "parse_float" }

Fetchers¶

Fetchers retrieve page content:

HTTP Fetcher (httpx)¶

The default fetcher uses httpx for fast HTTP requests:

[fetch]
type = "httpx"

[fetch.headers]
User-Agent = "MyBot/1.0"

Browser Fetcher (pydoll)¶

For JavaScript-heavy sites, use browser rendering:

[fetch]
type = "pydoll"

[fetch.browser]
headless = true
wait_for_selector = ".content-loaded"

Policy¶

The policy controls crawl behavior:

[policy]
max_retries = 3           # Retry failed requests
max_requests = 1000       # Stop after N requests
concurrency = 5           # Parallel requests
delay = 1.0               # Delay between batches
jitter = 0.2              # Random delay (anti-fingerprinting)
max_consecutive_failures = 50  # Stop on too many failures

Storage¶

Databrew uses a dual-layer storage architecture:

data/mysite/
├── .state.db             # URL queue/retry state (ephemeral, gitignored)
├── .failures.db          # Durable failure tracking (local, gitignored)
├── _failed_urls.json     # Portable failure snapshot (committed/synced)
├── .index.db             # Storage dedupe/index catalog (ephemeral, gitignored)
└── items/
    ├── part_000001.parquet   # Rolling part files (compressed)
    ├── part_000002.parquet
    └── ...

.state.db: SQLite database for URL queue/retry state. This is local-only.
.failures.db: Durable failure tracking in a separate SQLite file. Survives .state.db deletion.
_failed_urls.json: Portable failure snapshot exported at run end. Safe for cross-machine sync (e.g. via git).
.index.db: SQLite storage catalog for dedupe and item metadata. This is local-only and auto-rebuilt from Parquet files on startup.
items/*.parquet: Rolling Parquet part files containing the actual extracted items. These are the source of truth and should be synced across machines.
Deduplication: Items are deduplicated by ID field (if configured)

Storage location is configured in the config:

[storage]
path = "data/mysite"

The path is relative to CWD (where you run databrew), not the config file location.

See Working with Extracted Data for how to query and use the Parquet files.

Incremental Crawling¶

Databrew supports smart incremental updates:

Per-Branch Stopping¶

Each pagination chain stops independently when it encounters a page where all items already exist:

Seed URL 1 → Page 1 → Page 2 → Page 3 (all items exist, STOP)
Seed URL 2 → Page 1 → Page 2 (new items found, continue...)

This is automatic for re-runs (when items already exist in storage).

Cross-Run Retry¶

Item URLs that fail are automatically retried on subsequent runs:

Run 1: Item fails → status='failed', failed_runs=1
Run 2: Reset to pending, retry → fails → failed_runs=2
Run 3: Retry again → fails → status='permanently_failed'

After 3 failed runs, the URL is marked permanently failed.

Failures are tracked durably in .failures.db and exported to _failed_urls.json at run end, so they survive .state.db deletion and can be synced across machines.

Config Composition¶

Configs can inherit from a base config:

# base.toml
[fetch.headers]
User-Agent = "MyBot/1.0"

[policy]
max_retries = 3
concurrency = 5

# mysite.toml
extends = "base.toml"
name = "mysite"
start_urls = ["https://example.com"]
# ... site-specific config

Merge behavior:

Dicts: merge recursively (child overrides base)
Lists: replace entirely (no concatenation)
Scalars: replace entirely

Next Steps¶

CLI Reference - All available commands
Configuration Guide - Complete config reference
HTML Extraction - CSS selector patterns