Core Concepts¶
This guide explains the key concepts and architecture of databrew.
Architecture Overview¶
Databrew has a clean separation of concerns:
Orchestrator (coordinates everything)
├── Store (SQLite URL queue + Parquet item storage)
├── Fetcher (HTTP with rate limiting)
├── Extractor (HTML/JSON parsing)
└── Policy (retry rules, stopping conditions)
At the package level, these concerns map to component packages:
databrew.core: policy, stats, strict config models, and module-loading utilitiesfetchkit: HTTP/browser fetchers, request pacer, fetcher registryextractkit: HTML/JSON extractors and parser registryitemstore: Parquet item storage and storage metadata contractsdatabrew.state: URL queue and unified state store
The main databrew package composes these components and provides the CLI,
config loading, orchestrator, middleware, and hooks.
URL Types¶
Databrew distinguishes between two types of URLs:
Pagination URLs¶
Listing pages that contain links to other pages and items:
- Search result pages
- Category listings
- Index pages
Pagination URLs are always followed to discover new content.
Item URLs¶
Detail pages that contain the actual data to extract:
- Product pages
- Article pages
- Profile pages
Item URLs are checked against storage before fetching (deduplication).
Why This Matters¶
The separation enables smart incremental crawling:
- Pagination pages are re-crawled each run to find new items
- Item pages are only fetched once (unless the item doesn't exist in storage)
- When all items on a pagination page already exist, that branch stops
Data Flow¶
Config (TOML) → create_components() → Orchestrator.run()
↓
Store.get_next_url() → Fetcher.fetch() → Extractor.extract()
↑ ↓
└────────── Store.add_*_urls() ←───────┘
↓
Store.save_item()
- Config Loading: TOML is parsed into typed configuration objects
- Component Creation: Store, fetcher, and extractor are initialized
- URL Queue: The orchestrator pulls URLs from the queue
- Fetching: The fetcher retrieves page content
- Extraction: The extractor parses items and discovers links
- Storage: Items are saved, new links are added to the queue
The Orchestrator¶
The orchestrator is the main crawl loop that coordinates everything:
async def run(self):
while True:
# Get batch of URLs
tasks = self.store.get_pending_urls(limit=self.policy.concurrency)
if not tasks:
break # Nothing left to process
# Process concurrently
results = await asyncio.gather(*[
self._process_url(task) for task in tasks
])
# Check stopping conditions
if self.policy.should_stop(...):
break
Key features:
- Concurrent processing: Fetches multiple URLs in parallel
- Automatic retries: Failed URLs are retried with exponential backoff
- Incremental stopping: Each pagination branch stops independently
- Lifecycle hooks: Shell commands at key points (start, failure, complete) for automated recovery
- Progress tracking: Statistics are updated in real-time
Extractors¶
Extractors parse page content and return structured data.
HTML Extractor¶
Uses CSS selectors to extract data from HTML:
[extract]
type = "html"
[extract.items]
selector = ".product"
[extract.items.fields]
title = "h2"
price = { selector = ".price", parser = "parse_price" }
JSON Extractor¶
Uses dot-notation paths to extract data from JSON:
[extract]
type = "json"
[extract.items]
path = "data.products"
[extract.items.fields]
title = "name"
price = { path = "pricing.amount", parser = "parse_float" }
Fetchers¶
Fetchers retrieve page content:
HTTP Fetcher (httpx)¶
The default fetcher uses httpx for fast HTTP requests:
Browser Fetcher (pydoll)¶
For JavaScript-heavy sites, use browser rendering:
Policy¶
The policy controls crawl behavior:
[policy]
max_retries = 3 # Retry failed requests
max_requests = 1000 # Stop after N requests
concurrency = 5 # Parallel requests
delay = 1.0 # Delay between batches
jitter = 0.2 # Random delay (anti-fingerprinting)
max_consecutive_failures = 50 # Stop on too many failures
Storage¶
Databrew uses a dual-layer storage architecture:
data/mysite/
├── .state.db # URL queue/retry state (ephemeral, gitignored)
├── .failures.db # Durable failure tracking (local, gitignored)
├── _failed_urls.json # Portable failure snapshot (committed/synced)
├── .index.db # Storage dedupe/index catalog (ephemeral, gitignored)
└── items/
├── part_000001.parquet # Rolling part files (compressed)
├── part_000002.parquet
└── ...
.state.db: SQLite database for URL queue/retry state. This is local-only..failures.db: Durable failure tracking in a separate SQLite file. Survives.state.dbdeletion._failed_urls.json: Portable failure snapshot exported at run end. Safe for cross-machine sync (e.g. via git)..index.db: SQLite storage catalog for dedupe and item metadata. This is local-only and auto-rebuilt from Parquet files on startup.items/*.parquet: Rolling Parquet part files containing the actual extracted items. These are the source of truth and should be synced across machines.- Deduplication: Items are deduplicated by ID field (if configured)
Storage location is configured in the config:
The path is relative to CWD (where you run databrew), not the config file location.
See Working with Extracted Data for how to query and use the Parquet files.
Incremental Crawling¶
Databrew supports smart incremental updates:
Per-Branch Stopping¶
Each pagination chain stops independently when it encounters a page where all items already exist:
Seed URL 1 → Page 1 → Page 2 → Page 3 (all items exist, STOP)
Seed URL 2 → Page 1 → Page 2 (new items found, continue...)
This is automatic for re-runs (when items already exist in storage).
Cross-Run Retry¶
Item URLs that fail are automatically retried on subsequent runs:
Run 1: Item fails → status='failed', failed_runs=1
Run 2: Reset to pending, retry → fails → failed_runs=2
Run 3: Retry again → fails → status='permanently_failed'
After 3 failed runs, the URL is marked permanently failed.
Failures are tracked durably in .failures.db and exported to _failed_urls.json at
run end, so they survive .state.db deletion and can be synced across machines.
Config Composition¶
Configs can inherit from a base config:
# mysite.toml
extends = "base.toml"
name = "mysite"
start_urls = ["https://example.com"]
# ... site-specific config
Merge behavior:
- Dicts: merge recursively (child overrides base)
- Lists: replace entirely (no concatenation)
- Scalars: replace entirely
Next Steps¶
- CLI Reference - All available commands
- Configuration Guide - Complete config reference
- HTML Extraction - CSS selector patterns