CLI Reference¶
Databrew provides a command-line interface for running crawls, exporting data, and managing state.
Commands Overview¶
databrew [-h] {run,status,export,reset,list,import,check,init} ...
Commands:
run Run extraction from config
status Show crawl status
export Export data to JSON/JSONL
reset Reset crawl state
list List available configs
import Import data from JSON/JSONL
check Validate config file
init Generate a starter config file
run¶
Run extraction from one or more config files.
Options¶
| Option | Description |
|---|---|
-v, --verbose |
Verbose output |
-n, --limit LIMIT |
Max requests to process |
-c, --concurrency N |
Concurrent requests (default: 5) |
--start-urls FILE |
File with additional start URLs |
--full-crawl |
Ignore caught-up detection |
--dry-run |
Validate without fetching |
--fresh |
Force fresh crawl |
Examples¶
# Basic run
databrew run mysite.toml
# Limit to 50 requests
databrew run mysite.toml -n 50
# With custom concurrency
databrew run mysite.toml -c 10
# Force fresh crawl
databrew run mysite.toml --fresh
# Dry run (preview)
databrew run mysite.toml --dry-run
# Multiple configs (batch mode)
databrew run examples/*.toml -n 100
Resume Behavior¶
By default, databrew automatically resumes interrupted crawls:
- If the URL queue has pending URLs → resume (continues from where it stopped)
- If the queue is empty → fresh start (adds start_urls)
Use --fresh to force a fresh crawl even if URLs are pending.
export¶
Export extracted data to various formats.
Options¶
| Option | Description |
|---|---|
-o, --output PATH |
Output file/directory path |
-f, --format FORMAT |
Output format: jsonl, json, individual, parquet |
--no-meta |
Exclude _source_url and _extracted_at |
--url-type TYPE |
Filter by URL type: item, pagination, all |
--since TIMESTAMP |
Only export items after this timestamp |
Format Detection¶
Format is auto-detected from the output file extension:
.jsonl→ JSONL (one JSON object per line).json→ JSON array.parquet→ Parquet (requiresanalyticsextra)
Examples¶
# Export to JSONL
databrew export mysite.toml -o data.jsonl
# Export to Parquet (fast, compressed)
databrew export mysite.toml -o data.parquet
# Incremental export (items since date)
databrew export mysite.toml -o data.jsonl --since "2026-01-20"
status¶
Show crawl status and statistics.
Example Output¶
check¶
Validate a config file without running.
Examples¶
init¶
Generate a starter config file.
Options¶
| Option | Description |
|---|---|
--url URL |
Start URL for the site |
--type TYPE |
Extraction type: html or json |
-o, --output PATH |
Output file path |
Examples¶
# Generate HTML config
databrew init mysite --url "https://example.com" --type html
# Generate JSON API config
databrew init myapi --url "https://api.example.com/items" --type json
list¶
List available config files in a directory.
reset¶
Reset crawl state (URL queue and optionally items).
| Option | Description |
|---|---|
--all |
Delete entire database including items |
import¶
Import data from JSON/JSONL files to repopulate state.
Exit Codes¶
| Code | Description |
|---|---|
| 0 | Success |
| 1 | Error (invalid config, crawl failure, etc.) |