CLI Reference¶

Databrew provides a command-line interface for running crawls, exporting data, and managing state.

Commands Overview¶

databrew [-h] {run,status,export,reset,list,import,check,init} ...

Commands:
  run       Run extraction from config
  status    Show crawl status
  export    Export data to JSON/JSONL
  reset     Reset crawl state
  list      List available configs
  import    Import data from JSON/JSONL
  check     Validate config file
  init      Generate a starter config file

run¶

Run extraction from one or more config files.

databrew run [OPTIONS] CONFIG [CONFIG ...]

Options¶

Option	Description
`-v, --verbose`	Verbose output
`-n, --limit LIMIT`	Max requests to process
`-c, --concurrency N`	Concurrent requests (default: 5)
`--start-urls FILE`	File with additional start URLs
`--full-crawl`	Ignore caught-up detection
`--dry-run`	Validate without fetching
`--fresh`	Force fresh crawl

Examples¶

# Basic run
databrew run mysite.toml

# Limit to 50 requests
databrew run mysite.toml -n 50

# With custom concurrency
databrew run mysite.toml -c 10

# Force fresh crawl
databrew run mysite.toml --fresh

# Dry run (preview)
databrew run mysite.toml --dry-run

# Multiple configs (batch mode)
databrew run examples/*.toml -n 100

Resume Behavior¶

By default, databrew automatically resumes interrupted crawls:

If the URL queue has pending URLs → resume (continues from where it stopped)
If the queue is empty → fresh start (adds start_urls)

Use --fresh to force a fresh crawl even if URLs are pending.

export¶

Export extracted data to various formats.

databrew export [OPTIONS] CONFIG

Options¶

Option	Description
`-o, --output PATH`	Output file/directory path
`-f, --format FORMAT`	Output format: jsonl, json, individual, parquet
`--no-meta`	Exclude _source_url and _extracted_at
`--url-type TYPE`	Filter by URL type: item, pagination, all
`--since TIMESTAMP`	Only export items after this timestamp

Format Detection¶

Format is auto-detected from the output file extension:

.jsonl → JSONL (one JSON object per line)
.json → JSON array
.parquet → Parquet (requires analytics extra)

Examples¶

# Export to JSONL
databrew export mysite.toml -o data.jsonl

# Export to Parquet (fast, compressed)
databrew export mysite.toml -o data.parquet

# Incremental export (items since date)
databrew export mysite.toml -o data.jsonl --since "2026-01-20"

status¶

Show crawl status and statistics.

databrew status CONFIG

Example Output¶

Status for: realethio
  Items stored: 1763
  URLs pending: 0
  URLs completed: 1996
  URLs failed: 0

check¶

Validate a config file without running.

databrew check [-v] CONFIG

Examples¶

# Basic validation
databrew check mysite.toml

# Verbose output
databrew check mysite.toml -v

init¶

Generate a starter config file.

databrew init [OPTIONS] [NAME]

Options¶

Option	Description
`--url URL`	Start URL for the site
`--type TYPE`	Extraction type: html or json
`-o, --output PATH`	Output file path

Examples¶

# Generate HTML config
databrew init mysite --url "https://example.com" --type html

# Generate JSON API config
databrew init myapi --url "https://api.example.com/items" --type json

list¶

List available config files in a directory.

databrew list [DIR]

reset¶

Reset crawl state (URL queue and optionally items).

databrew reset [--all] CONFIG

Option	Description
`--all`	Delete entire database including items

import¶

Import data from JSON/JSONL files to repopulate state.

databrew import [OPTIONS] CONFIG INPUT

Exit Codes¶

Code	Description
0	Success
1	Error (invalid config, crawl failure, etc.)