Skip to content

CLI Reference

Databrew provides a command-line interface for running crawls and managing state.

Commands Overview

databrew [-h] {run,status,list,check,init} ...

Commands:
  run       Run extraction from config
  status    Show crawl status
  list      List available configs
  check     Validate config file
  init      Generate a starter config file

run

Run extraction from one or more config files.

databrew run [OPTIONS] CONFIG [CONFIG ...]

Options

Option Description
-v, --verbose Verbose output
-n, --limit LIMIT Max requests to process
-c, --concurrency N Concurrent requests (default: 5)
--start-urls FILE File with additional start URLs
--full-crawl Ignore caught-up detection
--dry-run Validate without fetching
--fresh Force fresh crawl
--retry-failed Retry only failed URLs from previous runs
--on-start CMD Shell command before crawl (exit non-zero to abort)
--on-failure CMD Shell command on consecutive failure stop (exit 0 to resume)
--on-complete CMD Shell command after crawl finishes

Examples

# Basic run
databrew run mysite.toml

# Limit to 50 requests
databrew run mysite.toml -n 50

# With custom concurrency
databrew run mysite.toml -c 10

# Force fresh crawl
databrew run mysite.toml --fresh

# Retry only failed URLs
databrew run mysite.toml --retry-failed

# Dry run (preview)
databrew run mysite.toml --dry-run

# Multiple configs (batch mode)
databrew run examples/*.toml -n 100

# With lifecycle hooks
databrew run mysite.toml --on-failure "python scripts/recover.py {name}"
databrew run mysite.toml --on-start "echo start" --on-complete "echo done"

Hook CLI flags override values from the [hooks] config section. See Lifecycle Hooks for details.

Resume Behavior

By default, databrew automatically resumes interrupted crawls:

  • If the URL queue has pending URLs → resume (continues from where it stopped)
  • If the queue is empty → fresh start (adds start_urls)

Use --fresh to force a fresh crawl even if URLs are pending.

Use --retry-failed to process only URLs currently in failed state. This skips start URL seeding and temporarily defers pending/in-progress URLs.

status

Show crawl status and statistics.

databrew status CONFIG

Example Output

Status for: realethio
  Storage: data/realethio
  Items stored: 1763
  Storage files: 3
    - part_000001.parquet
    - part_000002.parquet
    - part_000003.parquet
  URLs pending: 0
  URLs completed: 1996
  URLs failed: 0

check

Validate a config file without running.

databrew check [-v] CONFIG

Examples

# Basic validation
databrew check mysite.toml

# Verbose output
databrew check mysite.toml -v

init

Generate a starter config file.

databrew init [OPTIONS] [NAME]

Options

Option Description
--url URL Start URL for the site
--type TYPE Extraction type: html or json
-o, --output PATH Output file path

Examples

# Generate HTML config
databrew init mysite --url "https://example.com" --type html

# Generate JSON API config
databrew init myapi --url "https://api.example.com/items" --type json

list

List available config files in a directory.

databrew list [DIR]

Exit Codes

Code Description
0 Success
1 Error (invalid config, crawl failure, etc.)