Lifecycle Hooks¶

Lifecycle hooks let you run shell commands at key points during a crawl. This enables automated recovery from transient failures (expired cookies, blocked IPs, rate limits) without manual intervention or wrapper scripts.

Overview¶

Hook	When	Use Cases
`on_start`	Before crawl begins	Preflight checks, login scripts, cache warming
`on_failure`	On consecutive failure stop	Refresh cookies, rotate proxies, restart VPN
`on_complete`	After crawl finishes	Notifications, data export, cleanup

Quick Start¶

[hooks]
on_failure = "python scripts/refresh_cookies.py {name}"

When the crawl hits max_consecutive_failures, databrew runs your script instead of stopping. If the script exits 0, the crawl resets its failure counter and resumes.

Configuration¶

TOML Config¶

[hooks]
on_start = "echo Starting {name}"
on_failure = "python scripts/recover.py {name}"
on_complete = "python scripts/notify.py {name} {items}"
max_hook_retries = 3       # Max times on_failure can fire (default: 3)
hook_timeout = 300.0       # Timeout per hook in seconds (default: 300)

CLI Overrides¶

CLI flags override config values:

# Override on_failure from config
databrew run mysite.toml --on-failure "python scripts/recover.py {name}"

# Add hooks to a config that has none
databrew run mysite.toml --on-start "echo start" --on-complete "echo done"

# Override just one hook
databrew run mysite.toml --on-failure "echo recovery attempt"

Template Variables¶

Hook commands support these template variables:

Variable	Description
`{name}`	Site name from config
`{failures}`	Current consecutive failure count
`{items}`	Total items extracted so far
`{requests}`	Total requests processed so far

[hooks]
on_failure = "echo '{name} failed {failures} times after {requests} requests'"
on_complete = "echo '{name}: {items} items in {requests} requests'"

Hook Behavior¶

on_start¶

Runs before the crawl begins. If the command exits non-zero, the crawl aborts.

[hooks]
on_start = "python scripts/preflight.py {name}"

Use cases:

Validate that credentials are fresh
Check that the target site is reachable
Warm up caches or sessions

on_failure¶

Runs when the crawl hits max_consecutive_failures. This is the primary recovery hook.

[hooks]
on_failure = "python scripts/refresh_cookies.py {name}"
max_hook_retries = 5

Recovery flow:

Crawl running → 10 consecutive failures → on_failure fires
                                            ↓
                                    Exit 0 (success)?
                                    ├── Yes: Reset failure counter, reload config, resume crawl
                                    └── No:  Stop crawl (same as no hook)

After recovery, databrew:

Resets the consecutive failure counter to 0
Reloads config and recreates the fetcher — so your recovery script can update headers, cookies, or proxy settings in the config file and they take effect immediately
Resumes the crawl loop

The hook fires up to max_hook_retries times per crawl. After that, the crawl stops normally.

on_complete¶

Runs after the crawl finishes, regardless of how it stopped (success, failure, or abort). The CrawlResult.stopped_reason tells you what happened.

[hooks]
on_complete = "python scripts/notify.py {name} {items} {requests}"

Use cases:

Send notifications (Slack, email)
Trigger data pipelines
Upload results to cloud storage
Log crawl metrics

Examples¶

When a site requires session cookies that expire during long crawls:

[hooks]
on_failure = "python scripts/refresh_cookies.py {name}"
max_hook_retries = 5
hook_timeout = 60.0

# scripts/refresh_cookies.py
import sys
import tomllib

name = sys.argv[1] if len(sys.argv) > 1 else ""

# Your cookie refresh logic here
# (selenium login, API auth, etc.)
new_cookie = get_fresh_cookie()

# Update the config file with new cookie
# (databrew reloads config after recovery)
config_path = f"configs/{name}.toml"
# ... update headers in config ...

Proxy Rotation¶

[hooks]
on_failure = "python scripts/rotate_proxy.py {name}"
max_hook_retries = 10

Slack Notification¶

[hooks]
on_complete = "curl -X POST -d '{\"text\": \"{name}: {items} items extracted\"}' $SLACK_WEBHOOK"

Simple Logging¶

databrew run mysite.toml \
    --on-start "echo 'Starting crawl for {name}'" \
    --on-failure "echo 'Recovery needed for {name} after {failures} failures'" \
    --on-complete "echo 'Done: {name} extracted {items} items'"

Dry Run¶

Use --dry-run to verify hooks are configured correctly:

databrew run mysite.toml --dry-run --on-failure "echo recover"

Output includes a Hooks section showing the resolved commands:

Hooks:
  on_failure: echo recover
  max_hook_retries: 3
  hook_timeout: 300.0s

Programmatic Usage¶

Hooks are implemented as async callbacks on the Orchestrator. You can use them directly in Python:

import asyncio
from databrew import Orchestrator, load_config, create_components, HookContext, run_hook

config = load_config("mysite.toml")
components = create_components(config)

async def on_failure():
    ctx = HookContext(name=config.name, failures=10)
    return await run_hook("python scripts/recover.py {name}", ctx)

async def on_complete(result):
    print(f"Done: {result.stats.items_extracted} items")

orchestrator = Orchestrator(
    store=components.store,
    fetcher=components.fetcher,
    extractor=components.extractor,
    policy=components.policy,
    on_failure=on_failure,
    on_complete=on_complete,
    max_hook_retries=3,
)

result = asyncio.run(orchestrator.run())

The orchestrator accepts these callbacks:

Parameter	Signature	Description
`on_start`	`() -> bool`	Return `False` to abort
`on_failure`	`() -> bool`	Return `True` to resume
`on_complete`	`(CrawlResult) -> None`	Called after crawl ends
`on_recover`	`() -> None`	Called after successful `on_failure` (e.g., recreate fetcher)
`max_hook_retries`	`int`	Max `on_failure` invocations per crawl

Best Practices¶

1. Keep hooks fast¶

Hooks run between batches, blocking the crawl. Use hook_timeout to prevent hangs:

[hooks]
hook_timeout = 60.0  # Kill slow hooks after 60s

2. Make recovery scripts idempotent¶

The on_failure hook may fire multiple times. Ensure your recovery script handles being run repeatedly.

3. Test with dry-run first¶

databrew run mysite.toml --dry-run --on-failure "python scripts/recover.py {name}"

4. Use config reload for recovery¶

Since databrew reloads config after on_failure succeeds, your recovery script can modify the TOML file (e.g., update cookies in [fetch.headers]) and the changes take effect immediately.

5. Start with conservative retries¶

[hooks]
max_hook_retries = 3  # Don't retry forever

If your recovery script fails consistently, the crawl should stop so you can investigate.