Lifecycle Hooks¶
Lifecycle hooks let you run shell commands at key points during a crawl. This enables automated recovery from transient failures (expired cookies, blocked IPs, rate limits) without manual intervention or wrapper scripts.
Overview¶
| Hook | When | Use Cases |
|---|---|---|
on_start |
Before crawl begins | Preflight checks, login scripts, cache warming |
on_failure |
On consecutive failure stop | Refresh cookies, rotate proxies, restart VPN |
on_complete |
After crawl finishes | Notifications, data export, cleanup |
Quick Start¶
When the crawl hits max_consecutive_failures, databrew runs your script instead of stopping. If the script exits 0, the crawl resets its failure counter and resumes.
Configuration¶
TOML Config¶
[hooks]
on_start = "echo Starting {name}"
on_failure = "python scripts/recover.py {name}"
on_complete = "python scripts/notify.py {name} {items}"
max_hook_retries = 3 # Max times on_failure can fire (default: 3)
hook_timeout = 300.0 # Timeout per hook in seconds (default: 300)
CLI Overrides¶
CLI flags override config values:
# Override on_failure from config
databrew run mysite.toml --on-failure "python scripts/recover.py {name}"
# Add hooks to a config that has none
databrew run mysite.toml --on-start "echo start" --on-complete "echo done"
# Override just one hook
databrew run mysite.toml --on-failure "echo recovery attempt"
Template Variables¶
Hook commands support these template variables:
| Variable | Description |
|---|---|
{name} |
Site name from config |
{failures} |
Current consecutive failure count |
{items} |
Total items extracted so far |
{requests} |
Total requests processed so far |
[hooks]
on_failure = "echo '{name} failed {failures} times after {requests} requests'"
on_complete = "echo '{name}: {items} items in {requests} requests'"
Hook Behavior¶
on_start¶
Runs before the crawl begins. If the command exits non-zero, the crawl aborts.
Use cases:
- Validate that credentials are fresh
- Check that the target site is reachable
- Warm up caches or sessions
on_failure¶
Runs when the crawl hits max_consecutive_failures. This is the primary recovery hook.
Recovery flow:
Crawl running → 10 consecutive failures → on_failure fires
↓
Exit 0 (success)?
├── Yes: Reset failure counter, reload config, resume crawl
└── No: Stop crawl (same as no hook)
After recovery, databrew:
- Resets the consecutive failure counter to 0
- Reloads config and recreates the fetcher — so your recovery script can update headers, cookies, or proxy settings in the config file and they take effect immediately
- Resumes the crawl loop
The hook fires up to max_hook_retries times per crawl. After that, the crawl stops normally.
on_complete¶
Runs after the crawl finishes, regardless of how it stopped (success, failure, or abort). The CrawlResult.stopped_reason tells you what happened.
Use cases:
- Send notifications (Slack, email)
- Trigger data pipelines
- Upload results to cloud storage
- Log crawl metrics
Examples¶
Cookie Refresh¶
When a site requires session cookies that expire during long crawls:
[hooks]
on_failure = "python scripts/refresh_cookies.py {name}"
max_hook_retries = 5
hook_timeout = 60.0
# scripts/refresh_cookies.py
import sys
import tomllib
name = sys.argv[1] if len(sys.argv) > 1 else ""
# Your cookie refresh logic here
# (selenium login, API auth, etc.)
new_cookie = get_fresh_cookie()
# Update the config file with new cookie
# (databrew reloads config after recovery)
config_path = f"configs/{name}.toml"
# ... update headers in config ...
Proxy Rotation¶
Slack Notification¶
[hooks]
on_complete = "curl -X POST -d '{\"text\": \"{name}: {items} items extracted\"}' $SLACK_WEBHOOK"
Simple Logging¶
databrew run mysite.toml \
--on-start "echo 'Starting crawl for {name}'" \
--on-failure "echo 'Recovery needed for {name} after {failures} failures'" \
--on-complete "echo 'Done: {name} extracted {items} items'"
Dry Run¶
Use --dry-run to verify hooks are configured correctly:
Output includes a Hooks section showing the resolved commands:
Programmatic Usage¶
Hooks are implemented as async callbacks on the Orchestrator. You can use them directly in Python:
import asyncio
from databrew import Orchestrator, load_config, create_components, HookContext, run_hook
config = load_config("mysite.toml")
components = create_components(config)
async def on_failure():
ctx = HookContext(name=config.name, failures=10)
return await run_hook("python scripts/recover.py {name}", ctx)
async def on_complete(result):
print(f"Done: {result.stats.items_extracted} items")
orchestrator = Orchestrator(
store=components.store,
fetcher=components.fetcher,
extractor=components.extractor,
policy=components.policy,
on_failure=on_failure,
on_complete=on_complete,
max_hook_retries=3,
)
result = asyncio.run(orchestrator.run())
The orchestrator accepts these callbacks:
| Parameter | Signature | Description |
|---|---|---|
on_start |
() -> bool |
Return False to abort |
on_failure |
() -> bool |
Return True to resume |
on_complete |
(CrawlResult) -> None |
Called after crawl ends |
on_recover |
() -> None |
Called after successful on_failure (e.g., recreate fetcher) |
max_hook_retries |
int |
Max on_failure invocations per crawl |
Best Practices¶
1. Keep hooks fast¶
Hooks run between batches, blocking the crawl. Use hook_timeout to prevent hangs:
2. Make recovery scripts idempotent¶
The on_failure hook may fire multiple times. Ensure your recovery script handles being run repeatedly.
3. Test with dry-run first¶
4. Use config reload for recovery¶
Since databrew reloads config after on_failure succeeds, your recovery script can modify the TOML file (e.g., update cookies in [fetch.headers]) and the changes take effect immediately.
5. Start with conservative retries¶
If your recovery script fails consistently, the crawl should stop so you can investigate.