Incremental Crawling¶
Databrew is designed for incremental crawling—efficiently updating your data by only fetching what's new.
Resume Support¶
Automatic Resume¶
Databrew automatically tracks progress. If a crawl is interrupted, simply run again:
# First run (interrupted at 500 items)
databrew run mysite.toml
# Crawl interrupted (Ctrl+C, network error, etc.)
# Resume automatically
databrew run mysite.toml
# Resuming: 234 URLs pending
No special flags needed—databrew detects pending URLs and continues.
Fresh Start¶
To force a fresh crawl (re-add start URLs):
This adds start_urls again but keeps existing data.
Reset Everything¶
To start completely over:
# Reset URL queue only (keep items)
databrew reset mysite.toml
# Delete everything (queue + items)
databrew reset mysite.toml --all
Per-Branch Incremental Stopping¶
When re-running a crawl with existing data, databrew stops pagination branches intelligently.
How It Works¶
Each pagination chain stops independently when it encounters a page where all item links already exist in storage:
Seed URL 1 (Category A)
→ Page 1: 5 new items, continue...
→ Page 2: 3 new items, continue...
→ Page 3: 0 new items (all exist), STOP this branch
Seed URL 2 (Category B)
→ Page 1: 10 new items, continue...
→ Page 2: 8 new items, continue...
→ ... continues independently
Multi-Seed Crawls¶
This is particularly useful for multi-seed crawls (e.g., 100+ category URLs):
Each category is processed as its own branch. Categories with no new items stop quickly, while active categories continue to full depth.
Detection Logic¶
A pagination page triggers "caught up" when:
- The crawl is incremental (items already exist in storage)
- The page is a pagination type URL
- The page has item links (not empty)
- All item links already exist in storage
Fresh vs. Incremental¶
- Fresh crawl: All pagination is followed (caught-up detection disabled)
- Incremental crawl: Per-branch stopping is active
The mode is determined automatically based on whether items exist in storage.
Cross-Run Retry¶
Item URLs that fail (after exhausting retries) are automatically retried on subsequent runs.
Why Only Item URLs?¶
- Pagination pages hold dynamic data—retrying old pagination pages doesn't make sense
- Item pages hold static data—the item content doesn't change
Retry Progression¶
Run 1: Item URL fails after 3 retries → status='failed', failed_runs=1
Run 2: Reset to pending, retry → fails again → failed_runs=2
Run 3: Reset to pending, retry → fails again → status='permanently_failed'
After 3 failed runs, the URL is marked permanently_failed and won't be retried.
Checking Failed URLs¶
databrew status mysite.toml
# URLs failed: 5 # Will retry next run
# URLs permanently failed: 2 # Exhausted all retries
Full Crawl Mode¶
To crawl all pagination regardless of existing data:
This disables caught-up detection but still skips existing items.
Use this when:
- You want to verify no items were missed
- The site structure changed
- You need to re-crawl all pages for updated content
Incremental Exports¶
Export only items extracted since a specific time:
# Export items since a date
databrew export mysite.toml -o new_items.jsonl --since "2026-01-20"
# Export items since a timestamp
databrew export mysite.toml -o new_items.jsonl --since "2026-01-20T14:30:00"
This is useful for:
- Daily/weekly data syncs
- Streaming new items to another system
- Generating delta files
State Management¶
State File Location¶
State is stored in state.db in the output directory:
Backing Up State¶
The state file is a standard SQLite database. Back it up like any file:
Importing Data¶
Repopulate state from exported data:
Best Practices¶
1. Use Item IDs¶
Always configure an ID field for proper deduplication:
2. Start Small¶
Test with limited requests before full crawl:
3. Monitor Progress¶
Check status regularly:
4. Export Regularly¶
Export data periodically to avoid losing work:
5. Handle Failures¶
If too many URLs fail:
- Check the site for changes
- Review your config selectors
- Consider increasing retries or delays