Skip to content

Performance Tuning

This guide helps you make crawls faster while keeping results stable and respectful to target sites.

Quick Tuning Checklist

  1. Set concurrency to match site tolerance and your machine limits.
  2. Keep delay small but non-zero for fragile sites.
  3. Use items_from = "item" when listing pages are only for discovery.
  4. Configure id in [extract.items] for reliable deduplication.
  5. Start with -n limits and scale gradually.
  6. Tune [storage].max_pending_items to reduce tiny parquet parts.

Small/fragile sites

[policy]
concurrency = 2
delay = 1.0
jitter = 0.2
max_retries = 3

Typical production crawls

[policy]
concurrency = 5
delay = 0.2
jitter = 0.1
max_retries = 3

High-throughput internal/API crawls

[policy]
concurrency = 10
delay = 0.0
jitter = 0.0
max_retries = 2

Use high-throughput settings only when the upstream service allows it.

Where Time Usually Goes

  • Network wait: server latency, throttling, timeouts.
  • Extractor cost: heavy selectors, parser complexity, large responses.
  • Storage cost: item serialization + Parquet writes.
  • Queue churn: large bursts of discovered links.

Databrew now batches queue inserts and writes Parquet in append-only part files, which helps both queue and storage throughput at scale. Items are first persisted in SQLite pending WAL and then flushed to Parquet, so larger buffers do not drop unflushed data on restart.

Parquet Compaction

Databrew writes append-only part files during crawling. To merge many small files into fewer larger files, use the itemstore.compact_storage API while the crawler is stopped:

from pathlib import Path
from itemstore import compact_storage

# Preview only
result = compact_storage(storage_path=Path("data/mysite"), dry_run=True)

# Compact all part files
result = compact_storage(storage_path=Path("data/mysite"))

# Compact with custom compression and target size
result = compact_storage(
    storage_path=Path("data/mysite"),
    compression="zstd",
    target_max_file_mb=90,
)

Use target_max_file_mb to keep compacted files under your Git safety threshold.

Practical Workflow

  1. Validate config on a small run: databrew run mysite.toml -n 100 --dry-run
  2. Run a controlled crawl: databrew run mysite.toml -n 1000 -c 5
  3. Check status: databrew status mysite.toml
  4. Increase concurrency gradually while watching failure rate and 429s.

Browser Fetching Notes

Browser mode (fetch.type = "pydoll") is slower and heavier than httpx.

  • Keep concurrency lower (often 2-4)
  • Use wait_for_selector only when needed
  • Avoid large wait_after_load unless site behavior requires it

Safety and Stability

Fast crawls are only useful when they are correct and repeatable.

  • Keep retries enabled for transient failures.
  • Monitor urls_failed and error_rate.
  • Prefer incremental mode for recurring crawls.
  • Sync items/*.parquet; treat .state.db and .index.db as local ephemeral state.