Quick Start¶
This guide walks you through creating a config file and running your first extraction.
Generate a Starter Config¶
Use the init command to generate a starter config:
This creates mysite.toml with a template you can customize.
Alternatively, run databrew init without arguments for interactive prompts:
databrew init
# Site name: mysite
# Start URL: https://example.com/products
# Created config: mysite.toml
Understanding the Config¶
A minimal config has four main sections:
# Site identifier and starting point
name = "mysite"
start_urls = ["https://example.com/products"]
# What to extract
[extract]
type = "html" # or "json" for APIs
# Item extraction rules
[extract.items]
selector = ".product-card" # CSS selector for item containers
[extract.items.fields]
title = "h2.title" # Simple selector
price = { selector = ".price", parser = "parse_price" }
url = { selector = "a", attribute = "href" }
# Links to follow
[extract.links]
pagination = ["a.next-page"] # Next page links
items = [".product-card a"] # Detail page links
# Crawl behavior
[policy]
concurrency = 5
max_retries = 3
Validate Your Config¶
Before running, validate the config:
Add -v for more details:
Run a Test Crawl¶
Start with a limited crawl to test your extraction rules:
The -n 10 flag limits the crawl to 10 requests.
Dry Run¶
Preview what would happen without fetching:
Check Status¶
See the crawl progress:
databrew status mysite.toml
# Status for: mysite
# Items stored: 42
# URLs pending: 15
# URLs completed: 53
# URLs failed: 2
Export Data¶
Export extracted items to different formats:
# JSONL (one JSON object per line)
databrew export mysite.toml -o products.jsonl
# JSON array
databrew export mysite.toml -o products.json
# Parquet (requires analytics extra)
databrew export mysite.toml -o products.parquet
Resume an Interrupted Crawl¶
Databrew automatically tracks progress. If a crawl is interrupted, simply run again:
To force a fresh start:
Example: Real Estate Site¶
Here's a complete example for a real estate listing site:
name = "realestate"
start_urls = ["https://example.com/listings?page=1"]
[extract]
type = "html"
[extract.items]
selector = "" # Empty = whole page is one item (detail pages)
id = "property_id" # Field for deduplication
[extract.items.fields]
title = ".listing-title"
price = { selector = ".price", parser = "parse_price", required = true }
address = ".address"
bedrooms = { selector = ".beds", parser = "parse_int" }
bathrooms = { selector = ".baths", parser = "parse_int" }
description = ".description"
images = { selector = ".gallery img", attribute = "src", multiple = true }
# Key-value extraction for property details
details = { selector = ".details li", keys = "strong", values = "span" }
[extract.links]
pagination = [".pagination a.next"]
items = [".listing-card a.view-details"]
# Derived fields from nested data
[extract.derived]
property_id = "details.Property ID"
lot_size = { path = "details.Lot Size", parser = "squish" }
[policy]
concurrency = 3
delay = 1.0
jitter = 0.2
max_retries = 3
[storage]
path = "data/realestate"
Run it:
Next Steps¶
- Core Concepts - Understand URL types, extractors, and the crawl lifecycle
- HTML Extraction Guide - Deep dive into CSS selectors and field extraction
- CLI Reference - All available commands and options