Examples¶

This page shows complete examples from the databrew repository. These configs are used in production and demonstrate common patterns.

HTML Extraction: Real Estate Site¶

A real estate listing site with detail pages. Demonstrates:

Key-value pair extraction
Derived fields
Multiple value extraction
JSON-LD parsing
Per-page item extraction

name = "realethio"
start_urls = ["https://realethio.com/search/for-sale"]

[extract]
type = "html"

[extract.items]
selector = ""  # Whole page is one item (detail pages)
id = "property_id"

[extract.items.fields]
title = ".listing-title"
price = { selector = ".price", parser = "parse_price", required = true }
address = ".address"
description = { selector = ".description", parser = "squish" }
images = { selector = ".gallery img", attribute = "src", multiple = true }

# Key-value extraction
details = { selector = ".detail-wrap li", keys = "strong", values = "span" }

# JSON-LD
date_published = { selector = "script[type='application/ld+json']", parser = "ldjson:datePublished" }

[extract.links]
pagination = [".pagination a.next"]
items = [".listing-card a"]

[extract.derived]
property_id = "details.Property ID"
bedrooms = { path = "details.Bedrooms", parser = "parse_int" }
bathrooms = { path = "details.Bathrooms", parser = "parse_int" }

[policy]
concurrency = 3
delay = 1.0
jitter = 0.2

[storage]
path = "data/realethio"

Key Patterns¶

Empty selector for detail pages:

[extract.items]
selector = ""  # Whole page is one item

Key-value extraction:

fields.details = { selector = ".detail-wrap li", keys = "strong", values = "span" }
# Extracts: {"Property ID": "12345", "Bedrooms": "3", ...}

Derived fields from nested data:

[extract.derived]
property_id = "details.Property ID"  # Promotes nested value to top level

JSON-LD extraction:

fields.date_published = { selector = "script[type='application/ld+json']", parser = "ldjson:datePublished" }

JSON API Extraction: Classifieds¶

A JSON API with listing and detail endpoints. Demonstrates:

List + detail page pattern
Loading URLs from file
Building item URLs from IDs
Full response export
Custom headers

name = "engocha"
start_urls = { file = "examples/engocha_categories.txt" }

[extract]
type = "json"
items_from = "item"  # Only save from detail pages

[extract.items]
path = ""  # Full JSON response is the item
id = "listing.ListingID"

[extract.links]
pagination = ["listings.next_page_url"]
items_path = "listings.data"
items_id = "ListingID"
items_url = "https://engocha.com/api/v1/classifieds/{id}"

[policy]
concurrency = 8
delay = 1.0
jitter = 0.2

[fetch.headers]
Content-Type = "application/json"
User-Agent = "MyApp/1.0"

[storage]
path = "data/engocha"

Key Patterns¶

Loading URLs from file:

start_urls = { file = "examples/engocha_categories.txt" }

Full response export (no field mapping):

[extract.items]
path = ""  # Full JSON response is the item
id = "listing.ListingID"  # Path to ID in the response

Building item URLs from listing data:

[extract.links]
items_path = "listings.data"  # Array of items in listing response
items_id = "ListingID"        # ID field in each item
items_url = "https://engocha.com/api/v1/classifieds/{id}"  # URL template

Browser-Based Extraction: JavaScript Site¶

A JavaScript-heavy site requiring browser rendering. Demonstrates:

Pydoll browser fetcher
Wait strategies for dynamic content
Lower concurrency for browser resource management

name = "ethiopiapropertycentre"
start_urls = ["https://ethiopiapropertycentre.com/for-sale"]

[extract]
type = "html"

[extract.items]
selector = ".property-card"

[extract.items.fields]
title = ".title"
price = { selector = ".price", parser = "parse_price" }
address = ".location"
url = { selector = "a", attribute = "href" }

# Multiple key-value extractions
additional_info = { selector = "ul.aux-info li", keys = "span[itemprop=name]", values = "span[itemprop=value]" }

[extract.links]
pagination = [".pagination a.next"]
items = [".property-card a"]

[fetch]
type = "pydoll"

[fetch.browser]
headless = false              # Visible browser for debugging
page_load_timeout = 60.0      # Long timeout for slow sites
wait_for_network_idle = true  # Wait for AJAX to complete
network_idle_time = 5.0       # How long to wait

[policy]
concurrency = 2               # Fewer tabs for browser
delay = 2.0
jitter = 0.5

[storage]
path = "data/ethiopiapropertycentre"

Key Patterns¶

Browser fetcher with wait strategies:

[fetch]
type = "pydoll"

[fetch.browser]
headless = false              # Visible browser
page_load_timeout = 60.0      # Long timeout for slow sites
wait_for_network_idle = true  # Wait for AJAX to complete
network_idle_time = 5.0       # How long to wait

Multiple key-value extractions:

[extract.items.fields.additional_info]
selector = "ul.aux-info li"
keys = "span[itemprop=name]"
values = "span[itemprop=value]"

Running the Examples¶

Test with Limited Requests¶

# Run with limit for testing
databrew run examples/realethio.toml -n 10

# Dry run to preview
databrew run examples/realethio.toml --dry-run

Check Status¶

databrew status examples/realethio.toml

Export Data¶

databrew export examples/realethio.toml -o output.jsonl

Tips for Writing Configs¶

Start with databrew init - generates a template to customize
Test with -n 10 - limit requests while developing
Use --dry-run - preview without fetching
Check with databrew check -v - validate before running
Run with -v - verbose output helps debug extraction
Use id field - enables deduplication and incremental updates
Inspect the HTML - use browser dev tools to find selectors