Examples¶
This page shows complete examples from the databrew repository. These configs are used in production and demonstrate common patterns.
HTML Extraction: Real Estate Site¶
A real estate listing site with detail pages. Demonstrates:
- Key-value pair extraction
- Derived fields
- Multiple value extraction
- JSON-LD parsing
- Per-page item extraction
name = "realethio"
start_urls = ["https://realethio.com/search/for-sale"]
[extract]
type = "html"
[extract.items]
selector = "" # Whole page is one item (detail pages)
id = "property_id"
[extract.items.fields]
title = ".listing-title"
price = { selector = ".price", parser = "parse_price", required = true }
address = ".address"
description = { selector = ".description", parser = "squish" }
images = { selector = ".gallery img", attribute = "src", multiple = true }
# Key-value extraction
details = { selector = ".detail-wrap li", keys = "strong", values = "span" }
# JSON-LD
date_published = { selector = "script[type='application/ld+json']", parser = "ldjson:datePublished" }
[extract.links]
pagination = [".pagination a.next"]
items = [".listing-card a"]
[extract.derived]
property_id = "details.Property ID"
bedrooms = { path = "details.Bedrooms", parser = "parse_int" }
bathrooms = { path = "details.Bathrooms", parser = "parse_int" }
[policy]
concurrency = 3
delay = 1.0
jitter = 0.2
[storage]
path = "data/realethio"
Key Patterns¶
Empty selector for detail pages:
Key-value extraction:
fields.details = { selector = ".detail-wrap li", keys = "strong", values = "span" }
# Extracts: {"Property ID": "12345", "Bedrooms": "3", ...}
Derived fields from nested data:
JSON-LD extraction:
fields.date_published = { selector = "script[type='application/ld+json']", parser = "ldjson:datePublished" }
JSON API Extraction: Classifieds¶
A JSON API with listing and detail endpoints. Demonstrates:
- List + detail page pattern
- Loading URLs from file
- Building item URLs from IDs
- Full response export
- Custom headers
name = "engocha"
start_urls = { file = "examples/engocha_categories.txt" }
[extract]
type = "json"
items_from = "item" # Only save from detail pages
[extract.items]
path = "" # Full JSON response is the item
id = "listing.ListingID"
[extract.links]
pagination = ["listings.next_page_url"]
items_path = "listings.data"
items_id = "ListingID"
items_url = "https://engocha.com/api/v1/classifieds/{id}"
[policy]
concurrency = 8
delay = 1.0
jitter = 0.2
[fetch.headers]
Content-Type = "application/json"
User-Agent = "MyApp/1.0"
[storage]
path = "data/engocha"
Key Patterns¶
Loading URLs from file:
Full response export (no field mapping):
[extract.items]
path = "" # Full JSON response is the item
id = "listing.ListingID" # Path to ID in the response
Building item URLs from listing data:
[extract.links]
items_path = "listings.data" # Array of items in listing response
items_id = "ListingID" # ID field in each item
items_url = "https://engocha.com/api/v1/classifieds/{id}" # URL template
Browser-Based Extraction: JavaScript Site¶
A JavaScript-heavy site requiring browser rendering. Demonstrates:
- Pydoll browser fetcher
- Wait strategies for dynamic content
- Lower concurrency for browser resource management
name = "ethiopiapropertycentre"
start_urls = ["https://ethiopiapropertycentre.com/for-sale"]
[extract]
type = "html"
[extract.items]
selector = ".property-card"
[extract.items.fields]
title = ".title"
price = { selector = ".price", parser = "parse_price" }
address = ".location"
url = { selector = "a", attribute = "href" }
# Multiple key-value extractions
additional_info = { selector = "ul.aux-info li", keys = "span[itemprop=name]", values = "span[itemprop=value]" }
[extract.links]
pagination = [".pagination a.next"]
items = [".property-card a"]
[fetch]
type = "pydoll"
[fetch.browser]
headless = false # Visible browser for debugging
page_load_timeout = 60.0 # Long timeout for slow sites
wait_for_network_idle = true # Wait for AJAX to complete
network_idle_time = 5.0 # How long to wait
[policy]
concurrency = 2 # Fewer tabs for browser
delay = 2.0
jitter = 0.5
[storage]
path = "data/ethiopiapropertycentre"
Key Patterns¶
Browser fetcher with wait strategies:
[fetch]
type = "pydoll"
[fetch.browser]
headless = false # Visible browser
page_load_timeout = 60.0 # Long timeout for slow sites
wait_for_network_idle = true # Wait for AJAX to complete
network_idle_time = 5.0 # How long to wait
Multiple key-value extractions:
[extract.items.fields.additional_info]
selector = "ul.aux-info li"
keys = "span[itemprop=name]"
values = "span[itemprop=value]"
Running the Examples¶
Test with Limited Requests¶
# Run with limit for testing
databrew run examples/realethio.toml -n 10
# Dry run to preview
databrew run examples/realethio.toml --dry-run
Check Status¶
Export Data¶
Tips for Writing Configs¶
- Start with
databrew init- generates a template to customize - Test with
-n 10- limit requests while developing - Use
--dry-run- preview without fetching - Check with
databrew check -v- validate before running - Run with
-v- verbose output helps debug extraction - Use
idfield - enables deduplication and incremental updates - Inspect the HTML - use browser dev tools to find selectors