HTML Extraction¶
This guide covers HTML extraction using CSS selectors.
Basic Field Extraction¶
Simple Selector¶
The simplest form is a CSS selector string:
This extracts the text content of the first matching element.
Full Field Config¶
For more control, use a table:
[extract.items.fields]
title = { selector = "h1.title" }
price = { selector = ".price", parser = "parse_price" }
image = { selector = "img.main", attribute = "src" }
Field Options¶
| Option | Type | Default | Description |
|---|---|---|---|
selector |
string | required | CSS selector to find the element |
attribute |
string | null |
Attribute to extract (null = text content) |
parser |
string | null |
Parser function to transform the value |
required |
bool | false |
Fail item if field is missing |
multiple |
bool | false |
Extract all matches as a list |
Extracting Attributes¶
Use the attribute option to extract element attributes instead of text:
[extract.items.fields]
# Link URL
url = { selector = "a.product-link", attribute = "href" }
# Image source
image = { selector = "img.thumbnail", attribute = "src" }
# Data attributes
product_id = { selector = ".product", attribute = "data-id" }
coordinates = { selector = "#map", attribute = "data-coords" }
Multiple Values¶
Set multiple = true to extract all matching elements as a list:
[extract.items.fields]
# All image URLs
images = { selector = ".gallery img", attribute = "src", multiple = true }
# All tags
tags = { selector = ".tag-list a", multiple = true }
# All feature items
features = { selector = ".features li", multiple = true }
Required Fields¶
Mark fields as required to skip items where the field is missing:
[extract.items.fields]
title = { selector = "h1", required = true }
price = { selector = ".price", required = true }
description = ".description" # Optional
If a required field is missing, the entire item is skipped with a warning.
Key-Value Pair Extraction¶
For structured data in key-value format (like property details), use keys and values:
[extract.items.fields]
# Extract from <dt>/<dd> pairs
details = { keys = "dt", values = "dd" }
# Extract from custom structure
# <li><strong>Bedrooms:</strong> <span>3</span></li>
specs = { selector = ".specs li", keys = "strong", values = "span" }
With Container Selector¶
When key-value pairs are in containers:
# Each <li> contains a key-value pair
details = { selector = ".detail-item", keys = ".label", values = ".value" }
With Units¶
For values with separate unit elements:
# <div class="spec">
# <span class="key">Area</span>
# <span class="value">1500</span>
# <span class="unit">sqft</span>
# </div>
specs = { selector = ".spec", keys = ".key", values = ".value", units = ".unit" }
# Result: {"Area": "1500 sqft"}
Using Parsers¶
Parsers transform extracted values:
[extract.items.fields]
# Parse price with currency
price = { selector = ".price", parser = "parse_price" }
# "ETB 1,500,000" → {"amount": 1500000.0, "currency": "ETB", "raw": "ETB 1,500,000"}
# Parse integer
bedrooms = { selector = ".beds", parser = "parse_int" }
# "3 Beds" → 3
# Parse float
rating = { selector = ".rating", parser = "parse_float" }
# "4.5 stars" → 4.5
# Collapse whitespace
description = { selector = ".desc", parser = "squish" }
# " Multiple spaces " → "Multiple spaces"
# Parse coordinates
location = { selector = "#map", attribute = "data-coords", parser = "parse_coordinates" }
# '{"lat": 9.03, "lng": 38.74}' → "9.03,38.74"
See Built-in Parsers for all available parsers.
JSON-LD Extraction¶
Extract data from JSON-LD scripts using the ldjson: parser prefix:
[extract.items.fields]
# Extract specific field from JSON-LD
date_published = { selector = "script[type='application/ld+json']", parser = "ldjson:datePublished" }
date_modified = { selector = "script[type='application/ld+json']", parser = "ldjson:dateModified" }
# Extract entire JSON-LD as dict
structured_data = { selector = "script[type='application/ld+json']", parser = "parse_ldjson" }
The ldjson: prefix automatically handles @graph structures.
Item Containers¶
Multiple Items Per Page¶
When a page has multiple items (e.g., a listing page):
[extract.items]
selector = ".product-card" # Each card is one item
[extract.items.fields]
title = "h2" # Relative to the container
price = ".price"
url = { selector = "a", attribute = "href" }
Fields are extracted relative to each container.
Whole Page as Item¶
For detail pages where the whole page is one item:
[extract.items]
selector = "" # Empty string = whole page
[extract.items.fields]
title = "h1.page-title"
description = "#content .description"
# Selectors work on the entire document
Derived Fields¶
Derived fields extract values from already-extracted nested data (like key-value pairs):
[extract.items.fields]
# Extract key-value pairs as a dict
details = { selector = ".details li", keys = "strong", values = "span" }
# Result: {"Property ID": "12345", "Bedrooms": "3", "Bathrooms": "2"}
[extract.derived]
# Pull specific values out to top-level fields
property_id = "details.Property ID"
bedrooms = { path = "details.Bedrooms", parser = "parse_int" }
bathrooms = { path = "details.Bathrooms", parser = "parse_int" }
Derived Field Options¶
| Option | Type | Default | Description |
|---|---|---|---|
path |
string | required | Dot-notation path to the value |
parser |
string | null |
Parser to transform the value |
remove_source |
bool | true |
Remove the key from source dict |
Shorthand Syntax¶
[extract.derived]
# Shorthand (just the path)
property_id = "details.Property ID"
# Equivalent full form
property_id = { path = "details.Property ID" }
Keeping Source Fields¶
By default, derived keys are removed from the source dict. To keep them:
Link Extraction¶
Pagination Links¶
Links to more listing pages:
Item Links¶
Links to detail pages:
Link Attribute¶
By default, links are extracted from href. For other attributes:
Complete Example¶
name = "realestate"
start_urls = ["https://example.com/listings"]
[extract]
type = "html"
base_url = "https://example.com"
[extract.items]
selector = "" # Detail pages
id = "property_id"
[extract.items.fields]
# Basic fields
title = ".listing-title"
price = { selector = ".price", parser = "parse_price", required = true }
address = ".address"
description = { selector = ".description", parser = "squish" }
# Numeric fields with parsers
bedrooms = { selector = ".beds span", parser = "parse_int" }
bathrooms = { selector = ".baths span", parser = "parse_int" }
sqft = { selector = ".sqft span", parser = "parse_int" }
# Attributes
url = { selector = "link[rel='canonical']", attribute = "href" }
images = { selector = ".gallery img", attribute = "src", multiple = true }
# Key-value extraction
details = { selector = ".property-details li", keys = "strong", values = "span" }
features = { selector = ".features li", multiple = true }
# JSON-LD
date_listed = { selector = "script[type='application/ld+json']", parser = "ldjson:datePosted" }
[extract.links]
pagination = [".pagination a.next"]
items = [".listing-card a.view-details"]
[extract.derived]
property_id = "details.Property ID"
year_built = { path = "details.Year Built", parser = "parse_int" }
lot_size = "details.Lot Size"
[storage]
path = "data/realestate"