Built-in Parsers¶
Parsers transform extracted values. They can be used with any field.
Text Parsers¶
strip¶
Strip leading and trailing whitespace.
Usage in config:
squish¶
Collapse multiple whitespace characters into single spaces.
>>> squish("Multiple spaces")
'Multiple spaces'
>>> squish("Line\n\nbreaks")
'Line breaks'
>>> squish(" Mixed \t whitespace ")
'Mixed whitespace'
Usage in config:
Numeric Parsers¶
parse_int¶
Extract integer from text, stripping non-numeric characters.
>>> parse_int("3 Bedrooms")
3
>>> parse_int("$1,500")
1500
>>> parse_int("Rating: 5")
5
>>> parse_int("")
None
Usage in config:
parse_float¶
Extract float from text, handling currency and formatting.
>>> parse_float("4.5 stars")
4.5
>>> parse_float("$1,234.56")
1234.56
>>> parse_float("99%")
99.0
>>> parse_float("")
None
Usage in config:
parse_price¶
Parse price into a structured object with amount, currency, and raw value.
>>> parse_price("ETB 1,500,000")
{'amount': 1500000.0, 'currency': 'ETB', 'raw': 'ETB 1,500,000'}
>>> parse_price("$99.99")
{'amount': 99.99, 'currency': 'USD', 'raw': '$99.99'}
>>> parse_price("€199")
{'amount': 199.0, 'currency': 'EUR', 'raw': '€199'}
Recognized currencies: ETB (Ethiopian Birr), USD, EUR.
Usage in config:
Structured Data Parsers¶
parse_json¶
Parse a JSON string into Python object.
>>> parse_json('{"key": "value"}')
{'key': 'value'}
>>> parse_json('[1, 2, 3]')
[1, 2, 3]
>>> parse_json("invalid")
None
Usage in config:
parse_ldjson¶
Extract the first graph object from JSON-LD text.
>>> # With @graph structure
>>> ld_json = '{"@graph": [{"@type": "Product", "name": "Item"}]}'
>>> parse_ldjson(ld_json)
{'@type': 'Product', 'name': 'Item'}
>>> # Without @graph
>>> simple = '{"@type": "Product", "name": "Simple Item"}'
>>> parse_ldjson(simple)
{'@type': 'Product', 'name': 'Simple Item'}
Usage in config:
parse_coordinates¶
Extract latitude and longitude from JSON or text, returning as "lat,long" string.
>>> parse_coordinates('{"latitude": 9.03, "longitude": 38.74}')
'9.03,38.74'
>>> parse_coordinates('lat: 9.03, lng: 38.74')
'9.03,38.74'
>>> parse_coordinates('"Latitude": "9.03", "Long": "38.74"')
'9.03,38.74'
Usage in config:
JSON-LD Field Extraction¶
Use the ldjson: prefix to extract specific fields from JSON-LD:
[extract.items.fields]
date_published = { selector = "script[type='application/ld+json']", parser = "ldjson:datePublished" }
author_name = { selector = "script[type='application/ld+json']", parser = "ldjson:author.name" }
The path after ldjson: uses dot-notation:
ldjson:datePublished→ extractsdatePublishedldjson:author.name→ extractsauthor.nameldjson:offers.0.price→ extracts first offer's price
Available Built-in Parsers¶
| Parser | Description |
|---|---|
strip |
Strip leading/trailing whitespace |
squish |
Collapse multiple whitespace to single spaces |
parse_int |
Extract integer from text |
parse_float |
Extract float from text |
parse_price |
Parse price with currency detection |
parse_json |
Parse JSON string to object |
parse_ldjson |
Extract from JSON-LD (handles @graph) |
parse_coordinates |
Extract lat/long from various formats |
Using Parsers in Config¶
In Field Config¶
[extract.items.fields]
price = { selector = ".price", parser = "parse_price" }
count = { selector = ".count", parser = "parse_int" }
In Derived Fields¶
[extract.derived]
bedrooms = { path = "details.Bedrooms", parser = "parse_int" }
price_amount = { path = "price.amount", parser = "parse_float" }
Error Handling¶
If a parser fails:
- Returns
Nonefor the field - Logs a warning (visible with
-v) - Does not fail the item (unless field is
required)
Custom Parsers¶
See Custom Parsers for creating your own parsers.