Working with Extracted Data¶
Databrew stores extracted items in rolling Parquet part files. No separate export step is needed—your data is ready to use immediately after crawling.
Storage Layout¶
data/mysite/
├── .state.db # URL queue/retry state (ephemeral, gitignored)
├── .failures.db # Durable failure tracking (local, gitignored)
├── _failed_urls.json # Portable failure snapshot (committed/synced)
├── .index.db # Storage dedupe/index catalog (ephemeral, gitignored)
└── items/
├── part_000001.parquet # Rolling part files (compressed)
├── part_000002.parquet
└── ...
.state.dbis local-only queue state.failures.dbtracks failed URLs durably, surviving.state.dbdeletion_failed_urls.jsonis a portable failure snapshot exported at run end for cross-machine sync.index.dbis local-only storage catalog and auto-rebuilt from Parquet files on startupitems/*.parquetare the source of truth—sync these across machines- Parquet writes are append-only; new parts are created rather than rewriting existing files
Reading Data¶
Python (DuckDB)¶
DuckDB can read Parquet files directly with zero copying:
import duckdb
# Read all items
df = duckdb.sql("SELECT * FROM 'data/mysite/items/*.parquet'").df()
# Filter by date (uses partition pruning)
df = duckdb.sql("""
SELECT * FROM 'data/mysite/items/*.parquet'
WHERE _extracted_at >= '2026-01-20'
""").df()
# Aggregations
duckdb.sql("""
SELECT category, COUNT(*) as count, AVG(price) as avg_price
FROM 'data/mysite/items/*.parquet'
GROUP BY category
""").show()
Python (Pandas)¶
import pandas as pd
# Read all Parquet files
df = pd.read_parquet("data/mysite/items/")
# Read a specific part file
df = pd.read_parquet("data/mysite/items/part_000001.parquet")
Python (Polars)¶
import polars as pl
# Read all items (lazy scan for large datasets)
df = pl.scan_parquet("data/mysite/items/*.parquet").collect()
# With filtering (pushed down to Parquet)
df = (
pl.scan_parquet("data/mysite/items/*.parquet")
.filter(pl.col("price") > 100)
.collect()
)
Command Line (DuckDB CLI)¶
# Query directly
duckdb -c "SELECT COUNT(*) FROM 'data/mysite/items/*.parquet'"
# Convert to JSONL
duckdb -c "COPY (SELECT * FROM 'data/mysite/items/*.parquet') TO 'data.jsonl' (FORMAT JSON, ARRAY false)"
# Convert to CSV
duckdb -c "COPY (SELECT * FROM 'data/mysite/items/*.parquet') TO 'data.csv' (HEADER)"
# Filter and export
duckdb -c "COPY (SELECT * FROM 'data/mysite/items/*.parquet' WHERE price > 100) TO 'filtered.jsonl' (FORMAT JSON, ARRAY false)"
Command Line (jq + DuckDB)¶
# Export to JSONL then filter with jq
duckdb -c "COPY (SELECT * FROM 'data/mysite/items/*.parquet') TO '/dev/stdout' (FORMAT JSON, ARRAY false)" | jq 'select(.price > 100)'
Metadata Fields¶
Each stored item includes metadata fields:
| Field | Description |
|---|---|
_source_url |
URL the item was extracted from |
_extracted_at |
ISO timestamp when the item was extracted |
These fields are stored alongside your extracted data in the Parquet files.
Cross-Machine Sync¶
The storage architecture makes syncing simple. Share items/*.parquet and the failure snapshot:
# Machine A: crawl and commit
databrew run mysite.toml
git add data/mysite/items/ data/mysite/_failed_urls.json
git commit -m "Crawled items"
git push
# Machine B: pull and continue
git pull
databrew run mysite.toml
# Auto-rebuilds .index.db catalog from Parquet files
# Merges failure snapshot, continues crawling new ones
The local SQLite files should be gitignored:
Working with Large Datasets¶
For datasets with hundreds of thousands of items:
- Use DuckDB or Polars for efficient querying (they read Parquet natively without loading everything into memory)
- Filter by
_extracted_atto scope queries to specific time ranges - Use lazy evaluation (Polars
scan_parquet, DuckDB) to push filters down to the file level