Skip to content

Working with Extracted Data

Databrew stores extracted items in rolling Parquet part files. No separate export step is needed—your data is ready to use immediately after crawling.

Storage Layout

data/mysite/
├── .state.db             # URL queue/retry state (ephemeral, gitignored)
├── .failures.db          # Durable failure tracking (local, gitignored)
├── _failed_urls.json     # Portable failure snapshot (committed/synced)
├── .index.db             # Storage dedupe/index catalog (ephemeral, gitignored)
└── items/
    ├── part_000001.parquet   # Rolling part files (compressed)
    ├── part_000002.parquet
    └── ...
  • .state.db is local-only queue state
  • .failures.db tracks failed URLs durably, surviving .state.db deletion
  • _failed_urls.json is a portable failure snapshot exported at run end for cross-machine sync
  • .index.db is local-only storage catalog and auto-rebuilt from Parquet files on startup
  • items/*.parquet are the source of truth—sync these across machines
  • Parquet writes are append-only; new parts are created rather than rewriting existing files

Reading Data

Python (DuckDB)

DuckDB can read Parquet files directly with zero copying:

import duckdb

# Read all items
df = duckdb.sql("SELECT * FROM 'data/mysite/items/*.parquet'").df()

# Filter by date (uses partition pruning)
df = duckdb.sql("""
    SELECT * FROM 'data/mysite/items/*.parquet'
    WHERE _extracted_at >= '2026-01-20'
""").df()

# Aggregations
duckdb.sql("""
    SELECT category, COUNT(*) as count, AVG(price) as avg_price
    FROM 'data/mysite/items/*.parquet'
    GROUP BY category
""").show()

Python (Pandas)

import pandas as pd

# Read all Parquet files
df = pd.read_parquet("data/mysite/items/")

# Read a specific part file
df = pd.read_parquet("data/mysite/items/part_000001.parquet")

Python (Polars)

import polars as pl

# Read all items (lazy scan for large datasets)
df = pl.scan_parquet("data/mysite/items/*.parquet").collect()

# With filtering (pushed down to Parquet)
df = (
    pl.scan_parquet("data/mysite/items/*.parquet")
    .filter(pl.col("price") > 100)
    .collect()
)

Command Line (DuckDB CLI)

# Query directly
duckdb -c "SELECT COUNT(*) FROM 'data/mysite/items/*.parquet'"

# Convert to JSONL
duckdb -c "COPY (SELECT * FROM 'data/mysite/items/*.parquet') TO 'data.jsonl' (FORMAT JSON, ARRAY false)"

# Convert to CSV
duckdb -c "COPY (SELECT * FROM 'data/mysite/items/*.parquet') TO 'data.csv' (HEADER)"

# Filter and export
duckdb -c "COPY (SELECT * FROM 'data/mysite/items/*.parquet' WHERE price > 100) TO 'filtered.jsonl' (FORMAT JSON, ARRAY false)"

Command Line (jq + DuckDB)

# Export to JSONL then filter with jq
duckdb -c "COPY (SELECT * FROM 'data/mysite/items/*.parquet') TO '/dev/stdout' (FORMAT JSON, ARRAY false)" | jq 'select(.price > 100)'

Metadata Fields

Each stored item includes metadata fields:

Field Description
_source_url URL the item was extracted from
_extracted_at ISO timestamp when the item was extracted

These fields are stored alongside your extracted data in the Parquet files.

Cross-Machine Sync

The storage architecture makes syncing simple. Share items/*.parquet and the failure snapshot:

# Machine A: crawl and commit
databrew run mysite.toml
git add data/mysite/items/ data/mysite/_failed_urls.json
git commit -m "Crawled items"
git push

# Machine B: pull and continue
git pull
databrew run mysite.toml
# Auto-rebuilds .index.db catalog from Parquet files
# Merges failure snapshot, continues crawling new ones

The local SQLite files should be gitignored:

data/*/.state.db
data/*/.failures.db
data/*/.index.db

Working with Large Datasets

For datasets with hundreds of thousands of items:

  • Use DuckDB or Polars for efficient querying (they read Parquet natively without loading everything into memory)
  • Filter by _extracted_at to scope queries to specific time ranges
  • Use lazy evaluation (Polars scan_parquet, DuckDB) to push filters down to the file level
import duckdb

# Efficient: read all parts, filter by extraction date
result = duckdb.sql("""
    SELECT title, price
    FROM 'data/mysite/items/*.parquet'
    WHERE price BETWEEN 100 AND 500
    ORDER BY price DESC
    LIMIT 100
""").df()