Skip to content

Browser-Based Fetching

Some websites require JavaScript execution to render content. Databrew supports browser-based fetching using pydoll, a Python browser automation library.

Installation

Install databrew with the browser extra:

pip install databrew[browser]
# or
uv add databrew --extra browser

Basic Configuration

Enable browser fetching in your config:

[fetch]
type = "pydoll"

That's it! Databrew will use a headless Chrome browser to render pages.

Browser Settings

Configure browser behavior in [fetch.browser]:

[fetch]
type = "pydoll"

[fetch.browser]
headless = true              # Run without GUI (default: true)
page_load_timeout = 30.0     # Page load timeout in seconds
viewport_width = 1920        # Browser window width
viewport_height = 1080       # Browser window height
user_agent = "Mozilla/5.0 ..." # Custom user agent

Wait Strategies

Dynamic content may need time to load after the initial page load. Configure wait strategies:

Wait for Selector

Wait for a specific element to appear:

[fetch.browser]
wait_for_selector = ".content-loaded"  # CSS selector
selector_timeout = 10.0                 # Timeout in seconds

Wait for Network Idle

Wait for network activity to settle:

[fetch.browser]
wait_for_network_idle = true
network_idle_time = 2.0      # Seconds to wait after network settles

Additional Delay

Add a fixed delay after the page appears ready:

[fetch.browser]
wait_after_load = 1.0        # Additional seconds to wait

Combined Wait Strategy

You can combine multiple strategies:

[fetch.browser]
wait_for_selector = ".products-loaded"
selector_timeout = 15.0
wait_after_load = 0.5

Tab Pooling

The browser fetcher uses tab pooling for concurrent requests:

[policy]
concurrency = 5              # Number of browser tabs

Each concurrent request uses its own tab. Tabs are reused between requests.

When to Use Browser Fetching

Use browser fetching when:

  • Page content is loaded via JavaScript (React, Vue, etc.)
  • Content appears after user interactions or AJAX calls
  • The site blocks simple HTTP requests
  • You need to see the fully rendered DOM

Use HTTP fetching (type = "httpx") when:

  • Content is in the initial HTML response
  • Speed is important (HTTP is much faster)
  • You're hitting an API endpoint

Performance Considerations

Browser fetching is slower than HTTP fetching:

Fetcher Requests/sec (typical)
httpx 10-50
pydoll 1-5

Tips for better performance:

  1. Lower concurrency: Browser tabs are resource-intensive

    [policy]
    concurrency = 3
    

  2. Minimize wait times: Only wait as long as necessary

    [fetch.browser]
    wait_for_selector = ".content"
    selector_timeout = 5.0
    

  3. Use headless mode: Always use headless unless debugging

    [fetch.browser]
    headless = true
    

Debugging

Non-Headless Mode

To see the browser in action:

[fetch.browser]
headless = false

Verbose Logging

Run with verbose output:

databrew run config.toml -v

Complete Example

Config for a JavaScript-heavy real estate site:

name = "jsrealestate"
start_urls = ["https://example.com/listings"]

[extract]
type = "html"

[extract.items]
selector = ".property-card"

[extract.items.fields]
title = ".title"
price = { selector = ".price", parser = "parse_price" }
address = ".address"
url = { selector = "a", attribute = "href" }

[extract.links]
pagination = [".load-more-btn"]
items = [".property-card a"]

[fetch]
type = "pydoll"

[fetch.browser]
headless = true
page_load_timeout = 45.0
wait_for_selector = ".property-card"
selector_timeout = 15.0
wait_after_load = 0.5
viewport_width = 1920
viewport_height = 1080

[policy]
concurrency = 3          # Fewer concurrent tabs
delay = 1.0              # Delay between batches
jitter = 0.3             # Random delay
max_retries = 3

[storage]
path = "data/jsrealestate"

Troubleshooting

Browser Not Starting

Ensure Chrome/Chromium is installed on your system. Pydoll uses your system's Chrome installation.

Timeout Errors

Increase timeouts if pages take long to load:

[fetch.browser]
page_load_timeout = 60.0
selector_timeout = 20.0

Missing Content

If content is still missing:

  1. Increase wait_after_load
  2. Use wait_for_network_idle = true
  3. Try running in non-headless mode to debug

Memory Issues

Browser tabs consume memory. Reduce concurrency:

[policy]
concurrency = 2