Browser-Based Fetching¶
Some websites require JavaScript execution to render content. Databrew supports browser-based fetching using pydoll, a Python browser automation library.
Installation¶
Install databrew with the browser extra:
Basic Configuration¶
Enable browser fetching in your config:
That's it! Databrew will use a headless Chrome browser to render pages.
Browser Settings¶
Configure browser behavior in [fetch.browser]:
[fetch]
type = "pydoll"
[fetch.browser]
headless = true # Run without GUI (default: true)
page_load_timeout = 30.0 # Page load timeout in seconds
viewport_width = 1920 # Browser window width
viewport_height = 1080 # Browser window height
user_agent = "Mozilla/5.0 ..." # Custom user agent
Wait Strategies¶
Dynamic content may need time to load after the initial page load. Configure wait strategies:
Wait for Selector¶
Wait for a specific element to appear:
[fetch.browser]
wait_for_selector = ".content-loaded" # CSS selector
selector_timeout = 10.0 # Timeout in seconds
Wait for Network Idle¶
Wait for network activity to settle:
[fetch.browser]
wait_for_network_idle = true
network_idle_time = 2.0 # Seconds to wait after network settles
Additional Delay¶
Add a fixed delay after the page appears ready:
Combined Wait Strategy¶
You can combine multiple strategies:
[fetch.browser]
wait_for_selector = ".products-loaded"
selector_timeout = 15.0
wait_after_load = 0.5
Tab Pooling¶
The browser fetcher uses tab pooling for concurrent requests:
Each concurrent request uses its own tab. Tabs are reused between requests.
When to Use Browser Fetching¶
Use browser fetching when:
- Page content is loaded via JavaScript (React, Vue, etc.)
- Content appears after user interactions or AJAX calls
- The site blocks simple HTTP requests
- You need to see the fully rendered DOM
Use HTTP fetching (type = "httpx") when:
- Content is in the initial HTML response
- Speed is important (HTTP is much faster)
- You're hitting an API endpoint
Performance Considerations¶
Browser fetching is slower than HTTP fetching:
| Fetcher | Requests/sec (typical) |
|---|---|
| httpx | 10-50 |
| pydoll | 1-5 |
Tips for better performance:
-
Lower concurrency: Browser tabs are resource-intensive
-
Minimize wait times: Only wait as long as necessary
-
Use headless mode: Always use headless unless debugging
Debugging¶
Non-Headless Mode¶
To see the browser in action:
Verbose Logging¶
Run with verbose output:
Complete Example¶
Config for a JavaScript-heavy real estate site:
name = "jsrealestate"
start_urls = ["https://example.com/listings"]
[extract]
type = "html"
[extract.items]
selector = ".property-card"
[extract.items.fields]
title = ".title"
price = { selector = ".price", parser = "parse_price" }
address = ".address"
url = { selector = "a", attribute = "href" }
[extract.links]
pagination = [".load-more-btn"]
items = [".property-card a"]
[fetch]
type = "pydoll"
[fetch.browser]
headless = true
page_load_timeout = 45.0
wait_for_selector = ".property-card"
selector_timeout = 15.0
wait_after_load = 0.5
viewport_width = 1920
viewport_height = 1080
[policy]
concurrency = 3 # Fewer concurrent tabs
delay = 1.0 # Delay between batches
jitter = 0.3 # Random delay
max_retries = 3
[storage]
path = "data/jsrealestate"
Troubleshooting¶
Browser Not Starting¶
Ensure Chrome/Chromium is installed on your system. Pydoll uses your system's Chrome installation.
Timeout Errors¶
Increase timeouts if pages take long to load:
Missing Content¶
If content is still missing:
- Increase
wait_after_load - Use
wait_for_network_idle = true - Try running in non-headless mode to debug
Memory Issues¶
Browser tabs consume memory. Reduce concurrency: