Back to Blog

How to Scrape Dynamic Websites Without Headless Browsers

Thomas ShultzThomas Shultz
12 min read
20 views
How to Scrape Dynamic Web Pages Without Browsers

Most tutorials about scraping dynamic pages jump straight to Playwright or Puppeteer. Spin up a browser, wait for the DOM, extract the data. Problem solved โ€” except now you're managing a Chrome fleet, debugging memory leaks, and paying 5-10x more per page than you need to.

The reality is that most "dynamic" websites don't require a headless browser to scrape. They require you to think about why the content loads dynamically, and then pick the cheapest method that actually works. This guide walks through that decision process, the available approaches, and when to escalate.

Why Headless Browsers Are the Wrong Default

A headless browser solves the rendering problem by doing what a real browser does: execute JavaScript, wait for API calls to resolve, update the DOM, and return the final HTML. That works. It's also slow, expensive, and operationally painful at scale.

The problem is that most dynamic pages fall into one of two categories:

  • The site uses a JavaScript framework (React, Vue, Angular) but the underlying data comes from an accessible JSON or GraphQL API

  • The initial HTML contains more useful content than it appears to โ€” embedded JSON blobs, <noscript> content, or data-* attributes with everything you need

In both cases, rendering the page in a browser is unnecessary work. You're paying for computation you don't need.

Even when rendering is genuinely required, you don't have to run it yourself. The operational cost of maintaining a headless browser fleet โ€” Chrome updates, scaling, memory management, anti-bot countermeasures โ€” is a significant burden for a problem that specialized APIs already solve.

The Decision Tree: What to Try First

Before writing a single line of scraping code, spend five minutes in DevTools. This usually tells you which approach to take.

Check the network tab first. Filter by Fetch/XHR. Reload the page. If you see requests returning JSON payloads with the data you want, you probably don't need to render anything. You just need to reproduce those API calls directly.

Inspect the page source. Open view-source: or curl the URL raw. Many sites that appear dynamic serve pre-rendered HTML for crawlers. Look for JSON blobs embedded in <script> tags, data-* attributes, or <noscript> fallbacks. If the data is there, you're scraping a static page that just looks dynamic in a browser.

Check the HTML shell size. If the initial HTML is tiny โ€” a root <div id="app"></div> and a bundle of script tags โ€” you're dealing with a true client-side rendering setup. That's when you actually need rendering.

The order of operations:

Priority

Approach

When It Works

1

Direct API/XHR calls

Site fetches data from accessible JSON endpoints

2

Static HTML parsing

Content is in the source HTML, not JS-rendered

3

Managed rendering API

True SPA with no accessible backend API

4

Self-hosted headless browser

When you need full control over complex interaction flows

Only go down the list when the approach above fails.

Approach 1: Reverse-Engineer the Underlying API

This is the method that makes the rest unnecessary most of the time.

Open DevTools โ†’ Network โ†’ Filter by Fetch/XHR. Trigger whatever action loads the data you want (page load, scroll, button click). Look for requests that return JSON. Click the request, check Preview โ€” if you see the data you need, you've found your target.

What you're looking for: - REST endpoints like https://api.example.com/v1/products?page=1 - GraphQL requests to /graphql with a query field in the body - Pagination patterns in the URL or request body

import requests

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Accept": "application/json",
    "Referer": "https://example.com/products",
    "X-Requested-With": "XMLHttpRequest"
})

# Call the underlying API directly โ€” no browser needed
resp = session.get("https://api.example.com/v1/products?category=123&page=1")
resp.raise_for_status()
data = resp.json()

For GraphQL:

payload = {
    "operationName": "ProductList",
    "variables": {"categoryId": 123, "page": 1},
    "query": "query ProductList($categoryId: ID!, $page: Int!) { products(categoryId: $categoryId, page: $page) { id name price } }"
}
resp = session.post("https://example.com/graphql", json=payload)
data = resp.json()["data"]["products"]

This approach is fast, cheap, and produces already-structured data. The main limitation is authentication โ€” some APIs require tokens that are generated client-side, or are short-lived and tied to browser sessions. When you hit those walls, move to the next approach.

Approach 2: Parse Embedded JSON from the Source HTML

Before concluding a site requires rendering, check the raw HTML carefully. Many sites โ€” particularly e-commerce platforms and news sites โ€” embed the initial data state directly in the HTML as a JSON blob, often inside a <script> tag:

<script id="__NEXT_DATA__" type="application/json">
  {"props": {"pageProps": {"products": [...]}}}
</script>

Or in older patterns:

<script>window.__INITIAL_STATE__ = {"products": [...]};</script>

You can extract this with BeautifulSoup and the json module, no browser required:

import requests
import json
from bs4 import BeautifulSoup

html = requests.get("https://example.com/products").text
soup = BeautifulSoup(html, "lxml")

# Common Next.js pattern
script_tag = soup.find("script", {"id": "__NEXT_DATA__"})
if script_tag:
    data = json.loads(script_tag.string)
    products = data["props"]["pageProps"]["products"]

This catches a surprisingly large number of React and Next.js sites. Worth checking before spending credits on browser rendering.

Approach 3: Use a Managed Rendering API

When the page genuinely requires JavaScript execution, the right move is to offload the browser to a managed API โ€” not run one yourself. Tools like ScrapeBadger, Firecrawl, and Browserless all operate browser infrastructure at scale and expose it via simple HTTP endpoints.

The economics are straightforward: you pay per-render instead of running your own Chrome fleet with its associated memory overhead, maintenance burden, and anti-bot failures.

How ScrapeBadger's Engine System Works

ScrapeBadger uses a tiered engine approach that avoids unnecessary rendering by default:

Engine Tier

Description

Cost

HTTP

Fast HTTP request with Chrome TLS fingerprint

1 credit

Browser

Full headless browser with JS rendering

5 credits

Premium Browser

Real browser with advanced fingerprinting

10 credits

Setting engine: "auto" (the default) lets ScrapeBadger decide which tier to use. It starts with the cheapest method that can return the content, and only escalates when necessary. You only pay for the method that actually succeeds โ€” costs are not cumulative.

Detecting Protection Before Scraping

Before scraping a target site, it's worth running a detection check to understand what you're dealing with. ScrapeBadger's POST /v1/web/detect endpoint scans a URL for active anti-bot systems before you spend credits on a scrape attempt:

curl -X POST "https://scrapebadger.com/v1/web/detect" \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://target-site.com"}'

The response tells you exactly what you're up against:

{
  "antibot_systems": [
    {
      "system": "cloudflare_turnstile",
      "confidence": 0.92,
      "details": "Turnstile widget detected in page HTML"
    }
  ],
  "is_blocked": true,
  "blocking_type": "cloudflare",
  "recommendation": "Use browser engine with anti_bot enabled",
  "credits_used": 1
}

Cached results per domain cost 0 credits for 5 minutes โ€” so running this check upfront is cheap, and it tells you whether you need render_js: true or anti_bot: true before you commit to a more expensive scrape.

The Scrape Endpoint

The core endpoint is POST /v1/web/scrape. Here's the recommended workflow for scraping a dynamic page without defaulting to a browser:

# Let ScrapeBadger decide the right engine โ€” don't force browser rendering
curl -X POST "https://scrapebadger.com/v1/web/scrape" \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/js-heavy-page",
    "format": "markdown",
    "engine": "auto",
    "escalate": true,
    "retry_on_block": true
  }'

Key parameters worth knowing:

Parameter

What It Does

engine: "auto"

Tries HTTP first, only escalates to browser if needed

escalate: true

Enables automatic escalation if HTTP gets blocked

render_js: false

Keeps you on the HTTP tier โ€” set true only if JS is confirmed necessary

anti_bot: true

Adds anti-bot solving (+5 credits); avoids needing a manual browser setup

retry_on_block: true

Automatically retries if blocked โ€” without you spinning up anything

max_cost: 10

Credit budget cap โ€” prevents unexpected escalation to expensive tiers

wait_for: ".product-list"

CSS selector to wait for; only activates browser if needed

The response includes an engine_used field that tells you which tier was actually used. If it returns "http", you never needed a browser at all, and you paid 1 credit instead of 5 or 10.

1. POST /v1/web/detect  โ†’  Identify protection level (0 credits if cached)
2. POST /v1/web/scrape  โ†’  engine: "auto" + escalate: true + anti_bot: true
3. Check engine_used   โ†’  If "http", you never needed a browser at all

This workflow means you're always using the minimum necessary tier, and you're never managing browser infrastructure.

Handling Anti-Bot Protection Without a Browser

The most common reason teams reach for headless browsers is Cloudflare or similar protection. The assumption is that if you can't pass the challenge without a browser, you need to run one. That's increasingly not true.

Modern anti-bot APIs handle fingerprint spoofing, CAPTCHA solving, and challenge bypass internally โ€” you just set anti_bot: true and the API handles the rest. For a detailed breakdown of how this works against specific systems, see our guide on how to bypass Cloudflare anti-bot protection.

What you do need to think about:

  • Throttle your requests. Even with anti-bot bypass, aggressive request patterns raise flags. Add delays between requests for high-protection targets.

  • Use geo-targeted proxies when needed. The country parameter in ScrapeBadger's scrape endpoint handles this โ€” pass the ISO code for the target market.

  • Cache aggressively. Don't re-scrape unchanged pages. Hash the content on first fetch, store the result, and only re-scrape when you detect a change.

If you're building ongoing monitoring rather than one-off scraping, pairing this with how to monitor website changes automatically will save you significant cost and complexity.

What Each Approach Actually Costs

The cost difference between these approaches is significant at scale:

Method

Per-page cost

Infra overhead

Anti-bot handling

Direct API/XHR calls

Minimal (just requests)

None

Handled by headers/cookies

Static HTML parsing

Minimal

None

Basic

Managed API (HTTP tier)

~$0.01

None (outsourced)

Built-in

Managed API (Browser tier)

~$0.05

None (outsourced)

Built-in

Self-hosted headless browser

Low per-page

High (Chrome fleet)

DIY

At 10,000 pages per day, the difference between HTTP tier and browser tier is roughly $40/month vs $400/month. If 80% of your target pages don't need rendering, that's a significant waste if you default everything to browser mode.

When You Actually Need a Headless Browser

There are legitimate cases where managed rendering APIs won't get you there and you need full browser control:

  • Complex multi-step interaction flows โ€” filling forms, navigating wizard-style UIs, triggering specific user events in sequence

  • Sites with obfuscated tokens โ€” where auth tokens are generated by client-side JavaScript in ways that can't be reverse-engineered

  • Testing pipelines โ€” where you're validating real user behavior, not just extracting data

In those cases, Playwright is the current standard. But even then, you can outsource the browser infrastructure to services like Browserless rather than running your own fleet, and integrate them into your Scrapy or Python pipeline via their /content API endpoint.

FAQ

Do I always need a browser to scrape a React or Next.js site?

Not at all. Many React and Next.js sites embed their initial data in a <script id="__NEXT_DATA__"> tag in the raw HTML. BeautifulSoup can parse that without any rendering. Even when they don't, the data often comes from an accessible JSON API. Check the Network tab before assuming you need rendering.

What's the difference between render_js: true and escalate: true?

render_js: true forces browser rendering immediately, regardless of whether it's necessary. escalate: true with engine: "auto" lets ScrapeBadger try HTTP first and only escalate to a browser if the HTTP request fails or gets blocked. If your goal is to avoid unnecessary browser costs, use escalate: true rather than forcing render_js: true.

How do I know if a site needs JS rendering or if I can use direct API calls?

Open DevTools โ†’ Network โ†’ filter by Fetch/XHR โ†’ reload the page. If you see requests returning JSON with the data you want, you can call those APIs directly. If the DOM is built entirely by JavaScript with no accessible data endpoints, you need rendering.

Is using a managed rendering API cheaper than running Puppeteer myself?

At small scale, self-hosted is cheaper on a pure per-page basis. But the operational costs โ€” Chrome container management, memory limits, anti-bot updates, scaling โ€” add up quickly. At 50k+ pages per day, managing your own fleet typically requires dedicated engineering time. Managed APIs remove that overhead entirely.

What happens if escalate: true keeps escalating to the most expensive tier?

Use the max_cost parameter to set a credit budget cap. If a page would require more credits than your cap to scrape, the request will fail rather than escalate. This prevents unexpected charges on sites that require Premium Browser rendering.

Can I scrape pages behind login without a headless browser?

Sometimes. If the site uses cookie-based sessions with standard authentication, you can replicate the login request, capture the session cookie, and pass it in subsequent requests. For pages with more complex auth flows (OAuth, device fingerprinting, JS-generated tokens), you may need browser-level handling. See our guide on scraping websites behind login for a more detailed breakdown.

Why does engine_used matter in the response?

It's the feedback loop that tells you whether you're over-specifying. If you set engine: "auto" and consistently get engine_used: "http" back, you know you're paying 1 credit per page and no browser is involved. If you see "browser" coming back for pages you expected to be simple, that's a signal to investigate the target site's protection setup โ€” or to add those URLs to a different scraping strategy.

Thomas Shultz

Written by

Thomas Shultz

Thomas Shultz is the Head of Data at ScrapeBadger, working on public web data, scraping infrastructure, and data reliability. He writes about real-world scraping, data pipelines, and turning unstructured web data into usable signals.

Ready to get started?

Join thousands of developers using ScrapeBadger for their data needs.

How to Scrape Dynamic Web Pages Without Browsers | ScrapeBadger | ScrapeBadger