How to Scrape Websites Without Getting Blocked (2026 Guide)

Getting blocked is the most common — and most frustrating — problem in web scraping. You write a scraper, it works for a few minutes, then you're staring at a 403 Forbidden, an empty response, or a Cloudflare challenge page. You rotate your IP. You get blocked again. You add delays. Still blocked.

The reason most scrapers fail isn't bad code. It's an incomplete understanding of what's detecting them. Modern anti-bot systems don't block you because you sent a request — they block you because dozens of small signals, combined, revealed that you're not a human browsing in a real browser.

At ScrapeBadger, we've spent years building infrastructure that handles this at scale across thousands of sites, including Zillow, Rightmove, LinkedIn, and protected e-commerce platforms. This guide shares exactly what we've learned — the real detection layers, the techniques that work, and when to stop building and start using production-ready infrastructure.

What's Actually Detecting You: The Six Layers of Anti-Bot Systems

Before you can avoid being blocked, you need to understand what you're up against. Most developers think about IP blocking — but modern systems like Cloudflare, Imperva, DataDome, and PerimeterX run six distinct detection layers simultaneously. You can fix one and fail on another.

Layer 1: IP Reputation

The first thing every anti-bot system checks is your IP address. Datacenter IPs — addresses allocated to AWS, Google Cloud, Hetzner, OVH — are pre-flagged on most serious sites. Anti-bot vendors maintain constantly updated databases of ASN ranges associated with hosting providers. Your Python requests call from a DigitalOcean server gets blocked before a single line of your scraping logic even runs.

Residential IPs, assigned by ISPs to real home users, carry far more trust. They come with natural network characteristics — variable latency, jitter, diverse TCP behaviours — that make detection far harder. This is why the choice between datacenter and residential proxies isn't just a cost question; it's a question of whether your requests reach their destination at all.

Layer 2: TLS Fingerprinting (JA3/JA4)

This is the detection layer most developers don't know about — and it's why rotating proxies alone often isn't enough.

When your scraper makes an HTTPS request, it sends a ClientHello message to the server as part of the TLS handshake. This message contains cipher suites, TLS extensions, elliptic curve parameters, and protocol version. Together, these are hashed into a 32-character signature called a JA3 fingerprint.

The JA3 fingerprint for Python's requests library is 8d9f7747675e24454cd9b7ed35c58707. Every major anti-bot system knows this signature. Your request is flagged as automated before any application data is even exchanged — before headers, before cookies, before any content.

A newer standard, JA4, extends this analysis to capture more handshake details and is harder to spoof. The practical consequence: even with perfect headers and residential proxies, your scraper will be blocked if its TLS fingerprint doesn't match a real browser.

The fix is to use an HTTP client that mimics a real browser's TLS handshake. The curl_cffi library in Python is the current best-in-class solution for this:

python

from curl_cffi import requests

session = requests.Session()
response = session.get(
    "https://target-site.com/data",
    impersonate="chrome120"  # Matches Chrome 120's exact TLS profile
)
print(response.status_code)  # 200, not 403

ScrapeBadger handles TLS fingerprinting at the infrastructure level — every request is sent with an authentic browser-matching TLS profile, automatically. You don't configure it; it just works. See the ScrapeBadger documentation for details on what's handled under the hood.

Layer 3: HTTP Header Analysis

Even with the right TLS fingerprint and a clean residential IP, your headers can expose you. Anti-bot systems check:

Header ordering — real browsers send headers in a specific sequence; automated clients often don't
User-Agent consistency — claiming to be Chrome 120 while sending headers that Chrome 120 would never send is an immediate flag
Missing headers — real browser requests include Accept-Language, Accept-Encoding, Sec-Ch-Ua, Sec-Fetch-Site, and others; raw HTTP clients omit most of these
Inconsistent values — a Windows Chrome User-Agent combined with a macOS-specific header value is a contradiction that signals automation

A real Chrome 120 request looks like this:

GET /data HTTP/2
Host: example.com
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Sec-Ch-Ua: "Google Chrome";v="120", "Chromium";v="120", "Not_A Brand";v="24"
Sec-Ch-Ua-Mobile: ?0
Sec-Ch-Ua-Platform: "Windows"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

A requests.get("https://example.com") sends about five of these twenty-something headers. The gap is obvious to any half-decent anti-bot system.

Layer 4: Browser Fingerprinting (JavaScript)

For sites that load JavaScript — which is most modern sites — fingerprinting extends deep into the browser environment. Anti-bot scripts collect hundreds of signals:

navigator.webdriver — set to true in vanilla headless Playwright/Selenium; real browsers never expose this
WebGL renderer — headless Chrome returns a software renderer; real Chrome returns a GPU string like ANGLE (NVIDIA, NVIDIA GeForce RTX 3080 Direct3D11 vs_5_0 ps_5_0, D3D11)
Canvas fingerprint — each browser/OS combination renders a canvas slightly differently; the hash is a reliable human identifier
Screen dimensions — a 1920×1080 screen with a 1px window chrome is possible on real devices; a 800×600 headless Chrome is not
Installed fonts, audio context, battery API, media devices — each adds signal to the fingerprint puzzle

Standard headless browsers leak all of these. Tools like playwright-stealth, Camoufox, and undetected-chromedriver patch the most obvious leaks — but the patching is an ongoing arms race. Every update to Cloudflare or DataDome can invalidate patches that worked last month.

Layer 5: Behavioural Analysis

Anti-bot systems don't just look at individual requests — they analyse sessions over time. Red flags include:

Requests arriving at perfectly regular intervals (real users don't browse at exactly 2.0 seconds per page)
Navigation patterns that skip content or jump straight to high-value pages
Mouse movements that follow perfectly straight lines or don't exist at all
Forms filled in at typing speeds no human achieves
Sessions that access 500 product pages in ten minutes without pausing to read any of them

DataDome, in particular, runs ML models that track session-level behaviour and build intent profiles. By 2026, their system processes over 5 trillion signals per day and responds in under 2 milliseconds. Intent-based detection — not "is this a bot?" but "what is this visitor trying to accomplish?" — is the current frontier.

Layer 6: Honeypot Traps

A simpler but still common technique: sites embed invisible links or form fields that only a scraper would ever touch. A real human can't see a white link on a white background — but a scraper that follows every <a href> on the page will navigate to it, immediately flagging the session.

The Techniques That Actually Work

Understanding the detection layers tells you what to fix. Here's how to fix each one.

1. Use Residential Proxies — And Rotate Them Correctly

Residential proxies route your requests through IP addresses assigned by ISPs to real home users. Anti-bot systems trust them significantly more than datacenter IPs. For any site with meaningful bot protection, residential proxies are not optional — they're the baseline requirement.

But rotation strategy matters as much as proxy type. Session stickiness — keeping the same IP for several related requests — often works better than rotating on every request. A real user doesn't change their IP between clicking a product listing and adding it to a cart. Rotating too aggressively can itself be a detection signal.

Use residential proxies with session-based stickiness for authenticated flows and multi-step navigation. Reserve aggressive rotation for bulk, stateless scraping.

Mobile proxies (IPs from cellular networks) offer the highest trust level of all — mobile towers share and recycle IPs across hundreds of users, making it extremely costly for anti-bot systems to block mobile IPs without massive false-positive rates.

2. Fix Your TLS Fingerprint

As explained above, Python's requests has a known-bad JA3 fingerprint. The two solutions:

Option A: curl_cffi for HTTP scraping

python

from curl_cffi import requests

# Impersonate Chrome's exact TLS profile
response = requests.get(
    "https://example.com",
    impersonate="chrome120",
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
        "Accept-Language": "en-US,en;q=0.9",
    }
)

Option B: Full browser automation with Playwright

For JavaScript-heavy sites, use Playwright with a real browser instance. Playwright's Chromium uses the same TLS stack as real Chrome, so fingerprinting matches automatically:

python

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
        viewport={"width": 1920, "height": 1080},
        locale="en-US",
        timezone_id="America/New_York",
    )
    page = context.new_page()
    page.goto("https://example.com")
    data = page.inner_text(".product-price")

3. Send Complete, Consistent Browser Headers

Replicate the full header set a real browser sends. The key headers to include:

python

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Sec-Ch-Ua": '"Google Chrome";v="120", "Chromium";v="120", "Not_A Brand";v="24"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Windows"',
    "Upgrade-Insecure-Requests": "1",
    "Connection": "keep-alive",
}

Make sure User-Agent, Sec-Ch-Ua, and platform values are internally consistent — claiming Chrome 120 on Windows with a macOS platform string is caught immediately.

4. Add Realistic Request Timing

Bots move at machine speed. Humans don't. Add randomised delays that mimic real reading and navigation behaviour:

python

import time
import random

def human_delay(min_seconds=1.0, max_seconds=4.0):
    """Simulate human reading/browsing time between requests."""
    delay = random.uniform(min_seconds, max_seconds)
    # Occasional longer pauses (simulating reading content)
    if random.random() < 0.1:  # 10% chance of longer pause
        delay += random.uniform(2.0, 8.0)
    time.sleep(delay)

for url in product_urls:
    response = session.get(url, headers=headers)
    process_page(response)
    human_delay()  # Never scrape at machine speed

For production pipelines, this means accepting that scraping takes longer than it theoretically could. That's the correct trade-off. A 3-second delay per page that reliably returns data beats a 0.1-second delay that gets blocked after 50 requests.

5. Maintain Session State

Stateless scraping — creating a new session for every request — looks nothing like real browsing. Real users accumulate cookies over time, have browser history, and maintain authenticated sessions.

Always use a Session object in Python's requests, or a persistent browser context in Playwright. This ensures cookies set by the site on early requests (including anti-bot tracking cookies) are carried on subsequent ones — exactly as a real browser would:

python

import requests

session = requests.Session()
session.headers.update(headers)  # Your full browser header set

# First request sets tracking cookies
session.get("https://example.com/")

# Subsequent requests carry those cookies automatically
response = session.get("https://example.com/products/")

Saving and reusing sessions across runs is even better — it builds a history that looks like a returning user. See the session-based scraping guide on the ScrapeBadger blog for full implementation patterns including cookie persistence and session refresh handling.

6. Respect robots.txt and Rate Limits

This isn't just an ethical point — it's a practical one. Sites that detect high-volume scraping respond with increasingly aggressive countermeasures: CAPTCHAs, honeypots, behavioural scoring. Sites that never feel scraped never escalate their defences.

Check robots.txt before building any scraper. Honour Crawl-delay directives. Don't scrape the same URL more frequently than it changes. For most business use cases — competitor pricing, market research, lead generation — data that's an hour old is just as useful as data that's a minute old, and hourly scraping at polite rates is orders of magnitude more sustainable than minute-by-minute scraping that triggers blocks.

The Anti-Bot Systems You'll Encounter (And How They Differ)

Not all anti-bot protection is equal. The specific system protecting a site changes your approach.

Cloudflare is the most common. It uses JavaScript challenges (Turnstile), TLS fingerprinting, behavioural scoring, and browser fingerprinting. Cloudflare's weakest point is that it's widely deployed and must accommodate legitimate automated access (search crawlers, monitoring tools), so it has relatively high false-positive sensitivity. Good residential proxies plus correct TLS fingerprinting gets you past standard Cloudflare protection on most sites.

Imperva (Incapsula) is what protects Zillow, Glassdoor, and many financial platforms. It combines IP reputation, TLS fingerprinting, and sophisticated JavaScript challenges. As of 2026, Imperva has significantly expanded its IP reputation databases, making datacenter proxy detection near-certain. Residential or mobile proxies are mandatory. Our real estate scraping guide covers Zillow's Imperva protection in detail.

DataDome is the hardest to bypass manually. It runs ML models that process over 5 trillion signals per day and update detection in real time. Intent-based detection means even a perfect TLS fingerprint and residential IP can be flagged if your session's navigational patterns don't match how humans browse that particular site. DataDome-protected sites (Leboncoin, Tripadvisor, many ticketing platforms) require either very sophisticated browser automation with genuine behavioural simulation, or infrastructure-level bypass.

PerimeterX (HUMAN Security) is common on retail and e-commerce sites. It uses biometric behavioural analysis — mouse trajectories, typing rhythms, scroll patterns — to distinguish humans from bots. Standard rate limiting and proxy rotation do little against it. It requires real browser automation with genuine interaction patterns.

For a deeper treatment of how these systems compare and what each requires to bypass, the ScrapeBadger blog covers each major anti-bot system and the infrastructure required to handle them reliably.

The Decision Point: Build vs. Use Infrastructure

The techniques above work. But there's a practical ceiling to how far you can go with DIY bypass before the engineering cost exceeds the value of the data.

Building a production-grade scraper for a Imperva-protected site requires:

A residential proxy pool with session stickiness (ongoing cost)
Correct TLS fingerprinting via curl_cffi or Playwright
Complete browser header sets, consistently maintained
Behavioural simulation (delays, mouse movement, scroll patterns)
CAPTCHA solving integration
Session management with expiry detection
Monitoring for when any of the above breaks (it will)

Maintaining that stack as anti-bot vendors update their detection — which happens continuously — is itself a part-time engineering job.

At some point the question isn't "can I build this?" but "should I?"

ScrapeBadger handles all of this at the infrastructure layer. Every request routes through residential proxies with correct TLS fingerprinting, complete browser headers, behavioural simulation, and automatic CAPTCHA handling. When DataDome or Imperva push an update, our infrastructure adapts — you don't. Your code stays the same; the bypass keeps working.

The ScrapeBadger API documentation shows exactly how to integrate: send a URL, get back clean data. All the complexity described in this guide — proxy selection, TLS fingerprinting, header management, behavioural simulation — happens automatically between your request and the response.

For teams running AI agents that need live web data, the MCP integration connects ScrapeBadger's infrastructure directly to any MCP-compatible agent (Claude, Cursor, Windsurf) — no code required. See the MCP documentation for setup instructions.

Quick Reference: Which Technique Fixes Which Block

Symptom	Likely Cause	Fix
Immediate 403 or empty response	TLS fingerprint blocked	Use `curl_cffi` with browser impersonation
Blocked after 10–20 requests	IP rate limiting	Add delays + residential proxy rotation
CAPTCHA on every request	Datacenter IP flagged	Switch to residential/mobile proxies
Blocked despite residential proxies	Browser fingerprint detected	Playwright with stealth patches
Works once, blocked on return visits	Session not maintained	Use `requests.Session()` / Playwright contexts
Random blocks at irregular intervals	Behavioural analysis	Randomise delays, add human-like patterns
Some pages work, specific pages don't	Honeypot	Review all links before following

Is This Legal?

Scraping publicly visible data is generally lawful in the US and EU. The landmark hiQ v. LinkedIn ruling (affirmed by the Ninth Circuit in 2022) confirmed that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act. The web scraping market reached $1.03 billion in 2026 — this is established business practice, not a grey area for most publicly visible data.

The practical rules: scrape only data that's publicly visible without logging in, honour rate limits and robots.txt directives, don't resell raw data commercially without proper agreements, and apply GDPR/CCPA standards to any personal data you collect. For login-protected content, the session-based scraping guide covers the legal landscape in more detail.

The Complete Anti-Block Checklist

Before deploying any scraper, run through this list:

Infrastructure:

Residential or mobile proxies (not datacenter)
Session-based IP stickiness for multi-step flows
TLS fingerprint matching a real browser (curl_cffi or Playwright)

Headers:

Complete browser header set (15+ headers)
Internally consistent User-Agent, platform, and browser version
HTTP/2 transport (not HTTP/1.1)

Behaviour:

Randomised delays between requests (1–5 seconds baseline)
Occasional longer pauses simulating reading time
Session cookies persisted across requests

Validation:

Verified robots.txt before starting
Tested against the target site's specific anti-bot system
Error handling that detects blocks and backs off (don't retry immediately)
Monitoring for success rate degradation over time

Hitting every point on this list will get you through most commercial-grade anti-bot protection. For DataDome and Kasada on heavily protected sites, move to production infrastructure — the manual bypass complexity isn't worth the maintenance burden.

Web scraping in 2026 is absolutely viable. The tools and techniques exist to get clean data from almost any public website. The difference between scrapers that work and scrapers that fail is almost always understanding of the detection layers — not the complexity of the scraping logic itself.

Fix the layer that's detecting you. Use residential proxies. Match TLS fingerprints. Send real headers. Behave like a human. And when the target site is one of the ones where manual bypass isn't worth the engineering overhead, ScrapeBadger's infrastructure handles it for you.

What's Actually Detecting You: The Six Layers of Anti-Bot Systems

Layer 1: IP Reputation

Layer 2: TLS Fingerprinting (JA3/JA4)

This is the detection layer most developers don't know about — and it's why rotating proxies alone often isn't enough.

The fix is to use an HTTP client that mimics a real browser's TLS handshake. The curl_cffi library in Python is the current best-in-class solution for this:

python

from curl_cffi import requests

session = requests.Session()
response = session.get(
    "https://target-site.com/data",
    impersonate="chrome120"  # Matches Chrome 120's exact TLS profile
)
print(response.status_code)  # 200, not 403

Layer 3: HTTP Header Analysis

Even with the right TLS fingerprint and a clean residential IP, your headers can expose you. Anti-bot systems check:

Header ordering — real browsers send headers in a specific sequence; automated clients often don't
User-Agent consistency — claiming to be Chrome 120 while sending headers that Chrome 120 would never send is an immediate flag
Missing headers — real browser requests include Accept-Language, Accept-Encoding, Sec-Ch-Ua, Sec-Fetch-Site, and others; raw HTTP clients omit most of these
Inconsistent values — a Windows Chrome User-Agent combined with a macOS-specific header value is a contradiction that signals automation

A real Chrome 120 request looks like this:

GET /data HTTP/2
Host: example.com
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Sec-Ch-Ua: "Google Chrome";v="120", "Chromium";v="120", "Not_A Brand";v="24"
Sec-Ch-Ua-Mobile: ?0
Sec-Ch-Ua-Platform: "Windows"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

A requests.get("https://example.com") sends about five of these twenty-something headers. The gap is obvious to any half-decent anti-bot system.

Layer 4: Browser Fingerprinting (JavaScript)

For sites that load JavaScript — which is most modern sites — fingerprinting extends deep into the browser environment. Anti-bot scripts collect hundreds of signals:

navigator.webdriver — set to true in vanilla headless Playwright/Selenium; real browsers never expose this
WebGL renderer — headless Chrome returns a software renderer; real Chrome returns a GPU string like ANGLE (NVIDIA, NVIDIA GeForce RTX 3080 Direct3D11 vs_5_0 ps_5_0, D3D11)
Canvas fingerprint — each browser/OS combination renders a canvas slightly differently; the hash is a reliable human identifier
Screen dimensions — a 1920×1080 screen with a 1px window chrome is possible on real devices; a 800×600 headless Chrome is not
Installed fonts, audio context, battery API, media devices — each adds signal to the fingerprint puzzle

Layer 5: Behavioural Analysis

Anti-bot systems don't just look at individual requests — they analyse sessions over time. Red flags include:

Requests arriving at perfectly regular intervals (real users don't browse at exactly 2.0 seconds per page)
Navigation patterns that skip content or jump straight to high-value pages
Mouse movements that follow perfectly straight lines or don't exist at all
Forms filled in at typing speeds no human achieves
Sessions that access 500 product pages in ten minutes without pausing to read any of them

Layer 6: Honeypot Traps

The Techniques That Actually Work

Understanding the detection layers tells you what to fix. Here's how to fix each one.

1. Use Residential Proxies — And Rotate Them Correctly

Use residential proxies with session-based stickiness for authenticated flows and multi-step navigation. Reserve aggressive rotation for bulk, stateless scraping.

2. Fix Your TLS Fingerprint

As explained above, Python's requests has a known-bad JA3 fingerprint. The two solutions:

Option A: curl_cffi for HTTP scraping

python

from curl_cffi import requests

# Impersonate Chrome's exact TLS profile
response = requests.get(
    "https://example.com",
    impersonate="chrome120",
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
        "Accept-Language": "en-US,en;q=0.9",
    }
)

Option B: Full browser automation with Playwright

For JavaScript-heavy sites, use Playwright with a real browser instance. Playwright's Chromium uses the same TLS stack as real Chrome, so fingerprinting matches automatically:

python

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
        viewport={"width": 1920, "height": 1080},
        locale="en-US",
        timezone_id="America/New_York",
    )
    page = context.new_page()
    page.goto("https://example.com")
    data = page.inner_text(".product-price")

3. Send Complete, Consistent Browser Headers

Replicate the full header set a real browser sends. The key headers to include:

python

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Sec-Ch-Ua": '"Google Chrome";v="120", "Chromium";v="120", "Not_A Brand";v="24"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Windows"',
    "Upgrade-Insecure-Requests": "1",
    "Connection": "keep-alive",
}

Make sure User-Agent, Sec-Ch-Ua, and platform values are internally consistent — claiming Chrome 120 on Windows with a macOS platform string is caught immediately.

4. Add Realistic Request Timing

Bots move at machine speed. Humans don't. Add randomised delays that mimic real reading and navigation behaviour:

python

import time
import random

def human_delay(min_seconds=1.0, max_seconds=4.0):
    """Simulate human reading/browsing time between requests."""
    delay = random.uniform(min_seconds, max_seconds)
    # Occasional longer pauses (simulating reading content)
    if random.random() < 0.1:  # 10% chance of longer pause
        delay += random.uniform(2.0, 8.0)
    time.sleep(delay)

for url in product_urls:
    response = session.get(url, headers=headers)
    process_page(response)
    human_delay()  # Never scrape at machine speed

5. Maintain Session State

python

import requests

session = requests.Session()
session.headers.update(headers)  # Your full browser header set

# First request sets tracking cookies
session.get("https://example.com/")

# Subsequent requests carry those cookies automatically
response = session.get("https://example.com/products/")

6. Respect robots.txt and Rate Limits

The Anti-Bot Systems You'll Encounter (And How They Differ)

Not all anti-bot protection is equal. The specific system protecting a site changes your approach.

For a deeper treatment of how these systems compare and what each requires to bypass, the ScrapeBadger blog covers each major anti-bot system and the infrastructure required to handle them reliably.

The Decision Point: Build vs. Use Infrastructure

The techniques above work. But there's a practical ceiling to how far you can go with DIY bypass before the engineering cost exceeds the value of the data.

Building a production-grade scraper for a Imperva-protected site requires:

A residential proxy pool with session stickiness (ongoing cost)
Correct TLS fingerprinting via curl_cffi or Playwright
Complete browser header sets, consistently maintained
Behavioural simulation (delays, mouse movement, scroll patterns)
CAPTCHA solving integration
Session management with expiry detection
Monitoring for when any of the above breaks (it will)

Maintaining that stack as anti-bot vendors update their detection — which happens continuously — is itself a part-time engineering job.

At some point the question isn't "can I build this?" but "should I?"

Quick Reference: Which Technique Fixes Which Block

Symptom	Likely Cause	Fix
Immediate 403 or empty response	TLS fingerprint blocked	Use `curl_cffi` with browser impersonation
Blocked after 10–20 requests	IP rate limiting	Add delays + residential proxy rotation
CAPTCHA on every request	Datacenter IP flagged	Switch to residential/mobile proxies
Blocked despite residential proxies	Browser fingerprint detected	Playwright with stealth patches
Works once, blocked on return visits	Session not maintained	Use `requests.Session()` / Playwright contexts
Random blocks at irregular intervals	Behavioural analysis	Randomise delays, add human-like patterns
Some pages work, specific pages don't	Honeypot	Review all links before following

Is This Legal?

The Complete Anti-Block Checklist

Before deploying any scraper, run through this list:

Infrastructure:

Residential or mobile proxies (not datacenter)
Session-based IP stickiness for multi-step flows
TLS fingerprint matching a real browser (curl_cffi or Playwright)

Headers:

Complete browser header set (15+ headers)
Internally consistent User-Agent, platform, and browser version
HTTP/2 transport (not HTTP/1.1)

Behaviour:

Randomised delays between requests (1–5 seconds baseline)
Occasional longer pauses simulating reading time
Session cookies persisted across requests

Validation:

Verified robots.txt before starting
Tested against the target site's specific anti-bot system
Error handling that detects blocks and backs off (don't retry immediately)
Monitoring for success rate degradation over time

What's Actually Detecting You: The Six Layers of Anti-Bot Systems

Layer 1: IP Reputation

Layer 2: TLS Fingerprinting (JA3/JA4)

Layer 3: HTTP Header Analysis

Layer 4: Browser Fingerprinting (JavaScript)

Layer 5: Behavioural Analysis

Layer 6: Honeypot Traps

The Techniques That Actually Work

1. Use Residential Proxies — And Rotate Them Correctly

2. Fix Your TLS Fingerprint

3. Send Complete, Consistent Browser Headers

4. Add Realistic Request Timing

5. Maintain Session State

6. Respect robots.txt and Rate Limits

The Anti-Bot Systems You'll Encounter (And How They Differ)

The Decision Point: Build vs. Use Infrastructure

Quick Reference: Which Technique Fixes Which Block

Is This Legal?

The Complete Anti-Block Checklist

Thomas Shultz

Ready to get started?

What's Actually Detecting You: The Six Layers of Anti-Bot Systems

Layer 1: IP Reputation

Layer 2: TLS Fingerprinting (JA3/JA4)

Layer 3: HTTP Header Analysis

Layer 4: Browser Fingerprinting (JavaScript)

Layer 5: Behavioural Analysis

Layer 6: Honeypot Traps

The Techniques That Actually Work

1. Use Residential Proxies — And Rotate Them Correctly

2. Fix Your TLS Fingerprint

3. Send Complete, Consistent Browser Headers

4. Add Realistic Request Timing

5. Maintain Session State

6. Respect robots.txt and Rate Limits

The Anti-Bot Systems You'll Encounter (And How They Differ)

The Decision Point: Build vs. Use Infrastructure

Quick Reference: Which Technique Fixes Which Block

Is This Legal?

The Complete Anti-Block Checklist

Thomas Shultz

Ready to get started?