Python Web Scraping Tutorial: Complete Guide with Code Examples (2026)

Web scraping with Python is one of the most powerful skills a developer can learn. Whether you are building a competitive intelligence dashboard, training a machine learning model, monitoring prices, or automating lead generation, Python provides the most mature and versatile ecosystem for extracting data from the web at scale.

However, the web has changed significantly in recent years. Traditional scraping methods that worked reliably in 2020 often fail today due to heavy JavaScript rendering, aggressive anti-bot protections, and increasingly complex authentication flows. This tutorial is written for 2026, covering the tools and techniques that actually work in production.

By the end of this guide, you will understand not just how to write a scraper, but which tools to choose for different scenarios, how to handle the most common failure modes, and when it makes sense to offload the heavy lifting to a dedicated web scraping API like ScrapeBadger.

Why Use Python for Web Scraping?
How Web Scraping Works: The Basics
Python Web Scraping Libraries Compared
Step 1: Fetching a Page with Requests
Step 2: Parsing HTML with BeautifulSoup
Step 3: Advanced Parsing with lxml and XPath
Step 4: Handling Dynamic Pages with Playwright
Step 5: Building a Large-Scale Crawler with Scrapy
Step 6: Scraping at Scale with Asyncio and HTTPX
Step 7: A Complete Real-World Scraper (Pagination + Export)
Step 8: Handling Anti-Bot Protection
Step 9: AI-Powered Extraction
When to Use a Scraping API Instead of DIY
Common Errors and How to Fix Them
Frequently Asked Questions

1. Why Use Python for Web Scraping?

Python is the undisputed king of web scraping, and for good reason. Its clean, readable syntax allows you to build functional scrapers in minutes rather than hours. Its massive ecosystem provides a library for every scraping challenge, from simple HTTP requests to full browser automation. And once you have extracted the data, Python's data science stack — Pandas, NumPy, SQLAlchemy — makes it trivial to clean, analyze, and store your findings.

Here is a summary of why Python dominates this space:

Advantage	Why It Matters
Simple syntax	Readable code means faster debugging and easier maintenance
Massive library ecosystem	A purpose-built library exists for every scraping challenge
Data processing power	Pandas, NumPy, and SQLAlchemy integrate seamlessly with scraped data
Community & documentation	Virtually every scraping problem has been solved and documented online
Cross-platform	Works identically on Windows, macOS, and Linux

While languages like Node.js and Go have their place in web scraping, Python remains the most accessible and versatile choice for both beginners and experienced engineers.

2. How Web Scraping Works: The Basics

Before writing any code, it is crucial to understand the fundamental mechanics of web scraping. At its core, the process involves three steps:

Step 1 — Requesting Data: Your scraper sends an HTTP request to a web server, asking for the content at a specific URL. The server processes this request and sends back a response.

Step 2 — Parsing Data: The server's response (usually HTML) is parsed by your scraper to locate the specific data points you need. This is done by navigating the HTML tree structure and targeting elements by their tags, classes, or IDs.

Step 3 — Storing Data: The extracted data is cleaned and saved in a structured format such as a CSV file, a JSON document, or a relational database.

Static vs. Dynamic Websites

The most important distinction in modern web scraping is between static and dynamic websites.

Static websites return the complete HTML document in the initial server response. All the data you see on the page is present in the source code from the very first request. These sites are fast and easy to scrape using lightweight libraries like requests and BeautifulSoup.

Dynamic websites return a minimal HTML shell along with JavaScript files. The browser must execute this JavaScript to fetch additional data from backend APIs and render the final content on the page. If you send a plain HTTP request to a dynamic site, you will receive the empty shell — not the data you see in your browser. To scrape these sites, you need a tool that can execute JavaScript, such as Playwright or Selenium.

To determine whether a site is static or dynamic, open it in your browser, right-click, and select "View Page Source." If the data you want to scrape is visible in the raw source code, the site is static. If the source code is mostly empty <div> containers, the site is dynamic.

Using Browser Developer Tools

Your browser's Developer Tools (DevTools) are your most important instrument when building a scraper. They allow you to inspect the HTML structure of a page, identify the CSS classes and IDs that contain your target data, and monitor the network requests the browser makes.

To open DevTools in Chrome or Firefox, right-click anywhere on a webpage and select Inspect, or press Ctrl+Shift+I on Windows/Linux or Cmd+Option+I on macOS. The Elements tab shows the live HTML structure, while the Network tab lets you observe every HTTP request the browser makes — including the hidden API calls that dynamic sites use to load their data.

3. Python Web Scraping Libraries Compared

The Python ecosystem offers a wide array of scraping libraries. Choosing the right tool for the job is critical for building efficient and reliable scrapers. Here is a comprehensive comparison of the most important options in 2026:

Library	Primary Use Case	Difficulty	Speed	JavaScript Support
Requests	Simple HTTP requests	Beginner	Fast	No
HTTPX	Modern async HTTP client	Intermediate	Very Fast	No
BeautifulSoup	HTML parsing and data extraction	Beginner	Fast	No
lxml	High-performance HTML/XML parsing with XPath	Intermediate	Very Fast	No
Playwright	Headless browser automation (modern)	Advanced	Slow	Yes
Selenium	Headless browser automation (legacy)	Advanced	Slow	Yes
Scrapy	Large-scale concurrent web crawling	Advanced	Very Fast	No (via plugins)

When to Use Which Tool

The right choice depends entirely on your target website and the scale of your project:

For beginners and static sites: Use requests to fetch the page and BeautifulSoup to parse the HTML. This is the easiest stack to learn and works perfectly for simple websites.
For dynamic sites with JavaScript: Use Playwright. It is faster, more reliable, and has a cleaner API than the older Selenium library.
For large-scale crawling: Use Scrapy. It provides a structured framework for managing concurrent requests, handling pagination, and exporting data at scale.
For production environments with anti-bot protection: Use a scraping API like ScrapeBadger. It handles proxy rotation, JavaScript rendering, and anti-bot bypass automatically, saving you countless hours of infrastructure maintenance.

4. Step 1: Fetching a Page with Requests

Let's start by building a simple scraper using the requests library. Our target throughout this tutorial will be Books to Scrape, a public sandbox designed specifically for practicing web scraping. It has pagination, prices, ratings, and categories — making it a perfect real-world stand-in.

First, install the requests library:

pip install requests

Now, let's write a script to fetch the homepage:

import requests

url = "https://books.toscrape.com/"

response = requests.get(url )

if response.status_code == 200:
    print("Successfully fetched the page!")
    print(f"Content length: {len(response.text)} characters")
else:
    print(f"Failed. Status code: {response.status_code}")

Understanding HTTP Headers

When your browser visits a website, it sends additional metadata called HTTP headers alongside the request. These headers tell the server about your browser, operating system, and accepted content types. Many websites block requests that lack standard headers, as this is a clear indicator of automated bot traffic.

The most important header to include is the User-Agent, which identifies your client as a legitimate web browser:

import requests

url = "https://books.toscrape.com/"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64 ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

response = requests.get(url, headers=headers)
print(f"Status Code: {response.status_code}")

Handling Sessions and Cookies

For websites that require login or maintain state between requests (such as shopping carts), you should use requests.Session(). A session object automatically stores cookies returned by the server and sends them with every subsequent request, perfectly replicating browser behaviour.

import requests

session = requests.Session()

# Set default headers for all requests in this session
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
})

# The session will automatically handle cookies
response = session.get("https://books.toscrape.com/" )
print(f"Cookies received: {dict(session.cookies)}")

5. Step 2: Parsing HTML with BeautifulSoup

Now that we have the raw HTML, we need to extract specific data from it. BeautifulSoup creates a parse tree from the HTML document, allowing you to navigate and search it using Python.

 pip install beautifulsoup4 lxml

Inspecting the Books to Scrape homepage with DevTools reveals that each book is contained within an <article> tag with the class product_pod. Let's extract the title, price, and star rating for every book on the page:

import requests
from bs4 import BeautifulSoup

url = "https://books.toscrape.com/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64 )"}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")

books = soup.find_all("article", class_="product_pod")
print(f"Found {len(books)} books on the page.\n")

for book in books:
    # The full title is in the 'title' attribute of the <a> tag inside <h3>
    title = book.find("h3").find("a")["title"]
    
    # The price is in a <p> tag with class 'price_color'
    price = book.find("p", class_="price_color").text.strip()
    
    # The star rating is encoded as a class name: 'One', 'Two', 'Three', 'Four', 'Five'
    rating_word = book.find("p", class_="star-rating")["class"][1]
    
    print(f"Title: {title}")
    print(f"Price: {price} | Rating: {rating_word}")
    print("-" * 50)

CSS Selectors vs. find_all()

BeautifulSoup provides two main approaches for locating elements. The find() and find_all() methods are intuitive for simple tag and class lookups. The select() and select_one() methods accept standard CSS selector syntax, which is often more concise for complex queries.

Here is the same extraction using CSS selectors:

# Using CSS selectors — more concise for complex queries
books = soup.select("article.product_pod")

for book in books:
    title = book.select_one("h3 a")["title"]
    price = book.select_one("p.price_color").text.strip()
    rating = book.select_one("p.star-rating")["class"][1]
    print(f"{title} | {price} | {rating}")

The choice between the two approaches is largely a matter of preference. CSS selectors tend to be more compact, while find_all() is more explicit and easier to read for beginners.

6. Step 3: Advanced Parsing with lxml and XPath

For more complex HTML documents or when you need maximum parsing speed, the lxml library with XPath expressions is the right tool. XPath is a query language for navigating tree-structured documents, and it is significantly more powerful than CSS selectors for certain use cases.

XPath expressions use path syntax to navigate the document tree. Here are the most important patterns:

XPath Expression	Meaning
//div	Select all <div> elements anywhere in the document
//div[@class='product']	Select all <div> elements with class="product"
//h3/a	Select all <a> elements that are direct children of <h3>
//a/@href	Select the href attribute of all <a> elements
//p[contains(@class, 'price')]	Select <p> elements whose class contains 'price'

Here is how to use lxml with XPath to extract the same book data:

import requests
from lxml import html

url = "https://books.toscrape.com/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64 )"}

response = requests.get(url, headers=headers)

# Parse the HTML into an lxml tree
tree = html.fromstring(response.content)

# Use XPath to select all book containers
books = tree.xpath("//article[contains(@class, 'product_pod')]")

for book in books:
    # Extract the title attribute from the <a> tag inside <h3>
    title = book.xpath(".//h3/a/@title")[0]
    
    # Extract the price text
    price = book.xpath(".//p[contains(@class, 'price_color')]/text()")[0].strip()
    
    print(f"{title} — {price}")

XPath is particularly powerful when you need to traverse the document in directions that CSS selectors cannot handle, such as selecting a parent element based on a child's content, or navigating to sibling elements.

7. Step 4: Handling Dynamic Pages with Playwright

The requests + BeautifulSoup combination works perfectly for static sites like our bookstore sandbox. However, if a website relies on JavaScript to load its content, requests will return only the initial, empty HTML shell — not the rendered data.

To scrape dynamic websites, you need a headless browser that can execute JavaScript. Playwright is the recommended choice for modern Python web scraping. Developed by Microsoft, it offers a cleaner API, better auto-waiting capabilities, and superior performance compared to the older Selenium library.

Install Playwright and download the browser binaries:

pip install playwright
playwright install chromium

Basic Playwright Scraper

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_with_playwright(url):
    with sync_playwright() as p:
        # Launch a headless Chromium browser
        browser = p.chromium.launch(headless=True)
        
        # Create a new browser context (like a fresh browser tab)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        )
        
        page = context.new_page()
        page.goto(url)
        
        # Wait for a specific element to appear before extracting data
        # This ensures JavaScript has finished loading the content
        page.wait_for_selector("article.product_pod")
        
        # Get the fully rendered HTML
        html_content = page.content()
        
        browser.close()
        
        # Parse with BeautifulSoup as usual
        soup = BeautifulSoup(html_content, "lxml")
        books = soup.select("article.product_pod")
        
        for book in books:
            title = book.select_one("h3 a")["title"]
            price = book.select_one("p.price_color").text.strip()
            print(f"{title} — {price}")

scrape_with_playwright("https://books.toscrape.com/" )

Simulating User Interactions

Playwright's real power lies in its ability to simulate user interactions. You can click buttons, fill out forms, scroll down pages to trigger infinite loading, and even handle file downloads.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    
    page.goto("https://books.toscrape.com/" )
    
    # Click on a category link
    page.click("a[href='catalogue/category/books/mystery_3/index.html']")
    
    # Wait for the new page to load
    page.wait_for_load_state("networkidle")
    
    # Take a screenshot to verify the result
    page.screenshot(path="mystery_books.png")
    
    print(f"Current URL: {page.url}")
    
    browser.close()

Capturing XHR/API Requests

Many modern websites load their data through background API calls (XHR requests) rather than embedding it in the HTML. Playwright allows you to intercept these network requests and extract the structured JSON data directly, which is far more efficient than parsing HTML.

from playwright.sync_api import sync_playwright
import json

captured_responses = []

def handle_response(response):
    # Capture only JSON API responses
    if "application/json" in response.headers.get("content-type", ""):
        try:
            data = response.json()
            captured_responses.append(data)
        except Exception:
            pass

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    
    # Listen for all network responses
    page.on("response", handle_response)
    
    page.goto("https://example-api-driven-site.com" )
    page.wait_for_load_state("networkidle")
    
    browser.close()

print(f"Captured {len(captured_responses)} API responses")

8. Step 5: Building a Large-Scale Crawler with Scrapy

For scraping projects that involve hundreds or thousands of pages, Scrapy is the right tool. It is a complete, asynchronous web crawling framework that handles concurrent requests, manages pipelines for data processing, and provides a structured way to organise your scraping logic.

Install Scrapy:

pip install scrapy

Creating a Scrapy Project

scrapy startproject bookstore_scraper
cd bookstore_scraper

Writing a Scrapy Spider

Create a new file at bookstore_scraper/spiders/books_spider.py:

import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["https://books.toscrape.com/"]
    
    def parse(self, response ):
        # Extract data from each book on the current page
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css("p.price_color::text").get().strip(),
                "rating": book.css("p.star-rating::attr(class)").get().split()[1],
                "availability": book.css("p.instock.availability::text").getall()[1].strip(),
            }
        
        # Follow the "next" pagination link automatically
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Run the spider and export the results directly to a CSV file:

scrapy crawl books -o books.csv

Scrapy's response.follow() method automatically handles relative URLs, making pagination trivially easy. The framework also provides built-in support for rate limiting, retry logic, and middleware for rotating proxies and User-Agents.

9. Step 6: Scraping at Scale with Asyncio and HTTPX

When you need to scrape hundreds of pages quickly without the overhead of a full Scrapy project, Python's asyncio library combined with the httpx HTTP client is the ideal solution. This approach allows you to send multiple requests simultaneously, dramatically reducing total scrape time.

Install httpx:

pip install httpx

Here is a complete example that scrapes the first 10 pages of the bookstore concurrently:

import asyncio
import httpx
from bs4 import BeautifulSoup
import time

# Generate the list of page URLs to scrape
urls = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 11 )]

async def fetch_page(client: httpx.AsyncClient, url: str ) -> str | None:
    """Fetch a single page asynchronously with error handling."""
    try:
        response = await client.get(url)
        response.raise_for_status()
        return response.text
    except httpx.HTTPStatusError as e:
        print(f"HTTP error fetching {url}: {e.response.status_code}" )
        return None
    except httpx.RequestError as e:
        print(f"Request error fetching {url}: {e}" )
        return None

async def parse_books(html: str) -> list[dict]:
    """Parse book data from a page's HTML."""
    soup = BeautifulSoup(html, "lxml")
    books = []
    for book in soup.select("article.product_pod"):
        books.append({
            "title": book.select_one("h3 a")["title"],
            "price": book.select_one("p.price_color").text.strip(),
            "rating": book.select_one("p.star-rating")["class"][1],
        })
    return books

async def main():
    start_time = time.time()
    all_books = []
    
    # Configure the async HTTP client with connection limits and timeouts
    limits = httpx.Limits(max_connections=10, max_keepalive_connections=5 )
    timeout = httpx.Timeout(10.0 )
    
    async with httpx.AsyncClient(limits=limits, timeout=timeout ) as client:
        # Fetch all pages concurrently
        tasks = [fetch_page(client, url) for url in urls]
        html_pages = await asyncio.gather(*tasks)
        
        # Parse each page
        for html in html_pages:
            if html:
                books = await parse_books(html)
                all_books.extend(books)
    
    elapsed = time.time() - start_time
    print(f"Scraped {len(all_books)} books from {len(urls)} pages in {elapsed:.2f} seconds")
    return all_books

if __name__ == "__main__":
    books = asyncio.run(main())

The key advantage of this approach over sequential scraping is speed. Scraping 10 pages sequentially with a 1-second delay between each request takes at least 10 seconds. The async version completes all 10 requests in roughly the time it takes to complete a single request.

10. Step 7: A Complete Real-World Scraper (Pagination + Export)

Now let's build a complete, production-ready scraper that combines everything we have learned. This script scrapes all 50 pages of the Books to Scrape catalogue, handles errors gracefully, cleans the data, and exports the results to both CSV and JSON formats.

import requests
from bs4 import BeautifulSoup
import csv
import json
import time
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# Rating word-to-number mapping
RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

BASE_URL = "https://books.toscrape.com/catalogue/page-{}.html"
TOTAL_PAGES = 50

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64 ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

def scrape_page(page_num: int) -> list[dict]:
    """Scrape a single page and return a list of book dictionaries."""
    url = BASE_URL.format(page_num)
    
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        logger.error(f"Failed to fetch page {page_num}: {e}")
        return []
    
    soup = BeautifulSoup(response.text, "lxml")
    books = []
    
    for book in soup.select("article.product_pod"):
        title = book.select_one("h3 a")["title"]
        
        # Clean price: remove '£' and convert to float
        price_str = book.select_one("p.price_color").text.strip()
        price = float(price_str.replace("£", "").replace("Â", ""))
        
        # Convert rating word to integer
        rating_word = book.select_one("p.star-rating")["class"][1]
        rating = RATING_MAP.get(rating_word, 0)
        
        # Clean availability text
        availability = book.select_one("p.instock.availability").text.strip()
        
        books.append({
            "title": title,
            "price_gbp": price,
            "rating": rating,
            "availability": availability,
        })
    
    return books

def export_to_csv(data: list[dict], filename: str):
    """Export a list of dictionaries to a CSV file."""
    if not data:
        logger.warning("No data to export.")
        return
    
    with open(filename, mode="w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)
    
    logger.info(f"Exported {len(data)} records to {filename}")

def export_to_json(data: list[dict], filename: str):
    """Export a list of dictionaries to a JSON file."""
    with open(filename, mode="w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    
    logger.info(f"Exported {len(data)} records to {filename}")

def main():
    all_books = []
    
    for page_num in range(1, TOTAL_PAGES + 1):
        logger.info(f"Scraping page {page_num}/{TOTAL_PAGES}...")
        books = scrape_page(page_num)
        all_books.extend(books)
        
        # Polite delay between requests
        time.sleep(0.5)
    
    logger.info(f"Total books scraped: {len(all_books)}")
    
    # Export results
    export_to_csv(all_books, "books.csv")
    export_to_json(all_books, "books.json")
    
    # Print a quick summary
    avg_price = sum(b["price_gbp"] for b in all_books) / len(all_books)
    avg_rating = sum(b["rating"] for b in all_books) / len(all_books)
    
    print(f"\n--- Summary ---")
    print(f"Total books: {len(all_books)}")
    print(f"Average price: £{avg_price:.2f}")
    print(f"Average rating: {avg_rating:.1f}/5")

if __name__ == "__main__":
    main()

This script demonstrates a production-quality workflow: structured logging, error handling with try/except, data type conversion, a polite crawl delay, and dual-format export.

11. Step 8: Handling Anti-Bot Protection

Scraping real-world websites is often a battle against anti-bot systems. Websites use a variety of techniques to detect and block automated traffic. Understanding these mechanisms is essential for building scrapers that work reliably in production.

The Most Common Anti-Bot Mechanisms

IP Blocking and Rate Limiting is the most basic defence. If too many requests arrive from a single IP address in a short time window, the server blocks that IP and returns a 429 Too Many Requests or 403 Forbidden error. The solution is to distribute your requests across a pool of rotating residential proxies.

User-Agent Detection is trivial to implement on the server side. Any request without a standard browser User-Agent header is immediately flagged. The solution is to include realistic headers and rotate your User-Agent strings.

Browser Fingerprinting is used by advanced anti-bot systems like Cloudflare, Datadome, and PerimeterX. These systems analyse dozens of browser characteristics — TLS fingerprint, WebGL renderer, canvas hash, JavaScript engine behaviour — to determine whether the client is a real browser or a headless automation tool. Simple header spoofing does not defeat fingerprinting. You need specialised tools like playwright-stealth or a scraping API that handles fingerprinting at the infrastructure level.

CAPTCHAs are presented when the system suspects automated traffic. Solving them programmatically requires either a third-party CAPTCHA-solving service or a scraping API with built-in CAPTCHA handling.

Respecting robots.txt

Before scraping any website, always check its robots.txt file at https://example.com/robots.txt. This file specifies which pages crawlers are permitted to access and often includes a requested crawl delay. While robots.txt is not legally binding, respecting it is a fundamental principle of ethical web scraping and helps you avoid unnecessary blocks.

Python's standard library includes a parser for robots.txt:

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser( )
rp.set_url("https://books.toscrape.com/robots.txt" )
rp.read()

# Check if scraping a specific URL is allowed
url_to_check = "https://books.toscrape.com/catalogue/page-1.html"
is_allowed = rp.can_fetch("*", url_to_check )

print(f"Scraping {url_to_check} is {'allowed' if is_allowed else 'disallowed'}")

# Check the requested crawl delay
crawl_delay = rp.crawl_delay("*")
print(f"Requested crawl delay: {crawl_delay} seconds")

12. Step 9: AI-Powered Extraction

Writing CSS selectors and XPath expressions is tedious, especially when a website frequently changes its layout. A modern alternative is AI-powered extraction, which uses Large Language Models to understand the structure of a webpage and extract structured data based on natural language instructions.

Instead of writing brittle parsing logic that breaks every time the site updates its CSS classes, you describe the data you want in plain English. This approach is significantly more resilient to layout changes and dramatically reduces the time required to build scrapers for new websites.

ScrapeBadger includes an AI extraction mode that you can invoke directly from your Python code. Rather than parsing HTML manually, you describe the fields you want and let the AI handle the extraction:

import requests
import json

# ScrapeBadger API endpoint
api_url = "https://api.scrapebadger.com/v1/scrape"
api_key = "YOUR_SCRAPEBADGER_API_KEY"

# Target a product page
target_url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"

payload = {
    "url": target_url,
    "ai_extract": True,
    "ai_prompt": "Extract the following fields from this book product page: title, price (as a number ), star rating (as a number out of 5), availability status, product description, and UPC code. Return as a JSON object."
}

response = requests.post(
    api_url,
    json=payload,
    headers={"Authorization": f"Bearer {api_key}"}
)

if response.status_code == 200:
    data = response.json()
    print(json.dumps(data, indent=2))

AI extraction is particularly valuable when you are scraping dozens of different websites with varying HTML structures, or when you need to maintain scrapers over long periods where the target site's layout may change.

13. When to Use a Scraping API Instead of DIY

Building your own scraper using requests and BeautifulSoup is a great learning experience and works well for small, simple projects. However, as your scraping needs grow, the infrastructure overhead becomes a significant burden. Here is a clear decision framework:

Scenario	DIY Scraper	Scraping API
Scraping a public, static site once	✅	Overkill
Scraping 100–1,000 pages per day	✅	Optional
Scraping 10,000+ pages per day	Complex	✅
Target site uses Cloudflare/PerimeterX	Very hard	✅
Target site requires JavaScript rendering	Playwright needed	✅
Scraping from multiple geographic locations	Proxy setup required	✅
Maintaining scrapers across 10+ different sites	High maintenance	✅

You should strongly consider a dedicated web scraping API when:

You are constantly getting blocked. Sourcing reliable residential proxies, rotating User-Agents, and bypassing advanced fingerprinting systems is a full-time engineering job.
The target site uses heavy JavaScript. Running headless browsers like Playwright at scale is expensive and resource-intensive.
You need to scrape from specific geographic locations. A scraping API with geo-targeting handles this transparently.
The website layout changes frequently. AI-powered extraction eliminates the need to maintain brittle CSS selectors.

ScrapeBadger Integration

ScrapeBadger handles all the complexity of modern web scraping through a single API endpoint. It manages proxy rotation, JavaScript rendering, anti-bot bypass, and AI-powered data extraction automatically.

The integration is straightforward — you simply wrap your target URL with the ScrapeBadger API endpoint:

import requests

api_key = "YOUR_SCRAPEBADGER_API_KEY"
target_url = "https://target-website.com/products"

# Basic scrape — ScrapeBadger handles proxy rotation automatically
response = requests.get(
    "https://api.scrapebadger.com/v1/scrape",
    params={
        "api_key": api_key,
        "url": target_url,
    }
 )

print(response.text)  # Clean HTML ready for parsing

For dynamic sites that require JavaScript rendering:

# Enable JavaScript rendering for dynamic content
response = requests.get(
    "https://api.scrapebadger.com/v1/scrape",
    params={
        "api_key": api_key,
        "url": target_url,
        "render_js": "true",        # Enable headless browser rendering
        "anti_bot": "true",         # Enable anti-bot bypass (Cloudflare, etc. )
    }
)

ScrapeBadger uses smart billing — you are only charged for features the system actually uses. If JavaScript rendering is enabled but the target page turns out to be static, you are not charged for it. Failed requests are never billed.

For a detailed integration guide, see the ScrapeBadger documentation.

14. Common Errors and How to Fix Them

When building web scrapers, you will inevitably encounter errors. Here is a comprehensive guide to the most common issues and their solutions:

Error	Cause	Solution
403 Forbidden	Scraper detected and blocked	Rotate IP via proxies, add realistic headers, use a scraping API
404 Not Found	URL does not exist	Check for relative vs. absolute URLs, verify the URL structure
429 Too Many Requests	Rate limit exceeded	Add time.sleep() delays, use exponential backoff, rotate IPs
500 Internal Server Error	Server-side issue	Implement retry logic with a delay
Empty HTML / Missing Data	JavaScript-rendered content	Switch to Playwright or enable JS rendering in your scraping API
AttributeError: 'NoneType'	CSS selector found no match	Verify the selector in DevTools, add if element: null checks
ConnectionError	Network issue or IP blocked	Check your internet connection, rotate proxy, implement retries
Encoding Issues	Non-UTF-8 characters in response	Use response.encoding = 'utf-8' or response.apparent_encoding

Implementing Retry Logic

Robust scrapers should automatically retry failed requests with exponential backoff:

import requests
import time

def fetch_with_retry(url: str, max_retries: int = 3, backoff_factor: float = 2.0) -> requests.Response | None:
    """Fetch a URL with automatic retry and exponential backoff."""
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers=headers, timeout=10)
            
            if response.status_code == 200:
                return response
            elif response.status_code == 429:
                wait_time = backoff_factor ** attempt
                print(f"Rate limited. Waiting {wait_time}s before retry {attempt + 1}/{max_retries}...")
                time.sleep(wait_time)
            else:
                print(f"HTTP {response.status_code} on attempt {attempt + 1}")
                
        except requests.exceptions.RequestException as e:
            print(f"Request failed on attempt {attempt + 1}: {e}")
            time.sleep(backoff_factor ** attempt)
    
    print(f"All {max_retries} attempts failed for {url}")
    return None

15. Frequently Asked Questions

Is Python good for web scraping? Yes, Python is widely considered the best programming language for web scraping. Its simple syntax, mature ecosystem of libraries (BeautifulSoup, Playwright, Scrapy), and powerful data processing capabilities make it the industry standard. The combination of ease of use and raw capability is unmatched by any other language.

Is web scraping legal? The legality of web scraping depends on the jurisdiction, the nature of the data, and the website's Terms of Service. Generally, scraping publicly available data is legal, but scraping personal data or data behind a login requires careful consideration of laws like the GDPR and CCPA. The landmark hiQ v. LinkedIn ruling (2022) affirmed that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act in the United States. Always consult legal counsel for commercial projects.

Which is better: Selenium or Playwright? For new projects, Playwright is the better choice. It has a cleaner, more modern API, better auto-waiting capabilities, built-in support for async/await, and is actively maintained by Microsoft. Selenium is a mature, battle-tested tool with a larger community, but its API is older and more verbose. If you are maintaining an existing Selenium project, there is no urgent reason to migrate.

How do I scrape a website that requires a login? Use requests.Session() to handle cookies and maintain a logged-in state. For complex authentication flows involving JavaScript (such as OAuth or two-factor authentication), use Playwright to automate the login process in a headless browser, then extract the session cookies to use in subsequent requests. See our dedicated session-based scraping guide for full code examples.

How do I avoid getting my IP blocked while scraping? Distribute your requests across multiple IP addresses using a residential proxy pool. Rotate your User-Agent headers, implement random delays between requests (using time.sleep(random.uniform(1, 3))), and avoid aggressive crawling patterns. For sites with advanced anti-bot protection, use a scraping API that handles IP rotation and fingerprinting bypass at the infrastructure level.

What is the difference between find() and select() in BeautifulSoup? find() and find_all() use BeautifulSoup's own search API, where you specify tag names, attributes, and class names as Python arguments. select() and select_one() accept standard CSS selector strings, which are often more concise for complex queries. Both approaches produce identical results — the choice is a matter of preference and familiarity.

How do I handle pagination in a web scraper? Look for a "Next" button or pagination link in the HTML. Extract its href attribute and construct the next page URL. Use a loop or, in Scrapy, response.follow() to automatically crawl through all pages. Always implement a stopping condition (e.g., when no "Next" link is found) to prevent infinite loops.

When should I use Scrapy instead of requests + BeautifulSoup? Use Scrapy when you need to scrape more than a few hundred pages, when you need concurrent requests for speed, when you need a structured pipeline for data cleaning and export, or when you are building a long-running crawler that needs to handle retries, redirects, and middleware automatically. For quick, one-off scripts, requests + BeautifulSoup is simpler and faster to set up.

Conclusion

Web scraping with Python is a deep and rewarding skill. The tools and techniques in this guide cover the full spectrum of modern scraping challenges — from fetching a simple static page with requests to automating a headless browser with Playwright, building a concurrent crawler with asyncio, and leveraging AI-powered extraction to eliminate brittle CSS selectors.

The key takeaway is that no single tool is right for every situation. Start with requests + BeautifulSoup for simple, static sites. Move to Playwright when you encounter JavaScript rendering. Use Scrapy when you need to crawl at scale. And when you are spending more time fighting anti-bot systems than building your actual product, that is the signal to use a dedicated scraping API.

ScrapeBadger is built for exactly that moment — when your scraping needs outgrow what a DIY solution can reliably deliver. It handles proxy rotation, JavaScript rendering, anti-bot bypass, and AI extraction through a single API endpoint, so you can focus on the data rather than the infrastructure.
Ready to start? Get your free ScrapeBadger API key and run your first scrape in under five minutes.

Why Use Python for Web Scraping?
How Web Scraping Works: The Basics
Python Web Scraping Libraries Compared
Step 1: Fetching a Page with Requests
Step 2: Parsing HTML with BeautifulSoup
Step 3: Advanced Parsing with lxml and XPath
Step 4: Handling Dynamic Pages with Playwright
Step 5: Building a Large-Scale Crawler with Scrapy
Step 6: Scraping at Scale with Asyncio and HTTPX
Step 7: A Complete Real-World Scraper (Pagination + Export)
Step 8: Handling Anti-Bot Protection
Step 9: AI-Powered Extraction
When to Use a Scraping API Instead of DIY
Common Errors and How to Fix Them
Frequently Asked Questions

1. Why Use Python for Web Scraping?

Here is a summary of why Python dominates this space:

Advantage	Why It Matters
Simple syntax	Readable code means faster debugging and easier maintenance
Massive library ecosystem	A purpose-built library exists for every scraping challenge
Data processing power	Pandas, NumPy, and SQLAlchemy integrate seamlessly with scraped data
Community & documentation	Virtually every scraping problem has been solved and documented online
Cross-platform	Works identically on Windows, macOS, and Linux

While languages like Node.js and Go have their place in web scraping, Python remains the most accessible and versatile choice for both beginners and experienced engineers.

2. How Web Scraping Works: The Basics

Before writing any code, it is crucial to understand the fundamental mechanics of web scraping. At its core, the process involves three steps:

Step 1 — Requesting Data: Your scraper sends an HTTP request to a web server, asking for the content at a specific URL. The server processes this request and sends back a response.

Step 3 — Storing Data: The extracted data is cleaned and saved in a structured format such as a CSV file, a JSON document, or a relational database.

Static vs. Dynamic Websites

The most important distinction in modern web scraping is between static and dynamic websites.

Using Browser Developer Tools

3. Python Web Scraping Libraries Compared

Library	Primary Use Case	Difficulty	Speed	JavaScript Support
Requests	Simple HTTP requests	Beginner	Fast	No
HTTPX	Modern async HTTP client	Intermediate	Very Fast	No
BeautifulSoup	HTML parsing and data extraction	Beginner	Fast	No
lxml	High-performance HTML/XML parsing with XPath	Intermediate	Very Fast	No
Playwright	Headless browser automation (modern)	Advanced	Slow	Yes
Selenium	Headless browser automation (legacy)	Advanced	Slow	Yes
Scrapy	Large-scale concurrent web crawling	Advanced	Very Fast	No (via plugins)

When to Use Which Tool

The right choice depends entirely on your target website and the scale of your project:

For beginners and static sites: Use requests to fetch the page and BeautifulSoup to parse the HTML. This is the easiest stack to learn and works perfectly for simple websites.
For dynamic sites with JavaScript: Use Playwright. It is faster, more reliable, and has a cleaner API than the older Selenium library.
For large-scale crawling: Use Scrapy. It provides a structured framework for managing concurrent requests, handling pagination, and exporting data at scale.
For production environments with anti-bot protection: Use a scraping API like ScrapeBadger. It handles proxy rotation, JavaScript rendering, and anti-bot bypass automatically, saving you countless hours of infrastructure maintenance.

4. Step 1: Fetching a Page with Requests

First, install the requests library:

pip install requests

Now, let's write a script to fetch the homepage:

import requests

url = "https://books.toscrape.com/"

response = requests.get(url )

if response.status_code == 200:
    print("Successfully fetched the page!")
    print(f"Content length: {len(response.text)} characters")
else:
    print(f"Failed. Status code: {response.status_code}")

Understanding HTTP Headers

The most important header to include is the User-Agent, which identifies your client as a legitimate web browser:

import requests

url = "https://books.toscrape.com/"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64 ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

response = requests.get(url, headers=headers)
print(f"Status Code: {response.status_code}")

Handling Sessions and Cookies

import requests

session = requests.Session()

# Set default headers for all requests in this session
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
})

# The session will automatically handle cookies
response = session.get("https://books.toscrape.com/" )
print(f"Cookies received: {dict(session.cookies)}")

5. Step 2: Parsing HTML with BeautifulSoup

Now that we have the raw HTML, we need to extract specific data from it. BeautifulSoup creates a parse tree from the HTML document, allowing you to navigate and search it using Python.

 pip install beautifulsoup4 lxml

import requests
from bs4 import BeautifulSoup

url = "https://books.toscrape.com/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64 )"}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")

books = soup.find_all("article", class_="product_pod")
print(f"Found {len(books)} books on the page.\n")

for book in books:
    # The full title is in the 'title' attribute of the <a> tag inside <h3>
    title = book.find("h3").find("a")["title"]
    
    # The price is in a <p> tag with class 'price_color'
    price = book.find("p", class_="price_color").text.strip()
    
    # The star rating is encoded as a class name: 'One', 'Two', 'Three', 'Four', 'Five'
    rating_word = book.find("p", class_="star-rating")["class"][1]
    
    print(f"Title: {title}")
    print(f"Price: {price} | Rating: {rating_word}")
    print("-" * 50)

CSS Selectors vs. find_all()

Here is the same extraction using CSS selectors:

# Using CSS selectors — more concise for complex queries
books = soup.select("article.product_pod")

for book in books:
    title = book.select_one("h3 a")["title"]
    price = book.select_one("p.price_color").text.strip()
    rating = book.select_one("p.star-rating")["class"][1]
    print(f"{title} | {price} | {rating}")

The choice between the two approaches is largely a matter of preference. CSS selectors tend to be more compact, while find_all() is more explicit and easier to read for beginners.

6. Step 3: Advanced Parsing with lxml and XPath

XPath expressions use path syntax to navigate the document tree. Here are the most important patterns:

XPath Expression	Meaning
//div	Select all <div> elements anywhere in the document
//div[@class='product']	Select all <div> elements with class="product"
//h3/a	Select all <a> elements that are direct children of <h3>
//a/@href	Select the href attribute of all <a> elements
//p[contains(@class, 'price')]	Select <p> elements whose class contains 'price'

Here is how to use lxml with XPath to extract the same book data:

import requests
from lxml import html

url = "https://books.toscrape.com/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64 )"}

response = requests.get(url, headers=headers)

# Parse the HTML into an lxml tree
tree = html.fromstring(response.content)

# Use XPath to select all book containers
books = tree.xpath("//article[contains(@class, 'product_pod')]")

for book in books:
    # Extract the title attribute from the <a> tag inside <h3>
    title = book.xpath(".//h3/a/@title")[0]
    
    # Extract the price text
    price = book.xpath(".//p[contains(@class, 'price_color')]/text()")[0].strip()
    
    print(f"{title} — {price}")

7. Step 4: Handling Dynamic Pages with Playwright

Install Playwright and download the browser binaries:

pip install playwright
playwright install chromium

Basic Playwright Scraper

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_with_playwright(url):
    with sync_playwright() as p:
        # Launch a headless Chromium browser
        browser = p.chromium.launch(headless=True)
        
        # Create a new browser context (like a fresh browser tab)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        )
        
        page = context.new_page()
        page.goto(url)
        
        # Wait for a specific element to appear before extracting data
        # This ensures JavaScript has finished loading the content
        page.wait_for_selector("article.product_pod")
        
        # Get the fully rendered HTML
        html_content = page.content()
        
        browser.close()
        
        # Parse with BeautifulSoup as usual
        soup = BeautifulSoup(html_content, "lxml")
        books = soup.select("article.product_pod")
        
        for book in books:
            title = book.select_one("h3 a")["title"]
            price = book.select_one("p.price_color").text.strip()
            print(f"{title} — {price}")

scrape_with_playwright("https://books.toscrape.com/" )

Simulating User Interactions

Playwright's real power lies in its ability to simulate user interactions. You can click buttons, fill out forms, scroll down pages to trigger infinite loading, and even handle file downloads.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    
    page.goto("https://books.toscrape.com/" )
    
    # Click on a category link
    page.click("a[href='catalogue/category/books/mystery_3/index.html']")
    
    # Wait for the new page to load
    page.wait_for_load_state("networkidle")
    
    # Take a screenshot to verify the result
    page.screenshot(path="mystery_books.png")
    
    print(f"Current URL: {page.url}")
    
    browser.close()

Capturing XHR/API Requests

from playwright.sync_api import sync_playwright
import json

captured_responses = []

def handle_response(response):
    # Capture only JSON API responses
    if "application/json" in response.headers.get("content-type", ""):
        try:
            data = response.json()
            captured_responses.append(data)
        except Exception:
            pass

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    
    # Listen for all network responses
    page.on("response", handle_response)
    
    page.goto("https://example-api-driven-site.com" )
    page.wait_for_load_state("networkidle")
    
    browser.close()

print(f"Captured {len(captured_responses)} API responses")

8. Step 5: Building a Large-Scale Crawler with Scrapy

Install Scrapy:

pip install scrapy

Creating a Scrapy Project

scrapy startproject bookstore_scraper
cd bookstore_scraper

Writing a Scrapy Spider

Create a new file at bookstore_scraper/spiders/books_spider.py:

import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["https://books.toscrape.com/"]
    
    def parse(self, response ):
        # Extract data from each book on the current page
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css("p.price_color::text").get().strip(),
                "rating": book.css("p.star-rating::attr(class)").get().split()[1],
                "availability": book.css("p.instock.availability::text").getall()[1].strip(),
            }
        
        # Follow the "next" pagination link automatically
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Run the spider and export the results directly to a CSV file:

scrapy crawl books -o books.csv

9. Step 6: Scraping at Scale with Asyncio and HTTPX

Install httpx:

pip install httpx

Here is a complete example that scrapes the first 10 pages of the bookstore concurrently:

import asyncio
import httpx
from bs4 import BeautifulSoup
import time

# Generate the list of page URLs to scrape
urls = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 11 )]

async def fetch_page(client: httpx.AsyncClient, url: str ) -> str | None:
    """Fetch a single page asynchronously with error handling."""
    try:
        response = await client.get(url)
        response.raise_for_status()
        return response.text
    except httpx.HTTPStatusError as e:
        print(f"HTTP error fetching {url}: {e.response.status_code}" )
        return None
    except httpx.RequestError as e:
        print(f"Request error fetching {url}: {e}" )
        return None

async def parse_books(html: str) -> list[dict]:
    """Parse book data from a page's HTML."""
    soup = BeautifulSoup(html, "lxml")
    books = []
    for book in soup.select("article.product_pod"):
        books.append({
            "title": book.select_one("h3 a")["title"],
            "price": book.select_one("p.price_color").text.strip(),
            "rating": book.select_one("p.star-rating")["class"][1],
        })
    return books

async def main():
    start_time = time.time()
    all_books = []
    
    # Configure the async HTTP client with connection limits and timeouts
    limits = httpx.Limits(max_connections=10, max_keepalive_connections=5 )
    timeout = httpx.Timeout(10.0 )
    
    async with httpx.AsyncClient(limits=limits, timeout=timeout ) as client:
        # Fetch all pages concurrently
        tasks = [fetch_page(client, url) for url in urls]
        html_pages = await asyncio.gather(*tasks)
        
        # Parse each page
        for html in html_pages:
            if html:
                books = await parse_books(html)
                all_books.extend(books)
    
    elapsed = time.time() - start_time
    print(f"Scraped {len(all_books)} books from {len(urls)} pages in {elapsed:.2f} seconds")
    return all_books

if __name__ == "__main__":
    books = asyncio.run(main())

10. Step 7: A Complete Real-World Scraper (Pagination + Export)

import requests
from bs4 import BeautifulSoup
import csv
import json
import time
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# Rating word-to-number mapping
RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

BASE_URL = "https://books.toscrape.com/catalogue/page-{}.html"
TOTAL_PAGES = 50

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64 ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

def scrape_page(page_num: int) -> list[dict]:
    """Scrape a single page and return a list of book dictionaries."""
    url = BASE_URL.format(page_num)
    
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        logger.error(f"Failed to fetch page {page_num}: {e}")
        return []
    
    soup = BeautifulSoup(response.text, "lxml")
    books = []
    
    for book in soup.select("article.product_pod"):
        title = book.select_one("h3 a")["title"]
        
        # Clean price: remove '£' and convert to float
        price_str = book.select_one("p.price_color").text.strip()
        price = float(price_str.replace("£", "").replace("Â", ""))
        
        # Convert rating word to integer
        rating_word = book.select_one("p.star-rating")["class"][1]
        rating = RATING_MAP.get(rating_word, 0)
        
        # Clean availability text
        availability = book.select_one("p.instock.availability").text.strip()
        
        books.append({
            "title": title,
            "price_gbp": price,
            "rating": rating,
            "availability": availability,
        })
    
    return books

def export_to_csv(data: list[dict], filename: str):
    """Export a list of dictionaries to a CSV file."""
    if not data:
        logger.warning("No data to export.")
        return
    
    with open(filename, mode="w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)
    
    logger.info(f"Exported {len(data)} records to {filename}")

def export_to_json(data: list[dict], filename: str):
    """Export a list of dictionaries to a JSON file."""
    with open(filename, mode="w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    
    logger.info(f"Exported {len(data)} records to {filename}")

def main():
    all_books = []
    
    for page_num in range(1, TOTAL_PAGES + 1):
        logger.info(f"Scraping page {page_num}/{TOTAL_PAGES}...")
        books = scrape_page(page_num)
        all_books.extend(books)
        
        # Polite delay between requests
        time.sleep(0.5)
    
    logger.info(f"Total books scraped: {len(all_books)}")
    
    # Export results
    export_to_csv(all_books, "books.csv")
    export_to_json(all_books, "books.json")
    
    # Print a quick summary
    avg_price = sum(b["price_gbp"] for b in all_books) / len(all_books)
    avg_rating = sum(b["rating"] for b in all_books) / len(all_books)
    
    print(f"\n--- Summary ---")
    print(f"Total books: {len(all_books)}")
    print(f"Average price: £{avg_price:.2f}")
    print(f"Average rating: {avg_rating:.1f}/5")

if __name__ == "__main__":
    main()

This script demonstrates a production-quality workflow: structured logging, error handling with try/except, data type conversion, a polite crawl delay, and dual-format export.

11. Step 8: Handling Anti-Bot Protection

The Most Common Anti-Bot Mechanisms

Respecting robots.txt

Python's standard library includes a parser for robots.txt:

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser( )
rp.set_url("https://books.toscrape.com/robots.txt" )
rp.read()

# Check if scraping a specific URL is allowed
url_to_check = "https://books.toscrape.com/catalogue/page-1.html"
is_allowed = rp.can_fetch("*", url_to_check )

print(f"Scraping {url_to_check} is {'allowed' if is_allowed else 'disallowed'}")

# Check the requested crawl delay
crawl_delay = rp.crawl_delay("*")
print(f"Requested crawl delay: {crawl_delay} seconds")

12. Step 9: AI-Powered Extraction

import requests
import json

# ScrapeBadger API endpoint
api_url = "https://api.scrapebadger.com/v1/scrape"
api_key = "YOUR_SCRAPEBADGER_API_KEY"

# Target a product page
target_url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"

payload = {
    "url": target_url,
    "ai_extract": True,
    "ai_prompt": "Extract the following fields from this book product page: title, price (as a number ), star rating (as a number out of 5), availability status, product description, and UPC code. Return as a JSON object."
}

response = requests.post(
    api_url,
    json=payload,
    headers={"Authorization": f"Bearer {api_key}"}
)

if response.status_code == 200:
    data = response.json()
    print(json.dumps(data, indent=2))

13. When to Use a Scraping API Instead of DIY

Scenario	DIY Scraper	Scraping API
Scraping a public, static site once	✅	Overkill
Scraping 100–1,000 pages per day	✅	Optional
Scraping 10,000+ pages per day	Complex	✅
Target site uses Cloudflare/PerimeterX	Very hard	✅
Target site requires JavaScript rendering	Playwright needed	✅
Scraping from multiple geographic locations	Proxy setup required	✅
Maintaining scrapers across 10+ different sites	High maintenance	✅

You should strongly consider a dedicated web scraping API when:

You are constantly getting blocked. Sourcing reliable residential proxies, rotating User-Agents, and bypassing advanced fingerprinting systems is a full-time engineering job.
The target site uses heavy JavaScript. Running headless browsers like Playwright at scale is expensive and resource-intensive.
You need to scrape from specific geographic locations. A scraping API with geo-targeting handles this transparently.
The website layout changes frequently. AI-powered extraction eliminates the need to maintain brittle CSS selectors.

ScrapeBadger Integration

The integration is straightforward — you simply wrap your target URL with the ScrapeBadger API endpoint:

import requests

api_key = "YOUR_SCRAPEBADGER_API_KEY"
target_url = "https://target-website.com/products"

# Basic scrape — ScrapeBadger handles proxy rotation automatically
response = requests.get(
    "https://api.scrapebadger.com/v1/scrape",
    params={
        "api_key": api_key,
        "url": target_url,
    }
 )

print(response.text)  # Clean HTML ready for parsing

For dynamic sites that require JavaScript rendering:

# Enable JavaScript rendering for dynamic content
response = requests.get(
    "https://api.scrapebadger.com/v1/scrape",
    params={
        "api_key": api_key,
        "url": target_url,
        "render_js": "true",        # Enable headless browser rendering
        "anti_bot": "true",         # Enable anti-bot bypass (Cloudflare, etc. )
    }
)

For a detailed integration guide, see the ScrapeBadger documentation.

14. Common Errors and How to Fix Them

When building web scrapers, you will inevitably encounter errors. Here is a comprehensive guide to the most common issues and their solutions:

Error	Cause	Solution
403 Forbidden	Scraper detected and blocked	Rotate IP via proxies, add realistic headers, use a scraping API
404 Not Found	URL does not exist	Check for relative vs. absolute URLs, verify the URL structure
429 Too Many Requests	Rate limit exceeded	Add time.sleep() delays, use exponential backoff, rotate IPs
500 Internal Server Error	Server-side issue	Implement retry logic with a delay
Empty HTML / Missing Data	JavaScript-rendered content	Switch to Playwright or enable JS rendering in your scraping API
AttributeError: 'NoneType'	CSS selector found no match	Verify the selector in DevTools, add if element: null checks
ConnectionError	Network issue or IP blocked	Check your internet connection, rotate proxy, implement retries
Encoding Issues	Non-UTF-8 characters in response	Use response.encoding = 'utf-8' or response.apparent_encoding

Implementing Retry Logic

Robust scrapers should automatically retry failed requests with exponential backoff:

import requests
import time

def fetch_with_retry(url: str, max_retries: int = 3, backoff_factor: float = 2.0) -> requests.Response | None:
    """Fetch a URL with automatic retry and exponential backoff."""
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers=headers, timeout=10)
            
            if response.status_code == 200:
                return response
            elif response.status_code == 429:
                wait_time = backoff_factor ** attempt
                print(f"Rate limited. Waiting {wait_time}s before retry {attempt + 1}/{max_retries}...")
                time.sleep(wait_time)
            else:
                print(f"HTTP {response.status_code} on attempt {attempt + 1}")
                
        except requests.exceptions.RequestException as e:
            print(f"Request failed on attempt {attempt + 1}: {e}")
            time.sleep(backoff_factor ** attempt)
    
    print(f"All {max_retries} attempts failed for {url}")
    return None

Table of Contents

1. Why Use Python for Web Scraping?

2. How Web Scraping Works: The Basics

Static vs. Dynamic Websites

Using Browser Developer Tools

3. Python Web Scraping Libraries Compared

When to Use Which Tool

4. Step 1: Fetching a Page with Requests

Understanding HTTP Headers

Handling Sessions and Cookies

5. Step 2: Parsing HTML with BeautifulSoup

CSS Selectors vs. find_all()

6. Step 3: Advanced Parsing with lxml and XPath

7. Step 4: Handling Dynamic Pages with Playwright

Basic Playwright Scraper

Simulating User Interactions

Capturing XHR/API Requests

8. Step 5: Building a Large-Scale Crawler with Scrapy

Creating a Scrapy Project

Writing a Scrapy Spider

9. Step 6: Scraping at Scale with Asyncio and HTTPX

10. Step 7: A Complete Real-World Scraper (Pagination + Export)

11. Step 8: Handling Anti-Bot Protection

The Most Common Anti-Bot Mechanisms

Respecting robots.txt

12. Step 9: AI-Powered Extraction

13. When to Use a Scraping API Instead of DIY

ScrapeBadger Integration

14. Common Errors and How to Fix Them

Implementing Retry Logic

15. Frequently Asked Questions

Conclusion

Thomas Shultz

Ready to get started?

Table of Contents

1. Why Use Python for Web Scraping?

2. How Web Scraping Works: The Basics

Static vs. Dynamic Websites

Using Browser Developer Tools

3. Python Web Scraping Libraries Compared

When to Use Which Tool

4. Step 1: Fetching a Page with Requests

Understanding HTTP Headers

Handling Sessions and Cookies

5. Step 2: Parsing HTML with BeautifulSoup

CSS Selectors vs. find_all()

6. Step 3: Advanced Parsing with lxml and XPath

7. Step 4: Handling Dynamic Pages with Playwright

Basic Playwright Scraper

Simulating User Interactions

Capturing XHR/API Requests

8. Step 5: Building a Large-Scale Crawler with Scrapy

Creating a Scrapy Project

Writing a Scrapy Spider

9. Step 6: Scraping at Scale with Asyncio and HTTPX

10. Step 7: A Complete Real-World Scraper (Pagination + Export)

11. Step 8: Handling Anti-Bot Protection

The Most Common Anti-Bot Mechanisms

Respecting robots.txt

12. Step 9: AI-Powered Extraction

13. When to Use a Scraping API Instead of DIY

ScrapeBadger Integration

14. Common Errors and How to Fix Them

Implementing Retry Logic

15. Frequently Asked Questions

Conclusion

Thomas Shultz

Ready to get started?