Back to Blog

How to Scrape Any E-Commerce Site: A Platform-by-Platform Intelligence Guide

Thomas ShultzThomas Shultz
16 min read
32 views
How to Scrape Any E-Commerce Site

The global e-commerce market is on track to hit $8.3 trillion by 2025. Behind every dollar of that is a product listing, a price, a review, an inventory signal — data that companies are collecting, analysing, and competing on right now. The businesses winning on price intelligence, trend detection, and competitive strategy aren't guessing. They're scraping.

But here's what most scraping tutorials miss: there is no single technique that works across every e-commerce site. Amazon and a boutique Shopify store have almost nothing in common technically. Zalando and ASOS handle anti-bot differently from each other. Even two WooCommerce stores can behave differently depending on what plugins they're running.

What actually works is understanding the architecture behind each platform — what framework it's built on, how it renders data, what anti-bot system it deploys, and where the cleanest data lives. Get that right, and any e-commerce site becomes scrapeable. Get it wrong, and you'll spend weeks fighting blocks that have nothing to do with your code.

This is the guide we wish existed when we started building ScrapeBadger's e-commerce infrastructure. Let's get into it.

The Four Technical Categories Every E-Commerce Site Falls Into

Before you write a single line of code, identify which of four categories your target site belongs to. This alone determines your entire approach.

Category 1: Static HTML stores. Older stores, basic WooCommerce or Magento installations, and many small independent retailers render product data directly in the initial HTML response. BeautifulSoup and requests work. Simple, fast, maintainable.

Category 2: JavaScript-rendered stores. Modern storefronts built on React, Vue, or Next.js load product data dynamically after the initial page load. Shopify Plus, Zalando, ASOS, and most major retail platforms fall here. Your scraper needs to execute JavaScript — which means Playwright or a scraping API that handles rendering.

Category 3: API-backed stores. Many modern e-commerce frontends separate their display layer from their data layer. The page makes XHR/Fetch calls to internal JSON APIs as it loads. If you can find and call these APIs directly, you skip HTML parsing entirely and get cleaner data faster. We covered the technique for finding these in detail in the ScrapeBadger session-based scraping guide.

Category 4: Marketplace platforms. Amazon, eBay, Walmart, Etsy — massive platforms with sophisticated anti-bot infrastructure, frequent layout changes, and legal complexity. These require dedicated approaches.

Identifying the category takes five minutes: open DevTools, reload the target page, and check the Network tab. Static HTML means all your product data is in the page source. API calls in the XHR/Fetch tab mean Category 3. A page that loads a shell and then populates with data means Category 2.

Platform-by-Platform: What You're Actually Up Against

Shopify

Shopify powers roughly 10% of all e-commerce sites globally — from single-product dropshippers to enterprise brands doing nine-figure revenue. The good news: Shopify stores share common architecture. The better news: many Shopify stores expose their own JSON API by default.

The hidden Shopify JSON API

Every Shopify store exposes product data in JSON format by appending .json to collection and product URLs:

https://store.example.com/collections/all.json
https://store.example.com/products/product-name.json
https://store.example.com/collections/shoes.json?limit=250&page=1

This isn't a bug — Shopify enables this by design for third-party integrations. The response is clean, structured, and complete:

python

import requests

def scrape_shopify_products(domain: str, collection: str = "all") -> list:
    """
    Scrape all products from a Shopify store using the built-in JSON API.
    Works on most Shopify stores without any browser automation.
    """
    products = []
    page = 1

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept": "application/json",
    }

    while True:
        url = f"https://{domain}/collections/{collection}/products.json"
        params = {"limit": 250, "page": page}

        response = requests.get(url, headers=headers, params=params)

        if response.status_code != 200:
            break

        data = response.json()
        batch = data.get("products", [])

        if not batch:
            break

        products.extend(batch)
        page += 1

        # Be respectful — don't hammer the server
        import time
        time.sleep(0.5)

    return products

# Usage
products = scrape_shopify_products("www.allbirds.com")
print(f"Found {len(products)} products")

for p in products[:3]:
    print(f"\n{p['title']}")
    for variant in p.get("variants", []):
        print(f"  {variant['title']}: ${variant['price']} — {'In stock' if variant['available'] else 'Out of stock'}")

The JSON response gives you everything: product title, description, tags, all variants with their SKUs, prices, and availability status, images, created/updated timestamps, and more. No HTML parsing required.

When the JSON API is disabled

Some stores — particularly those with aggressive SEO setups or that have specifically blocked the JSON endpoints — disable this. When that happens, fall back to HTML scraping. Shopify's HTML structure is relatively consistent across stores. Product prices usually sit in elements with data-price attributes or standard schema.org markup:

python

from bs4 import BeautifulSoup
import requests

def scrape_shopify_html(url: str) -> dict:
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    # Schema.org structured data — most Shopify stores include this
    import json
    schema = soup.find("script", {"type": "application/ld+json"})
    if schema:
        data = json.loads(schema.string)
        return {
            "name": data.get("name"),
            "price": data.get("offers", {}).get("price"),
            "availability": data.get("offers", {}).get("availability"),
            "description": data.get("description"),
        }
    return {}

Anti-bot protection on Shopify. Shopify's platform-level Cloudflare integration protects all stores by default. For most scraping use cases — product data, pricing, catalogue monitoring — Shopify's protection doesn't trigger unless you're making requests at high frequency or hitting checkout/cart endpoints. Polite scraping of product pages at reasonable rates rarely encounters blocks. For stores with additional custom protection or very high-frequency monitoring, route through ScrapeBadger which handles Cloudflare bypass transparently.

WooCommerce

WooCommerce runs on WordPress, powers approximately 29% of all e-commerce stores, and is far more variable in technical setup than Shopify. A WooCommerce store can look radically different from another WooCommerce store depending on its theme, plugins, and configuration.

The WooCommerce REST API

Many WooCommerce stores expose the official WooCommerce REST API — a full-featured endpoint at /wp-json/wc/v3/products. Whether it's accessible without authentication depends on the store configuration:

python

import requests

def try_woocommerce_api(domain: str) -> list | None:
    """
    Attempt to access WooCommerce REST API.
    Returns products if accessible without auth, None if auth required.
    """
    url = f"https://{domain}/wp-json/wc/v3/products"
    params = {"per_page": 100, "page": 1}

    response = requests.get(url, params=params, timeout=10)

    if response.status_code == 200:
        return response.json()
    elif response.status_code == 401:
        print("API requires authentication")
        return None
    else:
        print(f"API not available: {response.status_code}")
        return None

Most public stores lock this endpoint — authentication is required. In that case, fall back to scraping the HTML product pages.

Scraping WooCommerce HTML

WooCommerce stores almost always include schema.org structured data in product pages. This is more reliable than scraping CSS classes (which vary by theme):

python

from bs4 import BeautifulSoup
import requests, json, re

def scrape_woocommerce_product(url: str) -> dict:
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Accept-Language": "en-GB,en;q=0.9",
    }

    session = requests.Session()
    session.headers.update(headers)
    response = session.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    result = {}

    # Method 1: schema.org JSON-LD (most reliable)
    for script in soup.find_all("script", {"type": "application/ld+json"}):
        try:
            data = json.loads(script.string)
            if isinstance(data, list):
                data = data[0]
            if data.get("@type") in ["Product", "ItemPage"]:
                product = data if data.get("@type") == "Product" else data.get("mainEntity", {})
                result = {
                    "name": product.get("name"),
                    "price": product.get("offers", {}).get("price"),
                    "currency": product.get("offers", {}).get("priceCurrency"),
                    "availability": product.get("offers", {}).get("availability", "").split("/")[-1],
                    "sku": product.get("sku"),
                    "description": product.get("description", "")[:500],
                    "rating": product.get("aggregateRating", {}).get("ratingValue"),
                    "review_count": product.get("aggregateRating", {}).get("reviewCount"),
                }
        except (json.JSONDecodeError, AttributeError):
            continue

    # Method 2: WooCommerce-specific data attributes as fallback
    if not result:
        price_el = soup.select_one(".price .woocommerce-Price-amount")
        title_el = soup.select_one(".product_title")
        result = {
            "name": title_el.get_text(strip=True) if title_el else None,
            "price": price_el.get_text(strip=True) if price_el else None,
        }

    return result

The schema.org approach is the right default for any WooCommerce store — it's consistent across themes, it's more semantic than CSS class scraping, and it's what Google's own crawlers use to understand product data.

Amazon

Amazon is the hardest legitimate scraping target in e-commerce. Not the hardest overall — that distinction goes to some financial platforms — but the hardest among retail targets that large numbers of teams need data from.

Amazon runs custom anti-bot infrastructure that combines IP reputation, TLS fingerprinting, behavioural analysis, and browser environment fingerprinting. Their detection is aggressive enough that a significant fraction of residential proxy IPs are blocked. Response times with cold proxies are slow. And Amazon's HTML structure changes frequently enough that selector-based scrapers require regular maintenance.

The data worth scraping from Amazon

Despite the difficulty, the data Amazon exposes publicly is extremely valuable:

  • Product prices across all sellers (including third-party marketplace)

  • BSR (Best Sellers Rank) — a real-time sales velocity proxy

  • Review count and average rating

  • "Frequently bought together" and recommendation relationships

  • Q&A content

  • Price history signals (when combined with the Keepa API for historical data)

  • Availability indicators ("Only 3 left in stock")

Technical approach

Amazon requires residential proxies and proper TLS fingerprinting as a baseline. For basic product data, a curl_cffi approach with residential proxy routing gets you most of the way there:

python

from curl_cffi import requests
import os

def scrape_amazon_product(asin: str, country_code: str = "com") -> dict:
    """
    Scrape Amazon product page for a given ASIN.
    Requires residential proxy for reliable results.
    """
    url = f"https://www.amazon.{country_code}/dp/{asin}"

    # Proxy configuration (residential required)
    proxies = {
        "http": os.environ.get("RESIDENTIAL_PROXY_URL"),
        "https": os.environ.get("RESIDENTIAL_PROXY_URL"),
    }

    response = requests.get(
        url,
        impersonate="chrome120",
        proxies=proxies,
        headers={
            "Accept-Language": "en-US,en;q=0.9",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        }
    )

    if response.status_code != 200:
        return {"error": f"Status {response.status_code}"}

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(response.text, "html.parser")

    # Amazon's product data is relatively stable in these selectors
    return {
        "asin": asin,
        "title": soup.select_one("#productTitle").get_text(strip=True) if soup.select_one("#productTitle") else None,
        "price": soup.select_one(".a-price .a-offscreen").get_text(strip=True) if soup.select_one(".a-price .a-offscreen") else None,
        "rating": soup.select_one(".a-icon-star-small .a-icon-alt").get_text(strip=True) if soup.select_one(".a-icon-star-small .a-icon-alt") else None,
        "review_count": soup.select_one("#acrCustomerReviewText").get_text(strip=True) if soup.select_one("#acrCustomerReviewText") else None,
        "availability": soup.select_one("#availability").get_text(strip=True) if soup.select_one("#availability") else None,
    }

For anything beyond basic product data or at any meaningful scale, the DIY Amazon scraper quickly becomes a maintenance burden. Amazon rotates its HTML structure regularly and updates anti-bot rules frequently. At ScrapeBadger, we handle Amazon scraping at the infrastructure level — the same approach we use for all heavily-protected e-commerce platforms. The ScrapeBadger documentation covers the Amazon endpoint configuration.

Zalando, ASOS, and Major Fashion Retailers

European fashion retailers like Zalando and UK retailers like ASOS present a specific challenge that's become increasingly common in 2025: they're React/Next.js SPAs with aggressive Cloudflare protection, but their data is beautifully structured in the JavaScript bundle and internal API calls.

Finding the internal API

Open DevTools on any Zalando or ASOS product page. In the Network tab, filter by XHR/Fetch. You'll see API calls like:

https://www.zalando.co.uk/api/graphql
https://api.asos.com/product/catalogue/v3/products/{id}

ASOS exposes a relatively clean internal catalogue API. Zalando uses GraphQL. Both require session cookies and specific headers to return data, but calling them directly is dramatically faster and more reliable than scraping rendered HTML.

python

import requests

def scrape_asos_product(product_id: int) -> dict:
    """
    Call ASOS's internal catalogue API directly.
    Headers extracted from browser DevTools session.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept": "application/json",
        "Referer": "https://www.asos.com/",
        "Store": "ROW",
        "Country": "GB",
        "Currency": "GBP",
        "Accept-Language": "en-GB,en;q=0.9",
    }

    response = requests.get(
        f"https://api.asos.com/product/catalogue/v3/products/{product_id}",
        headers=headers,
        params={"store": "ROW", "currency": "GBP", "lang": "en-GB"}
    )

    if response.status_code != 200:
        return {}

    data = response.json()
    return {
        "id": data.get("id"),
        "name": data.get("name"),
        "brand": data.get("brand", {}).get("name"),
        "price": data.get("price", {}).get("current", {}).get("value"),
        "original_price": data.get("price", {}).get("rrp", {}).get("value"),
        "is_in_sale": data.get("price", {}).get("isMarkedDown"),
        "colour": data.get("colour"),
        "category": data.get("productType", {}).get("name"),
    }

The internal API approach for fashion retailers yields richer data than HTML scraping and is less fragile — internal APIs change less frequently than visual layouts. For details on extracting and maintaining session cookies for authenticated API calls, the session-based scraping guide on the ScrapeBadger blog has the full technical walkthrough.

eBay and Etsy

eBay and Etsy both offer official APIs — the eBay Browse API and the Etsy Open API v3 — which are worth using when they cover your data requirements. Both require API key registration, and both have rate limits, but for many use cases they're more reliable than scraping.

When the official API doesn't cover what you need (competitor seller analysis, secondary market pricing trends, inventory depth), scraping is the supplement. eBay's HTML structure is relatively stable, and its search API can be called directly:

python

import requests
from urllib.parse import quote

def scrape_ebay_search(query: str, max_results: int = 50) -> list:
    """Scrape eBay search results for a product query."""
    from bs4 import BeautifulSoup

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    }

    url = f"https://www.ebay.com/sch/i.html?_nkw={quote(query)}&_ipg=50"
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    results = []
    listings = soup.select(".s-item")

    for listing in listings[:max_results]:
        title = listing.select_one(".s-item__title")
        price = listing.select_one(".s-item__price")
        link = listing.select_one(".s-item__link")
        condition = listing.select_one(".SECONDARY_INFO")
        sold_count = listing.select_one(".s-item__hotness")

        if title and price:
            results.append({
                "title": title.get_text(strip=True),
                "price": price.get_text(strip=True),
                "url": link.get("href") if link else None,
                "condition": condition.get_text(strip=True) if condition else None,
                "sold_indicator": sold_count.get_text(strip=True) if sold_count else None,
            })

    return results

The Data That Actually Drives Decisions

Knowing how to extract product pages is one thing. Knowing which fields drive real business decisions is another. After building e-commerce data pipelines across hundreds of use cases, these are the fields that matter most:

Price + original price together. Current price alone is incomplete. The discount percentage, derived from current vs. original price, is how consumers evaluate value — and it's how competitor pricing strategies become legible. A competitor reducing from £89 to £71 is doing something different from one who's always sold at £71.

Availability status + quantity signals. "In stock" is binary. "Only 3 left in stock" is a velocity signal. When a competitor's inventory runs low on a specific variant or size, that's an opportunity. When a product goes out of stock across multiple competitors simultaneously, that's a supply chain signal.

Review count + rating over time. A product with 4.2 stars and 8,000 reviews is a different competitive reality from one with 4.7 stars and 12 reviews. Track both fields historically and you can see products gaining or losing momentum before it shows up in pricing or availability data.

Variant-level data. Product-level price tracking misses the real picture on fashion, electronics, and anything with meaningful variants. A shoe at £79 that's only available in size 11 has different market dynamics from one with full size availability. Scraping variant-level data — SKU, size, colour, availability per variant, price per variant — is what separates surface-level monitoring from genuine intelligence.

Position in category / search rank. Where a product appears in category listings is a signal of both SEO performance and merchandising priority. When a product moves from page 3 to page 1 of a category, something changed — either their content, their sales velocity, or the platform's algorithm. Tracking position over time reveals these shifts.

Building a Production E-Commerce Scraping Pipeline

Getting a scraper to work once is engineering. Getting it to work reliably every day, at scale, without breaking when sites update, is a pipeline.

The architecture that works at production for e-commerce data has four components:

1. URL management. You need a maintained list of target URLs — product pages, category pages, search results pages. These lists change: products go out of stock, new products launch, categories reorganise. Your pipeline needs to handle URL discovery (crawling category pages to find new products) as well as URL maintenance (detecting 404s and removing dead pages).

2. Extraction with schema validation. Every field you extract should be validated before it enters your database. A price field that returns None or a string when you expect a float is a signal that something changed. Schema validation on extraction — using something like Pydantic — catches these failures before they corrupt your dataset. We covered the data quality principles behind this in the trusted data article on the blog.

3. Change detection. You're not interested in extracting the same data repeatedly — you're interested in detecting when data changes. Build a comparison layer that reads the previous value for each field and flags changes: price dropped by more than 5%, item went out of stock, new variant appeared. This is what turns a scraping pipeline into an intelligence system.

4. Delivery and alerting. Scraped data needs to reach the systems where it's used. That might be a Google Sheet for a small team, a Slack webhook for price alerts, a database feeding a BI dashboard, or a webhook pushing to a CRM. Build the delivery layer as part of the pipeline, not as an afterthought.

python

import json
from datetime import datetime
from typing import Optional

class ProductMonitor:
    """
    Simple production-grade product monitoring pipeline.
    Detects price changes and availability shifts.
    """

    def __init__(self, storage_path: str = "products.json"):
        self.storage_path = storage_path
        self.data = self._load()

    def _load(self) -> dict:
        try:
            with open(self.storage_path) as f:
                return json.load(f)
        except FileNotFoundError:
            return {}

    def _save(self):
        with open(self.storage_path, "w") as f:
            json.dump(self.data, f, indent=2)

    def update(self, product_id: str, current: dict) -> list[str]:
        """
        Update product data and return list of change alerts.
        """
        alerts = []
        previous = self.data.get(product_id)

        if previous:
            # Detect price change
            prev_price = float(previous.get("price", 0))
            curr_price = float(current.get("price", 0))
            if curr_price and prev_price:
                change_pct = (curr_price - prev_price) / prev_price * 100
                if abs(change_pct) >= 1:
                    direction = "dropped" if change_pct < 0 else "increased"
                    alerts.append(
                        f"Price {direction} {abs(change_pct):.1f}%: "
                        f"£{prev_price:.2f} → £{curr_price:.2f}"
                    )

            # Detect availability change
            if previous.get("availability") != current.get("availability"):
                alerts.append(
                    f"Availability changed: "
                    f"{previous.get('availability')} → {current.get('availability')}"
                )

        # Update stored data
        current["last_updated"] = datetime.utcnow().isoformat()
        self.data[product_id] = current
        self._save()

        return alerts


# Usage
monitor = ProductMonitor()
product = scrape_shopify_products("competitor-store.com")[0]

product_id = product.get("id")
alerts = monitor.update(str(product_id), {
    "price": product["variants"][0]["price"],
    "availability": str(product["variants"][0]["available"]),
    "title": product["title"],
})

for alert in alerts:
    print(f"🔔 {alert}")
    # Here you'd send to Slack/email/webhook

When to Use ScrapeBadger Instead of Building Your Own

The patterns in this guide work across the e-commerce landscape. Shopify JSON API endpoints, WooCommerce schema.org parsing, direct ASOS API calls, eBay HTML scraping — all of these can be built and maintained by a developer with reasonable experience.

The cases where it makes more sense to use ScrapeBadger's infrastructure instead:

Amazon at any meaningful scale. Amazon's anti-bot is sophisticated enough that maintaining a reliable DIY scraper is close to a full-time job. Residential proxies, TLS fingerprinting, session management, HTML structure maintenance — every layer requires ongoing attention. ScrapeBadger handles all of it.

Any site with DataDome or PerimeterX. Major fashion retailers and sporting goods stores in particular. These systems do behavioural analysis that naive scraping cannot pass. As detailed in our complete guide to scraping without getting blocked, these require infrastructure-level bypass that's impractical to build and maintain in-house.

Cross-platform price intelligence at scale. Monitoring 50,000 SKUs across 20 competitor websites simultaneously isn't a scraper — it's a platform. The proxy management, rate limiting, retry logic, and maintenance overhead of that scale is substantial. ScrapeBadger's batch processing and scheduling handles this without infrastructure investment.

AI agents that need live product data. If you're building an AI agent that makes pricing or inventory decisions, connecting it to a ScrapeBadger-powered data feed through the MCP integration is faster and more reliable than building a custom data layer. The MCP documentation has everything you need to connect Claude, Cursor, or any MCP-compatible agent to live e-commerce data.

The global e-commerce market is growing at a rate that makes competitive data a prerequisite for strategic decision-making, not a nice-to-have. The infrastructure to collect it reliably is available — whether you build it, use a scraping API, or combine both. The important thing is that the data flows.

Thomas Shultz

Written by

Thomas Shultz

Thomas Shultz is the Head of Data at ScrapeBadger, working on public web data, scraping infrastructure, and data reliability. He writes about real-world scraping, data pipelines, and turning unstructured web data into usable signals.

Ready to get started?

Join thousands of developers using ScrapeBadger for their data needs.

How to Scrape Any E-Commerce Site: Platform Guide With Python Code | ScrapeBadger