Back to Blog

How to Scrape Websites Behind Login: Session-Based Scraping Guide

Thomas ShultzThomas Shultz
16 min read
17 views
How to Scrape Websites Behind Login

You have built a web scraper. It works perfectly on public pages. It extracts the data, formats it cleanly, and runs on a schedule. Then, you point it at a URL that requires authentication. Instead of the data you need, your scraper returns the HTML for a login page. Every single time.

This is the wall every developer hits when moving from basic data extraction to production-grade scraping. The underlying concept is simple: websites use sessions to remember who you are between requests. Your browser handles this invisibly, storing cookies and attaching them to every request. Your scraper does notβ€”unless you explicitly build that functionality into it.

This guide covers three escalating levels of authentication complexity, providing working Python code for each:

  1. Simple username/password form login (using Python requests)

  2. JavaScript-heavy login with browser automation (using Playwright)

  3. Token-based and OAuth authentication (handling JWTs)

We will also cover how to persist sessions across runs, handle CSRF tokens, deal with session expiry, and when to use an infrastructure-level solution like ScrapeBadger to skip building this entirely.

(Not a developer? Check out our no-code scraping guide β€” tools like Browse AI handle authenticated scraping visually without any of the code below.)

1. How Website Authentication Works (What Your Scraper Needs to Know)

Before writing a single line of code, you must understand the mental model of how a website knows who you are. If you do not understand the mechanism, your scraper will fail silently.

When you log into a website, the server verifies your credentials. If they are correct, the server creates a session record in its database, generates a unique session ID, and sends that ID back to your browser inside a Set-Cookie HTTP header.

Your browser stores this cookie. On every subsequent request you make to that domain, your browser automatically includes that session ID in the Cookie header. The server reads the header, looks up the session ID, and knows you are authenticated.

Your scraper needs to replicate this exact process. It is not enough to simply send a POST request to the login endpoint; your scraper must capture the resulting session cookie and carry it on every subsequent request, exactly as a browser would.

The Four Authentication Types You Will Encounter

Understanding which type of authentication you are dealing with saves hours of debugging. Opening your browser's DevTools (F12), navigating to the Network tab, and watching what happens when you click "Login" tells you everything you need to know.

Type

How it works

Tools needed

Form-based (username/password)

POST credentials, receive session cookie

requests.Session()

CSRF-protected form

Same + hidden token must be submitted

requests + HTML parsing

JavaScript-dependent login

Login form renders/submits via JS

Playwright or Selenium

Token-based (JWT/OAuth)

Bearer token in Authorization header

requests with token refresh logic

2. Method 1 β€” Simple Form Login with Python Requests

When this works: The login form submits via a standard HTML POST request, and no JavaScript is required to render the form or execute the submission. This method works on roughly 40–50% of sites, especially older platforms, admin portals, and simple CMS backends.

Step 1 β€” Inspect the Login Form

Open DevTools (F12), go to the Network tab, filter by XHR/Fetch or Doc, and attempt a login. Find the POST request that carries your credentials. You need to note three things:

  1. The form action URL (where the credentials are sent).

  2. The exact field names (e.g., username, email, password, or site-specific variations like user_login).

  3. Any hidden fields being submitted alongside the credentials.

Step 2 β€” Create a Session and Login

We use requests.Session() rather than standard requests.post(). A Session object automatically persists cookies across all requests made from it.

import requests
from bs4 import BeautifulSoup

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})

login_url = "https://example.com/login"
credentials = {
    "username": "your_username",
    "password": "your_password"
}

response = session.post(login_url, data=credentials )

Step 3 β€” Verify the Login Worked

Never assume a POST request succeeded just because it returned a 200 OK status code. A failed login often returns a 200 OK with the HTML of the login page displaying an error message.

# Method 1: URL redirect β€” successful logins usually redirect away from /login
if "/login" not in response.url:
    print("Login successful")
else:
    print("Login failed β€” still on login page")

# Method 2: Check for known authenticated content
if "dashboard" in response.text.lower() or "welcome" in response.text.lower():
    print("Authenticated")

Step 4 β€” Scrape Authenticated Pages

Because we used requests.Session(), the session object now carries your authentication cookies automatically on every subsequent request.

protected_page = session.get("https://example.com/dashboard" )
soup = BeautifulSoup(protected_page.text, "html.parser")
data = soup.find("div", class_="user-data").text
print(data)

Step 5 β€” Save and Reuse Cookies

Logging in on every single script run is slow, inefficient, and highly suspicious to anti-bot systems. The best practice is to log in once, save the cookies to a file, and reuse them until they expire.

import pickle

# Save after successful login
with open("cookies.pkl", "wb") as f:
    pickle.dump(session.cookies, f)

# Restore on the next script run
session = requests.Session()
with open("cookies.pkl", "rb") as f:
    session.cookies.update(pickle.load(f))

# Test if the restored session is still valid
response = session.get("https://example.com/dashboard" )
if response.url.endswith("/login"):
    print("Session expired β€” need to re-login")

Limitation of this method: If the login form submits via JavaScript, or if the site checks for browser fingerprinting, requests will fail. If this happens, you must move to Method 2.

3. Handling CSRF Tokens (The Most Common Complication)

This is the single most common failure mode for Method 1. Most modern sites include a CSRF (Cross-Site Request Forgery) token in their login forms. This is a hidden field that must be submitted alongside your credentials. If you do not include it, the server will silently reject the login attempt.

How to Detect a CSRF Token

In DevTools, look at the login POST payload. If you see a field called token, csrftoken, authenticity_token, or something similar alongside your username and password, you are dealing with a CSRF token.

How to Extract and Submit It

The token is unique per page load and is embedded directly in the HTML. You must fetch the login page first, extract the token, and then submit it with your POST request.

import requests
from bs4 import BeautifulSoup

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})

# Step 1: GET the login page to capture the CSRF token
login_page = session.get("https://example.com/login" )
soup = BeautifulSoup(login_page.text, "html.parser")

# Step 2: Extract the token (try multiple common field names)
csrf_token = (
    soup.find("input", {"name": "csrf_token"}) or
    soup.find("input", {"name": "_token"}) or
    soup.find("input", {"name": "authenticity_token"}) or
    soup.find("meta", {"name": "csrf-token"})
)

token_value = csrf_token.get("value") or csrf_token.get("content")

# Step 3: Submit with the token
credentials = {
    "username": "your_username",
    "password": "your_password",
    "_token": token_value  # The field name must match what the site expects
}

response = session.post("https://example.com/login", data=credentials )

Important: Using requests.Session() is absolutely essential here. The session maintains the same cookies across the initial GET request (to fetch the login page) and the subsequent POST request (to submit the login). If the cookies do not match, the server will reject the CSRF token.

4. Method 2 β€” JavaScript-Heavy Login with Playwright

When this works: The login form renders dynamically, submits via JavaScript, or the site checks for real browser behaviour (such as mouse movements, JavaScript execution, or TLS fingerprinting). This covers most modern single-page applications (SPAs), React/Vue frontends, and heavily protected platforms.

Why requests Fails on JS-Dependent Login

The requests library makes raw HTTP requests; it does not execute JavaScript. If the login button triggers a JavaScript event rather than submitting a standard HTML form, the POST request never happens. Playwright launches a real Chromium browser, ensuring everything executes exactly as it would for a human user.

Step 1 β€” Install Playwright

pip install playwright
playwright install chromium

Step 2 β€” Automate the Login

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        locale="en-US",
    )
    page = context.new_page()

    # Navigate to the login page
    page.goto("https://example.com/login" )

    # Fill in credentials
    page.fill("#username", "your_username")
    page.fill("#password", "your_password")
    page.click("button[type='submit']")

    # Wait for navigation to confirm login
    page.wait_for_url("**/dashboard**", timeout=10000)
    print("Login successful")

Step 3 β€” Save Session State (Login Once, Reuse Forever)

This is Playwright's most powerful session management feature. The storageState method captures cookies, localStorage, and IndexedDB in a single JSON file.

    # After successful login, save the entire session state
    context.storage_state(path="auth_state.json")
    browser.close()

On subsequent runs, you can skip the login process entirely by loading this state file:

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)

    # Restore saved session β€” no login required
    context = browser.new_context(storage_state="auth_state.json")
    page = context.new_page()

    # Go straight to protected content
    page.goto("https://example.com/dashboard" )
    data = page.inner_text(".protected-data")
    print(data)

    browser.close()

IMPORTANT: The auth_state.json file contains your live session tokens. Add it to your .gitignore file immediately and never commit it to version control.

Step 4 β€” Check for Session Expiry

Sessions expire. You must build a guard that detects expiry and re-authenticates automatically.

def is_logged_in(page) -> bool:
    return "/login" not in page.url and "Sign in" not in page.title()

page.goto("https://example.com/dashboard" )
if not is_logged_in(page):
    # Re-run the login flow and save fresh state
    login_and_save_state(page, context)

Step 5 β€” Now Scrape Authenticated Pages

With a live, authenticated Playwright session, you can navigate to and scrape any protected page on the platform.

    # Navigate and extract data
    page.goto("https://example.com/members/data" )
    page.wait_for_selector(".data-table")  # Wait for content to load

    rows = page.query_selector_all("table tbody tr")
    for row in rows:
        cells = row.query_selector_all("td")
        print([cell.inner_text() for cell in cells])

5. Method 3 β€” Token-Based Authentication (JWT and OAuth)

When this applies: Modern APIs and Single Page Applications (SPAs) that use JSON Web Tokens (JWTs) stored in localStorage, or sites with Google, Facebook, or GitHub OAuth login. You will not find a traditional form POST; the authentication flow happens differently.

How to Identify Token-Based Auth

In DevTools, go to the Network tab and look at the authenticated requests made after login. If you see an Authorization: Bearer eyJ... header, or if the login response returns a JSON object containing a token or access_token field, you are dealing with token-based authentication.

Direct API Token Authentication

If the site exposes a direct API endpoint for login, you can authenticate and receive the token using requests.

import requests

# Step 1: Authenticate and receive token
auth_response = requests.post(
    "https://example.com/api/auth/login",
    json={"email": "user@example.com", "password": "secret"}
 )
token = auth_response.json()["access_token"]

# Step 2: Use token on all subsequent requests
headers = {"Authorization": f"Bearer {token}"}

data_response = requests.get(
    "https://example.com/api/protected/data",
    headers=headers
 )
print(data_response.json())

Extracting Tokens from a Browser Session

When the site stores the JWT in localStorage (which is common in React applications), use Playwright to extract it after completing a browser-based login.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)  # Headful is often needed for OAuth flows
    page = browser.new_page()

    # Complete the browser-based login
    page.goto("https://example.com/login" )
    page.fill("#email", "user@example.com")
    page.fill("#password", "secret")
    page.click("button[type='submit']")
    page.wait_for_url("**/dashboard")

    # Extract JWT from localStorage
    token = page.evaluate("() => localStorage.getItem('authToken')")
    print(f"Token: {token}")

    # Now use the token in requests (faster than browser for bulk scraping)
    import requests
    headers = {"Authorization": f"Bearer {token}"}
    response = requests.get("https://example.com/api/data", headers=headers )
    print(response.json())

Handling Token Expiry

JWTs typically expire after 15 minutes to 24 hours. You must build refresh logic into your scraper to handle this automatically.

import time

def get_fresh_token(email, password):
    response = requests.post(
        "https://example.com/api/auth/login",
        json={"email": email, "password": password}
     )
    data = response.json()
    return data["access_token"], data.get("expires_in", 3600)

token, expires_in = get_fresh_token("user@example.com", "secret")
token_expiry = time.time() + expires_in

def get_data(url):
    global token, token_expiry
    if time.time() > token_expiry - 60:  # Refresh 60 seconds before expiry
        token, expires_in = get_fresh_token("user@example.com", "secret")
        token_expiry = time.time() + expires_in
    return requests.get(url, headers={"Authorization": f"Bearer {token}"})

6. Common Failure Modes and How to Fix Them

This section covers the most frequent issues developers encounter when building session-based scrapers.

"My login POST returns 200 but I'm not authenticated"

Check for a CSRF token (see Section 3). Also, check whether the site sets cookies via JavaScript after the login response. If so, requests will not capture them. You must switch to Playwright.

"Session expires mid-scrape"

Sessions on authenticated sites typically last between 1 and 24 hours. Build expiry detection (e.g., checking for a redirect to /login) and automatic re-authentication into your scraping loop. For long-running jobs, use Playwright's storageState refresh pattern from Section 4.

"The login page has a CAPTCHA"

For recurrent scraping, create a dedicated account and check if the platform allows CAPTCHAs to be disabled for test or automation accounts. Otherwise, use a CAPTCHA-solving service integrated with Playwright. Alternatively, log in manually once, save the session state, and reuse it until it expires. See the ScrapeBadger documentation for built-in CAPTCHA handling.

"The site detects headless Playwright"

Sites with aggressive anti-bot systems (like Imperva or PerimeterX) detect headless browsers via TLS fingerprinting, not just the User-Agent string. Use playwright-stealth patches or route your traffic through ScrapeBadger, which handles fingerprinting at the infrastructure level.

"Login works locally but fails on my server"

Datacenter IPs (cloud servers) are often blocked from authentication endpoints by default. Use residential proxies or a scraping API that routes through residential IPs. Read more in our web scraping tools for beginners guide.

"The site uses 2FA"

Disable 2FA for the dedicated scraping account if the platform allows it. If not, use cookie-based authentication: complete the 2FA manually once in a real browser, export the session state, and reuse it in your scraper until it expires.

Scraping public data and scraping data behind a login are legally and ethically different in important ways.

When you log in to a site, you are typically bound by its Terms of Service (ToS). Most ToS explicitly prohibit automated access, scraping, or data extraction. Violating ToS is a civil matter, not a criminal one, but it can result in account termination, IP bans, and in some cases, legal action.

The key distinction courts have generally recognised: scraping your own data from a platform (your own profile, your own purchase history, your own content) is generally acceptable. Scraping other users' personal data through your authenticated account carries significantly higher legal risk.

Practical principles for authenticated scraping:

  • Use a dedicated account created specifically for scraping β€” never your personal account.

  • Scrape at respectful rates that do not degrade service for other users.

  • Do not store or redistribute personal data about other users.

  • Check the platform's ToS and robots.txt before building anything commercial.

  • If in doubt, contact the platform and ask. Many B2B platforms will provide API access or data exports upon request.

For a deeper treatment of scraping legality, see our web scraping business use cases article, which covers GDPR, CCPA, and the hiQ v. LinkedIn ruling in detail.

8. When to Use a Scraping API Instead of Building Your Own Session Management

Building session management from scratch is viable for simple cases. However, at some point, the engineering overhead tips against it. Here is when that point arrives:

  • The site uses Imperva or PerimeterX on its login endpoint. Your requests and Playwright sessions will get blocked regardless of how well they are built.

  • You need to scrape from multiple geographic locations, requiring different login sessions, different IPs, and different cookies per region.

  • The session expiry cycle is shorter than your scrape cycle, meaning you are spending more time re-authenticating than collecting data.

  • You are scraping dozens of different login-protected sites, and maintaining custom authentication logic per site has become a maintenance burden.

ScrapeBadger handles authenticated scraping at the infrastructure level. Session management, residential proxy routing, and anti-bot bypass are built in. You pass your credentials or session configuration once, and the API handles the rest across your entire scrape volume.

See the ScrapeBadger API tutorial for how to integrate it into an existing pipeline, or explore the ScrapeBadger MCP for advanced integrations.

9. Frequently Asked Questions

Q: Can I scrape a website that requires login?

Yes. You must build your scraper to authenticate, capture the session cookie or token, and include it in all subsequent requests. This can be done using Python requests for simple forms or Playwright for JavaScript-heavy logins.

Q: How do I handle session cookies in Python web scraping?

Use requests.Session(). It automatically stores cookies received from the login response and attaches them to all future requests made using that session object, perfectly replicating browser behaviour.

Q: What is the difference between cookie-based and token-based authentication for scraping?

Cookie-based authentication relies on the server setting a session ID in your browser's cookies. Token-based authentication (like JWT) requires you to extract a token from the login response and manually inject it into the Authorization header of subsequent requests.

Q: How do I stop my scraper from being logged out mid-run?

Build session expiry detection into your scraping loop. Check if the response URL redirects to /login or if the status code is 401 Unauthorized. If detected, trigger a function to re-authenticate and update the session state before continuing.

Q: Is scraping behind a login legal?

It depends on the Terms of Service and the data being scraped. Scraping your own data is generally safer than scraping other users' personal data. Violating ToS can lead to account bans or civil action. Always consult legal counsel for commercial projects.

Q: How do I scrape a website that uses Google login (OAuth)?

Use Playwright in headful mode to automate the Google login flow, wait for the redirect back to the target site, and then extract the resulting session cookie or JWT from localStorage to use in your scraper.

Q: What's the fastest way to reuse a login session across multiple scraping runs?

If using Playwright, use context.storage_state(path="auth.json") to save the entire session (cookies and local storage) to a file. On the next run, load it using browser.new_context(storage_state="auth.json") to skip login entirely.

Conclusion

The patterns in this guide cover the vast majority of authenticated scraping scenarios. The underlying concept is always the same: replicate what a browser does during and after login, persist the session state, and carry it on every subsequent request.

  • Simple form login, no JS: Use requests.Session() + CSRF extraction (Sections 2–3).

  • JavaScript-dependent login: Use Playwright with storageState (Section 4).

  • Token/JWT/OAuth: Extract the token, inject it as a header, and refresh on expiry (Section 5).

  • Anti-bot protected login, multi-geo, or production scale: Use ScrapeBadger.

Need authenticated scraping without the infrastructure overhead? Start with ScrapeBadger. Get your free API key here.

Thomas Shultz

Written by

Thomas Shultz

Thomas Shultz is the Head of Data at ScrapeBadger, working on public web data, scraping infrastructure, and data reliability. He writes about real-world scraping, data pipelines, and turning unstructured web data into usable signals.

Ready to get started?

Join thousands of developers using ScrapeBadger for their data needs.

How to Scrape Websites Behind Login: Session-Based Scraping Guide | ScrapeBadger