Back to Blog

How to Scrape Job Listings from Any Website with Python

Thomas ShultzThomas Shultz
10 min read
5 views
How to Scrape Job Listings with Python in 2026

Most job scraping tutorials show you a BeautifulSoup script that worked once in 2021. You run it, get back empty results or a 403, and spend the next two hours debugging something that was never going to work in the first place.

The reality in 2026 is that major job boards โ€” Indeed, LinkedIn, Glassdoor, ZipRecruiter โ€” have invested heavily in anti-bot infrastructure. Basic HTTP requests against these sites have a 5โ€“10% success rate. Playwright with residential proxies gets you to 70โ€“85%. A proper scraping API gets you to 95%+ without managing any of that infrastructure yourself.

This guide covers the full landscape: what actually works per target, where the failure modes are, and how to build a pipeline you'd actually trust to run on a schedule.

Why Job Boards Are Harder Than Most Sites

Job boards have a specific combination of properties that makes them difficult:

  • Dynamic rendering. Greenhouse, Lever, Workday, and most modern ATS platforms are single-page apps. The job cards don't exist in the initial HTML โ€” they load after JavaScript executes. BeautifulSoup sees nothing.

  • Aggressive bot detection. LinkedIn uses account-based detection and rate limiting. Indeed runs Cloudflare. ZipRecruiter adds behavioral analysis. Hitting these with a plain requests.get() will get you blocked within a handful of requests.

  • Selector instability. Job boards update their HTML structure regularly. A scraper tied to specific class names breaks silently โ€” the script completes, the CSV has headers, you assume everything is fine.

The practical consequence: you need to match your tool to the target. A requests + BeautifulSoup setup works fine on a small company's static careers page. It fails completely on Workday.

The Four Approaches (Ranked by Production Readiness)

Static Scraping with requests + BeautifulSoup

Works for: simple, server-rendered HTML pages โ€” small company career pages, job boards that don't use JavaScript-heavy rendering.

import requests
from bs4 import BeautifulSoup

url = "https://example-company.com/careers"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "html.parser")

jobs = []
for card in soup.select(".job-card"):  # Adjust selector per site
    title = card.select_one(".job-title")
    location = card.select_one(".job-location")
    jobs.append({
        "title": title.text.strip() if title else None,
        "location": location.text.strip() if location else None,
    })

print(jobs)

The failure mode here is silent: if the page uses JavaScript rendering or returns a bot challenge, response.text contains a challenge page or blank content โ€” not an error. Always print the first 500 characters of the response when debugging.

Browser Automation with Playwright

Works for: JavaScript-heavy job boards where you can't use static scraping, and where you have time to manage proxies.

Playwright is meaningfully better than Selenium for this use case โ€” faster, better async support, and cleaner stealth options. But it still requires residential proxies at any meaningful scale, and it breaks when sites change their DOM structure.

import asyncio
from playwright.async_api import async_playwright

async def scrape_jobs(url: str):
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={"server": "http://user:pass@residential-proxy:port"}
        )
        page = await browser.new_page()
        await page.goto(url)
        await page.wait_for_selector(".job-card", timeout=15000)

        jobs = []
        cards = await page.query_selector_all(".job-card")
        for card in cards:
            title = await card.query_selector(".job-title")
            jobs.append({
                "title": await title.inner_text() if title else None,
            })

        await browser.close()
        return jobs

asyncio.run(scrape_jobs("https://jobs.example.com"))

The problem with this approach at scale: you're paying for proxy bandwidth, managing browser instances, handling CAPTCHA, and maintaining selectors that change weekly. It works, but it's a real operational burden.

JobSpy (Open-Source Multi-Site Library)

Works for: quickly pulling data from Indeed, LinkedIn, ZipRecruiter, Glassdoor, and a few others without building per-site scrapers.

from jobspy import scrape_jobs

jobs = scrape_jobs(
    site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"],
    search_term="Python Developer",
    location="New York, NY",
    results_wanted=50,
    hours_old=48,
)
print(jobs.to_csv("jobs.csv", index=False))

Useful for fast prototyping and personal projects. The limitations: it requires proxy configuration for any real volume, LinkedIn support breaks periodically, and you're still responsible for maintenance when upstream sites change. For production pipelines where reliability matters, it's a starting point, not an endpoint.

Scraping API (Production-Ready)

Works for: anything you want to run reliably on a schedule without maintaining infrastructure.

This is where ScrapeBadger's web scraping endpoint fits. You POST a URL, configure a few parameters, and get back clean content โ€” the anti-bot handling, proxy rotation, and browser rendering are handled on their side.

Building a Reliable Job Scraper with ScrapeBadger

The ScrapeBadger web scrape endpoint takes a URL and returns structured content. For job scraping, the parameters that matter most are:

Parameter

What it does for job scraping

render_js

Required for ATS platforms (Greenhouse, Workday, Lever)

wait_for

Waits for job cards to load before extracting

anti_bot

Bypasses Cloudflare and similar protections

escalate

Auto-upgrades to premium engine for heavily protected sites

ai_extract + ai_prompt

Returns structured job data instead of raw HTML

format: "markdown"

Cleaner output than raw HTML for parsing

retry_on_block

Handles intermittent blocks automatically

Step 1: Scraping a Static Career Page

For a company careers page that doesn't require JavaScript rendering:

import requests

response = requests.post(
    "https://scrapebadger.com/v1/web/scrape",
    headers={
        "Content-Type": "application/json",
        "x-api-key": "YOUR_API_KEY"
    },
    json={
        "url": "https://example-company.com/careers",
        "format": "markdown",
        "anti_bot": True,
        "retry_on_block": True
    }
)

data = response.json()
print(data["content"])

Cost: 1 credit per request (HTTP engine).

Step 2: Scraping a JavaScript-Rendered Job Board

For Greenhouse, Lever, Workday, or any React/Vue-based careers page:

import requests

response = requests.post(
    "https://scrapebadger.com/v1/web/scrape",
    headers={
        "Content-Type": "application/json",
        "x-api-key": "YOUR_API_KEY"
    },
    json={
        "url": "https://boards.greenhouse.io/yourcompany",
        "render_js": True,
        "wait_for": "#app_body",
        "format": "markdown",
        "anti_bot": True,
        "retry_on_block": True
    }
)

data = response.json()
print(data["content"])

Cost: 5 credits per request (browser engine). For heavily protected pages, escalate: True kicks in automatically at 10 credits.

Step 3: Using AI Extraction to Skip the Parsing Step

The most useful feature for job scraping: instead of writing CSS selectors and normalization logic, you pass an ai_prompt and get structured data back directly.

import requests

response = requests.post(
    "https://scrapebadger.com/v1/web/scrape",
    headers={
        "Content-Type": "application/json",
        "x-api-key": "YOUR_API_KEY"
    },
    json={
        "url": "https://boards.greenhouse.io/yourcompany",
        "render_js": True,
        "wait_for": "#app_body",
        "anti_bot": True,
        "ai_extract": True,
        "ai_prompt": "Extract all job listings. For each, return: job_title, department, location, employment_type, and application_url.",
        "retry_on_block": True
    }
)

data = response.json()
extracted = data.get("ai_extraction", {})
print(extracted)

This returns a structured object โ€” no HTML parsing, no brittle selectors, no normalization code. When the page structure changes, the AI handles it.

Step 4: Pulling from Google Jobs

If you want broad coverage across job boards without scraping each one individually, the Google Jobs search endpoint aggregates listings across sources:

import requests

response = requests.get(
    "https://scrapebadger.com/v1/google/jobs/search",
    headers={"x-api-key": "YOUR_API_KEY"},
    params={
        "q": "Python developer",
        "location": "New York",
        "gl": "us",
        "job_type": "fulltime",
        "date_posted": "week"
    }
)

jobs = response.json()
for job in jobs.get("jobs_results", []):
    print(job.get("title"), "|", job.get("company_name"), "|", job.get("location"))

This is the fastest path to multi-source job data. Google has already aggregated listings from Indeed, LinkedIn, company pages, and dozens of other sources โ€” you get all of it in a single request.

Tool Comparison

Approach

Success Rate

Setup Time

Maintenance

Best For

requests + BeautifulSoup

5โ€“10% on major boards

Low

High (selector drift)

Static company career pages only

Playwright + proxies

70โ€“85%

High

High (proxy mgmt + selectors)

Custom control over browser behavior

JobSpy

50โ€“70%

Low

Medium (upstream breakage)

Fast prototypes, personal projects

ScrapeBadger web scrape

95%+

Low

Low (managed infrastructure)

Production pipelines, any job board

ScrapeBadger Google Jobs

99%+

Very low

None

Multi-source aggregation

Common Failure Modes

Empty results with no error. The most common issue with static scraping on JS-rendered pages. Fix: add render_js: True and a wait_for selector. If you're unsure what to wait for, use wait_after_load with a value like 2000 to give the page extra time.

Selector breaks after a site update. Selectors tied to class names or DOM structure break silently. Fix: prefer data-testid or data-automation attributes where available โ€” sites change class names far more often than data attributes. Better yet, use AI extraction and skip selectors entirely.

Rate limiting after a few requests. Happens when you're hitting the same target repeatedly from the same IP. Fix: use session_id to maintain a session, add delays between requests, or use the country parameter to rotate proxy geography.

Cloudflare or bot challenge pages. The response looks like HTML but contains a challenge, not content. Fix: anti_bot: True handles most cases. For heavily protected targets, add escalate: True.

What a Production Pipeline Looks Like

For a repeatable job collection pipeline โ€” say, monitoring hiring activity across 50 company careers pages daily โ€” the structure is:

  1. Maintain a list of target URLs (company careers pages, ATS board URLs)

  2. For each URL, POST to the scrape endpoint with render_js, anti_bot, ai_extract, and your extraction prompt

  3. Parse the AI-extracted structured data into a consistent schema

  4. Deduplicate by job URL or a composite key (company + title + location)

  5. Write new listings to a database or CSV

  6. Alert on new entries

The pipeline you'd build is similar to what's described in how to build a price tracking bot for e-commerce websites โ€” the core pattern of fetch โ†’ normalize โ†’ deduplicate โ†’ store applies here too.

If you're newer to web scraping in general, the Python web scraping tutorial covers the foundational tools and patterns before you get into job-specific logic.

Treat your job schema as a contract. Decide on your fields upfront โ€” job_title, company, location, employment_type, date_posted, application_url โ€” and enforce them with safe defaults. The AI extraction prompt should mirror these fields exactly so the output is consistent across different site structures.

FAQ

Why does requests + BeautifulSoup fail on most job boards?

Most modern job boards render their listings via JavaScript after the initial page load. When you use requests.get(), you receive the raw HTML before any JS executes โ€” which means the job cards haven't been inserted into the DOM yet. BeautifulSoup parses that raw HTML and finds nothing. You need either a browser automation tool or a scraping API with render_js: True to get the rendered content.

What's the difference between scraping Indeed directly vs. using the Google Jobs endpoint?

Scraping Indeed directly means fighting Cloudflare, rate limits, and selector changes. The Google Jobs endpoint returns structured job listings aggregated across multiple sources including Indeed, company pages, and other boards โ€” in a single API call with no anti-bot friction. For most use cases, Google Jobs is faster and more reliable, though it covers what Google has indexed rather than giving you real-time or volume-based control over specific sites.

How do I handle ATS platforms like Workday or Greenhouse?

Greenhouse exposes clean JSON endpoints at boards-api.greenhouse.io/v1/boards/{company}/jobs โ€” you can often call these directly without a browser at all. Workday is a single-page app and requires render_js: True plus a wait_for selector pointing to the job listing container. Lever has a similar JSON API available at api.lever.co/v0/postings/{company}. Identifying which ATS a company uses before deciding your approach saves a lot of time.

What fields should I extract from job listings?

At minimum: job_title, company, location, date_posted, and application_url. Depending on the use case, add employment_type (full-time/contract/remote), salary_range, department, and a short description snippet. Keep the schema stable โ€” it's a contract for everything downstream.

How do I avoid scraping the same job listing twice?

Use the application URL or a composite key of company + job_title + location as your primary deduplication key. Store seen keys in a database or a simple set, and check before writing new records. If you're running incremental jobs, the date_posted filter on the Google Jobs endpoint (date_posted: "day" or "week") limits results to recent listings and reduces the volume you need to deduplicate against.

Is it legal to scrape job listings?

Public job listings are generally fair game โ€” they're posted specifically to be read. That said, terms of service vary by platform, and some (LinkedIn in particular) explicitly prohibit automated scraping. The practical answer: check the ToS for sites you're targeting, don't store or redistribute data in ways that violate those terms, and use scraping for analysis rather than replication of the full dataset. When in doubt, use the Google Jobs endpoint, which operates through Google's indexing rather than directly against the job board.

Thomas Shultz

Written by

Thomas Shultz

Thomas Shultz is the Head of Data at ScrapeBadger, working on public web data, scraping infrastructure, and data reliability. He writes about real-world scraping, data pipelines, and turning unstructured web data into usable signals.

Ready to get started?

Join thousands of developers using ScrapeBadger for their data needs.

How to Scrape Job Listings with Python in 2026 | ScrapeBadger | ScrapeBadger