Most job scraping tutorials show you a BeautifulSoup script that worked once in 2021. You run it, get back empty results or a 403, and spend the next two hours debugging something that was never going to work in the first place.
The reality in 2026 is that major job boards โ Indeed, LinkedIn, Glassdoor, ZipRecruiter โ have invested heavily in anti-bot infrastructure. Basic HTTP requests against these sites have a 5โ10% success rate. Playwright with residential proxies gets you to 70โ85%. A proper scraping API gets you to 95%+ without managing any of that infrastructure yourself.
This guide covers the full landscape: what actually works per target, where the failure modes are, and how to build a pipeline you'd actually trust to run on a schedule.
Why Job Boards Are Harder Than Most Sites
Job boards have a specific combination of properties that makes them difficult:
Dynamic rendering. Greenhouse, Lever, Workday, and most modern ATS platforms are single-page apps. The job cards don't exist in the initial HTML โ they load after JavaScript executes. BeautifulSoup sees nothing.
Aggressive bot detection. LinkedIn uses account-based detection and rate limiting. Indeed runs Cloudflare. ZipRecruiter adds behavioral analysis. Hitting these with a plain
requests.get()will get you blocked within a handful of requests.Selector instability. Job boards update their HTML structure regularly. A scraper tied to specific class names breaks silently โ the script completes, the CSV has headers, you assume everything is fine.
The practical consequence: you need to match your tool to the target. A requests + BeautifulSoup setup works fine on a small company's static careers page. It fails completely on Workday.
The Four Approaches (Ranked by Production Readiness)
Static Scraping with requests + BeautifulSoup
Works for: simple, server-rendered HTML pages โ small company career pages, job boards that don't use JavaScript-heavy rendering.
import requests
from bs4 import BeautifulSoup
url = "https://example-company.com/careers"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "html.parser")
jobs = []
for card in soup.select(".job-card"): # Adjust selector per site
title = card.select_one(".job-title")
location = card.select_one(".job-location")
jobs.append({
"title": title.text.strip() if title else None,
"location": location.text.strip() if location else None,
})
print(jobs)
The failure mode here is silent: if the page uses JavaScript rendering or returns a bot challenge, response.text contains a challenge page or blank content โ not an error. Always print the first 500 characters of the response when debugging.
Browser Automation with Playwright
Works for: JavaScript-heavy job boards where you can't use static scraping, and where you have time to manage proxies.
Playwright is meaningfully better than Selenium for this use case โ faster, better async support, and cleaner stealth options. But it still requires residential proxies at any meaningful scale, and it breaks when sites change their DOM structure.
import asyncio
from playwright.async_api import async_playwright
async def scrape_jobs(url: str):
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={"server": "http://user:pass@residential-proxy:port"}
)
page = await browser.new_page()
await page.goto(url)
await page.wait_for_selector(".job-card", timeout=15000)
jobs = []
cards = await page.query_selector_all(".job-card")
for card in cards:
title = await card.query_selector(".job-title")
jobs.append({
"title": await title.inner_text() if title else None,
})
await browser.close()
return jobs
asyncio.run(scrape_jobs("https://jobs.example.com"))
The problem with this approach at scale: you're paying for proxy bandwidth, managing browser instances, handling CAPTCHA, and maintaining selectors that change weekly. It works, but it's a real operational burden.
JobSpy (Open-Source Multi-Site Library)
Works for: quickly pulling data from Indeed, LinkedIn, ZipRecruiter, Glassdoor, and a few others without building per-site scrapers.
from jobspy import scrape_jobs
jobs = scrape_jobs(
site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"],
search_term="Python Developer",
location="New York, NY",
results_wanted=50,
hours_old=48,
)
print(jobs.to_csv("jobs.csv", index=False))
Useful for fast prototyping and personal projects. The limitations: it requires proxy configuration for any real volume, LinkedIn support breaks periodically, and you're still responsible for maintenance when upstream sites change. For production pipelines where reliability matters, it's a starting point, not an endpoint.
Scraping API (Production-Ready)
Works for: anything you want to run reliably on a schedule without maintaining infrastructure.
This is where ScrapeBadger's web scraping endpoint fits. You POST a URL, configure a few parameters, and get back clean content โ the anti-bot handling, proxy rotation, and browser rendering are handled on their side.
Building a Reliable Job Scraper with ScrapeBadger
The ScrapeBadger web scrape endpoint takes a URL and returns structured content. For job scraping, the parameters that matter most are:
Parameter | What it does for job scraping |
|---|---|
| Required for ATS platforms (Greenhouse, Workday, Lever) |
| Waits for job cards to load before extracting |
| Bypasses Cloudflare and similar protections |
| Auto-upgrades to premium engine for heavily protected sites |
| Returns structured job data instead of raw HTML |
| Cleaner output than raw HTML for parsing |
| Handles intermittent blocks automatically |
Step 1: Scraping a Static Career Page
For a company careers page that doesn't require JavaScript rendering:
import requests
response = requests.post(
"https://scrapebadger.com/v1/web/scrape",
headers={
"Content-Type": "application/json",
"x-api-key": "YOUR_API_KEY"
},
json={
"url": "https://example-company.com/careers",
"format": "markdown",
"anti_bot": True,
"retry_on_block": True
}
)
data = response.json()
print(data["content"])
Cost: 1 credit per request (HTTP engine).
Step 2: Scraping a JavaScript-Rendered Job Board
For Greenhouse, Lever, Workday, or any React/Vue-based careers page:
import requests
response = requests.post(
"https://scrapebadger.com/v1/web/scrape",
headers={
"Content-Type": "application/json",
"x-api-key": "YOUR_API_KEY"
},
json={
"url": "https://boards.greenhouse.io/yourcompany",
"render_js": True,
"wait_for": "#app_body",
"format": "markdown",
"anti_bot": True,
"retry_on_block": True
}
)
data = response.json()
print(data["content"])
Cost: 5 credits per request (browser engine). For heavily protected pages, escalate: True kicks in automatically at 10 credits.
Step 3: Using AI Extraction to Skip the Parsing Step
The most useful feature for job scraping: instead of writing CSS selectors and normalization logic, you pass an ai_prompt and get structured data back directly.
import requests
response = requests.post(
"https://scrapebadger.com/v1/web/scrape",
headers={
"Content-Type": "application/json",
"x-api-key": "YOUR_API_KEY"
},
json={
"url": "https://boards.greenhouse.io/yourcompany",
"render_js": True,
"wait_for": "#app_body",
"anti_bot": True,
"ai_extract": True,
"ai_prompt": "Extract all job listings. For each, return: job_title, department, location, employment_type, and application_url.",
"retry_on_block": True
}
)
data = response.json()
extracted = data.get("ai_extraction", {})
print(extracted)
This returns a structured object โ no HTML parsing, no brittle selectors, no normalization code. When the page structure changes, the AI handles it.
Step 4: Pulling from Google Jobs
If you want broad coverage across job boards without scraping each one individually, the Google Jobs search endpoint aggregates listings across sources:
import requests
response = requests.get(
"https://scrapebadger.com/v1/google/jobs/search",
headers={"x-api-key": "YOUR_API_KEY"},
params={
"q": "Python developer",
"location": "New York",
"gl": "us",
"job_type": "fulltime",
"date_posted": "week"
}
)
jobs = response.json()
for job in jobs.get("jobs_results", []):
print(job.get("title"), "|", job.get("company_name"), "|", job.get("location"))
This is the fastest path to multi-source job data. Google has already aggregated listings from Indeed, LinkedIn, company pages, and dozens of other sources โ you get all of it in a single request.
Tool Comparison
Approach | Success Rate | Setup Time | Maintenance | Best For |
|---|---|---|---|---|
requests + BeautifulSoup | 5โ10% on major boards | Low | High (selector drift) | Static company career pages only |
Playwright + proxies | 70โ85% | High | High (proxy mgmt + selectors) | Custom control over browser behavior |
JobSpy | 50โ70% | Low | Medium (upstream breakage) | Fast prototypes, personal projects |
ScrapeBadger web scrape | 95%+ | Low | Low (managed infrastructure) | Production pipelines, any job board |
ScrapeBadger Google Jobs | 99%+ | Very low | None | Multi-source aggregation |
Common Failure Modes
Empty results with no error. The most common issue with static scraping on JS-rendered pages. Fix: add render_js: True and a wait_for selector. If you're unsure what to wait for, use wait_after_load with a value like 2000 to give the page extra time.
Selector breaks after a site update. Selectors tied to class names or DOM structure break silently. Fix: prefer data-testid or data-automation attributes where available โ sites change class names far more often than data attributes. Better yet, use AI extraction and skip selectors entirely.
Rate limiting after a few requests. Happens when you're hitting the same target repeatedly from the same IP. Fix: use session_id to maintain a session, add delays between requests, or use the country parameter to rotate proxy geography.
Cloudflare or bot challenge pages. The response looks like HTML but contains a challenge, not content. Fix: anti_bot: True handles most cases. For heavily protected targets, add escalate: True.
What a Production Pipeline Looks Like
For a repeatable job collection pipeline โ say, monitoring hiring activity across 50 company careers pages daily โ the structure is:
Maintain a list of target URLs (company careers pages, ATS board URLs)
For each URL, POST to the scrape endpoint with
render_js,anti_bot,ai_extract, and your extraction promptParse the AI-extracted structured data into a consistent schema
Deduplicate by job URL or a composite key (company + title + location)
Write new listings to a database or CSV
Alert on new entries
The pipeline you'd build is similar to what's described in how to build a price tracking bot for e-commerce websites โ the core pattern of fetch โ normalize โ deduplicate โ store applies here too.
If you're newer to web scraping in general, the Python web scraping tutorial covers the foundational tools and patterns before you get into job-specific logic.
Treat your job schema as a contract. Decide on your fields upfront โ job_title, company, location, employment_type, date_posted, application_url โ and enforce them with safe defaults. The AI extraction prompt should mirror these fields exactly so the output is consistent across different site structures.
FAQ
Why does requests + BeautifulSoup fail on most job boards?
Most modern job boards render their listings via JavaScript after the initial page load. When you use requests.get(), you receive the raw HTML before any JS executes โ which means the job cards haven't been inserted into the DOM yet. BeautifulSoup parses that raw HTML and finds nothing. You need either a browser automation tool or a scraping API with render_js: True to get the rendered content.
What's the difference between scraping Indeed directly vs. using the Google Jobs endpoint?
Scraping Indeed directly means fighting Cloudflare, rate limits, and selector changes. The Google Jobs endpoint returns structured job listings aggregated across multiple sources including Indeed, company pages, and other boards โ in a single API call with no anti-bot friction. For most use cases, Google Jobs is faster and more reliable, though it covers what Google has indexed rather than giving you real-time or volume-based control over specific sites.
How do I handle ATS platforms like Workday or Greenhouse?
Greenhouse exposes clean JSON endpoints at boards-api.greenhouse.io/v1/boards/{company}/jobs โ you can often call these directly without a browser at all. Workday is a single-page app and requires render_js: True plus a wait_for selector pointing to the job listing container. Lever has a similar JSON API available at api.lever.co/v0/postings/{company}. Identifying which ATS a company uses before deciding your approach saves a lot of time.
What fields should I extract from job listings?
At minimum: job_title, company, location, date_posted, and application_url. Depending on the use case, add employment_type (full-time/contract/remote), salary_range, department, and a short description snippet. Keep the schema stable โ it's a contract for everything downstream.
How do I avoid scraping the same job listing twice?
Use the application URL or a composite key of company + job_title + location as your primary deduplication key. Store seen keys in a database or a simple set, and check before writing new records. If you're running incremental jobs, the date_posted filter on the Google Jobs endpoint (date_posted: "day" or "week") limits results to recent listings and reduces the volume you need to deduplicate against.
Is it legal to scrape job listings?
Public job listings are generally fair game โ they're posted specifically to be read. That said, terms of service vary by platform, and some (LinkedIn in particular) explicitly prohibit automated scraping. The practical answer: check the ToS for sites you're targeting, don't store or redistribute data in ways that violate those terms, and use scraping for analysis rather than replication of the full dataset. When in doubt, use the Google Jobs endpoint, which operates through Google's indexing rather than directly against the job board.

Written by
Thomas Shultz
Thomas Shultz is the Head of Data at ScrapeBadger, working on public web data, scraping infrastructure, and data reliability. He writes about real-world scraping, data pipelines, and turning unstructured web data into usable signals.
Ready to get started?
Join thousands of developers using ScrapeBadger for their data needs.
