65% of companies are already using web scraping to feed their AI projects. The global AI training dataset market is projected to grow from $4.44 billion in 2026 to $23.18 billion by 2034, a CAGR of 22.90%. The growth is being driven by a fundamental constraint: models need more data than exists in ready-made public datasets, and the specific domain data that makes fine-tuned models genuinely useful rarely comes from anywhere except the web.
This guide covers the complete pipeline — from choosing what to scrape based on your training objective, through collection strategies for different content types, to the cleaning and formatting steps that determine whether your scraped data improves or degrades your model. Every section produces working code you can run.
Start With the Training Objective, Not the Scraping Tool
The most common mistake in building AI training datasets is starting with "what can I easily scrape" rather than "what does my model need to learn." The scraping strategy flows from the training objective. Get this backwards and you end up with a lot of data that doesn't improve your model.
The four training objectives that drive most custom dataset collection, and what each requires from scraped data:
Pre-training and continued pre-training — teaching a model general domain knowledge by training on large volumes of in-domain text. The target is breadth and volume. Quality filtering matters more than structure. You want diverse coverage of a topic area at scale, cleaned to remove navigation menus, headers, cookie notices, and other boilerplate.
Supervised fine-tuning (SFT) — teaching a model how to respond to instructions in a specific format or domain. The target is instruction-response pairs. Scraped raw text needs to be transformed into the {"instruction": "...", "response": "..."} format through either existing Q&A structure (forum threads, documentation pages) or post-processing with a language model.
Retrieval-Augmented Generation (RAG) — building a knowledge base an agent retrieves from at inference time. The target is high-quality, structured documents with clear chunking boundaries. Freshness matters more than volume — stale documents in a RAG corpus produce stale answers.
Reinforcement Learning from Human Feedback (RLHF) / DPO — training a reward model or preference dataset. The target is comparative pairs showing preferred and rejected responses. Scraped data for this purpose usually comes from review systems, upvote/downvote signals, and structured feedback content.
Each objective changes what you scrape, how you clean it, and what format it needs to be in.
The Data Format Question: Always Target Markdown
For LLMs to train, not all data formats are equal. Markdown is lightweight like plain text yet structured like HTML, which makes it a sweet spot format.
Raw HTML is a terrible training input. It contains tags, attributes, CSS class names, inline styles, tracking pixels, navigation chrome, footer content, and cookie banners. The signal-to-noise ratio is poor. Training on raw HTML teaches your model HTML structure, not domain knowledge.
Markdown preserves the structural signals that matter — headings, lists, code blocks, emphasis — without the HTML noise. The workflow for converting scraped HTML to training-ready Markdown:
python
import httpx
from bs4 import BeautifulSoup
import markdownify
import re
from typing import Optional
def html_to_training_text(
html: str,
remove_nav: bool = True,
remove_footer: bool = True,
min_length: int = 200,
) -> Optional[str]:
"""
Convert scraped HTML to clean Markdown suitable for LLM training.
Returns None if the content doesn't meet quality thresholds.
"""
soup = BeautifulSoup(html, "lxml")
# Remove boilerplate elements before conversion
boilerplate_selectors = [
"nav", "header", "footer", "aside",
".cookie-banner", ".ad", ".advertisement",
".sidebar", ".breadcrumb", ".pagination",
"script", "style", "noscript",
"[aria-hidden='true']",
]
if remove_nav:
boilerplate_selectors.extend(["nav", "header"])
if remove_footer:
boilerplate_selectors.append("footer")
for selector in boilerplate_selectors:
for element in soup.select(selector):
element.decompose()
# Try to find the main content area
main_content = (
soup.find("article") or
soup.find("main") or
soup.find(id="content") or
soup.find(class_="content") or
soup.find(role="main") or
soup.body
)
if not main_content:
return None
# Convert to Markdown
markdown = markdownify.markdownify(
str(main_content),
heading_style="ATX",
bullets="-",
code_language_callback=lambda el: el.get("class", [""])[0].replace("language-", ""),
)
# Clean up excess whitespace
markdown = re.sub(r"\n{3,}", "\n\n", markdown)
markdown = re.sub(r" {2,}", " ", markdown)
markdown = markdown.strip()
# Quality gate — reject if too short or mostly links
if len(markdown) < min_length:
return None
# Rough link density check — mostly links = navigation page
link_count = len(re.findall(r"\[.+?\]\(.+?\)", markdown))
line_count = len(markdown.split("\n"))
if line_count > 0 and link_count / line_count > 0.5:
return None # Probably a sitemap or link aggregator
return markdown
# Install: pip install markdownifyFor documentation sites, technical blogs, and knowledge bases — the most valuable fine-tuning sources — markdownify preserves code blocks, headings, and lists in a format that training pipelines can consume directly.
Strategy 1: Domain-Specific Pre-Training Corpus
The highest-value scraping use case for AI training is collecting domain-specific text for continued pre-training. A general model retrained on 10 billion tokens of legal documents, medical literature, or financial filings understands that domain in ways that prompt engineering can't fully compensate for.
The sources that produce the cleanest domain-specific corpora:
Documentation sites — technical documentation is exceptionally clean training data. Well-structured, authoritative, domain-specific, minimal boilerplate. A language model fine-tuned on a product's own documentation becomes genuinely useful for that product's support use cases.
Academic and professional publications — open access journals, preprints on arXiv, professional association publications. Dense with domain knowledge, citable, and generally higher quality than blog content.
Wikipedia category trees — structured, encyclopedic coverage of a domain, with clear article boundaries and consistent style. Easy to scrape; less domain-specific than primary sources but useful as a foundation.
Community knowledge bases — Stack Overflow, Stack Exchange, Reddit (technical subreddits), Quora answers. As covered in the ScrapeBadger Reddit Scraper guide, community Q&A content has a natural instruction-response structure that reduces post-processing.
python
import asyncio
import httpx
import json
import hashlib
from pathlib import Path
from datetime import datetime
from typing import Optional
class DomainCorpusBuilder:
"""
Builds domain-specific training corpus from a seed URL list.
Outputs JSONL format compatible with most training pipelines.
"""
def __init__(
self,
output_path: str,
api_key: str,
min_tokens_estimate: int = 150, # ~200 chars ≈ 50 tokens minimum
max_concurrent: int = 10,
):
self.output_path = Path(output_path)
self.api_key = api_key
self.min_tokens = min_tokens_estimate
self.semaphore = asyncio.Semaphore(max_concurrent)
self.seen_hashes = set() # Simple deduplication
self._records_written = 0
self._duplicates_skipped = 0
def _content_hash(self, text: str) -> str:
"""Generate hash for near-duplicate detection."""
# Normalize before hashing — removes whitespace variation
normalized = " ".join(text.lower().split())
return hashlib.sha256(normalized[:1000].encode()).hexdigest()
def _is_duplicate(self, text: str) -> bool:
h = self._content_hash(text)
if h in self.seen_hashes:
return True
self.seen_hashes.add(h)
return False
async def _fetch_and_process(
self,
client: httpx.AsyncClient,
url: str,
metadata: dict = None,
) -> Optional[dict]:
"""Fetch URL via ScrapeBadger and process to training record."""
async with self.semaphore:
try:
response = await client.get(
"https://api.scrapebadger.com/v1/scrape",
params={"url": url, "render_js": True},
timeout=30.0,
)
data = response.json()
html = data.get("html", "")
if not html:
return None
# Convert to clean Markdown
text = html_to_training_text(html, min_length=200)
if not text:
return None
# Deduplication check
if self._is_duplicate(text):
self._duplicates_skipped += 1
return None
record = {
"text": text,
"source": url,
"scraped_at": datetime.utcnow().isoformat(),
"token_estimate": len(text.split()) * 1.3, # Rough token estimate
}
if metadata:
record.update(metadata)
return record
except Exception as e:
return None
async def build(self, urls: list[str], metadata_list: list[dict] = None) -> dict:
"""
Build corpus from URL list.
Returns summary stats.
"""
if metadata_list is None:
metadata_list = [{}] * len(urls)
headers = {"X-API-Key": self.api_key}
async with httpx.AsyncClient(headers=headers) as client:
tasks = [
self._fetch_and_process(client, url, meta)
for url, meta in zip(urls, metadata_list)
]
with open(self.output_path, "w", encoding="utf-8") as f:
results = await asyncio.gather(*tasks)
for record in results:
if record:
f.write(json.dumps(record, ensure_ascii=False) + "\n")
self._records_written += 1
total_tokens = sum(
r["token_estimate"] for r in results if r
)
return {
"records_written": self._records_written,
"duplicates_skipped": self._duplicates_skipped,
"total_urls": len(urls),
"estimated_tokens": int(total_tokens),
"output_path": str(self.output_path),
}
# Usage
builder = DomainCorpusBuilder(
output_path="legal_corpus.jsonl",
api_key="your_scrapebadger_key",
max_concurrent=10,
)
urls = [
"https://legaldomain.com/contracts/overview",
"https://legaldomain.com/employment-law",
# ... hundreds more
]
stats = asyncio.run(builder.build(urls))
print(f"Built corpus: {stats['records_written']} documents, "
f"~{stats['estimated_tokens']:,} tokens")Strategy 2: Instruction-Response Pairs from Q&A Structures
Fine-tuning for instruction following requires data in the {"instruction": "...", "response": "..."} or {"messages": [...]} format. The most efficient source is content that already has this structure: Q&A forums, documentation pages with examples, technical blogs with "how to do X" structure.
Stack Exchange sites, Reddit's technical communities, and documentation platforms all produce natural instruction-response structure. The extraction approach:
python
from bs4 import BeautifulSoup
import json
import re
def extract_qa_pairs(html: str, url: str) -> list[dict]:
"""
Extract instruction-response pairs from Q&A page HTML.
Handles Stack Overflow, Stack Exchange, Reddit thread structures.
"""
soup = BeautifulSoup(html, "lxml")
pairs = []
# --- Stack Exchange / Stack Overflow pattern ---
question_el = soup.select_one(".question-body, .s-prose.js-post-body")
accepted_answer = soup.select_one(".accepted-answer .answercell .s-prose")
if question_el and accepted_answer:
q_text = question_el.get_text(separator=" ", strip=True)
a_text = accepted_answer.get_text(separator=" ", strip=True)
# Quality gates
if len(q_text) > 50 and len(a_text) > 100:
pairs.append({
"instruction": q_text,
"response": a_text,
"source": url,
"format": "stack_exchange",
})
# --- Reddit thread pattern (top-level Q + top comment A) ---
post_content = soup.select_one("[data-testid='post-container']")
top_comment = soup.select_one("[data-testid='comment']")
if post_content and top_comment:
q_text = post_content.get_text(separator=" ", strip=True)
a_text = top_comment.get_text(separator=" ", strip=True)
if len(q_text) > 50 and len(a_text) > 100:
pairs.append({
"instruction": q_text,
"response": a_text,
"source": url,
"format": "reddit_thread",
})
# --- Generic FAQ / Documentation pattern ---
faq_items = soup.select(".faq-item, [itemtype*='FAQPage'] [itemscope]")
for item in faq_items:
q = item.select_one(".faq-question, [itemprop='name']")
a = item.select_one(".faq-answer, [itemprop='acceptedAnswer'] [itemprop='text']")
if q and a:
q_text = q.get_text(strip=True)
a_text = a.get_text(separator=" ", strip=True)
if len(q_text) > 20 and len(a_text) > 50:
pairs.append({
"instruction": q_text,
"response": a_text,
"source": url,
"format": "faq",
})
return pairs
def convert_to_chat_format(pairs: list[dict]) -> list[dict]:
"""
Convert instruction-response pairs to messages format
(OpenAI chat format, compatible with most fine-tuning frameworks).
"""
return [
{
"messages": [
{"role": "user", "content": pair["instruction"]},
{"role": "assistant", "content": pair["response"]},
],
"source": pair.get("source", ""),
}
for pair in pairs
if pair["instruction"] and pair["response"]
]Strategy 3: Fresh Corpus for RAG Knowledge Bases
RAG pipelines retrieve documents at inference time rather than baking knowledge into weights. This means data freshness matters more than volume — a stale document in your retrieval corpus produces stale answers.
The scraping strategy for RAG knowledge bases is different from pre-training corpora:
Chunk-aware extraction — RAG systems split documents into chunks for embedding. Extracting content in a way that respects natural chunk boundaries (sections, paragraphs, code blocks) produces better retrieval performance than blindly splitting by character count.
Metadata preservation — every RAG document should carry source URL, title, publication date, and content type. These fields feed retrieval filters and answer attribution.
Freshness scheduling — RAG corpora should be re-scraped on a schedule that matches the target domain's update frequency. News and pricing data needs daily refresh. Documentation might need weekly. Legal references might need monthly.
python
import asyncio
import httpx
import json
import re
from datetime import datetime
from typing import Optional
from dataclasses import dataclass
@dataclass
class RAGDocument:
doc_id: str
title: str
content: str
chunks: list[str]
source_url: str
scraped_at: str
content_type: str
word_count: int
def chunk_markdown(
text: str,
max_chunk_size: int = 512,
overlap: int = 50,
) -> list[str]:
"""
Split Markdown into chunks that respect natural boundaries.
Prefers splitting at headers and paragraph breaks.
"""
# Split on headers first (most natural boundary)
sections = re.split(r"\n(?=#{1,3} )", text)
chunks = []
for section in sections:
words = section.split()
if len(words) <= max_chunk_size:
if section.strip():
chunks.append(section.strip())
else:
# Section is too long — split by paragraph
paragraphs = section.split("\n\n")
current_chunk = []
current_size = 0
for para in paragraphs:
para_words = para.split()
if current_size + len(para_words) > max_chunk_size and current_chunk:
chunks.append("\n\n".join(current_chunk).strip())
# Keep overlap from end of previous chunk
overlap_text = " ".join(
" ".join(current_chunk).split()[-overlap:]
)
current_chunk = [overlap_text, para] if overlap_text else [para]
current_size = len(overlap_text.split()) + len(para_words)
else:
current_chunk.append(para)
current_size += len(para_words)
if current_chunk:
chunks.append("\n\n".join(current_chunk).strip())
return [c for c in chunks if len(c.split()) > 20] # Drop tiny fragments
async def build_rag_document(
client: httpx.AsyncClient,
url: str,
content_type: str = "documentation",
) -> Optional[RAGDocument]:
"""Build a RAG-ready document from a URL."""
try:
response = await client.get(
"https://api.scrapebadger.com/v1/scrape",
params={"url": url, "render_js": True, "wait_for": "networkidle"},
timeout=30.0,
)
data = response.json()
html = data.get("html", "")
if not html:
return None
soup = BeautifulSoup(html, "lxml")
# Extract title
title = ""
title_el = soup.find("h1") or soup.find("title")
if title_el:
title = title_el.get_text(strip=True)
# Get clean Markdown content
text = html_to_training_text(html)
if not text:
return None
# Generate chunks
chunks = chunk_markdown(text, max_chunk_size=512, overlap=50)
if not chunks:
return None
import hashlib
doc_id = hashlib.md5(url.encode()).hexdigest()
return RAGDocument(
doc_id=doc_id,
title=title,
content=text,
chunks=chunks,
source_url=url,
scraped_at=datetime.utcnow().isoformat(),
content_type=content_type,
word_count=len(text.split()),
)
except Exception:
return None
def save_for_vector_db(
documents: list[RAGDocument],
output_path: str,
) -> int:
"""
Save RAG documents in format ready for ingestion into
Chroma, Pinecone, Weaviate, or any vector database.
Each chunk becomes a separate record with parent document metadata.
"""
records = []
for doc in documents:
for i, chunk in enumerate(doc.chunks):
records.append({
"id": f"{doc.doc_id}_chunk_{i}",
"text": chunk,
"metadata": {
"doc_id": doc.doc_id,
"title": doc.title,
"source_url": doc.source_url,
"scraped_at": doc.scraped_at,
"content_type": doc.content_type,
"chunk_index": i,
"total_chunks": len(doc.chunks),
},
})
with open(output_path, "w") as f:
for record in records:
f.write(json.dumps(record) + "\n")
return len(records)The Data Cleaning Pipeline
Raw scraped content is never ready for training without cleaning. The specific issues that degrade scraped training data:
Boilerplate and navigation text — cookie notices, subscription popups, navigation menus, footers. These appear repeatedly across documents from the same domain and teach the model to reproduce boilerplate rather than domain knowledge.
Near-duplicate content — product description syndication, press release republication, scraped content farms. Training on duplicated content wastes compute and biases the model toward overrepresented content.
Low-quality signals — short snippets, mostly-link pages, machine-translated content, auto-generated thin content. These add noise without adding signal.
python
import hashlib
import re
from collections import Counter
from typing import Optional
class TrainingDataQualityFilter:
"""
Multi-stage quality filter for scraped training data.
Apply before adding records to a training corpus.
"""
def __init__(
self,
min_words: int = 100,
max_duplicate_ngram_ratio: float = 0.3,
min_alpha_ratio: float = 0.7,
max_symbol_ratio: float = 0.1,
):
self.min_words = min_words
self.max_duplicate_ngram_ratio = max_duplicate_ngram_ratio
self.min_alpha_ratio = min_alpha_ratio
self.max_symbol_ratio = max_symbol_ratio
self._seen_hashes = set()
def check_length(self, text: str) -> tuple[bool, str]:
word_count = len(text.split())
if word_count < self.min_words:
return False, f"too_short ({word_count} words)"
return True, "ok"
def check_character_quality(self, text: str) -> tuple[bool, str]:
if not text:
return False, "empty"
alpha_count = sum(1 for c in text if c.isalpha())
symbol_count = sum(1 for c in text if not c.isalnum() and not c.isspace())
alpha_ratio = alpha_count / len(text)
symbol_ratio = symbol_count / len(text)
if alpha_ratio < self.min_alpha_ratio:
return False, f"low_alpha_ratio ({alpha_ratio:.2f})"
if symbol_ratio > self.max_symbol_ratio:
return False, f"high_symbol_ratio ({symbol_ratio:.2f})"
return True, "ok"
def check_repetition(self, text: str) -> tuple[bool, str]:
"""Detect repetitive content using n-gram duplication ratio."""
words = text.lower().split()
if len(words) < 10:
return True, "ok" # Too short to check meaningfully
# Build 5-grams
ngrams = [
tuple(words[i:i+5])
for i in range(len(words) - 4)
]
counts = Counter(ngrams)
duplicate_ngrams = sum(
count - 1 for count in counts.values() if count > 1
)
dup_ratio = duplicate_ngrams / len(ngrams) if ngrams else 0
if dup_ratio > self.max_duplicate_ngram_ratio:
return False, f"repetitive_content ({dup_ratio:.2f})"
return True, "ok"
def check_exact_duplicate(self, text: str) -> tuple[bool, str]:
"""Exact deduplication using content hash."""
normalized = " ".join(text.lower().split())
content_hash = hashlib.sha256(normalized.encode()).hexdigest()
if content_hash in self._seen_hashes:
return False, "exact_duplicate"
self._seen_hashes.add(content_hash)
return True, "ok"
def filter(self, text: str) -> tuple[bool, str]:
"""Run all quality checks. Returns (passed, reason)."""
for check in [
self.check_length,
self.check_character_quality,
self.check_repetition,
self.check_exact_duplicate,
]:
passed, reason = check(text)
if not passed:
return False, reason
return True, "passed"
# Usage
quality_filter = TrainingDataQualityFilter(min_words=100)
clean_records = []
rejected = Counter()
with open("raw_corpus.jsonl") as f_in:
for line in f_in:
record = json.loads(line)
text = record.get("text", "")
passed, reason = quality_filter.filter(text)
if passed:
clean_records.append(record)
else:
rejected[reason] += 1
print(f"Kept: {len(clean_records)}")
print(f"Rejected: {sum(rejected.values())}")
for reason, count in rejected.most_common():
print(f" {reason}: {count}")High-Value Source Categories for Specific Domains
Not all scraped data is equally valuable. The sources that produce the highest-quality fine-tuning data by domain:
Legal and compliance — court opinions, regulatory filings, contract templates, law review articles. All publicly accessible. Dense with domain-specific language patterns that make fine-tuned models genuinely useful for legal tasks. Google Scholar ScrapeBadger Scholar endpoint surfaces relevant case law and academic legal literature.
Financial and economic — SEC filings (EDGAR has a public API), central bank publications, earnings call transcripts, analyst reports in the public domain. Combined with ScrapeBadger's Google Finance data, this produces models that understand both qualitative and quantitative financial language.
Medical and scientific — PubMed Central (open access), arXiv, clinical guidelines from public health organisations, pharmacy databases. Be especially careful about quality filtering here — medical training data is high-stakes and low-quality medical content can produce dangerous outputs.
Technical documentation — software documentation, API references, developer guides. Particularly valuable because the information is precise, structured, and the instruction-response pattern is natural. A model trained on a product's documentation is genuinely useful for support automation on that product.
Community knowledge — as covered in the ScrapeBadger Reddit guide, Reddit technical communities and Stack Exchange produce natural Q&A pairs with community quality signals (upvotes) that can be used as weak supervision for quality filtering.
The Legal and Ethical Landscape in 2026
All-rights-reserved data like NYT or Stack Overflow snapshots need licenses post-2024. The practical workflow uses Common Crawl filtered through quality classifiers, mixed with code, math, and Wikipedia plus Stack Exchange and books for instruction tuning.
For custom fine-tuning data collected via scraping, the relevant considerations:
Copyright on scraped content — publicly accessible doesn't mean copyright-free. Using scraped content for AI training is an active legal question in multiple jurisdictions. The EU's text and data mining exception allows research use but is less clear on commercial model training. In the US, fair use analysis applies but has not been definitively settled for training purposes.
Terms of Service — Reddit, LinkedIn, and many other platforms explicitly prohibit scraping for AI training purposes in their current ToS. This doesn't make the underlying data uncopyable, but it adds contractual exposure. Review ToS before building large-scale training corpora from any major platform.
Privacy and personal data — scraped content that includes personal information (names, emails, user posts) collected under GDPR may require specific legal bases for training use. Apply data minimisation — collect only what the training objective requires, pseudonymise personal identifiers where possible.
Attribution and citations — for RAG systems where the model attributes answers to sources, scraping the original source URL and metadata alongside content lets you preserve citation chains. This is both ethically sound and practically useful — cited answers are more trustworthy to end users.
The safest approach for commercial training data: public domain content (US government publications, pre-1928 works, CC0 licensed material), openly licensed datasets (CC-BY), and proprietary data you own or have licensed. Supplement with scraped data where the copyright situation is less clear, but get legal review before publishing a model trained on scraped content commercially.
Connecting the Pipeline to ScrapeBadger
The full pipeline described above uses ScrapeBadger at the collection layer for all sources that require anti-bot bypass or JavaScript rendering. This covers the majority of high-value training data sources — technical documentation, news archives, forum content, and any platform with meaningful bot protection.
AI-related data projects at leading scraping providers are growing at 400% year-over-year, with average deal values three times higher than standard scraping projects. The volume requirements for training datasets are the primary driver — pre-training corpora need millions of documents, which means collection infrastructure that can sustain high throughput without degrading as anti-bot systems adapt. ScrapingBee
ScrapeBadger's content validation layer matters specifically for training data: a block page that passes a status-code check but fails content validation is never billed and never enters your corpus. As covered in the data quality article on the ScrapeBadger blog, paying for garbage and then training on it is a cost that compounds — you pay for the request, waste the storage, and then waste compute training on noise. ScrapeBadger only charges for successful retrievals.
The MCP integration enables a particularly useful pattern for AI dataset work: an agent that discovers and evaluates candidate URLs for inclusion in a training corpus, calling ScrapeBadger to fetch and preview content before adding it to the collection queue. Setup at docs.scrapebadger.com/mcp/overview.
Full documentation for all endpoints — including the Google Scholar, News, and Search endpoints that are particularly useful for research and literature datasets — at docs.scrapebadger.com.

Written by
Thomas Shultz
Thomas Shultz is the Head of Data at ScrapeBadger, working on public web data, scraping infrastructure, and data reliability. He writes about real-world scraping, data pipelines, and turning unstructured web data into usable signals.
Ready to get started?
Join thousands of developers using ScrapeBadger for their data needs.
