Most news aggregator tutorials show you how to scrape one site. Then you try to add a second source and realize every site has a completely different HTML structure, pagination pattern, and JavaScript rendering behavior. By the time you've written custom parsers for five sources, you're spending more time maintaining scrapers than reading the news they produce.
This guide covers how to build a news aggregator that actually holds up โ the architecture decisions that matter, where traditional scraping breaks down, and how to structure a pipeline that doesn't need constant patching.
What a News Aggregator Actually Needs
Before writing code, it helps to be clear about what you're building. A news aggregator is a data pipeline with four responsibilities:
Component | What It Does |
|---|---|
Collection | Fetch articles from multiple sources on a schedule |
Normalization | Flatten different source formats into a consistent schema |
Deduplication | Prevent the same article from appearing multiple times |
Storage + Delivery | Store articles and surface them in a usable format |
Each component has failure modes. Collection breaks when sites change their HTML or add bot protection. Normalization breaks when fields are missing or structured differently. Deduplication fails without a stable unique key. Storage becomes a problem when your schema doesn't account for optional fields.
The mistake most people make is treating this as a scraping problem. It's a pipeline design problem. Scraping is just the input layer.
The Three Data Source Patterns
News sites fall into three categories, and each requires a different approach.
RSS Feeds
About 80% of established news sites still publish RSS feeds. They're structured XML, they don't require parsing HTML, and they're stable. If a source has an RSS feed, use it. It's the lowest-maintenance option by a wide margin.
Libraries like feedparser (Python) handle parsing automatically. You get titles, links, publication dates, and summaries without writing a single CSS selector.
import feedparser
FEEDS = {
"Reuters": "https://feeds.reuters.com/reuters/topNews",
"BBC": "http://feeds.bbci.co.uk/news/rss.xml",
"Ars Technica": "https://feeds.arstechnica.com/arstechnica/index",
}
def collect_from_rss(feeds: dict) -> list[dict]:
articles = []
for source, url in feeds.items():
feed = feedparser.parse(url)
for entry in feed.entries:
articles.append({
"title": entry.get("title", ""),
"url": entry.get("link", ""),
"summary": entry.get("summary", ""),
"published": entry.get("published", ""),
"source": source,
})
return articles
The problem: not every site has an RSS feed, and some feeds are deliberately incomplete โ truncated summaries, missing content, no body text. For those, you need to scrape.
Static HTML Sites
Sites that render content server-side are straightforward to scrape with requests + BeautifulSoup. The issue is that every site requires its own CSS selectors, and those selectors break when the site redesigns. A scraper for TechCrunch looks nothing like a scraper for The Guardian, which looks nothing like Hacker News.
This is the core maintenance problem with traditional news scraping. You end up with a collection of site-specific scrapers, each of which needs updating every few months.
JavaScript-Rendered Sites
A significant portion of modern news sites render content client-side via JavaScript. requests will return an empty shell โ the article content never appears in the initial HTML response. For these, you need a headless browser like Playwright or a scraping API that handles rendering for you.
Playwright works, but it's significantly slower and more resource-intensive than HTTP-based scraping. It also gets detected more frequently by anti-bot systems. The tradeoff is worth it only when there's no other option.
Why Traditional Scraping Breaks at Scale
The failure modes become obvious once you're aggregating more than a handful of sources:
Selector rot. News sites redesign regularly. A CSS class like .post-block__title__link that works today will return nothing after the next frontend deploy. With ten sources, you're fixing broken selectors on a near-monthly basis.
JavaScript rendering gaps. About 60% of news sites load content dynamically. A plain HTTP request misses the actual articles.
Bot detection. Major publishers run Cloudflare, Akamai, or custom anti-bot systems. Roughly 68% of news sites use some form of bot protection. A vanilla requests call gets blocked immediately on many of them.
Schema variance. Even when scraping works, the data structure is inconsistent. One site puts the author in a <span class="byline">, another puts it in a <meta> tag, another doesn't publish it at all. Writing defensive normalization for every possible variation is tedious and fragile.
If you're building a personal aggregator with three to five sources you control, traditional scraping is fine. If you're building something that needs to reliably cover dozens of sources with minimal maintenance, you need a different approach for the collection layer.
A More Reliable Collection Layer: Scraping APIs
The practical alternative to maintaining your own scraper fleet is using a scraping API that handles rendering, bot detection, and proxy rotation internally. You pass it a URL, it returns the content. When a site changes its anti-bot setup, the API provider deals with it โ not you.
ScrapeBadger's web scraping endpoint is built for exactly this. A single POST /v1/web/scrape request handles static and dynamic sites, with configurable rendering and AI extraction baked in.
import requests
import os
SCRAPEBADGER_API_KEY = os.getenv("SCRAPEBADGER_API_KEY")
def scrape_article(url: str, render_js: bool = False) -> dict:
response = requests.post(
"https://scrapebadger.com/v1/web/scrape",
headers={
"x-api-key": SCRAPEBADGER_API_KEY,
"Content-Type": "application/json",
},
json={
"url": url,
"format": "markdown", # Clean readable text
"render_js": render_js, # True for JS-heavy sites
"anti_bot": True, # Handle bot protection
"retry_count": 3, # Auto-retry on failure
},
)
response.raise_for_status()
data = response.json()
return {
"content": data.get("content", ""),
"engine_used": data.get("engine_used"),
"credits_used": data.get("credits_used"),
}
The format: "markdown" parameter is worth calling out specifically. Rather than returning raw HTML that you have to parse, it returns clean article text. For a news aggregator, that's usually what you want โ you don't need the nav, ads, and footer markup.
AI Extraction for Structured Fields
If you need structured fields rather than raw content โ headline, author, publish date, summary โ the ai_extract parameter handles that without you writing a custom parser:
def scrape_article_structured(url: str) -> dict:
response = requests.post(
"https://scrapebadger.com/v1/web/scrape",
headers={
"x-api-key": SCRAPEBADGER_API_KEY,
"Content-Type": "application/json",
},
json={
"url": url,
"format": "markdown",
"render_js": False,
"anti_bot": True,
"ai_extract": True,
"ai_prompt": "Extract: headline, author, publish_date, summary (2-3 sentences), main_topic",
},
)
response.raise_for_status()
data = response.json()
# Structured fields returned under ai_extraction
return data.get("ai_extraction", {})
The result is a consistent JSON object regardless of how different sources structure their HTML. One parser works everywhere.
Engine Costs to Plan Around
Engine | Credit Cost | When It's Used |
|---|---|---|
| 1 credit | Most static sites |
| 5 credits | JS-heavy or dynamic sites |
Escalated | 10 credits | Heavily protected sites ( |
For a news aggregator checking 500 articles per day, you're looking at 500โ2,500 credits depending on how many sources require browser rendering. Plan your source list accordingly.
Keyword-Based News Collection
For aggregators organized around topics rather than specific sources, ScrapeBadger's Google News endpoints are more efficient than scraping individual publisher pages.
The Google News search endpoint returns structured results for any search query:
def fetch_news_by_keyword(query: str, max_results: int = 20) -> list[dict]:
response = requests.get(
"https://scrapebadger.com/v1/google/news/search",
headers={"x-api-key": SCRAPEBADGER_API_KEY},
params={
"q": query,
"hl": "en",
"gl": "US",
"max_results": max_results,
},
)
response.raise_for_status()
return response.json().get("results", [])
For topic-based feeds โ Technology, Business, Sports โ the News by Topic endpoint does the categorization for you:
def fetch_news_by_topic(topic: str, max_results: int = 25) -> list[dict]:
response = requests.get(
"https://scrapebadger.com/v1/google/news/topics",
headers={"x-api-key": SCRAPEBADGER_API_KEY},
params={
"topic": topic, # "technology", "business", "sports", etc.
"hl": "en",
"gl": "US",
"max_results": max_results,
},
)
response.raise_for_status()
return response.json().get("results", [])
This approach sidesteps the whole "maintain scrapers per publisher" problem entirely for keyword-driven use cases. You get normalized results from Google News across hundreds of sources without managing any of them directly.
Normalization and Deduplication
Regardless of which collection method you use, the output schema needs to be consistent. A news article should always produce the same set of fields, with safe defaults when something is missing.
def normalize_article(raw: dict, source: str = "") -> dict:
return {
"url": str(raw.get("url") or raw.get("link") or ""),
"title": str(raw.get("title") or raw.get("headline") or ""),
"summary": str(raw.get("summary") or raw.get("description") or ""),
"author": str(raw.get("author") or ""),
"published": str(raw.get("published") or raw.get("publish_date") or ""),
"source": str(raw.get("source") or source),
"content": str(raw.get("content") or ""),
}
Deduplication uses the URL as the primary key. URLs are stable identifiers for news articles โ the same article will always have the same URL. Store every URL you've processed in a lookup table and check against it before writing new records.
import sqlite3
def setup_db(db_path: str = "news.db"):
con = sqlite3.connect(db_path)
con.execute("""
CREATE TABLE IF NOT EXISTS articles (
url TEXT PRIMARY KEY,
title TEXT,
summary TEXT,
author TEXT,
published TEXT,
source TEXT,
content TEXT,
fetched_at TEXT DEFAULT (datetime('now'))
)
""")
con.commit()
return con
def save_articles(con, articles: list[dict]) -> int:
saved = 0
for article in articles:
if not article["url"]:
continue
try:
con.execute("""
INSERT INTO articles (url, title, summary, author, published, source, content)
VALUES (:url, :title, :summary, :author, :published, :source, :content)
""", article)
saved += 1
except sqlite3.IntegrityError:
pass # Already in the database
con.commit()
return saved
The IntegrityError on PRIMARY KEY violation is intentional deduplication โ SQLite enforces uniqueness automatically. This means you can run the collector multiple times without duplicating articles, even if sources overlap.
Scheduling and Noise Filtering
A few decisions here matter more than the code itself.
Scheduling frequency vs. source type. Breaking news sources warrant 15-minute polling intervals. Weekly newsletters or slower-moving publications can run hourly or daily. Running everything at the same frequency wastes credits and creates noise.
Noise filtering upfront. Broad topic keywords return a lot of irrelevant content. Add negative keyword filters before storing, set minimum title length requirements (very short titles are usually navigation elements, not articles), and filter out duplicated content from wire services that republishes across dozens of outlets verbatim.
Store raw before you process. Keep the original response alongside the normalized fields. Requirements change โ what you want to extract six months from now might differ from today, and you want to be able to reprocess historical records without re-fetching them.
If you're also building adjacent tools โ like a price tracking bot or a broader web monitoring system โ the same pipeline patterns apply: collection, normalization, deduplication, scheduled jobs, storage.
Legal and Ethical Considerations
This comes up in every news scraping project and deserves a direct answer rather than a disclaimer.
Check robots.txt before scraping any site. Most news publishers specify crawl rules there. Respecting those rules is both legally safer and practically smarter โ sites that block aggressive scrapers will block yours too.
RSS feeds exist specifically for machine consumption. When a site offers one, use it instead of scraping the HTML. It's what the feed is for.
The riskier territory is commercial use and republication. Scraping headlines for personal monitoring is different from republishing full article content in a product. The former is generally fine; the latter starts implicating copyright law in most jurisdictions. When in doubt, link to the original rather than reproducing content. You can read more about how web scraping works as a data method and where the boundaries are.
Treat your output schema as a contract and your data sources as borrowed โ cite them, link back, and don't republish more than you need to.
FAQ
What's the most reliable way to collect news from multiple sites?
A combination of RSS feeds (for sites that offer them) and a scraping API (for everything else). RSS feeds are stable, structured, and low-maintenance. For sites without feeds, a scraping API that handles rendering and bot protection is more reliable than maintaining your own per-site scrapers.
How do I deduplicate articles across sources?
Use the article URL as the primary key in your database. Wire services like AP and Reuters get republished across many outlets verbatim โ the same URL won't appear twice, but the same article content can. For catching near-duplicates across different URLs, sentence embeddings and cosine similarity work well, though they add infrastructure complexity.
How often should I poll sources for new articles?
Match frequency to how the source actually updates. Major news sites update continuously โ 15โ30 minute intervals make sense. Slower-moving blogs or industry publications can run hourly or daily. Running every source at maximum frequency wastes API credits and makes noise filtering harder.
What do I do when a news site blocks my scraper?
First, check if the site has an RSS feed โ that's the cleanest solution. If scraping is necessary, a scraping API with anti_bot: true and retry_on_block: true handles most common bot protection setups. For heavily protected sites (Cloudflare Enterprise, Akamai Bot Manager), the escalate: true parameter routes the request through a premium browser session.
How do I extract structured fields like author and publish date reliably?
AI extraction (ai_extract: true with a custom ai_prompt) returns consistent structured fields regardless of how different sites organize their HTML. This is more maintainable than writing site-specific CSS selectors that break when publishers update their templates.
Do I need to render JavaScript for news sites?
About 60% of news sites load content dynamically. For those, set render_js: true in the scrape request or use the browser engine explicitly. For static sites, the default HTTP engine works and costs fewer credits. The auto engine will pick the right approach automatically if you're not sure.
What's the difference between scraping Google News and scraping publishers directly?
Scraping Google News via the search or topic endpoints gives you structured results normalized across hundreds of sources โ titles, URLs, publication dates โ without managing any publisher-specific scraping logic. The tradeoff is you get headlines and metadata, not full article content. For full article text, you still need to scrape the individual article URLs. The two approaches are complementary, not mutually exclusive.

Written by
Thomas Shultz
Thomas Shultz is the Head of Data at ScrapeBadger, working on public web data, scraping infrastructure, and data reliability. He writes about real-world scraping, data pipelines, and turning unstructured web data into usable signals.
Ready to get started?
Join thousands of developers using ScrapeBadger for their data needs.
