Most guides on building a Twitter dataset stop at "call the API and save some JSON." That's not a dataset. That's a pile of raw text with inconsistent fields, duplicate tweets, and no label structure. When you try to train something on it, you spend more time fixing the data than building the model.
This guide covers the full pipeline: what to collect, how to structure it, how to clean it, and what the resulting dataset should actually look like before you hand it to a training loop. Whether you're building a sentiment classifier, a topic model, or a fine-tuned LLM, the decisions you make during collection determine whether the dataset is usable or a debugging nightmare.
Why Twitter Data Is Valuable for ML
Twitter sits in a useful position for ML training data. Conversations are public, short-form, and topic-dense. The signal-to-noise ratio is manageable if you filter well. And the volume is high enough that you can build labeled corpora at scale without manual annotation for every row.
Specific use cases where Twitter data consistently performs well:
Sentiment analysis — product feedback, brand reaction, public opinion on events
Text classification — topic labeling, intent detection, spam filtering
Named entity recognition — extracting people, places, products from informal text
Language modeling — fine-tuning on informal, conversational English (or other languages)
Trend detection — training early-signal classifiers on time-series tweet volume
The problem isn't that Twitter data is hard to get. The problem is that most collection pipelines produce data that looks clean but isn't. Duplicate IDs, truncated text, missing timestamps, inconsistent schemas — any of these will cause quiet failures downstream.
Step 1: Define the Dataset Before You Collect It
The most common mistake is starting the scraper before deciding what the dataset should look like. Collection decisions are hard to undo once you've run a 10-hour job.
Answer these questions before writing any code:
What's the task? Sentiment classification needs labels (positive/negative/neutral). Topic modeling doesn't. A named entity dataset needs spans and entity types. Know what columns your training loop expects.
What's the collection window? A dataset for trend detection needs timestamps and sequential ordering. A general-purpose sentiment corpus doesn't care about time. Set your window and stick to it.
What's the keyword or account strategy? Keyword-based collection picks up broader conversation but requires more noise filtering. Account-based collection is more targeted but biases toward specific voices. Most real datasets mix both.
What's the minimum size? Fine-tuning a small model on tweet-length text usually requires at least 10K–50K examples for stable performance. Topic classifiers with 5+ categories need more. Set a floor before you start.
Treat your dataset schema as a contract. Define the columns first, then build collection logic around them. Changing the schema after collection means reprocessing everything.
Step 2: Design Your Collection Strategy
There are three practical ways to collect Twitter data for ML in 2026. Each has different trade-offs.
Strategy | Best For | Scale | Engineering Effort |
|---|---|---|---|
Keyword search (recent tweets) | Topic-specific corpora, sentiment datasets | Medium | Low |
Account timeline collection | Author-based datasets, domain-specific corpora | Medium | Low |
Historical backfill | Time-series datasets, event-based corpora | High | Medium |
Filter rules (real-time stream) | Continuous collection, trend datasets | High | Medium–High |
For most ML datasets, keyword search is where you start. It's the fastest way to get topic-relevant volume without building streaming infrastructure.
The keyword strategy matters more than people realize. Broad keywords (e.g., "AI") produce enormous volume with low relevance. Narrow keywords (e.g., "GPT-4 hallucination bug") produce low volume with high relevance. For labeled datasets, narrow is better — you get cleaner signal per row. For unsupervised tasks, broad is fine if you filter aggressively downstream.
Practical decision rule: If you're building a classification dataset with defined categories, use one keyword cluster per category. If you're building a general corpus, use 3–5 broad keywords and filter by engagement floor.
Step 3: Set Up the Collection Pipeline
Here's the environment setup. Keep it isolated — dataset projects have specific dependency versions and you don't want cross-contamination.
mkdir twitter-ml-dataset
cd twitter-ml-dataset
python3 -m venv .venv
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows
pip install scrapebadger pandas
pip freeze > requirements.txt
Set your API key as an environment variable:
export SCRAPEBADGER_API_KEY="YOUR_API_KEY"
Project structure:
twitter-ml-dataset/
collect.py
clean.py
output/
raw/
clean/
Keep raw and clean outputs separate. You'll want to reprocess raw data against updated cleaning logic without re-collecting.
Step 4: Collect the Raw Data
Start with the smallest version that proves the pipeline works. Don't run the full job until you've validated the response shape.
# collect.py
import asyncio
import json
import os
from scrapebadger import ScrapeBadger
async def collect_tweets(query: str, max_items: int, out_path: str):
api_key = os.getenv("SCRAPEBADGER_API_KEY")
if not api_key:
raise RuntimeError("Missing SCRAPEBADGER_API_KEY")
results = []
async with ScrapeBadger(api_key=api_key) as client:
stream = client.twitter.tweets.search_all(query, max_items=max_items)
async for tweet in stream:
if not isinstance(tweet, dict):
tweet = getattr(tweet, "model_dump", lambda: dict(tweet))()
results.append(tweet)
# Write raw JSON — one object per line (JSONL format)
with open(out_path, "w", encoding="utf-8") as f:
for tweet in results:
f.write(json.dumps(tweet, ensure_ascii=False) + "\n")
print(f"Collected {len(results)} tweets → {out_path}")
if __name__ == "__main__":
asyncio.run(collect_tweets(
query="machine learning -is:retweet lang:en",
max_items=5000,
out_path="output/raw/ml_tweets.jsonl",
))
A few notes on why this is structured this way:
JSONL format (one JSON object per line) is standard for ML datasets. It's streamable, appendable, and most ML frameworks read it natively.
-is:retweetin the query excludes retweets. For ML datasets, retweets are almost always noise — they duplicate text and skew label distributions.lang:enkeeps the dataset monolingual. If you need multilingual data, collect separate files per language and label them.Raw output stays untouched. The clean pipeline runs separately.
Step 5: Normalize and Clean the Data
Raw tweet payloads have inconsistent field presence, nested objects, and occasional null values that will crash training scripts without warning. The normalization step converts raw JSON into a flat, stable schema.
# clean.py
import json
import csv
import re
import os
# Final schema — this is the contract
COLUMNS = [
"tweet_id",
"created_at",
"username",
"text",
"clean_text",
"like_count",
"retweet_count",
"reply_count",
"follower_count",
"lang",
]
def clean_text(text: str) -> str:
"""
Normalizes tweet text for ML consumption.
Strips URLs, @mentions, extra whitespace.
Preserves hashtags (often informative for topic models).
"""
text = re.sub(r"http\S+", "", text) # Remove URLs
text = re.sub(r"@\w+", "", text) # Remove @mentions
text = re.sub(r"\s+", " ", text).strip() # Normalize whitespace
return text
def normalize(tweet: dict) -> dict | None:
metrics = tweet.get("public_metrics") or {}
user = tweet.get("user") or {}
raw_text = str(tweet.get("text") or "")
# Skip empty text
if not raw_text.strip():
return None
return {
"tweet_id": str(tweet.get("id") or ""),
"created_at": str(tweet.get("created_at") or ""),
"username": str(user.get("username") or ""),
"text": raw_text,
"clean_text": clean_text(raw_text),
"like_count": int(metrics.get("like_count") or 0),
"retweet_count": int(metrics.get("retweet_count") or 0),
"reply_count": int(metrics.get("reply_count") or 0),
"follower_count": int(user.get("followers_count") or 0),
"lang": str(tweet.get("lang") or ""),
}
def process(in_path: str, out_path: str, min_chars: int = 20):
seen_ids = set()
rows = []
with open(in_path, "r", encoding="utf-8") as f:
for line in f:
try:
tweet = json.loads(line)
except json.JSONDecodeError:
continue
row = normalize(tweet)
if row is None:
continue
# Deduplicate
if row["tweet_id"] in seen_ids or not row["tweet_id"]:
continue
seen_ids.add(row["tweet_id"])
# Minimum text length filter
if len(row["clean_text"]) < min_chars:
continue
rows.append(row)
# Write CSV
with open(out_path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=COLUMNS)
writer.writeheader()
writer.writerows(rows)
print(f"Cleaned: {len(rows)} rows → {out_path}")
print(f"Dropped: duplicates + short text filtered out")
if __name__ == "__main__":
process(
in_path="output/raw/ml_tweets.jsonl",
out_path="output/clean/ml_tweets.csv",
min_chars=20,
)
The clean_text field is the one you pass to tokenizers. The text field is the original — keep it for audit purposes and for re-cleaning if your preprocessing logic changes.
The min_chars=20 filter removes very short tweets that don't carry enough signal for classification. Adjust based on your task — for some NER or intent tasks, short tweets are fine.
Step 6: What a Good Dataset Looks Like
Before you hand this to any training pipeline, sanity check the output.
What to verify:
No duplicate
tweet_idvaluesclean_texthas no URLs or @mentionsTimestamps are present and parseable
like_countandretweet_countare integers, not stringsNo rows where
clean_textis empty or under your minimum length
Engagement-based filtering for quality:
For supervised tasks, high-engagement tweets are sometimes better training examples — they've been validated by actual humans as interesting or informative. A floor of like_count >= 2 reduces spam and bot-generated content meaningfully without cutting too much volume.
# Optional: filter by engagement
rows = [r for r in rows if r["like_count"] >= 2 or r["retweet_count"] >= 1]
Don't over-filter. A dataset with <5K rows from a narrow query will have distribution problems. Run the filter, check row count, and relax the threshold if you're losing too much volume.
Step 7: Adding Labels (for Supervised Tasks)
Raw collected data isn't labeled. For sentiment classification or topic models, you need to add a label column.
Three practical approaches, in order of cost:
Approach | Quality | Cost | Best For |
|---|---|---|---|
Keyword-based proxy labels | Low–Medium | Free | Bootstrapping, weak supervision |
Zero-shot LLM labeling (e.g., GPT-4o) | Medium–High | Low–Medium | General sentiment, topic labels |
Manual annotation | High | High | Small high-quality gold sets |
Keyword proxy labels are the fastest. If you collected tweets matching "love this product" and "this product is broken", you already have soft positive/negative signal. Use the query itself as a weak label.
LLM-based labeling is practical for most teams now. Run clean_text through a batch classification prompt, output a label per row. For sentiment, this reliably produces usable training data at scale with minimal cost.
Manual annotation remains the gold standard for evaluation sets, even if you use LLMs for bulk labeling. Label 200–500 rows manually and use that as your test set.
Scaling Up: Multi-Keyword Collection
For larger datasets, you'll want to collect across multiple keyword clusters and merge them:
QUERIES = [
("machine learning -is:retweet lang:en", "output/raw/ml.jsonl"),
("deep learning tutorial -is:retweet lang:en", "output/raw/dl.jsonl"),
("neural network training -is:retweet lang:en", "output/raw/nn.jsonl"),
]
async def collect_all():
for query, path in QUERIES:
await collect_tweets(query, max_items=3000, out_path=path)
await asyncio.sleep(2) # Small buffer between jobs
After collecting, run all files through the same process() function. The deduplication step handles any overlap between query results — you'll inevitably get some tweets that match multiple queries.
For a multi-category classifier, structure one query cluster per category. Keep them in separate files until after cleaning, then merge with a category column added during normalization.
Common Dataset Problems (and Fixes)
Dataset is dominated by one time window. If you run one big job, you'll over-represent whatever was trending that day. Collect in smaller batches across multiple days for better temporal diversity.
Bot-generated content skews the distribution. Filter by follower_count >= 10 and like_count >= 1. Bots typically have zero engagement and very low follower counts. Not perfect, but catches most of it.
Text after cleaning is too short to be useful. This often means your query is too narrow and returning very short replies. Widen the query or increase min_chars.
Labels are imbalanced. Most keyword-based collections end up skewed toward neutral or slightly-positive sentiment. If you're building a classifier, explicitly over-collect for underrepresented categories or apply class weights during training.
If you're planning to collect at scale or want a full real-time data feed for ongoing dataset expansion, the real-time Twitter monitoring pipeline guide covers streaming infrastructure in detail.
Dataset Format Reference
The final output schema, with field descriptions:
Field | Type | Notes |
|---|---|---|
tweet_id | string | Unique key. Primary dedup identifier. |
created_at | string | ISO timestamp. Required for time-series tasks. |
username | string | Author handle. Useful for author-level analysis. |
text | string | Original tweet text. Keep for re-cleaning. |
clean_text | string | Preprocessed text. Feed this to tokenizers. |
like_count | integer | Engagement signal. Use for quality filtering. |
retweet_count | integer | Amplification signal. |
reply_count | integer | Conversation signal. High reply count = contested. |
follower_count | integer | Author reach. Useful for weighting. |
lang | string | Language code. Filter before tokenizing. |
This schema is deliberately minimal. Add columns as your task requires — label, category, sentiment_score — but don't add fields you don't have a use for. Wide schemas with mostly-empty columns cause problems in most ML data loaders.
The Twitter lead generation post covers a different angle on structured tweet collection — worth reading if your dataset needs to include intent signals alongside content.
FAQ
How many tweets do I need to train a sentiment classifier?
It depends on the number of classes and the model size. For fine-tuning a small transformer (BERT-base, DistilBERT), 10K–50K labeled examples is usually enough for stable performance with 2–3 classes. For more classes or noisier labels, aim for 50K+. For few-shot prompting on GPT-4o-class models, even 500–2K high-quality examples can be effective.
Should I include retweets in my dataset?
Almost never. Retweets duplicate text, inflate certain authors' representation, and don't add linguistic diversity. Add -is:retweet to every collection query and filter them out during normalization. The only exception is if you're explicitly studying retweet behavior itself.
How do I handle multilingual tweets?
Collect with lang:XX in the query for each language you want, keep them in separate files, and process them independently through language-specific cleaning pipelines. Mixing languages without language tags will produce a dataset that confuses most tokenizers unless you're building a multilingual model intentionally.
What's the best format for storing a Twitter ML dataset?
JSONL for raw storage (streamable, appendable, one record per line). CSV or Parquet for cleaned, flat datasets. Parquet is preferred if the dataset exceeds 100K rows — it compresses better and loads faster in pandas and PyTorch. For annotation workflows, CSV is simpler to work with in spreadsheet tools.
How do I deal with duplicate tweets across multiple collection runs?
Store all processed tweet_id values in a lookup file or SQLite table after each run. On subsequent runs, load that table and skip any ID already present. This is more reliable than in-memory deduplication, which resets every time the script restarts.
Is it legal to use Twitter data for ML training?
This is genuinely complicated. Public tweet content is publicly accessible, but X's Terms of Service have restrictions on how data can be used, stored, and redistributed. Using data to train a model for internal use is generally different from redistributing the dataset publicly. Review X's developer policy, your jurisdiction's applicable laws, and consult legal counsel before publishing or distributing any dataset. Don't just copy what other people are doing — their risk tolerance isn't yours.
How often should I refresh a dataset used for ongoing model training?
For production models monitoring evolving language (slang, emerging topics, brand sentiment), refresh at minimum monthly. For static research corpora, once is often enough. For trend detection models, data freshness matters a lot — consider a continuous collection pipeline that incrementally adds new rows rather than periodic full re-collection.

Written by
Thomas Shultz
Thomas Shultz is the Head of Data at ScrapeBadger, working on public web data, scraping infrastructure, and data reliability. He writes about real-world scraping, data pipelines, and turning unstructured web data into usable signals.
Ready to get started?
Join thousands of developers using ScrapeBadger for their data needs.
