How to Build a Twitter Dataset for Machine Learning with ScrapeBadger

Most guides on building a Twitter dataset stop at "call the API and save some JSON." That's not a dataset. That's a pile of raw text with inconsistent fields, duplicate tweets, and no label structure. When you try to train something on it, you spend more time fixing the data than building the model.

This guide covers the full pipeline: what to collect, how to structure it, how to clean it, and what the resulting dataset should actually look like before you hand it to a training loop. Whether you're building a sentiment classifier, a topic model, or a fine-tuned LLM, the decisions you make during collection determine whether the dataset is usable or a debugging nightmare.

Why Twitter Data Is Valuable for ML

Twitter sits in a useful position for ML training data. Conversations are public, short-form, and topic-dense. The signal-to-noise ratio is manageable if you filter well. And the volume is high enough that you can build labeled corpora at scale without manual annotation for every row.

Specific use cases where Twitter data consistently performs well:

Sentiment analysis — product feedback, brand reaction, public opinion on events
Text classification — topic labeling, intent detection, spam filtering
Named entity recognition — extracting people, places, products from informal text
Language modeling — fine-tuning on informal, conversational English (or other languages)
Trend detection — training early-signal classifiers on time-series tweet volume

The problem isn't that Twitter data is hard to get. The problem is that most collection pipelines produce data that looks clean but isn't. Duplicate IDs, truncated text, missing timestamps, inconsistent schemas — any of these will cause quiet failures downstream.

Step 1: Define the Dataset Before You Collect It

The most common mistake is starting the scraper before deciding what the dataset should look like. Collection decisions are hard to undo once you've run a 10-hour job.

Answer these questions before writing any code:

What's the task? Sentiment classification needs labels (positive/negative/neutral). Topic modeling doesn't. A named entity dataset needs spans and entity types. Know what columns your training loop expects.

What's the collection window? A dataset for trend detection needs timestamps and sequential ordering. A general-purpose sentiment corpus doesn't care about time. Set your window and stick to it.

What's the keyword or account strategy? Keyword-based collection picks up broader conversation but requires more noise filtering. Account-based collection is more targeted but biases toward specific voices. Most real datasets mix both.

What's the minimum size? Fine-tuning a small model on tweet-length text usually requires at least 10K–50K examples for stable performance. Topic classifiers with 5+ categories need more. Set a floor before you start.

Treat your dataset schema as a contract. Define the columns first, then build collection logic around them. Changing the schema after collection means reprocessing everything.

Step 2: Design Your Collection Strategy

There are three practical ways to collect Twitter data for ML in 2026. Each has different trade-offs.

Strategy	Best For	Scale	Engineering Effort
Keyword search (recent tweets)	Topic-specific corpora, sentiment datasets	Medium	Low
Account timeline collection	Author-based datasets, domain-specific corpora	Medium	Low
Historical backfill	Time-series datasets, event-based corpora	High	Medium
Filter rules (real-time stream)	Continuous collection, trend datasets	High	Medium–High

For most ML datasets, keyword search is where you start. It's the fastest way to get topic-relevant volume without building streaming infrastructure.

The keyword strategy matters more than people realize. Broad keywords (e.g., "AI") produce enormous volume with low relevance. Narrow keywords (e.g., "GPT-4 hallucination bug") produce low volume with high relevance. For labeled datasets, narrow is better — you get cleaner signal per row. For unsupervised tasks, broad is fine if you filter aggressively downstream.

Practical decision rule: If you're building a classification dataset with defined categories, use one keyword cluster per category. If you're building a general corpus, use 3–5 broad keywords and filter by engagement floor.

Step 3: Set Up the Collection Pipeline

Here's the environment setup. Keep it isolated — dataset projects have specific dependency versions and you don't want cross-contamination.

mkdir twitter-ml-dataset
cd twitter-ml-dataset
python3 -m venv .venv
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate   # Windows

pip install scrapebadger pandas
pip freeze > requirements.txt

Set your API key as an environment variable:

export SCRAPEBADGER_API_KEY="YOUR_API_KEY"

Project structure:

twitter-ml-dataset/
  collect.py
  clean.py
  output/
    raw/
    clean/

Keep raw and clean outputs separate. You'll want to reprocess raw data against updated cleaning logic without re-collecting.

Step 4: Collect the Raw Data

Start with the smallest version that proves the pipeline works. Don't run the full job until you've validated the response shape.

# collect.py
import asyncio
import json
import os
from scrapebadger import ScrapeBadger

async def collect_tweets(query: str, max_items: int, out_path: str):
    api_key = os.getenv("SCRAPEBADGER_API_KEY")
    if not api_key:
        raise RuntimeError("Missing SCRAPEBADGER_API_KEY")

    results = []

    async with ScrapeBadger(api_key=api_key) as client:
        stream = client.twitter.tweets.search_all(query, max_items=max_items)
        async for tweet in stream:
            if not isinstance(tweet, dict):
                tweet = getattr(tweet, "model_dump", lambda: dict(tweet))()
            results.append(tweet)

    # Write raw JSON — one object per line (JSONL format)
    with open(out_path, "w", encoding="utf-8") as f:
        for tweet in results:
            f.write(json.dumps(tweet, ensure_ascii=False) + "\n")

    print(f"Collected {len(results)} tweets → {out_path}")

if __name__ == "__main__":
    asyncio.run(collect_tweets(
        query="machine learning -is:retweet lang:en",
        max_items=5000,
        out_path="output/raw/ml_tweets.jsonl",
    ))

A few notes on why this is structured this way:

JSONL format (one JSON object per line) is standard for ML datasets. It's streamable, appendable, and most ML frameworks read it natively.
-is:retweet in the query excludes retweets. For ML datasets, retweets are almost always noise — they duplicate text and skew label distributions.
lang:en keeps the dataset monolingual. If you need multilingual data, collect separate files per language and label them.
Raw output stays untouched. The clean pipeline runs separately.

Step 5: Normalize and Clean the Data

Raw tweet payloads have inconsistent field presence, nested objects, and occasional null values that will crash training scripts without warning. The normalization step converts raw JSON into a flat, stable schema.

# clean.py
import json
import csv
import re
import os

# Final schema — this is the contract
COLUMNS = [
    "tweet_id",
    "created_at",
    "username",
    "text",
    "clean_text",
    "like_count",
    "retweet_count",
    "reply_count",
    "follower_count",
    "lang",
]

def clean_text(text: str) -> str:
    """
    Normalizes tweet text for ML consumption.
    Strips URLs, @mentions, extra whitespace.
    Preserves hashtags (often informative for topic models).
    """
    text = re.sub(r"http\S+", "", text)           # Remove URLs
    text = re.sub(r"@\w+", "", text)              # Remove @mentions
    text = re.sub(r"\s+", " ", text).strip()      # Normalize whitespace
    return text

def normalize(tweet: dict) -> dict | None:
    metrics = tweet.get("public_metrics") or {}
    user = tweet.get("user") or {}
    raw_text = str(tweet.get("text") or "")

    # Skip empty text
    if not raw_text.strip():
        return None

    return {
        "tweet_id":      str(tweet.get("id") or ""),
        "created_at":    str(tweet.get("created_at") or ""),
        "username":      str(user.get("username") or ""),
        "text":          raw_text,
        "clean_text":    clean_text(raw_text),
        "like_count":    int(metrics.get("like_count") or 0),
        "retweet_count": int(metrics.get("retweet_count") or 0),
        "reply_count":   int(metrics.get("reply_count") or 0),
        "follower_count": int(user.get("followers_count") or 0),
        "lang":          str(tweet.get("lang") or ""),
    }

def process(in_path: str, out_path: str, min_chars: int = 20):
    seen_ids = set()
    rows = []

    with open(in_path, "r", encoding="utf-8") as f:
        for line in f:
            try:
                tweet = json.loads(line)
            except json.JSONDecodeError:
                continue

            row = normalize(tweet)
            if row is None:
                continue

            # Deduplicate
            if row["tweet_id"] in seen_ids or not row["tweet_id"]:
                continue
            seen_ids.add(row["tweet_id"])

            # Minimum text length filter
            if len(row["clean_text"]) < min_chars:
                continue

            rows.append(row)

    # Write CSV
    with open(out_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=COLUMNS)
        writer.writeheader()
        writer.writerows(rows)

    print(f"Cleaned: {len(rows)} rows → {out_path}")
    print(f"Dropped: duplicates + short text filtered out")

if __name__ == "__main__":
    process(
        in_path="output/raw/ml_tweets.jsonl",
        out_path="output/clean/ml_tweets.csv",
        min_chars=20,
    )

The clean_text field is the one you pass to tokenizers. The text field is the original — keep it for audit purposes and for re-cleaning if your preprocessing logic changes.

The min_chars=20 filter removes very short tweets that don't carry enough signal for classification. Adjust based on your task — for some NER or intent tasks, short tweets are fine.

Step 6: What a Good Dataset Looks Like

Before you hand this to any training pipeline, sanity check the output.

What to verify:

No duplicate tweet_id values
clean_text has no URLs or @mentions
Timestamps are present and parseable
like_count and retweet_count are integers, not strings
No rows where clean_text is empty or under your minimum length

Engagement-based filtering for quality:

For supervised tasks, high-engagement tweets are sometimes better training examples — they've been validated by actual humans as interesting or informative. A floor of like_count >= 2 reduces spam and bot-generated content meaningfully without cutting too much volume.

# Optional: filter by engagement
rows = [r for r in rows if r["like_count"] >= 2 or r["retweet_count"] >= 1]

Don't over-filter. A dataset with <5K rows from a narrow query will have distribution problems. Run the filter, check row count, and relax the threshold if you're losing too much volume.

Step 7: Adding Labels (for Supervised Tasks)

Raw collected data isn't labeled. For sentiment classification or topic models, you need to add a label column.

Three practical approaches, in order of cost:

Approach	Quality	Cost	Best For
Keyword-based proxy labels	Low–Medium	Free	Bootstrapping, weak supervision
Zero-shot LLM labeling (e.g., GPT-4o)	Medium–High	Low–Medium	General sentiment, topic labels
Manual annotation	High	High	Small high-quality gold sets

Keyword proxy labels are the fastest. If you collected tweets matching "love this product" and "this product is broken", you already have soft positive/negative signal. Use the query itself as a weak label.

LLM-based labeling is practical for most teams now. Run clean_text through a batch classification prompt, output a label per row. For sentiment, this reliably produces usable training data at scale with minimal cost.

Manual annotation remains the gold standard for evaluation sets, even if you use LLMs for bulk labeling. Label 200–500 rows manually and use that as your test set.

Scaling Up: Multi-Keyword Collection

For larger datasets, you'll want to collect across multiple keyword clusters and merge them:

QUERIES = [
    ("machine learning -is:retweet lang:en", "output/raw/ml.jsonl"),
    ("deep learning tutorial -is:retweet lang:en", "output/raw/dl.jsonl"),
    ("neural network training -is:retweet lang:en", "output/raw/nn.jsonl"),
]

async def collect_all():
    for query, path in QUERIES:
        await collect_tweets(query, max_items=3000, out_path=path)
        await asyncio.sleep(2)  # Small buffer between jobs

After collecting, run all files through the same process() function. The deduplication step handles any overlap between query results — you'll inevitably get some tweets that match multiple queries.

For a multi-category classifier, structure one query cluster per category. Keep them in separate files until after cleaning, then merge with a category column added during normalization.

Common Dataset Problems (and Fixes)

Dataset is dominated by one time window. If you run one big job, you'll over-represent whatever was trending that day. Collect in smaller batches across multiple days for better temporal diversity.

Bot-generated content skews the distribution. Filter by follower_count >= 10 and like_count >= 1. Bots typically have zero engagement and very low follower counts. Not perfect, but catches most of it.

Text after cleaning is too short to be useful. This often means your query is too narrow and returning very short replies. Widen the query or increase min_chars.

Labels are imbalanced. Most keyword-based collections end up skewed toward neutral or slightly-positive sentiment. If you're building a classifier, explicitly over-collect for underrepresented categories or apply class weights during training.

If you're planning to collect at scale or want a full real-time data feed for ongoing dataset expansion, the real-time Twitter monitoring pipeline guide covers streaming infrastructure in detail.

Dataset Format Reference

The final output schema, with field descriptions:

Field	Type	Notes
tweet_id	string	Unique key. Primary dedup identifier.
created_at	string	ISO timestamp. Required for time-series tasks.
username	string	Author handle. Useful for author-level analysis.
text	string	Original tweet text. Keep for re-cleaning.
clean_text	string	Preprocessed text. Feed this to tokenizers.
like_count	integer	Engagement signal. Use for quality filtering.
retweet_count	integer	Amplification signal.
reply_count	integer	Conversation signal. High reply count = contested.
follower_count	integer	Author reach. Useful for weighting.
lang	string	Language code. Filter before tokenizing.

This schema is deliberately minimal. Add columns as your task requires — label, category, sentiment_score — but don't add fields you don't have a use for. Wide schemas with mostly-empty columns cause problems in most ML data loaders.

The Twitter lead generation post covers a different angle on structured tweet collection — worth reading if your dataset needs to include intent signals alongside content.

FAQ

How many tweets do I need to train a sentiment classifier?

It depends on the number of classes and the model size. For fine-tuning a small transformer (BERT-base, DistilBERT), 10K–50K labeled examples is usually enough for stable performance with 2–3 classes. For more classes or noisier labels, aim for 50K+. For few-shot prompting on GPT-4o-class models, even 500–2K high-quality examples can be effective.

Should I include retweets in my dataset?

Almost never. Retweets duplicate text, inflate certain authors' representation, and don't add linguistic diversity. Add -is:retweet to every collection query and filter them out during normalization. The only exception is if you're explicitly studying retweet behavior itself.

How do I handle multilingual tweets?

Collect with lang:XX in the query for each language you want, keep them in separate files, and process them independently through language-specific cleaning pipelines. Mixing languages without language tags will produce a dataset that confuses most tokenizers unless you're building a multilingual model intentionally.

What's the best format for storing a Twitter ML dataset?

JSONL for raw storage (streamable, appendable, one record per line). CSV or Parquet for cleaned, flat datasets. Parquet is preferred if the dataset exceeds 100K rows — it compresses better and loads faster in pandas and PyTorch. For annotation workflows, CSV is simpler to work with in spreadsheet tools.

How do I deal with duplicate tweets across multiple collection runs?

Store all processed tweet_id values in a lookup file or SQLite table after each run. On subsequent runs, load that table and skip any ID already present. This is more reliable than in-memory deduplication, which resets every time the script restarts.

Is it legal to use Twitter data for ML training?

This is genuinely complicated. Public tweet content is publicly accessible, but X's Terms of Service have restrictions on how data can be used, stored, and redistributed. Using data to train a model for internal use is generally different from redistributing the dataset publicly. Review X's developer policy, your jurisdiction's applicable laws, and consult legal counsel before publishing or distributing any dataset. Don't just copy what other people are doing — their risk tolerance isn't yours.

How often should I refresh a dataset used for ongoing model training?

For production models monitoring evolving language (slang, emerging topics, brand sentiment), refresh at minimum monthly. For static research corpora, once is often enough. For trend detection models, data freshness matters a lot — consider a continuous collection pipeline that incrementally adds new rows rather than periodic full re-collection.

How to Build a Twitter Dataset for Machine Learning