Back to Blog

How to Scrape Twitter User Timelines Automatically

Thomas ShultzThomas Shultz
14 min read
6 views
How to Scrape Twitter User Timelines Automatically

Most tutorials about scraping Twitter focus on keyword search — collect tweets matching a query, export to CSV, done. That's useful, but it misses a whole category of data: the full posting history of specific accounts.

Scraping user timelines is a different problem. You're not hunting for mentions of a topic. You're tracking what a particular person or organization has said over time. That means dealing with pagination over potentially thousands of tweets, handling sparse periods and burst periods, and keeping your dataset current as new posts arrive.

This guide walks through how to do it correctly — starting from a working minimal script, building up to a scheduled pipeline that runs unattended.

Why User Timeline Scraping Is Useful

Before the implementation details, it's worth being clear about what this is actually good for.

Tracking a competitor's product account lets you see what they announce, how often they post, which posts get traction, and what topics they've stopped talking about. That's a meaningful dataset for any startup that wants to understand where a market is moving.

Following key voices in a domain — analysts, researchers, practitioners — gives you a structured feed of signal without the noise of platform timelines. You can process it programmatically: filter by engagement, cluster by topic, or export to a doc for team review.

Building training data for a model that understands how a particular account or persona communicates requires a clean, timestamped history of their posts. You can't get that from keyword search.

And if you're running lead generation through Twitter monitoring, timelines fill in context keyword search misses — you see the full picture of who someone is and what they care about before you decide whether to engage.

The Core Challenge: Pagination

The thing that makes timeline scraping harder than it looks is pagination. A user with 5,000 tweets isn't going to give you all 5,000 in one response. You'll get a page of results and a cursor, and you need to follow that cursor page by page until you've collected what you need.

The problems that show up in practice:

  • Gaps when cursor logic breaks or you hit a timeout mid-run

  • Duplicates when you restart a job and overlap with the previous run's final page

  • Inconsistent response shapes — some fields appear on some tweets and not others

  • Silent failures where pagination stops early and the script reports success

None of these are fatal problems. But they require explicit handling. A script that just calls an endpoint and collects whatever it gets until it stops will produce a dataset you can't trust.

What You'll Build

A Python script that:

  • Fetches all tweets from a specified account up to a configurable limit

  • Handles pagination internally

  • Normalizes each tweet into a consistent schema

  • Deduplicates by tweet ID

  • Exports to CSV with atomic writes (no half-written files)

  • Can be scheduled to run incrementally

Step 1: Set Up the Environment

mkdir twitter-timeline-scraper
cd twitter-timeline-scraper
python3 -m venv .venv
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate   # Windows

Install dependencies:

pip install scrapebadger
pip freeze > requirements.txt

Set your API key as an environment variable:

export SCRAPEBADGER_API_KEY="YOUR_API_KEY"

Create the output directory:

mkdir -p output

Project structure:

twitter-timeline-scraper/
  scrape_timeline.py
  output/

Step 2: Fetch a Timeline (Minimal Working Version)

Start small. Prove the request works and inspect the response shape before building the full pipeline.

import asyncio
import os
from scrapebadger import ScrapeBadger

async def fetch_timeline(username: str, limit: int = 20):
    api_key = os.getenv("SCRAPEBADGER_API_KEY")
    if not api_key:
        raise RuntimeError("Missing SCRAPEBADGER_API_KEY environment variable")

    async with ScrapeBadger(api_key=api_key) as client:
        stream = client.twitter.users.latest_tweets(username, max_items=limit)
        async for tweet in stream:
            print({
                "id": tweet.get("id"),
                "created_at": tweet.get("created_at"),
                "text": tweet.get("text", "")[:100],
                "likes": (tweet.get("public_metrics") or {}).get("like_count"),
            })

if __name__ == "__main__":
    asyncio.run(fetch_timeline("openai", limit=20))

Run it:

python scrape_timeline.py

What you're checking:

  • Does id return a stable string you can use as a unique key?

  • Are timestamps consistent and parseable?

  • Is the public_metrics block present? What fields does it contain?

  • Does the text ever look truncated?

This informs the normalization schema you'll build in the next step.

Step 3: Normalize and Export to CSV

Raw tweet payloads are inconsistent. Fields appear and disappear depending on the tweet type, account settings, and API response shape. The normalization step converts that chaos into a predictable schema.

Define the schema first:

CSV_COLUMNS = [
    "tweet_id",
    "created_at",
    "text",
    "like_count",
    "retweet_count",
    "reply_count",
    "quote_count",
]

Write the normalizer with safe defaults — never assume a field is present:

def normalize(tweet: dict) -> dict:
    metrics = tweet.get("public_metrics") or {}
    return {
        "tweet_id":     str(tweet.get("id") or ""),
        "created_at":   str(tweet.get("created_at") or ""),
        "text":         str(tweet.get("text") or ""),
        "like_count":   int(metrics.get("like_count") or 0),
        "retweet_count": int(metrics.get("retweet_count") or 0),
        "reply_count":  int(metrics.get("reply_count") or 0),
        "quote_count":  int(metrics.get("quote_count") or 0),
    }

The or "" and or 0 patterns matter. If a field is None, the cast to str or int would crash without the default. If it's missing entirely, .get() returns None and the default kicks in. Either way, the row is safe to write.

Step 4: The Full Export Script

This version handles the complete pipeline: fetch → paginate → normalize → deduplicate → write atomically.

import asyncio
import csv
import os
import time
from scrapebadger import ScrapeBadger

CSV_COLUMNS = [
    "tweet_id",
    "created_at",
    "text",
    "like_count",
    "retweet_count",
    "reply_count",
    "quote_count",
]

def normalize(tweet: dict) -> dict:
    metrics = tweet.get("public_metrics") or {}
    return {
        "tweet_id":      str(tweet.get("id") or ""),
        "created_at":    str(tweet.get("created_at") or ""),
        "text":          str(tweet.get("text") or ""),
        "like_count":    int(metrics.get("like_count") or 0),
        "retweet_count": int(metrics.get("retweet_count") or 0),
        "reply_count":   int(metrics.get("reply_count") or 0),
        "quote_count":   int(metrics.get("quote_count") or 0),
    }

async def export_timeline_to_csv(
    username: str,
    max_items: int,
    out_path: str,
    hard_timeout_seconds: int = 900,
):
    api_key = os.getenv("SCRAPEBADGER_API_KEY")
    if not api_key:
        raise RuntimeError("Missing SCRAPEBADGER_API_KEY environment variable")

    started = time.time()
    seen_ids: set[str] = set()

    async with ScrapeBadger(api_key=api_key) as client:
        stream = client.twitter.users.latest_tweets(username, max_items=max_items)

        tmp_path = out_path + ".tmp"
        with open(tmp_path, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=CSV_COLUMNS)
            writer.writeheader()

            async for tweet in stream:
                if time.time() - started > hard_timeout_seconds:
                    print(f"Hard timeout reached after {hard_timeout_seconds}s")
                    break

                if not isinstance(tweet, dict):
                    tweet = getattr(tweet, "model_dump", lambda: dict(tweet))()

                row = normalize(tweet)

                if not row["tweet_id"]:
                    continue
                if row["tweet_id"] in seen_ids:
                    continue

                seen_ids.add(row["tweet_id"])
                writer.writerow(row)

    # Atomic replace — the file either exists and is complete, or doesn't exist
    os.replace(tmp_path, out_path)
    print(f"Exported {len(seen_ids)} tweets to {out_path}")

if __name__ == "__main__":
    asyncio.run(export_timeline_to_csv(
        username="openai",
        max_items=500,
        out_path="output/openai_timeline.csv",
        hard_timeout_seconds=900,
    ))

Run it:

python scrape_timeline.py

The .tmpos.replace() pattern is worth noting. If the script crashes mid-run, you end up with a .tmp file, not a corrupt CSV. The final file is always either complete or absent — never half-written.

Step 5: Incremental Runs (Collecting Only New Tweets)

A full re-fetch every time is wasteful and unnecessary once you have a baseline. The better pattern is to track the highest tweet ID you've seen and only fetch newer posts.

SQLite makes this easy — it's a single file, no server needed, and Python's standard library supports it directly.

import sqlite3
from pathlib import Path

DB_FILE = Path("output/timeline.db")

def setup_db():
    con = sqlite3.connect(DB_FILE)
    cur = con.cursor()
    cur.execute("""
        CREATE TABLE IF NOT EXISTS tweets (
            tweet_id     TEXT PRIMARY KEY,
            username     TEXT NOT NULL,
            created_at   TEXT,
            text         TEXT,
            like_count   INTEGER DEFAULT 0,
            retweet_count INTEGER DEFAULT 0,
            reply_count  INTEGER DEFAULT 0,
            quote_count  INTEGER DEFAULT 0
        )
    """)
    con.commit()
    con.close()

def save_tweets(rows: list[dict], username: str) -> int:
    """Returns count of newly inserted rows."""
    con = sqlite3.connect(DB_FILE)
    cur = con.cursor()
    new_count = 0

    for row in rows:
        try:
            cur.execute("""
                INSERT INTO tweets
                    (tweet_id, username, created_at, text, like_count,
                     retweet_count, reply_count, quote_count)
                VALUES
                    (:tweet_id, :username, :created_at, :text, :like_count,
                     :retweet_count, :reply_count, :quote_count)
            """, {**row, "username": username})
            new_count += 1
        except sqlite3.IntegrityError:
            pass  # Already stored — expected on incremental runs

    con.commit()
    con.close()
    return new_count

On each run, IntegrityError is the normal success path for tweets you've already seen. The tweet_id primary key handles deduplication automatically.

Step 6: Schedule the Job with Cron

Once the script works, scheduling is straightforward. On Linux/macOS, use cron:

crontab -e

Add a daily run at 8am for a single account:

# Fetch latest tweets from @openai daily at 8am
0 8 * * * /path/to/.venv/bin/python /path/to/scrape_timeline.py >> /path/to/timeline.log 2>&1

For multiple accounts, the cleanest pattern is to pass the username as a command-line argument:

import sys

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python scrape_timeline.py <username> [max_items]")
        sys.exit(1)
    username = sys.argv[1]
    max_items = int(sys.argv[2]) if len(sys.argv) > 2 else 200
    asyncio.run(export_timeline_to_csv(username, max_items, f"output/{username}.csv"))

Then in cron:

0 8 * * * /path/to/.venv/bin/python /path/to/scrape_timeline.py openai 200 >> timeline.log 2>&1
0 8 * * * /path/to/.venv/bin/python /path/to/scrape_timeline.py anthropic 200 >> timeline.log 2>&1
0 8 * * * /path/to/.venv/bin/python /path/to/scrape_timeline.py googledeepmind 200 >> timeline.log 2>&1

Each account runs as a separate job. If one fails, the others aren't affected.

Common Failure Modes

These are the problems that actually show up when you run this in production.

Empty output — CSV has only headers

Check first: is the username correct? Twitter handles are case-insensitive but typos matter. Second: does the account exist and is it public? Private accounts return no data. Third: is authentication working — run the minimal script to confirm you're getting a response at all.

Duplicate tweet IDs across runs

Cause: overlapping pagination on restart. Fix: the seen_ids set handles this within a single run. For across runs, the SQLite PRIMARY KEY constraint handles it. You should see IntegrityError on the second run for every tweet from the first — that's correct behavior.

Truncated text

Some tweet objects return shortened text. Check whether you need to request the full text explicitly via query parameters. If you're seeing at the end of tweets, the endpoint may need a tweet.fields=text parameter or similar depending on the API version.

Job completes too fast with suspiciously few results

This usually means pagination stopped early — either the account genuinely has fewer tweets than you expected, or the stream ended unexpectedly. Add logging around the row count and compare against the account's visible tweet count as a sanity check.

Choosing How Many Tweets to Collect

The right max_items value depends on what you're doing with the data.

Use Case

Recommended max_items

Rationale

Initial baseline for a competitor account

1,000–3,000

Get enough history to identify posting patterns

Daily incremental update

50–200

Covers a day's activity with room for burst periods

Training data collection

5,000+

More data improves model quality

Trend analysis over a specific period

500–1,000

Cap by count, then filter by date range

Quick account audit

100–200

Fast enough to run interactively

Start with a smaller number and confirm output quality before scaling up. A run that fetches 200 tweets and validates well is more useful than a run that fetches 5,000 and silently drops 40% of them.

What Good Output Looks Like

Before you trust the data downstream, check:

  • Headers are present and consistent on every run

  • No duplicate tweet_id values (run sort -u on the column if checking manually)

  • created_at timestamps are present and parseable — not empty strings or None

  • text fields aren't uniformly truncated

  • Engagement counts are numeric, not empty or None

If all of these pass, the dataset is safe to hand to analysis, a model, or a downstream automation.

Production Considerations

A few things worth thinking about before you schedule this to run unattended for weeks.

Alert on zero output. If a job runs and returns zero tweets for an account that normally returns 100+, something is wrong. Instrument your jobs to log row counts per run and alert if a run produces nothing.

Watch for schema changes. If the public_metrics block changes shape or a field gets renamed, your normalizer will silently write zeros instead of real values. Periodic manual spot-checks catch this before it poisons weeks of data.

Rate your requests sensibly. If you're collecting timelines for 20 accounts per day, that's fine. If you're collecting 200 accounts every hour, think about whether that's actually necessary or whether you're generating noise you'll never look at.

Store raw responses. Keep the original tweet objects in a separate table alongside the normalized rows. When you need to extract a field you didn't think to include originally, you want to be able to reprocess history without re-fetching.

For a more in-depth look at real-time versus scheduled pipeline architectures, the guide on building a real-time Twitter monitoring pipeline covers the tradeoffs in detail.

Practical Applications

Tracking what a product account posts over time. Build a weekly digest of what specific accounts in your market have shipped, announced, or complained about. This is far more useful than a one-time search.

Building a training corpus. If you're fine-tuning a model on a specific domain or communication style, a cleaned timeline export gives you a timestamped, structured dataset. The normalization step matters more here — garbage in, garbage out.

Analyzing posting patterns. Frequency histograms, engagement-per-post trends, and topic clustering all become possible once you have a consistent schema across multiple accounts over time. You can answer questions like "does posting frequency correlate with follower growth?" or "which content formats drive the most replies?"

Content gap analysis. Look at what topics a competitor stopped talking about. That's often more informative than what they're currently posting.

ScrapeBadger supports user timeline collection alongside keyword search, follower data, and engagement metrics — all through the same API and SDK, so you can mix and match without managing multiple integrations.


FAQ

What's the difference between user timeline scraping and keyword search?

Keyword search finds all tweets matching a query, regardless of who posted them. Timeline scraping fetches all posts from a specific account. They answer different questions. Use keyword search when you want to know what the internet is saying about a topic. Use timeline scraping when you want to know what a specific account has said over time.

How far back can I go with user timeline data?

This depends on the API provider and the account. Some providers limit retrieval to the most recent 3,200 tweets (which matches the historical behavior of Twitter's own API). Accounts that post frequently may have a shorter effective history than accounts that post rarely. Set your max_items conservatively and check the earliest created_at timestamp in your output to see how far back you actually got.

Can I scrape private Twitter accounts?

No. Private accounts only show their posts to approved followers. Any scraping API that returns data from private accounts is doing something that will eventually break and may create legal exposure. Stick to public accounts.

How do I avoid collecting retweets in my timeline data?

Add a filter in your normalization step: check whether tweet.get("text", "").startswith("RT @") and skip those rows if you don't want retweets. A cleaner approach is to filter on a referenced_tweets field if the API returns it — that gives you explicit control over original posts versus retweets versus quote tweets.

What's the right polling frequency for incremental timeline updates?

It depends on how active the account is and how fresh you need the data. For most accounts, once per day is sufficient. For very active accounts during a product launch or live event, you might run every hour. Running more frequently than the account actually posts wastes credits and adds noise to your logs. A reasonable rule: if the last three runs all returned fewer than 5 new tweets, reduce the frequency.

How do I handle accounts that change their username?

Twitter account IDs are stable even when usernames change. If you're storing data long-term, consider storing the user ID alongside the username in your database. If a lookup by username fails, the account may have been suspended, deleted, or renamed — each case needs a different response.

Is scraping user timelines legal?

The short answer: it depends on your jurisdiction, how you use the data, and who you're targeting. Public account data is generally accessible, but platform terms of service vary and national data protection laws (GDPR, CCPA, etc.) impose constraints on storage and processing. Always review the relevant ToS and applicable law for your specific situation before collecting at scale or sharing the data externally.

Thomas Shultz

Written by

Thomas Shultz

Thomas Shultz is the Head of Data at ScrapeBadger, working on public web data, scraping infrastructure, and data reliability. He writes about real-world scraping, data pipelines, and turning unstructured web data into usable signals.

Ready to get started?

Join thousands of developers using ScrapeBadger for their data needs.

Twitter User Timelines Scraping: Full Guide | ScrapeBadger