Scrape Reddit Comments and Reconstruct Thread Trees | ScrapeBadger

Comments are where Reddit's real intelligence lives. The original post states a problem or asks a question. The comment thread is where the community diagnoses it, disagrees with each other, offers solutions, validates those solutions through upvotes, and builds a conversation structure that is richer than any individual contribution within it.

Most Reddit scraping tutorials collect posts. They either skip comments entirely or return them as a flat list that discards the threading structure — the parent-child relationships that show which comment is a reply to which, what the community thinks of each response relative to others, and how the conversation developed over time. A flat list of 200 comments from a 200-comment thread is not the same thing as the conversation. It is the conversation's raw materials with the structure removed.

The threading structure matters for three specific use cases where Reddit data has commercial value.

For brand intelligence, a complaint thread is not just a complaint — it is a conversation where the community diagnosed the problem, proposed solutions, and validated the most plausible explanation through upvotes. The highest-upvoted comment in a negative brand thread is the community's consensus view of what went wrong. A flat comment list cannot distinguish this consensus from the long tail of low-engagement responses.

For AI training data, conversational thread structure provides the dialogue patterns that make fine-tuned models genuinely conversational rather than question-answering machines. A nested comment tree shows how a community builds on previous points, challenges assertions, and reaches emergent consensus — training signals that a flat corpus cannot provide.

For product research, a thread about a product failure or a comparison request contains a hierarchical conversation where different users advocate for different positions and the community evaluates those positions through voting. Reconstructing this hierarchy shows which arguments won, which lost, and what the underlying debate was actually about.

This guide builds a complete Reddit comment scraping and thread reconstruction pipeline using ScrapeBadger's Reddit Scraper, from individual comment collection through full nested tree reconstruction and structured export.

Understanding Reddit's Comment Data Model

Before building the pipeline, understand what Reddit's comment structure actually contains. Every comment in a Reddit thread has:

id: unique comment identifier
parent_id: the ID of the comment or post this is a reply to. Prefixed with t1_ for comment parents, t3_ for post parents (top-level comments)
body: the comment text
author: username of the commenter
score: upvote score (net upvotes minus downvotes)
created_utc: creation timestamp
depth: nesting depth in the tree (0 = top-level, 1 = reply to top-level, etc.)
distinguished: whether the commenter is a moderator or special account
is_submitter: whether the commenter is the original poster (OP)

The parent_id field is the key to tree reconstruction. Every comment knows its parent. From a flat list of comments, you can reconstruct the full tree by following parent-child relationships recursively.

The complication: Reddit truncates deep threads. A post with 500 comments does not return all 500 comments in a single response. Deep threads are replaced with "MoreComments" objects — placeholders that require additional API calls to expand. Most tools either return an incomplete thread silently or fail to handle these stubs at all.

ScrapeBadger's Reddit comment endpoint handles the expansion of MoreComments stubs automatically, returning the complete comment tree structure for a post rather than the truncated version.

Step 1: Fetching Comments for a Post

python

# reddit_comment_collector.py
import httpx
import asyncio
import os
from typing import Optional
from datetime import datetime

API_KEY = os.environ["SCRAPEBADGER_API_KEY"]
BASE_URL = "https://api.scrapebadger.com/v1"
HEADERS = {"X-API-Key": API_KEY}


async def fetch_post_comments(
    client: httpx.AsyncClient,
    post_id: str,
    subreddit: str = None,
    sort: str = "top",          # top, best, new, controversial, old, qa
    limit: int = 500,
    depth: int = None,          # None = full depth
) -> dict:
    """
    Fetch all comments for a Reddit post.
    Returns post metadata + full comment list with parent IDs.
    
    sort='top' returns highest-voted comments first — best for
    brand intelligence and consensus analysis.
    sort='new' returns most recent — best for monitoring ongoing discussions.
    """
    try:
        params = {
            "sort": sort,
            "limit": limit,
        }
        if depth is not None:
            params["depth"] = depth
        if subreddit:
            params["subreddit"] = subreddit
        
        response = await client.get(
            f"{BASE_URL}/reddit/post/{post_id}/comments",
            params=params,
            timeout=45.0,  # Comment collection can be slower than post fetch
        )
        response.raise_for_status()
        data = response.json()
        
        post = data.get("post", {})
        comments = data.get("comments", [])
        
        print(f"Post '{post.get('title', post_id)[:60]}...'")
        print(f"  Fetched {len(comments)} comments (sort={sort})")
        
        return {
            "post_id": post_id,
            "post": post,
            "comments": comments,
            "fetched_at": datetime.utcnow().isoformat(),
        }
    
    except httpx.HTTPStatusError as e:
        print(f"HTTP error for post {post_id}: {e.response.status_code}")
        return {"post_id": post_id, "post": {}, "comments": [], "error": str(e)}
    except Exception as e:
        print(f"Error fetching comments for {post_id}: {e}")
        return {"post_id": post_id, "post": {}, "comments": [], "error": str(e)}

Step 2: The Thread Tree Reconstruction Algorithm

The reconstruction converts a flat list of comment objects — each with a parent_id — into a nested tree structure. This is a classic tree-building problem: build a dictionary keyed by ID, then attach children to their parents.

python

# thread_tree.py
from typing import Optional
from dataclasses import dataclass, field


@dataclass
class CommentNode:
    """A comment in a thread tree."""
    id: str
    parent_id: str          # t1_xxx for comment, t3_xxx for post
    body: str
    author: str
    score: int
    created_utc: str
    depth: int = 0
    is_submitter: bool = False
    distinguished: Optional[str] = None
    children: list = field(default_factory=list)
    
    @property
    def is_deleted(self) -> bool:
        return self.body in ("[deleted]", "[removed]", "")
    
    @property
    def is_top_level(self) -> bool:
        """True if this is a direct reply to the post (not to another comment)."""
        return self.parent_id.startswith("t3_")
    
    @property
    def subtree_size(self) -> int:
        """Total number of comments in this node's subtree."""
        return 1 + sum(child.subtree_size for child in self.children)
    
    @property
    def subtree_max_score(self) -> int:
        """Highest score in this node's subtree."""
        scores = [self.score] + [child.subtree_max_score for child in self.children]
        return max(scores)
    
    def to_dict(self, include_children: bool = True) -> dict:
        result = {
            "id": self.id,
            "parent_id": self.parent_id,
            "body": self.body,
            "author": self.author,
            "score": self.score,
            "created_utc": self.created_utc,
            "depth": self.depth,
            "is_submitter": self.is_submitter,
            "is_top_level": self.is_top_level,
            "subtree_size": self.subtree_size,
        }
        if include_children:
            result["children"] = [c.to_dict() for c in self.children]
        return result


def build_comment_tree(
    comments: list[dict],
    post_id: str,
    exclude_deleted: bool = False,
) -> list[CommentNode]:
    """
    Convert a flat list of comment dicts into a nested tree.
    Returns a list of top-level CommentNode objects with children populated.
    
    This is a O(n) algorithm:
    1. Build a lookup dict by comment ID
    2. For each comment, attach it to its parent
    3. Return only root-level comments (those whose parent is the post)
    """
    
    # Build lookup dict
    nodes: dict[str, CommentNode] = {}
    
    for comment in comments:
        comment_id = comment.get("id", "")
        if not comment_id:
            continue
        
        body = comment.get("body", "")
        
        if exclude_deleted and body in ("[deleted]", "[removed]", ""):
            continue
        
        node = CommentNode(
            id=comment_id,
            parent_id=comment.get("parent_id", f"t3_{post_id}"),
            body=body,
            author=comment.get("author", "[deleted]"),
            score=comment.get("score", 0),
            created_utc=comment.get("created_utc", ""),
            depth=comment.get("depth", 0),
            is_submitter=comment.get("is_submitter", False),
            distinguished=comment.get("distinguished"),
        )
        nodes[comment_id] = node
    
    # Attach children to parents
    roots = []
    for node in nodes.values():
        if node.is_top_level:
            roots.append(node)
        else:
            # Strip the t1_ prefix to get parent comment ID
            parent_comment_id = node.parent_id.replace("t1_", "")
            if parent_comment_id in nodes:
                nodes[parent_comment_id].children.append(node)
            else:
                # Parent not in dataset (truncated thread) — treat as root
                roots.append(node)
    
    # Sort roots by score descending (highest-quality first)
    roots.sort(key=lambda n: n.score, reverse=True)
    
    # Sort children at each level too
    def sort_children(node: CommentNode):
        node.children.sort(key=lambda n: n.score, reverse=True)
        for child in node.children:
            sort_children(child)
    
    for root in roots:
        sort_children(root)
    
    return roots

Step 3: Extracting Intelligence From the Tree

A reconstructed thread tree enables analyses that are impossible on flat comment lists.

python

# thread_analysis.py
from thread_tree import CommentNode
from collections import Counter
import re


def extract_thread_insights(
    roots: list[CommentNode],
    post_title: str = "",
    min_score: int = 5,
) -> dict:
    """
    Extract structured intelligence from a reconstructed comment tree.
    Returns consensus views, key themes, and community dynamics.
    """
    
    # Flatten all comments for aggregate analysis
    all_comments = []
    
    def collect_all(node: CommentNode):
        all_comments.append(node)
        for child in node.children:
            collect_all(child)
    
    for root in roots:
        collect_all(root)
    
    # Filter out deleted and low-score comments
    quality_comments = [
        c for c in all_comments
        if not c.is_deleted and c.score >= min_score
    ]
    
    # Top-level consensus: highest-scored root comments
    top_consensus = [
        {
            "body": root.body[:500],
            "score": root.score,
            "author": root.author,
            "reply_count": root.subtree_size - 1,
            "depth_of_discussion": _max_depth(root),
        }
        for root in roots[:5]
        if not root.is_deleted
    ]
    
    # OP engagement: did original poster respond and what did they say?
    op_responses = [
        {
            "body": c.body[:300],
            "score": c.score,
            "depth": c.depth,
        }
        for c in quality_comments
        if c.is_submitter and c.depth > 0
    ]
    
    # Most contested comments: low score but high reply count
    # (people disagreed enough to reply but not upvote)
    contested = sorted(
        [c for c in all_comments if c.score < 2 and c.subtree_size > 5],
        key=lambda c: c.subtree_size,
        reverse=True,
    )[:3]
    
    # Score distribution
    scores = [c.score for c in all_comments if not c.is_deleted]
    
    # Key themes via simple word frequency on high-score comments
    high_quality_text = " ".join(
        c.body for c in quality_comments if len(c.body) > 50
    ).lower()
    
    # Remove common stop words
    stop_words = {
        "the", "a", "an", "and", "or", "but", "in", "on", "at", "to",
        "for", "of", "with", "this", "that", "it", "is", "was", "are",
        "have", "has", "been", "be", "i", "you", "we", "they", "not",
        "reddit", "comment", "post", "thread", "edit",
    }
    
    words = re.findall(r'\b[a-z]{4,}\b', high_quality_text)
    word_freq = Counter(w for w in words if w not in stop_words)
    top_themes = word_freq.most_common(15)
    
    return {
        "total_comments": len(all_comments),
        "quality_comments": len(quality_comments),
        "deleted_count": len([c for c in all_comments if c.is_deleted]),
        "top_consensus": top_consensus,
        "op_engaged": len(op_responses) > 0,
        "op_response_count": len(op_responses),
        "op_responses": op_responses[:3],
        "contested_comments": [
            {"body": c.body[:200], "score": c.score, "replies": c.subtree_size}
            for c in contested
        ],
        "score_summary": {
            "max": max(scores) if scores else 0,
            "median": sorted(scores)[len(scores)//2] if scores else 0,
            "positive_pct": round(
                100 * len([s for s in scores if s > 0]) / len(scores), 1
            ) if scores else 0,
        },
        "top_themes": [(word, count) for word, count in top_themes],
    }


def _max_depth(node: CommentNode, current: int = 0) -> int:
    """Find the maximum depth of a comment subtree."""
    if not node.children:
        return current
    return max(_max_depth(child, current + 1) for child in node.children)


def thread_to_training_format(
    roots: list[CommentNode],
    min_exchange_score: int = 10,
) -> list[dict]:
    """
    Convert a thread tree to instruction-response pairs for AI training.
    Each quality parent-child comment pair becomes a training example.
    The parent is the instruction; the highest-scored child is the response.
    """
    pairs = []
    
    def extract_pairs(node: CommentNode):
        # Skip deleted or low-quality nodes
        if node.is_deleted or node.score < min_exchange_score:
            for child in node.children:
                extract_pairs(child)
            return
        
        # Find the best-scored non-deleted child
        quality_children = [
            c for c in node.children
            if not c.is_deleted and c.score >= min_exchange_score
        ]
        
        if quality_children:
            best_child = max(quality_children, key=lambda c: c.score)
            
            # Only include if both parent and child are substantive
            if len(node.body) > 50 and len(best_child.body) > 50:
                pairs.append({
                    "instruction": node.body,
                    "response": best_child.body,
                    "metadata": {
                        "parent_score": node.score,
                        "response_score": best_child.score,
                        "parent_depth": node.depth,
                        "response_type": "top_upvoted_reply",
                    }
                })
        
        for child in node.children:
            extract_pairs(child)
    
    for root in roots:
        extract_pairs(root)
    
    return pairs

Step 4: The Complete Pipeline

python

# main_reddit_comments.py
import asyncio
import httpx
import json
import os
from reddit_comment_collector import fetch_post_comments
from thread_tree import build_comment_tree
from thread_analysis import extract_thread_insights, thread_to_training_format

API_KEY = os.environ["SCRAPEBADGER_API_KEY"]
HEADERS = {"X-API-Key": API_KEY}


async def process_post(
    post_id: str,
    subreddit: str = None,
    purpose: str = "intelligence",  # "intelligence" or "training"
    output_dir: str = "output",
) -> dict:
    """
    Complete pipeline: fetch → reconstruct → analyse → export.
    """
    os.makedirs(output_dir, exist_ok=True)
    
    async with httpx.AsyncClient(headers=HEADERS) as client:
        # Step 1: Fetch all comments
        raw = await fetch_post_comments(
            client, post_id, subreddit, sort="top"
        )
        
        if raw.get("error") or not raw.get("comments"):
            print(f"Failed to fetch comments for {post_id}")
            return raw
        
        # Step 2: Reconstruct tree
        roots = build_comment_tree(
            comments=raw["comments"],
            post_id=post_id,
            exclude_deleted=(purpose == "training"),
        )
        
        print(f"  Tree built: {len(roots)} top-level threads")
        
        # Step 3: Analyse
        post_title = raw["post"].get("title", "")
        insights = extract_thread_insights(roots, post_title)
        
        result = {
            "post_id": post_id,
            "post_title": post_title,
            "subreddit": raw["post"].get("subreddit", ""),
            "total_comments": insights["total_comments"],
            "quality_comments": insights["quality_comments"],
            "top_consensus": insights["top_consensus"],
            "top_themes": insights["top_themes"],
            "op_engaged": insights["op_engaged"],
            "score_summary": insights["score_summary"],
        }
        
        # Step 4: Export
        if purpose == "intelligence":
            out_path = f"{output_dir}/{post_id}_intelligence.json"
            with open(out_path, "w") as f:
                json.dump(result, f, indent=2)
            print(f"  Intelligence saved: {out_path}")
        
        elif purpose == "training":
            pairs = thread_to_training_format(roots, min_exchange_score=5)
            out_path = f"{output_dir}/{post_id}_training.jsonl"
            with open(out_path, "w") as f:
                for pair in pairs:
                    f.write(json.dumps(pair) + "\n")
            print(f"  Training pairs saved: {len(pairs)} pairs → {out_path}")
            result["training_pairs"] = len(pairs)
        
        return result


async def batch_process(
    post_ids: list[str],
    purpose: str = "intelligence",
    max_concurrent: int = 5,
) -> list[dict]:
    """Process multiple posts concurrently."""
    import random
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async def bounded_process(post_id: str) -> dict:
        async with semaphore:
            await asyncio.sleep(random.uniform(0.5, 1.5))
            return await process_post(post_id, purpose=purpose)
    
    results = await asyncio.gather(
        *[bounded_process(pid) for pid in post_ids]
    )
    return list(results)


if __name__ == "__main__":
    import sys
    
    if len(sys.argv) < 2:
        print("Usage: python main_reddit_comments.py <post_id1> [post_id2 ...]")
        print("Example: python main_reddit_comments.py abc123 def456")
        sys.exit(1)
    
    post_ids = sys.argv[1:]
    purpose = "intelligence"  # or "training"
    
    results = asyncio.run(batch_process(post_ids, purpose=purpose))
    
    print(f"\n{'='*50}")
    print(f"Processed {len(results)} posts:")
    for r in results:
        print(f"  {r.get('post_id')}: {r.get('total_comments', 0)} comments, "
              f"{len(r.get('top_themes', []))} themes identified")

What Reconstructed Thread Trees Enable

For Brand Intelligence

A product complaint thread reconstructed as a tree shows which concern the community found most credible (highest top-level score), how deeply the discussion went into diagnosis (subtree depth), whether the original poster engaged and what they said (OP response tracking), and what the community consensus answer was. This is categorically different from reading 200 comments in sequence.

For AI Training Data

As covered in the ScrapeBadger AI training datasets guide, Reddit produces natural conversational pairs from comment structure. The tree reconstruction separates high-quality community-validated exchanges (high-score parent-child pairs) from the long tail of low-engagement comments. The thread_to_training_format() function above extracts these pairs in a format directly compatible with fine-tuning pipelines.

For Quantitative Research

The score distribution, OP engagement rate, contest frequency, and theme clustering that the analysis pipeline produces are quantitative signals extractable from comment trees but not from flat comment lists. Researchers tracking how community sentiment around a topic evolves over time need the full thread structure to measure things like average top-level score, discussion depth distribution, and OP reply rate — none of which are derivable from flat data.

The ScrapeBadger Reddit Scraper documentation covers the comment endpoint parameters including sort options, depth control, and response field reference. Free trial at scrapebadger.com/reddit-scraper — 1,000 credits, no credit card.

How to Scrape Reddit Comments and Reconstruct Thread Trees With ScrapeBadger

Understanding Reddit's Comment Data Model

Step 1: Fetching Comments for a Post

Step 2: The Thread Tree Reconstruction Algorithm

Step 3: Extracting Intelligence From the Tree

Step 4: The Complete Pipeline

What Reconstructed Thread Trees Enable

For Brand Intelligence

For AI Training Data

For Quantitative Research

Thomas Shultz

Ready to get started?