How to Scrape Reddit Comments and Reconstruct Thread Trees With ScrapeBadger

Comments are where Reddit's real intelligence lives. The original post states a problem or asks a question. The comment thread is where the community diagnoses it, disagrees with each other, offers solutions, validates those solutions through upvotes, and builds a conversation structure that is richer than any individual contribution within it.
Most Reddit scraping tutorials collect posts. They either skip comments entirely or return them as a flat list that discards the threading structure — the parent-child relationships that show which comment is a reply to which, what the community thinks of each response relative to others, and how the conversation developed over time. A flat list of 200 comments from a 200-comment thread is not the same thing as the conversation. It is the conversation's raw materials with the structure removed.
The threading structure matters for three specific use cases where Reddit data has commercial value.
For brand intelligence, a complaint thread is not just a complaint — it is a conversation where the community diagnosed the problem, proposed solutions, and validated the most plausible explanation through upvotes. The highest-upvoted comment in a negative brand thread is the community's consensus view of what went wrong. A flat comment list cannot distinguish this consensus from the long tail of low-engagement responses.
For AI training data, conversational thread structure provides the dialogue patterns that make fine-tuned models genuinely conversational rather than question-answering machines. A nested comment tree shows how a community builds on previous points, challenges assertions, and reaches emergent consensus — training signals that a flat corpus cannot provide.
For product research, a thread about a product failure or a comparison request contains a hierarchical conversation where different users advocate for different positions and the community evaluates those positions through voting. Reconstructing this hierarchy shows which arguments won, which lost, and what the underlying debate was actually about.
This guide builds a complete Reddit comment scraping and thread reconstruction pipeline using ScrapeBadger's Reddit Scraper, from individual comment collection through full nested tree reconstruction and structured export.
Understanding Reddit's Comment Data Model
Before building the pipeline, understand what Reddit's comment structure actually contains. Every comment in a Reddit thread has:
id: unique comment identifierparent_id: the ID of the comment or post this is a reply to. Prefixed witht1_for comment parents,t3_for post parents (top-level comments)body: the comment textauthor: username of the commenterscore: upvote score (net upvotes minus downvotes)created_utc: creation timestampdepth: nesting depth in the tree (0 = top-level, 1 = reply to top-level, etc.)distinguished: whether the commenter is a moderator or special accountis_submitter: whether the commenter is the original poster (OP)
The parent_id field is the key to tree reconstruction. Every comment knows its parent. From a flat list of comments, you can reconstruct the full tree by following parent-child relationships recursively.
The complication: Reddit truncates deep threads. A post with 500 comments does not return all 500 comments in a single response. Deep threads are replaced with "MoreComments" objects — placeholders that require additional API calls to expand. Most tools either return an incomplete thread silently or fail to handle these stubs at all.
ScrapeBadger's Reddit comment endpoint handles the expansion of MoreComments stubs automatically, returning the complete comment tree structure for a post rather than the truncated version.
Step 1: Fetching Comments for a Post
python
# reddit_comment_collector.py
import httpx
import asyncio
import os
from typing import Optional
from datetime import datetime
API_KEY = os.environ["SCRAPEBADGER_API_KEY"]
BASE_URL = "https://api.scrapebadger.com/v1"
HEADERS = {"X-API-Key": API_KEY}
async def fetch_post_comments(
client: httpx.AsyncClient,
post_id: str,
subreddit: str = None,
sort: str = "top", # top, best, new, controversial, old, qa
limit: int = 500,
depth: int = None, # None = full depth
) -> dict:
"""
Fetch all comments for a Reddit post.
Returns post metadata + full comment list with parent IDs.
sort='top' returns highest-voted comments first — best for
brand intelligence and consensus analysis.
sort='new' returns most recent — best for monitoring ongoing discussions.
"""
try:
params = {
"sort": sort,
"limit": limit,
}
if depth is not None:
params["depth"] = depth
if subreddit:
params["subreddit"] = subreddit
response = await client.get(
f"{BASE_URL}/reddit/post/{post_id}/comments",
params=params,
timeout=45.0, # Comment collection can be slower than post fetch
)
response.raise_for_status()
data = response.json()
post = data.get("post", {})
comments = data.get("comments", [])
print(f"Post '{post.get('title', post_id)[:60]}...'")
print(f" Fetched {len(comments)} comments (sort={sort})")
return {
"post_id": post_id,
"post": post,
"comments": comments,
"fetched_at": datetime.utcnow().isoformat(),
}
except httpx.HTTPStatusError as e:
print(f"HTTP error for post {post_id}: {e.response.status_code}")
return {"post_id": post_id, "post": {}, "comments": [], "error": str(e)}
except Exception as e:
print(f"Error fetching comments for {post_id}: {e}")
return {"post_id": post_id, "post": {}, "comments": [], "error": str(e)}Step 2: The Thread Tree Reconstruction Algorithm
The reconstruction converts a flat list of comment objects — each with a parent_id — into a nested tree structure. This is a classic tree-building problem: build a dictionary keyed by ID, then attach children to their parents.
python
# thread_tree.py
from typing import Optional
from dataclasses import dataclass, field
@dataclass
class CommentNode:
"""A comment in a thread tree."""
id: str
parent_id: str # t1_xxx for comment, t3_xxx for post
body: str
author: str
score: int
created_utc: str
depth: int = 0
is_submitter: bool = False
distinguished: Optional[str] = None
children: list = field(default_factory=list)
@property
def is_deleted(self) -> bool:
return self.body in ("[deleted]", "[removed]", "")
@property
def is_top_level(self) -> bool:
"""True if this is a direct reply to the post (not to another comment)."""
return self.parent_id.startswith("t3_")
@property
def subtree_size(self) -> int:
"""Total number of comments in this node's subtree."""
return 1 + sum(child.subtree_size for child in self.children)
@property
def subtree_max_score(self) -> int:
"""Highest score in this node's subtree."""
scores = [self.score] + [child.subtree_max_score for child in self.children]
return max(scores)
def to_dict(self, include_children: bool = True) -> dict:
result = {
"id": self.id,
"parent_id": self.parent_id,
"body": self.body,
"author": self.author,
"score": self.score,
"created_utc": self.created_utc,
"depth": self.depth,
"is_submitter": self.is_submitter,
"is_top_level": self.is_top_level,
"subtree_size": self.subtree_size,
}
if include_children:
result["children"] = [c.to_dict() for c in self.children]
return result
def build_comment_tree(
comments: list[dict],
post_id: str,
exclude_deleted: bool = False,
) -> list[CommentNode]:
"""
Convert a flat list of comment dicts into a nested tree.
Returns a list of top-level CommentNode objects with children populated.
This is a O(n) algorithm:
1. Build a lookup dict by comment ID
2. For each comment, attach it to its parent
3. Return only root-level comments (those whose parent is the post)
"""
# Build lookup dict
nodes: dict[str, CommentNode] = {}
for comment in comments:
comment_id = comment.get("id", "")
if not comment_id:
continue
body = comment.get("body", "")
if exclude_deleted and body in ("[deleted]", "[removed]", ""):
continue
node = CommentNode(
id=comment_id,
parent_id=comment.get("parent_id", f"t3_{post_id}"),
body=body,
author=comment.get("author", "[deleted]"),
score=comment.get("score", 0),
created_utc=comment.get("created_utc", ""),
depth=comment.get("depth", 0),
is_submitter=comment.get("is_submitter", False),
distinguished=comment.get("distinguished"),
)
nodes[comment_id] = node
# Attach children to parents
roots = []
for node in nodes.values():
if node.is_top_level:
roots.append(node)
else:
# Strip the t1_ prefix to get parent comment ID
parent_comment_id = node.parent_id.replace("t1_", "")
if parent_comment_id in nodes:
nodes[parent_comment_id].children.append(node)
else:
# Parent not in dataset (truncated thread) — treat as root
roots.append(node)
# Sort roots by score descending (highest-quality first)
roots.sort(key=lambda n: n.score, reverse=True)
# Sort children at each level too
def sort_children(node: CommentNode):
node.children.sort(key=lambda n: n.score, reverse=True)
for child in node.children:
sort_children(child)
for root in roots:
sort_children(root)
return rootsStep 3: Extracting Intelligence From the Tree
A reconstructed thread tree enables analyses that are impossible on flat comment lists.
python
# thread_analysis.py
from thread_tree import CommentNode
from collections import Counter
import re
def extract_thread_insights(
roots: list[CommentNode],
post_title: str = "",
min_score: int = 5,
) -> dict:
"""
Extract structured intelligence from a reconstructed comment tree.
Returns consensus views, key themes, and community dynamics.
"""
# Flatten all comments for aggregate analysis
all_comments = []
def collect_all(node: CommentNode):
all_comments.append(node)
for child in node.children:
collect_all(child)
for root in roots:
collect_all(root)
# Filter out deleted and low-score comments
quality_comments = [
c for c in all_comments
if not c.is_deleted and c.score >= min_score
]
# Top-level consensus: highest-scored root comments
top_consensus = [
{
"body": root.body[:500],
"score": root.score,
"author": root.author,
"reply_count": root.subtree_size - 1,
"depth_of_discussion": _max_depth(root),
}
for root in roots[:5]
if not root.is_deleted
]
# OP engagement: did original poster respond and what did they say?
op_responses = [
{
"body": c.body[:300],
"score": c.score,
"depth": c.depth,
}
for c in quality_comments
if c.is_submitter and c.depth > 0
]
# Most contested comments: low score but high reply count
# (people disagreed enough to reply but not upvote)
contested = sorted(
[c for c in all_comments if c.score < 2 and c.subtree_size > 5],
key=lambda c: c.subtree_size,
reverse=True,
)[:3]
# Score distribution
scores = [c.score for c in all_comments if not c.is_deleted]
# Key themes via simple word frequency on high-score comments
high_quality_text = " ".join(
c.body for c in quality_comments if len(c.body) > 50
).lower()
# Remove common stop words
stop_words = {
"the", "a", "an", "and", "or", "but", "in", "on", "at", "to",
"for", "of", "with", "this", "that", "it", "is", "was", "are",
"have", "has", "been", "be", "i", "you", "we", "they", "not",
"reddit", "comment", "post", "thread", "edit",
}
words = re.findall(r'\b[a-z]{4,}\b', high_quality_text)
word_freq = Counter(w for w in words if w not in stop_words)
top_themes = word_freq.most_common(15)
return {
"total_comments": len(all_comments),
"quality_comments": len(quality_comments),
"deleted_count": len([c for c in all_comments if c.is_deleted]),
"top_consensus": top_consensus,
"op_engaged": len(op_responses) > 0,
"op_response_count": len(op_responses),
"op_responses": op_responses[:3],
"contested_comments": [
{"body": c.body[:200], "score": c.score, "replies": c.subtree_size}
for c in contested
],
"score_summary": {
"max": max(scores) if scores else 0,
"median": sorted(scores)[len(scores)//2] if scores else 0,
"positive_pct": round(
100 * len([s for s in scores if s > 0]) / len(scores), 1
) if scores else 0,
},
"top_themes": [(word, count) for word, count in top_themes],
}
def _max_depth(node: CommentNode, current: int = 0) -> int:
"""Find the maximum depth of a comment subtree."""
if not node.children:
return current
return max(_max_depth(child, current + 1) for child in node.children)
def thread_to_training_format(
roots: list[CommentNode],
min_exchange_score: int = 10,
) -> list[dict]:
"""
Convert a thread tree to instruction-response pairs for AI training.
Each quality parent-child comment pair becomes a training example.
The parent is the instruction; the highest-scored child is the response.
"""
pairs = []
def extract_pairs(node: CommentNode):
# Skip deleted or low-quality nodes
if node.is_deleted or node.score < min_exchange_score:
for child in node.children:
extract_pairs(child)
return
# Find the best-scored non-deleted child
quality_children = [
c for c in node.children
if not c.is_deleted and c.score >= min_exchange_score
]
if quality_children:
best_child = max(quality_children, key=lambda c: c.score)
# Only include if both parent and child are substantive
if len(node.body) > 50 and len(best_child.body) > 50:
pairs.append({
"instruction": node.body,
"response": best_child.body,
"metadata": {
"parent_score": node.score,
"response_score": best_child.score,
"parent_depth": node.depth,
"response_type": "top_upvoted_reply",
}
})
for child in node.children:
extract_pairs(child)
for root in roots:
extract_pairs(root)
return pairsStep 4: The Complete Pipeline
python
# main_reddit_comments.py
import asyncio
import httpx
import json
import os
from reddit_comment_collector import fetch_post_comments
from thread_tree import build_comment_tree
from thread_analysis import extract_thread_insights, thread_to_training_format
API_KEY = os.environ["SCRAPEBADGER_API_KEY"]
HEADERS = {"X-API-Key": API_KEY}
async def process_post(
post_id: str,
subreddit: str = None,
purpose: str = "intelligence", # "intelligence" or "training"
output_dir: str = "output",
) -> dict:
"""
Complete pipeline: fetch → reconstruct → analyse → export.
"""
os.makedirs(output_dir, exist_ok=True)
async with httpx.AsyncClient(headers=HEADERS) as client:
# Step 1: Fetch all comments
raw = await fetch_post_comments(
client, post_id, subreddit, sort="top"
)
if raw.get("error") or not raw.get("comments"):
print(f"Failed to fetch comments for {post_id}")
return raw
# Step 2: Reconstruct tree
roots = build_comment_tree(
comments=raw["comments"],
post_id=post_id,
exclude_deleted=(purpose == "training"),
)
print(f" Tree built: {len(roots)} top-level threads")
# Step 3: Analyse
post_title = raw["post"].get("title", "")
insights = extract_thread_insights(roots, post_title)
result = {
"post_id": post_id,
"post_title": post_title,
"subreddit": raw["post"].get("subreddit", ""),
"total_comments": insights["total_comments"],
"quality_comments": insights["quality_comments"],
"top_consensus": insights["top_consensus"],
"top_themes": insights["top_themes"],
"op_engaged": insights["op_engaged"],
"score_summary": insights["score_summary"],
}
# Step 4: Export
if purpose == "intelligence":
out_path = f"{output_dir}/{post_id}_intelligence.json"
with open(out_path, "w") as f:
json.dump(result, f, indent=2)
print(f" Intelligence saved: {out_path}")
elif purpose == "training":
pairs = thread_to_training_format(roots, min_exchange_score=5)
out_path = f"{output_dir}/{post_id}_training.jsonl"
with open(out_path, "w") as f:
for pair in pairs:
f.write(json.dumps(pair) + "\n")
print(f" Training pairs saved: {len(pairs)} pairs → {out_path}")
result["training_pairs"] = len(pairs)
return result
async def batch_process(
post_ids: list[str],
purpose: str = "intelligence",
max_concurrent: int = 5,
) -> list[dict]:
"""Process multiple posts concurrently."""
import random
semaphore = asyncio.Semaphore(max_concurrent)
async def bounded_process(post_id: str) -> dict:
async with semaphore:
await asyncio.sleep(random.uniform(0.5, 1.5))
return await process_post(post_id, purpose=purpose)
results = await asyncio.gather(
*[bounded_process(pid) for pid in post_ids]
)
return list(results)
if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print("Usage: python main_reddit_comments.py <post_id1> [post_id2 ...]")
print("Example: python main_reddit_comments.py abc123 def456")
sys.exit(1)
post_ids = sys.argv[1:]
purpose = "intelligence" # or "training"
results = asyncio.run(batch_process(post_ids, purpose=purpose))
print(f"\n{'='*50}")
print(f"Processed {len(results)} posts:")
for r in results:
print(f" {r.get('post_id')}: {r.get('total_comments', 0)} comments, "
f"{len(r.get('top_themes', []))} themes identified")What Reconstructed Thread Trees Enable
For Brand Intelligence
A product complaint thread reconstructed as a tree shows which concern the community found most credible (highest top-level score), how deeply the discussion went into diagnosis (subtree depth), whether the original poster engaged and what they said (OP response tracking), and what the community consensus answer was. This is categorically different from reading 200 comments in sequence.
For AI Training Data
As covered in the ScrapeBadger AI training datasets guide, Reddit produces natural conversational pairs from comment structure. The tree reconstruction separates high-quality community-validated exchanges (high-score parent-child pairs) from the long tail of low-engagement comments. The thread_to_training_format() function above extracts these pairs in a format directly compatible with fine-tuning pipelines.
For Quantitative Research
The score distribution, OP engagement rate, contest frequency, and theme clustering that the analysis pipeline produces are quantitative signals extractable from comment trees but not from flat comment lists. Researchers tracking how community sentiment around a topic evolves over time need the full thread structure to measure things like average top-level score, discussion depth distribution, and OP reply rate — none of which are derivable from flat data.
The ScrapeBadger Reddit Scraper documentation covers the comment endpoint parameters including sort options, depth control, and response field reference. Free trial at scrapebadger.com/reddit-scraper — 1,000 credits, no credit card.

Written by
Thomas Shultz
Thomas Shultz is the Head of Data at ScrapeBadger, working on public web data, scraping infrastructure, and data reliability. He writes about real-world scraping, data pipelines, and turning unstructured web data into usable signals.
Ready to get started?
Join thousands of developers using ScrapeBadger for their data needs.