Sitemap Iteration Retrospective
From 4 iterations to 1 - grep-first data flow analysis
Author: Blue
Updated: 2024-12-24
Based on: feat-sitemap-link-text → fix-crawl-sitemap-link-text → feat-scrape-sitemap-grouping
This was an iteration where we did many things right, but also took unnecessary detours. Documenting it so we can skip a round next time.
Background: A "Simple" Feature Request
User feedback: Sitemap preview only shows URLs, can't tell what the pages are about.
Sounds simple: grab link text in extractLinks(), then render as [text](URL) format.
What actually happened?
| Iteration | Content | Commits |
|---|---|---|
| 1 | feat-sitemap-link-text - Add link text to Scrape mode |
Archived |
| 2 | fix-crawl-sitemap-link-text - Add link text to Crawl mode |
0b54f44, dc5c121 |
| 3 | feat-scrape-sitemap-grouping - Group links by path |
27ec579 |
| 4 | Summary stats optimization + doc cleanup | aee6e1f |
4 iterations. Should have been 2.
What We Did Well
1. Problem-Driven, Incremental Delivery
Each iteration solved one clear problem, no scope creep:
link text → crawl fix → grouping → stats optimization
Every commit can be reviewed and reverted independently.
2. Verification First
Tested with real URLs (docs.cnb.cool), not fake data.
When Scrape (49 links) vs Crawl (36 links) showed different counts, I dug into the root cause instead of ignoring it:
"Why does the same URL return different link counts in different modes?"
The answer revealed a semantic difference:
- Scrape:
discoveredLinks= all same-domain links on the page - Crawl:
discoveredUrls= discovered but not crawled links
This difference should have been clarified during design.
3. YAGNI Applied Properly
normalizeDiscoveredLinks() backward compatibility function? Not needed.
"Frontend and backend deploy together, no version mismatch scenario."
Cut it. 30 lines of code saved.
4. Code Review Made a Difference
Auditor caught Builder using startsWith(prefix + "/") instead of includes(prefix):
// Less precise
linkPath.includes("/zh/build") // "/other/zh/build" also matches
// More precise
linkPath.startsWith("/zh/build/") || linkPath === "/zh/build"
Small change, big difference.
What Could Be Better
1. Incomplete Initial Design
feat-sitemap-link-text only considered Scrape, missed Crawl.
Linus would say:
"Did you ask 'where else is discoveredUrls used?' during design? If you had, you wouldn't have missed it."
What should have been done:
grep -r "discoveredUrls\|discoveredLinks" --include="*.ts" src/ ui/src/
One command reveals all consumption points.
2. Insufficient Data Flow Analysis
Proposal estimated 4 files, actually changed 8:
| Estimated | Actual |
|---|---|
src/types.ts |
src/types.ts |
src/crawler/index.ts |
src/crawler/index.ts |
ui/src/lib/types.ts |
ui/src/lib/types.ts |
ui/src/lib/sitemap-generator.ts |
ui/src/lib/sitemap-generator.ts |
| - | src/lib/sitemap-generator.ts |
| - | src/routes/download.ts |
| - | ui/src/hooks/use-crawl-stream.ts |
| - | ui/src/components/extraction/utils/download.ts |
A Google engineer would:
Draw the data flow diagram first, mark every consumption point, then code.
3. Semantic Differences Discovered Late
Field semantic differences between modes weren't discovered until verification:
| Field | Mode | Meaning |
|---|---|---|
discoveredLinks |
Scrape | All same-domain links on current page |
discoveredUrls |
Crawl | Discovered but not crawled links |
Only analyzed deeply when user asked "why are the counts different?"
Should have done in proposal: Define each field's semantics explicitly.
4. Temporary Documentation Accumulated
Generated 3 SITEMAP_*.md temp files (873 lines), cleaned up only at the end.
Correct approach:
Temporary analysis stays local. Use
/tmpor scratch files, don't commit to repo.
If We Did It Again
The Linus Approach
1. Ask about data structures first
# What's the complete consumption chain for discoveredUrls?
grep -r "discoveredUrls" --include="*.ts" src/ ui/src/
2. Get it right the first time
Don't split into Scrape/Crawl iterations. Design a unified DiscoveredLink type, one patch to fix all.
3. No temporary docs in repo
Analysis happens in your head or scratch paper, not in commits.
The Google Engineer Approach
1. Design doc first
- User scenario: Who uses sitemap? What do they want to see?
- Data model: Define
DiscoveredLinkas the canonical type - Impact scope: List all files that need changes
2. Consistency check
- Are Scrape and Crawl sitemap formats consistent?
- Is inconsistency intentional or an oversight?
3. Product thinking upfront
Grouping should have been in iteration 1, not iteration 3.
"When a user scrapes a page, what do they most want to know?" → How many related pages are in this section
Key Lessons
| Problem | Improvement |
|---|---|
| Incomplete design causes patches | grep globally during design to find all consumption points |
| Semantic differences found late | Proposal must define field semantics, not just types |
| Temp docs accumulate | Temp analysis stays local, use /tmp directory |
| Grouping added late | Product thinking upfront: ask what users want to see |
Quantified Comparison
| Metric | Actual | Ideal |
|---|---|---|
| Iterations | 4 rounds | 2 rounds |
| Proposals | 2 + 1 fix | 1 complete proposal |
| File changes | Scattered across commits | Done in one pass |
| Temp docs | 873 lines deleted | 0 |
Process Improvement: Implemented
Based on this retrospective, we updated strategist.md:
New Step: Data Flow Analysis
2. Data Flow Analysis (Linus-style grep first)
- Identify core fields/types involved in the change
- Execute grep to find all consumption points
- Track complete chain: production → transport → consumption → render
- Clarify field semantics: same name may mean different things in different modes
New Template Section: Data Flow Analysis
## Data Flow Analysis
### Fields Involved
| Field | Type | Module |
### Consumption Chain
[Production] → [Transport] → [Consumption] → [Render]
### Semantic Definitions
| Field | Mode | Meaning |
### grep Verification Results
# Consumption points found (X total)
Summary
This iteration taught me:
- grep first: Search all consumption points before design, avoid omissions
- Define semantics: Same field name may mean different things in different modes
- Product thinking upfront: What does the user want to see? Don't wait until iteration 3
- Temp docs stay local: Analysis in scratch files, not polluting the repo
- Get it right once: Complete design analysis reduces patch iterations
Finally, a reminder from Linus:
"Bad programmers worry about the code. Good programmers worry about data structures."
Next time, ask first: Where does the data come from? Where does it go? Who consumes it?
Answer these three questions clearly, and iteration count naturally decreases.