Sitemap Iteration Retrospective

From 4 iterations to 1 - grep-first data flow analysis

Author: Blue
Updated: 2024-12-24
Based on: feat-sitemap-link-text → fix-crawl-sitemap-link-text → feat-scrape-sitemap-grouping

This was an iteration where we did many things right, but also took unnecessary detours. Documenting it so we can skip a round next time.

Background: A "Simple" Feature Request

User feedback: Sitemap preview only shows URLs, can't tell what the pages are about.

Sounds simple: grab link text in extractLinks(), then render as [text](URL) format.

What actually happened?

Iteration	Content	Commits
1	`feat-sitemap-link-text` - Add link text to Scrape mode	Archived
2	`fix-crawl-sitemap-link-text` - Add link text to Crawl mode	`0b54f44`, `dc5c121`
3	`feat-scrape-sitemap-grouping` - Group links by path	`27ec579`
4	Summary stats optimization + doc cleanup	`aee6e1f`

4 iterations. Should have been 2.

What We Did Well

1. Problem-Driven, Incremental Delivery

Each iteration solved one clear problem, no scope creep:

link text → crawl fix → grouping → stats optimization

Every commit can be reviewed and reverted independently.

2. Verification First

Tested with real URLs (docs.cnb.cool), not fake data.

When Scrape (49 links) vs Crawl (36 links) showed different counts, I dug into the root cause instead of ignoring it:

"Why does the same URL return different link counts in different modes?"

The answer revealed a semantic difference:

Scrape: discoveredLinks = all same-domain links on the page
Crawl: discoveredUrls = discovered but not crawled links

This difference should have been clarified during design.

3. YAGNI Applied Properly

normalizeDiscoveredLinks() backward compatibility function? Not needed.

"Frontend and backend deploy together, no version mismatch scenario."

Cut it. 30 lines of code saved.

4. Code Review Made a Difference

Auditor caught Builder using startsWith(prefix + "/") instead of includes(prefix):

// Less precise
linkPath.includes("/zh/build")  // "/other/zh/build" also matches

// More precise
linkPath.startsWith("/zh/build/") || linkPath === "/zh/build"

Small change, big difference.

What Could Be Better

1. Incomplete Initial Design

feat-sitemap-link-text only considered Scrape, missed Crawl.

Linus would say:

"Did you ask 'where else is discoveredUrls used?' during design? If you had, you wouldn't have missed it."

What should have been done:

grep -r "discoveredUrls\|discoveredLinks" --include="*.ts" src/ ui/src/

One command reveals all consumption points.

2. Insufficient Data Flow Analysis

Proposal estimated 4 files, actually changed 8:

Estimated	Actual
`src/types.ts`	`src/types.ts`
`src/crawler/index.ts`	`src/crawler/index.ts`
`ui/src/lib/types.ts`	`ui/src/lib/types.ts`
`ui/src/lib/sitemap-generator.ts`	`ui/src/lib/sitemap-generator.ts`
-	`src/lib/sitemap-generator.ts`
-	`src/routes/download.ts`
-	`ui/src/hooks/use-crawl-stream.ts`
-	`ui/src/components/extraction/utils/download.ts`

A Google engineer would:

Draw the data flow diagram first, mark every consumption point, then code.

3. Semantic Differences Discovered Late

Field semantic differences between modes weren't discovered until verification:

Field	Mode	Meaning
`discoveredLinks`	Scrape	All same-domain links on current page
`discoveredUrls`	Crawl	Discovered but not crawled links

Only analyzed deeply when user asked "why are the counts different?"

Should have done in proposal: Define each field's semantics explicitly.

4. Temporary Documentation Accumulated

Generated 3 SITEMAP_*.md temp files (873 lines), cleaned up only at the end.

Correct approach:

Temporary analysis stays local. Use /tmp or scratch files, don't commit to repo.

If We Did It Again

The Linus Approach

1. Ask about data structures first

# What's the complete consumption chain for discoveredUrls?
grep -r "discoveredUrls" --include="*.ts" src/ ui/src/

2. Get it right the first time

Don't split into Scrape/Crawl iterations. Design a unified DiscoveredLink type, one patch to fix all.

3. No temporary docs in repo

Analysis happens in your head or scratch paper, not in commits.

The Google Engineer Approach

1. Design doc first

User scenario: Who uses sitemap? What do they want to see?
Data model: Define DiscoveredLink as the canonical type
Impact scope: List all files that need changes

2. Consistency check

Are Scrape and Crawl sitemap formats consistent?
Is inconsistency intentional or an oversight?

3. Product thinking upfront

Grouping should have been in iteration 1, not iteration 3.

"When a user scrapes a page, what do they most want to know?" → How many related pages are in this section

Key Lessons

Problem	Improvement
Incomplete design causes patches	grep globally during design to find all consumption points
Semantic differences found late	Proposal must define field semantics, not just types
Temp docs accumulate	Temp analysis stays local, use `/tmp` directory
Grouping added late	Product thinking upfront: ask what users want to see

Quantified Comparison

Metric	Actual	Ideal
Iterations	4 rounds	2 rounds
Proposals	2 + 1 fix	1 complete proposal
File changes	Scattered across commits	Done in one pass
Temp docs	873 lines deleted	0

Process Improvement: Implemented

Based on this retrospective, we updated strategist.md:

New Step: Data Flow Analysis

2. Data Flow Analysis (Linus-style grep first)
   - Identify core fields/types involved in the change
   - Execute grep to find all consumption points
   - Track complete chain: production → transport → consumption → render
   - Clarify field semantics: same name may mean different things in different modes

New Template Section: Data Flow Analysis

## Data Flow Analysis

### Fields Involved
| Field | Type | Module |

### Consumption Chain
[Production] → [Transport] → [Consumption] → [Render]

### Semantic Definitions
| Field | Mode | Meaning |

### grep Verification Results
# Consumption points found (X total)

Summary

This iteration taught me:

grep first: Search all consumption points before design, avoid omissions
Define semantics: Same field name may mean different things in different modes
Product thinking upfront: What does the user want to see? Don't wait until iteration 3
Temp docs stay local: Analysis in scratch files, not polluting the repo
Get it right once: Complete design analysis reduces patch iterations

Finally, a reminder from Linus:

"Bad programmers worry about the code. Good programmers worry about data structures."

Next time, ask first: Where does the data come from? Where does it go? Who consumes it?

Answer these three questions clearly, and iteration count naturally decreases.