"The Validation Illusion: When typecheck Lies"

A journey through a deceptive bug where all checks passed but nothing worked, revealing the gap between static validation and runtime reality.

"Show me the data." — Linus Torvalds

This is a story about a feature that passed every automated check yet failed to appear on screen. It exposed a fundamental gap in our Agent workflow: the illusion that compilation equals correctness.


The Scene

What We Built

We were adding an "API Breakdown" component to the Admin Dashboard. The feature would show:

  • Core APIs (scrape, crawl) — always expanded, prominently displayed
  • Supporting APIs (convert, download, etc.) — collapsed by default

Simple enough. The backend already returned breakdown data. We just needed the frontend to consume it.

The Workflow

Designer → Design spec approved
Strategist → Proposal created
Builder → Code implemented
Auditor → Review passed ✅
Main Agent → Committed

Everything green. Ship it.


The Crash

User: "验证通过了吗,我打开怎么没有看到界面呢,这个验收也太粗糙了"

Translation: "Did you actually verify this? I opened the page and see nothing. This acceptance testing is way too sloppy."

The Evidence

$ curl -s "http://localhost:3000/api/stats" | jq 'keys'
[
  "ai",
  "browserPool",
  "requests",
  "timestamp",
  "window"
]

No breakdown field. The backend was returning old data.

But wait — we committed the code. The git log showed:

8ce638c feat(stats): add endpoint breakdown following Google Analytics methodology

The code was there. The types were correct. TypeScript was happy. So why wasn't the API returning the new field?


The Root Cause: A Tale of Three Failures

Failure 1: The Auditor's Blind Spot

Our Auditor ran:

pnpm typecheck  # ✅ Passed

And declared victory.

But typecheck only verifies that types align at compile time. It says nothing about:

  • Whether the server is running the new code
  • Whether the UI actually renders the component
  • Whether the data flows correctly at runtime

The lesson: Static analysis is necessary but not sufficient.

Failure 2: The Invisible Restart

Node.js doesn't hot-reload server code. When we modified src/lib/stats.ts, the running server kept executing the old in-memory code.

Code on disk: ✅ Has breakdown field
Running server: ❌ Old code without breakdown

The backend needed a restart. Nobody mentioned it. Nobody checked.

The lesson: Code changes require service restarts. Memory doesn't sync with disk.

Failure 3: The Phantom Directory

Our Designer created files in:

openspec/changes/feat-endpoint-breakdown-ui/design/

But the change-id was:

feat-admin-dashboard-breakdown

The Designer invented their own directory name instead of using what the Main Agent specified.

The lesson: Agents may ignore instructions they "should" follow.


The Investigation: Where Do Instructions Go?

This led to a deeper question: How do we ensure Agents actually receive and follow instructions?

The Hierarchy of Certainty

Location Agent Receives It? Notes
openspec/AGENTS.md Maybe Agent must choose to read it
Main Agent's prompt Yes But must be written every time
.codebuddy/agents/*.md Always It's the system prompt

The AGENTS.md file was a suggestion. Agents could read it, or not. They could read it and ignore it.

But .codebuddy/agents/builder.md? That's the Builder's system prompt. It's injected into every Builder invocation. The Agent cannot not receive it.

"Agent 读取 AGENTS.md 是建议,不是强制。Agent 可能读了但忽略。"

"Reading AGENTS.md is a suggestion, not enforcement. The Agent might read it and still ignore it."

This was the key insight: inject rules where they cannot be bypassed.


The Fix: Three Layers of Defense

Layer 1: Strategist — Generate Better Checklists

We updated strategist.md to require typed validation checklists:

**Frontend changes:**
- [ ] typecheck passes
- [ ] Service is running/restarted
- [ ] API returns correct data (`curl` verification)
- [ ] UI renders correctly (browser snapshot)

**Backend changes:**
- [ ] typecheck passes
- [ ] Service restarted, API verified (`curl`)
- [ ] Test data generated to verify business logic

Now every tasks.md comes with the right checklist by default.

Layer 2: Auditor — Mandate E2E Verification

We added to auditor.md:

### End-to-End Validation (Required!)

**Prohibited**: Passing review based solely on typecheck/lint

**Must execute**:
- Confirm service is running latest code
  - Backend: Service needs restart (`curl` to verify new fields)
  - Frontend: Confirm hot-reload or rebuild
- Execute ALL validation items in tasks.md
- Frontend changes: Browser snapshot to verify UI
- Backend changes: `curl` to verify API response structure

Layer 3: Builder & Designer — Explicit Constraints

For Builder:

**Service Restart Reminder**:
- Backend code changes don't auto-reload in running services
- In your report, remind: "Backend modified, restart service before validation"

For Designer:

**Directory Specification (Required!)**:
- `<change-id>` is provided by Main Agent in the prompt
- **Prohibited**: Creating your own directory names
- If prompt doesn't provide change-id, **stop** and request it

The Pattern: Validation Pyramid

         ┌─────────────────────┐
         │   E2E Validation    │  ← The missing layer
         │  (curl + snapshot)  │
         └──────────┬──────────┘
                    │
         ┌──────────▼──────────┐
         │    Unit Tests       │
         │   (pnpm test)       │
         └──────────┬──────────┘
                    │
         ┌──────────▼──────────┐
         │   Static Analysis   │
         │    (typecheck)      │
         └─────────────────────┘

Each layer catches different classes of bugs:

Layer Catches Misses
typecheck Type errors Runtime behavior
Unit tests Logic errors Integration issues
E2E validation Everything Nothing (if done right)

The pyramid must be complete. Skipping the top layer creates a false sense of security.


The Philosophy: Instructions That Cannot Be Ignored

The deeper lesson isn't about testing. It's about system design for AI collaboration.

The Old Model: Trust and Hope

Write instructions → Hope Agent reads them → Hope Agent follows them

This fails because Agents have agency. They make decisions. They prioritize. They might decide your instructions aren't relevant to their current task.

The New Model: Structural Enforcement

Inject into system prompt → Agent MUST receive it
Template in generator → Output MUST include it
Check in pipeline → Build MUST fail without it

The principle: Don't rely on compliance. Design for inevitability.


Epilogue: The Humble Checklist

After all this analysis and all these changes, what's the actual fix?

A checklist. A simple, boring checklist that says:

- [ ] Did you actually open the page?
- [ ] Did you see the thing you built?
- [ ] Is the server running your code?

Sometimes the most sophisticated solution is remembering to look.

"验收必须看到真实数据在真实界面上的呈现。"

"Validation must see real data rendered on the real interface."


Commits

  • f610524 feat(admin): add API breakdown component to dashboard
  • 5c82d84 refactor(agents): strengthen validation and directory naming rules

Related

  • Insight: openspec/insights/agent-validation-gap-2026-01-01.md
  • Feature: openspec/changes/feat-admin-dashboard-breakdown/