CSS Selector Side Effects & Diagnosis Methods
Wildcard selector pitfalls, git bisect tracing, and PDF style independence
Author: Blue
Updated: 2024-12-27
Based on: VuePress blank page bug → Angular misdetection → PDF code truncation
This is a diagnostic journey from "blank page" to "root cause identification". Recording core insights to help quickly locate similar issues next time.
Background: VuePress Page Scraping Returns Blank
User feedback: After scraping VuePress docs, the frontend shows blank, only a "0%" progress bar.
Backend logs showed "success", but returned content was nearly empty:
{
"content": "\n\n0%",
"contentLength": 5
}
Diagnosis Process: Three-Step Localization
Step 1: curl Isolation Test
curl -X POST http://localhost:3456/api/extract \
-H "Content-Type: application/json" \
-d '{"url":"https://vuepress.vuejs.org/guide/"}'
Conclusion: Backend returns empty content, not a frontend rendering issue.
Step 2: git log -S to Trace Changes
# Find commit that introduced [class*="sidebar"]
git log -S '[class*="sidebar"]' --oneline -- src/lib/clean-html.ts
Located commit d664233.
Step 3: git show to View Specific Changes
git show d664233 -- src/lib/clean-html.ts
Found three new wildcard selectors:
'[class*="sidebar"]',
'[class*="toc"]',
'[class*="catalog"]',
Root Cause Analysis
Bug 1: Sidebar Selector Over-matching
/* Problematic code */
'[class*="sidebar"]'
/* Expected matches */
class="sidebar"
class="left-sidebar"
/* Unintended match - VuePress main content area! */
class="vp-content has-sidebar"
VuePress uses has-sidebar as a layout flag, not an actual sidebar. The wildcard selector accidentally removed the entire main content.
Bug 2: Angular Misdetection
// Problematic code
$("[class*='ng-']").length > 0
// Expected matches
class="ng-scope" // Angular class
// Unintended matches
class="language-yaml" // contains 'ng-y'
class="reading-time" // contains 'ng-t'
This caused regular static pages to be misidentified as Angular SPAs, triggering unnecessary wait logic.
Bug 3: PDF Code Block Truncation
PDF generation uses independent style files; frontend @media print has no effect.
/* Before - scrollable on screen, truncated in PDF */
pre {
overflow: auto;
white-space: pre;
}
/* Fixed - long code auto-wraps */
pre {
white-space: pre-wrap;
word-wrap: break-word;
overflow-wrap: break-word;
}
Fix Strategy: Conservative Fix vs Complete Rollback
| Option | Pros | Cons |
|---|---|---|
| Complete rollback | Simple and direct | Loses valuable precise selectors |
| Conservative fix | Keeps good parts | Requires case-by-case evaluation |
Choice: Option 2 - Only remove dangerous wildcards, keep precise selectors.
Newly added .toc, .doc-nav, .breadcrumb and other precise selectors help extraction quality and don't need rollback.
Core Insights
Insight 1: CSS Selector Safety Levels
| Level | Selector Type | Safety | Example |
|---|---|---|---|
| Safe | Exact class selector | ✅ | .sidebar |
| Safe | ID selector | ✅ | #sidebar |
| Safe | Tag selector | ✅ | nav, aside |
| Dangerous | Attribute contains selector | ❌ | [class*="sidebar"] |
| Dangerous | Attribute starts-with selector | ⚠️ | [class^="nav-"] |
Rule: Prefer precise selectors; wildcard selectors must be carefully evaluated for side effects.
Insight 2: Linus's Three Questions for Debugging
- What actually happened? Page blank, only "0%"
- What was expected? Display complete page content
- Where's the difference? Content extraction phase deleted main content
Insight 3: PDF Styles Are Independent of Frontend
Backend PDF generation uses Puppeteer + independent style files:
src/converter/styles/github.tssrc/converter/styles/minimal.ts
Frontend globals.css and @media print have no effect on PDF.
Fix Checklist
| File | Change | Reason |
|---|---|---|
src/lib/clean-html.ts |
Remove [class*="sidebar/toc/catalog"] |
Wildcard over-matching |
src/scraper/detector.ts |
[class*='ng-'] → [ng-reflect-] |
Angular misdetection |
src/converter/styles/github.ts |
Add pre-wrap properties |
PDF code wrapping |
src/converter/styles/minimal.ts |
Add pre-wrap properties |
PDF code wrapping |
Reusable Patterns
Pattern 1: Scraper Debugging Flow
1. curl direct API test → Frontend or backend issue?
2. Check response content field → Extraction logic issue?
3. git log -S '<key code>' → When was it introduced?
4. git show <commit> → What exactly changed?
5. Create minimal reproduction → Verify fix
Pattern 2: Print/PDF Style Separation
/* Screen: horizontal scroll, preserve code format */
pre {
overflow: auto;
white-space: pre;
}
/* PDF/Print: auto-wrap, prevent truncation */
@media print {
pre {
white-space: pre-wrap;
word-wrap: break-word;
}
}
But note: Puppeteer may not apply @media print when generating PDFs, you might need to modify the style source directly.
Follow-up Actions
- Add CSS selector unit tests covering VuePress/Docusaurus/GitBook and other common CMS
- Consider adding "conservative mode" for clean-html that only uses tag selectors
- PDF style test cases: verify long code line wrapping
Raw Evidence
Backend logs:
"signals":["rich_static_content(-20)","angular_component(40)","bundled_scripts(25)"]VuePress was misdetected as Angular because
language-yamlcontainsng-y
commit d664233 introduced problematic selectors:
+ '[class*="sidebar"]', + '[class*="toc"]', + '[class*="catalog"]',
Summary
This debugging session taught me:
- Wildcard selectors are a side-effect minefield -
[class*="xxx"]matches any class containing that substring - git log -S is a tracing powerhouse - Quickly locate when code changes were introduced
- PDF styles are independent of frontend - Understand the complete PDF generation pipeline before making changes
- Conservative fixes beat complete rollbacks - Evaluate the value of each change, keep the good parts
Finally, looking at the problem from a data structure perspective:
Selectors aren't "matching rules", they're "data filters". Filters that are too broad will accidentally delete valid data.
Next time writing selectors, ask first: What other class names might this match?