CSS Selector Side Effects & Diagnosis Methods

Wildcard selector pitfalls, git bisect tracing, and PDF style independence

Author: Blue
Updated: 2024-12-27
Based on: VuePress blank page bug → Angular misdetection → PDF code truncation

This is a diagnostic journey from "blank page" to "root cause identification". Recording core insights to help quickly locate similar issues next time.


Background: VuePress Page Scraping Returns Blank

User feedback: After scraping VuePress docs, the frontend shows blank, only a "0%" progress bar.

Backend logs showed "success", but returned content was nearly empty:

{
  "content": "\n\n0%",
  "contentLength": 5
}

Diagnosis Process: Three-Step Localization

Step 1: curl Isolation Test

curl -X POST http://localhost:3456/api/extract \
  -H "Content-Type: application/json" \
  -d '{"url":"https://vuepress.vuejs.org/guide/"}'

Conclusion: Backend returns empty content, not a frontend rendering issue.

Step 2: git log -S to Trace Changes

# Find commit that introduced [class*="sidebar"]
git log -S '[class*="sidebar"]' --oneline -- src/lib/clean-html.ts

Located commit d664233.

Step 3: git show to View Specific Changes

git show d664233 -- src/lib/clean-html.ts

Found three new wildcard selectors:

'[class*="sidebar"]',
'[class*="toc"]',
'[class*="catalog"]',

Root Cause Analysis

Bug 1: Sidebar Selector Over-matching

/* Problematic code */
'[class*="sidebar"]'

/* Expected matches */
class="sidebar"
class="left-sidebar"

/* Unintended match - VuePress main content area! */
class="vp-content has-sidebar"

VuePress uses has-sidebar as a layout flag, not an actual sidebar. The wildcard selector accidentally removed the entire main content.

Bug 2: Angular Misdetection

// Problematic code
$("[class*='ng-']").length > 0

// Expected matches
class="ng-scope"  // Angular class

// Unintended matches
class="language-yaml"  // contains 'ng-y'
class="reading-time"   // contains 'ng-t'

This caused regular static pages to be misidentified as Angular SPAs, triggering unnecessary wait logic.

Bug 3: PDF Code Block Truncation

PDF generation uses independent style files; frontend @media print has no effect.

/* Before - scrollable on screen, truncated in PDF */
pre {
  overflow: auto;
  white-space: pre;
}

/* Fixed - long code auto-wraps */
pre {
  white-space: pre-wrap;
  word-wrap: break-word;
  overflow-wrap: break-word;
}

Fix Strategy: Conservative Fix vs Complete Rollback

Option Pros Cons
Complete rollback Simple and direct Loses valuable precise selectors
Conservative fix Keeps good parts Requires case-by-case evaluation

Choice: Option 2 - Only remove dangerous wildcards, keep precise selectors.

Newly added .toc, .doc-nav, .breadcrumb and other precise selectors help extraction quality and don't need rollback.


Core Insights

Insight 1: CSS Selector Safety Levels

Level Selector Type Safety Example
Safe Exact class selector .sidebar
Safe ID selector #sidebar
Safe Tag selector nav, aside
Dangerous Attribute contains selector [class*="sidebar"]
Dangerous Attribute starts-with selector ⚠️ [class^="nav-"]

Rule: Prefer precise selectors; wildcard selectors must be carefully evaluated for side effects.

Insight 2: Linus's Three Questions for Debugging

  1. What actually happened? Page blank, only "0%"
  2. What was expected? Display complete page content
  3. Where's the difference? Content extraction phase deleted main content

Insight 3: PDF Styles Are Independent of Frontend

Backend PDF generation uses Puppeteer + independent style files:

  • src/converter/styles/github.ts
  • src/converter/styles/minimal.ts

Frontend globals.css and @media print have no effect on PDF.


Fix Checklist

File Change Reason
src/lib/clean-html.ts Remove [class*="sidebar/toc/catalog"] Wildcard over-matching
src/scraper/detector.ts [class*='ng-'][ng-reflect-] Angular misdetection
src/converter/styles/github.ts Add pre-wrap properties PDF code wrapping
src/converter/styles/minimal.ts Add pre-wrap properties PDF code wrapping

Reusable Patterns

Pattern 1: Scraper Debugging Flow

1. curl direct API test → Frontend or backend issue?
2. Check response content field → Extraction logic issue?
3. git log -S '<key code>' → When was it introduced?
4. git show <commit> → What exactly changed?
5. Create minimal reproduction → Verify fix

Pattern 2: Print/PDF Style Separation

/* Screen: horizontal scroll, preserve code format */
pre {
  overflow: auto;
  white-space: pre;
}

/* PDF/Print: auto-wrap, prevent truncation */
@media print {
  pre {
    white-space: pre-wrap;
    word-wrap: break-word;
  }
}

But note: Puppeteer may not apply @media print when generating PDFs, you might need to modify the style source directly.


Follow-up Actions

  • Add CSS selector unit tests covering VuePress/Docusaurus/GitBook and other common CMS
  • Consider adding "conservative mode" for clean-html that only uses tag selectors
  • PDF style test cases: verify long code line wrapping

Raw Evidence

Backend logs:

"signals":["rich_static_content(-20)","angular_component(40)","bundled_scripts(25)"]

VuePress was misdetected as Angular because language-yaml contains ng-y

commit d664233 introduced problematic selectors:

+  '[class*="sidebar"]',
+  '[class*="toc"]',
+  '[class*="catalog"]',

Summary

This debugging session taught me:

  1. Wildcard selectors are a side-effect minefield - [class*="xxx"] matches any class containing that substring
  2. git log -S is a tracing powerhouse - Quickly locate when code changes were introduced
  3. PDF styles are independent of frontend - Understand the complete PDF generation pipeline before making changes
  4. Conservative fixes beat complete rollbacks - Evaluate the value of each change, keep the good parts

Finally, looking at the problem from a data structure perspective:

Selectors aren't "matching rules", they're "data filters". Filters that are too broad will accidentally delete valid data.

Next time writing selectors, ask first: What other class names might this match?