"PDF Optimization Journey: From 30 Seconds to 1 Second"

A chronicle of optimizing Firecrawl Lite's PDF export feature, covering problem diagnosis, Linus-style thinking, and the "user presence" design philosophy.

"The best code is no code." — Linus Torvalds

This is a story about using first-principles thinking to optimize PDF export from 30 seconds to near-instant. More importantly, it demonstrates how distinguishing between "user present" and "user absent" scenarios leads to the simplest solution.


The Problem

Initial State

Users reported two severe issues with PDF export:

  1. Single page PDF took 20-30 seconds (compared to 3-5 seconds for Markdown)
  2. Batch PDF export crashed the server (first file errored, subsequent requests hung)

Error Logs

Error processing https://docs.example.com/page1
Crawl completed with 5 successes and 1 failures

After that, all new requests stalled, and the server entered a deadlock state.


Phase 1: Surface-Level Fix

Problem Diagnosis

Examining the code revealed the issues:

// src/converter/pdf.ts
const page = await pool.acquire(); // Get browser instance
await page.goto(url, { 
  waitUntil: 'networkidle0',  // Issue 1: Wait for ALL network requests to complete
  timeout: 60000              // Issue 2: 60-second timeout is too long
});

Root Causes:

Config Problem Impact
networkidle0 Wait for 0 network requests Slow pages never finish loading
timeout: 60000 60-second timeout One failure takes 1 minute
No tryAcquire() Block when pool is exhausted Cascade failure, server deadlock

Quick Fix

// After modification
await page.goto(url, { 
  waitUntil: 'networkidle2',  // Change to 2 or fewer network requests
  timeout: 30000              // Reduce to 30 seconds
});

// Add graceful degradation
const page = pool.tryAcquire();
if (!page) {
  return res.status(503).json({ 
    error: 'Server busy',
    retryAfter: 5 
  });
}

Result: Batch export no longer crashes, single page PDF reduced from 30s to 10-15s.


Phase 2: Linus-Style Thinking

The Essence of the Problem

Still not satisfied after the fix. 10-15 seconds is still too slow. I asked myself:

"You're optimizing a problem that might not need to exist. First ask: does this code need to exist?"

Let's analyze user scenarios:

Scenario User Trigger Current Solution
UI Interaction Human Click "Export PDF" Server-side Puppeteer
API Call Program format=pdf Server-side Puppeteer

Both scenarios use the same solution—server-side Puppeteer. But do they really need to be the same?

Google's Wisdom

Observing how Google Docs handles this:

  • User clicks "Print" → Calls window.print(), handled natively by browser
  • Program calls export API → Server-side generation

Core Principle: Use client-side when user is present, use server-side when user is absent.

Scenario Analysis

┌──────────────────┬───────┬─────────────────┬──────────────┐
│ Scenario         │ User  │ Trigger         │ Best Solution│
├──────────────────┼───────┼─────────────────┼──────────────┤
│ **UI Interaction**│ Human │ Click "Export"  │ Client-side  │
├──────────────────┼───────┼─────────────────┼──────────────┤
│ **API Call**     │ Program│ `format=pdf`   │ Server-side  │
└──────────────────┴───────┴─────────────────┴──────────────┘

Phase 3: Client-Side PDF

The Simplest Solution

If UI scenarios use client-side PDF, we can:

// Minimal implementation
function ExportPdfButton() {
  return (
    <button onClick={() => window.print()}>
      Export PDF
    </button>
  );
}

With @media print styles:

@media print {
  header, nav, .sidebar, .toolbar { display: none; }
  .markdown-content {
    width: 100%;
    margin: 0;
    padding: 20mm;
  }
}

Impact Assessment

Metric Before (All Server-side) After (Separated)
UI PDF Export Time 10-30s < 1s
Server CPU Usage High Reduced 80%+
Browser Pool Pressure High Significantly Reduced
Maintenance Complexity High (queues/monitoring) Simplified

Problems and Iteration

The window.print() approach had two issues:

  1. Print scope too large: Included entire page, not just Markdown content
  2. Extra click required: User needs to click again in print dialog

Improvement: html2pdf.js

Using the html2pdf.js library allows:

  • Selecting only specific DOM elements (markdown rendering area)
  • Directly generating PDF and downloading, no user interaction needed
// ui/src/hooks/use-client-pdf.ts
import html2pdf from 'html2pdf.js';

export function useClientPdf({ filename }) {
  const generatePdf = async (element: HTMLElement) => {
    // Clone element to avoid modifying original DOM
    const clone = element.cloneNode(true) as HTMLElement;
    
    // Handle CORS issues with external images
    const images = clone.querySelectorAll('img');
    images.forEach((img) => {
      const isLocalImage = img.src.startsWith('/api/images/');
      if (!isLocalImage) {
        // Replace external images with placeholders
        const placeholder = document.createElement('span');
        placeholder.textContent = `[Image: ${img.alt}]`;
        img.replaceWith(placeholder);
      }
    });
    
    await html2pdf().set(options).from(clone).save();
  };
  
  return { generatePdf };
}

CORS Limitations

html2pdf.js uses html2canvas to render DOM, but external images are blocked by CORS policy.

Solution:

  • Local proxy images (via /api/images/ path) display normally
  • External images show placeholder [Image: alt_text]
  • Prompt users to enable "Download Images" option for complete PDF

Architecture Evolution

Final Architecture

UI User → Click "Export to PDF" → html2pdf.js client-side → Direct download
                                  ↓
                              Zero server cost

API/MCP → POST /api/convert → Puppeteer server-side → Return base64
                              ↓
                          Only for automation scenarios

Decision Comparison

Option Effort Impact
A. Continue optimizing server-side 1-2 weeks Improve server PDF
B. Do client-side PDF first 1 day Eliminate 90% of problems

Key Insights

Applying Linus's Principle

"The best code is no code"

Client-side PDF isn't a "new feature"—it's "removing a feature"—removing UI scenarios' dependency on server-side PDF.

Thinking Framework

  1. First ask "is it needed?", then ask "how to do it?"
  2. Distinguish scenarios: User present vs. user absent
  3. Simple beats complex: Delete code rather than add code
  4. 80/20 Rule: Use simplest solution for 90% of requests, keep complex solution for 10% special cases

Priority Adjustment

┌──────────┬────────────────────────────┬─────────────────────────────┐
│ Priority │ Content                    │ Status                      │
├──────────┼────────────────────────────┼─────────────────────────────┤
│ ~~P0~~   │ ~~Server-side PDF opt~~    │ **Done** (short-term fix)   │
├──────────┼────────────────────────────┼─────────────────────────────┤
│ **P0**   │ **Client-side PDF**        │ **Done**                    │
├──────────┼────────────────────────────┼─────────────────────────────┤
│ P1       │ Observability (general)    │ Retained                    │
├──────────┼────────────────────────────┼─────────────────────────────┤
│ ~~P2~~   │ ~~Heavy browser pool opt~~ │ **Wait and see**            │
├──────────┼────────────────────────────┼─────────────────────────────┤
│ ~~P3~~   │ ~~Request queue~~          │ **Probably not needed**     │
└──────────┴────────────────────────────┴─────────────────────────────┘

Conclusion

This optimization taught me a profound lesson: Good architecture is subtraction, not addition.

We initially planned to spend one to two weeks building complex queue systems, monitoring infrastructure, and browser pool optimizations. But by returning to first principles, we eliminated 90% of the problems with one day of client-side PDF implementation.

For the remaining 10% of API/MCP scenarios, the existing short-term optimizations (networkidle2 + 30-second timeout + 503 graceful degradation) are sufficient.

This is the power of Linus-style thinking: First ask if the problem needs to exist, then decide how to solve it.