"PDF Optimization Journey: From 30 Seconds to 1 Second"
A chronicle of optimizing Firecrawl Lite's PDF export feature, covering problem diagnosis, Linus-style thinking, and the "user presence" design philosophy.
"The best code is no code." — Linus Torvalds
This is a story about using first-principles thinking to optimize PDF export from 30 seconds to near-instant. More importantly, it demonstrates how distinguishing between "user present" and "user absent" scenarios leads to the simplest solution.
The Problem
Initial State
Users reported two severe issues with PDF export:
- Single page PDF took 20-30 seconds (compared to 3-5 seconds for Markdown)
- Batch PDF export crashed the server (first file errored, subsequent requests hung)
Error Logs
Error processing https://docs.example.com/page1
Crawl completed with 5 successes and 1 failures
After that, all new requests stalled, and the server entered a deadlock state.
Phase 1: Surface-Level Fix
Problem Diagnosis
Examining the code revealed the issues:
// src/converter/pdf.ts
const page = await pool.acquire(); // Get browser instance
await page.goto(url, {
waitUntil: 'networkidle0', // Issue 1: Wait for ALL network requests to complete
timeout: 60000 // Issue 2: 60-second timeout is too long
});
Root Causes:
| Config | Problem | Impact |
|---|---|---|
networkidle0 |
Wait for 0 network requests | Slow pages never finish loading |
timeout: 60000 |
60-second timeout | One failure takes 1 minute |
No tryAcquire() |
Block when pool is exhausted | Cascade failure, server deadlock |
Quick Fix
// After modification
await page.goto(url, {
waitUntil: 'networkidle2', // Change to 2 or fewer network requests
timeout: 30000 // Reduce to 30 seconds
});
// Add graceful degradation
const page = pool.tryAcquire();
if (!page) {
return res.status(503).json({
error: 'Server busy',
retryAfter: 5
});
}
Result: Batch export no longer crashes, single page PDF reduced from 30s to 10-15s.
Phase 2: Linus-Style Thinking
The Essence of the Problem
Still not satisfied after the fix. 10-15 seconds is still too slow. I asked myself:
"You're optimizing a problem that might not need to exist. First ask: does this code need to exist?"
Let's analyze user scenarios:
| Scenario | User | Trigger | Current Solution |
|---|---|---|---|
| UI Interaction | Human | Click "Export PDF" | Server-side Puppeteer |
| API Call | Program | format=pdf |
Server-side Puppeteer |
Both scenarios use the same solution—server-side Puppeteer. But do they really need to be the same?
Google's Wisdom
Observing how Google Docs handles this:
- User clicks "Print" → Calls
window.print(), handled natively by browser - Program calls export API → Server-side generation
Core Principle: Use client-side when user is present, use server-side when user is absent.
Scenario Analysis
┌──────────────────┬───────┬─────────────────┬──────────────┐
│ Scenario │ User │ Trigger │ Best Solution│
├──────────────────┼───────┼─────────────────┼──────────────┤
│ **UI Interaction**│ Human │ Click "Export" │ Client-side │
├──────────────────┼───────┼─────────────────┼──────────────┤
│ **API Call** │ Program│ `format=pdf` │ Server-side │
└──────────────────┴───────┴─────────────────┴──────────────┘
Phase 3: Client-Side PDF
The Simplest Solution
If UI scenarios use client-side PDF, we can:
// Minimal implementation
function ExportPdfButton() {
return (
<button onClick={() => window.print()}>
Export PDF
</button>
);
}
With @media print styles:
@media print {
header, nav, .sidebar, .toolbar { display: none; }
.markdown-content {
width: 100%;
margin: 0;
padding: 20mm;
}
}
Impact Assessment
| Metric | Before (All Server-side) | After (Separated) |
|---|---|---|
| UI PDF Export Time | 10-30s | < 1s |
| Server CPU Usage | High | Reduced 80%+ |
| Browser Pool Pressure | High | Significantly Reduced |
| Maintenance Complexity | High (queues/monitoring) | Simplified |
Problems and Iteration
The window.print() approach had two issues:
- Print scope too large: Included entire page, not just Markdown content
- Extra click required: User needs to click again in print dialog
Improvement: html2pdf.js
Using the html2pdf.js library allows:
- Selecting only specific DOM elements (markdown rendering area)
- Directly generating PDF and downloading, no user interaction needed
// ui/src/hooks/use-client-pdf.ts
import html2pdf from 'html2pdf.js';
export function useClientPdf({ filename }) {
const generatePdf = async (element: HTMLElement) => {
// Clone element to avoid modifying original DOM
const clone = element.cloneNode(true) as HTMLElement;
// Handle CORS issues with external images
const images = clone.querySelectorAll('img');
images.forEach((img) => {
const isLocalImage = img.src.startsWith('/api/images/');
if (!isLocalImage) {
// Replace external images with placeholders
const placeholder = document.createElement('span');
placeholder.textContent = `[Image: ${img.alt}]`;
img.replaceWith(placeholder);
}
});
await html2pdf().set(options).from(clone).save();
};
return { generatePdf };
}
CORS Limitations
html2pdf.js uses html2canvas to render DOM, but external images are blocked by CORS policy.
Solution:
- Local proxy images (via
/api/images/path) display normally - External images show placeholder
[Image: alt_text] - Prompt users to enable "Download Images" option for complete PDF
Architecture Evolution
Final Architecture
UI User → Click "Export to PDF" → html2pdf.js client-side → Direct download
↓
Zero server cost
API/MCP → POST /api/convert → Puppeteer server-side → Return base64
↓
Only for automation scenarios
Decision Comparison
| Option | Effort | Impact |
|---|---|---|
| A. Continue optimizing server-side | 1-2 weeks | Improve server PDF |
| B. Do client-side PDF first | 1 day | Eliminate 90% of problems |
Key Insights
Applying Linus's Principle
"The best code is no code"
Client-side PDF isn't a "new feature"—it's "removing a feature"—removing UI scenarios' dependency on server-side PDF.
Thinking Framework
- First ask "is it needed?", then ask "how to do it?"
- Distinguish scenarios: User present vs. user absent
- Simple beats complex: Delete code rather than add code
- 80/20 Rule: Use simplest solution for 90% of requests, keep complex solution for 10% special cases
Priority Adjustment
┌──────────┬────────────────────────────┬─────────────────────────────┐
│ Priority │ Content │ Status │
├──────────┼────────────────────────────┼─────────────────────────────┤
│ ~~P0~~ │ ~~Server-side PDF opt~~ │ **Done** (short-term fix) │
├──────────┼────────────────────────────┼─────────────────────────────┤
│ **P0** │ **Client-side PDF** │ **Done** │
├──────────┼────────────────────────────┼─────────────────────────────┤
│ P1 │ Observability (general) │ Retained │
├──────────┼────────────────────────────┼─────────────────────────────┤
│ ~~P2~~ │ ~~Heavy browser pool opt~~ │ **Wait and see** │
├──────────┼────────────────────────────┼─────────────────────────────┤
│ ~~P3~~ │ ~~Request queue~~ │ **Probably not needed** │
└──────────┴────────────────────────────┴─────────────────────────────┘
Conclusion
This optimization taught me a profound lesson: Good architecture is subtraction, not addition.
We initially planned to spend one to two weeks building complex queue systems, monitoring infrastructure, and browser pool optimizations. But by returning to first principles, we eliminated 90% of the problems with one day of client-side PDF implementation.
For the remaining 10% of API/MCP scenarios, the existing short-term optimizations (networkidle2 + 30-second timeout + 503 graceful degradation) are sufficient.
This is the power of Linus-style thinking: First ask if the problem needs to exist, then decide how to solve it.