"From Distributed Illusion to Resource Isolation: Architecture Evolution"
Reflections on how Firecrawl Lite evolved from an over-engineered distributed solution to a minimalist resource isolation approach.
"Talk is cheap. Show me the data." — Linus Torvalds
This document records the decision-making process of how Firecrawl Lite, when facing computing power bottlenecks, used first-principles thinking to converge from an "over-engineered" distributed solution to a "minimalist" resource isolation approach.
1. Background: The 2C2G Dilemma
Firecrawl Lite runs on a 2C2G server in Singapore, which hosts not only the scraper API but also other core businesses.
Core Conflict:
- Main Business: Requires low latency, high stability, small and stable memory usage.
- Scraper Business: Puppeteer is a "memory monster", with a single instance taking up ~500MB+. Peak concurrency easily triggers OOM, causing the main business to crash.
2. The Wrong Path: The Temptation of Distributed Architecture
Facing "insufficient computing power" and the temptation of "CNB free computing power", we initially designed a classic Distributed Worker Architecture (feat-distributed-workers):
- Master: Responsible for queue management and task distribution.
- Worker: Deployed in CNB containers, actively pulling tasks via HTTP long polling (penetrating NAT).
- Components: Redis queue, heartbeat detection, auto-scaling, SSRF protection, authentication...
It looked beautiful, but couldn't withstand first-principles scrutiny:
- Complexity Explosion: To solve memory contention, a full set of distributed system complexity (service discovery, state synchronization, fault tolerance) was introduced.
- Asynchronous Trap: Crawlers usually need to return in 3-10s. Changing to async + polling causes user experience to drop precipitously.
- YAGNI: Do we really need 10+ concurrency? Or do we just want "no crashes"?
3. Back to Basics: Resource Isolation
Revisiting through first principles, we found the essence of the problem is not "Scalability" but "Resource Isolation".
We don't need infinite computing power; we just need to: Kick the memory-eating Puppeteer out of the 2C2G server.
Solution Evolution
| Solution | Core Idea | Complexity | Cost | Latency | Evaluation |
|---|---|---|---|---|---|
| A. 503 Throttling | Admit limited resources, reject when pool is full | Very Low | $0 | Low | Current Best Solution. But isolation not solved. |
| B. CNB Worker | Async queue + long polling | High | $0 | High | Over-engineered. Poor async experience. |
| C. Add Server | Physical isolation | Low | $$ | Low | Simplest but costly. Goes against "saving money". |
| D. Cloudflare Browser Rendering | Serverless Browser | Low | $0-5 | Low | High Potential. Limited to 10min/day, paid expansion needed. |
| E. Browserless on CNB | Cloud Native Dev Env + Heartbeat | Medium | $0 | Low | Innovative Solution. Use CNB features for free isolation. |
4. Final Exploration: Browserless on CNB
We discovered that CNB's "Cloud Native Development Environment" has two key features:
- Port Preview: Provides
xxx-port.cnb.runpublic address. - Recycling Mechanism: Recycled after 10 minutes of inactivity, but can run for up to 18 hours.
This inspired a Hybrid Architecture:
graph TD
User[User] --> Gateway[2C2G Server]
subgraph "Resource Isolation Strategy"
Gateway --"1. Try First"--> Remote[CNB Browserless]
Gateway --"2. Fallback"--> Local[Local Puppeteer]
end
Remote --"WebSocket"--> Gateway
subgraph "CNB Cloud Native Env"
Browserless[Browserless Docker]
KeepAlive[Heartbeat Script]
end
KeepAlive --"Every 5min"--> Browserless
Browserless --"Register Address"--> Gateway
Core Advantages
- Zero Cold Start: Use heartbeat script to keep container online (up to 18h).
- Zero Queue: Connect to remote browser directly via WebSocket, returning results synchronously.
- High Availability: When CNB is unavailable, automatically fallback to local Puppeteer (with 503 throttling).
- Minimal Code: Just change
puppeteer.launch()topuppeteer.connect().
5. Architectural Philosophy Summary
- Ask "Is it?" before "How to do it": Confirming it's a resource problem rather than a scalability problem avoided building a huge distributed system.
- Use simple solutions if possible: Sync call > Async queue > Auto-scaling.
- Leverage existing infrastructure: CNB's development environment itself is a high-quality container runtime; we don't necessarily need Pipeline Triggers.
- Graceful degradation is the system's airbag: No matter how unstable external services are, as long as there is a local fallback, the system is robust.
Next Steps
- feat-remote-browser - Remote Browser Integration Plan (TBD)