API Reference
Simple, predictable web scraping API. No hidden concurrency. No timeouts you can't control.
Core Endpoints
POST /api/scrape - Convert URL to Markdown
Convert any URL to clean, Markdown-formatted text.
curl -X POST http://localhost:3000/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"engine": "auto",
"format": "markdown",
"enableAI": false
}'
Response:
{
"success": true,
"content": "# Page Title\n\nPage content here...",
"markdown": "# Page Title\n\n...",
"metadata": {
"title": "Page Title",
"author": "Author Name",
"description": "Page description",
"images": {
"download": false,
"items": [],
"totalBytes": 0,
"warnings": []
}
},
"engine": "http",
"format": "markdown"
}
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to scrape |
engine |
string | "auto" | Engine selection: "auto" (smart detection), "http" (fast), "browser" (JS rendering) |
format |
string | "markdown" | Output format: "markdown", "clean.html", "raw.html" |
includeMetadata |
boolean | true | Include YAML frontmatter with metadata |
enableAI |
boolean | false | AI-powered content extraction (experimental) |
downloadImages |
boolean | false | Download images from content to local storage |
Engine Selection
- "auto" (recommended): Tries HTTP first (~170ms), automatically upgrades to browser if content is empty or JavaScript is required
- "http": Lightweight HTTP + HTML parsing. Fast (~170ms), great for static content
- "browser": Full Puppeteer browser rendering. Slower (~5s avg), required for JavaScript-heavy sites
Output Formats
- "markdown": Clean Markdown with YAML frontmatter metadata (recommended)
- "clean.html": Cleaned HTML without scripts/styles for custom processing
- "raw.html": Original unprocessed HTML for debugging
Image Download
When downloadImages: true:
- Images are downloaded from the main content area
- Stored in
IMAGE_DOWNLOAD_DIR(default:./data/images) - Metadata exposes image information:
originalUrl: Source URLlocalPath: Downloaded file pathtotalBytes: Total bytes downloadedwarnings: Any limit hits (e.g., max images reached)
POST /api/crawl - Recursive Site Crawl
Systematically crawl and extract content from multiple pages starting from a single URL.
curl -X POST http://localhost:3000/api/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"maxDepth": 2,
"maxPages": 50,
"pathPrefix": "/docs/",
"engine": "auto",
"format": "markdown"
}'
Response (JSON mode):
{
"success": true,
"pages": [
{
"url": "https://docs.example.com/guide",
"content": "# Guide\n...",
"markdown": "# Guide\n...",
"metadata": { ... }
}
],
"stats": {
"total": 10,
"completed": 10,
"failed": 0
}
}
Real-Time Progress (SSE)
Use Accept: text/event-stream header to stream progress events:
curl -X POST http://localhost:3000/api/crawl \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"url": "https://docs.example.com",
"maxPages": 50
}'
Server streams:
data: {"type":"progress","completed":1,"total":50,"currentUrl":"https://..."}
data: {"type":"progress","completed":2,"total":50,"currentUrl":"https://..."}
...
data: {"type":"complete","completed":50,"total":50}
data: {"type":"result","data":{"success":true,"pages":[...]}}
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
string | required | Starting URL for crawl |
maxDepth |
number | 3 | Max link depth (1 = only direct links) |
maxPages |
number | 50 | Max total pages to crawl |
pathPrefix |
string | - | Only crawl URLs matching this path (e.g., /docs/) |
engine |
string | "auto" | Engine selection (auto, http, browser) |
format |
string | "markdown" | Output format |
includeMetadata |
boolean | true | Include metadata |
downloadImages |
boolean | false | Download images for each page |
POST /api/convert - Convert HTML to Markdown
Convert HTML files to Markdown with optional custom CSS selectors.
curl -X POST http://localhost:3000/api/convert \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@page.html" \
-F 'selectors={"title":".article-title","content":".article-body"}'
Response:
{
"success": true,
"content": "# Title\n\nContent here...",
"html": "<h1>Title</h1><p>Content here...</p>",
"title": "Article Title"
}
Custom Selectors
Extract specific page sections using CSS selectors:
{
"title": ".article-title",
"content": ".article-body"
}
Common patterns:
.classname- Extract by class#idname- Extract by IDarticle > div- Child combinatorsdiv[data-post]- Attribute selectors
Find selectors:
- Open HTML in browser
- Right-click element → Inspect
- Copy
classoridattribute - Use in API call
GET /api/health - Health Check
Simple health check endpoint.
curl http://localhost:3000/api/health
Response:
{
"status": "ok",
"version": "1.9.4"
}
Request/Response Patterns
Authentication
Set API_KEY environment variable to enable authentication:
API_KEY=your-secret-key pnpm dev
Then include auth header:
curl -H "Authorization: Bearer your-secret-key" \
-X POST http://localhost:3000/api/scrape ...
Error Handling
All endpoints return success: false on error:
{
"success": false,
"error": "Error message",
"statusCode": 400
}
Content Validation
Markdown output includes YAML frontmatter:
---
title: Page Title
author: Author Name
description: Page Description
created: 2025-01-01T12:00:00Z
---
# Page Title
Content here...
Performance Guidelines
| Scenario | Engine | Speed | Best For |
|---|---|---|---|
| Static HTML | HTTP | ~170ms | Documentation, blogs |
| JavaScript-heavy | Browser | ~5s | SPAs, dynamic content |
| Mixed sites | Auto | ~170-5s | Automatic, recommended |
Hybrid Crawling Strategy
For large documentation sites with JavaScript navigation:
- Phase 1: Single browser render to extract link map (~5s)
- Phase 2: Batch HTTP crawl for all links (~170ms each)
- Result: 10x faster than full JS crawl with 100% coverage
# Extract all links via browser (1 page, 5s)
curl -X POST http://localhost:3000/api/scrape \
-d '{"url": "https://docs.example.com", "engine": "browser"}' \
| jq -r '.content' | grep -oP 'href="([^"]+)"' | sort -u > links.txt
# Batch crawl all links (N pages × 170ms)
while read link; do
curl -s -X POST http://localhost:3000/api/scrape \
-d "{\"url\":\"$link\"}"
done < links.txt
Configuration
Environment Variables
PORT=3000 # Server port
MAX_BROWSERS=5 # Concurrent browser instances
API_KEY= # Optional auth (leave empty to disable)
IMAGE_DOWNLOAD_DIR=./data/images # Image storage location
Per-Request Limits
These settings apply per request:
maxDepth(crawl): Default 3, max recommended 10maxPages(crawl): Default 50, max recommended 500- Image download: Limited by storage space and bandwidth
Examples
Extract Article from Blog
curl -X POST http://localhost:3000/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://blog.example.com/article",
"format": "markdown",
"includeMetadata": true
}' | jq '.content'
Crawl Documentation Site
curl -X POST http://localhost:3000/api/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com/guide/",
"pathPrefix": "/guide/",
"maxPages": 100,
"engine": "auto"
}' | jq '.pages | length'
Download Images with Content
curl -X POST http://localhost:3000/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"downloadImages": true
}' | jq '.metadata.images'
Convert HTML File
curl -X POST http://localhost:3000/api/convert \
-F "file=@article.html" \
-F 'selectors={"content":".post-body"}' \
| jq '.content'
Troubleshooting
| Problem | Solution |
|---|---|
| Slow responses | Use engine: "http" (170ms vs 5s) |
| Missing content | Try engine: "browser" |
| Empty responses | Check URL is accessible, might need browser engine |
| Large responses | Use custom selectors to extract only needed content |
| Memory growing | Restart service daily, reduce MAX_BROWSERS |
Rate Limits
No built-in rate limiting. Configure at reverse proxy level (e.g., Caddy, Nginx).
Recommended limits:
- Per IP: 100 requests/minute
- Crawl requests: 10 active crawls/IP
- Total: Scale based on
MAX_BROWSERSand memory
See Also
- Getting Started - Basic usage
- Crawling - UI-based crawling
- Deployment Guide - Production setup