API Reference

Simple, predictable web scraping API. No hidden concurrency. No timeouts you can't control.

Core Endpoints

POST /api/scrape - Convert URL to Markdown

Convert any URL to clean, Markdown-formatted text.

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "engine": "auto",
    "format": "markdown",
    "enableAI": false
  }'

Response:

{
  "success": true,
  "content": "# Page Title\n\nPage content here...",
  "markdown": "# Page Title\n\n...",
  "metadata": {
    "title": "Page Title",
    "author": "Author Name",
    "description": "Page description",
    "images": {
      "download": false,
      "items": [],
      "totalBytes": 0,
      "warnings": []
    }
  },
  "engine": "http",
  "format": "markdown"
}

Parameters

Parameter	Type	Default	Description
`url`	string	required	URL to scrape
`engine`	string	"auto"	Engine selection: `"auto"` (smart detection), `"http"` (fast), `"browser"` (JS rendering)
`format`	string	"markdown"	Output format: `"markdown"`, `"clean.html"`, `"raw.html"`
`includeMetadata`	boolean	true	Include YAML frontmatter with metadata
`enableAI`	boolean	false	AI-powered content extraction (experimental)
`downloadImages`	boolean	false	Download images from content to local storage

Engine Selection

"auto" (recommended): Tries HTTP first (~170ms), automatically upgrades to browser if content is empty or JavaScript is required
"http": Lightweight HTTP + HTML parsing. Fast (~170ms), great for static content
"browser": Full Puppeteer browser rendering. Slower (~5s avg), required for JavaScript-heavy sites

Output Formats

"markdown": Clean Markdown with YAML frontmatter metadata (recommended)
"clean.html": Cleaned HTML without scripts/styles for custom processing
"raw.html": Original unprocessed HTML for debugging

Image Download

When downloadImages: true:

Images are downloaded from the main content area
Stored in IMAGE_DOWNLOAD_DIR (default: ./data/images)
Metadata exposes image information:
- originalUrl: Source URL
- localPath: Downloaded file path
- totalBytes: Total bytes downloaded
- warnings: Any limit hits (e.g., max images reached)

POST /api/crawl - Recursive Site Crawl

Systematically crawl and extract content from multiple pages starting from a single URL.

curl -X POST http://localhost:3000/api/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "maxDepth": 2,
    "maxPages": 50,
    "pathPrefix": "/docs/",
    "engine": "auto",
    "format": "markdown"
  }'

Response (JSON mode):

{
  "success": true,
  "pages": [
    {
      "url": "https://docs.example.com/guide",
      "content": "# Guide\n...",
      "markdown": "# Guide\n...",
      "metadata": { ... }
    }
  ],
  "stats": {
    "total": 10,
    "completed": 10,
    "failed": 0
  }
}

Real-Time Progress (SSE)

Use Accept: text/event-stream header to stream progress events:

curl -X POST http://localhost:3000/api/crawl \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "url": "https://docs.example.com",
    "maxPages": 50
  }'

Server streams:

data: {"type":"progress","completed":1,"total":50,"currentUrl":"https://..."}
data: {"type":"progress","completed":2,"total":50,"currentUrl":"https://..."}
...
data: {"type":"complete","completed":50,"total":50}
data: {"type":"result","data":{"success":true,"pages":[...]}}

Parameters

Parameter	Type	Default	Description
`url`	string	required	Starting URL for crawl
`maxDepth`	number	3	Max link depth (1 = only direct links)
`maxPages`	number	50	Max total pages to crawl
`pathPrefix`	string	-	Only crawl URLs matching this path (e.g., `/docs/`)
`engine`	string	"auto"	Engine selection (auto, http, browser)
`format`	string	"markdown"	Output format
`includeMetadata`	boolean	true	Include metadata
`downloadImages`	boolean	false	Download images for each page

POST /api/convert - Convert HTML to Markdown

Convert HTML files to Markdown with optional custom CSS selectors.

curl -X POST http://localhost:3000/api/convert \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@page.html" \
  -F 'selectors={"title":".article-title","content":".article-body"}'

Response:

{
  "success": true,
  "content": "# Title\n\nContent here...",
  "html": "<h1>Title</h1><p>Content here...</p>",
  "title": "Article Title"
}

Custom Selectors

Extract specific page sections using CSS selectors:

{
  "title": ".article-title",
  "content": ".article-body"
}

Common patterns:

.classname - Extract by class
#idname - Extract by ID
article > div - Child combinators
div[data-post] - Attribute selectors

Find selectors:

Open HTML in browser
Right-click element → Inspect
Copy class or id attribute
Use in API call

GET /api/health - Health Check

Simple health check endpoint.

curl http://localhost:3000/api/health

Response:

{
  "status": "ok",
  "version": "1.9.4"
}

Request/Response Patterns

Authentication

Set API_KEY environment variable to enable authentication:

API_KEY=your-secret-key pnpm dev

Then include auth header:

curl -H "Authorization: Bearer your-secret-key" \
  -X POST http://localhost:3000/api/scrape ...

Error Handling

All endpoints return success: false on error:

{
  "success": false,
  "error": "Error message",
  "statusCode": 400
}

Content Validation

Markdown output includes YAML frontmatter:

---
title: Page Title
author: Author Name
description: Page Description
created: 2025-01-01T12:00:00Z
---

# Page Title

Content here...

Performance Guidelines

Scenario	Engine	Speed	Best For
Static HTML	HTTP	~170ms	Documentation, blogs
JavaScript-heavy	Browser	~5s	SPAs, dynamic content
Mixed sites	Auto	~170-5s	Automatic, recommended

Hybrid Crawling Strategy

For large documentation sites with JavaScript navigation:

Phase 1: Single browser render to extract link map (~5s)
Phase 2: Batch HTTP crawl for all links (~170ms each)
Result: 10x faster than full JS crawl with 100% coverage

# Extract all links via browser (1 page, 5s)
curl -X POST http://localhost:3000/api/scrape \
  -d '{"url": "https://docs.example.com", "engine": "browser"}' \
  | jq -r '.content' | grep -oP 'href="([^"]+)"' | sort -u > links.txt

# Batch crawl all links (N pages × 170ms)
while read link; do
  curl -s -X POST http://localhost:3000/api/scrape \
    -d "{\"url\":\"$link\"}"
done < links.txt

Configuration

Environment Variables

PORT=3000                    # Server port
MAX_BROWSERS=5              # Concurrent browser instances
API_KEY=                    # Optional auth (leave empty to disable)
IMAGE_DOWNLOAD_DIR=./data/images  # Image storage location

Per-Request Limits

These settings apply per request:

maxDepth (crawl): Default 3, max recommended 10
maxPages (crawl): Default 50, max recommended 500
Image download: Limited by storage space and bandwidth

Examples

Extract Article from Blog

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://blog.example.com/article",
    "format": "markdown",
    "includeMetadata": true
  }' | jq '.content'

Crawl Documentation Site

curl -X POST http://localhost:3000/api/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com/guide/",
    "pathPrefix": "/guide/",
    "maxPages": 100,
    "engine": "auto"
  }' | jq '.pages | length'

Download Images with Content

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "downloadImages": true
  }' | jq '.metadata.images'

Convert HTML File

curl -X POST http://localhost:3000/api/convert \
  -F "file=@article.html" \
  -F 'selectors={"content":".post-body"}' \
  | jq '.content'

Troubleshooting

Problem	Solution
Slow responses	Use `engine: "http"` (170ms vs 5s)
Missing content	Try `engine: "browser"`
Empty responses	Check URL is accessible, might need browser engine
Large responses	Use custom selectors to extract only needed content
Memory growing	Restart service daily, reduce `MAX_BROWSERS`

Rate Limits

No built-in rate limiting. Configure at reverse proxy level (e.g., Caddy, Nginx).

Recommended limits:

Per IP: 100 requests/minute
Crawl requests: 10 active crawls/IP
Total: Scale based on MAX_BROWSERS and memory

API Reference

Core Endpoints

POST /api/scrape - Convert URL to Markdown

Parameters

Engine Selection

Output Formats

Image Download

POST /api/crawl - Recursive Site Crawl

Real-Time Progress (SSE)

Parameters

POST /api/convert - Convert HTML to Markdown

Custom Selectors

GET /api/health - Health Check

Request/Response Patterns

Authentication

Error Handling

Content Validation

Performance Guidelines

Hybrid Crawling Strategy

Configuration

Environment Variables

Per-Request Limits

Examples

Extract Article from Blog

Crawl Documentation Site

Download Images with Content

Convert HTML File

Troubleshooting

Rate Limits

See Also