API Reference

Simple, predictable web scraping API. No hidden concurrency. No timeouts you can't control.

Core Endpoints

POST /api/scrape - Convert URL to Markdown

Convert any URL to clean, Markdown-formatted text.

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "engine": "auto",
    "format": "markdown",
    "enableAI": false
  }'

Response:

{
  "success": true,
  "content": "# Page Title\n\nPage content here...",
  "markdown": "# Page Title\n\n...",
  "metadata": {
    "title": "Page Title",
    "author": "Author Name",
    "description": "Page description",
    "images": {
      "download": false,
      "items": [],
      "totalBytes": 0,
      "warnings": []
    }
  },
  "engine": "http",
  "format": "markdown"
}

Parameters

Parameter Type Default Description
url string required URL to scrape
engine string "auto" Engine selection: "auto" (smart detection), "http" (fast), "browser" (JS rendering)
format string "markdown" Output format: "markdown", "clean.html", "raw.html"
includeMetadata boolean true Include YAML frontmatter with metadata
enableAI boolean false AI-powered content extraction (experimental)
downloadImages boolean false Download images from content to local storage

Engine Selection

  • "auto" (recommended): Tries HTTP first (~170ms), automatically upgrades to browser if content is empty or JavaScript is required
  • "http": Lightweight HTTP + HTML parsing. Fast (~170ms), great for static content
  • "browser": Full Puppeteer browser rendering. Slower (~5s avg), required for JavaScript-heavy sites

Output Formats

  • "markdown": Clean Markdown with YAML frontmatter metadata (recommended)
  • "clean.html": Cleaned HTML without scripts/styles for custom processing
  • "raw.html": Original unprocessed HTML for debugging

Image Download

When downloadImages: true:

  • Images are downloaded from the main content area
  • Stored in IMAGE_DOWNLOAD_DIR (default: ./data/images)
  • Metadata exposes image information:
    • originalUrl: Source URL
    • localPath: Downloaded file path
    • totalBytes: Total bytes downloaded
    • warnings: Any limit hits (e.g., max images reached)

POST /api/crawl - Recursive Site Crawl

Systematically crawl and extract content from multiple pages starting from a single URL.

curl -X POST http://localhost:3000/api/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "maxDepth": 2,
    "maxPages": 50,
    "pathPrefix": "/docs/",
    "engine": "auto",
    "format": "markdown"
  }'

Response (JSON mode):

{
  "success": true,
  "pages": [
    {
      "url": "https://docs.example.com/guide",
      "content": "# Guide\n...",
      "markdown": "# Guide\n...",
      "metadata": { ... }
    }
  ],
  "stats": {
    "total": 10,
    "completed": 10,
    "failed": 0
  }
}

Real-Time Progress (SSE)

Use Accept: text/event-stream header to stream progress events:

curl -X POST http://localhost:3000/api/crawl \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "url": "https://docs.example.com",
    "maxPages": 50
  }'

Server streams:

data: {"type":"progress","completed":1,"total":50,"currentUrl":"https://..."}
data: {"type":"progress","completed":2,"total":50,"currentUrl":"https://..."}
...
data: {"type":"complete","completed":50,"total":50}
data: {"type":"result","data":{"success":true,"pages":[...]}}

Parameters

Parameter Type Default Description
url string required Starting URL for crawl
maxDepth number 3 Max link depth (1 = only direct links)
maxPages number 50 Max total pages to crawl
pathPrefix string - Only crawl URLs matching this path (e.g., /docs/)
engine string "auto" Engine selection (auto, http, browser)
format string "markdown" Output format
includeMetadata boolean true Include metadata
downloadImages boolean false Download images for each page

POST /api/convert - Convert HTML to Markdown

Convert HTML files to Markdown with optional custom CSS selectors.

curl -X POST http://localhost:3000/api/convert \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@page.html" \
  -F 'selectors={"title":".article-title","content":".article-body"}'

Response:

{
  "success": true,
  "content": "# Title\n\nContent here...",
  "html": "<h1>Title</h1><p>Content here...</p>",
  "title": "Article Title"
}

Custom Selectors

Extract specific page sections using CSS selectors:

{
  "title": ".article-title",
  "content": ".article-body"
}

Common patterns:

  • .classname - Extract by class
  • #idname - Extract by ID
  • article > div - Child combinators
  • div[data-post] - Attribute selectors

Find selectors:

  1. Open HTML in browser
  2. Right-click element → Inspect
  3. Copy class or id attribute
  4. Use in API call

GET /api/health - Health Check

Simple health check endpoint.

curl http://localhost:3000/api/health

Response:

{
  "status": "ok",
  "version": "1.9.4"
}

Request/Response Patterns

Authentication

Set API_KEY environment variable to enable authentication:

API_KEY=your-secret-key pnpm dev

Then include auth header:

curl -H "Authorization: Bearer your-secret-key" \
  -X POST http://localhost:3000/api/scrape ...

Error Handling

All endpoints return success: false on error:

{
  "success": false,
  "error": "Error message",
  "statusCode": 400
}

Content Validation

Markdown output includes YAML frontmatter:

---
title: Page Title
author: Author Name
description: Page Description
created: 2025-01-01T12:00:00Z
---

# Page Title

Content here...

Performance Guidelines

Scenario Engine Speed Best For
Static HTML HTTP ~170ms Documentation, blogs
JavaScript-heavy Browser ~5s SPAs, dynamic content
Mixed sites Auto ~170-5s Automatic, recommended

Hybrid Crawling Strategy

For large documentation sites with JavaScript navigation:

  1. Phase 1: Single browser render to extract link map (~5s)
  2. Phase 2: Batch HTTP crawl for all links (~170ms each)
  3. Result: 10x faster than full JS crawl with 100% coverage
# Extract all links via browser (1 page, 5s)
curl -X POST http://localhost:3000/api/scrape \
  -d '{"url": "https://docs.example.com", "engine": "browser"}' \
  | jq -r '.content' | grep -oP 'href="([^"]+)"' | sort -u > links.txt

# Batch crawl all links (N pages × 170ms)
while read link; do
  curl -s -X POST http://localhost:3000/api/scrape \
    -d "{\"url\":\"$link\"}"
done < links.txt

Configuration

Environment Variables

PORT=3000                    # Server port
MAX_BROWSERS=5              # Concurrent browser instances
API_KEY=                    # Optional auth (leave empty to disable)
IMAGE_DOWNLOAD_DIR=./data/images  # Image storage location

Per-Request Limits

These settings apply per request:

  • maxDepth (crawl): Default 3, max recommended 10
  • maxPages (crawl): Default 50, max recommended 500
  • Image download: Limited by storage space and bandwidth

Examples

Extract Article from Blog

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://blog.example.com/article",
    "format": "markdown",
    "includeMetadata": true
  }' | jq '.content'

Crawl Documentation Site

curl -X POST http://localhost:3000/api/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com/guide/",
    "pathPrefix": "/guide/",
    "maxPages": 100,
    "engine": "auto"
  }' | jq '.pages | length'

Download Images with Content

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "downloadImages": true
  }' | jq '.metadata.images'

Convert HTML File

curl -X POST http://localhost:3000/api/convert \
  -F "file=@article.html" \
  -F 'selectors={"content":".post-body"}' \
  | jq '.content'

Troubleshooting

Problem Solution
Slow responses Use engine: "http" (170ms vs 5s)
Missing content Try engine: "browser"
Empty responses Check URL is accessible, might need browser engine
Large responses Use custom selectors to extract only needed content
Memory growing Restart service daily, reduce MAX_BROWSERS

Rate Limits

No built-in rate limiting. Configure at reverse proxy level (e.g., Caddy, Nginx).

Recommended limits:

  • Per IP: 100 requests/minute
  • Crawl requests: 10 active crawls/IP
  • Total: Scale based on MAX_BROWSERS and memory

See Also