API Reference

Complete API documentation for frontend integration

Complete API documentation for frontend integration.

API Documentation Screenshot

POST /api/scrape

Scrape a URL and convert to Markdown or HTML.

Request

{
  url: string                                    // Required
  engine?: 'auto' | 'http' | 'browser'           // Default: 'auto'
  format?: 'raw.html' | 'clean.html' | 'markdown' | 'pdf' // Default: 'markdown'
  downloadImages?: boolean                       // Default: false (Fetches images to local storage)
  includeMetadata?: boolean                      // Default: true
  timeout?: number                               // Default: 30000ms
  enableAI?: boolean                             // Default: false (Extracts main content and removes clutter)
  selectors?: { title?: string, content?: string }
}

Response

{
  success: true
  url: string
  format: 'raw.html' | 'clean.html' | 'markdown' | 'pdf'
  engine: 'http' | 'browser'
  
  content: string      // Scraped data (Markdown, HTML, or PDF base64)
  rawHtml?: string     // Included when format='markdown'
  title?: string
  
  metadata?: {
    convertedAt: string
    timing: { fetchDurationMs, cleanDurationMs, totalDurationMs }
    stats: { rawHtmlSizeBytes, markdownSizeBytes, compressionRatio }
    images?: {         // When downloadImages=true
      download: boolean
      totalBytes: number
      items: { originalUrl, localPath }[]
    }
  }
  
  discoveredLinks?: {  // Internal links discovered on the page
    url: string
    text?: string
  }[]
}

Example

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "format": "markdown"}'

POST /api/crawl

Crawl a website recursively. Supports JSON (wait) and SSE (real-time).

Request

{
  url: string           // Required - Starting URL
  maxPages?: number     // Default: 50, Max: 500
  maxDepth?: number     // Default: 3, Max: 10
  concurrency?: number  // Default: 2, Max: 10
  pathPrefix?: string   // Filter URLs by path (e.g., '/docs/')
  engine?: 'auto' | 'http' | 'browser'
  format?: 'markdown' | 'clean.html' | 'raw.html' | 'pdf'
  downloadImages?: boolean
}

JSON Mode (Default)

Wait for crawl to complete:

curl -X POST http://localhost:3000/api/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "maxPages": 10}'

Response:

{
  success: true
  baseUrl: string
  totalPages: number
  pages: [{ url, title, content }]
  sitemap?: {
    crawledUrls: string[];           // URLs that were crawled
    discoveredUrls: DiscoveredLink[]; // Links discovered but not crawled
    stopReason: string;              // Stop reason (maxPages/maxDepth/drained)
  }
}

SSE Mode (Real-time Progress)

Add Accept: text/event-stream header for real-time updates:

curl -X POST http://localhost:3000/api/crawl \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{"url": "https://example.com", "maxPages": 50}'

Event Types:

// Detecting page type (auto mode)
data: {"type": "detecting", "completed": 0, "total": 50, "currentUrl": "https://..."}

// Engine upgrade event (auto mode detected browser needed)
data: {"type": "engine_upgrade", "completed": 1, "total": 50, "upgradeReason": "spa_detected"}

// Progress update
data: {"type": "progress", "completed": 5, "total": 50, "currentUrl": "https://...", "data": {...}}

// Final result
data: {"type": "complete", "data": { success, baseUrl, totalPages, pages, sitemap }}

// Error
data: {"type": "error", "error": "message"}

JavaScript SSE Client

async function crawlWithProgress(url, maxPages, onProgress) {
  const res = await fetch('/api/crawl', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Accept': 'text/event-stream'
    },
    body: JSON.stringify({ url, maxPages })
  });
  
  const reader = res.body.getReader();
  const decoder = new TextDecoder();
  let buffer = '';
  let result = null;
  
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    
    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split('\n');
    buffer = lines.pop() || '';
    
    for (const line of lines) {
      if (!line.startsWith('data: ')) continue;
      const event = JSON.parse(line.substring(6));
      
      if (event.type === 'progress') {
        onProgress(event.completed, event.total, event.currentUrl);
      } else if (event.type === 'complete') {
        result = event.data;
      } else if (event.type === 'error') {
        throw new Error(event.error);
      }
    }
  }
  
  return result;
}

// Usage
const result = await crawlWithProgress(
  'https://docs.example.com',
  100,
  (done, total, url) => console.log(`${done}/${total}: ${url}`)
);

POST /api/convert

Convert HTML file or Markdown text to other formats.

Request

Supports two modes:

Mode 1: File Upload (HTML → Markdown)

Content-Type: multipart/form-data
Body: file (binary HTML file)

curl -X POST http://localhost:3000/api/convert \
  -F "file=@page.html;type=text/html"

Mode 2: JSON Body (Markdown → PDF)

Content-Type: application/json

curl -X POST http://localhost:3000/api/convert \
  -H "Content-Type: application/json" \
  -d '{"markdown": "# Hello World", "format": "pdf"}'

POST /api/download

Create a ZIP archive containing Markdown content and local images.

Request

{
  markdown: string                                       // Required
  images?: { originalUrl: string, localPath: string }[]  // Optional
  title?: string                                         // Default: 'export'
}

Response

Binary ZIP file.

Example

curl -X POST http://localhost:3000/api/download \
  -H "Content-Type: application/json" \
  --output export.zip \
  -d '{
    "markdown": "# Hello\n![Image](./img/1.png)",
    "images": [{"originalUrl": "...", "localPath": "img/1.png"}],
    "title": "My Page"
  }'

POST /api/download/batch

Batch download multiple pages, generating a ZIP archive containing all Markdown files and a shared images directory.

Request Parameters

Parameter	Type	Required	Description
pages	array	Yes	Array of page data
pages[].url	string	Yes	Original page URL
pages[].markdown	string	Yes	Extracted Markdown content
pages[].title	string	No	Page title (used for filename generation)
pages[].images	array	No	Image mapping array

Request Example

curl -X POST http://localhost:3000/api/download/batch \
  -H "Content-Type: application/json" \
  -d '{
    "pages": [
      {
        "url": "https://example.com/page1",
        "markdown": "# Page 1\n\nContent here...",
        "title": "Page 1"
      },
      {
        "url": "https://example.com/page2",
        "markdown": "# Page 2\n\nMore content...",
        "title": "Page 2"
      }
    ]
  }' --output batch.zip

Response

Content-Type: application/zip
Filename: Generated based on the first page's domain, e.g., example.com-batch.zip

ZIP Structure:

example.com-batch.zip
├── page-1.md
├── page-2.md
└── images/
    ├── image1.png
    └── image2.jpg

GET /api/images/*

Serve images downloaded during scrape/crawl when downloadImages=true.

Usage

Images are stored locally and can be accessed via this endpoint. The localPath returned in metadata can be appended to /api/images/.

Example

curl http://localhost:3000/api/images/2023/10/page-id/image.png --output image.png

GET /api/health

Health check endpoint.

curl http://localhost:3000/api/health
# {"status":"ok","version":"..."}

Engine Selection

Engine	Speed	Use Case
`auto`	~1.2s avg	Most cases (default)
`http`	~170ms	Static HTML sites
`browser`	~5s	JavaScript SPAs

Auto mode tries HTTP first, upgrades to browser if content is missing.

Output Formats

Format	Size	Best For
`markdown`	Smallest	LLM processing, reading
`clean.html`	Medium	Web display
`raw.html`	Largest	Debugging
`pdf`	Variable	Offline reading, archiving

Metadata Frontmatter

When format='markdown' and includeMetadata=true:

---
url: https://example.com
title: Page Title
converted_at: 2025-11-17T12:34:56.789Z
engine: browser

timing:
  fetch_duration_ms: 250
  total_duration_ms: 300

stats:
  raw_html_size_bytes: 125000
  markdown_size_bytes: 12000
  compression_ratio: 0.096
---

# Actual Markdown Content...

Error Handling

// Error response
{
  success: false
  url: string
  error: string
}

// Common errors:
// - 400: Invalid URL or parameters
// - 403: Site blocked access
// - 404: Page not found
// - 408: Timeout
// - 500: Internal error

Retry Strategy

async function scrapeWithRetry(url, retries = 3) {
  for (let i = 0; i < retries; i++) {
    try {
      return await scrape(url);
    } catch (e) {
      const status = parseInt(e.message.match(/\d+/)?.[0]);
      if (status && status < 500) throw e;  // Don't retry 4xx
      if (i < retries - 1) {
        await new Promise(r => setTimeout(r, 1000 << i));  // Exponential backoff
      }
    }
  }
}

Authentication (Optional)

If backend has API_KEY set:

fetch('/api/scrape', {
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer YOUR_API_KEY'
  },
  body: JSON.stringify({ url })
});

Rate Limiting

Crawl: 1 request/second (polite crawling)
Scrape: No limit by default
Configure in deployment if needed