API Reference

Complete API documentation for frontend integration

Complete API documentation for frontend integration.

API Documentation Screenshot


POST /api/scrape

Scrape a URL and convert to Markdown or HTML.

Request

{
  url: string                                    // Required
  engine?: 'auto' | 'http' | 'browser'           // Default: 'auto'
  format?: 'raw.html' | 'clean.html' | 'markdown' | 'pdf' // Default: 'markdown'
  downloadImages?: boolean                       // Default: false (Fetches images to local storage)
  includeMetadata?: boolean                      // Default: true
  timeout?: number                               // Default: 30000ms
  enableAI?: boolean                             // Default: false (Extracts main content and removes clutter)
  selectors?: { title?: string, content?: string }
}

Response

{
  success: true
  url: string
  format: 'raw.html' | 'clean.html' | 'markdown' | 'pdf'
  engine: 'http' | 'browser'
  
  content: string      // Scraped data (Markdown, HTML, or PDF base64)
  rawHtml?: string     // Included when format='markdown'
  title?: string
  
  metadata?: {
    convertedAt: string
    timing: { fetchDurationMs, cleanDurationMs, totalDurationMs }
    stats: { rawHtmlSizeBytes, markdownSizeBytes, compressionRatio }
    images?: {         // When downloadImages=true
      download: boolean
      totalBytes: number
      items: { originalUrl, localPath }[]
    }
  }
  
  discoveredLinks?: {  // Internal links discovered on the page
    url: string
    text?: string
  }[]
}

Example

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "format": "markdown"}'

POST /api/crawl

Crawl a website recursively. Supports JSON (wait) and SSE (real-time).

Request

{
  url: string           // Required - Starting URL
  maxPages?: number     // Default: 50, Max: 500
  maxDepth?: number     // Default: 3, Max: 10
  concurrency?: number  // Default: 2, Max: 10
  pathPrefix?: string   // Filter URLs by path (e.g., '/docs/')
  engine?: 'auto' | 'http' | 'browser'
  format?: 'markdown' | 'clean.html' | 'raw.html' | 'pdf'
  downloadImages?: boolean
}

JSON Mode (Default)

Wait for crawl to complete:

curl -X POST http://localhost:3000/api/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "maxPages": 10}'

Response:

{
  success: true
  baseUrl: string
  totalPages: number
  pages: [{ url, title, content }]
  sitemap?: {
    crawledUrls: string[];           // URLs that were crawled
    discoveredUrls: DiscoveredLink[]; // Links discovered but not crawled
    stopReason: string;              // Stop reason (maxPages/maxDepth/drained)
  }
}

SSE Mode (Real-time Progress)

Add Accept: text/event-stream header for real-time updates:

curl -X POST http://localhost:3000/api/crawl \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{"url": "https://example.com", "maxPages": 50}'

Event Types:

// Detecting page type (auto mode)
data: {"type": "detecting", "completed": 0, "total": 50, "currentUrl": "https://..."}

// Engine upgrade event (auto mode detected browser needed)
data: {"type": "engine_upgrade", "completed": 1, "total": 50, "upgradeReason": "spa_detected"}

// Progress update
data: {"type": "progress", "completed": 5, "total": 50, "currentUrl": "https://...", "data": {...}}

// Final result
data: {"type": "complete", "data": { success, baseUrl, totalPages, pages, sitemap }}

// Error
data: {"type": "error", "error": "message"}

JavaScript SSE Client

async function crawlWithProgress(url, maxPages, onProgress) {
  const res = await fetch('/api/crawl', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Accept': 'text/event-stream'
    },
    body: JSON.stringify({ url, maxPages })
  });
  
  const reader = res.body.getReader();
  const decoder = new TextDecoder();
  let buffer = '';
  let result = null;
  
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    
    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split('\n');
    buffer = lines.pop() || '';
    
    for (const line of lines) {
      if (!line.startsWith('data: ')) continue;
      const event = JSON.parse(line.substring(6));
      
      if (event.type === 'progress') {
        onProgress(event.completed, event.total, event.currentUrl);
      } else if (event.type === 'complete') {
        result = event.data;
      } else if (event.type === 'error') {
        throw new Error(event.error);
      }
    }
  }
  
  return result;
}

// Usage
const result = await crawlWithProgress(
  'https://docs.example.com',
  100,
  (done, total, url) => console.log(`${done}/${total}: ${url}`)
);

POST /api/convert

Convert HTML file or Markdown text to other formats.

Request

Supports two modes:

Mode 1: File Upload (HTML → Markdown)

  • Content-Type: multipart/form-data
  • Body: file (binary HTML file)
curl -X POST http://localhost:3000/api/convert \
  -F "file=@page.html;type=text/html"

Mode 2: JSON Body (Markdown → PDF)

  • Content-Type: application/json
curl -X POST http://localhost:3000/api/convert \
  -H "Content-Type: application/json" \
  -d '{"markdown": "# Hello World", "format": "pdf"}'

POST /api/download

Create a ZIP archive containing Markdown content and local images.

Request

{
  markdown: string                                       // Required
  images?: { originalUrl: string, localPath: string }[]  // Optional
  title?: string                                         // Default: 'export'
}

Response

Binary ZIP file.

Example

curl -X POST http://localhost:3000/api/download \
  -H "Content-Type: application/json" \
  --output export.zip \
  -d '{
    "markdown": "# Hello\n![Image](./img/1.png)",
    "images": [{"originalUrl": "...", "localPath": "img/1.png"}],
    "title": "My Page"
  }'

POST /api/download/batch

Batch download multiple pages, generating a ZIP archive containing all Markdown files and a shared images directory.

Request Parameters

Parameter Type Required Description
pages array Yes Array of page data
pages[].url string Yes Original page URL
pages[].markdown string Yes Extracted Markdown content
pages[].title string No Page title (used for filename generation)
pages[].images array No Image mapping array

Request Example

curl -X POST http://localhost:3000/api/download/batch \
  -H "Content-Type: application/json" \
  -d '{
    "pages": [
      {
        "url": "https://example.com/page1",
        "markdown": "# Page 1\n\nContent here...",
        "title": "Page 1"
      },
      {
        "url": "https://example.com/page2",
        "markdown": "# Page 2\n\nMore content...",
        "title": "Page 2"
      }
    ]
  }' --output batch.zip

Response

  • Content-Type: application/zip
  • Filename: Generated based on the first page's domain, e.g., example.com-batch.zip
  • ZIP Structure:
    example.com-batch.zip
    ├── page-1.md
    ├── page-2.md
    └── images/
        ├── image1.png
        └── image2.jpg
    

GET /api/images/*

Serve images downloaded during scrape/crawl when downloadImages=true.

Usage

Images are stored locally and can be accessed via this endpoint. The localPath returned in metadata can be appended to /api/images/.

Example

curl http://localhost:3000/api/images/2023/10/page-id/image.png --output image.png

GET /api/health

Health check endpoint.

curl http://localhost:3000/api/health
# {"status":"ok","version":"..."}

Engine Selection

Engine Speed Use Case
auto ~1.2s avg Most cases (default)
http ~170ms Static HTML sites
browser ~5s JavaScript SPAs

Auto mode tries HTTP first, upgrades to browser if content is missing.


Output Formats

Format Size Best For
markdown Smallest LLM processing, reading
clean.html Medium Web display
raw.html Largest Debugging
pdf Variable Offline reading, archiving

Metadata Frontmatter

When format='markdown' and includeMetadata=true:

---
url: https://example.com
title: Page Title
converted_at: 2025-11-17T12:34:56.789Z
engine: browser

timing:
  fetch_duration_ms: 250
  total_duration_ms: 300

stats:
  raw_html_size_bytes: 125000
  markdown_size_bytes: 12000
  compression_ratio: 0.096
---

# Actual Markdown Content...

Error Handling

// Error response
{
  success: false
  url: string
  error: string
}

// Common errors:
// - 400: Invalid URL or parameters
// - 403: Site blocked access
// - 404: Page not found
// - 408: Timeout
// - 500: Internal error

Retry Strategy

async function scrapeWithRetry(url, retries = 3) {
  for (let i = 0; i < retries; i++) {
    try {
      return await scrape(url);
    } catch (e) {
      const status = parseInt(e.message.match(/\d+/)?.[0]);
      if (status && status < 500) throw e;  // Don't retry 4xx
      if (i < retries - 1) {
        await new Promise(r => setTimeout(r, 1000 << i));  // Exponential backoff
      }
    }
  }
}

Authentication (Optional)

If backend has API_KEY set:

fetch('/api/scrape', {
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer YOUR_API_KEY'
  },
  body: JSON.stringify({ url })
});

Rate Limiting

  • Crawl: 1 request/second (polite crawling)
  • Scrape: No limit by default
  • Configure in deployment if needed