Scraper - Web Content Extraction

Extract clean content from any webpage. Core tool of Alchemy Lab.

Extract clean content from any webpage. Core tool of Alchemy Lab.

Core Features

  • 🌐 Single Page Scraping: Extract clean Markdown from any URL
  • 🔄 Batch Crawling: Automatically crawl sites following links (up to 200 pages)
  • 🖼️ Image Processing: Auto-download and convert to CDN links
  • 📄 Multi-format Export: Markdown / PDF / ZIP
  • 🤖 AI Refinement: Intelligently remove navigation, ads, and noise

Quick Start

Single Page Scraping

curl -X POST https://api.alchemy.izoa.fun/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "downloadImages": true
  }'

Response Example:

{
  "data": "# Page Title\n\nClean markdown content...",
  "metadata": {
    "title": "Page Title",
    "url": "https://example.com",
    "images": {
      "download": true,
      "cdnRewrite": true,
      "items": [...]
    }
  }
}

Batch Crawling

curl -X POST https://api.alchemy.izoa.fun/api/crawl \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "url": "https://example.com/docs",
    "limit": 10,
    "pathPrefix": "/docs"
  }'

SSE Event Stream:

event: discovering
data: {"phase":"discovering","discovered":5}

event: scraping
data: {"phase":"scraping","current":"https://example.com/docs/intro","progress":{"scraped":1,"total":5}}

event: complete
data: {"phase":"complete","results":[...]}

AI Refinement Mode

curl -X POST https://api.alchemy.izoa.fun/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "aiRefinement": true,
    "noiseHint": "Sidebar has many links, footer has copyright info"
  }'

API Reference

POST /api/scrape - Single Page Scraping

Request Body:

Field Type Required Description
url string URL to scrape
downloadImages boolean Download images (default false)
aiRefinement boolean Use AI refinement (default false)
noiseHint string Noise hint (helps AI identify better)
format string Output format: markdown (default) or pdf

Response:

{
  data: string;           // Markdown content
  metadata: {
    title: string;        // Page title
    url: string;          // Original URL
    images?: {            // Image info (if downloadImages=true)
      download: boolean;
      cdnRewrite: boolean;
      items: Array<{
        original: string;
        cdn: string;
        size: number;
      }>;
    };
  };
}

POST /api/crawl - Batch Crawling

Request Body:

Field Type Required Description
url string Starting URL
limit number Max pages (default 10, max 200)
depth number Crawl depth (default 2, max 5)
pathPrefix string Path prefix filter (e.g., /docs)

Response: Server-Sent Events (SSE) stream

Event Type Data
discovering { phase, discovered, current? }
detecting { phase, current }
scraping { phase, current, progress }
complete { phase, results, stats }
error { error, message }

GET /api/download/:id - Download Results

Download crawl results as a ZIP archive (includes all Markdown and images).

Tech Stack

  • Backend: Node.js + Express + TypeScript
  • Crawl Engine: Puppeteer (Chrome Headless)
  • HTML Parsing: Cheerio + Turndown
  • AI Service: new-api proxy (Gemini Flash)
  • Storage: Local file system + in-memory cache
  • Deployment: Docker + Caddy

Use Cases

1. RAG System Data Collection

# Crawl documentation sites to build knowledge base
curl -X POST .../api/crawl \
  -d '{"url": "https://docs.example.com", "limit": 100}'

2. Research Material Organization

# Batch download paper abstracts
for url in $(cat urls.txt); do
  curl -X POST .../api/scrape -d "{\"url\":\"$url\"}" >> papers.md
done

3. Content Backup Archiving

# Download entire site as PDF
curl -X POST .../api/crawl \
  -d '{"url": "https://blog.example.com", "format": "pdf"}'

Deployment Guide

Docker Deployment (Recommended)

# Clone repository
git clone https://github.com/user/alchemy-lite.git
cd alchemy-lite

# Configure environment variables
cp .env.example .env
# Edit .env to set API keys

# Start services
docker-compose up -d

# Access
open http://localhost:9002

Local Development

# Install dependencies
pnpm install

# Start dev server
pnpm dev:start  # Start both frontend and backend

# Verify
pnpm verify

See project README.md and Deployment Docs for details.

Performance Metrics

Metric Value
Single Page Scraping ~2-5s (with AI ~8s)
Batch Crawling ~0.5s/page
Image Download ~50-200ms/image
Concurrency 10+ req/s
Memory Usage ~200-500MB

Limits and Constraints

  • Crawl Limit: Max 200 pages per crawl
  • Depth Limit: Max depth 5 levels
  • Timeout: Page load timeout 30s
  • Image Size: Max 10MB per image
  • Rate Limit: No hard limit (recommend <10 req/s)

Troubleshooting

Issue: Page Load Timeout

Cause: Slow page load or anti-scraping blocking

Solution:

# Increase timeout (backend config)
PUPPETEER_TIMEOUT=60000

Issue: Image Download Failed

Cause: Image URL hotlink protection or expired

Solution:

  • Check network connection
  • Try without downloadImages
  • Manually download images and upload to your CDN

Issue: AI Refinement Not Effective

Cause: Complex page structure or inaccurate noiseHint

Solution:

# Provide more detailed noise hint
{
  "aiRefinement": true,
  "noiseHint": "Sidebar contains navigation links, Footer has 'About Us' and copyright, Header has login button"
}

Related Resources

Feedback and Support