Scraper - Web Content Extraction

Extract clean content from any webpage. Core tool of Alchemy Lab.

Extract clean content from any webpage. Core tool of Alchemy Lab.

Core Features

🌐 Single Page Scraping: Extract clean Markdown from any URL
🔄 Batch Crawling: Automatically crawl sites following links (up to 200 pages)
🖼️ Image Processing: Auto-download and convert to CDN links
📄 Multi-format Export: Markdown / PDF / ZIP
🤖 AI Refinement: Intelligently remove navigation, ads, and noise

Quick Start

Single Page Scraping

curl -X POST https://api.alchemy.izoa.fun/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "downloadImages": true
  }'

Response Example:

{
  "data": "# Page Title\n\nClean markdown content...",
  "metadata": {
    "title": "Page Title",
    "url": "https://example.com",
    "images": {
      "download": true,
      "cdnRewrite": true,
      "items": [...]
    }
  }
}

Batch Crawling

curl -X POST https://api.alchemy.izoa.fun/api/crawl \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "url": "https://example.com/docs",
    "limit": 10,
    "pathPrefix": "/docs"
  }'

SSE Event Stream:

event: discovering
data: {"phase":"discovering","discovered":5}

event: scraping
data: {"phase":"scraping","current":"https://example.com/docs/intro","progress":{"scraped":1,"total":5}}

event: complete
data: {"phase":"complete","results":[...]}

AI Refinement Mode

curl -X POST https://api.alchemy.izoa.fun/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "aiRefinement": true,
    "noiseHint": "Sidebar has many links, footer has copyright info"
  }'

API Reference

POST /api/scrape - Single Page Scraping

Request Body:

Field	Type	Required	Description
`url`	string	✅	URL to scrape
`downloadImages`	boolean	❌	Download images (default false)
`aiRefinement`	boolean	❌	Use AI refinement (default false)
`noiseHint`	string	❌	Noise hint (helps AI identify better)
`format`	string	❌	Output format: `markdown` (default) or `pdf`

Response:

{
  data: string;           // Markdown content
  metadata: {
    title: string;        // Page title
    url: string;          // Original URL
    images?: {            // Image info (if downloadImages=true)
      download: boolean;
      cdnRewrite: boolean;
      items: Array<{
        original: string;
        cdn: string;
        size: number;
      }>;
    };
  };
}

POST /api/crawl - Batch Crawling

Request Body:

Field	Type	Required	Description
`url`	string	✅	Starting URL
`limit`	number	❌	Max pages (default 10, max 200)
`depth`	number	❌	Crawl depth (default 2, max 5)
`pathPrefix`	string	❌	Path prefix filter (e.g., `/docs`)

Response: Server-Sent Events (SSE) stream

Event Type	Data
`discovering`	`{ phase, discovered, current? }`
`detecting`	`{ phase, current }`
`scraping`	`{ phase, current, progress }`
`complete`	`{ phase, results, stats }`
`error`	`{ error, message }`

GET /api/download/:id - Download Results

Download crawl results as a ZIP archive (includes all Markdown and images).

Tech Stack

Backend: Node.js + Express + TypeScript
Crawl Engine: Puppeteer (Chrome Headless)
HTML Parsing: Cheerio + Turndown
AI Service: new-api proxy (Gemini Flash)
Storage: Local file system + in-memory cache
Deployment: Docker + Caddy

Use Cases

1. RAG System Data Collection

# Crawl documentation sites to build knowledge base
curl -X POST .../api/crawl \
  -d '{"url": "https://docs.example.com", "limit": 100}'

2. Research Material Organization

# Batch download paper abstracts
for url in $(cat urls.txt); do
  curl -X POST .../api/scrape -d "{\"url\":\"$url\"}" >> papers.md
done

3. Content Backup Archiving

# Download entire site as PDF
curl -X POST .../api/crawl \
  -d '{"url": "https://blog.example.com", "format": "pdf"}'

Deployment Guide

Docker Deployment (Recommended)

# Clone repository
git clone https://github.com/user/alchemy-lite.git
cd alchemy-lite

# Configure environment variables
cp .env.example .env
# Edit .env to set API keys

# Start services
docker-compose up -d

# Access
open http://localhost:9002

Local Development

# Install dependencies
pnpm install

# Start dev server
pnpm dev:start  # Start both frontend and backend

# Verify
pnpm verify

See project README.md and Deployment Docs for details.

Performance Metrics

Metric	Value
Single Page Scraping	~2-5s (with AI ~8s)
Batch Crawling	~0.5s/page
Image Download	~50-200ms/image
Concurrency	10+ req/s
Memory Usage	~200-500MB

Limits and Constraints

Crawl Limit: Max 200 pages per crawl
Depth Limit: Max depth 5 levels
Timeout: Page load timeout 30s
Image Size: Max 10MB per image
Rate Limit: No hard limit (recommend <10 req/s)

Troubleshooting

Issue: Page Load Timeout

Cause: Slow page load or anti-scraping blocking

Solution:

# Increase timeout (backend config)
PUPPETEER_TIMEOUT=60000

Issue: Image Download Failed

Cause: Image URL hotlink protection or expired

Solution:

Check network connection
Try without downloadImages
Manually download images and upload to your CDN

Issue: AI Refinement Not Effective

Cause: Complex page structure or inaccurate noiseHint

Solution:

# Provide more detailed noise hint
{
  "aiRefinement": true,
  "noiseHint": "Sidebar contains navigation links, Footer has 'About Us' and copyright, Header has login button"
}

Related Resources

Source Code: GitHub - alchemy-lite
Online Use: alchemy.izoa.fun
Product Roadmap: Insights - Vision
Best Practices: Guides - Scraping Best Practices
API Reference: API section in this document

Feedback and Support

Issue Report: GitHub Issues
Feature Requests: GitHub Discussions
Contact: admin@izoa.fun