Scraper - Web Content Extraction
Extract clean content from any webpage. Core tool of Alchemy Lab.
Extract clean content from any webpage. Core tool of Alchemy Lab.
Core Features
- 🌐 Single Page Scraping: Extract clean Markdown from any URL
- 🔄 Batch Crawling: Automatically crawl sites following links (up to 200 pages)
- 🖼️ Image Processing: Auto-download and convert to CDN links
- 📄 Multi-format Export: Markdown / PDF / ZIP
- 🤖 AI Refinement: Intelligently remove navigation, ads, and noise
Quick Start
Single Page Scraping
curl -X POST https://api.alchemy.izoa.fun/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"downloadImages": true
}'
Response Example:
{
"data": "# Page Title\n\nClean markdown content...",
"metadata": {
"title": "Page Title",
"url": "https://example.com",
"images": {
"download": true,
"cdnRewrite": true,
"items": [...]
}
}
}
Batch Crawling
curl -X POST https://api.alchemy.izoa.fun/api/crawl \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"url": "https://example.com/docs",
"limit": 10,
"pathPrefix": "/docs"
}'
SSE Event Stream:
event: discovering
data: {"phase":"discovering","discovered":5}
event: scraping
data: {"phase":"scraping","current":"https://example.com/docs/intro","progress":{"scraped":1,"total":5}}
event: complete
data: {"phase":"complete","results":[...]}
AI Refinement Mode
curl -X POST https://api.alchemy.izoa.fun/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"aiRefinement": true,
"noiseHint": "Sidebar has many links, footer has copyright info"
}'
API Reference
POST /api/scrape - Single Page Scraping
Request Body:
| Field | Type | Required | Description |
|---|---|---|---|
url |
string | ✅ | URL to scrape |
downloadImages |
boolean | ❌ | Download images (default false) |
aiRefinement |
boolean | ❌ | Use AI refinement (default false) |
noiseHint |
string | ❌ | Noise hint (helps AI identify better) |
format |
string | ❌ | Output format: markdown (default) or pdf |
Response:
{
data: string; // Markdown content
metadata: {
title: string; // Page title
url: string; // Original URL
images?: { // Image info (if downloadImages=true)
download: boolean;
cdnRewrite: boolean;
items: Array<{
original: string;
cdn: string;
size: number;
}>;
};
};
}
POST /api/crawl - Batch Crawling
Request Body:
| Field | Type | Required | Description |
|---|---|---|---|
url |
string | ✅ | Starting URL |
limit |
number | ❌ | Max pages (default 10, max 200) |
depth |
number | ❌ | Crawl depth (default 2, max 5) |
pathPrefix |
string | ❌ | Path prefix filter (e.g., /docs) |
Response: Server-Sent Events (SSE) stream
| Event Type | Data |
|---|---|
discovering |
{ phase, discovered, current? } |
detecting |
{ phase, current } |
scraping |
{ phase, current, progress } |
complete |
{ phase, results, stats } |
error |
{ error, message } |
GET /api/download/:id - Download Results
Download crawl results as a ZIP archive (includes all Markdown and images).
Tech Stack
- Backend: Node.js + Express + TypeScript
- Crawl Engine: Puppeteer (Chrome Headless)
- HTML Parsing: Cheerio + Turndown
- AI Service: new-api proxy (Gemini Flash)
- Storage: Local file system + in-memory cache
- Deployment: Docker + Caddy
Use Cases
1. RAG System Data Collection
# Crawl documentation sites to build knowledge base
curl -X POST .../api/crawl \
-d '{"url": "https://docs.example.com", "limit": 100}'
2. Research Material Organization
# Batch download paper abstracts
for url in $(cat urls.txt); do
curl -X POST .../api/scrape -d "{\"url\":\"$url\"}" >> papers.md
done
3. Content Backup Archiving
# Download entire site as PDF
curl -X POST .../api/crawl \
-d '{"url": "https://blog.example.com", "format": "pdf"}'
Deployment Guide
Docker Deployment (Recommended)
# Clone repository
git clone https://github.com/user/alchemy-lite.git
cd alchemy-lite
# Configure environment variables
cp .env.example .env
# Edit .env to set API keys
# Start services
docker-compose up -d
# Access
open http://localhost:9002
Local Development
# Install dependencies
pnpm install
# Start dev server
pnpm dev:start # Start both frontend and backend
# Verify
pnpm verify
See project README.md and Deployment Docs for details.
Performance Metrics
| Metric | Value |
|---|---|
| Single Page Scraping | ~2-5s (with AI ~8s) |
| Batch Crawling | ~0.5s/page |
| Image Download | ~50-200ms/image |
| Concurrency | 10+ req/s |
| Memory Usage | ~200-500MB |
Limits and Constraints
- Crawl Limit: Max 200 pages per crawl
- Depth Limit: Max depth 5 levels
- Timeout: Page load timeout 30s
- Image Size: Max 10MB per image
- Rate Limit: No hard limit (recommend <10 req/s)
Troubleshooting
Issue: Page Load Timeout
Cause: Slow page load or anti-scraping blocking
Solution:
# Increase timeout (backend config)
PUPPETEER_TIMEOUT=60000
Issue: Image Download Failed
Cause: Image URL hotlink protection or expired
Solution:
- Check network connection
- Try without
downloadImages - Manually download images and upload to your CDN
Issue: AI Refinement Not Effective
Cause: Complex page structure or inaccurate noiseHint
Solution:
# Provide more detailed noise hint
{
"aiRefinement": true,
"noiseHint": "Sidebar contains navigation links, Footer has 'About Us' and copyright, Header has login button"
}
Related Resources
- Source Code: GitHub - alchemy-lite
- Online Use: alchemy.izoa.fun
- Product Roadmap: Insights - Vision
- Best Practices: Guides - Scraping Best Practices
- API Reference: API section in this document
Feedback and Support
- Issue Report: GitHub Issues
- Feature Requests: GitHub Discussions
- Contact: admin@izoa.fun