Alchemy Page

A complete guide to using the Alchemy page for web scraping and multi-page crawling

The Alchemy page is the core interface of this application. It refines web pages from raw "ore" into clean knowledge "gold"—supporting two modes: Scrape (single-page extraction) and Crawl (multi-page crawling).

Alchemy Page Screenshot


Scrape Mode

Scrape mode extracts content from one or more specific URLs. Ideal for single articles, blog posts, news pages, etc.

Steps

  1. Enter URLs — Type target URLs in the text box, multiple URLs supported (one per line)
  2. Select output format — Click the format selector, choose Markdown, JSON, or PDF
  3. Configure advanced options (optional) — Enable AI optimization or image download
  4. Click "Start Scraping" — Wait for completion and view results

Options

Option Description Default
Format Output format: Markdown / JSON / PDF Markdown
AI Optimization Use AI to clean noise and extract core content Off
No Cache Force re-fetch content, ignore cached versions Off
Download Images Save images locally to prevent broken links Off

Use Cases: Single technical blog posts, news articles, specific product documentation pages.


Crawl Mode

Crawl mode starts from a single URL and automatically discovers and crawls linked pages. Ideal for documentation sites, blog series, full site backups, etc.

Steps

  1. Enter starting URL — Input the crawl starting point
  2. Configure crawl parameters — Set max pages, crawl depth, path prefix, etc.
  3. Click "Start Crawling" — Watch real-time progress and results

Configuration Options

Option Description Default Recommended
Max Pages (maxPages) Maximum number of pages to crawl 10 10-50
Crawl Depth (maxDepth) Maximum link traversal depth 3 2-3
Path Prefix (pathPrefix) Restrict crawl scope to specific path Auto-extracted /docs/

Tip: Path prefix is automatically extracted from the input URL. For example, entering https://example.com/docs/intro will auto-set it to /docs.

Use Cases: Technical documentation sites (Docusaurus, GitBook), blog post series, full site backup, building RAG knowledge bases.


Output Format Comparison

Format Pros Cons Best For
Markdown Smallest size, AI-friendly, easy to edit No styling RAG, knowledge bases, LLM processing
JSON Structured data, includes metadata Requires parsing API integration, data processing
PDF Print-ready, fixed format, offline readable Not editable, larger size Archiving, reports, sharing

Advanced Options

AI Optimization (enableAI)

When enabled, the system uses large language models to intelligently process scraped content: removing ads, navigation, footers, and other noise; identifying and preserving main article content; fixing formatting issues.

Note: Requires backend LLM service configuration (e.g., OpenAI API) to use this feature.

Image Download (downloadImages)

When enabled, downloads page images locally to prevent broken external links and support offline viewing. Downloaded images are accessible via the /api/images/* endpoint or can be packaged as ZIP.

Engine Selection

The system defaults to auto engine, intelligently choosing the optimal approach:

Engine Speed Use Case
HTTP ~170ms Static HTML websites
Browser ~5s JavaScript Single Page Apps (SPA)
Auto ~1.2s Auto-select, tries HTTP first, upgrades when needed

FAQ

Q: Scrape result is empty or incomplete? Target site may use JavaScript dynamic rendering; system auto-upgrades to Browser engine. If issues persist, check if the site has anti-scraping mechanisms.

Q: Crawl speed is slow? Lower concurrency reduces the risk of being blocked. Set reasonable path prefix to avoid crawling irrelevant pages.

Q: How to limit crawl scope? Use path prefix (pathPrefix) to restrict crawling to specific directories. For example, setting /docs/ only crawls the documentation directory.