"Feishu Document Scraping Exploration: From Password Authentication to Virtual Scrolling"

A deep exploration of Feishu document scraping, covering the complete technical decision-making process from password authentication to virtual scrolling

"The best code is no code." — The most important lesson we learned from the Feishu exploration.

Background: A Seemingly Simple Request

On January 4, 2026, user feedback triggered our exploration journey:

"I want to scrape Feishu document content, but the page requires a password to access."

The requirement seemed straightforward:

  1. Feishu documents support password-protected sharing
  2. Users want to pass in a password, automatically complete authentication, and scrape content
  3. Finally output clean Markdown

This looked like a typical "20 lines of code to fix" requirement — at least that's what we initially thought.


Chapter 1: Small Victory with Password Authentication

Problem Analysis

Feishu's password-protected page structure:

  • Password input: input[placeholder="请输入密码"]
  • Confirm button: button containing "确定" text
  • After authentication is a SPA navigation, not traditional page refresh

Solution

We designed a generic auth configuration:

// src/scraper/presets/types.ts
interface PasswordAuth {
  type: "password";
  detectSelector: string;      // Detect if on password page
  passwordSelector: string;    // Password input field
  submitSelector: string;      // Submit button
  contentWaitSelector?: string; // Content to wait for after auth
}

Implement password authentication flow:

// src/scraper/browser.ts
async function handlePasswordAuth(page, authConfig, options) {
  // 1. Detect if password is required
  const hasPasswordPage = await page.$(authConfig.detectSelector);
  if (!hasPasswordPage) return;

  // 2. Enter password and submit
  await page.type(authConfig.passwordSelector, options.password);
  await page.click(authConfig.submitSelector);

  // 3. Wait for content to load (SPA special handling)
  await page.waitForSelector(authConfig.contentWaitSelector);
}

Unexpected Discovery: SPAs Need Stylesheets

During testing, we found an interesting issue: the page displayed blank.

The reason was that we block CSS loading by default to improve performance, but Feishu as an SPA uses CSS to determine critical rendering logic. The solution is to let presets customize blocking strategy:

// feishu preset
options: {
  blockResources: ["image", "media"], // Keep stylesheet
}

Investment: ~100 lines of code Benefit: Password authentication functionality works

This was a small victory. But the real challenge was just beginning.


Chapter 2: Virtual Scrolling — The Real Boss

Verify Content Completeness

After successful authentication, we scraped a Feishu document and exported to /tmp/claude-code-feishu.md.

$ wc -l /tmp/claude-code-feishu.md
119 /tmp/claude-code-feishu.md

Only 119 lines? The original document should have dozens of chapters. Let's look at the content:

# Claude Code从入门到狂飙

👍 大家好,我是志辉,10 年大数据架构...
(作者介绍)
(几张图片引用)
(学习路线标题)
...

Only the title and opening author intro — the body content was completely lost.

Root Cause Analysis

We deeply analyzed Feishu's DOM structure:

// Execute in DevTools
document.querySelectorAll('[data-block-id]').length
// Result: 12 (only 12 blocks within the viewport)

It turns out Feishu uses virtual scrolling:

  1. Only renders content blocks within the viewport
  2. Dynamically loads new blocks and unloads old blocks when scrolling
  3. DOM always contains only a small number of elements

This isn't a bug, but standard practice for performance optimization. But for content scraping, this means the traditional "wait for page load → extract DOM" process completely fails.

Evaluation: Is This Feature Worth Doing?

Here we paused and used a Google杰出产品经理 perspective to evaluate:

Investment:

  • Already invested: ~100 lines (password auth + blockResources)
  • Solving virtual scrolling needs: ~50-100 lines (scroll + waitForSelector action)

Return:

  • Password auth: ✅ Functional, but limited value
  • Content completeness: ❌ Core problem not solved

Conclusion: We are in a half-baked state — password passes but content is incomplete. If users try it, the experience is "authentication succeeded, but result is garbage."


Chapter 3: Systematic Planning — From Problem to Architecture

Competitor Research

Before deciding whether to continue, let's look at solutions in the market.

Firecrawl Official Actions

// Firecrawl's action schema
type ActionType = "wait" | "click" | "write" | "press" | "screenshot";

Discovery: Firecrawl supports 5 browser action types, but no scroll. This is exactly the core capability we need.

Playwright MCP

// Playwright MCP's tool set
tools: [
  "browser_navigate",
  "browser_click",
  "browser_type",
  "browser_scroll",  // ← Yes!
  "browser_screenshot",
  "browser_get_accessibility_tree",
  // ... 20+ tools
]

Playwright MCP is a complete browser automation solution with powerful features but high complexity.

browser-use

This is the other extreme — AI Agent autonomously controls browser:

agent = Agent(
    task="Go to Reddit, search for 'browser-use' in search bar",
    llm=ChatAnthropic(model_name="claude-3-5-sonnet-20241022"),
)
await agent.run()

Capability Layering Model

Based on research, we proposed a three-layer capability model:

┌─────────────────────────────────────────────────────────┐
│  Level 3: AI Agent                                       │
│  Natural language → LLM autonomous planning → Execution        │
│  (browser-use, future vision)                           │
├─────────────────────────────────────────────────────────┤
│  Level 2: Browser Automation                            │
│  Declarative Action Sequence → Deterministic execution        │
│  (Firecrawl Actions, what we need)                       │
├─────────────────────────────────────────────────────────┤
│  Level 1: Static Preset                                 │
│  selector + waitFor → Single scrape                         │
│  (Current SitePreset)                                     │
└─────────────────────────────────────────────────────────┘

Minimal solution for Feishu problem: Level 2, needs only 2 action types:

  • scroll: Scroll to bottom
  • waitForSelector: Wait for new content to load

Action Sequence Design

interface ActionSequence {
  trigger: "onLoad" | "onAuth" | "preScrape";
  actions: Action[];
  maxIterations?: number;
  stopCondition?: string;
}

type Action =
  | { type: "scroll"; direction: "down" | "up"; distance?: number }
  | { type: "waitForSelector"; selector: string; timeout?: number }
  | { type: "wait"; ms: number };

// Feishu virtual scroll configuration
const feishuBehavior: ActionSequence = {
  trigger: "preScrape",
  actions: [
    { type: "scroll", direction: "down" },
    { type: "waitForSelector", selector: "[data-block-id]" },
  ],
  maxIterations: 100,
  stopCondition: "noNewContent",
};

Estimated Investment: ~50 lines of core logic


Chapter 4: A Bigger Question — Should We Make a Separate Component?

The Temptation of "browserless"

When designing Action Sequence, a thought emerged:

"Browser automation is a general capability, should it be extracted as a separate component? Like Browserless.io?"

This is a strategic question about architecture. We analyzed it using two thinking frameworks.

Linus Torvalds Perspective: "Don't reinvent the wheel"

"I'm not a visionary, I'm an engineer."

Existing ecosystem:

  • Browserless.io: Commercial-grade Headless Chrome service
  • Playwright: Microsoft product, excellent API design
  • Puppeteer: Google product, Chrome native
  • Playwright MCP: 20+ tools, direct Claude integration
  • browser-use: AI Agent framework, active development

Conclusion: We have no opportunity for 10x improvement. Reinventing the wheel is a waste of time.

Google杰出产品经理Perspective

Three questions:

  1. Is there 10x improvement? No, existing solutions are mature enough
  2. Is there unique positioning? No, browser-use is more AI Native than us
  3. What is the core competitiveness? It's knowledge accumulation in site presets, not browser control

Conclusion: Don't make a separate component. Focus on the preset configuration layer.

"The best code is no code."


Chapter 5: Final Decision

Feishu Feature Status: Experimental

After complete exploration, our decision is:

  1. Keep existing code: Password auth + blockResources can be reused
  2. Mark as experimental: Feature incomplete, not recommended for production use
  3. Future path: When there are more virtual scrolling site requirements, implement minimal Action Sequence

Requirement Planning

Created feat-dynamic-scraper-framework (P1):

#### Problem
- Feishu, Notion and other modern document platforms use virtual scrolling
- Existing Scraper lacks systematic mechanism to handle "interaction-generated content"

#### Expectation
- Extend SitePreset to support behavior/actions declarations
- Implement generic VirtualScrollHandler
- Systematically solve consistency issues for all "dynamic loading" type sites

Code Status

# Committed (keep)
- feishu preset basic configuration
- blockResources option
- Preset matching logic

# Uncommitted (experimental, documented)
- handlePasswordAuth function
- auth configuration types
- password parameter passing

Philosophical Reflection: The Value of Exploration

This exploration didn't produce a complete feature, but we gained something more important.

1. Clear Understanding of Problem Boundaries

"Virtual scrolling is not a bug, it's a feature."

Modern web applications universally use lazy loading techniques for performance. Traditional static scraping thinking needs to evolve.

2. Architectural Vision of Capability Layering

Static Preset → Action Sequence → AI Agent

This layering model applies not only to browser automation, but also to many "from simple to complex" system designs.

3. The Decision "Not to Do" Is Equally Important

We evaluated whether to make a separate browserless component and ultimately decided not to. This "not to do" decision avoided weeks or even months of ineffective investment.

4. Methodology of Competitor Research

Discover problem → Research competitors → Distill differences → Find positioning

Firecrawl lacks scroll, Playwright MCP is too heavy, browser-use is the future. Our positioning is knowledge accumulation at the preset configuration layer.


Next Steps

  • Document complete exploration process
  • Create feat-dynamic-scraper-framework requirement
  • (Experiment) Start implementation when there are 3+ virtual scrolling site requirements
  • (Future) Evaluate feasibility of Playwright MCP integration

References


Exploration time: 2026-01-04 ~ 2026-01-05 Participants: User, Claude (Strategist)