"Feishu Document Scraping Exploration: From Password Authentication to Virtual Scrolling"
A deep exploration of Feishu document scraping, covering the complete technical decision-making process from password authentication to virtual scrolling
"The best code is no code." — The most important lesson we learned from the Feishu exploration.
Background: A Seemingly Simple Request
On January 4, 2026, user feedback triggered our exploration journey:
"I want to scrape Feishu document content, but the page requires a password to access."
The requirement seemed straightforward:
- Feishu documents support password-protected sharing
- Users want to pass in a password, automatically complete authentication, and scrape content
- Finally output clean Markdown
This looked like a typical "20 lines of code to fix" requirement — at least that's what we initially thought.
Chapter 1: Small Victory with Password Authentication
Problem Analysis
Feishu's password-protected page structure:
- Password input:
input[placeholder="请输入密码"] - Confirm button: button containing "确定" text
- After authentication is a SPA navigation, not traditional page refresh
Solution
We designed a generic auth configuration:
// src/scraper/presets/types.ts
interface PasswordAuth {
type: "password";
detectSelector: string; // Detect if on password page
passwordSelector: string; // Password input field
submitSelector: string; // Submit button
contentWaitSelector?: string; // Content to wait for after auth
}
Implement password authentication flow:
// src/scraper/browser.ts
async function handlePasswordAuth(page, authConfig, options) {
// 1. Detect if password is required
const hasPasswordPage = await page.$(authConfig.detectSelector);
if (!hasPasswordPage) return;
// 2. Enter password and submit
await page.type(authConfig.passwordSelector, options.password);
await page.click(authConfig.submitSelector);
// 3. Wait for content to load (SPA special handling)
await page.waitForSelector(authConfig.contentWaitSelector);
}
Unexpected Discovery: SPAs Need Stylesheets
During testing, we found an interesting issue: the page displayed blank.
The reason was that we block CSS loading by default to improve performance, but Feishu as an SPA uses CSS to determine critical rendering logic. The solution is to let presets customize blocking strategy:
// feishu preset
options: {
blockResources: ["image", "media"], // Keep stylesheet
}
Investment: ~100 lines of code Benefit: Password authentication functionality works
This was a small victory. But the real challenge was just beginning.
Chapter 2: Virtual Scrolling — The Real Boss
Verify Content Completeness
After successful authentication, we scraped a Feishu document and exported to /tmp/claude-code-feishu.md.
$ wc -l /tmp/claude-code-feishu.md
119 /tmp/claude-code-feishu.md
Only 119 lines? The original document should have dozens of chapters. Let's look at the content:
# Claude Code从入门到狂飙
👍 大家好,我是志辉,10 年大数据架构...
(作者介绍)
(几张图片引用)
(学习路线标题)
...
Only the title and opening author intro — the body content was completely lost.
Root Cause Analysis
We deeply analyzed Feishu's DOM structure:
// Execute in DevTools
document.querySelectorAll('[data-block-id]').length
// Result: 12 (only 12 blocks within the viewport)
It turns out Feishu uses virtual scrolling:
- Only renders content blocks within the viewport
- Dynamically loads new blocks and unloads old blocks when scrolling
- DOM always contains only a small number of elements
This isn't a bug, but standard practice for performance optimization. But for content scraping, this means the traditional "wait for page load → extract DOM" process completely fails.
Evaluation: Is This Feature Worth Doing?
Here we paused and used a Google杰出产品经理 perspective to evaluate:
Investment:
- Already invested: ~100 lines (password auth + blockResources)
- Solving virtual scrolling needs: ~50-100 lines (scroll + waitForSelector action)
Return:
- Password auth: ✅ Functional, but limited value
- Content completeness: ❌ Core problem not solved
Conclusion: We are in a half-baked state — password passes but content is incomplete. If users try it, the experience is "authentication succeeded, but result is garbage."
Chapter 3: Systematic Planning — From Problem to Architecture
Competitor Research
Before deciding whether to continue, let's look at solutions in the market.
Firecrawl Official Actions
// Firecrawl's action schema
type ActionType = "wait" | "click" | "write" | "press" | "screenshot";
Discovery: Firecrawl supports 5 browser action types, but no scroll. This is exactly the core capability we need.
Playwright MCP
// Playwright MCP's tool set
tools: [
"browser_navigate",
"browser_click",
"browser_type",
"browser_scroll", // ← Yes!
"browser_screenshot",
"browser_get_accessibility_tree",
// ... 20+ tools
]
Playwright MCP is a complete browser automation solution with powerful features but high complexity.
browser-use
This is the other extreme — AI Agent autonomously controls browser:
agent = Agent(
task="Go to Reddit, search for 'browser-use' in search bar",
llm=ChatAnthropic(model_name="claude-3-5-sonnet-20241022"),
)
await agent.run()
Capability Layering Model
Based on research, we proposed a three-layer capability model:
┌─────────────────────────────────────────────────────────┐
│ Level 3: AI Agent │
│ Natural language → LLM autonomous planning → Execution │
│ (browser-use, future vision) │
├─────────────────────────────────────────────────────────┤
│ Level 2: Browser Automation │
│ Declarative Action Sequence → Deterministic execution │
│ (Firecrawl Actions, what we need) │
├─────────────────────────────────────────────────────────┤
│ Level 1: Static Preset │
│ selector + waitFor → Single scrape │
│ (Current SitePreset) │
└─────────────────────────────────────────────────────────┘
Minimal solution for Feishu problem: Level 2, needs only 2 action types:
scroll: Scroll to bottomwaitForSelector: Wait for new content to load
Action Sequence Design
interface ActionSequence {
trigger: "onLoad" | "onAuth" | "preScrape";
actions: Action[];
maxIterations?: number;
stopCondition?: string;
}
type Action =
| { type: "scroll"; direction: "down" | "up"; distance?: number }
| { type: "waitForSelector"; selector: string; timeout?: number }
| { type: "wait"; ms: number };
// Feishu virtual scroll configuration
const feishuBehavior: ActionSequence = {
trigger: "preScrape",
actions: [
{ type: "scroll", direction: "down" },
{ type: "waitForSelector", selector: "[data-block-id]" },
],
maxIterations: 100,
stopCondition: "noNewContent",
};
Estimated Investment: ~50 lines of core logic
Chapter 4: A Bigger Question — Should We Make a Separate Component?
The Temptation of "browserless"
When designing Action Sequence, a thought emerged:
"Browser automation is a general capability, should it be extracted as a separate component? Like Browserless.io?"
This is a strategic question about architecture. We analyzed it using two thinking frameworks.
Linus Torvalds Perspective: "Don't reinvent the wheel"
"I'm not a visionary, I'm an engineer."
Existing ecosystem:
- Browserless.io: Commercial-grade Headless Chrome service
- Playwright: Microsoft product, excellent API design
- Puppeteer: Google product, Chrome native
- Playwright MCP: 20+ tools, direct Claude integration
- browser-use: AI Agent framework, active development
Conclusion: We have no opportunity for 10x improvement. Reinventing the wheel is a waste of time.
Google杰出产品经理Perspective
Three questions:
- Is there 10x improvement? No, existing solutions are mature enough
- Is there unique positioning? No, browser-use is more AI Native than us
- What is the core competitiveness? It's knowledge accumulation in site presets, not browser control
Conclusion: Don't make a separate component. Focus on the preset configuration layer.
"The best code is no code."
Chapter 5: Final Decision
Feishu Feature Status: Experimental
After complete exploration, our decision is:
- Keep existing code: Password auth + blockResources can be reused
- Mark as experimental: Feature incomplete, not recommended for production use
- Future path: When there are more virtual scrolling site requirements, implement minimal Action Sequence
Requirement Planning
Created feat-dynamic-scraper-framework (P1):
#### Problem
- Feishu, Notion and other modern document platforms use virtual scrolling
- Existing Scraper lacks systematic mechanism to handle "interaction-generated content"
#### Expectation
- Extend SitePreset to support behavior/actions declarations
- Implement generic VirtualScrollHandler
- Systematically solve consistency issues for all "dynamic loading" type sites
Code Status
# Committed (keep)
- feishu preset basic configuration
- blockResources option
- Preset matching logic
# Uncommitted (experimental, documented)
- handlePasswordAuth function
- auth configuration types
- password parameter passing
Philosophical Reflection: The Value of Exploration
This exploration didn't produce a complete feature, but we gained something more important.
1. Clear Understanding of Problem Boundaries
"Virtual scrolling is not a bug, it's a feature."
Modern web applications universally use lazy loading techniques for performance. Traditional static scraping thinking needs to evolve.
2. Architectural Vision of Capability Layering
Static Preset → Action Sequence → AI Agent
This layering model applies not only to browser automation, but also to many "from simple to complex" system designs.
3. The Decision "Not to Do" Is Equally Important
We evaluated whether to make a separate browserless component and ultimately decided not to. This "not to do" decision avoided weeks or even months of ineffective investment.
4. Methodology of Competitor Research
Discover problem → Research competitors → Distill differences → Find positioning
Firecrawl lacks scroll, Playwright MCP is too heavy, browser-use is the future. Our positioning is knowledge accumulation at the preset configuration layer.
Next Steps
- Document complete exploration process
- Create
feat-dynamic-scraper-frameworkrequirement - (Experiment) Start implementation when there are 3+ virtual scrolling site requirements
- (Future) Evaluate feasibility of Playwright MCP integration
References
Exploration time: 2026-01-04 ~ 2026-01-05 Participants: User, Claude (Strategist)