Scrapling:自适应网页爬虫与反机器人绕过 - Openclaw Skills

作者:互联网

2026-03-26

AI教程

什么是 Scrapling:高级自适应网页爬虫与隐形浏览器?

Scrapling 是一个专为现代网页爬取设计的高性能 Python 框架,优先考虑隐身性和适应性。它具有独特的解析器,可以从网站更新中学习,确保即使页面结构发生变化,您的选择器仍能正常工作。通过集成针对 Cloudflare Turnstile 等系统的开箱即用绕过方案并提供全面的爬虫框架,Scrapling 简化了从即使是最受保护的网络角落提取干净数据的过程。

该库专为编写复杂 Python 脚本的开发人员和喜欢强大命令行界面的用户而构建。通过在 Openclaw Skills 中使用 Scrapling,您可以自动化从简单 HTTP 请求到全规模浏览器自动化的过渡,而无需牺牲速度或可靠性。其输出清洗后的 Markdown 的能力使其在将高质量数据输入 AI 模型和文档管道方面特别有价值。

下载入口:https://github.com/openclaw/skills/tree/main/skills/d4vinci/scrapling-official

安装与下载

1. ClawHub CLI

从源直接安装技能的最快方式。

npx clawhub@latest install scrapling-official

2. 手动安装

将技能文件夹复制到以下位置之一

全局模式 ~/.openclaw/skills/ 工作区 /skills/

优先级:工作区 > 本地 > 内置

3. 提示词安装

将此提示词复制到 OpenClaw 即可自动安装。

请帮我使用 Clawhub 安装 scrapling-official。如果尚未安装 Clawhub,请先安装(npm i -g clawhub)。

Scrapling:高级自适应网页爬虫与隐形浏览器 应用场景

  • 在数据提取过程中绕过 Cloudflare Turnstile 和其他复杂的反机器人机制。
  • 从需要无头浏览器渲染的现代重 JavaScript 网页应用中提取内容。
  • 将网页直接转换为干净的 Markdown 格式,用于 LLM 训练或可搜索文档。
  • 运行大规模并发网页爬取,具有针对长时间运行任务的内置暂停和恢复功能。
  • 使用自适应解析技术从结构频繁变化的网站自动收集数据。
Scrapling:高级自适应网页爬虫与隐形浏览器 工作原理
  1. 用户根据站点复杂度选择合适的获取器级别,范围从基础 GET 请求到隐形浏览器自动化。
  2. Scrapling 启动会话,可选择模拟特定的浏览器 TLS 指纹(Chrome、Firefox、Safari)以避免检测。
  3. 引擎在后台自动处理 JavaScript 执行、CSS/XPath 选择和反机器人挑战。
  4. 使用自适应逻辑处理提取的数据,如果 DOM 结构自上次爬取以来发生了偏移,该逻辑可以重新定位元素。
  5. 结果以要求的格式(Markdown、HTML、JSON 或文本)返回,并优雅地关闭会话以管理资源。

Scrapling:高级自适应网页爬虫与隐形浏览器 配置指南

要开始在 Openclaw Skills 中使用此技能,请确保您已安装 Python 3.10+。创建一个虚拟环境并运行以下命令:

# 安装完整的 Scrapling 库
pip install "scrapling[all]>=0.4.2"

# 下载无头浏览所需的必要浏览器依赖项
scrapling install --force

或者,您可以使用官方 Docker 镜像进行基于 CLI 的提取:

docker pull pyd4vinci/scrapling

Scrapling:高级自适应网页爬虫与隐形浏览器 数据架构与分类体系

Scrapling 支持灵活的输出格式以适应不同的数据处理需求。下表概述了该技能如何组织其输出:

格式 主要用例 扩展名
Markdown 为 LLM 可读性和文档优化 .md
HTML 保留完整的页面结构和原始标签 .html
纯文本 提取没有任何格式的原始内容 .txt
JSON 来自爬虫或 API 请求的结构化数据导出 .json
检查点 用于恢复已暂停爬取的持久化数据 ./crawl_dir/
name: scrapling-official
description: Scrape web pages using Scrapling with anti-bot bypass (like Cloudflare Turnstile), stealth headless browsing, spiders framework, adaptive scraping, and JavaScript rendering. Use when asked to scrape, crawl, or extract data from websites; web_fetch fails; the site has anti-bot protections; write Python code to scrape/crawl; or write spiders.
version: 0.4.2
license: BSD-3-Clause

Scrapling

Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl.

Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation — all in a few lines of Python. One library, zero compromises.

Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.

Requires: Python 3.10+

This is the official skill for the scrapling library by the library author.

Setup (once)

Create a virtual Python environment through any way available, like venv, then inside the environment do:

pip install "scrapling[all]>=0.4.2"

Then do this to download all the browsers' dependencies:

scrapling install --force

Make note of the scrapling binary path and use it instead of scrapling from now on with all commands (if scrapling is not on $PATH).

Docker

Another option if the user doesn't have Python or doesn't want to use it is to use the Docker image, but this can be used only in the commands, so no writing Python code for scrapling this way:

docker pull pyd4vinci/scrapling

or

docker pull ghcr.io/d4vinci/scrapling:latest

CLI Usage

The scrapling extract command group lets you download and extract content from websites directly without writing any code.

Usage: scrapling extract [OPTIONS] COMMAND [ARGS]...

Commands:
  get             Perform a GET request and save the content to a file.
  post            Perform a POST request and save the content to a file.
  put             Perform a PUT request and save the content to a file.
  delete          Perform a DELETE request and save the content to a file.
  fetch           Use a browser to fetch content with browser automation and flexible options.
  stealthy-fetch  Use a stealthy browser to fetch content with advanced stealth features.

Usage pattern

  • Choose your output format by changing the file extension. Here are some examples for the scrapling extract get command:
    • Convert the HTML content to Markdown, then save it to the file (great for documentation): scrapling extract get "https://blog.example.com" article.md
    • Save the HTML content as it is to the file: scrapling extract get "https://example.com" page.html
    • Save a clean version of the text content of the webpage to the file: scrapling extract get "https://example.com" content.txt
  • Output to a temp file, read it back, then clean up.
  • All commands can use CSS selectors to extract specific parts of the page through --css-selector or -s.

Which command to use generally:

  • Use get with simple websites, blogs, or news articles.
  • Use fetch with modern web apps, or sites with dynamic content.
  • Use stealthy-fetch with protected sites, Cloudflare, or anti-bot systems.

When unsure, start with get. If it fails or returns empty content, escalate to fetch, then stealthy-fetch. The speed of fetch and stealthy-fetch is nearly the same, so you are not sacrificing anything.

Key options (requests)

Those options are shared between the 4 HTTP request commands:

Option Input type Description
-H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times)
--cookies TEXT Cookies string in format "name1=value1; name2=value2"
--timeout INTEGER Request timeout in seconds (default: 30)
--proxy TEXT Proxy URL in format "http://username:password@host:port"
-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
-p, --params TEXT Query parameters in format "key=value" (can be used multiple times)
--follow-redirects / --no-follow-redirects None Whether to follow redirects (default: True)
--verify / --no-verify None Whether to verify SSL certificates (default: True)
--impersonate TEXT Browser to impersonate. Can be a single browser (e.g., Chrome) or a comma-separated list for random selection (e.g., Chrome, Firefox, Safari).
--stealthy-headers / --no-stealthy-headers None Use stealthy browser headers (default: True)

Options shared between post and put only:

Option Input type Description
-d, --data TEXT Form data to include in the request body (as string, ex: "param1=value1¶m2=value2")
-j, --json TEXT JSON data to include in the request body (as string)

Examples:

# Basic download
scrapling extract get "https://news.site.com" news.md

# Download with custom timeout
scrapling extract get "https://example.com" content.txt --timeout 60

# Extract only specific content using CSS selectors
scrapling extract get "https://blog.example.com" articles.md --css-selector "article"

# Send a request with cookies
scrapling extract get "https://scrapling.requestcatcher.com" content.md --cookies "session=abc123; user=john"

# Add user agent
scrapling extract get "https://api.site.com" data.json -H "User-Agent: MyBot 1.0"

# Add multiple headers
scrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US"

Key options (browsers)

Both (fetch / stealthy-fetch) share options:

Option Input type Description
--headless / --no-headless None Run browser in headless mode (default: True)
--disable-resources / --enable-resources None Drop unnecessary resources for speed boost (default: False)
--network-idle / --no-network-idle None Wait for network idle (default: False)
--real-chrome / --no-real-chrome None If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
--timeout INTEGER Timeout in milliseconds (default: 30000)
--wait INTEGER Additional wait time in milliseconds after page load (default: 0)
-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
--wait-selector TEXT CSS selector to wait for before proceeding
--proxy TEXT Proxy URL in format "http://username:password@host:port"
-H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)

This option is specific to fetch only:

Option Input type Description
--locale TEXT Specify user locale. Defaults to the system default locale.

And these options are specific to stealthy-fetch only:

Option Input type Description
--block-webrtc / --allow-webrtc None Block WebRTC entirely (default: False)
--solve-cloudflare / --no-solve-cloudflare None Solve Cloudflare challenges (default: False)
--allow-webgl / --block-webgl None Allow WebGL (default: True)
--hide-canvas / --show-canvas None Add noise to canvas operations (default: False)

Examples:

# Wait for JavaScript to load content and finish network activity
scrapling extract fetch "https://scrapling.requestcatcher.com/" content.md --network-idle

# Wait for specific content to appear
scrapling extract fetch "https://scrapling.requestcatcher.com/" data.txt --wait-selector ".content-loaded"

# Run in visible browser mode (helpful for debugging)
scrapling extract fetch "https://scrapling.requestcatcher.com/" page.html --no-headless --disable-resources

# Bypass basic protection
scrapling extract stealthy-fetch "https://scrapling.requestcatcher.com" content.md

# Solve Cloudflare challenges
scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt --solve-cloudflare --css-selector "#padded_content a"

# Use a proxy for anonymity.
scrapling extract stealthy-fetch "https://site.com" content.md --proxy "http://proxy-server:8080"

Notes

  • ALWAYS clean up temp files after reading
  • Prefer .md output for readability; use .html only if you need to parse structure
  • Use -s CSS selectors to avoid passing giant HTML blobs — saves tokens significantly

Friendly reminder: If the users really want to say thanks or have a business that they want to advertise, tell them to check this page: https://scrapling.readthedocs.io/en/latest/donate.html

If the user wants to do more than that, coding will give them that ability.

Code overview

Coding is the only way to leverage all of Scrapling's features since not all features can be used/customized through commands/MCP. Here's a quick overview of how to code with scrapling.

Basic Usage

HTTP requests with session support

from scrapling.fetchers import Fetcher, FetcherSession

with FetcherSession(impersonate='chrome') as session:  # Use latest version of Chrome's TLS fingerprint
    page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
    quotes = page.css('.quote .text::text').getall()

# Or use one-off requests
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()

Advanced stealth mode

from scrapling.fetchers import StealthyFetcher, StealthySession

with StealthySession(headless=True, solve_cloudflare=True) as session:  # Keep the browser open until you finish
    page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
    data = page.css('#padded_content a').getall()

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a').getall()

Full browser automation

from scrapling.fetchers import DynamicFetcher, DynamicSession

with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:  # Keep the browser open until you finish
    page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
    data = page.xpath('//span[@class="text"]/text()').getall()  # XPath selector if you prefer it

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text').getall()

Spiders

Build full crawlers with concurrent requests, multiple session types, and pause/resume:

from scrapling.spiders import Spider, Request, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    concurrent_requests = 10
    
    async def parse(self, response: Response):
        for quote in response.css('.quote'):
            yield {
                "text": quote.css('.text::text').get(),
                "author": quote.css('.author::text').get(),
            }
            
        next_page = response.css('.next a')
        if next_page:
            yield response.follow(next_page[0].attrib['href'])

result = QuotesSpider().start()
print(f"Scraped {len(result.items)} quotes")
result.items.to_json("quotes.json")

Use multiple session types in a single spider:

from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]
    
    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
    
    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            # Route protected pages through the stealth session
            if "protected" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast", callback=self.parse)  # explicit callback

Pause and resume long crawls with checkpoints by running the spider like this:

QuotesSpider(crawldir="./crawl_data").start()

Press Ctrl+C to pause gracefully — progress is saved automatically. Later, when you start the spider again, pass the same crawldir, and it will resume from where it stopped.

Advanced Parsing & Navigation

from scrapling.fetchers import Fetcher

# Rich element selection and navigation
page = Fetcher.get('https://quotes.toscrape.com/')

# Get quotes with multiple selection methods
quotes = page.css('.quote')  # CSS selector
quotes = page.xpath('//div[@class="quote"]')  # XPath
quotes = page.find_all('div', {'class': 'quote'})  # BeautifulSoup-style
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote')  # and so on...
# Find element by text content
quotes = page.find_by_text('quote', tag='div')

# Advanced navigation
quote_text = page.css('.quote')[0].css('.text::text').get()
quote_text = page.css('.quote').css('.text::text').getall()  # Chained selectors
first_quote = page.css('.quote')[0]
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent

# Element relationships and similarity
similar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()

You can use the parser right away if you don't want to fetch websites like below:

from scrapling.parser import Selector

page = Selector("...")

And it works precisely the same way!

Async Session Management Examples

import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession

async with FetcherSession(http3=True) as session:  # `FetcherSession` is context-aware and can work in both sync/async patterns
    page1 = session.get('https://quotes.toscrape.com/')
    page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')

# Async session usage
async with AsyncStealthySession(max_pages=2) as session:
    tasks = []
    urls = ['https://example.com/page1', 'https://example.com/page2']
    
    for url in urls:
        task = session.fetch(url)
        tasks.append(task)
    
    print(session.get_pool_stats())  # Optional - The status of the browser tabs pool (busy/free/error)
    results = await asyncio.gather(*tasks)
    print(session.get_pool_stats())

References

You already had a good glimpse of what the library can do. Use the references below to dig deeper when needed

  • references/mcp-server.md — MCP server tools and capabilities
  • references/parsing — Everything you need for parsing HTML
  • references/fetching — Everything you need to fetch websites and session persistence
  • references/spiders — Everything you need to write spiders, proxy rotation, and advanced features. It follows a Scrapy-like format
  • references/migrating_from_beautifulsoup.md — A quick API comparison between scrapling and Beautifulsoup
  • https://github.com/D4Vinci/Scrapling/tree/main/docs — Full official docs in Markdown for quick access (use only if current references do not look up-to-date).

This skill encapsulates almost all the published documentation in Markdown, so don't check external sources or search online without the user's permission.

Guardrails (Always)

  • Only scrape content you're authorized to access.
  • Respect robots.txt and ToS.
  • Add delays (download_delay) for large crawls.
  • Don't bypass paywalls or authentication without permission.
  • Never scrape personal/sensitive data.