MinerU PDF 提取器:利用 OCR 将 PDF 转换为 Markdown - Openclaw Skills
作者:互联网
2026-03-29
什么是 MinerU PDF 提取器?
MinerU PDF 提取器是一款功能多样的社区驱动工具,旨在弥合静态 PDF 文档与大模型就绪结构化数据之间的鸿沟。通过利用 MinerU API,该工具擅长识别复杂的元素,如数学公式、嵌套表格以及通过高质量 OCR 识别扫描文本。无论您是正在构建 RAG 流水线还是研究助手,这款添加到 Openclaw Skills 库中的工具都能为您提供一种稳健的文档智能获取方式。
该工具提供两种不同的工作流:针对私有文档的多步本地文件上传,以及针对在线 URL 的简化两步过程。它是专为在将学术论文、技术手册或商业报告转换为整洁 Markdown 格式时需要精准度和速度的开发者而构建的。
下载入口:https://github.com/openclaw/skills/tree/main/skills/a-i-r/mineru-pdf-extractor
安装与下载
1. ClawHub CLI
从源直接安装技能的最快方式。
npx clawhub@latest install mineru-pdf-extractor
2. 手动安装
将技能文件夹复制到以下位置之一
全局模式~/.openclaw/skills/
工作区
/skills/
优先级:工作区 > 本地 > 内置
3. 提示词安装
将此提示词复制到 OpenClaw 即可自动安装。
请帮我使用 Clawhub 安装 mineru-pdf-extractor。如果尚未安装 Clawhub,请先安装(npm i -g clawhub)。
MinerU PDF 提取器 应用场景
- 将包含复杂 LaTeX 公式的研究论文转换为 Markdown 以用于 LLM 训练。
- 从具有复杂表格的多页商业报告中提取数据以进行自动化分析。
- 使用 OCR 处理扫描的 PDF 文档以创建可搜索的知识库。
- 通过 URL 直接从 arXiv 简化在线研究论文的获取。
- 为数据迁移项目批量处理本地 PDF 文件的整个目录。
- 用户向技能脚本提供本地 PDF 文件路径或公开文档 URL。
- 对于本地文件,该工具向 MinerU API 申请安全上传 URL,上传文件并初始化解析批次。
- 对于在线 URL,该工具直接向 MinerU 引擎提交任务 ID 以实现更快的处理。
- 该工具按定义的间隔轮询 MinerU API 状态端点,以跟踪提取进度。
- 完成后,该工具下载一个包含转换后的 Markdown 及支持资产的完整 ZIP 包。
- 最后,该工具自动解压该包,整理 Markdown 文本、图像和布局元数据以供立即使用。
MinerU PDF 提取器 配置指南
请确保您拥有来自 MinerU 的 API 密钥,并且您的系统上已安装了 curl 和 unzip。这些工具是此类 Openclaw Skills 运行的基础。
# 设置您的身份验证令牌
export MINERU_TOKEN="your_api_token_here"
# 可选:安装 jq 以获得更好的 JSON 处理能力
# sudo apt-get install jq
导航到脚本目录开始使用提供的自动化实用程序。
MinerU PDF 提取器 数据架构与分类体系
该工具为每个处理的文档将输出组织到一个结构化目录中,确保与需要结构化输入的其他 Openclaw Skills 兼容:
| 文件 | 描述 |
|---|---|
| full.md | 包含文本、公式和表格数据的主要 Markdown 输出。 |
| images/ | 包含从 PDF 中提取的所有图形元素的子目录。 |
| content_list.json | 文档层级结构的结构化 JSON 表示。 |
| layout.json | 关于文档布局和坐标的详细元数据。 |
name: mineru-pdf-extractor
description: Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods.
author: Community
version: 1.0.0
homepage: https://mineru.net/
source: https://github.com/opendatalab/MinerU
env:
- name: MINERU_TOKEN
description: "MinerU API token for authentication (primary)"
required: true
- name: MINERU_API_KEY
description: "Alternative API token if MINERU_TOKEN is not set"
required: false
- name: MINERU_BASE_URL
description: "API base URL (optional, defaults to https://mineru.net/api/v4)"
required: false
default: "https://mineru.net/api/v4"
tools:
required:
- name: curl
description: "HTTP client for API requests and file downloads"
- name: unzip
description: "Archive extraction tool for result ZIP files"
optional:
- name: jq
description: "JSON processor for enhanced parsing and security (recommended)"
MinerU PDF Extractor
Extract PDF documents to structured Markdown using the MinerU API. Supports formula recognition, table extraction, and OCR.
Note: This is a community skill, not an official MinerU product. You need to obtain your own API key from MinerU.
?? Skill Structure
mineru-pdf-extractor/
├── SKILL.md # English documentation
├── SKILL_zh.md # Chinese documentation
├── docs/ # Documentation
│ ├── Local_File_Parsing_Guide.md # Local PDF parsing detailed guide (English)
│ ├── Online_URL_Parsing_Guide.md # Online PDF parsing detailed guide (English)
│ ├── MinerU_本地文档解析完整流程.md # Local parsing complete guide (Chinese)
│ └── MinerU_在线文档解析完整流程.md # Online parsing complete guide (Chinese)
└── scripts/ # Executable scripts
├── local_file_step1_apply_upload_url.sh # Local parsing Step 1
├── local_file_step2_upload_file.sh # Local parsing Step 2
├── local_file_step3_poll_result.sh # Local parsing Step 3
├── local_file_step4_download.sh # Local parsing Step 4
├── online_file_step1_submit_task.sh # Online parsing Step 1
└── online_file_step2_poll_result.sh # Online parsing Step 2
?? Requirements
Required Environment Variables
Scripts automatically read MinerU Token from environment variables (choose one):
# Option 1: Set MINERU_TOKEN
export MINERU_TOKEN="your_api_token_here"
# Option 2: Set MINERU_API_KEY
export MINERU_API_KEY="your_api_token_here"
Required Command-Line Tools
curl- For HTTP requests (usually pre-installed)unzip- For extracting results (usually pre-installed)
Optional Tools
jq- For enhanced JSON parsing and security (recommended but not required)- If not installed, scripts will use fallback methods
- Install:
apt-get install jq(Debian/Ubuntu) orbrew install jq(macOS)
Optional Configuration
# Set API base URL (default is pre-configured)
export MINERU_BASE_URL="https://mineru.net/api/v4"
?? Get Token: Visit https://mineru.net/apiManage/docs to register and obtain an API Key
?? Feature 1: Parse Local PDF Documents
For locally stored PDF files. Requires 4 steps.
Quick Start
cd scripts/
# Step 1: Apply for upload URL
./local_file_step1_apply_upload_url.sh /path/to/your.pdf
# Output: BATCH_ID=xxx UPLOAD_URL=xxx
# Step 2: Upload file
./local_file_step2_upload_file.sh "$UPLOAD_URL" /path/to/your.pdf
# Step 3: Poll for results
./local_file_step3_poll_result.sh "$BATCH_ID"
# Output: FULL_ZIP_URL=xxx
# Step 4: Download results
./local_file_step4_download.sh "$FULL_ZIP_URL" result.zip extracted/
Script Descriptions
local_file_step1_apply_upload_url.sh
Apply for upload URL and batch_id.
Usage:
./local_file_step1_apply_upload_url.sh [language] [layout_model]
Parameters:
language:ch(Chinese),en(English),auto(auto-detect), defaultchlayout_model:doclayout_yolo(fast),layoutlmv3(accurate), defaultdoclayout_yolo
Output:
BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...
local_file_step2_upload_file.sh
Upload PDF file to the presigned URL.
Usage:
./local_file_step2_upload_file.sh
local_file_step3_poll_result.sh
Poll extraction results until completion or failure.
Usage:
./local_file_step3_poll_result.sh [max_retries] [retry_interval_seconds]
Output:
FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip
local_file_step4_download.sh
Download result ZIP and extract.
Usage:
./local_file_step4_download.sh [output_zip_filename] [extract_directory_name]
Output Structure:
extracted/
├── full.md # ?? Markdown document (main result)
├── images/ # ??? Extracted images
├── content_list.json # Structured content
└── layout.json # Layout analysis data
Detailed Documentation
?? Complete Guide: See docs/Local_File_Parsing_Guide.md
?? Feature 2: Parse Online PDF Documents (URL Method)
For PDF files already available online (e.g., arXiv, websites). Only 2 steps, more concise and efficient.
Quick Start
cd scripts/
# Step 1: Submit parsing task (provide URL directly)
./online_file_step1_submit_task.sh "https://arxiv.org/pdf/2410.17247.pdf"
# Output: TASK_ID=xxx
# Step 2: Poll results and auto-download/extract
./online_file_step2_poll_result.sh "$TASK_ID" extracted/
Script Descriptions
online_file_step1_submit_task.sh
Submit parsing task for online PDF.
Usage:
./online_file_step1_submit_task.sh [language] [layout_model]
Parameters:
pdf_url: Complete URL of the online PDF (required)language:ch(Chinese),en(English),auto(auto-detect), defaultchlayout_model:doclayout_yolo(fast),layoutlmv3(accurate), defaultdoclayout_yolo
Output:
TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
online_file_step2_poll_result.sh
Poll extraction results, automatically download and extract when complete.
Usage:
./online_file_step2_poll_result.sh [output_directory] [max_retries] [retry_interval_seconds]
Output Structure:
extracted/
├── full.md # ?? Markdown document (main result)
├── images/ # ??? Extracted images
├── content_list.json # Structured content
└── layout.json # Layout analysis data
Detailed Documentation
?? Complete Guide: See docs/Online_URL_Parsing_Guide.md
?? Comparison of Two Parsing Methods
| Feature | Local PDF Parsing | Online PDF Parsing |
|---|---|---|
| Steps | 4 steps | 2 steps |
| Upload Required | ? Yes | ? No |
| Average Time | 30-60 seconds | 10-20 seconds |
| Use Case | Local files | Files already online (arXiv, websites, etc.) |
| File Size Limit | 200MB | Limited by source server |
?? Advanced Usage
Batch Process Local Files
for pdf in /path/to/pdfs/*.pdf; do
echo "Processing: $pdf"
# Step 1
result=$(./local_file_step1_apply_upload_url.sh "$pdf" 2>&1)
batch_id=$(echo "$result" | grep BATCH_ID | cut -d= -f2)
upload_url=$(echo "$result" | grep UPLOAD_URL | cut -d= -f2)
# Step 2
./local_file_step2_upload_file.sh "$upload_url" "$pdf"
# Step 3
zip_url=$(./local_file_step3_poll_result.sh "$batch_id" | grep FULL_ZIP_URL | cut -d= -f2)
# Step 4
filename=$(basename "$pdf" .pdf)
./local_file_step4_download.sh "$zip_url" "${filename}.zip" "${filename}_extracted"
done
Batch Process Online Files
for url in r
"https://arxiv.org/pdf/2410.17247.pdf" r
"https://arxiv.org/pdf/2409.12345.pdf"; do
echo "Processing: $url"
# Step 1
result=$(./online_file_step1_submit_task.sh "$url" 2>&1)
task_id=$(echo "$result" | grep TASK_ID | cut -d= -f2)
# Step 2
filename=$(basename "$url" .pdf)
./online_file_step2_poll_result.sh "$task_id" "${filename}_extracted"
done
?? Notes
- Token Configuration: Scripts prioritize
MINERU_TOKEN, fall back toMINERU_API_KEYif not found - Token Security: Do not hard-code tokens in scripts; use environment variables
- URL Accessibility: For online parsing, ensure the provided URL is publicly accessible
- File Limits: Single file recommended not exceeding 200MB, maximum 600 pages
- Network Stability: Ensure stable network when uploading large files
- Security: This skill includes input validation and sanitization to prevent JSON injection and directory traversal attacks
- Optional jq: Installing
jqprovides enhanced JSON parsing and additional security checks
?? Reference Documentation
| Document | Description |
|---|---|
docs/Local_File_Parsing_Guide.md |
Detailed curl commands and parameters for local PDF parsing |
docs/Online_URL_Parsing_Guide.md |
Detailed curl commands and parameters for online PDF parsing |
External Resources:
- ?? MinerU Official: https://mineru.net/
- ?? API Documentation: https://mineru.net/apiManage/docs
- ?? GitHub Repository: https://github.com/opendatalab/MinerU
Skill Version: 1.0.0
Release Date: 2026-02-18
Community Skill - Not affiliated with MinerU official
相关推荐
专题
+ 收藏
+ 收藏
+ 收藏
+ 收藏
+ 收藏
最新数据
相关文章
信号管道:自动化营销情报工具 - Openclaw Skills
技能收益追踪器:监控 Openclaw 技能并实现变现
AI 合规准备就绪度:评估与治理工具 - Openclaw Skills
FOSMVVM ServerRequest 测试生成器:自动化 API 测试 - Openclaw Skills
酒店搜索器:AI 赋能的住宿与位置情报 - Openclaw Skills
Dub 链接 API:程序化链接管理 - Openclaw Skills
IntercomSwap:P2P BTC 与 USDT 跨链兑换 - Openclaw Skills
spotplay:macOS 原生 Spotify 播放控制 - Openclaw Skills
DeepSeek OCR:AI驱动的图像文本识别 - Openclaw Skills
Web Navigator:自动化网页研究与浏览 - Openclaw Skills
AI精选
