MinerU PDF 解析器:提取文本、表格和公式 - Openclaw Skills
作者:互联网
2026-03-29
什么是 MinerU PDF 解析器?
MinerU PDF 解析器是一款强大的工具,旨在将静态 PDF 文档转换为丰富的、AI 就绪的 Markdown 格式。作为 Openclaw Skills 生态系统的核心组件,它利用 MinerU MCP 服务器处理从基础文本提取到复杂数学符号和表格数据识别的所有任务。对于需要将技术文档摄入 AI 模型且不丢失结构完整性的研究人员和开发人员来说,它特别有价值。
该技能针对现代开发环境进行了独特优化,为 Apple Silicon 硬件用户提供了利用 MLX 加速的专用后端。通过同时提供用于交互式使用的原生 MCP 接口和用于持久化存储的直接 Python 工具,它确保了无论您是在进行快速查询还是构建庞大的文档数据库,您的 Openclaw Skills 工作流程都能保持灵活且高效。
下载入口:https://github.com/openclaw/skills/tree/main/skills/etoile04/mineru-pdf
安装与下载
1. ClawHub CLI
从源直接安装技能的最快方式。
npx clawhub@latest install mineru-pdf
2. 手动安装
将技能文件夹复制到以下位置之一
全局模式~/.openclaw/skills/
工作区
/skills/
优先级:工作区 > 本地 > 内置
3. 提示词安装
将此提示词复制到 OpenClaw 即可自动安装。
请帮我使用 Clawhub 安装 mineru-pdf。如果尚未安装 Clawhub,请先安装(npm i -g clawhub)。
MinerU PDF 解析器 应用场景
- 将技术白皮书数字化为 LaTeX 和 Markdown,用于研究库。
- 从多页 PDF 表格中提取财务数据到结构化格式以进行分析。
- 使用内置 OCR 预处理扫描文档,为 RAG 流水线做准备。
- 使用 MLX 加速推理在 Mac 硬件上批量处理大型文档库。
- 在保留数学公式的同时,将学术教科书转换为结构化文本。
- 用户通过提供 PDF 路径并选择处理后端(如 pipeline 或 vlm-mlx-engine)发起请求。
- 系统调用 MinerU 引擎执行布局分析,并在需要时进行光学字符识别。
- 复杂元素(如表格和数学公式)被分离并分别转换为 Markdown 和 LaTeX。
- 该技能将处理后的组件编译成包含元数据和结构化内容的完整 Markdown 文档。
- 结果要么立即返回给 AI 代理,要么保存到本地持久目录以便长期访问。
MinerU PDF 解析器 配置指南
要将此功能添加到您的环境中,请使用以下命令安装 MinerU MCP 服务器:
claude mcp add --transport stdio --scope user mineru -- r
uvx --from mcp-mineru python -m mcp_mineru.server
对于需要文件持久化并希望避免自动清理临时文件的用户,请使用技能目录中提供的直接解析工具:
python /path/to/skills/mineru-pdf/parse.py
MinerU PDF 解析器 数据架构与分类体系
该技能在持久模式下使用时会生成一个结构化的输出目录,确保您的 Openclaw Skills 项目可以轻松访问所有资产:
| 路径组件 | 描述 |
|---|---|
input_name_parsed.md |
包含完整提取内容的主 Markdown 文件。 |
/images/ |
包含从 PDF 中提取的所有插图和图形元素的子目录。 |
metadata |
包含后端设置、页数和处理时间戳的集成标头。 |
LaTeX |
数学内容使用标准 LaTeX 定界符直接嵌入在 Markdown 中。 |
name: mineru-pdf
description: Parse PDF documents with MinerU MCP to extract text, tables, and formulas. Supports multiple backends including MLX-accelerated inference on Apple Silicon.
homepage: https://github.com/TINKPA/mcp-mineru
metadata:
{
"openclaw":
{
"emoji": "??",
"requires": { "bins": ["uvx"] },
"install":
[
{
"id": "uvx",
"kind": "uvx",
"package": "mcp-mineru",
"label": "Install mcp-mineru via uvx (auto-managed)",
},
],
},
}
MinerU PDF Parser
Parse PDF documents using MinerU MCP to extract structured content including text, tables, and formulas with MLX acceleration on Apple Silicon.
Installation
Option 1: Install MinerU MCP (for Claude Code)
claude mcp add --transport stdio --scope user mineru -- r
uvx --from mcp-mineru python -m mcp_mineru.server
This installs and configures MinerU for all Claude projects. Models are downloaded on first use.
Option 2: Use Direct Tool (preserves files)
The skill includes a direct parsing tool that saves output to a persistent directory:
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py [options]
Advantages:
- ? Files are saved permanently (not auto-deleted)
- ? Full control over output location
- ? No MCP overhead
- ? Works with any Python environment that has MinerU
Quick Start
Method 1: Using the Direct Tool (Recommended)
# Parse entire PDF
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py r
"/path/to/document.pdf" r
"/path/to/output"
# Parse specific pages
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py r
"/path/to/document.pdf" r
"/path/to/output" r
--start-page 0 --end-page 2
# Use Apple Silicon optimization
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py r
"/path/to/document.pdf" r
"/path/to/output" r
--backend vlm-mlx-engine
# Text only (faster)
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py r
"/path/to/document.pdf" r
"/path/to/output" r
--no-table --no-formula
Method 2: Using MinerU MCP (Temporary Files)
Parse a PDF document
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'pipeline',
'formula_enable': True,
'table_enable': True,
'start_page': 0,
'end_page': -1 # -1 for all pages
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
Check system capabilities
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def list_backends():
result = await call_tool(
name='list_backends',
arguments={}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(list_backends())
"
Parameters
parse_pdf
Required:
file_path- Absolute path to the PDF file
Optional:
backend- Processing backend (default:pipeline)pipeline- Fast, general-purpose (recommended)vlm-mlx-engine- Fastest on Apple Silicon (M1/M2/M3/M4)vlm-transformers- Slowest but most accurate
formula_enable- Enable formula recognition (default:true)table_enable- Enable table recognition (default:true)start_page- Starting page (0-indexed, default:0)end_page- Ending page (default:-1for all pages)
list_backends
No parameters required. Returns system information and backend recommendations.
Usage Examples
Extract tables from a specific page range
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'pipeline',
'table_enable': True,
'start_page': 5,
'end_page': 10
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
Parse with formula recognition only (faster)
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'vlm-mlx-engine',
'formula_enable': True,
'table_enable': False # Disable for speed
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
Parse single page (fastest for testing)
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'pipeline',
'formula_enable': False,
'table_enable': False,
'start_page': 0,
'end_page': 0
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
Performance
On Apple Silicon M4 (16GB RAM):
pipeline: ~32s/page, CPU-only, good qualityvlm-mlx-engine: ~38s/page, Apple Silicon optimized, excellent qualityvlm-transformers: ~148s/page, highest quality, slowest
Note: First run downloads models (can take 5-10 minutes). Models are cached in ~/.cache/uv/ for faster subsequent runs.
Output Format
Returns structured Markdown with:
- Document metadata (file, backend, pages, settings)
- Extracted text with preserved structure
- Tables formatted as Markdown tables
- Formulas converted to LaTeX
Supported Formats
- PDF documents (
.pdf) - JPEG images (
.jpg,.jpeg) - PNG images (
.png) - Other image formats (WebP, GIF, etc.)
Troubleshooting
Module not found error
If you get "No module named 'mcp_mineru'", make sure you installed it:
claude mcp add --transport stdio --scope user mineru -- r
uvx --from mcp-mineru python -m mcp_mineru.server
Slow processing on first run
This is normal. MinerU downloads ML models on first use. Subsequent runs will be much faster.
Timeout errors
Increase timeout for large documents or use smaller page ranges for testing.
Notes
- Output is returned as Markdown text
- Tables are preserved in Markdown format
- Mathematical formulas are converted to LaTeX
- Works with scanned documents (OCR built-in)
- Optimized for Apple Silicon (M1/M2/M3/M4) with MLX backend
File Persistence
Why Files Get Deleted (MCP Method)
The MinerU MCP server uses Python's tempfile.TemporaryDirectory(), which automatically deletes files when the context exits. This is by design to prevent temporary files from accumulating.
How to Preserve Files
Method A: Use the Direct Tool (Recommended)
The skill provides parse.py which saves files to a persistent directory:
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py r
/path/to/input.pdf r
/path/to/output_dir
Advantages:
- ? Files are never auto-deleted
- ? Full control over output location
- ? Can be used in batch processing
- ? No MCP connection needed
Generated Structure:
/path/to/output_dir/
├── input.pdf_name/
│ └── auto/ # or vlm/ depending on backend
│ ├── input.pdf_name.md
│ └── images/
│ └── *.jpg
└── input.pdf_name_parsed.md # Copy at root for easy access
Method B: Redirect MCP Output
If using the MCP method, capture the output and save it:
# Capture to file
claude -p "Parse this PDF: /path/to/file.pdf" > /tmp/output.md
# Or use within a script that saves the result
Comparison
| Feature | Direct Tool | MCP Method |
|---|---|---|
| Files persisted | ? Yes | ? No (auto-deleted) |
| Custom output dir | ? Yes | ? No (temp only) |
| Claude Code integration | ?? Manual | ? Native |
| Speed | ? Fast | ?? MCP overhead |
| Offline use | ? Yes | ?? Needs Claude Code |
Recommendation
- Use Direct Tool when you need to keep the files for later use
- Use MCP Method when working within Claude Code and only need the text content
相关推荐
专题
+ 收藏
+ 收藏
+ 收藏
+ 收藏
+ 收藏
最新数据
相关文章
信号管道:自动化营销情报工具 - Openclaw Skills
技能收益追踪器:监控 Openclaw 技能并实现变现
AI 合规准备就绪度:评估与治理工具 - Openclaw Skills
FOSMVVM ServerRequest 测试生成器:自动化 API 测试 - Openclaw Skills
酒店搜索器:AI 赋能的住宿与位置情报 - Openclaw Skills
Dub 链接 API:程序化链接管理 - Openclaw Skills
IntercomSwap:P2P BTC 与 USDT 跨链兑换 - Openclaw Skills
spotplay:macOS 原生 Spotify 播放控制 - Openclaw Skills
DeepSeek OCR:AI驱动的图像文本识别 - Openclaw Skills
Web Navigator:自动化网页研究与浏览 - Openclaw Skills
AI精选
