MinerU PDF 提取器:利用 OCR 将 PDF 转换为 Markdown - Openclaw Skills

作者:互联网

2026-03-29

AI教程

什么是 MinerU PDF 提取器?

MinerU PDF 提取器是一款功能多样的社区驱动工具,旨在弥合静态 PDF 文档与大模型就绪结构化数据之间的鸿沟。通过利用 MinerU API,该工具擅长识别复杂的元素,如数学公式、嵌套表格以及通过高质量 OCR 识别扫描文本。无论您是正在构建 RAG 流水线还是研究助手,这款添加到 Openclaw Skills 库中的工具都能为您提供一种稳健的文档智能获取方式。

该工具提供两种不同的工作流:针对私有文档的多步本地文件上传,以及针对在线 URL 的简化两步过程。它是专为在将学术论文、技术手册或商业报告转换为整洁 Markdown 格式时需要精准度和速度的开发者而构建的。

下载入口:https://github.com/openclaw/skills/tree/main/skills/a-i-r/mineru-pdf-extractor

安装与下载

1. ClawHub CLI

从源直接安装技能的最快方式。

npx clawhub@latest install mineru-pdf-extractor

2. 手动安装

将技能文件夹复制到以下位置之一

全局模式 ~/.openclaw/skills/ 工作区 /skills/

优先级:工作区 > 本地 > 内置

3. 提示词安装

将此提示词复制到 OpenClaw 即可自动安装。

请帮我使用 Clawhub 安装 mineru-pdf-extractor。如果尚未安装 Clawhub,请先安装(npm i -g clawhub)。

MinerU PDF 提取器 应用场景

  • 将包含复杂 LaTeX 公式的研究论文转换为 Markdown 以用于 LLM 训练。
  • 从具有复杂表格的多页商业报告中提取数据以进行自动化分析。
  • 使用 OCR 处理扫描的 PDF 文档以创建可搜索的知识库。
  • 通过 URL 直接从 arXiv 简化在线研究论文的获取。
  • 为数据迁移项目批量处理本地 PDF 文件的整个目录。
MinerU PDF 提取器 工作原理
  1. 用户向技能脚本提供本地 PDF 文件路径或公开文档 URL。
  2. 对于本地文件,该工具向 MinerU API 申请安全上传 URL,上传文件并初始化解析批次。
  3. 对于在线 URL,该工具直接向 MinerU 引擎提交任务 ID 以实现更快的处理。
  4. 该工具按定义的间隔轮询 MinerU API 状态端点,以跟踪提取进度。
  5. 完成后,该工具下载一个包含转换后的 Markdown 及支持资产的完整 ZIP 包。
  6. 最后,该工具自动解压该包,整理 Markdown 文本、图像和布局元数据以供立即使用。

MinerU PDF 提取器 配置指南

请确保您拥有来自 MinerU 的 API 密钥,并且您的系统上已安装了 curl 和 unzip。这些工具是此类 Openclaw Skills 运行的基础。

# 设置您的身份验证令牌
export MINERU_TOKEN="your_api_token_here"

# 可选:安装 jq 以获得更好的 JSON 处理能力
# sudo apt-get install jq

导航到脚本目录开始使用提供的自动化实用程序。

MinerU PDF 提取器 数据架构与分类体系

该工具为每个处理的文档将输出组织到一个结构化目录中,确保与需要结构化输入的其他 Openclaw Skills 兼容:

文件 描述
full.md 包含文本、公式和表格数据的主要 Markdown 输出。
images/ 包含从 PDF 中提取的所有图形元素的子目录。
content_list.json 文档层级结构的结构化 JSON 表示。
layout.json 关于文档布局和坐标的详细元数据。
name: mineru-pdf-extractor
description: Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods.
author: Community
version: 1.0.0
homepage: https://mineru.net/
source: https://github.com/opendatalab/MinerU
env:
  - name: MINERU_TOKEN
    description: "MinerU API token for authentication (primary)"
    required: true
  - name: MINERU_API_KEY
    description: "Alternative API token if MINERU_TOKEN is not set"
    required: false
  - name: MINERU_BASE_URL
    description: "API base URL (optional, defaults to https://mineru.net/api/v4)"
    required: false
    default: "https://mineru.net/api/v4"
tools:
  required:
    - name: curl
      description: "HTTP client for API requests and file downloads"
    - name: unzip
      description: "Archive extraction tool for result ZIP files"
  optional:
    - name: jq
      description: "JSON processor for enhanced parsing and security (recommended)"

MinerU PDF Extractor

Extract PDF documents to structured Markdown using the MinerU API. Supports formula recognition, table extraction, and OCR.

Note: This is a community skill, not an official MinerU product. You need to obtain your own API key from MinerU.


?? Skill Structure

mineru-pdf-extractor/
├── SKILL.md                          # English documentation
├── SKILL_zh.md                       # Chinese documentation
├── docs/                             # Documentation
│   ├── Local_File_Parsing_Guide.md   # Local PDF parsing detailed guide (English)
│   ├── Online_URL_Parsing_Guide.md   # Online PDF parsing detailed guide (English)
│   ├── MinerU_本地文档解析完整流程.md  # Local parsing complete guide (Chinese)
│   └── MinerU_在线文档解析完整流程.md  # Online parsing complete guide (Chinese)
└── scripts/                          # Executable scripts
    ├── local_file_step1_apply_upload_url.sh    # Local parsing Step 1
    ├── local_file_step2_upload_file.sh         # Local parsing Step 2
    ├── local_file_step3_poll_result.sh         # Local parsing Step 3
    ├── local_file_step4_download.sh            # Local parsing Step 4
    ├── online_file_step1_submit_task.sh        # Online parsing Step 1
    └── online_file_step2_poll_result.sh        # Online parsing Step 2

?? Requirements

Required Environment Variables

Scripts automatically read MinerU Token from environment variables (choose one):

# Option 1: Set MINERU_TOKEN
export MINERU_TOKEN="your_api_token_here"

# Option 2: Set MINERU_API_KEY
export MINERU_API_KEY="your_api_token_here"

Required Command-Line Tools

  • curl - For HTTP requests (usually pre-installed)
  • unzip - For extracting results (usually pre-installed)

Optional Tools

  • jq - For enhanced JSON parsing and security (recommended but not required)
    • If not installed, scripts will use fallback methods
    • Install: apt-get install jq (Debian/Ubuntu) or brew install jq (macOS)

Optional Configuration

# Set API base URL (default is pre-configured)
export MINERU_BASE_URL="https://mineru.net/api/v4"

?? Get Token: Visit https://mineru.net/apiManage/docs to register and obtain an API Key


?? Feature 1: Parse Local PDF Documents

For locally stored PDF files. Requires 4 steps.

Quick Start

cd scripts/

# Step 1: Apply for upload URL
./local_file_step1_apply_upload_url.sh /path/to/your.pdf
# Output: BATCH_ID=xxx UPLOAD_URL=xxx

# Step 2: Upload file
./local_file_step2_upload_file.sh "$UPLOAD_URL" /path/to/your.pdf

# Step 3: Poll for results
./local_file_step3_poll_result.sh "$BATCH_ID"
# Output: FULL_ZIP_URL=xxx

# Step 4: Download results
./local_file_step4_download.sh "$FULL_ZIP_URL" result.zip extracted/

Script Descriptions

local_file_step1_apply_upload_url.sh

Apply for upload URL and batch_id.

Usage:

./local_file_step1_apply_upload_url.sh  [language] [layout_model]

Parameters:

  • language: ch (Chinese), en (English), auto (auto-detect), default ch
  • layout_model: doclayout_yolo (fast), layoutlmv3 (accurate), default doclayout_yolo

Output:

BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...

local_file_step2_upload_file.sh

Upload PDF file to the presigned URL.

Usage:

./local_file_step2_upload_file.sh  

local_file_step3_poll_result.sh

Poll extraction results until completion or failure.

Usage:

./local_file_step3_poll_result.sh  [max_retries] [retry_interval_seconds]

Output:

FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip

local_file_step4_download.sh

Download result ZIP and extract.

Usage:

./local_file_step4_download.sh  [output_zip_filename] [extract_directory_name]

Output Structure:

extracted/
├── full.md              # ?? Markdown document (main result)
├── images/              # ??? Extracted images
├── content_list.json    # Structured content
└── layout.json          # Layout analysis data

Detailed Documentation

?? Complete Guide: See docs/Local_File_Parsing_Guide.md


?? Feature 2: Parse Online PDF Documents (URL Method)

For PDF files already available online (e.g., arXiv, websites). Only 2 steps, more concise and efficient.

Quick Start

cd scripts/

# Step 1: Submit parsing task (provide URL directly)
./online_file_step1_submit_task.sh "https://arxiv.org/pdf/2410.17247.pdf"
# Output: TASK_ID=xxx

# Step 2: Poll results and auto-download/extract
./online_file_step2_poll_result.sh "$TASK_ID" extracted/

Script Descriptions

online_file_step1_submit_task.sh

Submit parsing task for online PDF.

Usage:

./online_file_step1_submit_task.sh  [language] [layout_model]

Parameters:

  • pdf_url: Complete URL of the online PDF (required)
  • language: ch (Chinese), en (English), auto (auto-detect), default ch
  • layout_model: doclayout_yolo (fast), layoutlmv3 (accurate), default doclayout_yolo

Output:

TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

online_file_step2_poll_result.sh

Poll extraction results, automatically download and extract when complete.

Usage:

./online_file_step2_poll_result.sh  [output_directory] [max_retries] [retry_interval_seconds]

Output Structure:

extracted/
├── full.md              # ?? Markdown document (main result)
├── images/              # ??? Extracted images
├── content_list.json    # Structured content
└── layout.json          # Layout analysis data

Detailed Documentation

?? Complete Guide: See docs/Online_URL_Parsing_Guide.md


?? Comparison of Two Parsing Methods

Feature Local PDF Parsing Online PDF Parsing
Steps 4 steps 2 steps
Upload Required ? Yes ? No
Average Time 30-60 seconds 10-20 seconds
Use Case Local files Files already online (arXiv, websites, etc.)
File Size Limit 200MB Limited by source server

?? Advanced Usage

Batch Process Local Files

for pdf in /path/to/pdfs/*.pdf; do
    echo "Processing: $pdf"
    
    # Step 1
    result=$(./local_file_step1_apply_upload_url.sh "$pdf" 2>&1)
    batch_id=$(echo "$result" | grep BATCH_ID | cut -d= -f2)
    upload_url=$(echo "$result" | grep UPLOAD_URL | cut -d= -f2)
    
    # Step 2
    ./local_file_step2_upload_file.sh "$upload_url" "$pdf"
    
    # Step 3
    zip_url=$(./local_file_step3_poll_result.sh "$batch_id" | grep FULL_ZIP_URL | cut -d= -f2)
    
    # Step 4
    filename=$(basename "$pdf" .pdf)
    ./local_file_step4_download.sh "$zip_url" "${filename}.zip" "${filename}_extracted"
done

Batch Process Online Files

for url in r
  "https://arxiv.org/pdf/2410.17247.pdf" r
  "https://arxiv.org/pdf/2409.12345.pdf"; do
    echo "Processing: $url"
    
    # Step 1
    result=$(./online_file_step1_submit_task.sh "$url" 2>&1)
    task_id=$(echo "$result" | grep TASK_ID | cut -d= -f2)
    
    # Step 2
    filename=$(basename "$url" .pdf)
    ./online_file_step2_poll_result.sh "$task_id" "${filename}_extracted"
done

?? Notes

  1. Token Configuration: Scripts prioritize MINERU_TOKEN, fall back to MINERU_API_KEY if not found
  2. Token Security: Do not hard-code tokens in scripts; use environment variables
  3. URL Accessibility: For online parsing, ensure the provided URL is publicly accessible
  4. File Limits: Single file recommended not exceeding 200MB, maximum 600 pages
  5. Network Stability: Ensure stable network when uploading large files
  6. Security: This skill includes input validation and sanitization to prevent JSON injection and directory traversal attacks
  7. Optional jq: Installing jq provides enhanced JSON parsing and additional security checks

?? Reference Documentation

Document Description
docs/Local_File_Parsing_Guide.md Detailed curl commands and parameters for local PDF parsing
docs/Online_URL_Parsing_Guide.md Detailed curl commands and parameters for online PDF parsing

External Resources:

  • ?? MinerU Official: https://mineru.net/
  • ?? API Documentation: https://mineru.net/apiManage/docs
  • ?? GitHub Repository: https://github.com/opendatalab/MinerU

Skill Version: 1.0.0
Release Date: 2026-02-18
Community Skill - Not affiliated with MinerU official