Mistral OCR PDF 提取：将 PDF 转换为 Markdown

AI智能体脚本智能办公脚本自动化游戏脚本浏览器自动化脚本服务器脚本

Mistral OCR PDF 提取：将 PDF 转换为 Markdown - Openclaw Skills

作者：互联网

2026-03-31

AI教程

什么是 Mistral OCR PDF 提取？

此技能利用 Mistral OCR API 提供高保真文档理解能力，能够处理数字和扫描版 PDF。它旨在填补非结构化文档与 LLM 就绪数据之间的空白，使其成为 Openclaw Skills 生态系统中开发 RAG 系统或文档处理流水线的开发者必备组件。

通过自动化提取文本、表格和图像，它确保您的 AI 代理能够访问任何文档的完整上下文。该技能处理文件上传、API 通信和输出整理的复杂性，为文档智能提供开发者友好的接口。

下载入口:https://github.com/openclaw/skills/tree/main/skills/tristanmanchester/extracting-mistral-ocr

安装与下载

1. ClawHub CLI

从源直接安装技能的最快方式。

npx clawhub@latest install extracting-mistral-ocr

2. 手动安装

将技能文件夹复制到以下位置之一

全局模式 ~/.openclaw/skills/ 工作区 /skills/

优先级：工作区 > 本地 > 内置

3. 提示词安装

将此提示词复制到 OpenClaw 即可自动安装。

请帮我使用 Clawhub 安装 extracting-mistral-ocr。如果尚未安装 Clawhub，请先安装（npm i -g clawhub）。

Mistral OCR PDF 提取应用场景

将扫描的 PDF 文档转换为可搜索的 Markdown，用于 RAG 应用。
使用自定义标注提示词提取结构化数据，如发票总额或合同日期。
解析复杂的多页报告，同时保留表格结构和视觉元素。
自动化需要高精度光学字符识别的文档摄取流水线。

Mistral OCR PDF 提取工作原理

脚本识别输入源，支持本地文件路径和公开 URL。
本地文件会自动上传到 Mistral Files API，以生成用于处理的唯一文件 ID。
OCR 请求发送至 Mistral，并带有针对表格格式化和图像提取的特定配置参数。
API 处理文档并返回详细响应，包括页面级 Markdown、元数据和可选标注。
该技能将输出整理到结构化目录中，包含全文、单页文件和任何提取的资源。

Mistral OCR PDF 提取配置指南

确保您拥有有效的 Mistral API 密钥，并安装了 Python 3.9+ 及 mistralai 包。此工具是 Openclaw Skills 集合的核心部分。

# 设置环境变量
export MISTRAL_API_KEY='your_api_key_here'

# 运行提取脚本
python scripts/mistral_ocr_extract.py --input path/to/file.pdf --out out/ocr

Mistral OCR PDF 提取数据架构与分类体系

该技能生成具有以下结构的确定性输出文件夹，以确保与下游 Openclaw Skills 的兼容性：

文件/文件夹	描述
combined.md	串联到单个 Markdown 文件中的整个文档。
pages/	每个页面的独立 Markdown 文件（例如 page-000.md）。
raw_response.json	来自 Mistral OCR API 的完整、未经编辑的 JSON 响应。
images/	如果有请求，从 PDF 中提取并解码的图像和图表。
tables/	如果请求 HTML 格式，提取并保存为独立文件的表格。

name: extracting-mistral-ocr
description: >-
  Extracts text, tables, and images from PDFs (including scanned PDFs) using the Mistral OCR API.
  Use when user asks to OCR a PDF/image, extract text from a PDF, parse a scanned document,
  convert a PDF to Markdown, or extract structured fields from a document.
compatibility: >-
  Requires network access and a MISTRAL_API_KEY environment variable. Expects Python 3.9+ and the mistralai package.
allowed-tools: "Read,Write,Bash(python:*)"
metadata:
  author: generated-by-chatgpt
  version: 0.1.0
  api: mistral
  default-model: mistral-ocr-latest

Mistral OCR PDF extraction

Quick start (default)

Run the bundled script to OCR a local PDF and write Markdown + JSON outputs:

python {baseDir}/scripts/mistral_ocr_extract.py --input path/to/file.pdf --out out/ocr

Output directory layout:

combined.md (all pages concatenated)
pages/page-000.md (per-page markdown)
raw_response.json (full OCR response)
images/ (decoded embedded images, if requested)
tables/ (separate tables, if requested)

Workflow

Pick input mode
- Local PDF (most common): upload via Files API, then OCR via file_id.
- Public URL: OCR directly via document_url.
Choose output fidelity (defaults are safe for RAG)
- Keep table_format=inline unless the user explicitly wants tables split out.
- Set --include-image-base64 when the user needs figures/diagrams extracted.
- Use --extract-header/--extract-footer if header/footer noise hurts downstream search.
Run OCR
- Use scripts/mistral_ocr_extract.py to produce a deterministic on-disk artefact set.

(Optional) Structured extraction from the whole document

If the user wants fields (invoice totals, contract parties, etc.), provide an annotation prompt.
The OCR API can return a document-level document_annotation in addition to page markdown.

Example:

python {baseDir}/scripts/mistral_ocr_extract.py r
  --input invoice.pdf r
  --out out/invoice r
  --annotation-prompt "Extract supplier_name, invoice_number, invoice_date (ISO-8601), currency, total_amount. Return JSON." r
  --annotation-format json_object

Decision rules

If the PDF is local and not publicly accessible, upload it (the script does this automatically).
If the PDF URL is private or requires authentication, do not pass it as document_url; upload instead.
If output quality is critical, prefer table_format=html for downstream parsing over brittle regex.

Common failure modes

Missing MISTRAL_API_KEY: set it in the environment before running.
URL OCR fails: the URL likely is not publicly accessible; upload the file.
Large files: upload supports large files, but very large PDFs may need page selection (--pages) or batch processing.

References

API + parameters: references/mistral_ocr_api.md
Output mapping rules (placeholders to extracted images/tables): references/output_mapping.md
Example annotation prompts for common document types: references/annotation_prompts.md

上一篇：前 1000 名用户：AI Reddit 种子用户推广智能体 - Openclaw Skills 下一篇：混合深度搜索：智能 AI 查询路由 - Openclaw Skills

相关推荐

2026年云原生开发者调查报告：平台工程采用现状与成熟度分析 2026年云原生开发者调查报告分析了平台工程在三大领域的采用现状与成熟度。报告评估了工作流自动化、应用交付及安全合规管理的核心工具，包括GitHub Actions、Helm、Keycloak等技术的实用性与稳定性。数据显示Armada成熟度最高，而新兴工具Crossplane和kro获得开发者高度推荐。41%组织采用多团队协作平台模式，35%选择混合平台应对AI工作流挑战。

2026-03-31

立即查看

软件所发布首个本地通用幻灯片智能体模型及环境系统开源版本中国科学院软件研究所开源第二代幻灯片智能体系统PPTagent，首次实现智能体模型与沙箱环境同步开源。该系统通过环境感知反思机制优化排版流程，集成20余种专业工具确保内容专业性，支持消费级显卡部署并适配国产算力生态。9B版本在测试中接近闭源模型表现，提供可编辑pptx格式输出。

2026-03-31

立即查看

中国科学院开启新一代开源芯片与系统技术攻关中国科学院发布香山开源处理器与如意原生操作系统，标志着我国在RISC-V芯片架构与系统技术领域取得重大突破。新一代开源芯片性能达国际先进水平，实现规模化产业落地，有效降低企业研发成本。产学研联合启动昆明湖架构研发，加速构建自主可控的芯片与操作系统生态体系。

2026-03-31

立即查看

2026-03-31

立即查看

专题

#Grok

Grok脚本资源网站，提供G

+ 收藏

#Sora2

Sora2脚本资源网站，提供S

+ 收藏

#通义万相

通义万相脚本资源网站，提供通

+ 收藏

#海螺AI

海螺AI脚本资源网站，提供海

+ 收藏

#可灵AI

可灵AI脚本资源网站，提供可

+ 收藏

#Kling3.0

Kling3.0脚本资源网站，提

+ 收藏

Mistral OCR PDF 提取：将 PDF 转换为 Markdown - Openclaw Skills

什么是 Mistral OCR PDF 提取？

安装与下载

1. ClawHub CLI

2. 手动安装

3. 提示词安装

Mistral OCR PDF 提取 应用场景

Mistral OCR PDF 提取 配置指南

Mistral OCR PDF 提取 数据架构与分类体系

Mistral OCR PDF extraction

Quick start (default)

Workflow

Decision rules

Common failure modes

References

Mistral OCR PDF 提取应用场景

Mistral OCR PDF 提取配置指南

Mistral OCR PDF 提取数据架构与分类体系