Invoice Extractor: 自动化费用追踪 - Openclaw Skills
作者:互联网
2026-04-18
什么是 Invoice Extractor?
Invoice Extractor 是一个强大的 Openclaw Skills 扩展,旨在缩小凌乱的财务文件与结构化会计之间的差距。它采用混合方法,其中 Python 脚本处理强大的 PDF 文本提取和账本管理,而 LLM 则解析传统 regex 经常无法捕获的复杂发票布局。无论您处理的是电子版 PDF 还是纸质收据照片,此技能都能确保您的财务数据被准确捕获并自动分类。
通过无缝集成到您的工作流中,Invoice Extractor 帮助您以 CSV 或 JSON 格式维护清晰的费用历史记录。它通过“先确认后添加”的工作流优先考虑数据完整性,确保从税务明细到供应商名称的每一条提取信息在提交到永久费用账本之前都经过用户验证。
下载入口:https://github.com/openclaw/skills/tree/main/skills/99rebels/rebels-invoice-extractor
安装与下载
1. ClawHub CLI
从源直接安装技能的最快方式。
npx clawhub@latest install rebels-invoice-extractor
2. 手动安装
将技能文件夹复制到以下位置之一
全局模式~/.openclaw/skills/
工作区
/skills/
优先级:工作区 > 本地 > 内置
3. 提示词安装
将此提示词复制到 OpenClaw 即可自动安装。
请帮我使用 Clawhub 安装 rebels-invoice-extractor。如果尚未安装 Clawhub,请先安装(npm i -g clawhub)。
Invoice Extractor 应用场景
- 自动化从批量 PDF 发票中提取数据用于企业会计。
- 通过拍摄收据照片来追踪日常个人支出。
- 为特定时间段准备有条理的费用报告和税务文件。
- 跨不同部门或项目代码自动对支出进行分类。
- 将经过验证的费用数据直接导出到 Xero、FreeAgent 或 Wave 等会计平台。
- 用户通过文件路径或图像上传向 agent 提供发票。
- 对于 PDF,系统使用 pdfplumber 提取文本;对于图像,agent 使用视觉处理来识别关键财务字段。
- 提取的数据被映射到标准化的 JSON schema,包括供应商、日期、商品明细和税收。
- 该技能针对本地配置文件运行自动分类检查,以标记费用(例如:办公室、差旅、软件)。
- 向用户展示摘要进行审查,允许手动编辑或确认。
- 获得批准后,数据将追加到持久账本文件中,并具有自动重复检测和备份创建功能。
Invoice Extractor 配置指南
要开始使用此 Openclaw Skills 实用程序,请确保您已安装必要的 Python 依赖项:
pip install pdfplumber
# 如果 pdfplumber 不可用,将使用 PyPDF2 作为自动备选方案
该技能依赖于技能目录中的 scripts/extract.py 处理逻辑和 expense-config.json 处理分类规则。
Invoice Extractor 数据架构与分类体系
该技能使用严格的 JSON schema 组织数据,以确保与会计软件的兼容性。关键字段包括:
| 字段 | 描述 | 要求 |
|---|---|---|
| vendor | 商家名称 | 必填 |
| total | 最终结算金额 | 必填 |
| date | 交易日期 (YYYY-MM-DD) | 必填 |
| lineItems | 描述、数量和价格的数组 | 可选 |
| tax | 计算出的税额 | 可选 |
| category | 费用分类(例如:食品、办公室) | 可选 |
所有账本条目都存储有一个唯一的哈希值(vendor + date + total),以防止重复记录。
name: invoice-extractor
description: >
Extract structured data from invoices and receipts (PDFs and images). Output JSON, CSV,
or build a running expense ledger. Use when someone shares an invoice to process, asks
to track expenses, categorize spending, or prepare tax documents.
Invoice Extractor ??
Turn invoices and receipts into structured expense data. Extract from PDFs and images, auto-categorize spending, and maintain a running CSV ledger.
Hybrid approach: A Python script handles PDF text extraction and ledger management, while you (the agent) parse the invoice content — LLMs understand varied formats far better than regex.
When to Use
- "Extract data from this invoice"
- "Track my expenses" / "Add to my expense ledger"
- "Categorize this receipt"
- "Process these invoices" / "Batch process receipts"
- "Show me my spending summary"
- "Prepare tax documents" / "Get my expenses for April"
Setup
pip install pdfplumber
# Fallback: PyPDF2 (auto-used if pdfplumber unavailable)
Script: scripts/extract.py (relative to this skill directory) Config: expense-config.json (same directory)
? Single Invoice Workflow
PDF Invoices
python3 scripts/extract.py pdf
Read the output text, parse it into structured JSON (see schema below), then confirm with the user before adding to ledger.
Image Invoices (jpg, png, webp, gif)
Use the image tool with a prompt like: "Extract all invoice/receipt data from this image. Return vendor, invoice number, date, line items, subtotal, tax, total, and currency."
Parse the result into structured JSON, then confirm with the user before adding to ledger.
?? Confirm Then Add
Always present extracted data for user review before writing to the ledger:
?? Invoice Extracted
Vendor: Amazon
Date: 2026-04-01
Invoice #: INV-2026-001
Description: Office supplies — keyboard and monitor
Total: €539.96 (incl. €100.97 tax)
Category: office (auto)
Add to ledger? (yes/edit/skip)
Format output for the current channel — adapt formatting to match what the platform supports. See references/formatting.md for platform-specific examples.
On confirmation, write the JSON to a temp file and run:
python3 scripts/extract.py ledger add /tmp/invoice-entry.json
Or pipe via stdin:
echo '' | python3 scripts/extract.py ledger add -
If the user says "edit", modify the requested fields and re-confirm. If "skip", discard.
?? Batch Processing
python3 scripts/extract.py batch
- Run the batch command to get a JSON list of all PDFs and images
- Process each file one at a time (PDFs via
pdfcommand, images viaimagetool) - Collect all results — do NOT confirm each one individually
- Present a summary of ALL extracted data at the end
- Ask the user to confirm once: add all, edit specific entries, or skip
Show this summary after processing all files:
?? Batch Results — 8 files processed
1. Amazon EU S.a.r.l. — €191.84 — office
2. Tesco — €25.26 — food
3. DigitalOcean LLC — €35.81 — software
4. Insomnia Coffee — €9.84 — food
5. ACME Solutions Ltd — €3,867.11 — uncategorized ??
... (errors shown separately)
Total: €4,129.86 across 5 entries (1 error)
Add all to ledger? (yes/edit/skip)
On confirmation, add all entries at once. If the user wants to edit, modify specific entries and re-confirm.
?? Viewing Expenses & Summaries
View entries with optional filters:
python3 scripts/extract.py ledger view [filters]
--from DATE Entries from this date (YYYY-MM-DD)
--to DATE Entries up to this date
--category CAT Filter by category name
--vendor VENDOR Filter by vendor (partial match)
--format json|csv Output format (default: json)
Edit an entry:
python3 scripts/extract.py ledger edit --id N --vendor "New Name"
python3 scripts/extract.py ledger edit --id N --total 250.00 --category software
python3 scripts/extract.py ledger edit --id N --date 2026-04-02
Editable fields: --vendor, --total, --date, --description, --category, --currency, --subtotal, --tax. Multiple fields in one command. Auto-recalculates the dedup hash.
Delete an entry:
python3 scripts/extract.py ledger delete --id N
Removes the entry, renumbers remaining IDs, creates a backup.
Undo last add:
python3 scripts/extract.py ledger undo
Removes the most recently added entry (highest ID). One-level undo only.
Category summaries:
python3 scripts/extract.py ledger summary [--period week|month|year]
JSON Schema
Structure all extracted invoice data as:
{
"vendor": "Amazon",
"invoiceNumber": "INV-2026-001",
"date": "2026-04-01",
"dueDate": "2026-04-30",
"description": "Office supplies — keyboard and monitor",
"lineItems": [
{"description": "Mechanical Keyboard", "quantity": 1, "unitPrice": 89.99},
{"description": "USB-C Monitor", "quantity": 1, "unitPrice": 349.00}
],
"subtotal": 438.99,
"tax": 100.97,
"total": 539.96,
"currency": "EUR",
"category": "office"
}
Required for ledger: vendor, total, date Optional: everything else — the script handles missing fields gracefully
??? Auto-Categorization
Auto-categorizes based on keyword matching in expense-config.json. Checks vendor name and description against category keywords (case-insensitive).
python3 scripts/extract.py categories
Users can customize by editing the config. Suggest adding new keywords when a vendor doesn't match.
?? Exporting the Ledger
Export ledger entries in platform-specific CSV formats for direct import into accounting software.
python3 scripts/extract.py ledger export --platform [filters] [--output FILE]
Filters: --from DATE, --to DATE, --category CAT, --vendor VENDOR
Built-in Platforms
| Platform | Use Case | Notes |
|---|---|---|
xero |
Bills/Expenses import | DD/MM/YYYY dates, includes AccountCode & TaxRate |
freeagent |
Out-of-pocket expenses | No header row, needs claimantName in config |
wave |
Bank transactions | Negative amounts for expenses |
generic |
Excel/Google Sheets | Full detail, clean format |
Examples
# Export all entries for Xero
python3 scripts/extract.py ledger export --platform xero
# Export April expenses to a file
python3 scripts/extract.py ledger export --platform xero --from 2026-04-01 --to 2026-04-30 --output /tmp/xero-export.csv
# Filter by category for FreeAgent
python3 scripts/extract.py ledger export --platform freeagent --category travel --output /tmp/freeagent-travel.csv
Custom Presets
Define custom export formats in expense-config.json under exportPresets:
{
"exportPresets": {
"my-accounting": {
"columns": ["date", "vendor", "amount", "category", "notes"],
"headerRow": true,
"dateFormat": "%m/%d/%Y",
"amountHandling": "positive",
"fieldMapping": {
"date": "date",
"vendor": "vendor",
"amount": "total",
"category": "category",
"notes": "description"
}
}
}
}
The fieldMapping maps CSV column names → ledger field names. Use: --platform my-accounting
Sending the File
If no --output is specified, CSV goes to stdout. For file attachments:
- Use
--output /tmp/invoice-export-- .csv - Send via
MEDIA:
Here's your Xero import file (12 entries, April 2026).
MEDIA:/tmp/invoice-export-xero-20260406.csv
?? Unknown Platform? (LLM Discovery Flow)
If the user names a platform that isn't built-in and isn't in their custom presets:
- Use
web_searchto find "[platform name] CSV import format expenses" - Identify the required columns and their format
- Create a
fieldMappingfrom our ledger fields to their columns - Add the preset to the user's
expense-config.jsonunderexportPresets - Tell the user the preset was created and saved
- Proceed with the export using the new preset
?? Config
The config file (expense-config.json) lives in the skill root directory. See references/configuration.md for the full config reference.
# Use a custom config
python3 scripts/extract.py --config /path/to/config.json
?? Important Notes
- Always confirm before adding to ledger — never auto-add extracted data
- Duplicate detection — entries are auto-checked against existing ledger (vendor + date + total hash). Duplicates are skipped with a warning. Use
--forceto override - Dates must be YYYY-MM-DD — convert if the invoice uses a different format
- Currency symbols — normalize to ISO codes (€ → EUR, £ → GBP, $ → USD)
- Backups — the script automatically backs up the ledger before each write (keeps last 5)
For edge cases (encrypted PDFs, scanned/image-only PDFs, dependency errors), see references/notes.md.
Edge Cases
Ambiguous Dates
- "03/04/2026" is ambiguous (March 4 US, April 3 EU)
- If the invoice doesn't specify a format, check the config
defaults.dateFormat - If still unclear, ask the user: "Is this March 4th or April 3rd?"
- Common formats: DD/MM/YYYY (Ireland, UK, EU), MM/DD/YYYY (US), YYYY-MM-DD (ISO — always prefer this)
Missing Fields
- If no invoice number: leave blank in JSON, the script handles it
- If no line items: just use the description field
- If no tax breakdown: set tax to 0 and note "tax not specified"
- If no currency: use the config default (EUR)
- If no vendor name but there's a company logo in the image: best effort from context
- Always show the user what was extracted — even incomplete data — and let them confirm or edit
Credit Notes and Refunds
- Negative totals indicate a credit/refund
- Still add to ledger — negative entries are valid expenses (they reduce totals)
- Category as normal based on vendor
- In the confirmation prompt, note it's a credit: "?? Credit note detected (negative total)"
Multi-page PDFs
- pdfplumber extracts text from all pages into one output
- The LLM sees all text and can find totals on any page
- No special handling needed — it just works
Non-invoice PDFs
- If the extracted text doesn't look like an invoice (no vendor, no amounts, no date), tell the user: "This doesn't appear to be an invoice or receipt. Want to skip it?"
- Don't force extraction on something that clearly isn't an invoice
Very Small Receipts
- Coffee receipts, parking tickets — often low-quality images or tiny text
- The LLM should still attempt extraction but flag low confidence: "?? Low confidence — please verify the amounts"
相关推荐
专题
+ 收藏
+ 收藏
+ 收藏
+ 收藏
+ 收藏
+ 收藏
最新数据
相关文章
Minecraft 3D 建造计划生成器:AI 场景架构师 - Openclaw Skills
Scholar Search:自动化文献搜索与研究简报 - Openclaw Skills
issue-to-pr: 自动化 GitHub Issue 修复与 PR 生成 - Openclaw Skills
接班交班总结器:临床 EHR 自动化 - Openclaw Skills
Teacher AI 备课专家:K-12 自动化教案设计 - Openclaw Skills
专利权利要求映射器:生物技术与制药 IP 分析 - Openclaw Skills
生成 Tesla 车身改色膜:用于 3D 显示的 AI 图像生成 - Openclaw Skills
Taiwan MD:面向台湾的 AI 原生开放知识库 - Openclaw Skills
自学习与迭代演进:AI Agent 成长框架 - Openclaw Skills
HIPC Config Manager: 安全的 API 凭据处理器 - Openclaw Skills
AI精选
