Invoice Extractor: 自动化费用追踪 - Openclaw Skills

作者:互联网

2026-04-18

AI教程

什么是 Invoice Extractor?

Invoice Extractor 是一个强大的 Openclaw Skills 扩展,旨在缩小凌乱的财务文件与结构化会计之间的差距。它采用混合方法,其中 Python 脚本处理强大的 PDF 文本提取和账本管理,而 LLM 则解析传统 regex 经常无法捕获的复杂发票布局。无论您处理的是电子版 PDF 还是纸质收据照片,此技能都能确保您的财务数据被准确捕获并自动分类。

通过无缝集成到您的工作流中,Invoice Extractor 帮助您以 CSV 或 JSON 格式维护清晰的费用历史记录。它通过“先确认后添加”的工作流优先考虑数据完整性,确保从税务明细到供应商名称的每一条提取信息在提交到永久费用账本之前都经过用户验证。

下载入口:https://github.com/openclaw/skills/tree/main/skills/99rebels/rebels-invoice-extractor

安装与下载

1. ClawHub CLI

从源直接安装技能的最快方式。

npx clawhub@latest install rebels-invoice-extractor

2. 手动安装

将技能文件夹复制到以下位置之一

全局模式 ~/.openclaw/skills/ 工作区 /skills/

优先级:工作区 > 本地 > 内置

3. 提示词安装

将此提示词复制到 OpenClaw 即可自动安装。

请帮我使用 Clawhub 安装 rebels-invoice-extractor。如果尚未安装 Clawhub,请先安装(npm i -g clawhub)。

Invoice Extractor 应用场景

  • 自动化从批量 PDF 发票中提取数据用于企业会计。
  • 通过拍摄收据照片来追踪日常个人支出。
  • 为特定时间段准备有条理的费用报告和税务文件。
  • 跨不同部门或项目代码自动对支出进行分类。
  • 将经过验证的费用数据直接导出到 Xero、FreeAgent 或 Wave 等会计平台。
Invoice Extractor 工作原理
  1. 用户通过文件路径或图像上传向 agent 提供发票。
  2. 对于 PDF,系统使用 pdfplumber 提取文本;对于图像,agent 使用视觉处理来识别关键财务字段。
  3. 提取的数据被映射到标准化的 JSON schema,包括供应商、日期、商品明细和税收。
  4. 该技能针对本地配置文件运行自动分类检查,以标记费用(例如:办公室、差旅、软件)。
  5. 向用户展示摘要进行审查,允许手动编辑或确认。
  6. 获得批准后,数据将追加到持久账本文件中,并具有自动重复检测和备份创建功能。

Invoice Extractor 配置指南

要开始使用此 Openclaw Skills 实用程序,请确保您已安装必要的 Python 依赖项:

pip install pdfplumber
# 如果 pdfplumber 不可用,将使用 PyPDF2 作为自动备选方案

该技能依赖于技能目录中的 scripts/extract.py 处理逻辑和 expense-config.json 处理分类规则。

Invoice Extractor 数据架构与分类体系

该技能使用严格的 JSON schema 组织数据,以确保与会计软件的兼容性。关键字段包括:

字段 描述 要求
vendor 商家名称 必填
total 最终结算金额 必填
date 交易日期 (YYYY-MM-DD) 必填
lineItems 描述、数量和价格的数组 可选
tax 计算出的税额 可选
category 费用分类(例如:食品、办公室) 可选

所有账本条目都存储有一个唯一的哈希值(vendor + date + total),以防止重复记录。

name: invoice-extractor
description: >
  Extract structured data from invoices and receipts (PDFs and images). Output JSON, CSV,
  or build a running expense ledger. Use when someone shares an invoice to process, asks
  to track expenses, categorize spending, or prepare tax documents.

Invoice Extractor ??

Turn invoices and receipts into structured expense data. Extract from PDFs and images, auto-categorize spending, and maintain a running CSV ledger.

Hybrid approach: A Python script handles PDF text extraction and ledger management, while you (the agent) parse the invoice content — LLMs understand varied formats far better than regex.


When to Use

  • "Extract data from this invoice"
  • "Track my expenses" / "Add to my expense ledger"
  • "Categorize this receipt"
  • "Process these invoices" / "Batch process receipts"
  • "Show me my spending summary"
  • "Prepare tax documents" / "Get my expenses for April"

Setup

pip install pdfplumber
# Fallback: PyPDF2 (auto-used if pdfplumber unavailable)

Script: scripts/extract.py (relative to this skill directory) Config: expense-config.json (same directory)


? Single Invoice Workflow

PDF Invoices

python3 scripts/extract.py pdf 

Read the output text, parse it into structured JSON (see schema below), then confirm with the user before adding to ledger.

Image Invoices (jpg, png, webp, gif)

Use the image tool with a prompt like: "Extract all invoice/receipt data from this image. Return vendor, invoice number, date, line items, subtotal, tax, total, and currency."

Parse the result into structured JSON, then confirm with the user before adding to ledger.

?? Confirm Then Add

Always present extracted data for user review before writing to the ledger:

?? Invoice Extracted
Vendor: Amazon
Date: 2026-04-01
Invoice #: INV-2026-001
Description: Office supplies — keyboard and monitor
Total: €539.96 (incl. €100.97 tax)
Category: office (auto)

Add to ledger? (yes/edit/skip)

Format output for the current channel — adapt formatting to match what the platform supports. See references/formatting.md for platform-specific examples.

On confirmation, write the JSON to a temp file and run:

python3 scripts/extract.py ledger add /tmp/invoice-entry.json

Or pipe via stdin:

echo '' | python3 scripts/extract.py ledger add -

If the user says "edit", modify the requested fields and re-confirm. If "skip", discard.


?? Batch Processing

python3 scripts/extract.py batch 
  1. Run the batch command to get a JSON list of all PDFs and images
  2. Process each file one at a time (PDFs via pdf command, images via image tool)
  3. Collect all results — do NOT confirm each one individually
  4. Present a summary of ALL extracted data at the end
  5. Ask the user to confirm once: add all, edit specific entries, or skip

Show this summary after processing all files:

?? Batch Results — 8 files processed

1. Amazon EU S.a.r.l.  —  €191.84  —  office
2. Tesco              —  €25.26   —  food
3. DigitalOcean LLC    —  €35.81   —  software
4. Insomnia Coffee     —  €9.84    —  food
5. ACME Solutions Ltd  —  €3,867.11 —  uncategorized ??
... (errors shown separately)

Total: €4,129.86 across 5 entries (1 error)

Add all to ledger? (yes/edit/skip)

On confirmation, add all entries at once. If the user wants to edit, modify specific entries and re-confirm.


?? Viewing Expenses & Summaries

View entries with optional filters:

python3 scripts/extract.py ledger view [filters]
--from DATE       Entries from this date (YYYY-MM-DD)
--to DATE         Entries up to this date
--category CAT    Filter by category name
--vendor VENDOR   Filter by vendor (partial match)
--format json|csv Output format (default: json)

Edit an entry:

python3 scripts/extract.py ledger edit --id N --vendor "New Name"
python3 scripts/extract.py ledger edit --id N --total 250.00 --category software
python3 scripts/extract.py ledger edit --id N --date 2026-04-02

Editable fields: --vendor, --total, --date, --description, --category, --currency, --subtotal, --tax. Multiple fields in one command. Auto-recalculates the dedup hash.

Delete an entry:

python3 scripts/extract.py ledger delete --id N

Removes the entry, renumbers remaining IDs, creates a backup.

Undo last add:

python3 scripts/extract.py ledger undo

Removes the most recently added entry (highest ID). One-level undo only.

Category summaries:

python3 scripts/extract.py ledger summary [--period week|month|year]

JSON Schema

Structure all extracted invoice data as:

{
  "vendor": "Amazon",
  "invoiceNumber": "INV-2026-001",
  "date": "2026-04-01",
  "dueDate": "2026-04-30",
  "description": "Office supplies — keyboard and monitor",
  "lineItems": [
    {"description": "Mechanical Keyboard", "quantity": 1, "unitPrice": 89.99},
    {"description": "USB-C Monitor", "quantity": 1, "unitPrice": 349.00}
  ],
  "subtotal": 438.99,
  "tax": 100.97,
  "total": 539.96,
  "currency": "EUR",
  "category": "office"
}

Required for ledger: vendor, total, date Optional: everything else — the script handles missing fields gracefully


??? Auto-Categorization

Auto-categorizes based on keyword matching in expense-config.json. Checks vendor name and description against category keywords (case-insensitive).

python3 scripts/extract.py categories

Users can customize by editing the config. Suggest adding new keywords when a vendor doesn't match.


?? Exporting the Ledger

Export ledger entries in platform-specific CSV formats for direct import into accounting software.

python3 scripts/extract.py ledger export --platform  [filters] [--output FILE]

Filters: --from DATE, --to DATE, --category CAT, --vendor VENDOR

Built-in Platforms

Platform Use Case Notes
xero Bills/Expenses import DD/MM/YYYY dates, includes AccountCode & TaxRate
freeagent Out-of-pocket expenses No header row, needs claimantName in config
wave Bank transactions Negative amounts for expenses
generic Excel/Google Sheets Full detail, clean format

Examples

# Export all entries for Xero
python3 scripts/extract.py ledger export --platform xero

# Export April expenses to a file
python3 scripts/extract.py ledger export --platform xero --from 2026-04-01 --to 2026-04-30 --output /tmp/xero-export.csv

# Filter by category for FreeAgent
python3 scripts/extract.py ledger export --platform freeagent --category travel --output /tmp/freeagent-travel.csv

Custom Presets

Define custom export formats in expense-config.json under exportPresets:

{
  "exportPresets": {
    "my-accounting": {
      "columns": ["date", "vendor", "amount", "category", "notes"],
      "headerRow": true,
      "dateFormat": "%m/%d/%Y",
      "amountHandling": "positive",
      "fieldMapping": {
        "date": "date",
        "vendor": "vendor",
        "amount": "total",
        "category": "category",
        "notes": "description"
      }
    }
  }
}

The fieldMapping maps CSV column names → ledger field names. Use: --platform my-accounting

Sending the File

If no --output is specified, CSV goes to stdout. For file attachments:

  1. Use --output /tmp/invoice-export--.csv
  2. Send via MEDIA:
Here's your Xero import file (12 entries, April 2026).
MEDIA:/tmp/invoice-export-xero-20260406.csv

?? Unknown Platform? (LLM Discovery Flow)

If the user names a platform that isn't built-in and isn't in their custom presets:

  1. Use web_search to find "[platform name] CSV import format expenses"
  2. Identify the required columns and their format
  3. Create a fieldMapping from our ledger fields to their columns
  4. Add the preset to the user's expense-config.json under exportPresets
  5. Tell the user the preset was created and saved
  6. Proceed with the export using the new preset

?? Config

The config file (expense-config.json) lives in the skill root directory. See references/configuration.md for the full config reference.

# Use a custom config
python3 scripts/extract.py --config /path/to/config.json 

?? Important Notes

  • Always confirm before adding to ledger — never auto-add extracted data
  • Duplicate detection — entries are auto-checked against existing ledger (vendor + date + total hash). Duplicates are skipped with a warning. Use --force to override
  • Dates must be YYYY-MM-DD — convert if the invoice uses a different format
  • Currency symbols — normalize to ISO codes (€ → EUR, £ → GBP, $ → USD)
  • Backups — the script automatically backs up the ledger before each write (keeps last 5)

For edge cases (encrypted PDFs, scanned/image-only PDFs, dependency errors), see references/notes.md.


Edge Cases

Ambiguous Dates

  • "03/04/2026" is ambiguous (March 4 US, April 3 EU)
  • If the invoice doesn't specify a format, check the config defaults.dateFormat
  • If still unclear, ask the user: "Is this March 4th or April 3rd?"
  • Common formats: DD/MM/YYYY (Ireland, UK, EU), MM/DD/YYYY (US), YYYY-MM-DD (ISO — always prefer this)

Missing Fields

  • If no invoice number: leave blank in JSON, the script handles it
  • If no line items: just use the description field
  • If no tax breakdown: set tax to 0 and note "tax not specified"
  • If no currency: use the config default (EUR)
  • If no vendor name but there's a company logo in the image: best effort from context
  • Always show the user what was extracted — even incomplete data — and let them confirm or edit

Credit Notes and Refunds

  • Negative totals indicate a credit/refund
  • Still add to ledger — negative entries are valid expenses (they reduce totals)
  • Category as normal based on vendor
  • In the confirmation prompt, note it's a credit: "?? Credit note detected (negative total)"

Multi-page PDFs

  • pdfplumber extracts text from all pages into one output
  • The LLM sees all text and can find totals on any page
  • No special handling needed — it just works

Non-invoice PDFs

  • If the extracted text doesn't look like an invoice (no vendor, no amounts, no date), tell the user: "This doesn't appear to be an invoice or receipt. Want to skip it?"
  • Don't force extraction on something that clearly isn't an invoice

Very Small Receipts

  • Coffee receipts, parking tickets — often low-quality images or tiny text
  • The LLM should still attempt extraction but flag low confidence: "?? Low confidence — please verify the amounts"

相关推荐