同行评审:AI 智能体的本地 LLM 批判层 - Openclaw Skills
作者:互联网
2026-03-29
什么是 同行评审 (Peer Review)?
同行评审 (Peer Review) 技能是 AI 驱动工作流的关键质量保证层。通过 Ollama 利用本地 LLM 推理,它实现了一种扇出架构,其中多个模型(如 Mistral 和 Llama 3.1)独立地对 Claude 等云端模型的输出进行批判。这种方法允许使用 Openclaw Skills 的开发人员在幻觉、逻辑不一致和过度自信的断言影响生产环境之前将其捕捉。
此技能专为准确性至关重要的极高要求场景而设计。它不仅仅提供第二意见;它还编排了一群本地评审员,通过共识逻辑汇总他们的发现。这通过提供透明、本地化且具有成本效益的方法来验证复杂的技术分析和创意内容,显著提高了智能体输出的可靠性。
下载入口:https://github.com/openclaw/skills/tree/main/skills/staybased/peer-review
安装与下载
1. ClawHub CLI
从源直接安装技能的最快方式。
npx clawhub@latest install peer-review
2. 手动安装
将技能文件夹复制到以下位置之一
全局模式~/.openclaw/skills/
工作区
/skills/
优先级:工作区 > 本地 > 内置
3. 提示词安装
将此提示词复制到 OpenClaw 即可自动安装。
请帮我使用 Clawhub 安装 peer-review。如果尚未安装 Clawhub,请先安装(npm i -g clawhub)。
同行评审 (Peer Review) 应用场景
- 验证云端智能体生成的高风险金融交易分析或技术报告。
- 在发布前审查自动化内容流水线中的智能体输出质量。
- 针对云端基准测试本地模型的准确性和推理能力。
- 在长篇生成的文本中检测虚假引用或不存在的来源。
- 为执行多步逻辑推理的自主智能体提供安全保障。
- 主要云端模型(如 Claude)生成初始分析或响应。
- 该技能触发扇出过程,将输入文本发送到三个不同的本地模型:Drift (Mistral 7B)、Pip (TinyLlama) 和 Lume (Llama 3.1)。
- 每个模型都扮演怀疑论评审员的角色,根据专门的提示词分析文本的事实、逻辑和结构错误。
- 批判意见以结构化 JSON 形式返回,识别特定的引用、问题类别和置信水平。
- 聚合脚本合成结果,去重标记并按模型置信度加权以确定共识。
- 生成最终报告,并提供发布、修改或人工干预标记的建议。
同行评审 (Peer Review) 配置指南
要将其集成到您的 Openclaw Skills 库中,请确保已安装 Ollama 并在本地拉取了所需的模型。
# 通过 Ollama 拉取所需的本地模型
ollama pull mistral:7b
ollama pull tinyllama:1.1b
ollama pull llama3.1:8b
# 确保依赖项可用
sudo apt-get install jq curl
# 对单个文档运行同行评审
bash scripts/peer-review.sh [output_dir]
同行评审 (Peer Review) 数据架构与分类体系
同行评审技能为每次批判生成结构化元数据,以促进自动化决策。
| 属性 | 类型 | 描述 |
|---|---|---|
| category | 字符串 | 分类(事实、逻辑、缺失、过度自信、虚假来源) |
| quote | 字符串 | 被标记的源文本中的特定摘录 |
| issue | 字符串 | 对识别出的错误或疑虑的详细解释 |
| confidence | 整数 | 模型的确定性评分,范围从 0 到 100 |
所有评审日志和性能指标都存储在 experiments/peer-review-results/ 目录中,以便长期跟踪真阳性率 (TPR)。
name: peer-review
description: |
Multi-model peer review layer using local LLMs via Ollama to catch errors in cloud model output.
Fan-out critiques to 2-3 local models, aggregate flags, synthesize consensus.
Use when: validating trade analyses, reviewing agent output quality, testing local model accuracy,
checking any high-stakes Claude output before publishing or acting on it.
Don't use when: simple fact-checking (just search the web), tasks that don't benefit from
multi-model consensus, time-critical decisions where 60s latency is unacceptable,
reviewing trivial or low-stakes content.
Negative examples:
- "Check if this date is correct" → No. Just web search it.
- "Review my grocery list" → No. Not worth multi-model inference.
- "I need this answer in 5 seconds" → No. Peer review adds 30-60s latency.
Edge cases:
- Short text (<50 words) → Models may not find meaningful issues. Consider skipping.
- Highly technical domain → Local models may lack domain knowledge. Weight flags lower.
- Creative writing → Factual review doesn't apply well. Use only for logical consistency.
version: "1.0"
Peer Review — Local LLM Critique Layer
Hypothesis: Local LLMs can catch ≥30% of real errors in cloud output with <50% false positive rate.
Architecture
Cloud Model (Claude) produces analysis
│
▼
┌────────────────────────┐
│ Peer Review Fan-Out │
├────────────────────────┤
│ Drift (Mistral 7B) │──? Critique A
│ Pip (TinyLlama 1.1B) │──? Critique B
│ Lume (Llama 3.1 8B) │──? Critique C
└────────────────────────┘
│
▼
Aggregator (consensus logic)
│
▼
Final: original + flagged issues
Swarm Bot Roles
| Bot | Model | Role | Strengths |
|---|---|---|---|
| Drift ?? | Mistral 7B | Methodical analyst | Structured reasoning, catches logical gaps |
| Pip ?? | TinyLlama 1.1B | Fast checker | Quick sanity checks, low latency |
| Lume ?? | Llama 3.1 8B | Deep thinker | Nuanced analysis, catches subtle issues |
Scripts
| Script | Purpose |
|---|---|
scripts/peer-review.sh |
Send single input to all models, collect critiques |
scripts/peer-review-batch.sh |
Run peer review across a corpus of samples |
scripts/seed-test-corpus.sh |
Generate seeded error corpus for testing |
Usage
# Single file review
bash scripts/peer-review.sh [output_dir]
# Batch review
bash scripts/peer-review-batch.sh [results_dir]
# Generate test corpus
bash scripts/seed-test-corpus.sh [count] [output_dir]
Scripts live at workspace/scripts/ — not bundled in skill to avoid duplication.
Critique Prompt Template
You are a skeptical reviewer. Analyze the following text for errors.
For each issue found, output JSON:
{"category": "factual|logical|missing|overconfidence|hallucinated_source",
"quote": "...", "issue": "...", "confidence": 0-100}
If no issues found, output: {"issues": []}
TEXT:
---
{cloud_output}
---
Error Categories
| Category | Description | Example |
|---|---|---|
| factual | Wrong numbers, dates, names | "Bitcoin launched in 2010" |
| logical | Non-sequiturs, unsupported conclusions | "X is rising, therefore Y will fall" |
| missing | Important context omitted | Ignoring a major counterargument |
| overconfidence | Certainty without justification | "This will definitely happen" on 55% event |
| hallucinated_source | Citing nonexistent sources | "According to a 2024 Reuters report..." |
Discord Workflow
- Post analysis to #the-deep (or #swarm-lab)
- Drift, Pip, and Lume respond with independent critiques
- Celeste synthesizes: deduplicates flags, weights by model confidence
- If consensus (≥2 models agree) → flag is high-confidence
- Final output posted with recommendation:
publish|revise|flag_for_human
Success Criteria
| Outcome | TPR | FPR | Decision |
|---|---|---|---|
| Strong pass | ≥50% | <30% | Ship as default layer |
| Pass | ≥30% | <50% | Ship as opt-in layer |
| Marginal | 20–30% | 50–70% | Iterate on prompts, retest |
| Fail | <20% | >70% | Abandon approach |
Scoring Rules
- Flag = true positive if it identifies a real error (even if explanation is imperfect)
- Flag = false positive if flagged content is actually correct
- Duplicate flags across models count once for TPR but inform consensus metrics
Dependencies
- Ollama running locally with models pulled:
mistral:7b,tinyllama:1.1b,llama3.1:8b jqandcurlinstalled- Results stored in
experiments/peer-review-results/
Integration
When peer review passes validation:
- Package as Reef API endpoint:
POST /review - Agents call before publishing any analysis
- Configurable: model selection, consensus threshold, categories
- Log all reviews to
#reef-logswith TPR tracking
相关推荐
专题
+ 收藏
+ 收藏
+ 收藏
+ 收藏
+ 收藏
最新数据
相关文章
ERC-8004:区块链 AI 代理身份与声誉 - Openclaw Skills
行动建议器:人工智能驱动的潜客跟进建议 - Openclaw Skills
会话成本追踪器:优化 Token 投资回报率 - Openclaw Skills
Memoria: AI 智能体结构化记忆系统 - Openclaw Skills
Deno 运行时专家:安全 TypeScript 开发 - Openclaw Skills
为 AI 代理部署 Spark Bitcoin L2 代理 - Openclaw Skills
加密货币价格技能:实时市场数据集成 - Openclaw Skills
Happenstance:专业人脉搜索与研究 - Openclaw Skills
飞书日历技能:通过 Openclaw Skills 自动化日程安排
顾问委员会:多人格 AI 加密货币分析 - Openclaw Skills
AI精选
