OpenClaw 自愈:AI 驱动的网关恢复 - Openclaw Skills

作者:互联网

2026-04-17

AI教程

什么是 OpenClaw 自愈系统?

OpenClaw 自愈系统是一款专业级基础设施工具,旨在确保 AI 网关的持续可用性。它在 Openclaw Skills 框架内运行,针对进程崩溃、配置错误和资源泄漏等常见运行故障提供弹性多层防御。

通过将传统的看门狗坚控与先进的 AI 诊断相结合,此技能可最大限度地减少停机时间并减少人工干预的需求。对于通过 Openclaw Skills 部署复杂 Agent 和服务且需要“一劳永逸”环境的开发者来说,它是一个关键组件。

下载入口:https://github.com/openclaw/skills/tree/main/skills/ramsbaby/openclaw-self-healing

安装与下载

1. ClawHub CLI

从源直接安装技能的最快方式。

npx clawhub@latest install openclaw-self-healing

2. 手动安装

将技能文件夹复制到以下位置之一

全局模式 ~/.openclaw/skills/ 工作区 /skills/

优先级:工作区 > 本地 > 内置

3. 提示词安装

将此提示词复制到 OpenClaw 即可自动安装。

请帮我使用 Clawhub 安装 openclaw-self-healing。如果尚未安装 Clawhub,请先安装(npm i -g clawhub)。

OpenClaw 自愈系统 应用场景

  • 在意外崩溃或 HTTP 失败后自动重启 OpenClaw 网关。
  • 在中断 Agent 运行前检测并解决过期的 OAuth 令牌。
  • 清理导致端口冲突或 PID 不匹配的孤儿僵尸进程。
  • 当自动重启失败时,执行 AI 辅助的配置模式故障排查。
  • 坚控 macOS 和 Linux 生产环境中的系统健康状况。
OpenClaw 自愈系统 工作原理
  1. L1 配置坚控脚本坚控配置文件更改,以触发即时验证和重新加载。
  2. L2 看门狗持续轮询网关健康检查 URL,以检测进程挂起或连接问题。
  3. 如果故障持续存在,系统将执行指数退避重启策略并清除僵尸进程。
  4. 在 30 分钟故障窗口后,L3 触发 Claude Code AI 医生执行深度诊断并应用逻辑修复。
  5. 状态更新和关键升级警报通过 L4 与 Discord 或 T@elegrimm 的集成推送。

OpenClaw 自愈系统 配置指南

要在 Openclaw Skills 环境中部署此恢复系统,请确保已安装 tmux、jq 和 Claude Code CLI。

运行主安装脚本:

bash <(curl -fsSL https://raw.githubusercontent.com/Ramsbaby/openclaw-self-healing/main/install.sh)

或者通过 ClawHub 安装:

npx clawhub@latest install openclaw-self-healing

~/.openclaw/.env 中配置环境变量(例如 Webhook URL)。

OpenClaw 自愈系统 数据架构与分类体系

系统整理其运行数据和恢复日志,以确保透明度和可审计性:

组件 类型 描述
环境配置 .env 存储网关 URL、重试限制和通知 Webhook
健康指标 JSON 跟踪成功率、MTTR 和趋势故障模式
推理日志 Markdown L3 恢复期间 AI 诊断步骤的详细记录
进程坚控 tmux 活动恢复层的实时仪表板视图
name: openclaw-self-healing
version: 3.1.1
description: 4-tier autonomous self-healing and auto-recovery system for OpenClaw Gateway. Monitors gateway health, auto-restarts on crash, detects OAuth token expiry, kills zombie processes, and escalates to Claude Code AI for diagnosis when automated recovery fails. Use when your OpenClaw gateway crashes, stops responding, enters a restart loop, or needs automatic monitoring and recovery. Features watchdog, config validation, exponential backoff, Discord/T@elegrimm alerts. macOS & Linux.
metadata:
  {
    "openclaw":
      {
        "requires": { "bins": ["tmux", "claude", "jq"] },
        "install":
          [
            {
              "id": "tmux",
              "kind": "brew",
              "package": "tmux",
              "bins": ["tmux"],
              "label": "Install tmux (brew)",
            },
            {
              "id": "claude",
              "kind": "node",
              "package": "@anthropic-ai/claude-code",
              "bins": ["claude"],
              "label": "Install Claude Code CLI (npm)",
            },
            {
              "id": "jq",
              "kind": "brew",
              "package": "jq",
              "bins": ["jq"],
              "label": "Install jq (brew) - for metrics dashboard",
            },
          ],
      },
  }

OpenClaw Self-Healing System

"The system that heals itself — or calls for help when it can't."

A 4-tier autonomous recovery system for OpenClaw Gateway, featuring AI-powered diagnosis via Claude Code. Tested in production on macOS + Linux.

Architecture

Level 1: config-watch        → Config file change detection + instant reload
Level 2: Watchdog v4.4       → OAuth detection, zombie kill, exponential backoff
Level 3: Claude Code Doctor  → AI-powered diagnosis & repair (30 min window) ??
Level 4: Discord/T@elegrimm    → Human escalation with full context

What's New in v3.1.0

  • Complete healing chain fix — config-watch → Watchdog → Emergency Recovery now fully connected
  • Installer rewrite — single install.sh covers macOS (LaunchAgent) + Linux (systemd)
  • Watchdog v4.4 — OAuth token expiry detection, zombie process auto-kill, Exponential Backoff
  • Emergency Recovery v2 — persistent learning repo, reasoning logs, multi-model support (Claude Code + Aider)
  • Metrics dashboard — success rate, MTTR, trending analysis via tmux

Quick Setup

bash <(curl -fsSL https://raw.githubusercontent.com/Ramsbaby/openclaw-self-healing/main/install.sh)

Or install via ClawHub:

npx clawhub@latest install openclaw-self-healing

The 4 Tiers in Detail

Level Script Trigger Action
L1 config-watch.sh Config file change Validate + reload gateway
L2 gateway-watchdog.sh Process down / HTTP fail Kill zombie → restart → backoff
L3 emergency-recovery-v2.sh 30min continuous failure Claude Code PTY diagnosis
L4 emergency-recovery-monitor.sh L3 triggered Discord + T@elegrimm alert

Configuration

All settings via environment variables in ~/.openclaw/.env:

Variable Default Description
DISCORD_WEBHOOK_URL (none) Discord webhook for L4 alerts
OPENCLAW_GATEWAY_URL http://localhost:18789/ Gateway health check URL
HEALTH_CHECK_MAX_RETRIES 3 Restart attempts before L3 escalation
EMERGENCY_RECOVERY_TIMEOUT 1800 Claude recovery timeout (30 min)

Verified Recovery Cases

  • OAuth token expiry — Watchdog v4.4 detects 401 in logs, restarts before agent dies
  • Zombie process — Preflight detects PID mismatch, SIGKILL + launchctl kickstart
  • Config schema erroropenclaw doctor --fix auto-applied on exit_1 pattern
  • Level 3 triggered — Claude Code diagnosed and fixed broken config in < 15 min
  • GitHub: https://github.com/Ramsbaby/openclaw-self-healing
  • Changelog: https://github.com/Ramsbaby/openclaw-self-healing/blob/main/CHANGELOG.md
  • Linux setup: https://github.com/Ramsbaby/openclaw-self-healing/blob/main/docs/LINUX_SETUP.md

License

MIT — built by @ramsbaby + Jarvis ??

相关推荐