Skip to content
广告 · 本站推荐广告

健康检查

Gateway 提供内置的健康检查(Health Check)端点,用于监控运行状态并集成外部监控系统。

健康检查端点

text
GET /health
bash
curl http://127.0.0.1:18789/health

响应示例

json
{
  "status": "healthy",
  "version": "0.5.0",
  "uptime": 9240,
  "timestamp": "2025-01-15T13:04:00Z",
  "components": {
    "gateway": "healthy",
    "channels": "healthy",
    "sessions": "healthy",
    "storage": "healthy"
  }
}

状态码

HTTP 状态码含义
200 OKGateway 运行正常
503 Service UnavailableGateway 异常或降级

状态指标

整体状态

状态值说明
healthy所有组件正常运行
degraded部分组件异常,但核心功能可用
unhealthy严重故障,无法正常提供服务

详细状态端点

text
GET /health/detailed
json
{
  "status": "degraded",
  "version": "0.5.0",
  "uptime": 9240,
  "components": {
    "gateway": {
      "status": "healthy",
      "pid": 42187,
      "memory": "85 MB",
      "cpu": "2.1%"
    },
    "channels": {
      "status": "degraded",
      "details": [
        {"name": "openai", "status": "healthy", "latency": "45ms"},
        {"name": "anthropic", "status": "unhealthy", "error": "connection timeout"}
      ]
    },
    "sessions": {
      "status": "healthy",
      "active": 3,
      "total": 156
    },
    "storage": {
      "status": "healthy",
      "diskFree": "45 GB",
      "dataSize": "128 MB"
    }
  }
}

降级状态

当某个 Channel 不可用但其他 Channel 正常时,整体状态为 degraded(降级)而非 unhealthy

Metrics 端点

提供 Prometheus 兼容的指标端点:

text
GET /metrics
text
# HELP openclaw_gateway_uptime_seconds Gateway uptime in seconds
# TYPE openclaw_gateway_uptime_seconds gauge
openclaw_gateway_uptime_seconds 9240

# HELP openclaw_sessions_active Current active sessions
# TYPE openclaw_sessions_active gauge
openclaw_sessions_active 3

# HELP openclaw_requests_total Total requests processed
# TYPE openclaw_requests_total counter
openclaw_requests_total{channel="openai"} 1523
openclaw_requests_total{channel="anthropic"} 847

# HELP openclaw_request_duration_seconds Request duration histogram
# TYPE openclaw_request_duration_seconds histogram
openclaw_request_duration_seconds_bucket{le="0.1"} 500
openclaw_request_duration_seconds_bucket{le="1.0"} 1200
openclaw_request_duration_seconds_bucket{le="10.0"} 1500

监控集成

Prometheus + Grafana

yaml
# prometheus.yml
scrape_configs:
  - job_name: 'openclaw-gateway'
    scrape_interval: 15s
    static_configs:
      - targets: ['127.0.0.1:18789']
    metrics_path: '/metrics'

Docker 健康检查

yaml
# docker-compose.yml
services:
  openclaw:
    image: openclaw/gateway:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:18789/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 10s

Kubernetes Probes

yaml
# k8s deployment
spec:
  containers:
    - name: openclaw
      livenessProbe:
        httpGet:
          path: /health
          port: 18789
        initialDelaySeconds: 10
        periodSeconds: 30
      readinessProbe:
        httpGet:
          path: /health/detailed
          port: 18789
        initialDelaySeconds: 5
        periodSeconds: 10

告警配置

基于健康检查的简单告警

bash
#!/bin/bash
# health-check.sh — 简单的健康检查脚本
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:18789/health)

if [ "$STATUS" != "200" ]; then
  echo "Gateway health check failed! Status: $STATUS" | \
    mail -s "OpenClaw Alert" admin@example.com
fi
bash
# crontab 每分钟检查
* * * * * /path/to/health-check.sh

Prometheus 告警规则

yaml
# alert_rules.yml
groups:
  - name: openclaw
    rules:
      - alert: GatewayDown
        expr: up{job="openclaw-gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "OpenClaw Gateway is down"

      - alert: ChannelUnhealthy
        expr: openclaw_channel_status{status="unhealthy"} > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Channel {{ $labels.channel }} is unhealthy"

告警策略

建议对 Gateway 进程状态设置 critical 告警(1 分钟),对 Channel 状态设置 warning 告警(5 分钟,允许临时波动)。

CLI 健康检查

bash
# 快速健康检查
openclaw status

# 详细状态
openclaw status --detailed

# JSON 输出
openclaw status --format json

相关文档

基于MIT协议开源 | 内容翻译自 官方文档,同步更新