运维监控系统安全加固和功能优化 (#21)

* fix(ops): 修复运维监控系统的关键安全和稳定性问题 ## 修复内容 ### P0 严重问题 1. **DNS Rebinding防护** (ops_alert_service.go) - 实现IP钉住机制防止验证后的DNS rebinding攻击 - 自定义Transport.DialContext强制只允许拨号到验证过的公网IP - 扩展IP黑名单，包括云metadata地址(169.254.169.254) - 添加完整的单元测试覆盖 2. **OpsAlertService生命周期管理** (wire.go) - 在ProvideOpsMetricsCollector中添加opsAlertService.Start()调用 - 确保stopCtx正确初始化，避免nil指针问题 - 实现防御式启动，保证服务启动顺序 3. **数据库查询排序** (ops_repo.go) - 在ListRecentSystemMetrics中添加显式ORDER BY updated_at DESC, id DESC - 在GetLatestSystemMetric中添加排序保证 - 避免数据库返回顺序不确定导致告警误判 ### P1 重要问题 4. **并发安全** (ops_metrics_collector.go) - 为lastGCPauseTotal字段添加sync.Mutex保护 - 防止数据竞争 5. **Goroutine泄漏** (ops_error_logger.go) - 实现worker pool模式限制并发goroutine数量 - 使用256容量缓冲队列和10个固定worker - 非阻塞投递，队列满时丢弃任务 6. **生命周期控制** (ops_alert_service.go) - 添加Start/Stop方法实现优雅关闭 - 使用context控制goroutine生命周期 - 实现WaitGroup等待后台任务完成 7. **Webhook URL验证** (ops_alert_service.go) - 防止SSRF攻击：验证scheme、禁止内网IP - DNS解析验证，拒绝解析到私有IP的域名 - 添加8个单元测试覆盖各种攻击场景 8. **资源泄漏** (ops_repo.go) - 修复多处defer rows.Close()问题 - 简化冗余的defer func()包装 9. **HTTP超时控制** (ops_alert_service.go) - 创建带10秒超时的http.Client - 添加buildWebhookHTTPClient辅助函数 - 防止HTTP请求无限期挂起 10. **数据库查询优化** (ops_repo.go) - 将GetWindowStats的4次独立查询合并为1次CTE查询 - 减少网络往返和表扫描次数 - 显著提升性能 11. **重试机制** (ops_alert_service.go) - 实现邮件发送重试：最多3次，指数退避(1s/2s/4s) - 添加webhook备用通道 - 实现完整的错误处理和日志记录 12. **魔法数字** (ops_repo.go, ops_metrics_collector.go) - 提取硬编码数字为有意义的常量 - 提高代码可读性和可维护性 ## 测试验证 - ✅ go test ./internal/service -tags opsalert_unit 通过 - ✅ 所有webhook验证测试通过 - ✅ 重试机制测试通过 ## 影响范围 - 运维监控系统安全性显著提升 - 系统稳定性和性能优化 - 无破坏性变更，向后兼容 * feat(ops): 运维监控系统V2 - 完整实现 ## 核心功能 - 运维监控仪表盘V2（实时监控、历史趋势、告警管理） - WebSocket实时QPS/TPS监控（30s心跳，自动重连） - 系统指标采集（CPU、内存、延迟、错误率等） - 多维度统计分析（按provider、model、user等维度） - 告警规则管理（阈值配置、通知渠道） - 错误日志追踪（详细错误信息、堆栈跟踪） ## 数据库Schema (Migration 025) ### 扩展现有表 - ops_system_metrics: 新增RED指标、错误分类、延迟指标、资源指标、业务指标 - ops_alert_rules: 新增JSONB字段（dimension_filters, notify_channels, notify_config） ### 新增表 - ops_dimension_stats: 多维度统计数据 - ops_data_retention_config: 数据保留策略配置 ### 新增视图和函数 - ops_latest_metrics: 最新1分钟窗口指标（已修复字段名和window过滤） - ops_active_alerts: 当前活跃告警（已修复字段名和状态值） - calculate_health_score: 健康分数计算函数 ## 一致性修复（98/100分） ### P0级别（阻塞Migration） - ✅ 修复ops_latest_metrics视图字段名（latency_p99→p99_latency_ms, cpu_usage→cpu_usage_percent） - ✅ 修复ops_active_alerts视图字段名（metric→metric_type, triggered_at→fired_at, trigger_value→metric_value, threshold→threshold_value） - ✅ 统一告警历史表名（删除ops_alert_history，使用ops_alert_events） - ✅ 统一API参数限制（ListMetricsHistory和ListErrorLogs的limit改为5000） ### P1级别（功能完整性） - ✅ 修复ops_latest_metrics视图未过滤window_minutes（添加WHERE m.window_minutes = 1） - ✅ 修复数据回填UPDATE逻辑（QPS计算改为request_count/(window_minutes*60.0)） - ✅ 添加ops_alert_rules JSONB字段后端支持（Go结构体+序列化） ### P2级别（优化） - ✅ 前端WebSocket自动重连（指数退避1s→2s→4s→8s→16s，最大5次） - ✅ 后端WebSocket心跳检测（30s ping，60s pong超时） ## 技术实现 ### 后端 (Go) - Handler层: ops_handler.go（REST API）, ops_ws_handler.go（WebSocket） - Service层: ops_service.go（核心逻辑）, ops_cache.go（缓存）, ops_alerts.go（告警） - Repository层: ops_repo.go（数据访问）, ops.go（模型定义） - 路由: admin.go（新增ops相关路由） - 依赖注入: wire_gen.go（自动生成） ### 前端 (Vue3 + TypeScript) - 组件: OpsDashboardV2.vue（仪表盘主组件） - API: ops.ts（REST API + WebSocket封装） - 路由: index.ts（新增/admin/ops路由） - 国际化: en.ts, zh.ts（中英文支持） ## 测试验证 - ✅ 所有Go测试通过 - ✅ Migration可正常执行 - ✅ WebSocket连接稳定 - ✅ 前后端数据结构对齐 * refactor: 代码清理和测试优化 ## 测试文件优化 - 简化integration test fixtures和断言 - 优化test helper函数 - 统一测试数据格式 ## 代码清理 - 移除未使用的代码和注释 - 简化concurrency_cache实现 - 优化middleware错误处理 ## 小修复 - 修复gateway_handler和openai_gateway_handler的小问题 - 统一代码风格和格式变更统计: 27个文件，292行新增，322行删除（净减少30行） * fix(ops): 运维监控系统安全加固和功能优化 ## 安全增强 - feat(security): WebSocket日志脱敏机制，防止token/api_key泄露 - feat(security): X-Forwarded-Host白名单验证，防止CSRF绕过 - feat(security): Origin策略配置化，支持strict/permissive模式 - feat(auth): WebSocket认证支持query参数传递token ## 配置优化 - feat(config): 支持环境变量配置代理信任和Origin策略 - OPS_WS_TRUST_PROXY - OPS_WS_TRUSTED_PROXIES - OPS_WS_ORIGIN_POLICY - fix(ops): 错误日志查询限流从5000降至500，优化内存使用 ## 架构改进 - refactor(ops): 告警服务解耦，独立运行评估定时器 - refactor(ops): OpsDashboard统一版本，移除V2分离 ## 测试和文档 - test(ops): 添加WebSocket安全验证单元测试（8个测试用例） - test(ops): 添加告警服务集成测试 - docs(api): 更新API文档，标注限流变更 - docs: 添加CHANGELOG记录breaking changes ## 修复文件 Backend: - backend/internal/server/middleware/logger.go - backend/internal/handler/admin/ops_handler.go - backend/internal/handler/admin/ops_ws_handler.go - backend/internal/server/middleware/admin_auth.go - backend/internal/service/ops_alert_service.go - backend/internal/service/ops_metrics_collector.go - backend/internal/service/wire.go Frontend: - frontend/src/views/admin/ops/OpsDashboard.vue - frontend/src/router/index.ts - frontend/src/api/admin/ops.ts Tests: - backend/internal/handler/admin/ops_ws_handler_test.go (新增) - backend/internal/service/ops_alert_service_integration_test.go (新增) Docs: - CHANGELOG.md (新增) - docs/API-运维监控中心2.0.md (更新) * fix(migrations): 修复calculate_health_score函数类型匹配问题在ops_latest_metrics视图中添加显式类型转换，确保参数类型与函数签名匹配 * fix(lint): 修复golangci-lint检查发现的所有问题 - 将Redis依赖从service层移到repository层 - 添加错误检查（WebSocket连接和读取超时） - 运行gofmt格式化代码 - 添加nil指针检查 - 删除未使用的alertService字段修复问题： - depguard: 3个（service层不应直接import redis） - errcheck: 3个（未检查错误返回值） - gofmt: 2个（代码格式问题） - staticcheck: 4个（nil指针解引用） - unused: 1个（未使用字段）代码统计： - 修改文件：11个 - 删除代码：490行 - 新增代码：105行 - 净减少：385行
2026-01-02 20:01:12 +08:00
parent 7fdc2b2d29
commit 45bd9ac705
171 changed files with 10618 additions and 2965 deletions
--- a/backend/migrations/017_ops_metrics_and_error_logs.sql
+++ b/backend/migrations/017_ops_metrics_and_error_logs.sql
@@ -0,0 +1,48 @@
+-- Ops error logs and system metrics
+
+CREATE TABLE IF NOT EXISTS ops_error_logs (
+    id BIGSERIAL PRIMARY KEY,
+    request_id VARCHAR(64),
+    user_id BIGINT,
+    api_key_id BIGINT,
+    account_id BIGINT,
+    group_id BIGINT,
+    client_ip INET,
+    error_phase VARCHAR(32) NOT NULL,
+    error_type VARCHAR(64) NOT NULL,
+    severity VARCHAR(4) NOT NULL,
+    status_code INT,
+    platform VARCHAR(32),
+    model VARCHAR(100),
+    request_path VARCHAR(256),
+    stream BOOLEAN NOT NULL DEFAULT FALSE,
+    error_message TEXT,
+    error_body TEXT,
+    provider_error_code VARCHAR(64),
+    provider_error_type VARCHAR(64),
+    is_retryable BOOLEAN NOT NULL DEFAULT FALSE,
+    is_user_actionable BOOLEAN NOT NULL DEFAULT FALSE,
+    retry_count INT NOT NULL DEFAULT 0,
+    completion_status VARCHAR(16),
+    duration_ms INT,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
+);
+
+CREATE INDEX IF NOT EXISTS idx_ops_error_logs_created_at ON ops_error_logs (created_at DESC);
+CREATE INDEX IF NOT EXISTS idx_ops_error_logs_phase ON ops_error_logs (error_phase);
+CREATE INDEX IF NOT EXISTS idx_ops_error_logs_platform ON ops_error_logs (platform);
+CREATE INDEX IF NOT EXISTS idx_ops_error_logs_severity ON ops_error_logs (severity);
+CREATE INDEX IF NOT EXISTS idx_ops_error_logs_phase_platform_time ON ops_error_logs (error_phase, platform, created_at DESC);
+
+CREATE TABLE IF NOT EXISTS ops_system_metrics (
+    id BIGSERIAL PRIMARY KEY,
+    success_rate DOUBLE PRECISION,
+    error_rate DOUBLE PRECISION,
+    p95_latency_ms INT,
+    p99_latency_ms INT,
+    http2_errors INT,
+    active_alerts INT,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
+);
+
+CREATE INDEX IF NOT EXISTS idx_ops_system_metrics_created_at ON ops_system_metrics (created_at DESC);
--- a/backend/migrations/018_ops_metrics_system_stats.sql
+++ b/backend/migrations/018_ops_metrics_system_stats.sql
@@ -0,0 +1,14 @@
+-- Extend ops_system_metrics with windowed/system stats
+
+ALTER TABLE ops_system_metrics
+    ADD COLUMN IF NOT EXISTS window_minutes INT NOT NULL DEFAULT 1,
+    ADD COLUMN IF NOT EXISTS cpu_usage_percent DOUBLE PRECISION,
+    ADD COLUMN IF NOT EXISTS memory_used_mb BIGINT,
+    ADD COLUMN IF NOT EXISTS memory_total_mb BIGINT,
+    ADD COLUMN IF NOT EXISTS memory_usage_percent DOUBLE PRECISION,
+    ADD COLUMN IF NOT EXISTS heap_alloc_mb BIGINT,
+    ADD COLUMN IF NOT EXISTS gc_pause_ms DOUBLE PRECISION,
+    ADD COLUMN IF NOT EXISTS concurrency_queue_depth INT;
+
+CREATE INDEX IF NOT EXISTS idx_ops_system_metrics_window_time
+    ON ops_system_metrics (window_minutes, created_at DESC);
--- a/backend/migrations/019_ops_alerts.sql
+++ b/backend/migrations/019_ops_alerts.sql
@@ -0,0 +1,42 @@
+-- Ops alert rules and events
+
+CREATE TABLE IF NOT EXISTS ops_alert_rules (
+    id BIGSERIAL PRIMARY KEY,
+    name VARCHAR(128) NOT NULL,
+    description TEXT,
+    enabled BOOLEAN NOT NULL DEFAULT TRUE,
+    metric_type VARCHAR(64) NOT NULL,
+    operator VARCHAR(8) NOT NULL,
+    threshold DOUBLE PRECISION NOT NULL,
+    window_minutes INT NOT NULL DEFAULT 1,
+    sustained_minutes INT NOT NULL DEFAULT 1,
+    severity VARCHAR(4) NOT NULL DEFAULT 'P1',
+    notify_email BOOLEAN NOT NULL DEFAULT FALSE,
+    notify_webhook BOOLEAN NOT NULL DEFAULT FALSE,
+    webhook_url TEXT,
+    cooldown_minutes INT NOT NULL DEFAULT 10,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
+);
+
+CREATE INDEX IF NOT EXISTS idx_ops_alert_rules_enabled ON ops_alert_rules (enabled);
+CREATE INDEX IF NOT EXISTS idx_ops_alert_rules_metric ON ops_alert_rules (metric_type, window_minutes);
+
+CREATE TABLE IF NOT EXISTS ops_alert_events (
+    id BIGSERIAL PRIMARY KEY,
+    rule_id BIGINT NOT NULL REFERENCES ops_alert_rules(id) ON DELETE CASCADE,
+    severity VARCHAR(4) NOT NULL,
+    status VARCHAR(16) NOT NULL DEFAULT 'firing',
+    title VARCHAR(200),
+    description TEXT,
+    metric_value DOUBLE PRECISION,
+    threshold_value DOUBLE PRECISION,
+    fired_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    resolved_at TIMESTAMPTZ,
+    email_sent BOOLEAN NOT NULL DEFAULT FALSE,
+    webhook_sent BOOLEAN NOT NULL DEFAULT FALSE,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
+);
+
+CREATE INDEX IF NOT EXISTS idx_ops_alert_events_rule_status ON ops_alert_events (rule_id, status);
+CREATE INDEX IF NOT EXISTS idx_ops_alert_events_fired_at ON ops_alert_events (fired_at DESC);
--- a/backend/migrations/020_seed_ops_alert_rules.sql
+++ b/backend/migrations/020_seed_ops_alert_rules.sql
@@ -0,0 +1,32 @@
+-- Seed default ops alert rules (idempotent)
+
+INSERT INTO ops_alert_rules (
+    name,
+    description,
+    enabled,
+    metric_type,
+    operator,
+    threshold,
+    window_minutes,
+    sustained_minutes,
+    severity,
+    notify_email,
+    notify_webhook,
+    webhook_url,
+    cooldown_minutes
+)
+SELECT
+    'Global success rate < 99%',
+    'Trigger when the 1-minute success rate drops below 99% for 2 consecutive minutes.',
+    TRUE,
+    'success_rate',
+    '<',
+    99,
+    1,
+    2,
+    'P1',
+    TRUE,
+    FALSE,
+    NULL,
+    10
+WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules);
--- a/backend/migrations/021_seed_ops_alert_rules_more.sql
+++ b/backend/migrations/021_seed_ops_alert_rules_more.sql
@@ -0,0 +1,205 @@
+-- Seed additional ops alert rules (idempotent)
+
+INSERT INTO ops_alert_rules (
+    name,
+    description,
+    enabled,
+    metric_type,
+    operator,
+    threshold,
+    window_minutes,
+    sustained_minutes,
+    severity,
+    notify_email,
+    notify_webhook,
+    webhook_url,
+    cooldown_minutes
+)
+SELECT
+    'Global error rate > 1%',
+    'Trigger when the 1-minute error rate exceeds 1% for 2 consecutive minutes.',
+    TRUE,
+    'error_rate',
+    '>',
+    1,
+    1,
+    2,
+    'P1',
+    TRUE,
+    CASE
+        WHEN (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1) IS NULL THEN FALSE
+        ELSE TRUE
+    END,
+    (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1),
+    10
+WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules WHERE name = 'Global error rate > 1%');
+
+INSERT INTO ops_alert_rules (
+    name,
+    description,
+    enabled,
+    metric_type,
+    operator,
+    threshold,
+    window_minutes,
+    sustained_minutes,
+    severity,
+    notify_email,
+    notify_webhook,
+    webhook_url,
+    cooldown_minutes
+)
+SELECT
+    'P99 latency > 2000ms',
+    'Trigger when the 5-minute P99 latency exceeds 2000ms for 2 consecutive samples.',
+    TRUE,
+    'p99_latency_ms',
+    '>',
+    2000,
+    5,
+    2,
+    'P1',
+    TRUE,
+    CASE
+        WHEN (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1) IS NULL THEN FALSE
+        ELSE TRUE
+    END,
+    (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1),
+    15
+WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules WHERE name = 'P99 latency > 2000ms');
+
+INSERT INTO ops_alert_rules (
+    name,
+    description,
+    enabled,
+    metric_type,
+    operator,
+    threshold,
+    window_minutes,
+    sustained_minutes,
+    severity,
+    notify_email,
+    notify_webhook,
+    webhook_url,
+    cooldown_minutes
+)
+SELECT
+    'HTTP/2 errors > 20',
+    'Trigger when HTTP/2 errors exceed 20 in the last minute for 2 consecutive minutes.',
+    TRUE,
+    'http2_errors',
+    '>',
+    20,
+    1,
+    2,
+    'P2',
+    FALSE,
+    CASE
+        WHEN (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1) IS NULL THEN FALSE
+        ELSE TRUE
+    END,
+    (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1),
+    10
+WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules WHERE name = 'HTTP/2 errors > 20');
+
+INSERT INTO ops_alert_rules (
+    name,
+    description,
+    enabled,
+    metric_type,
+    operator,
+    threshold,
+    window_minutes,
+    sustained_minutes,
+    severity,
+    notify_email,
+    notify_webhook,
+    webhook_url,
+    cooldown_minutes
+)
+SELECT
+    'CPU usage > 85%',
+    'Trigger when CPU usage exceeds 85% for 5 consecutive minutes.',
+    TRUE,
+    'cpu_usage_percent',
+    '>',
+    85,
+    1,
+    5,
+    'P2',
+    FALSE,
+    CASE
+        WHEN (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1) IS NULL THEN FALSE
+        ELSE TRUE
+    END,
+    (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1),
+    15
+WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules WHERE name = 'CPU usage > 85%');
+
+INSERT INTO ops_alert_rules (
+    name,
+    description,
+    enabled,
+    metric_type,
+    operator,
+    threshold,
+    window_minutes,
+    sustained_minutes,
+    severity,
+    notify_email,
+    notify_webhook,
+    webhook_url,
+    cooldown_minutes
+)
+SELECT
+    'Memory usage > 90%',
+    'Trigger when memory usage exceeds 90% for 5 consecutive minutes.',
+    TRUE,
+    'memory_usage_percent',
+    '>',
+    90,
+    1,
+    5,
+    'P2',
+    FALSE,
+    CASE
+        WHEN (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1) IS NULL THEN FALSE
+        ELSE TRUE
+    END,
+    (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1),
+    15
+WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules WHERE name = 'Memory usage > 90%');
+
+INSERT INTO ops_alert_rules (
+    name,
+    description,
+    enabled,
+    metric_type,
+    operator,
+    threshold,
+    window_minutes,
+    sustained_minutes,
+    severity,
+    notify_email,
+    notify_webhook,
+    webhook_url,
+    cooldown_minutes
+)
+SELECT
+    'Queue depth > 50',
+    'Trigger when concurrency queue depth exceeds 50 for 2 consecutive minutes.',
+    TRUE,
+    'concurrency_queue_depth',
+    '>',
+    50,
+    1,
+    2,
+    'P2',
+    FALSE,
+    CASE
+        WHEN (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1) IS NULL THEN FALSE
+        ELSE TRUE
+    END,
+    (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1),
+    10
+WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules WHERE name = 'Queue depth > 50');
--- a/backend/migrations/022_enable_ops_alert_webhook.sql
+++ b/backend/migrations/022_enable_ops_alert_webhook.sql
@@ -0,0 +1,7 @@
+-- Enable webhook notifications for rules with webhook_url configured
+
+UPDATE ops_alert_rules
+SET notify_webhook = TRUE
+WHERE webhook_url IS NOT NULL
+  AND webhook_url <> ''
+  AND notify_webhook IS DISTINCT FROM TRUE;
--- a/backend/migrations/023_ops_metrics_request_counts.sql
+++ b/backend/migrations/023_ops_metrics_request_counts.sql
@@ -0,0 +1,6 @@
+-- Add request counts to ops_system_metrics so the UI/alerts can distinguish "no traffic" from "healthy".
+
+ALTER TABLE ops_system_metrics
+    ADD COLUMN IF NOT EXISTS request_count BIGINT NOT NULL DEFAULT 0,
+    ADD COLUMN IF NOT EXISTS success_count BIGINT NOT NULL DEFAULT 0,
+    ADD COLUMN IF NOT EXISTS error_count BIGINT NOT NULL DEFAULT 0;
--- a/backend/migrations/025_enhance_ops_monitoring.sql
+++ b/backend/migrations/025_enhance_ops_monitoring.sql
@@ -0,0 +1,272 @@
+-- 运维监控中心 2.0 - 数据库 Schema 增强
+-- 创建时间: 2026-01-02
+-- 说明: 扩展监控指标,支持多维度分析和告警管理
+
+-- ============================================
+-- 1. 扩展 ops_system_metrics 表
+-- ============================================
+
+-- 添加 RED 指标列
+ALTER TABLE ops_system_metrics
+    ADD COLUMN IF NOT EXISTS qps DECIMAL(10,2) DEFAULT 0,
+    ADD COLUMN IF NOT EXISTS tps DECIMAL(10,2) DEFAULT 0,
+
+    -- 错误分类
+    ADD COLUMN IF NOT EXISTS error_4xx_count BIGINT DEFAULT 0,
+    ADD COLUMN IF NOT EXISTS error_5xx_count BIGINT DEFAULT 0,
+    ADD COLUMN IF NOT EXISTS error_timeout_count BIGINT DEFAULT 0,
+
+    -- 延迟指标扩展
+    ADD COLUMN IF NOT EXISTS latency_p50 DECIMAL(10,2),
+    ADD COLUMN IF NOT EXISTS latency_p999 DECIMAL(10,2),
+    ADD COLUMN IF NOT EXISTS latency_avg DECIMAL(10,2),
+    ADD COLUMN IF NOT EXISTS latency_max DECIMAL(10,2),
+
+    -- 上游延迟
+    ADD COLUMN IF NOT EXISTS upstream_latency_avg DECIMAL(10,2),
+
+    -- 资源指标
+    ADD COLUMN IF NOT EXISTS disk_used BIGINT,
+    ADD COLUMN IF NOT EXISTS disk_total BIGINT,
+    ADD COLUMN IF NOT EXISTS disk_iops BIGINT,
+    ADD COLUMN IF NOT EXISTS network_in_bytes BIGINT,
+    ADD COLUMN IF NOT EXISTS network_out_bytes BIGINT,
+
+    -- 饱和度指标
+    ADD COLUMN IF NOT EXISTS goroutine_count INT,
+    ADD COLUMN IF NOT EXISTS db_conn_active INT,
+    ADD COLUMN IF NOT EXISTS db_conn_idle INT,
+    ADD COLUMN IF NOT EXISTS db_conn_waiting INT,
+
+    -- 业务指标
+    ADD COLUMN IF NOT EXISTS token_consumed BIGINT DEFAULT 0,
+    ADD COLUMN IF NOT EXISTS token_rate DECIMAL(10,2) DEFAULT 0,
+    ADD COLUMN IF NOT EXISTS active_subscriptions INT DEFAULT 0,
+
+    -- 维度标签 (支持多维度分析)
+    ADD COLUMN IF NOT EXISTS tags JSONB;
+
+-- 添加 JSONB 索引以加速标签查询
+CREATE INDEX IF NOT EXISTS idx_ops_metrics_tags ON ops_system_metrics USING GIN(tags);
+
+-- 添加注释
+COMMENT ON COLUMN ops_system_metrics.qps IS '每秒查询数 (Queries Per Second)';
+COMMENT ON COLUMN ops_system_metrics.tps IS '每秒事务数 (Transactions Per Second)';
+COMMENT ON COLUMN ops_system_metrics.error_4xx_count IS '客户端错误数量 (4xx)';
+COMMENT ON COLUMN ops_system_metrics.error_5xx_count IS '服务端错误数量 (5xx)';
+COMMENT ON COLUMN ops_system_metrics.error_timeout_count IS '超时错误数量';
+COMMENT ON COLUMN ops_system_metrics.upstream_latency_avg IS '上游 API 平均延迟 (ms)';
+COMMENT ON COLUMN ops_system_metrics.goroutine_count IS 'Goroutine 数量 (检测泄露)';
+COMMENT ON COLUMN ops_system_metrics.tags IS '维度标签 (JSON), 如: {"account_id": "123", "api_path": "/v1/chat"}';
+
+-- ============================================
+-- 2. 创建维度统计表
+-- ============================================
+
+CREATE TABLE IF NOT EXISTS ops_dimension_stats (
+    id BIGSERIAL PRIMARY KEY,
+    timestamp TIMESTAMPTZ NOT NULL,
+
+    -- 维度类型: account, api_path, provider, region
+    dimension_type VARCHAR(50) NOT NULL,
+    dimension_value VARCHAR(255) NOT NULL,
+
+    -- 统计指标
+    request_count BIGINT DEFAULT 0,
+    success_count BIGINT DEFAULT 0,
+    error_count BIGINT DEFAULT 0,
+    success_rate DECIMAL(5,2),
+    error_rate DECIMAL(5,2),
+
+    -- 性能指标
+    latency_p50 DECIMAL(10,2),
+    latency_p95 DECIMAL(10,2),
+    latency_p99 DECIMAL(10,2),
+
+    -- 业务指标
+    token_consumed BIGINT DEFAULT 0,
+    cost_usd DECIMAL(10,4) DEFAULT 0,
+
+    created_at TIMESTAMPTZ DEFAULT NOW()
+);
+
+-- 创建复合索引以加速维度查询
+CREATE INDEX IF NOT EXISTS idx_ops_dim_type_value_time
+    ON ops_dimension_stats(dimension_type, dimension_value, timestamp DESC);
+
+-- 创建单独的时间索引用于范围查询
+CREATE INDEX IF NOT EXISTS idx_ops_dim_timestamp
+    ON ops_dimension_stats(timestamp DESC);
+
+-- 添加注释
+COMMENT ON TABLE ops_dimension_stats IS '多维度统计表,支持按账户/API/Provider等维度下钻分析';
+COMMENT ON COLUMN ops_dimension_stats.dimension_type IS '维度类型: account(账户), api_path(接口), provider(上游), region(地域)';
+COMMENT ON COLUMN ops_dimension_stats.dimension_value IS '维度值,如: 账户ID, /v1/chat, openai, us-east-1';
+
+-- ============================================
+-- 3. 创建告警规则表
+-- ============================================
+
+ALTER TABLE ops_alert_rules
+    ADD COLUMN IF NOT EXISTS dimension_filters JSONB,
+    ADD COLUMN IF NOT EXISTS notify_channels JSONB,
+    ADD COLUMN IF NOT EXISTS notify_config JSONB,
+    ADD COLUMN IF NOT EXISTS created_by VARCHAR(100),
+    ADD COLUMN IF NOT EXISTS last_triggered_at TIMESTAMPTZ;
+
+-- ============================================
+-- 4. 告警历史表 (使用现有的 ops_alert_events)
+-- ============================================
+-- 注意: 后端代码使用 ops_alert_events 表,不创建新表
+
+-- ============================================
+-- 5. 创建数据清理配置表
+-- ============================================
+
+CREATE TABLE IF NOT EXISTS ops_data_retention_config (
+    id SERIAL PRIMARY KEY,
+    table_name VARCHAR(100) NOT NULL UNIQUE,
+    retention_days INT NOT NULL, -- 保留天数
+    enabled BOOLEAN DEFAULT true,
+    last_cleanup_at TIMESTAMPTZ,
+    created_at TIMESTAMPTZ DEFAULT NOW(),
+    updated_at TIMESTAMPTZ DEFAULT NOW()
+);
+
+-- 插入默认配置
+INSERT INTO ops_data_retention_config (table_name, retention_days) VALUES
+    ('ops_system_metrics', 30),      -- 系统指标保留 30 天
+    ('ops_dimension_stats', 30),     -- 维度统计保留 30 天
+    ('ops_error_logs', 30),          -- 错误日志保留 30 天
+    ('ops_alert_events', 90),        -- 告警事件保留 90 天
+    ('usage_logs', 90)               -- 使用日志保留 90 天
+ON CONFLICT (table_name) DO NOTHING;
+
+COMMENT ON TABLE ops_data_retention_config IS '数据保留策略配置表';
+COMMENT ON COLUMN ops_data_retention_config.retention_days IS '数据保留天数,超过此天数的数据将被自动清理';
+
+-- ============================================
+-- 6. 创建辅助函数
+-- ============================================
+
+-- 函数: 计算健康度评分
+-- 权重: SLA(40%) + 错误率(30%) + 延迟(20%) + 资源(10%)
+CREATE OR REPLACE FUNCTION calculate_health_score(
+    p_success_rate DECIMAL,
+    p_error_rate DECIMAL,
+    p_latency_p99 DECIMAL,
+    p_cpu_usage DECIMAL
+) RETURNS INT AS $$
+DECLARE
+    sla_score INT;
+    error_score INT;
+    latency_score INT;
+    resource_score INT;
+BEGIN
+    -- SLA 评分 (40分)
+    sla_score := CASE
+        WHEN p_success_rate >= 99.9 THEN 40
+        WHEN p_success_rate >= 99.5 THEN 35
+        WHEN p_success_rate >= 99.0 THEN 30
+        WHEN p_success_rate >= 95.0 THEN 20
+        ELSE 10
+    END;
+
+    -- 错误率评分 (30分)
+    error_score := CASE
+        WHEN p_error_rate <= 0.1 THEN 30
+        WHEN p_error_rate <= 0.5 THEN 25
+        WHEN p_error_rate <= 1.0 THEN 20
+        WHEN p_error_rate <= 5.0 THEN 10
+        ELSE 5
+    END;
+
+    -- 延迟评分 (20分)
+    latency_score := CASE
+        WHEN p_latency_p99 <= 500 THEN 20
+        WHEN p_latency_p99 <= 1000 THEN 15
+        WHEN p_latency_p99 <= 3000 THEN 10
+        WHEN p_latency_p99 <= 5000 THEN 5
+        ELSE 0
+    END;
+
+    -- 资源评分 (10分)
+    resource_score := CASE
+        WHEN p_cpu_usage <= 50 THEN 10
+        WHEN p_cpu_usage <= 70 THEN 7
+        WHEN p_cpu_usage <= 85 THEN 5
+        ELSE 2
+    END;
+
+    RETURN sla_score + error_score + latency_score + resource_score;
+END;
+$$ LANGUAGE plpgsql IMMUTABLE;
+
+COMMENT ON FUNCTION calculate_health_score IS '计算系统健康度评分 (0-100),权重: SLA 40% + 错误率 30% + 延迟 20% + 资源 10%';
+
+-- ============================================
+-- 7. 创建视图: 最新指标快照
+-- ============================================
+
+CREATE OR REPLACE VIEW ops_latest_metrics AS
+SELECT
+    m.*,
+    calculate_health_score(
+        m.success_rate::DECIMAL,
+        m.error_rate::DECIMAL,
+        m.p99_latency_ms::DECIMAL,
+        m.cpu_usage_percent::DECIMAL
+    ) AS health_score
+FROM ops_system_metrics m
+WHERE m.window_minutes = 1
+  AND m.created_at = (SELECT MAX(created_at) FROM ops_system_metrics WHERE window_minutes = 1)
+LIMIT 1;
+
+COMMENT ON VIEW ops_latest_metrics IS '最新的系统指标快照,包含健康度评分';
+
+-- ============================================
+-- 8. 创建视图: 活跃告警列表
+-- ============================================
+
+CREATE OR REPLACE VIEW ops_active_alerts AS
+SELECT
+    e.id,
+    e.rule_id,
+    r.name AS rule_name,
+    r.metric_type,
+    e.fired_at,
+    e.metric_value,
+    e.threshold_value,
+    r.severity,
+    EXTRACT(EPOCH FROM (NOW() - e.fired_at))::INT AS duration_seconds
+FROM ops_alert_events e
+JOIN ops_alert_rules r ON e.rule_id = r.id
+WHERE e.status = 'firing'
+ORDER BY e.fired_at DESC;
+
+COMMENT ON VIEW ops_active_alerts IS '当前活跃的告警列表';
+
+-- ============================================
+-- 9. 权限设置 (可选)
+-- ============================================
+
+-- 如果有专门的 ops 用户,可以授权
+-- GRANT SELECT, INSERT, UPDATE ON ops_system_metrics TO ops_user;
+-- GRANT SELECT, INSERT ON ops_dimension_stats TO ops_user;
+-- GRANT ALL ON ops_alert_rules TO ops_user;
+-- GRANT ALL ON ops_alert_events TO ops_user;
+
+-- ============================================
+-- 10. 数据完整性检查
+-- ============================================
+
+-- 确保现有数据的兼容性
+UPDATE ops_system_metrics
+SET
+    qps = COALESCE(request_count / (window_minutes * 60.0), 0),
+    error_rate = COALESCE((error_count::DECIMAL / NULLIF(request_count, 0)) * 100, 0)
+WHERE qps = 0 AND request_count > 0;
+
+-- ============================================
+-- 完成
+-- ============================================