* fix(ops): 修复运维监控系统的关键安全和稳定性问题
## 修复内容
### P0 严重问题
1. **DNS Rebinding防护** (ops_alert_service.go)
- 实现IP钉住机制防止验证后的DNS rebinding攻击
- 自定义Transport.DialContext强制只允许拨号到验证过的公网IP
- 扩展IP黑名单,包括云metadata地址(169.254.169.254)
- 添加完整的单元测试覆盖
2. **OpsAlertService生命周期管理** (wire.go)
- 在ProvideOpsMetricsCollector中添加opsAlertService.Start()调用
- 确保stopCtx正确初始化,避免nil指针问题
- 实现防御式启动,保证服务启动顺序
3. **数据库查询排序** (ops_repo.go)
- 在ListRecentSystemMetrics中添加显式ORDER BY updated_at DESC, id DESC
- 在GetLatestSystemMetric中添加排序保证
- 避免数据库返回顺序不确定导致告警误判
### P1 重要问题
4. **并发安全** (ops_metrics_collector.go)
- 为lastGCPauseTotal字段添加sync.Mutex保护
- 防止数据竞争
5. **Goroutine泄漏** (ops_error_logger.go)
- 实现worker pool模式限制并发goroutine数量
- 使用256容量缓冲队列和10个固定worker
- 非阻塞投递,队列满时丢弃任务
6. **生命周期控制** (ops_alert_service.go)
- 添加Start/Stop方法实现优雅关闭
- 使用context控制goroutine生命周期
- 实现WaitGroup等待后台任务完成
7. **Webhook URL验证** (ops_alert_service.go)
- 防止SSRF攻击:验证scheme、禁止内网IP
- DNS解析验证,拒绝解析到私有IP的域名
- 添加8个单元测试覆盖各种攻击场景
8. **资源泄漏** (ops_repo.go)
- 修复多处defer rows.Close()问题
- 简化冗余的defer func()包装
9. **HTTP超时控制** (ops_alert_service.go)
- 创建带10秒超时的http.Client
- 添加buildWebhookHTTPClient辅助函数
- 防止HTTP请求无限期挂起
10. **数据库查询优化** (ops_repo.go)
- 将GetWindowStats的4次独立查询合并为1次CTE查询
- 减少网络往返和表扫描次数
- 显著提升性能
11. **重试机制** (ops_alert_service.go)
- 实现邮件发送重试:最多3次,指数退避(1s/2s/4s)
- 添加webhook备用通道
- 实现完整的错误处理和日志记录
12. **魔法数字** (ops_repo.go, ops_metrics_collector.go)
- 提取硬编码数字为有意义的常量
- 提高代码可读性和可维护性
## 测试验证
- ✅ go test ./internal/service -tags opsalert_unit 通过
- ✅ 所有webhook验证测试通过
- ✅ 重试机制测试通过
## 影响范围
- 运维监控系统安全性显著提升
- 系统稳定性和性能优化
- 无破坏性变更,向后兼容
* feat(ops): 运维监控系统V2 - 完整实现
## 核心功能
- 运维监控仪表盘V2(实时监控、历史趋势、告警管理)
- WebSocket实时QPS/TPS监控(30s心跳,自动重连)
- 系统指标采集(CPU、内存、延迟、错误率等)
- 多维度统计分析(按provider、model、user等维度)
- 告警规则管理(阈值配置、通知渠道)
- 错误日志追踪(详细错误信息、堆栈跟踪)
## 数据库Schema (Migration 025)
### 扩展现有表
- ops_system_metrics: 新增RED指标、错误分类、延迟指标、资源指标、业务指标
- ops_alert_rules: 新增JSONB字段(dimension_filters, notify_channels, notify_config)
### 新增表
- ops_dimension_stats: 多维度统计数据
- ops_data_retention_config: 数据保留策略配置
### 新增视图和函数
- ops_latest_metrics: 最新1分钟窗口指标(已修复字段名和window过滤)
- ops_active_alerts: 当前活跃告警(已修复字段名和状态值)
- calculate_health_score: 健康分数计算函数
## 一致性修复(98/100分)
### P0级别(阻塞Migration)
- ✅ 修复ops_latest_metrics视图字段名(latency_p99→p99_latency_ms, cpu_usage→cpu_usage_percent)
- ✅ 修复ops_active_alerts视图字段名(metric→metric_type, triggered_at→fired_at, trigger_value→metric_value, threshold→threshold_value)
- ✅ 统一告警历史表名(删除ops_alert_history,使用ops_alert_events)
- ✅ 统一API参数限制(ListMetricsHistory和ListErrorLogs的limit改为5000)
### P1级别(功能完整性)
- ✅ 修复ops_latest_metrics视图未过滤window_minutes(添加WHERE m.window_minutes = 1)
- ✅ 修复数据回填UPDATE逻辑(QPS计算改为request_count/(window_minutes*60.0))
- ✅ 添加ops_alert_rules JSONB字段后端支持(Go结构体+序列化)
### P2级别(优化)
- ✅ 前端WebSocket自动重连(指数退避1s→2s→4s→8s→16s,最大5次)
- ✅ 后端WebSocket心跳检测(30s ping,60s pong超时)
## 技术实现
### 后端 (Go)
- Handler层: ops_handler.go(REST API), ops_ws_handler.go(WebSocket)
- Service层: ops_service.go(核心逻辑), ops_cache.go(缓存), ops_alerts.go(告警)
- Repository层: ops_repo.go(数据访问), ops.go(模型定义)
- 路由: admin.go(新增ops相关路由)
- 依赖注入: wire_gen.go(自动生成)
### 前端 (Vue3 + TypeScript)
- 组件: OpsDashboardV2.vue(仪表盘主组件)
- API: ops.ts(REST API + WebSocket封装)
- 路由: index.ts(新增/admin/ops路由)
- 国际化: en.ts, zh.ts(中英文支持)
## 测试验证
- ✅ 所有Go测试通过
- ✅ Migration可正常执行
- ✅ WebSocket连接稳定
- ✅ 前后端数据结构对齐
* refactor: 代码清理和测试优化
## 测试文件优化
- 简化integration test fixtures和断言
- 优化test helper函数
- 统一测试数据格式
## 代码清理
- 移除未使用的代码和注释
- 简化concurrency_cache实现
- 优化middleware错误处理
## 小修复
- 修复gateway_handler和openai_gateway_handler的小问题
- 统一代码风格和格式
变更统计: 27个文件,292行新增,322行删除(净减少30行)
* fix(ops): 运维监控系统安全加固和功能优化
## 安全增强
- feat(security): WebSocket日志脱敏机制,防止token/api_key泄露
- feat(security): X-Forwarded-Host白名单验证,防止CSRF绕过
- feat(security): Origin策略配置化,支持strict/permissive模式
- feat(auth): WebSocket认证支持query参数传递token
## 配置优化
- feat(config): 支持环境变量配置代理信任和Origin策略
- OPS_WS_TRUST_PROXY
- OPS_WS_TRUSTED_PROXIES
- OPS_WS_ORIGIN_POLICY
- fix(ops): 错误日志查询限流从5000降至500,优化内存使用
## 架构改进
- refactor(ops): 告警服务解耦,独立运行评估定时器
- refactor(ops): OpsDashboard统一版本,移除V2分离
## 测试和文档
- test(ops): 添加WebSocket安全验证单元测试(8个测试用例)
- test(ops): 添加告警服务集成测试
- docs(api): 更新API文档,标注限流变更
- docs: 添加CHANGELOG记录breaking changes
## 修复文件
Backend:
- backend/internal/server/middleware/logger.go
- backend/internal/handler/admin/ops_handler.go
- backend/internal/handler/admin/ops_ws_handler.go
- backend/internal/server/middleware/admin_auth.go
- backend/internal/service/ops_alert_service.go
- backend/internal/service/ops_metrics_collector.go
- backend/internal/service/wire.go
Frontend:
- frontend/src/views/admin/ops/OpsDashboard.vue
- frontend/src/router/index.ts
- frontend/src/api/admin/ops.ts
Tests:
- backend/internal/handler/admin/ops_ws_handler_test.go (新增)
- backend/internal/service/ops_alert_service_integration_test.go (新增)
Docs:
- CHANGELOG.md (新增)
- docs/API-运维监控中心2.0.md (更新)
* fix(migrations): 修复calculate_health_score函数类型匹配问题
在ops_latest_metrics视图中添加显式类型转换,确保参数类型与函数签名匹配
* fix(lint): 修复golangci-lint检查发现的所有问题
- 将Redis依赖从service层移到repository层
- 添加错误检查(WebSocket连接和读取超时)
- 运行gofmt格式化代码
- 添加nil指针检查
- 删除未使用的alertService字段
修复问题:
- depguard: 3个(service层不应直接import redis)
- errcheck: 3个(未检查错误返回值)
- gofmt: 2个(代码格式问题)
- staticcheck: 4个(nil指针解引用)
- unused: 1个(未使用字段)
代码统计:
- 修改文件:11个
- 删除代码:490行
- 新增代码:105行
- 净减少:385行
1334 lines
30 KiB
Go
1334 lines
30 KiB
Go
package repository
|
|
|
|
import (
|
|
"context"
|
|
"database/sql"
|
|
"encoding/json"
|
|
"errors"
|
|
"fmt"
|
|
"math"
|
|
"strings"
|
|
"time"
|
|
|
|
dbent "github.com/Wei-Shaw/sub2api/ent"
|
|
"github.com/Wei-Shaw/sub2api/internal/service"
|
|
"github.com/redis/go-redis/v9"
|
|
)
|
|
|
|
const (
|
|
DefaultWindowMinutes = 1
|
|
|
|
MaxErrorLogsLimit = 500
|
|
DefaultErrorLogsLimit = 200
|
|
|
|
MaxRecentSystemMetricsLimit = 500
|
|
DefaultRecentSystemMetricsLimit = 60
|
|
|
|
MaxMetricsLimit = 5000
|
|
DefaultMetricsLimit = 300
|
|
)
|
|
|
|
type OpsRepository struct {
|
|
sql sqlExecutor
|
|
rdb *redis.Client
|
|
}
|
|
|
|
func NewOpsRepository(_ *dbent.Client, sqlDB *sql.DB, rdb *redis.Client) service.OpsRepository {
|
|
return &OpsRepository{sql: sqlDB, rdb: rdb}
|
|
}
|
|
|
|
func (r *OpsRepository) CreateErrorLog(ctx context.Context, log *service.OpsErrorLog) error {
|
|
if log == nil {
|
|
return nil
|
|
}
|
|
|
|
createdAt := log.CreatedAt
|
|
if createdAt.IsZero() {
|
|
createdAt = time.Now()
|
|
}
|
|
|
|
query := `
|
|
INSERT INTO ops_error_logs (
|
|
request_id,
|
|
user_id,
|
|
api_key_id,
|
|
account_id,
|
|
group_id,
|
|
client_ip,
|
|
error_phase,
|
|
error_type,
|
|
severity,
|
|
status_code,
|
|
platform,
|
|
model,
|
|
request_path,
|
|
stream,
|
|
error_message,
|
|
duration_ms,
|
|
created_at
|
|
) VALUES (
|
|
$1, $2, $3, $4, $5,
|
|
$6, $7, $8, $9, $10,
|
|
$11, $12, $13, $14, $15,
|
|
$16, $17
|
|
)
|
|
RETURNING id, created_at
|
|
`
|
|
|
|
requestID := nullString(log.RequestID)
|
|
clientIP := nullString(log.ClientIP)
|
|
platform := nullString(log.Platform)
|
|
model := nullString(log.Model)
|
|
requestPath := nullString(log.RequestPath)
|
|
message := nullString(log.Message)
|
|
latency := nullInt(log.LatencyMs)
|
|
|
|
args := []any{
|
|
requestID,
|
|
nullInt64(log.UserID),
|
|
nullInt64(log.APIKeyID),
|
|
nullInt64(log.AccountID),
|
|
nullInt64(log.GroupID),
|
|
clientIP,
|
|
log.Phase,
|
|
log.Type,
|
|
log.Severity,
|
|
log.StatusCode,
|
|
platform,
|
|
model,
|
|
requestPath,
|
|
log.Stream,
|
|
message,
|
|
latency,
|
|
createdAt,
|
|
}
|
|
|
|
if err := scanSingleRow(ctx, r.sql, query, args, &log.ID, &log.CreatedAt); err != nil {
|
|
return err
|
|
}
|
|
return nil
|
|
}
|
|
|
|
func (r *OpsRepository) ListErrorLogsLegacy(ctx context.Context, filters service.OpsErrorLogFilters) ([]service.OpsErrorLog, error) {
|
|
conditions := make([]string, 0)
|
|
args := make([]any, 0)
|
|
|
|
addCondition := func(condition string, values ...any) {
|
|
conditions = append(conditions, condition)
|
|
args = append(args, values...)
|
|
}
|
|
|
|
if filters.StartTime != nil {
|
|
addCondition(fmt.Sprintf("created_at >= $%d", len(args)+1), *filters.StartTime)
|
|
}
|
|
if filters.EndTime != nil {
|
|
addCondition(fmt.Sprintf("created_at <= $%d", len(args)+1), *filters.EndTime)
|
|
}
|
|
if filters.Platform != "" {
|
|
addCondition(fmt.Sprintf("platform = $%d", len(args)+1), filters.Platform)
|
|
}
|
|
if filters.Phase != "" {
|
|
addCondition(fmt.Sprintf("error_phase = $%d", len(args)+1), filters.Phase)
|
|
}
|
|
if filters.Severity != "" {
|
|
addCondition(fmt.Sprintf("severity = $%d", len(args)+1), filters.Severity)
|
|
}
|
|
if filters.Query != "" {
|
|
like := "%" + strings.ToLower(filters.Query) + "%"
|
|
startIdx := len(args) + 1
|
|
addCondition(
|
|
fmt.Sprintf("(LOWER(request_id) LIKE $%d OR LOWER(model) LIKE $%d OR LOWER(error_message) LIKE $%d OR LOWER(error_type) LIKE $%d)",
|
|
startIdx, startIdx+1, startIdx+2, startIdx+3,
|
|
),
|
|
like, like, like, like,
|
|
)
|
|
}
|
|
|
|
limit := filters.Limit
|
|
if limit <= 0 || limit > MaxErrorLogsLimit {
|
|
limit = DefaultErrorLogsLimit
|
|
}
|
|
|
|
where := ""
|
|
if len(conditions) > 0 {
|
|
where = "WHERE " + strings.Join(conditions, " AND ")
|
|
}
|
|
|
|
query := fmt.Sprintf(`
|
|
SELECT
|
|
id,
|
|
created_at,
|
|
user_id,
|
|
api_key_id,
|
|
account_id,
|
|
group_id,
|
|
client_ip,
|
|
error_phase,
|
|
error_type,
|
|
severity,
|
|
status_code,
|
|
platform,
|
|
model,
|
|
request_path,
|
|
stream,
|
|
duration_ms,
|
|
request_id,
|
|
error_message
|
|
FROM ops_error_logs
|
|
%s
|
|
ORDER BY created_at DESC
|
|
LIMIT $%d
|
|
`, where, len(args)+1)
|
|
|
|
args = append(args, limit)
|
|
|
|
rows, err := r.sql.QueryContext(ctx, query, args...)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
defer func() { _ = rows.Close() }()
|
|
|
|
results := make([]service.OpsErrorLog, 0)
|
|
for rows.Next() {
|
|
logEntry, err := scanOpsErrorLog(rows)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
results = append(results, *logEntry)
|
|
}
|
|
if err := rows.Err(); err != nil {
|
|
return nil, err
|
|
}
|
|
return results, nil
|
|
}
|
|
|
|
func (r *OpsRepository) GetLatestSystemMetric(ctx context.Context) (*service.OpsMetrics, error) {
|
|
query := `
|
|
SELECT
|
|
window_minutes,
|
|
request_count,
|
|
success_count,
|
|
error_count,
|
|
success_rate,
|
|
error_rate,
|
|
p95_latency_ms,
|
|
p99_latency_ms,
|
|
http2_errors,
|
|
active_alerts,
|
|
cpu_usage_percent,
|
|
memory_used_mb,
|
|
memory_total_mb,
|
|
memory_usage_percent,
|
|
heap_alloc_mb,
|
|
gc_pause_ms,
|
|
concurrency_queue_depth,
|
|
created_at AS updated_at
|
|
FROM ops_system_metrics
|
|
WHERE window_minutes = $1
|
|
ORDER BY updated_at DESC, id DESC
|
|
LIMIT 1
|
|
`
|
|
|
|
var windowMinutes sql.NullInt64
|
|
var requestCount, successCount, errorCount sql.NullInt64
|
|
var successRate, errorRate sql.NullFloat64
|
|
var p95Latency, p99Latency, http2Errors, activeAlerts sql.NullInt64
|
|
var cpuUsage, memoryUsage, gcPause sql.NullFloat64
|
|
var memoryUsed, memoryTotal, heapAlloc, queueDepth sql.NullInt64
|
|
var createdAt time.Time
|
|
if err := scanSingleRow(
|
|
ctx,
|
|
r.sql,
|
|
query,
|
|
[]any{DefaultWindowMinutes},
|
|
&windowMinutes,
|
|
&requestCount,
|
|
&successCount,
|
|
&errorCount,
|
|
&successRate,
|
|
&errorRate,
|
|
&p95Latency,
|
|
&p99Latency,
|
|
&http2Errors,
|
|
&activeAlerts,
|
|
&cpuUsage,
|
|
&memoryUsed,
|
|
&memoryTotal,
|
|
&memoryUsage,
|
|
&heapAlloc,
|
|
&gcPause,
|
|
&queueDepth,
|
|
&createdAt,
|
|
); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
metric := &service.OpsMetrics{
|
|
UpdatedAt: createdAt,
|
|
}
|
|
if windowMinutes.Valid {
|
|
metric.WindowMinutes = int(windowMinutes.Int64)
|
|
}
|
|
if requestCount.Valid {
|
|
metric.RequestCount = requestCount.Int64
|
|
}
|
|
if successCount.Valid {
|
|
metric.SuccessCount = successCount.Int64
|
|
}
|
|
if errorCount.Valid {
|
|
metric.ErrorCount = errorCount.Int64
|
|
}
|
|
if successRate.Valid {
|
|
metric.SuccessRate = successRate.Float64
|
|
}
|
|
if errorRate.Valid {
|
|
metric.ErrorRate = errorRate.Float64
|
|
}
|
|
if p95Latency.Valid {
|
|
metric.P95LatencyMs = int(p95Latency.Int64)
|
|
}
|
|
if p99Latency.Valid {
|
|
metric.P99LatencyMs = int(p99Latency.Int64)
|
|
}
|
|
if http2Errors.Valid {
|
|
metric.HTTP2Errors = int(http2Errors.Int64)
|
|
}
|
|
if activeAlerts.Valid {
|
|
metric.ActiveAlerts = int(activeAlerts.Int64)
|
|
}
|
|
if cpuUsage.Valid {
|
|
metric.CPUUsagePercent = cpuUsage.Float64
|
|
}
|
|
if memoryUsed.Valid {
|
|
metric.MemoryUsedMB = memoryUsed.Int64
|
|
}
|
|
if memoryTotal.Valid {
|
|
metric.MemoryTotalMB = memoryTotal.Int64
|
|
}
|
|
if memoryUsage.Valid {
|
|
metric.MemoryUsagePercent = memoryUsage.Float64
|
|
}
|
|
if heapAlloc.Valid {
|
|
metric.HeapAllocMB = heapAlloc.Int64
|
|
}
|
|
if gcPause.Valid {
|
|
metric.GCPauseMs = gcPause.Float64
|
|
}
|
|
if queueDepth.Valid {
|
|
metric.ConcurrencyQueueDepth = int(queueDepth.Int64)
|
|
}
|
|
return metric, nil
|
|
}
|
|
|
|
func (r *OpsRepository) CreateSystemMetric(ctx context.Context, metric *service.OpsMetrics) error {
|
|
if metric == nil {
|
|
return nil
|
|
}
|
|
createdAt := metric.UpdatedAt
|
|
if createdAt.IsZero() {
|
|
createdAt = time.Now()
|
|
}
|
|
windowMinutes := metric.WindowMinutes
|
|
if windowMinutes <= 0 {
|
|
windowMinutes = DefaultWindowMinutes
|
|
}
|
|
|
|
query := `
|
|
INSERT INTO ops_system_metrics (
|
|
window_minutes,
|
|
request_count,
|
|
success_count,
|
|
error_count,
|
|
success_rate,
|
|
error_rate,
|
|
p95_latency_ms,
|
|
p99_latency_ms,
|
|
http2_errors,
|
|
active_alerts,
|
|
cpu_usage_percent,
|
|
memory_used_mb,
|
|
memory_total_mb,
|
|
memory_usage_percent,
|
|
heap_alloc_mb,
|
|
gc_pause_ms,
|
|
concurrency_queue_depth,
|
|
created_at
|
|
) VALUES (
|
|
$1, $2, $3, $4, $5, $6, $7, $8, $9, $10,
|
|
$11, $12, $13, $14, $15, $16, $17, $18
|
|
)
|
|
`
|
|
_, err := r.sql.ExecContext(ctx, query,
|
|
windowMinutes,
|
|
metric.RequestCount,
|
|
metric.SuccessCount,
|
|
metric.ErrorCount,
|
|
metric.SuccessRate,
|
|
metric.ErrorRate,
|
|
metric.P95LatencyMs,
|
|
metric.P99LatencyMs,
|
|
metric.HTTP2Errors,
|
|
metric.ActiveAlerts,
|
|
metric.CPUUsagePercent,
|
|
metric.MemoryUsedMB,
|
|
metric.MemoryTotalMB,
|
|
metric.MemoryUsagePercent,
|
|
metric.HeapAllocMB,
|
|
metric.GCPauseMs,
|
|
metric.ConcurrencyQueueDepth,
|
|
createdAt,
|
|
)
|
|
return err
|
|
}
|
|
|
|
func (r *OpsRepository) ListRecentSystemMetrics(ctx context.Context, windowMinutes, limit int) ([]service.OpsMetrics, error) {
|
|
if windowMinutes <= 0 {
|
|
windowMinutes = DefaultWindowMinutes
|
|
}
|
|
if limit <= 0 || limit > MaxRecentSystemMetricsLimit {
|
|
limit = DefaultRecentSystemMetricsLimit
|
|
}
|
|
|
|
query := `
|
|
SELECT
|
|
window_minutes,
|
|
request_count,
|
|
success_count,
|
|
error_count,
|
|
success_rate,
|
|
error_rate,
|
|
p95_latency_ms,
|
|
p99_latency_ms,
|
|
http2_errors,
|
|
active_alerts,
|
|
cpu_usage_percent,
|
|
memory_used_mb,
|
|
memory_total_mb,
|
|
memory_usage_percent,
|
|
heap_alloc_mb,
|
|
gc_pause_ms,
|
|
concurrency_queue_depth,
|
|
created_at AS updated_at
|
|
FROM ops_system_metrics
|
|
WHERE window_minutes = $1
|
|
ORDER BY updated_at DESC, id DESC
|
|
LIMIT $2
|
|
`
|
|
|
|
rows, err := r.sql.QueryContext(ctx, query, windowMinutes, limit)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
defer func() { _ = rows.Close() }()
|
|
|
|
results := make([]service.OpsMetrics, 0)
|
|
for rows.Next() {
|
|
metric, err := scanOpsSystemMetric(rows)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
results = append(results, *metric)
|
|
}
|
|
if err := rows.Err(); err != nil {
|
|
return nil, err
|
|
}
|
|
return results, nil
|
|
}
|
|
|
|
func (r *OpsRepository) ListSystemMetricsRange(ctx context.Context, windowMinutes int, startTime, endTime time.Time, limit int) ([]service.OpsMetrics, error) {
|
|
if windowMinutes <= 0 {
|
|
windowMinutes = DefaultWindowMinutes
|
|
}
|
|
if limit <= 0 || limit > MaxMetricsLimit {
|
|
limit = DefaultMetricsLimit
|
|
}
|
|
if endTime.IsZero() {
|
|
endTime = time.Now()
|
|
}
|
|
if startTime.IsZero() {
|
|
startTime = endTime.Add(-time.Duration(limit) * time.Minute)
|
|
}
|
|
if startTime.After(endTime) {
|
|
startTime, endTime = endTime, startTime
|
|
}
|
|
|
|
query := `
|
|
SELECT
|
|
window_minutes,
|
|
request_count,
|
|
success_count,
|
|
error_count,
|
|
success_rate,
|
|
error_rate,
|
|
p95_latency_ms,
|
|
p99_latency_ms,
|
|
http2_errors,
|
|
active_alerts,
|
|
cpu_usage_percent,
|
|
memory_used_mb,
|
|
memory_total_mb,
|
|
memory_usage_percent,
|
|
heap_alloc_mb,
|
|
gc_pause_ms,
|
|
concurrency_queue_depth,
|
|
created_at
|
|
FROM ops_system_metrics
|
|
WHERE window_minutes = $1
|
|
AND created_at >= $2
|
|
AND created_at <= $3
|
|
ORDER BY created_at ASC
|
|
LIMIT $4
|
|
`
|
|
|
|
rows, err := r.sql.QueryContext(ctx, query, windowMinutes, startTime, endTime, limit)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
defer func() { _ = rows.Close() }()
|
|
|
|
results := make([]service.OpsMetrics, 0)
|
|
for rows.Next() {
|
|
metric, err := scanOpsSystemMetric(rows)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
results = append(results, *metric)
|
|
}
|
|
if err := rows.Err(); err != nil {
|
|
return nil, err
|
|
}
|
|
return results, nil
|
|
}
|
|
|
|
func (r *OpsRepository) ListAlertRules(ctx context.Context) ([]service.OpsAlertRule, error) {
|
|
query := `
|
|
SELECT
|
|
id,
|
|
name,
|
|
description,
|
|
enabled,
|
|
metric_type,
|
|
operator,
|
|
threshold,
|
|
window_minutes,
|
|
sustained_minutes,
|
|
severity,
|
|
notify_email,
|
|
notify_webhook,
|
|
webhook_url,
|
|
cooldown_minutes,
|
|
dimension_filters,
|
|
notify_channels,
|
|
notify_config,
|
|
created_at,
|
|
updated_at
|
|
FROM ops_alert_rules
|
|
ORDER BY id ASC
|
|
`
|
|
|
|
rows, err := r.sql.QueryContext(ctx, query)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
defer func() { _ = rows.Close() }()
|
|
|
|
rules := make([]service.OpsAlertRule, 0)
|
|
for rows.Next() {
|
|
var rule service.OpsAlertRule
|
|
var description sql.NullString
|
|
var webhookURL sql.NullString
|
|
var dimensionFilters, notifyChannels, notifyConfig []byte
|
|
if err := rows.Scan(
|
|
&rule.ID,
|
|
&rule.Name,
|
|
&description,
|
|
&rule.Enabled,
|
|
&rule.MetricType,
|
|
&rule.Operator,
|
|
&rule.Threshold,
|
|
&rule.WindowMinutes,
|
|
&rule.SustainedMinutes,
|
|
&rule.Severity,
|
|
&rule.NotifyEmail,
|
|
&rule.NotifyWebhook,
|
|
&webhookURL,
|
|
&rule.CooldownMinutes,
|
|
&dimensionFilters,
|
|
¬ifyChannels,
|
|
¬ifyConfig,
|
|
&rule.CreatedAt,
|
|
&rule.UpdatedAt,
|
|
); err != nil {
|
|
return nil, err
|
|
}
|
|
if description.Valid {
|
|
rule.Description = description.String
|
|
}
|
|
if webhookURL.Valid {
|
|
rule.WebhookURL = webhookURL.String
|
|
}
|
|
if len(dimensionFilters) > 0 {
|
|
_ = json.Unmarshal(dimensionFilters, &rule.DimensionFilters)
|
|
}
|
|
if len(notifyChannels) > 0 {
|
|
_ = json.Unmarshal(notifyChannels, &rule.NotifyChannels)
|
|
}
|
|
if len(notifyConfig) > 0 {
|
|
_ = json.Unmarshal(notifyConfig, &rule.NotifyConfig)
|
|
}
|
|
rules = append(rules, rule)
|
|
}
|
|
if err := rows.Err(); err != nil {
|
|
return nil, err
|
|
}
|
|
return rules, nil
|
|
}
|
|
|
|
func (r *OpsRepository) GetActiveAlertEvent(ctx context.Context, ruleID int64) (*service.OpsAlertEvent, error) {
|
|
return r.getAlertEvent(ctx, `WHERE rule_id = $1 AND status = $2`, []any{ruleID, service.OpsAlertStatusFiring})
|
|
}
|
|
|
|
func (r *OpsRepository) GetLatestAlertEvent(ctx context.Context, ruleID int64) (*service.OpsAlertEvent, error) {
|
|
return r.getAlertEvent(ctx, `WHERE rule_id = $1`, []any{ruleID})
|
|
}
|
|
|
|
func (r *OpsRepository) CreateAlertEvent(ctx context.Context, event *service.OpsAlertEvent) error {
|
|
if event == nil {
|
|
return nil
|
|
}
|
|
if event.FiredAt.IsZero() {
|
|
event.FiredAt = time.Now()
|
|
}
|
|
if event.CreatedAt.IsZero() {
|
|
event.CreatedAt = event.FiredAt
|
|
}
|
|
if event.Status == "" {
|
|
event.Status = service.OpsAlertStatusFiring
|
|
}
|
|
|
|
query := `
|
|
INSERT INTO ops_alert_events (
|
|
rule_id,
|
|
severity,
|
|
status,
|
|
title,
|
|
description,
|
|
metric_value,
|
|
threshold_value,
|
|
fired_at,
|
|
resolved_at,
|
|
email_sent,
|
|
webhook_sent,
|
|
created_at
|
|
) VALUES (
|
|
$1, $2, $3, $4, $5, $6,
|
|
$7, $8, $9, $10, $11, $12
|
|
)
|
|
RETURNING id, created_at
|
|
`
|
|
|
|
var resolvedAt sql.NullTime
|
|
if event.ResolvedAt != nil {
|
|
resolvedAt = sql.NullTime{Time: *event.ResolvedAt, Valid: true}
|
|
}
|
|
|
|
if err := scanSingleRow(
|
|
ctx,
|
|
r.sql,
|
|
query,
|
|
[]any{
|
|
event.RuleID,
|
|
event.Severity,
|
|
event.Status,
|
|
event.Title,
|
|
event.Description,
|
|
event.MetricValue,
|
|
event.ThresholdValue,
|
|
event.FiredAt,
|
|
resolvedAt,
|
|
event.EmailSent,
|
|
event.WebhookSent,
|
|
event.CreatedAt,
|
|
},
|
|
&event.ID,
|
|
&event.CreatedAt,
|
|
); err != nil {
|
|
return err
|
|
}
|
|
return nil
|
|
}
|
|
|
|
func (r *OpsRepository) UpdateAlertEventStatus(ctx context.Context, eventID int64, status string, resolvedAt *time.Time) error {
|
|
var resolved sql.NullTime
|
|
if resolvedAt != nil {
|
|
resolved = sql.NullTime{Time: *resolvedAt, Valid: true}
|
|
}
|
|
_, err := r.sql.ExecContext(ctx, `
|
|
UPDATE ops_alert_events
|
|
SET status = $2, resolved_at = $3
|
|
WHERE id = $1
|
|
`, eventID, status, resolved)
|
|
return err
|
|
}
|
|
|
|
func (r *OpsRepository) UpdateAlertEventNotifications(ctx context.Context, eventID int64, emailSent, webhookSent bool) error {
|
|
_, err := r.sql.ExecContext(ctx, `
|
|
UPDATE ops_alert_events
|
|
SET email_sent = $2, webhook_sent = $3
|
|
WHERE id = $1
|
|
`, eventID, emailSent, webhookSent)
|
|
return err
|
|
}
|
|
|
|
func (r *OpsRepository) CountActiveAlerts(ctx context.Context) (int, error) {
|
|
var count int64
|
|
if err := scanSingleRow(
|
|
ctx,
|
|
r.sql,
|
|
`SELECT COUNT(*) FROM ops_alert_events WHERE status = $1`,
|
|
[]any{service.OpsAlertStatusFiring},
|
|
&count,
|
|
); err != nil {
|
|
if errors.Is(err, sql.ErrNoRows) {
|
|
return 0, nil
|
|
}
|
|
return 0, err
|
|
}
|
|
return int(count), nil
|
|
}
|
|
|
|
func (r *OpsRepository) GetWindowStats(ctx context.Context, startTime, endTime time.Time) (*service.OpsWindowStats, error) {
|
|
query := `
|
|
WITH
|
|
usage_agg AS (
|
|
SELECT
|
|
COUNT(*) AS success_count,
|
|
percentile_cont(0.95) WITHIN GROUP (ORDER BY duration_ms)
|
|
FILTER (WHERE duration_ms IS NOT NULL) AS p95,
|
|
percentile_cont(0.99) WITHIN GROUP (ORDER BY duration_ms)
|
|
FILTER (WHERE duration_ms IS NOT NULL) AS p99
|
|
FROM usage_logs
|
|
WHERE created_at >= $1 AND created_at < $2
|
|
),
|
|
error_agg AS (
|
|
SELECT
|
|
COUNT(*) AS error_count,
|
|
COUNT(*) FILTER (
|
|
WHERE
|
|
error_type = 'network_error'
|
|
OR error_message ILIKE '%http2%'
|
|
OR error_message ILIKE '%http/2%'
|
|
) AS http2_errors
|
|
FROM ops_error_logs
|
|
WHERE created_at >= $1 AND created_at < $2
|
|
)
|
|
SELECT
|
|
usage_agg.success_count,
|
|
error_agg.error_count,
|
|
usage_agg.p95,
|
|
usage_agg.p99,
|
|
error_agg.http2_errors
|
|
FROM usage_agg
|
|
CROSS JOIN error_agg
|
|
`
|
|
|
|
var stats service.OpsWindowStats
|
|
var p95Latency, p99Latency sql.NullFloat64
|
|
var http2Errors int64
|
|
if err := scanSingleRow(
|
|
ctx,
|
|
r.sql,
|
|
query,
|
|
[]any{startTime, endTime},
|
|
&stats.SuccessCount,
|
|
&stats.ErrorCount,
|
|
&p95Latency,
|
|
&p99Latency,
|
|
&http2Errors,
|
|
); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
stats.HTTP2Errors = int(http2Errors)
|
|
if p95Latency.Valid {
|
|
stats.P95LatencyMs = int(math.Round(p95Latency.Float64))
|
|
}
|
|
if p99Latency.Valid {
|
|
stats.P99LatencyMs = int(math.Round(p99Latency.Float64))
|
|
}
|
|
|
|
return &stats, nil
|
|
}
|
|
|
|
func (r *OpsRepository) GetOverviewStats(ctx context.Context, startTime, endTime time.Time) (*service.OverviewStats, error) {
|
|
query := `
|
|
WITH
|
|
usage_stats AS (
|
|
SELECT
|
|
COUNT(*) AS request_count,
|
|
COUNT(*) FILTER (WHERE duration_ms IS NOT NULL) AS success_count,
|
|
percentile_cont(0.50) WITHIN GROUP (ORDER BY duration_ms) FILTER (WHERE duration_ms IS NOT NULL) AS p50,
|
|
percentile_cont(0.95) WITHIN GROUP (ORDER BY duration_ms) FILTER (WHERE duration_ms IS NOT NULL) AS p95,
|
|
percentile_cont(0.99) WITHIN GROUP (ORDER BY duration_ms) FILTER (WHERE duration_ms IS NOT NULL) AS p99,
|
|
percentile_cont(0.999) WITHIN GROUP (ORDER BY duration_ms) FILTER (WHERE duration_ms IS NOT NULL) AS p999,
|
|
AVG(duration_ms) FILTER (WHERE duration_ms IS NOT NULL) AS avg_latency,
|
|
MAX(duration_ms) FILTER (WHERE duration_ms IS NOT NULL) AS max_latency
|
|
FROM usage_logs
|
|
WHERE created_at >= $1 AND created_at < $2
|
|
),
|
|
error_stats AS (
|
|
SELECT
|
|
COUNT(*) AS error_count,
|
|
COUNT(*) FILTER (WHERE status_code >= 400 AND status_code < 500) AS error_4xx,
|
|
COUNT(*) FILTER (WHERE status_code >= 500) AS error_5xx,
|
|
COUNT(*) FILTER (
|
|
WHERE
|
|
error_type IN ('timeout', 'timeout_error')
|
|
OR error_message ILIKE '%timeout%'
|
|
OR error_message ILIKE '%deadline exceeded%'
|
|
) AS timeout_count
|
|
FROM ops_error_logs
|
|
WHERE created_at >= $1 AND created_at < $2
|
|
),
|
|
top_error AS (
|
|
SELECT
|
|
COALESCE(status_code::text, 'unknown') AS error_code,
|
|
error_message,
|
|
COUNT(*) AS error_count
|
|
FROM ops_error_logs
|
|
WHERE created_at >= $1 AND created_at < $2
|
|
GROUP BY status_code, error_message
|
|
ORDER BY error_count DESC
|
|
LIMIT 1
|
|
),
|
|
latest_metrics AS (
|
|
SELECT
|
|
cpu_usage_percent,
|
|
memory_usage_percent,
|
|
memory_used_mb,
|
|
memory_total_mb,
|
|
concurrency_queue_depth
|
|
FROM ops_system_metrics
|
|
ORDER BY created_at DESC
|
|
LIMIT 1
|
|
)
|
|
SELECT
|
|
COALESCE(usage_stats.request_count, 0) + COALESCE(error_stats.error_count, 0) AS request_count,
|
|
COALESCE(usage_stats.success_count, 0),
|
|
COALESCE(error_stats.error_count, 0),
|
|
COALESCE(error_stats.error_4xx, 0),
|
|
COALESCE(error_stats.error_5xx, 0),
|
|
COALESCE(error_stats.timeout_count, 0),
|
|
COALESCE(usage_stats.p50, 0),
|
|
COALESCE(usage_stats.p95, 0),
|
|
COALESCE(usage_stats.p99, 0),
|
|
COALESCE(usage_stats.p999, 0),
|
|
COALESCE(usage_stats.avg_latency, 0),
|
|
COALESCE(usage_stats.max_latency, 0),
|
|
COALESCE(top_error.error_code, ''),
|
|
COALESCE(top_error.error_message, ''),
|
|
COALESCE(top_error.error_count, 0),
|
|
COALESCE(latest_metrics.cpu_usage_percent, 0),
|
|
COALESCE(latest_metrics.memory_usage_percent, 0),
|
|
COALESCE(latest_metrics.memory_used_mb, 0),
|
|
COALESCE(latest_metrics.memory_total_mb, 0),
|
|
COALESCE(latest_metrics.concurrency_queue_depth, 0)
|
|
FROM usage_stats
|
|
CROSS JOIN error_stats
|
|
LEFT JOIN top_error ON true
|
|
LEFT JOIN latest_metrics ON true
|
|
`
|
|
|
|
var stats service.OverviewStats
|
|
var p50, p95, p99, p999, avgLatency, maxLatency sql.NullFloat64
|
|
|
|
err := scanSingleRow(
|
|
ctx,
|
|
r.sql,
|
|
query,
|
|
[]any{startTime, endTime},
|
|
&stats.RequestCount,
|
|
&stats.SuccessCount,
|
|
&stats.ErrorCount,
|
|
&stats.Error4xxCount,
|
|
&stats.Error5xxCount,
|
|
&stats.TimeoutCount,
|
|
&p50,
|
|
&p95,
|
|
&p99,
|
|
&p999,
|
|
&avgLatency,
|
|
&maxLatency,
|
|
&stats.TopErrorCode,
|
|
&stats.TopErrorMsg,
|
|
&stats.TopErrorCount,
|
|
&stats.CPUUsage,
|
|
&stats.MemoryUsage,
|
|
&stats.MemoryUsedMB,
|
|
&stats.MemoryTotalMB,
|
|
&stats.ConcurrencyQueueDepth,
|
|
)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
if p50.Valid {
|
|
stats.LatencyP50 = int(p50.Float64)
|
|
}
|
|
if p95.Valid {
|
|
stats.LatencyP95 = int(p95.Float64)
|
|
}
|
|
if p99.Valid {
|
|
stats.LatencyP99 = int(p99.Float64)
|
|
}
|
|
if p999.Valid {
|
|
stats.LatencyP999 = int(p999.Float64)
|
|
}
|
|
if avgLatency.Valid {
|
|
stats.LatencyAvg = int(avgLatency.Float64)
|
|
}
|
|
if maxLatency.Valid {
|
|
stats.LatencyMax = int(maxLatency.Float64)
|
|
}
|
|
|
|
return &stats, nil
|
|
}
|
|
|
|
func (r *OpsRepository) GetProviderStats(ctx context.Context, startTime, endTime time.Time) ([]*service.ProviderStats, error) {
|
|
if startTime.IsZero() || endTime.IsZero() {
|
|
return nil, nil
|
|
}
|
|
if startTime.After(endTime) {
|
|
startTime, endTime = endTime, startTime
|
|
}
|
|
|
|
query := `
|
|
WITH combined AS (
|
|
SELECT
|
|
COALESCE(g.platform, a.platform, '') AS platform,
|
|
u.duration_ms AS duration_ms,
|
|
1 AS is_success,
|
|
0 AS is_error,
|
|
NULL::INT AS status_code,
|
|
NULL::TEXT AS error_type,
|
|
NULL::TEXT AS error_message
|
|
FROM usage_logs u
|
|
LEFT JOIN groups g ON g.id = u.group_id
|
|
LEFT JOIN accounts a ON a.id = u.account_id
|
|
WHERE u.created_at >= $1 AND u.created_at < $2
|
|
|
|
UNION ALL
|
|
|
|
SELECT
|
|
COALESCE(NULLIF(o.platform, ''), g.platform, a.platform, '') AS platform,
|
|
o.duration_ms AS duration_ms,
|
|
0 AS is_success,
|
|
1 AS is_error,
|
|
o.status_code AS status_code,
|
|
o.error_type AS error_type,
|
|
o.error_message AS error_message
|
|
FROM ops_error_logs o
|
|
LEFT JOIN groups g ON g.id = o.group_id
|
|
LEFT JOIN accounts a ON a.id = o.account_id
|
|
WHERE o.created_at >= $1 AND o.created_at < $2
|
|
)
|
|
SELECT
|
|
platform,
|
|
COUNT(*) AS request_count,
|
|
COALESCE(SUM(is_success), 0) AS success_count,
|
|
COALESCE(SUM(is_error), 0) AS error_count,
|
|
COALESCE(AVG(duration_ms) FILTER (WHERE duration_ms IS NOT NULL), 0) AS avg_latency_ms,
|
|
percentile_cont(0.99) WITHIN GROUP (ORDER BY duration_ms)
|
|
FILTER (WHERE duration_ms IS NOT NULL) AS p99_latency_ms,
|
|
COUNT(*) FILTER (WHERE is_error = 1 AND status_code >= 400 AND status_code < 500) AS error_4xx,
|
|
COUNT(*) FILTER (WHERE is_error = 1 AND status_code >= 500 AND status_code < 600) AS error_5xx,
|
|
COUNT(*) FILTER (
|
|
WHERE
|
|
is_error = 1
|
|
AND (
|
|
status_code = 504
|
|
OR error_type ILIKE '%timeout%'
|
|
OR error_message ILIKE '%timeout%'
|
|
)
|
|
) AS timeout_count
|
|
FROM combined
|
|
WHERE platform <> ''
|
|
GROUP BY platform
|
|
ORDER BY request_count DESC, platform ASC
|
|
`
|
|
|
|
rows, err := r.sql.QueryContext(ctx, query, startTime, endTime)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
defer func() { _ = rows.Close() }()
|
|
|
|
results := make([]*service.ProviderStats, 0)
|
|
for rows.Next() {
|
|
var item service.ProviderStats
|
|
var avgLatency sql.NullFloat64
|
|
var p99Latency sql.NullFloat64
|
|
if err := rows.Scan(
|
|
&item.Platform,
|
|
&item.RequestCount,
|
|
&item.SuccessCount,
|
|
&item.ErrorCount,
|
|
&avgLatency,
|
|
&p99Latency,
|
|
&item.Error4xxCount,
|
|
&item.Error5xxCount,
|
|
&item.TimeoutCount,
|
|
); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
if avgLatency.Valid {
|
|
item.AvgLatencyMs = int(math.Round(avgLatency.Float64))
|
|
}
|
|
if p99Latency.Valid {
|
|
item.P99LatencyMs = int(math.Round(p99Latency.Float64))
|
|
}
|
|
|
|
results = append(results, &item)
|
|
}
|
|
if err := rows.Err(); err != nil {
|
|
return nil, err
|
|
}
|
|
return results, nil
|
|
}
|
|
|
|
func (r *OpsRepository) GetLatencyHistogram(ctx context.Context, startTime, endTime time.Time) ([]*service.LatencyHistogramItem, error) {
|
|
query := `
|
|
WITH buckets AS (
|
|
SELECT
|
|
CASE
|
|
WHEN duration_ms < 200 THEN '<200ms'
|
|
WHEN duration_ms < 500 THEN '200-500ms'
|
|
WHEN duration_ms < 1000 THEN '500-1000ms'
|
|
WHEN duration_ms < 3000 THEN '1000-3000ms'
|
|
ELSE '>3000ms'
|
|
END AS range_name,
|
|
CASE
|
|
WHEN duration_ms < 200 THEN 1
|
|
WHEN duration_ms < 500 THEN 2
|
|
WHEN duration_ms < 1000 THEN 3
|
|
WHEN duration_ms < 3000 THEN 4
|
|
ELSE 5
|
|
END AS range_order,
|
|
COUNT(*) AS count
|
|
FROM usage_logs
|
|
WHERE created_at >= $1 AND created_at < $2 AND duration_ms IS NOT NULL
|
|
GROUP BY 1, 2
|
|
),
|
|
total AS (
|
|
SELECT SUM(count) AS total_count FROM buckets
|
|
)
|
|
SELECT
|
|
b.range_name,
|
|
b.count,
|
|
ROUND((b.count::numeric / t.total_count) * 100, 2) AS percentage
|
|
FROM buckets b
|
|
CROSS JOIN total t
|
|
ORDER BY b.range_order ASC
|
|
`
|
|
|
|
rows, err := r.sql.QueryContext(ctx, query, startTime, endTime)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
defer func() { _ = rows.Close() }()
|
|
|
|
results := make([]*service.LatencyHistogramItem, 0)
|
|
for rows.Next() {
|
|
var item service.LatencyHistogramItem
|
|
if err := rows.Scan(&item.Range, &item.Count, &item.Percentage); err != nil {
|
|
return nil, err
|
|
}
|
|
results = append(results, &item)
|
|
}
|
|
return results, nil
|
|
}
|
|
|
|
func (r *OpsRepository) GetErrorDistribution(ctx context.Context, startTime, endTime time.Time) ([]*service.ErrorDistributionItem, error) {
|
|
query := `
|
|
WITH errors AS (
|
|
SELECT
|
|
COALESCE(status_code::text, 'unknown') AS code,
|
|
COALESCE(error_message, 'Unknown error') AS message,
|
|
COUNT(*) AS count
|
|
FROM ops_error_logs
|
|
WHERE created_at >= $1 AND created_at < $2
|
|
GROUP BY 1, 2
|
|
),
|
|
total AS (
|
|
SELECT SUM(count) AS total_count FROM errors
|
|
)
|
|
SELECT
|
|
e.code,
|
|
e.message,
|
|
e.count,
|
|
ROUND((e.count::numeric / t.total_count) * 100, 2) AS percentage
|
|
FROM errors e
|
|
CROSS JOIN total t
|
|
ORDER BY e.count DESC
|
|
LIMIT 20
|
|
`
|
|
|
|
rows, err := r.sql.QueryContext(ctx, query, startTime, endTime)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
defer func() { _ = rows.Close() }()
|
|
|
|
results := make([]*service.ErrorDistributionItem, 0)
|
|
for rows.Next() {
|
|
var item service.ErrorDistributionItem
|
|
if err := rows.Scan(&item.Code, &item.Message, &item.Count, &item.Percentage); err != nil {
|
|
return nil, err
|
|
}
|
|
results = append(results, &item)
|
|
}
|
|
return results, nil
|
|
}
|
|
|
|
func (r *OpsRepository) getAlertEvent(ctx context.Context, whereClause string, args []any) (*service.OpsAlertEvent, error) {
|
|
query := fmt.Sprintf(`
|
|
SELECT
|
|
id,
|
|
rule_id,
|
|
severity,
|
|
status,
|
|
title,
|
|
description,
|
|
metric_value,
|
|
threshold_value,
|
|
fired_at,
|
|
resolved_at,
|
|
email_sent,
|
|
webhook_sent,
|
|
created_at
|
|
FROM ops_alert_events
|
|
%s
|
|
ORDER BY fired_at DESC
|
|
LIMIT 1
|
|
`, whereClause)
|
|
|
|
var event service.OpsAlertEvent
|
|
var resolvedAt sql.NullTime
|
|
var metricValue sql.NullFloat64
|
|
var thresholdValue sql.NullFloat64
|
|
if err := scanSingleRow(
|
|
ctx,
|
|
r.sql,
|
|
query,
|
|
args,
|
|
&event.ID,
|
|
&event.RuleID,
|
|
&event.Severity,
|
|
&event.Status,
|
|
&event.Title,
|
|
&event.Description,
|
|
&metricValue,
|
|
&thresholdValue,
|
|
&event.FiredAt,
|
|
&resolvedAt,
|
|
&event.EmailSent,
|
|
&event.WebhookSent,
|
|
&event.CreatedAt,
|
|
); err != nil {
|
|
if errors.Is(err, sql.ErrNoRows) {
|
|
return nil, nil
|
|
}
|
|
return nil, err
|
|
}
|
|
|
|
if metricValue.Valid {
|
|
event.MetricValue = metricValue.Float64
|
|
}
|
|
if thresholdValue.Valid {
|
|
event.ThresholdValue = thresholdValue.Float64
|
|
}
|
|
if resolvedAt.Valid {
|
|
event.ResolvedAt = &resolvedAt.Time
|
|
}
|
|
return &event, nil
|
|
}
|
|
|
|
func scanOpsSystemMetric(rows *sql.Rows) (*service.OpsMetrics, error) {
|
|
var metric service.OpsMetrics
|
|
var windowMinutes sql.NullInt64
|
|
var requestCount, successCount, errorCount sql.NullInt64
|
|
var successRate, errorRate sql.NullFloat64
|
|
var p95Latency, p99Latency, http2Errors, activeAlerts sql.NullInt64
|
|
var cpuUsage, memoryUsage, gcPause sql.NullFloat64
|
|
var memoryUsed, memoryTotal, heapAlloc, queueDepth sql.NullInt64
|
|
|
|
if err := rows.Scan(
|
|
&windowMinutes,
|
|
&requestCount,
|
|
&successCount,
|
|
&errorCount,
|
|
&successRate,
|
|
&errorRate,
|
|
&p95Latency,
|
|
&p99Latency,
|
|
&http2Errors,
|
|
&activeAlerts,
|
|
&cpuUsage,
|
|
&memoryUsed,
|
|
&memoryTotal,
|
|
&memoryUsage,
|
|
&heapAlloc,
|
|
&gcPause,
|
|
&queueDepth,
|
|
&metric.UpdatedAt,
|
|
); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
if windowMinutes.Valid {
|
|
metric.WindowMinutes = int(windowMinutes.Int64)
|
|
}
|
|
if requestCount.Valid {
|
|
metric.RequestCount = requestCount.Int64
|
|
}
|
|
if successCount.Valid {
|
|
metric.SuccessCount = successCount.Int64
|
|
}
|
|
if errorCount.Valid {
|
|
metric.ErrorCount = errorCount.Int64
|
|
}
|
|
if successRate.Valid {
|
|
metric.SuccessRate = successRate.Float64
|
|
}
|
|
if errorRate.Valid {
|
|
metric.ErrorRate = errorRate.Float64
|
|
}
|
|
if p95Latency.Valid {
|
|
metric.P95LatencyMs = int(p95Latency.Int64)
|
|
}
|
|
if p99Latency.Valid {
|
|
metric.P99LatencyMs = int(p99Latency.Int64)
|
|
}
|
|
if http2Errors.Valid {
|
|
metric.HTTP2Errors = int(http2Errors.Int64)
|
|
}
|
|
if activeAlerts.Valid {
|
|
metric.ActiveAlerts = int(activeAlerts.Int64)
|
|
}
|
|
if cpuUsage.Valid {
|
|
metric.CPUUsagePercent = cpuUsage.Float64
|
|
}
|
|
if memoryUsed.Valid {
|
|
metric.MemoryUsedMB = memoryUsed.Int64
|
|
}
|
|
if memoryTotal.Valid {
|
|
metric.MemoryTotalMB = memoryTotal.Int64
|
|
}
|
|
if memoryUsage.Valid {
|
|
metric.MemoryUsagePercent = memoryUsage.Float64
|
|
}
|
|
if heapAlloc.Valid {
|
|
metric.HeapAllocMB = heapAlloc.Int64
|
|
}
|
|
if gcPause.Valid {
|
|
metric.GCPauseMs = gcPause.Float64
|
|
}
|
|
if queueDepth.Valid {
|
|
metric.ConcurrencyQueueDepth = int(queueDepth.Int64)
|
|
}
|
|
|
|
return &metric, nil
|
|
}
|
|
|
|
func scanOpsErrorLog(rows *sql.Rows) (*service.OpsErrorLog, error) {
|
|
var entry service.OpsErrorLog
|
|
var userID, apiKeyID, accountID, groupID sql.NullInt64
|
|
var clientIP sql.NullString
|
|
var statusCode sql.NullInt64
|
|
var platform sql.NullString
|
|
var model sql.NullString
|
|
var requestPath sql.NullString
|
|
var stream sql.NullBool
|
|
var latency sql.NullInt64
|
|
var requestID sql.NullString
|
|
var message sql.NullString
|
|
|
|
if err := rows.Scan(
|
|
&entry.ID,
|
|
&entry.CreatedAt,
|
|
&userID,
|
|
&apiKeyID,
|
|
&accountID,
|
|
&groupID,
|
|
&clientIP,
|
|
&entry.Phase,
|
|
&entry.Type,
|
|
&entry.Severity,
|
|
&statusCode,
|
|
&platform,
|
|
&model,
|
|
&requestPath,
|
|
&stream,
|
|
&latency,
|
|
&requestID,
|
|
&message,
|
|
); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
if userID.Valid {
|
|
v := userID.Int64
|
|
entry.UserID = &v
|
|
}
|
|
if apiKeyID.Valid {
|
|
v := apiKeyID.Int64
|
|
entry.APIKeyID = &v
|
|
}
|
|
if accountID.Valid {
|
|
v := accountID.Int64
|
|
entry.AccountID = &v
|
|
}
|
|
if groupID.Valid {
|
|
v := groupID.Int64
|
|
entry.GroupID = &v
|
|
}
|
|
if clientIP.Valid {
|
|
entry.ClientIP = clientIP.String
|
|
}
|
|
if statusCode.Valid {
|
|
entry.StatusCode = int(statusCode.Int64)
|
|
}
|
|
if platform.Valid {
|
|
entry.Platform = platform.String
|
|
}
|
|
if model.Valid {
|
|
entry.Model = model.String
|
|
}
|
|
if requestPath.Valid {
|
|
entry.RequestPath = requestPath.String
|
|
}
|
|
if stream.Valid {
|
|
entry.Stream = stream.Bool
|
|
}
|
|
if latency.Valid {
|
|
value := int(latency.Int64)
|
|
entry.LatencyMs = &value
|
|
}
|
|
if requestID.Valid {
|
|
entry.RequestID = requestID.String
|
|
}
|
|
if message.Valid {
|
|
entry.Message = message.String
|
|
}
|
|
|
|
return &entry, nil
|
|
}
|
|
|
|
func nullString(value string) sql.NullString {
|
|
if value == "" {
|
|
return sql.NullString{}
|
|
}
|
|
return sql.NullString{String: value, Valid: true}
|
|
}
|