* fix(ops): 修复运维监控系统的关键安全和稳定性问题
## 修复内容
### P0 严重问题
1. **DNS Rebinding防护** (ops_alert_service.go)
- 实现IP钉住机制防止验证后的DNS rebinding攻击
- 自定义Transport.DialContext强制只允许拨号到验证过的公网IP
- 扩展IP黑名单,包括云metadata地址(169.254.169.254)
- 添加完整的单元测试覆盖
2. **OpsAlertService生命周期管理** (wire.go)
- 在ProvideOpsMetricsCollector中添加opsAlertService.Start()调用
- 确保stopCtx正确初始化,避免nil指针问题
- 实现防御式启动,保证服务启动顺序
3. **数据库查询排序** (ops_repo.go)
- 在ListRecentSystemMetrics中添加显式ORDER BY updated_at DESC, id DESC
- 在GetLatestSystemMetric中添加排序保证
- 避免数据库返回顺序不确定导致告警误判
### P1 重要问题
4. **并发安全** (ops_metrics_collector.go)
- 为lastGCPauseTotal字段添加sync.Mutex保护
- 防止数据竞争
5. **Goroutine泄漏** (ops_error_logger.go)
- 实现worker pool模式限制并发goroutine数量
- 使用256容量缓冲队列和10个固定worker
- 非阻塞投递,队列满时丢弃任务
6. **生命周期控制** (ops_alert_service.go)
- 添加Start/Stop方法实现优雅关闭
- 使用context控制goroutine生命周期
- 实现WaitGroup等待后台任务完成
7. **Webhook URL验证** (ops_alert_service.go)
- 防止SSRF攻击:验证scheme、禁止内网IP
- DNS解析验证,拒绝解析到私有IP的域名
- 添加8个单元测试覆盖各种攻击场景
8. **资源泄漏** (ops_repo.go)
- 修复多处defer rows.Close()问题
- 简化冗余的defer func()包装
9. **HTTP超时控制** (ops_alert_service.go)
- 创建带10秒超时的http.Client
- 添加buildWebhookHTTPClient辅助函数
- 防止HTTP请求无限期挂起
10. **数据库查询优化** (ops_repo.go)
- 将GetWindowStats的4次独立查询合并为1次CTE查询
- 减少网络往返和表扫描次数
- 显著提升性能
11. **重试机制** (ops_alert_service.go)
- 实现邮件发送重试:最多3次,指数退避(1s/2s/4s)
- 添加webhook备用通道
- 实现完整的错误处理和日志记录
12. **魔法数字** (ops_repo.go, ops_metrics_collector.go)
- 提取硬编码数字为有意义的常量
- 提高代码可读性和可维护性
## 测试验证
- ✅ go test ./internal/service -tags opsalert_unit 通过
- ✅ 所有webhook验证测试通过
- ✅ 重试机制测试通过
## 影响范围
- 运维监控系统安全性显著提升
- 系统稳定性和性能优化
- 无破坏性变更,向后兼容
* feat(ops): 运维监控系统V2 - 完整实现
## 核心功能
- 运维监控仪表盘V2(实时监控、历史趋势、告警管理)
- WebSocket实时QPS/TPS监控(30s心跳,自动重连)
- 系统指标采集(CPU、内存、延迟、错误率等)
- 多维度统计分析(按provider、model、user等维度)
- 告警规则管理(阈值配置、通知渠道)
- 错误日志追踪(详细错误信息、堆栈跟踪)
## 数据库Schema (Migration 025)
### 扩展现有表
- ops_system_metrics: 新增RED指标、错误分类、延迟指标、资源指标、业务指标
- ops_alert_rules: 新增JSONB字段(dimension_filters, notify_channels, notify_config)
### 新增表
- ops_dimension_stats: 多维度统计数据
- ops_data_retention_config: 数据保留策略配置
### 新增视图和函数
- ops_latest_metrics: 最新1分钟窗口指标(已修复字段名和window过滤)
- ops_active_alerts: 当前活跃告警(已修复字段名和状态值)
- calculate_health_score: 健康分数计算函数
## 一致性修复(98/100分)
### P0级别(阻塞Migration)
- ✅ 修复ops_latest_metrics视图字段名(latency_p99→p99_latency_ms, cpu_usage→cpu_usage_percent)
- ✅ 修复ops_active_alerts视图字段名(metric→metric_type, triggered_at→fired_at, trigger_value→metric_value, threshold→threshold_value)
- ✅ 统一告警历史表名(删除ops_alert_history,使用ops_alert_events)
- ✅ 统一API参数限制(ListMetricsHistory和ListErrorLogs的limit改为5000)
### P1级别(功能完整性)
- ✅ 修复ops_latest_metrics视图未过滤window_minutes(添加WHERE m.window_minutes = 1)
- ✅ 修复数据回填UPDATE逻辑(QPS计算改为request_count/(window_minutes*60.0))
- ✅ 添加ops_alert_rules JSONB字段后端支持(Go结构体+序列化)
### P2级别(优化)
- ✅ 前端WebSocket自动重连(指数退避1s→2s→4s→8s→16s,最大5次)
- ✅ 后端WebSocket心跳检测(30s ping,60s pong超时)
## 技术实现
### 后端 (Go)
- Handler层: ops_handler.go(REST API), ops_ws_handler.go(WebSocket)
- Service层: ops_service.go(核心逻辑), ops_cache.go(缓存), ops_alerts.go(告警)
- Repository层: ops_repo.go(数据访问), ops.go(模型定义)
- 路由: admin.go(新增ops相关路由)
- 依赖注入: wire_gen.go(自动生成)
### 前端 (Vue3 + TypeScript)
- 组件: OpsDashboardV2.vue(仪表盘主组件)
- API: ops.ts(REST API + WebSocket封装)
- 路由: index.ts(新增/admin/ops路由)
- 国际化: en.ts, zh.ts(中英文支持)
## 测试验证
- ✅ 所有Go测试通过
- ✅ Migration可正常执行
- ✅ WebSocket连接稳定
- ✅ 前后端数据结构对齐
* refactor: 代码清理和测试优化
## 测试文件优化
- 简化integration test fixtures和断言
- 优化test helper函数
- 统一测试数据格式
## 代码清理
- 移除未使用的代码和注释
- 简化concurrency_cache实现
- 优化middleware错误处理
## 小修复
- 修复gateway_handler和openai_gateway_handler的小问题
- 统一代码风格和格式
变更统计: 27个文件,292行新增,322行删除(净减少30行)
* fix(ops): 运维监控系统安全加固和功能优化
## 安全增强
- feat(security): WebSocket日志脱敏机制,防止token/api_key泄露
- feat(security): X-Forwarded-Host白名单验证,防止CSRF绕过
- feat(security): Origin策略配置化,支持strict/permissive模式
- feat(auth): WebSocket认证支持query参数传递token
## 配置优化
- feat(config): 支持环境变量配置代理信任和Origin策略
- OPS_WS_TRUST_PROXY
- OPS_WS_TRUSTED_PROXIES
- OPS_WS_ORIGIN_POLICY
- fix(ops): 错误日志查询限流从5000降至500,优化内存使用
## 架构改进
- refactor(ops): 告警服务解耦,独立运行评估定时器
- refactor(ops): OpsDashboard统一版本,移除V2分离
## 测试和文档
- test(ops): 添加WebSocket安全验证单元测试(8个测试用例)
- test(ops): 添加告警服务集成测试
- docs(api): 更新API文档,标注限流变更
- docs: 添加CHANGELOG记录breaking changes
## 修复文件
Backend:
- backend/internal/server/middleware/logger.go
- backend/internal/handler/admin/ops_handler.go
- backend/internal/handler/admin/ops_ws_handler.go
- backend/internal/server/middleware/admin_auth.go
- backend/internal/service/ops_alert_service.go
- backend/internal/service/ops_metrics_collector.go
- backend/internal/service/wire.go
Frontend:
- frontend/src/views/admin/ops/OpsDashboard.vue
- frontend/src/router/index.ts
- frontend/src/api/admin/ops.ts
Tests:
- backend/internal/handler/admin/ops_ws_handler_test.go (新增)
- backend/internal/service/ops_alert_service_integration_test.go (新增)
Docs:
- CHANGELOG.md (新增)
- docs/API-运维监控中心2.0.md (更新)
* fix(migrations): 修复calculate_health_score函数类型匹配问题
在ops_latest_metrics视图中添加显式类型转换,确保参数类型与函数签名匹配
* fix(lint): 修复golangci-lint检查发现的所有问题
- 将Redis依赖从service层移到repository层
- 添加错误检查(WebSocket连接和读取超时)
- 运行gofmt格式化代码
- 添加nil指针检查
- 删除未使用的alertService字段
修复问题:
- depguard: 3个(service层不应直接import redis)
- errcheck: 3个(未检查错误返回值)
- gofmt: 2个(代码格式问题)
- staticcheck: 4个(nil指针解引用)
- unused: 1个(未使用字段)
代码统计:
- 修改文件:11个
- 删除代码:490行
- 新增代码:105行
- 净减少:385行
835 lines
21 KiB
Go
835 lines
21 KiB
Go
package service
|
|
|
|
import (
|
|
"bytes"
|
|
"context"
|
|
"encoding/json"
|
|
"errors"
|
|
"fmt"
|
|
"log"
|
|
"net"
|
|
"net/http"
|
|
"net/url"
|
|
"strconv"
|
|
"strings"
|
|
"sync"
|
|
"time"
|
|
)
|
|
|
|
type OpsAlertService struct {
|
|
opsService *OpsService
|
|
userService *UserService
|
|
emailService *EmailService
|
|
httpClient *http.Client
|
|
|
|
interval time.Duration
|
|
|
|
startOnce sync.Once
|
|
stopOnce sync.Once
|
|
stopCtx context.Context
|
|
stop context.CancelFunc
|
|
wg sync.WaitGroup
|
|
}
|
|
|
|
// opsAlertEvalInterval defines how often OpsAlertService evaluates alert rules.
|
|
//
|
|
// Production uses opsMetricsInterval. Tests may override this variable to keep
|
|
// integration tests fast without changing production defaults.
|
|
var opsAlertEvalInterval = opsMetricsInterval
|
|
|
|
func NewOpsAlertService(opsService *OpsService, userService *UserService, emailService *EmailService) *OpsAlertService {
|
|
return &OpsAlertService{
|
|
opsService: opsService,
|
|
userService: userService,
|
|
emailService: emailService,
|
|
httpClient: &http.Client{Timeout: 10 * time.Second},
|
|
interval: opsAlertEvalInterval,
|
|
}
|
|
}
|
|
|
|
// Start launches the background alert evaluation loop.
|
|
//
|
|
// Stop must be called during shutdown to ensure the goroutine exits.
|
|
func (s *OpsAlertService) Start() {
|
|
s.StartWithContext(context.Background())
|
|
}
|
|
|
|
// StartWithContext is like Start but allows the caller to provide a parent context.
|
|
// When the parent context is canceled, the service stops automatically.
|
|
func (s *OpsAlertService) StartWithContext(ctx context.Context) {
|
|
if s == nil {
|
|
return
|
|
}
|
|
if ctx == nil {
|
|
ctx = context.Background()
|
|
}
|
|
|
|
s.startOnce.Do(func() {
|
|
if s.interval <= 0 {
|
|
s.interval = opsAlertEvalInterval
|
|
}
|
|
|
|
s.stopCtx, s.stop = context.WithCancel(ctx)
|
|
s.wg.Add(1)
|
|
go s.run()
|
|
})
|
|
}
|
|
|
|
// Stop gracefully stops the background goroutine started by Start/StartWithContext.
|
|
// It is safe to call Stop multiple times.
|
|
func (s *OpsAlertService) Stop() {
|
|
if s == nil {
|
|
return
|
|
}
|
|
|
|
s.stopOnce.Do(func() {
|
|
if s.stop != nil {
|
|
s.stop()
|
|
}
|
|
})
|
|
s.wg.Wait()
|
|
}
|
|
|
|
func (s *OpsAlertService) run() {
|
|
defer s.wg.Done()
|
|
|
|
ticker := time.NewTicker(s.interval)
|
|
defer ticker.Stop()
|
|
|
|
s.evaluateOnce()
|
|
for {
|
|
select {
|
|
case <-ticker.C:
|
|
s.evaluateOnce()
|
|
case <-s.stopCtx.Done():
|
|
return
|
|
}
|
|
}
|
|
}
|
|
|
|
func (s *OpsAlertService) evaluateOnce() {
|
|
ctx, cancel := context.WithTimeout(s.stopCtx, opsAlertEvaluateTimeout)
|
|
defer cancel()
|
|
|
|
s.Evaluate(ctx, time.Now())
|
|
}
|
|
|
|
func (s *OpsAlertService) Evaluate(ctx context.Context, now time.Time) {
|
|
if s == nil || s.opsService == nil {
|
|
return
|
|
}
|
|
|
|
rules, err := s.opsService.ListAlertRules(ctx)
|
|
if err != nil {
|
|
log.Printf("[OpsAlert] failed to list rules: %v", err)
|
|
return
|
|
}
|
|
if len(rules) == 0 {
|
|
return
|
|
}
|
|
|
|
maxSustainedByWindow := make(map[int]int)
|
|
for _, rule := range rules {
|
|
if !rule.Enabled {
|
|
continue
|
|
}
|
|
window := rule.WindowMinutes
|
|
if window <= 0 {
|
|
window = 1
|
|
}
|
|
sustained := rule.SustainedMinutes
|
|
if sustained <= 0 {
|
|
sustained = 1
|
|
}
|
|
if sustained > maxSustainedByWindow[window] {
|
|
maxSustainedByWindow[window] = sustained
|
|
}
|
|
}
|
|
|
|
metricsByWindow := make(map[int][]OpsMetrics)
|
|
for window, limit := range maxSustainedByWindow {
|
|
metrics, err := s.opsService.ListRecentSystemMetrics(ctx, window, limit)
|
|
if err != nil {
|
|
log.Printf("[OpsAlert] failed to load metrics window=%dm: %v", window, err)
|
|
continue
|
|
}
|
|
metricsByWindow[window] = metrics
|
|
}
|
|
|
|
for _, rule := range rules {
|
|
if !rule.Enabled {
|
|
continue
|
|
}
|
|
window := rule.WindowMinutes
|
|
if window <= 0 {
|
|
window = 1
|
|
}
|
|
sustained := rule.SustainedMinutes
|
|
if sustained <= 0 {
|
|
sustained = 1
|
|
}
|
|
|
|
metrics := metricsByWindow[window]
|
|
selected, ok := selectContiguousMetrics(metrics, sustained, now)
|
|
if !ok {
|
|
continue
|
|
}
|
|
|
|
breached, latestValue, ok := evaluateRule(rule, selected)
|
|
if !ok {
|
|
continue
|
|
}
|
|
|
|
activeEvent, err := s.opsService.GetActiveAlertEvent(ctx, rule.ID)
|
|
if err != nil {
|
|
log.Printf("[OpsAlert] failed to get active event (rule=%d): %v", rule.ID, err)
|
|
continue
|
|
}
|
|
|
|
if breached {
|
|
if activeEvent != nil {
|
|
continue
|
|
}
|
|
|
|
lastEvent, err := s.opsService.GetLatestAlertEvent(ctx, rule.ID)
|
|
if err != nil {
|
|
log.Printf("[OpsAlert] failed to get latest event (rule=%d): %v", rule.ID, err)
|
|
continue
|
|
}
|
|
if lastEvent != nil && rule.CooldownMinutes > 0 {
|
|
cooldown := time.Duration(rule.CooldownMinutes) * time.Minute
|
|
if now.Sub(lastEvent.FiredAt) < cooldown {
|
|
continue
|
|
}
|
|
}
|
|
|
|
event := &OpsAlertEvent{
|
|
RuleID: rule.ID,
|
|
Severity: rule.Severity,
|
|
Status: OpsAlertStatusFiring,
|
|
Title: fmt.Sprintf("%s: %s", rule.Severity, rule.Name),
|
|
Description: buildAlertDescription(rule, latestValue),
|
|
MetricValue: latestValue,
|
|
ThresholdValue: rule.Threshold,
|
|
FiredAt: now,
|
|
CreatedAt: now,
|
|
}
|
|
|
|
if err := s.opsService.CreateAlertEvent(ctx, event); err != nil {
|
|
log.Printf("[OpsAlert] failed to create event (rule=%d): %v", rule.ID, err)
|
|
continue
|
|
}
|
|
|
|
emailSent, webhookSent := s.dispatchNotifications(ctx, rule, event)
|
|
if emailSent || webhookSent {
|
|
if err := s.opsService.UpdateAlertEventNotifications(ctx, event.ID, emailSent, webhookSent); err != nil {
|
|
log.Printf("[OpsAlert] failed to update notification flags (event=%d): %v", event.ID, err)
|
|
}
|
|
}
|
|
} else if activeEvent != nil {
|
|
resolvedAt := now
|
|
if err := s.opsService.UpdateAlertEventStatus(ctx, activeEvent.ID, OpsAlertStatusResolved, &resolvedAt); err != nil {
|
|
log.Printf("[OpsAlert] failed to resolve event (event=%d): %v", activeEvent.ID, err)
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
const opsMetricsContinuityTolerance = 20 * time.Second
|
|
|
|
// selectContiguousMetrics picks the newest N metrics and verifies they are continuous.
|
|
//
|
|
// This prevents a sustained rule from triggering when metrics sampling has gaps
|
|
// (e.g. collector downtime) and avoids evaluating "stale" data.
|
|
//
|
|
// Assumptions:
|
|
// - Metrics are ordered by UpdatedAt DESC (newest first).
|
|
// - Metrics are expected to be collected at opsMetricsInterval cadence.
|
|
func selectContiguousMetrics(metrics []OpsMetrics, needed int, now time.Time) ([]OpsMetrics, bool) {
|
|
if needed <= 0 {
|
|
return nil, false
|
|
}
|
|
if len(metrics) < needed {
|
|
return nil, false
|
|
}
|
|
newest := metrics[0].UpdatedAt
|
|
if newest.IsZero() {
|
|
return nil, false
|
|
}
|
|
if now.Sub(newest) > opsMetricsInterval+opsMetricsContinuityTolerance {
|
|
return nil, false
|
|
}
|
|
|
|
selected := metrics[:needed]
|
|
for i := 0; i < len(selected)-1; i++ {
|
|
a := selected[i].UpdatedAt
|
|
b := selected[i+1].UpdatedAt
|
|
if a.IsZero() || b.IsZero() {
|
|
return nil, false
|
|
}
|
|
gap := a.Sub(b)
|
|
if gap < opsMetricsInterval-opsMetricsContinuityTolerance || gap > opsMetricsInterval+opsMetricsContinuityTolerance {
|
|
return nil, false
|
|
}
|
|
}
|
|
return selected, true
|
|
}
|
|
|
|
func evaluateRule(rule OpsAlertRule, metrics []OpsMetrics) (bool, float64, bool) {
|
|
if len(metrics) == 0 {
|
|
return false, 0, false
|
|
}
|
|
|
|
latestValue, ok := metricValue(metrics[0], rule.MetricType)
|
|
if !ok {
|
|
return false, 0, false
|
|
}
|
|
|
|
for _, metric := range metrics {
|
|
value, ok := metricValue(metric, rule.MetricType)
|
|
if !ok || !compareMetric(value, rule.Operator, rule.Threshold) {
|
|
return false, latestValue, true
|
|
}
|
|
}
|
|
|
|
return true, latestValue, true
|
|
}
|
|
|
|
func metricValue(metric OpsMetrics, metricType string) (float64, bool) {
|
|
switch metricType {
|
|
case OpsMetricSuccessRate:
|
|
if metric.RequestCount == 0 {
|
|
return 0, false
|
|
}
|
|
return metric.SuccessRate, true
|
|
case OpsMetricErrorRate:
|
|
if metric.RequestCount == 0 {
|
|
return 0, false
|
|
}
|
|
return metric.ErrorRate, true
|
|
case OpsMetricP95LatencyMs:
|
|
return float64(metric.P95LatencyMs), true
|
|
case OpsMetricP99LatencyMs:
|
|
return float64(metric.P99LatencyMs), true
|
|
case OpsMetricHTTP2Errors:
|
|
return float64(metric.HTTP2Errors), true
|
|
case OpsMetricCPUUsagePercent:
|
|
return metric.CPUUsagePercent, true
|
|
case OpsMetricMemoryUsagePercent:
|
|
return metric.MemoryUsagePercent, true
|
|
case OpsMetricQueueDepth:
|
|
return float64(metric.ConcurrencyQueueDepth), true
|
|
default:
|
|
return 0, false
|
|
}
|
|
}
|
|
|
|
func compareMetric(value float64, operator string, threshold float64) bool {
|
|
switch operator {
|
|
case ">":
|
|
return value > threshold
|
|
case ">=":
|
|
return value >= threshold
|
|
case "<":
|
|
return value < threshold
|
|
case "<=":
|
|
return value <= threshold
|
|
case "==":
|
|
return value == threshold
|
|
default:
|
|
return false
|
|
}
|
|
}
|
|
|
|
func buildAlertDescription(rule OpsAlertRule, value float64) string {
|
|
window := rule.WindowMinutes
|
|
if window <= 0 {
|
|
window = 1
|
|
}
|
|
return fmt.Sprintf("Rule %s triggered: %s %s %.2f (current %.2f) over last %dm",
|
|
rule.Name,
|
|
rule.MetricType,
|
|
rule.Operator,
|
|
rule.Threshold,
|
|
value,
|
|
window,
|
|
)
|
|
}
|
|
|
|
func (s *OpsAlertService) dispatchNotifications(ctx context.Context, rule OpsAlertRule, event *OpsAlertEvent) (bool, bool) {
|
|
emailSent := false
|
|
webhookSent := false
|
|
|
|
notifyCtx, cancel := s.notificationContext(ctx)
|
|
defer cancel()
|
|
|
|
if rule.NotifyEmail {
|
|
emailSent = s.sendEmailNotification(notifyCtx, rule, event)
|
|
}
|
|
if rule.NotifyWebhook && rule.WebhookURL != "" {
|
|
webhookSent = s.sendWebhookNotification(notifyCtx, rule, event)
|
|
}
|
|
// Fallback channel: if email is enabled but ultimately fails, try webhook even if the
|
|
// webhook toggle is off (as long as a webhook URL is configured).
|
|
if rule.NotifyEmail && !emailSent && !rule.NotifyWebhook && rule.WebhookURL != "" {
|
|
log.Printf("[OpsAlert] email failed; attempting webhook fallback (rule=%d)", rule.ID)
|
|
webhookSent = s.sendWebhookNotification(notifyCtx, rule, event)
|
|
}
|
|
|
|
return emailSent, webhookSent
|
|
}
|
|
|
|
const (
|
|
opsAlertEvaluateTimeout = 45 * time.Second
|
|
opsAlertNotificationTimeout = 30 * time.Second
|
|
opsAlertEmailMaxRetries = 3
|
|
)
|
|
|
|
var opsAlertEmailBackoff = []time.Duration{
|
|
1 * time.Second,
|
|
2 * time.Second,
|
|
4 * time.Second,
|
|
}
|
|
|
|
func (s *OpsAlertService) notificationContext(ctx context.Context) (context.Context, context.CancelFunc) {
|
|
parent := ctx
|
|
if s != nil && s.stopCtx != nil {
|
|
parent = s.stopCtx
|
|
}
|
|
if parent == nil {
|
|
parent = context.Background()
|
|
}
|
|
return context.WithTimeout(parent, opsAlertNotificationTimeout)
|
|
}
|
|
|
|
var opsAlertSleep = sleepWithContext
|
|
|
|
func sleepWithContext(ctx context.Context, d time.Duration) error {
|
|
if d <= 0 {
|
|
return nil
|
|
}
|
|
if ctx == nil {
|
|
time.Sleep(d)
|
|
return nil
|
|
}
|
|
timer := time.NewTimer(d)
|
|
defer timer.Stop()
|
|
select {
|
|
case <-ctx.Done():
|
|
return ctx.Err()
|
|
case <-timer.C:
|
|
return nil
|
|
}
|
|
}
|
|
|
|
func retryWithBackoff(
|
|
ctx context.Context,
|
|
maxRetries int,
|
|
backoff []time.Duration,
|
|
fn func() error,
|
|
onError func(attempt int, total int, nextDelay time.Duration, err error),
|
|
) error {
|
|
if ctx == nil {
|
|
ctx = context.Background()
|
|
}
|
|
if maxRetries < 0 {
|
|
maxRetries = 0
|
|
}
|
|
totalAttempts := maxRetries + 1
|
|
|
|
var lastErr error
|
|
for attempt := 1; attempt <= totalAttempts; attempt++ {
|
|
if attempt > 1 {
|
|
backoffIdx := attempt - 2
|
|
if backoffIdx < len(backoff) {
|
|
if err := opsAlertSleep(ctx, backoff[backoffIdx]); err != nil {
|
|
return err
|
|
}
|
|
}
|
|
}
|
|
|
|
if err := ctx.Err(); err != nil {
|
|
return err
|
|
}
|
|
|
|
if err := fn(); err != nil {
|
|
lastErr = err
|
|
nextDelay := time.Duration(0)
|
|
if attempt < totalAttempts {
|
|
nextIdx := attempt - 1
|
|
if nextIdx < len(backoff) {
|
|
nextDelay = backoff[nextIdx]
|
|
}
|
|
}
|
|
if onError != nil {
|
|
onError(attempt, totalAttempts, nextDelay, err)
|
|
}
|
|
continue
|
|
}
|
|
return nil
|
|
}
|
|
|
|
return lastErr
|
|
}
|
|
|
|
func (s *OpsAlertService) sendEmailNotification(ctx context.Context, rule OpsAlertRule, event *OpsAlertEvent) bool {
|
|
if s.emailService == nil || s.userService == nil {
|
|
return false
|
|
}
|
|
|
|
if ctx == nil {
|
|
ctx = context.Background()
|
|
}
|
|
|
|
admin, err := s.userService.GetFirstAdmin(ctx)
|
|
if err != nil || admin == nil || admin.Email == "" {
|
|
return false
|
|
}
|
|
|
|
subject := fmt.Sprintf("[Ops Alert][%s] %s", rule.Severity, rule.Name)
|
|
body := fmt.Sprintf(
|
|
"Alert triggered: %s\n\nMetric: %s\nThreshold: %.2f\nCurrent: %.2f\nWindow: %dm\nStatus: %s\nTime: %s",
|
|
rule.Name,
|
|
rule.MetricType,
|
|
rule.Threshold,
|
|
event.MetricValue,
|
|
rule.WindowMinutes,
|
|
event.Status,
|
|
event.FiredAt.Format(time.RFC3339),
|
|
)
|
|
|
|
config, err := s.emailService.GetSMTPConfig(ctx)
|
|
if err != nil {
|
|
log.Printf("[OpsAlert] email config load failed: %v", err)
|
|
return false
|
|
}
|
|
|
|
if err := retryWithBackoff(
|
|
ctx,
|
|
opsAlertEmailMaxRetries,
|
|
opsAlertEmailBackoff,
|
|
func() error {
|
|
return s.emailService.SendEmailWithConfig(config, admin.Email, subject, body)
|
|
},
|
|
func(attempt int, total int, nextDelay time.Duration, err error) {
|
|
if attempt < total {
|
|
log.Printf("[OpsAlert] email send failed (attempt=%d/%d), retrying in %s: %v", attempt, total, nextDelay, err)
|
|
return
|
|
}
|
|
log.Printf("[OpsAlert] email send failed (attempt=%d/%d), giving up: %v", attempt, total, err)
|
|
},
|
|
); err != nil {
|
|
if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
|
|
log.Printf("[OpsAlert] email send canceled: %v", err)
|
|
}
|
|
return false
|
|
}
|
|
return true
|
|
}
|
|
|
|
func (s *OpsAlertService) sendWebhookNotification(ctx context.Context, rule OpsAlertRule, event *OpsAlertEvent) bool {
|
|
ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
|
|
defer cancel()
|
|
|
|
webhookTarget, err := validateWebhookURL(ctx, rule.WebhookURL)
|
|
if err != nil {
|
|
log.Printf("[OpsAlert] invalid webhook url (rule=%d): %v", rule.ID, err)
|
|
return false
|
|
}
|
|
|
|
payload := map[string]any{
|
|
"rule_id": rule.ID,
|
|
"rule_name": rule.Name,
|
|
"severity": rule.Severity,
|
|
"status": event.Status,
|
|
"metric_type": rule.MetricType,
|
|
"metric_value": event.MetricValue,
|
|
"threshold_value": rule.Threshold,
|
|
"window_minutes": rule.WindowMinutes,
|
|
"fired_at": event.FiredAt.Format(time.RFC3339),
|
|
}
|
|
|
|
body, err := json.Marshal(payload)
|
|
if err != nil {
|
|
return false
|
|
}
|
|
|
|
req, err := http.NewRequestWithContext(ctx, http.MethodPost, webhookTarget.URL.String(), bytes.NewReader(body))
|
|
if err != nil {
|
|
return false
|
|
}
|
|
req.Header.Set("Content-Type", "application/json")
|
|
|
|
resp, err := buildWebhookHTTPClient(s.httpClient, webhookTarget).Do(req)
|
|
if err != nil {
|
|
log.Printf("[OpsAlert] webhook send failed: %v", err)
|
|
return false
|
|
}
|
|
defer func() { _ = resp.Body.Close() }()
|
|
|
|
if resp.StatusCode < http.StatusOK || resp.StatusCode >= http.StatusMultipleChoices {
|
|
log.Printf("[OpsAlert] webhook returned status %d", resp.StatusCode)
|
|
return false
|
|
}
|
|
return true
|
|
}
|
|
|
|
const webhookHTTPClientTimeout = 10 * time.Second
|
|
|
|
func buildWebhookHTTPClient(base *http.Client, webhookTarget *validatedWebhookTarget) *http.Client {
|
|
var client http.Client
|
|
if base != nil {
|
|
client = *base
|
|
}
|
|
if client.Timeout <= 0 {
|
|
client.Timeout = webhookHTTPClientTimeout
|
|
}
|
|
client.CheckRedirect = func(req *http.Request, via []*http.Request) error {
|
|
return http.ErrUseLastResponse
|
|
}
|
|
if webhookTarget != nil {
|
|
client.Transport = buildWebhookTransport(client.Transport, webhookTarget)
|
|
}
|
|
return &client
|
|
}
|
|
|
|
var disallowedWebhookIPNets = []net.IPNet{
|
|
// "this host on this network" / unspecified.
|
|
mustParseCIDR("0.0.0.0/8"),
|
|
mustParseCIDR("127.0.0.0/8"), // loopback (includes 127.0.0.1)
|
|
mustParseCIDR("10.0.0.0/8"), // RFC1918
|
|
mustParseCIDR("192.168.0.0/16"), // RFC1918
|
|
mustParseCIDR("172.16.0.0/12"), // RFC1918 (172.16.0.0 - 172.31.255.255)
|
|
mustParseCIDR("100.64.0.0/10"), // RFC6598 (carrier-grade NAT)
|
|
mustParseCIDR("169.254.0.0/16"), // IPv4 link-local (includes 169.254.169.254 metadata IP on many clouds)
|
|
mustParseCIDR("198.18.0.0/15"), // RFC2544 benchmark testing
|
|
mustParseCIDR("224.0.0.0/4"), // IPv4 multicast
|
|
mustParseCIDR("240.0.0.0/4"), // IPv4 reserved
|
|
mustParseCIDR("::/128"), // IPv6 unspecified
|
|
mustParseCIDR("::1/128"), // IPv6 loopback
|
|
mustParseCIDR("fc00::/7"), // IPv6 unique local
|
|
mustParseCIDR("fe80::/10"), // IPv6 link-local
|
|
mustParseCIDR("ff00::/8"), // IPv6 multicast
|
|
}
|
|
|
|
func mustParseCIDR(cidr string) net.IPNet {
|
|
_, block, err := net.ParseCIDR(cidr)
|
|
if err != nil {
|
|
panic(err)
|
|
}
|
|
return *block
|
|
}
|
|
|
|
var lookupIPAddrs = func(ctx context.Context, host string) ([]net.IPAddr, error) {
|
|
return net.DefaultResolver.LookupIPAddr(ctx, host)
|
|
}
|
|
|
|
type validatedWebhookTarget struct {
|
|
URL *url.URL
|
|
|
|
host string
|
|
port string
|
|
pinnedIPs []net.IP
|
|
}
|
|
|
|
var webhookBaseDialContext = func(ctx context.Context, network, addr string) (net.Conn, error) {
|
|
dialer := net.Dialer{
|
|
Timeout: 5 * time.Second,
|
|
KeepAlive: 30 * time.Second,
|
|
}
|
|
return dialer.DialContext(ctx, network, addr)
|
|
}
|
|
|
|
func buildWebhookTransport(base http.RoundTripper, webhookTarget *validatedWebhookTarget) http.RoundTripper {
|
|
if webhookTarget == nil || webhookTarget.URL == nil {
|
|
return base
|
|
}
|
|
|
|
var transport *http.Transport
|
|
switch typed := base.(type) {
|
|
case *http.Transport:
|
|
if typed != nil {
|
|
transport = typed.Clone()
|
|
}
|
|
}
|
|
if transport == nil {
|
|
if defaultTransport, ok := http.DefaultTransport.(*http.Transport); ok && defaultTransport != nil {
|
|
transport = defaultTransport.Clone()
|
|
} else {
|
|
transport = (&http.Transport{}).Clone()
|
|
}
|
|
}
|
|
|
|
webhookHost := webhookTarget.host
|
|
webhookPort := webhookTarget.port
|
|
pinnedIPs := append([]net.IP(nil), webhookTarget.pinnedIPs...)
|
|
|
|
transport.Proxy = nil
|
|
transport.DialTLSContext = nil
|
|
transport.DialContext = func(ctx context.Context, network, addr string) (net.Conn, error) {
|
|
host, port, err := net.SplitHostPort(addr)
|
|
if err != nil || host == "" || port == "" {
|
|
return nil, fmt.Errorf("webhook dial target is invalid: %q", addr)
|
|
}
|
|
|
|
canonicalHost := strings.TrimSuffix(strings.ToLower(host), ".")
|
|
if canonicalHost != webhookHost || port != webhookPort {
|
|
return nil, fmt.Errorf("webhook dial target mismatch: %q", addr)
|
|
}
|
|
|
|
var lastErr error
|
|
for _, ip := range pinnedIPs {
|
|
if isDisallowedWebhookIP(ip) {
|
|
lastErr = fmt.Errorf("webhook target resolves to a disallowed ip")
|
|
continue
|
|
}
|
|
|
|
dialAddr := net.JoinHostPort(ip.String(), port)
|
|
conn, err := webhookBaseDialContext(ctx, network, dialAddr)
|
|
if err == nil {
|
|
return conn, nil
|
|
}
|
|
lastErr = err
|
|
}
|
|
if lastErr == nil {
|
|
lastErr = errors.New("webhook target has no resolved addresses")
|
|
}
|
|
return nil, lastErr
|
|
}
|
|
|
|
return transport
|
|
}
|
|
|
|
func validateWebhookURL(ctx context.Context, raw string) (*validatedWebhookTarget, error) {
|
|
raw = strings.TrimSpace(raw)
|
|
if raw == "" {
|
|
return nil, errors.New("webhook url is empty")
|
|
}
|
|
// Avoid request smuggling / header injection vectors.
|
|
if strings.ContainsAny(raw, "\r\n") {
|
|
return nil, errors.New("webhook url contains invalid characters")
|
|
}
|
|
|
|
parsed, err := url.Parse(raw)
|
|
if err != nil {
|
|
return nil, errors.New("webhook url format is invalid")
|
|
}
|
|
if !strings.EqualFold(parsed.Scheme, "https") {
|
|
return nil, errors.New("webhook url scheme must be https")
|
|
}
|
|
parsed.Scheme = "https"
|
|
if parsed.Host == "" || parsed.Hostname() == "" {
|
|
return nil, errors.New("webhook url must include host")
|
|
}
|
|
if parsed.User != nil {
|
|
return nil, errors.New("webhook url must not include userinfo")
|
|
}
|
|
if parsed.Port() != "" {
|
|
port, err := strconv.Atoi(parsed.Port())
|
|
if err != nil || port < 1 || port > 65535 {
|
|
return nil, errors.New("webhook url port is invalid")
|
|
}
|
|
}
|
|
|
|
host := strings.TrimSuffix(strings.ToLower(parsed.Hostname()), ".")
|
|
if host == "localhost" {
|
|
return nil, errors.New("webhook url host must not be localhost")
|
|
}
|
|
|
|
if ip := net.ParseIP(host); ip != nil {
|
|
if isDisallowedWebhookIP(ip) {
|
|
return nil, errors.New("webhook url host resolves to a disallowed ip")
|
|
}
|
|
return &validatedWebhookTarget{
|
|
URL: parsed,
|
|
host: host,
|
|
port: portForScheme(parsed),
|
|
pinnedIPs: []net.IP{ip},
|
|
}, nil
|
|
}
|
|
|
|
if ctx == nil {
|
|
ctx = context.Background()
|
|
}
|
|
ips, err := lookupIPAddrs(ctx, host)
|
|
if err != nil || len(ips) == 0 {
|
|
return nil, errors.New("webhook url host cannot be resolved")
|
|
}
|
|
pinned := make([]net.IP, 0, len(ips))
|
|
for _, addr := range ips {
|
|
if isDisallowedWebhookIP(addr.IP) {
|
|
return nil, errors.New("webhook url host resolves to a disallowed ip")
|
|
}
|
|
if addr.IP != nil {
|
|
pinned = append(pinned, addr.IP)
|
|
}
|
|
}
|
|
|
|
if len(pinned) == 0 {
|
|
return nil, errors.New("webhook url host cannot be resolved")
|
|
}
|
|
|
|
return &validatedWebhookTarget{
|
|
URL: parsed,
|
|
host: host,
|
|
port: portForScheme(parsed),
|
|
pinnedIPs: uniqueResolvedIPs(pinned),
|
|
}, nil
|
|
}
|
|
|
|
func isDisallowedWebhookIP(ip net.IP) bool {
|
|
if ip == nil {
|
|
return false
|
|
}
|
|
if ip4 := ip.To4(); ip4 != nil {
|
|
ip = ip4
|
|
} else if ip16 := ip.To16(); ip16 != nil {
|
|
ip = ip16
|
|
} else {
|
|
return false
|
|
}
|
|
|
|
// Disallow non-public addresses even if they're not explicitly covered by the CIDR list.
|
|
// This provides defense-in-depth against SSRF targets such as link-local, multicast, and
|
|
// unspecified addresses, and ensures any "pinned" IP is still blocked at dial time.
|
|
if ip.IsUnspecified() ||
|
|
ip.IsLoopback() ||
|
|
ip.IsMulticast() ||
|
|
ip.IsLinkLocalUnicast() ||
|
|
ip.IsLinkLocalMulticast() ||
|
|
ip.IsPrivate() {
|
|
return true
|
|
}
|
|
|
|
for _, block := range disallowedWebhookIPNets {
|
|
if block.Contains(ip) {
|
|
return true
|
|
}
|
|
}
|
|
return false
|
|
}
|
|
|
|
func portForScheme(u *url.URL) string {
|
|
if u != nil && u.Port() != "" {
|
|
return u.Port()
|
|
}
|
|
return "443"
|
|
}
|
|
|
|
func uniqueResolvedIPs(ips []net.IP) []net.IP {
|
|
seen := make(map[string]struct{}, len(ips))
|
|
out := make([]net.IP, 0, len(ips))
|
|
for _, ip := range ips {
|
|
if ip == nil {
|
|
continue
|
|
}
|
|
key := ip.String()
|
|
if _, ok := seen[key]; ok {
|
|
continue
|
|
}
|
|
seen[key] = struct{}{}
|
|
out = append(out, ip)
|
|
}
|
|
return out
|
|
}
|