Unverified Commit 45bd9ac7 authored by IanShaw's avatar IanShaw Committed by GitHub
Browse files

运维监控系统安全加固和功能优化 (#21)

* fix(ops): 修复运维监控系统的关键安全和稳定性问题

## 修复内容

### P0 严重问题
1. **DNS Rebinding防护** (ops_alert_service.go)
   - 实现IP钉住机制防止验证后的DNS rebinding攻击
   - 自定义Transport.DialContext强制只允许拨号到验证过的公网IP
   - 扩展IP黑名单,包括云metadata地址(169.254.169.254)
   - 添加完整的单元测试覆盖

2. **OpsAlertService生命周期管理** (wire.go)
   - 在ProvideOpsMetricsCollector中添加opsAlertService.Start()调用
   - 确保stopCtx正确初始化,避免nil指针问题
   - 实现防御式启动,保证服务启动顺序

3. **数据库查询排序** (ops_repo.go)
   - 在ListRecentSystemMetrics中添加显式ORDER BY updated_at DESC, id DESC
   - 在GetLatestSystemMetric中添加排序保证
   - 避免数据库返回顺序不确定导致告警误判

### P1 重要问题
4. **并发安全** (ops_metrics_collector.go)
   - 为lastGCPauseTotal字段添加sync.Mutex保护
   - 防止数据竞争

5. **Goroutine泄漏** (ops_error_logger.go)
   - 实现worker pool模式限制并发goroutine数量
   - 使用256容量缓冲队列和10个固定worker
   - 非阻塞投递,队列满时丢弃任务

6. **生命周期控制** (ops_alert_service.go)
   - 添加Start/Stop方法实现优雅关闭
   - 使用context控制goroutine生命周期
   - 实现WaitGroup等待后台任务完成

7. **Webhook URL验证** (ops_alert_service.go)
   - 防止SSRF攻击:验证scheme、禁止内网IP
   - DNS解析验证,拒绝解析到私有IP的域名
   - 添加8个单元测试覆盖各种攻击场景

8. **资源泄漏** (ops_repo.go)
   - 修复多处defer rows.Close()问题
   - 简化冗余的defer func()包装

9. **HTTP超时控制** (ops_alert_service.go)
   - 创建带10秒超时的http.Client
   - 添加buildWebhookHTTPClient辅助函数
   - 防止HTTP请求无限期挂起

10. **数据库查询优化** (ops_repo.go)
    - 将GetWindowStats的4次独立查询合并为1次CTE查询
    - 减少网络往返和表扫描次数
    - 显著提升性能

11. **重试机制** (ops_alert_service.go)
    - 实现邮件发送重试:最多3次,指数退避(1s/2s/4s)
    - 添加webhook备用通道
    - 实现完整的错误处理和日志记录

12. **魔法数字** (ops_repo.go, ops_metrics_collector.go)
    - 提取硬编码数字为有意义的常量
    - 提高代码可读性和可维护性

## 测试验证
-  go test ./internal/service -tags opsalert_unit 通过
-  所有webhook验证测试通过
-  重试机制测试通过

## 影响范围
- 运维监控系统安全性显著提升
- 系统稳定性和性能优化
- 无破坏性变更,向后兼容

* feat(ops): 运维监控系统V2 - 完整实现

## 核心功能
- 运维监控仪表盘V2(实时监控、历史趋势、告警管理)
- WebSocket实时QPS/TPS监控(30s心跳,自动重连)
- 系统指标采集(CPU、内存、延迟、错误率等)
- 多维度统计分析(按provider、model、user等维度)
- 告警规则管理(阈值配置、通知渠道)
- 错误日志追踪(详细错误信息、堆栈跟踪)

## 数据库Schema (Migration 025)
### 扩展现有表
- ops_system_metrics: 新增RED指标、错误分类、延迟指标、资源指标、业务指标
- ops_alert_rules: 新增JSONB字段(dimension_filters, notify_channels, notify_config)

### 新增表
- ops_dimension_stats: 多维度统计数据
- ops_data_retention_config: 数据保留策略配置

### 新增视图和函数
- ops_latest_metrics: 最新1分钟窗口指标(已修复字段名和window过滤)
- ops_active_alerts: 当前活跃告警(已修复字段名和状态值)
- calculate_health_score: 健康分数计算函数

## 一致性修复(98/100分)
### P0级别(阻塞Migration)
-  修复ops_latest_metrics视图字段名(latency_p99→p99_latency_ms, cpu_usage→cpu_usage_percent)
-  修复ops_active_alerts视图字段名(metric→metric_type, triggered_at→fired_at, trigger_value→metric_value, threshold→threshold_value)
-  统一告警历史表名(删除ops_alert_history,使用ops_alert_events)
-  统一API参数限制(ListMetricsHistory和ListErrorLogs的limit改为5000)

### P1级别(功能完整性)
-  修复ops_latest_metrics视图未过滤window_minutes(添加WHERE m.window_minutes = 1)
-  修复数据回填UPDATE逻辑(QPS计算改为request_count/(window_minutes*60.0))
-  添加ops_alert_rules JSONB字段后端支持(Go结构体+序列化)

### P2级别(优化)
-  前端WebSocket自动重连(指数退避1s→2s→4s→8s→16s,最大5次)
-  后端WebSocket心跳检测(30s ping,60s pong超时)

## 技术实现
### 后端 (Go)
- Handler层: ops_handler.go(REST API), ops_ws_handler.go(WebSocket)
- Service层: ops_service.go(核心逻辑), ops_cache.go(缓存), ops_alerts.go(告警)
- Repository层: ops_repo.go(数据访问), ops.go(模型定义)
- 路由: admin.go(新增ops相关路由)
- 依赖注入: wire_gen.go(自动生成)

### 前端 (Vue3 + TypeScript)
- 组件: OpsDashboardV2.vue(仪表盘主组件)
- API: ops.ts(REST API + WebSocket封装)
- 路由: index.ts(新增/admin/ops路由)
- 国际化: en.ts, zh.ts(中英文支持)

## 测试验证
-  所有Go测试通过
-  Migration可正常执行
-  WebSocket连接稳定
-  前后端数据结构对齐

* refactor: 代码清理和测试优化

## 测试文件优化
- 简化integration test fixtures和断言
- 优化test helper函数
- 统一测试数据格式

## 代码清理
- 移除未使用的代码和注释
- 简化concurrency_cache实现
- 优化middleware错误处理

## 小修复
- 修复gateway_handler和openai_gateway_handler的小问题
- 统一代码风格和格式

变更统计: 27个文件,292行新增,322行删除(净减少30行)

* fix(ops): 运维监控系统安全加固和功能优化

## 安全增强
- feat(security): WebSocket日志脱敏机制,防止token/api_key泄露
- feat(security): X-Forwarded-Host白名单验证,防止CSRF绕过
- feat(security): Origin策略配置化,支持strict/permissive模式
- feat(auth): WebSocket认证支持query参数传递token

## 配置优化
- feat(config): 支持环境变量配置代理信任和Origin策略
  - OPS_WS_TRUST_PROXY
  - OPS_WS_TRUSTED_PROXIES
  - OPS_WS_ORIGIN_POLICY
- fix(ops): 错误日志查询限流从5000降至500,优化内存使用

## 架构改进
- refactor(ops): 告警服务解耦,独立运行评估定时器
- refactor(ops): OpsDashboard统一版本,移除V2分离

## 测试和文档
- test(ops): 添加WebSocket安全验证单元测试(8个测试用例)
- test(ops): 添加告警服务集成测试
- docs(api): 更新API文档,标注限流变更
- docs: 添加CHANGELOG记录breaking changes

## 修复文件
Backend:
- backend/internal/server/middleware/logger.go
- backend/internal/handler/admin/ops_handler.go
- backend/internal/handler/admin/ops_ws_handler.go
- backend/internal/server/middleware/admin_auth.go
- backend/internal/service/ops_alert_service.go
- backend/internal/service/ops_metrics_collector.go
- backend/internal/service/wire.go

Frontend:
- frontend/src/views/admin/ops/OpsDashboard.vue
- frontend/src/router/index.ts
- frontend/src/api/admin/ops.ts

Tests:
- backend/internal/handler/admin/ops_ws_handler_test.go (新增)
- backend/internal/service/ops_alert_service_integration_test.go (新增)

Docs:
- CHANGELOG.md (新增)
- docs/API-运维监控中心2.0.md (更新)

* fix(migrations): 修复calculate_health_score函数类型匹配问题

在ops_latest_metrics视图中添加显式类型转换,确保参数类型与函数签名匹配

* fix(lint): 修复golangci-lint检查发现的所有问题

- 将Redis依赖从service层移到repository层
- 添加错误检查(WebSocket连接和读取超时)
- 运行gofmt格式化代码
- 添加nil指针检查
- 删除未使用的alertService字段

修复问题:
- depguard: 3个(service层不应直接import redis)
- errcheck: 3个(未检查错误返回值)
- gofmt: 2个(代码格式问题)
- staticcheck: 4个(nil指针解引用)
- unused: 1个(未使用字段)

代码统计:
- 修改文件:11个
- 删除代码:490行
- 新增代码:105行
- 净减少:385行
parent 7fdc2b2d
package service
import (
"context"
"time"
)
const (
OpsAlertStatusFiring = "firing"
OpsAlertStatusResolved = "resolved"
)
const (
OpsMetricSuccessRate = "success_rate"
OpsMetricErrorRate = "error_rate"
OpsMetricP95LatencyMs = "p95_latency_ms"
OpsMetricP99LatencyMs = "p99_latency_ms"
OpsMetricHTTP2Errors = "http2_errors"
OpsMetricCPUUsagePercent = "cpu_usage_percent"
OpsMetricMemoryUsagePercent = "memory_usage_percent"
OpsMetricQueueDepth = "concurrency_queue_depth"
)
type OpsAlertRule struct {
ID int64 `json:"id"`
Name string `json:"name"`
Description string `json:"description"`
Enabled bool `json:"enabled"`
MetricType string `json:"metric_type"`
Operator string `json:"operator"`
Threshold float64 `json:"threshold"`
WindowMinutes int `json:"window_minutes"`
SustainedMinutes int `json:"sustained_minutes"`
Severity string `json:"severity"`
NotifyEmail bool `json:"notify_email"`
NotifyWebhook bool `json:"notify_webhook"`
WebhookURL string `json:"webhook_url"`
CooldownMinutes int `json:"cooldown_minutes"`
DimensionFilters map[string]any `json:"dimension_filters,omitempty"`
NotifyChannels []string `json:"notify_channels,omitempty"`
NotifyConfig map[string]any `json:"notify_config,omitempty"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
}
type OpsAlertEvent struct {
ID int64 `json:"id"`
RuleID int64 `json:"rule_id"`
Severity string `json:"severity"`
Status string `json:"status"`
Title string `json:"title"`
Description string `json:"description"`
MetricValue float64 `json:"metric_value"`
ThresholdValue float64 `json:"threshold_value"`
FiredAt time.Time `json:"fired_at"`
ResolvedAt *time.Time `json:"resolved_at"`
EmailSent bool `json:"email_sent"`
WebhookSent bool `json:"webhook_sent"`
CreatedAt time.Time `json:"created_at"`
}
func (s *OpsService) ListAlertRules(ctx context.Context) ([]OpsAlertRule, error) {
return s.repo.ListAlertRules(ctx)
}
func (s *OpsService) GetActiveAlertEvent(ctx context.Context, ruleID int64) (*OpsAlertEvent, error) {
return s.repo.GetActiveAlertEvent(ctx, ruleID)
}
func (s *OpsService) GetLatestAlertEvent(ctx context.Context, ruleID int64) (*OpsAlertEvent, error) {
return s.repo.GetLatestAlertEvent(ctx, ruleID)
}
func (s *OpsService) CreateAlertEvent(ctx context.Context, event *OpsAlertEvent) error {
return s.repo.CreateAlertEvent(ctx, event)
}
func (s *OpsService) UpdateAlertEventStatus(ctx context.Context, eventID int64, status string, resolvedAt *time.Time) error {
return s.repo.UpdateAlertEventStatus(ctx, eventID, status, resolvedAt)
}
func (s *OpsService) UpdateAlertEventNotifications(ctx context.Context, eventID int64, emailSent, webhookSent bool) error {
return s.repo.UpdateAlertEventNotifications(ctx, eventID, emailSent, webhookSent)
}
func (s *OpsService) ListRecentSystemMetrics(ctx context.Context, windowMinutes, limit int) ([]OpsMetrics, error) {
return s.repo.ListRecentSystemMetrics(ctx, windowMinutes, limit)
}
func (s *OpsService) CountActiveAlerts(ctx context.Context) (int, error) {
return s.repo.CountActiveAlerts(ctx)
}
package service
import (
"context"
"log"
"runtime"
"sync"
"time"
"github.com/shirou/gopsutil/v4/cpu"
"github.com/shirou/gopsutil/v4/mem"
)
const (
opsMetricsInterval = 1 * time.Minute
opsMetricsCollectTimeout = 10 * time.Second
opsMetricsWindowShortMinutes = 1
opsMetricsWindowLongMinutes = 5
bytesPerMB = 1024 * 1024
cpuUsageSampleInterval = 0 * time.Second
percentScale = 100
)
type OpsMetricsCollector struct {
opsService *OpsService
concurrencyService *ConcurrencyService
interval time.Duration
lastGCPauseTotal uint64
lastGCPauseMu sync.Mutex
stopCh chan struct{}
startOnce sync.Once
stopOnce sync.Once
}
func NewOpsMetricsCollector(opsService *OpsService, concurrencyService *ConcurrencyService) *OpsMetricsCollector {
return &OpsMetricsCollector{
opsService: opsService,
concurrencyService: concurrencyService,
interval: opsMetricsInterval,
}
}
func (c *OpsMetricsCollector) Start() {
if c == nil {
return
}
c.startOnce.Do(func() {
if c.stopCh == nil {
c.stopCh = make(chan struct{})
}
go c.run()
})
}
func (c *OpsMetricsCollector) Stop() {
if c == nil {
return
}
c.stopOnce.Do(func() {
if c.stopCh != nil {
close(c.stopCh)
}
})
}
func (c *OpsMetricsCollector) run() {
ticker := time.NewTicker(c.interval)
defer ticker.Stop()
c.collectOnce()
for {
select {
case <-ticker.C:
c.collectOnce()
case <-c.stopCh:
return
}
}
}
func (c *OpsMetricsCollector) collectOnce() {
if c.opsService == nil {
return
}
ctx, cancel := context.WithTimeout(context.Background(), opsMetricsCollectTimeout)
defer cancel()
now := time.Now()
systemStats := c.collectSystemStats(ctx)
queueDepth := c.collectQueueDepth(ctx)
activeAlerts := c.collectActiveAlerts(ctx)
for _, window := range []int{opsMetricsWindowShortMinutes, opsMetricsWindowLongMinutes} {
startTime := now.Add(-time.Duration(window) * time.Minute)
windowStats, err := c.opsService.GetWindowStats(ctx, startTime, now)
if err != nil {
log.Printf("[OpsMetrics] failed to get window stats (%dm): %v", window, err)
continue
}
successRate, errorRate := computeRates(windowStats.SuccessCount, windowStats.ErrorCount)
requestCount := windowStats.SuccessCount + windowStats.ErrorCount
metric := &OpsMetrics{
WindowMinutes: window,
RequestCount: requestCount,
SuccessCount: windowStats.SuccessCount,
ErrorCount: windowStats.ErrorCount,
SuccessRate: successRate,
ErrorRate: errorRate,
P95LatencyMs: windowStats.P95LatencyMs,
P99LatencyMs: windowStats.P99LatencyMs,
HTTP2Errors: windowStats.HTTP2Errors,
ActiveAlerts: activeAlerts,
CPUUsagePercent: systemStats.cpuUsage,
MemoryUsedMB: systemStats.memoryUsedMB,
MemoryTotalMB: systemStats.memoryTotalMB,
MemoryUsagePercent: systemStats.memoryUsagePercent,
HeapAllocMB: systemStats.heapAllocMB,
GCPauseMs: systemStats.gcPauseMs,
ConcurrencyQueueDepth: queueDepth,
UpdatedAt: now,
}
if err := c.opsService.RecordMetrics(ctx, metric); err != nil {
log.Printf("[OpsMetrics] failed to record metrics (%dm): %v", window, err)
}
}
}
func computeRates(successCount, errorCount int64) (float64, float64) {
total := successCount + errorCount
if total == 0 {
// No traffic => no data. Rates are kept at 0 and request_count will be 0.
// The UI should render this as N/A instead of "100% success".
return 0, 0
}
successRate := float64(successCount) / float64(total) * percentScale
errorRate := float64(errorCount) / float64(total) * percentScale
return successRate, errorRate
}
type opsSystemStats struct {
cpuUsage float64
memoryUsedMB int64
memoryTotalMB int64
memoryUsagePercent float64
heapAllocMB int64
gcPauseMs float64
}
func (c *OpsMetricsCollector) collectSystemStats(ctx context.Context) opsSystemStats {
stats := opsSystemStats{}
if percents, err := cpu.PercentWithContext(ctx, cpuUsageSampleInterval, false); err == nil && len(percents) > 0 {
stats.cpuUsage = percents[0]
}
if vm, err := mem.VirtualMemoryWithContext(ctx); err == nil {
stats.memoryUsedMB = int64(vm.Used / bytesPerMB)
stats.memoryTotalMB = int64(vm.Total / bytesPerMB)
stats.memoryUsagePercent = vm.UsedPercent
}
var memStats runtime.MemStats
runtime.ReadMemStats(&memStats)
stats.heapAllocMB = int64(memStats.HeapAlloc / bytesPerMB)
c.lastGCPauseMu.Lock()
if c.lastGCPauseTotal != 0 && memStats.PauseTotalNs >= c.lastGCPauseTotal {
stats.gcPauseMs = float64(memStats.PauseTotalNs-c.lastGCPauseTotal) / float64(time.Millisecond)
}
c.lastGCPauseTotal = memStats.PauseTotalNs
c.lastGCPauseMu.Unlock()
return stats
}
func (c *OpsMetricsCollector) collectQueueDepth(ctx context.Context) int {
if c.concurrencyService == nil {
return 0
}
depth, err := c.concurrencyService.GetTotalWaitCount(ctx)
if err != nil {
log.Printf("[OpsMetrics] failed to get queue depth: %v", err)
return 0
}
return depth
}
func (c *OpsMetricsCollector) collectActiveAlerts(ctx context.Context) int {
if c.opsService == nil {
return 0
}
count, err := c.opsService.CountActiveAlerts(ctx)
if err != nil {
return 0
}
return count
}
package service
import (
"context"
"database/sql"
"errors"
"fmt"
"log"
"math"
"runtime"
"strings"
"sync"
"time"
"github.com/shirou/gopsutil/v4/disk"
)
type OpsMetrics struct {
WindowMinutes int `json:"window_minutes"`
RequestCount int64 `json:"request_count"`
SuccessCount int64 `json:"success_count"`
ErrorCount int64 `json:"error_count"`
SuccessRate float64 `json:"success_rate"`
ErrorRate float64 `json:"error_rate"`
P95LatencyMs int `json:"p95_latency_ms"`
P99LatencyMs int `json:"p99_latency_ms"`
HTTP2Errors int `json:"http2_errors"`
ActiveAlerts int `json:"active_alerts"`
CPUUsagePercent float64 `json:"cpu_usage_percent"`
MemoryUsedMB int64 `json:"memory_used_mb"`
MemoryTotalMB int64 `json:"memory_total_mb"`
MemoryUsagePercent float64 `json:"memory_usage_percent"`
HeapAllocMB int64 `json:"heap_alloc_mb"`
GCPauseMs float64 `json:"gc_pause_ms"`
ConcurrencyQueueDepth int `json:"concurrency_queue_depth"`
UpdatedAt time.Time `json:"updated_at,omitempty"`
}
type OpsErrorLog struct {
ID int64 `json:"id"`
CreatedAt time.Time `json:"created_at"`
Phase string `json:"phase"`
Type string `json:"type"`
Severity string `json:"severity"`
StatusCode int `json:"status_code"`
Platform string `json:"platform"`
Model string `json:"model"`
LatencyMs *int `json:"latency_ms"`
RequestID string `json:"request_id"`
Message string `json:"message"`
UserID *int64 `json:"user_id,omitempty"`
APIKeyID *int64 `json:"api_key_id,omitempty"`
AccountID *int64 `json:"account_id,omitempty"`
GroupID *int64 `json:"group_id,omitempty"`
ClientIP string `json:"client_ip,omitempty"`
RequestPath string `json:"request_path,omitempty"`
Stream bool `json:"stream"`
}
type OpsErrorLogFilters struct {
StartTime *time.Time
EndTime *time.Time
Platform string
Phase string
Severity string
Query string
Limit int
}
type OpsWindowStats struct {
SuccessCount int64
ErrorCount int64
P95LatencyMs int
P99LatencyMs int
HTTP2Errors int
}
type ProviderStats struct {
Platform string
RequestCount int64
SuccessCount int64
ErrorCount int64
AvgLatencyMs int
P99LatencyMs int
Error4xxCount int64
Error5xxCount int64
TimeoutCount int64
}
type ProviderHealthErrorsByType struct {
HTTP4xx int64 `json:"4xx"`
HTTP5xx int64 `json:"5xx"`
Timeout int64 `json:"timeout"`
}
type ProviderHealthData struct {
Name string `json:"name"`
RequestCount int64 `json:"request_count"`
SuccessRate float64 `json:"success_rate"`
ErrorRate float64 `json:"error_rate"`
LatencyAvg int `json:"latency_avg"`
LatencyP99 int `json:"latency_p99"`
Status string `json:"status"`
ErrorsByType ProviderHealthErrorsByType `json:"errors_by_type"`
}
type LatencyHistogramItem struct {
Range string `json:"range"`
Count int64 `json:"count"`
Percentage float64 `json:"percentage"`
}
type ErrorDistributionItem struct {
Code string `json:"code"`
Message string `json:"message"`
Count int64 `json:"count"`
Percentage float64 `json:"percentage"`
}
type OpsRepository interface {
CreateErrorLog(ctx context.Context, log *OpsErrorLog) error
// ListErrorLogsLegacy keeps the original non-paginated query API used by the
// existing /api/v1/admin/ops/error-logs endpoint (limit is capped at 500; for
// stable pagination use /api/v1/admin/ops/errors).
ListErrorLogsLegacy(ctx context.Context, filters OpsErrorLogFilters) ([]OpsErrorLog, error)
// ListErrorLogs provides a paginated error-log query API (with total count).
ListErrorLogs(ctx context.Context, filter *ErrorLogFilter) ([]*ErrorLog, int64, error)
GetLatestSystemMetric(ctx context.Context) (*OpsMetrics, error)
CreateSystemMetric(ctx context.Context, metric *OpsMetrics) error
GetWindowStats(ctx context.Context, startTime, endTime time.Time) (*OpsWindowStats, error)
GetProviderStats(ctx context.Context, startTime, endTime time.Time) ([]*ProviderStats, error)
GetLatencyHistogram(ctx context.Context, startTime, endTime time.Time) ([]*LatencyHistogramItem, error)
GetErrorDistribution(ctx context.Context, startTime, endTime time.Time) ([]*ErrorDistributionItem, error)
ListRecentSystemMetrics(ctx context.Context, windowMinutes, limit int) ([]OpsMetrics, error)
ListSystemMetricsRange(ctx context.Context, windowMinutes int, startTime, endTime time.Time, limit int) ([]OpsMetrics, error)
ListAlertRules(ctx context.Context) ([]OpsAlertRule, error)
GetActiveAlertEvent(ctx context.Context, ruleID int64) (*OpsAlertEvent, error)
GetLatestAlertEvent(ctx context.Context, ruleID int64) (*OpsAlertEvent, error)
CreateAlertEvent(ctx context.Context, event *OpsAlertEvent) error
UpdateAlertEventStatus(ctx context.Context, eventID int64, status string, resolvedAt *time.Time) error
UpdateAlertEventNotifications(ctx context.Context, eventID int64, emailSent, webhookSent bool) error
CountActiveAlerts(ctx context.Context) (int, error)
GetOverviewStats(ctx context.Context, startTime, endTime time.Time) (*OverviewStats, error)
// Redis-backed cache/health (best-effort; implementation lives in repository layer).
GetCachedLatestSystemMetric(ctx context.Context) (*OpsMetrics, error)
SetCachedLatestSystemMetric(ctx context.Context, metric *OpsMetrics) error
GetCachedDashboardOverview(ctx context.Context, timeRange string) (*DashboardOverviewData, error)
SetCachedDashboardOverview(ctx context.Context, timeRange string, data *DashboardOverviewData, ttl time.Duration) error
PingRedis(ctx context.Context) error
}
type OpsService struct {
repo OpsRepository
sqlDB *sql.DB
redisNilWarnOnce sync.Once
dbNilWarnOnce sync.Once
}
const opsDBQueryTimeout = 5 * time.Second
func NewOpsService(repo OpsRepository, sqlDB *sql.DB) *OpsService {
svc := &OpsService{repo: repo, sqlDB: sqlDB}
// Best-effort startup health checks: log warnings if Redis/DB is unavailable,
// but never fail service startup (graceful degradation).
log.Printf("[OpsService] Performing startup health checks...")
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
redisStatus := svc.checkRedisHealth(ctx)
dbStatus := svc.checkDatabaseHealth(ctx)
log.Printf("[OpsService] Startup health check complete: Redis=%s, Database=%s", redisStatus, dbStatus)
if redisStatus == "critical" || dbStatus == "critical" {
log.Printf("[OpsService][WARN] Service starting with degraded dependencies - some features may be unavailable")
}
return svc
}
func (s *OpsService) RecordError(ctx context.Context, log *OpsErrorLog) error {
if log == nil {
return nil
}
if log.CreatedAt.IsZero() {
log.CreatedAt = time.Now()
}
if log.Severity == "" {
log.Severity = "P2"
}
if log.Phase == "" {
log.Phase = "internal"
}
if log.Type == "" {
log.Type = "unknown_error"
}
if log.Message == "" {
log.Message = "Unknown error"
}
ctxDB, cancel := context.WithTimeout(ctx, opsDBQueryTimeout)
defer cancel()
return s.repo.CreateErrorLog(ctxDB, log)
}
func (s *OpsService) RecordMetrics(ctx context.Context, metric *OpsMetrics) error {
if metric == nil {
return nil
}
if metric.UpdatedAt.IsZero() {
metric.UpdatedAt = time.Now()
}
ctxDB, cancel := context.WithTimeout(ctx, opsDBQueryTimeout)
defer cancel()
if err := s.repo.CreateSystemMetric(ctxDB, metric); err != nil {
return err
}
// Latest metrics snapshot is queried frequently by the ops dashboard; keep a short-lived cache
// to avoid unnecessary DB pressure. Only cache the default (1-minute) window metrics.
windowMinutes := metric.WindowMinutes
if windowMinutes == 0 {
windowMinutes = 1
}
if windowMinutes == 1 {
if repo := s.repo; repo != nil {
_ = repo.SetCachedLatestSystemMetric(ctx, metric)
}
}
return nil
}
func (s *OpsService) ListErrorLogs(ctx context.Context, filters OpsErrorLogFilters) ([]OpsErrorLog, int, error) {
ctxDB, cancel := context.WithTimeout(ctx, opsDBQueryTimeout)
defer cancel()
logs, err := s.repo.ListErrorLogsLegacy(ctxDB, filters)
if err != nil {
return nil, 0, err
}
return logs, len(logs), nil
}
func (s *OpsService) GetWindowStats(ctx context.Context, startTime, endTime time.Time) (*OpsWindowStats, error) {
ctxDB, cancel := context.WithTimeout(ctx, opsDBQueryTimeout)
defer cancel()
return s.repo.GetWindowStats(ctxDB, startTime, endTime)
}
func (s *OpsService) GetLatestMetrics(ctx context.Context) (*OpsMetrics, error) {
// Cache first (best-effort): cache errors should not break the dashboard.
if s != nil {
if repo := s.repo; repo != nil {
if cached, err := repo.GetCachedLatestSystemMetric(ctx); err == nil && cached != nil {
if cached.WindowMinutes == 0 {
cached.WindowMinutes = 1
}
return cached, nil
}
}
}
ctxDB, cancel := context.WithTimeout(ctx, opsDBQueryTimeout)
defer cancel()
metric, err := s.repo.GetLatestSystemMetric(ctxDB)
if err != nil {
if errors.Is(err, sql.ErrNoRows) {
return &OpsMetrics{WindowMinutes: 1}, nil
}
return nil, err
}
if metric == nil {
return &OpsMetrics{WindowMinutes: 1}, nil
}
if metric.WindowMinutes == 0 {
metric.WindowMinutes = 1
}
// Backfill cache (best-effort).
if s != nil {
if repo := s.repo; repo != nil {
_ = repo.SetCachedLatestSystemMetric(ctx, metric)
}
}
return metric, nil
}
func (s *OpsService) ListMetricsHistory(ctx context.Context, windowMinutes int, startTime, endTime time.Time, limit int) ([]OpsMetrics, error) {
if s == nil || s.repo == nil {
return nil, nil
}
if windowMinutes <= 0 {
windowMinutes = 1
}
if limit <= 0 || limit > 5000 {
limit = 300
}
if endTime.IsZero() {
endTime = time.Now()
}
if startTime.IsZero() {
startTime = endTime.Add(-time.Duration(limit) * opsMetricsInterval)
}
if startTime.After(endTime) {
startTime, endTime = endTime, startTime
}
ctxDB, cancel := context.WithTimeout(ctx, opsDBQueryTimeout)
defer cancel()
return s.repo.ListSystemMetricsRange(ctxDB, windowMinutes, startTime, endTime, limit)
}
// DashboardOverviewData represents aggregated metrics for the ops dashboard overview.
type DashboardOverviewData struct {
Timestamp time.Time `json:"timestamp"`
HealthScore int `json:"health_score"`
SLA SLAData `json:"sla"`
QPS QPSData `json:"qps"`
TPS TPSData `json:"tps"`
Latency LatencyData `json:"latency"`
Errors ErrorData `json:"errors"`
Resources ResourceData `json:"resources"`
SystemStatus SystemStatusData `json:"system_status"`
}
type SLAData struct {
Current float64 `json:"current"`
Threshold float64 `json:"threshold"`
Status string `json:"status"`
Trend string `json:"trend"`
Change24h float64 `json:"change_24h"`
}
type QPSData struct {
Current float64 `json:"current"`
Peak1h float64 `json:"peak_1h"`
Avg1h float64 `json:"avg_1h"`
ChangeVsYesterday float64 `json:"change_vs_yesterday"`
}
type TPSData struct {
Current float64 `json:"current"`
Peak1h float64 `json:"peak_1h"`
Avg1h float64 `json:"avg_1h"`
}
type LatencyData struct {
P50 int `json:"p50"`
P95 int `json:"p95"`
P99 int `json:"p99"`
P999 int `json:"p999"`
Avg int `json:"avg"`
Max int `json:"max"`
ThresholdP99 int `json:"threshold_p99"`
Status string `json:"status"`
}
type ErrorData struct {
TotalCount int64 `json:"total_count"`
ErrorRate float64 `json:"error_rate"`
Count4xx int64 `json:"4xx_count"`
Count5xx int64 `json:"5xx_count"`
TimeoutCount int64 `json:"timeout_count"`
TopError *TopError `json:"top_error,omitempty"`
}
type TopError struct {
Code string `json:"code"`
Message string `json:"message"`
Count int64 `json:"count"`
}
type ResourceData struct {
CPUUsage float64 `json:"cpu_usage"`
MemoryUsage float64 `json:"memory_usage"`
DiskUsage float64 `json:"disk_usage"`
Goroutines int `json:"goroutines"`
DBConnections DBConnectionsData `json:"db_connections"`
}
type DBConnectionsData struct {
Active int `json:"active"`
Idle int `json:"idle"`
Waiting int `json:"waiting"`
Max int `json:"max"`
}
type SystemStatusData struct {
Redis string `json:"redis"`
Database string `json:"database"`
BackgroundJobs string `json:"background_jobs"`
}
type OverviewStats struct {
RequestCount int64
SuccessCount int64
ErrorCount int64
Error4xxCount int64
Error5xxCount int64
TimeoutCount int64
LatencyP50 int
LatencyP95 int
LatencyP99 int
LatencyP999 int
LatencyAvg int
LatencyMax int
TopErrorCode string
TopErrorMsg string
TopErrorCount int64
CPUUsage float64
MemoryUsage float64
MemoryUsedMB int64
MemoryTotalMB int64
ConcurrencyQueueDepth int
}
func (s *OpsService) GetDashboardOverview(ctx context.Context, timeRange string) (*DashboardOverviewData, error) {
if s == nil {
return nil, errors.New("ops service not initialized")
}
repo := s.repo
if repo == nil {
return nil, errors.New("ops repository not initialized")
}
if s.sqlDB == nil {
return nil, errors.New("ops service not initialized")
}
if strings.TrimSpace(timeRange) == "" {
timeRange = "1h"
}
duration, err := parseTimeRange(timeRange)
if err != nil {
return nil, err
}
if cached, err := repo.GetCachedDashboardOverview(ctx, timeRange); err == nil && cached != nil {
return cached, nil
}
now := time.Now().UTC()
startTime := now.Add(-duration)
ctxStats, cancelStats := context.WithTimeout(ctx, opsDBQueryTimeout)
stats, err := repo.GetOverviewStats(ctxStats, startTime, now)
cancelStats()
if err != nil {
return nil, fmt.Errorf("get overview stats: %w", err)
}
if stats == nil {
return nil, errors.New("get overview stats returned nil")
}
var statsYesterday *OverviewStats
{
yesterdayEnd := now.Add(-24 * time.Hour)
yesterdayStart := yesterdayEnd.Add(-duration)
ctxYesterday, cancelYesterday := context.WithTimeout(ctx, opsDBQueryTimeout)
ys, err := repo.GetOverviewStats(ctxYesterday, yesterdayStart, yesterdayEnd)
cancelYesterday()
if err != nil {
// Best-effort: overview should still work when historical comparison fails.
log.Printf("[OpsOverview] get yesterday overview stats failed: %v", err)
} else {
statsYesterday = ys
}
}
totalReqs := stats.SuccessCount + stats.ErrorCount
successRate, errorRate := calculateRates(stats.SuccessCount, stats.ErrorCount, totalReqs)
successRateYesterday := 0.0
totalReqsYesterday := int64(0)
if statsYesterday != nil {
totalReqsYesterday = statsYesterday.SuccessCount + statsYesterday.ErrorCount
successRateYesterday, _ = calculateRates(statsYesterday.SuccessCount, statsYesterday.ErrorCount, totalReqsYesterday)
}
slaThreshold := 99.9
slaChange24h := roundTo2DP(successRate - successRateYesterday)
slaTrend := classifyTrend(slaChange24h, 0.05)
slaStatus := classifySLAStatus(successRate, slaThreshold)
latencyThresholdP99 := 1000
latencyStatus := classifyLatencyStatus(stats.LatencyP99, latencyThresholdP99)
qpsCurrent := 0.0
{
ctxWindow, cancelWindow := context.WithTimeout(ctx, opsDBQueryTimeout)
windowStats, err := repo.GetWindowStats(ctxWindow, now.Add(-1*time.Minute), now)
cancelWindow()
if err == nil && windowStats != nil {
qpsCurrent = roundTo1DP(float64(windowStats.SuccessCount+windowStats.ErrorCount) / 60)
} else if err != nil {
log.Printf("[OpsOverview] get realtime qps failed: %v", err)
}
}
qpsAvg := roundTo1DP(safeDivide(float64(totalReqs), duration.Seconds()))
qpsPeak := qpsAvg
{
limit := int(duration.Minutes()) + 5
if limit < 10 {
limit = 10
}
if limit > 5000 {
limit = 5000
}
ctxMetrics, cancelMetrics := context.WithTimeout(ctx, opsDBQueryTimeout)
items, err := repo.ListSystemMetricsRange(ctxMetrics, 1, startTime, now, limit)
cancelMetrics()
if err != nil {
log.Printf("[OpsOverview] get metrics range for peak qps failed: %v", err)
} else {
maxQPS := 0.0
for _, item := range items {
v := float64(item.RequestCount) / 60
if v > maxQPS {
maxQPS = v
}
}
if maxQPS > 0 {
qpsPeak = roundTo1DP(maxQPS)
}
}
}
qpsAvgYesterday := 0.0
if duration.Seconds() > 0 && totalReqsYesterday > 0 {
qpsAvgYesterday = float64(totalReqsYesterday) / duration.Seconds()
}
qpsChangeVsYesterday := roundTo1DP(percentChange(qpsAvgYesterday, float64(totalReqs)/duration.Seconds()))
tpsCurrent, tpsPeak, tpsAvg := 0.0, 0.0, 0.0
if current, peak, avg, err := s.getTokenTPS(ctx, now, startTime, duration); err != nil {
log.Printf("[OpsOverview] get token tps failed: %v", err)
} else {
tpsCurrent, tpsPeak, tpsAvg = roundTo1DP(current), roundTo1DP(peak), roundTo1DP(avg)
}
diskUsage := 0.0
if v, err := getDiskUsagePercent(ctx, "/"); err != nil {
log.Printf("[OpsOverview] get disk usage failed: %v", err)
} else {
diskUsage = roundTo1DP(v)
}
redisStatus := s.checkRedisHealth(ctx)
dbStatus := s.checkDatabaseHealth(ctx)
healthScore := calculateHealthScore(successRate, stats.LatencyP99, errorRate, redisStatus, dbStatus)
data := &DashboardOverviewData{
Timestamp: now,
HealthScore: healthScore,
SLA: SLAData{
Current: successRate,
Threshold: slaThreshold,
Status: slaStatus,
Trend: slaTrend,
Change24h: slaChange24h,
},
QPS: QPSData{
Current: qpsCurrent,
Peak1h: qpsPeak,
Avg1h: qpsAvg,
ChangeVsYesterday: qpsChangeVsYesterday,
},
TPS: TPSData{
Current: tpsCurrent,
Peak1h: tpsPeak,
Avg1h: tpsAvg,
},
Latency: LatencyData{
P50: stats.LatencyP50,
P95: stats.LatencyP95,
P99: stats.LatencyP99,
P999: stats.LatencyP999,
Avg: stats.LatencyAvg,
Max: stats.LatencyMax,
ThresholdP99: latencyThresholdP99,
Status: latencyStatus,
},
Errors: ErrorData{
TotalCount: stats.ErrorCount,
ErrorRate: errorRate,
Count4xx: stats.Error4xxCount,
Count5xx: stats.Error5xxCount,
TimeoutCount: stats.TimeoutCount,
},
Resources: ResourceData{
CPUUsage: roundTo1DP(stats.CPUUsage),
MemoryUsage: roundTo1DP(stats.MemoryUsage),
DiskUsage: diskUsage,
Goroutines: runtime.NumGoroutine(),
DBConnections: s.getDBConnections(),
},
SystemStatus: SystemStatusData{
Redis: redisStatus,
Database: dbStatus,
BackgroundJobs: "healthy",
},
}
if stats.TopErrorCount > 0 {
data.Errors.TopError = &TopError{
Code: stats.TopErrorCode,
Message: stats.TopErrorMsg,
Count: stats.TopErrorCount,
}
}
_ = repo.SetCachedDashboardOverview(ctx, timeRange, data, 10*time.Second)
return data, nil
}
func (s *OpsService) GetProviderHealth(ctx context.Context, timeRange string) ([]*ProviderHealthData, error) {
if s == nil || s.repo == nil {
return nil, nil
}
if strings.TrimSpace(timeRange) == "" {
timeRange = "1h"
}
window, err := parseTimeRange(timeRange)
if err != nil {
return nil, err
}
endTime := time.Now()
startTime := endTime.Add(-window)
ctxDB, cancel := context.WithTimeout(ctx, opsDBQueryTimeout)
stats, err := s.repo.GetProviderStats(ctxDB, startTime, endTime)
cancel()
if err != nil {
return nil, err
}
results := make([]*ProviderHealthData, 0, len(stats))
for _, item := range stats {
if item == nil {
continue
}
successRate, errorRate := calculateRates(item.SuccessCount, item.ErrorCount, item.RequestCount)
results = append(results, &ProviderHealthData{
Name: formatPlatformName(item.Platform),
RequestCount: item.RequestCount,
SuccessRate: successRate,
ErrorRate: errorRate,
LatencyAvg: item.AvgLatencyMs,
LatencyP99: item.P99LatencyMs,
Status: classifyProviderStatus(successRate, item.P99LatencyMs, item.TimeoutCount, item.RequestCount),
ErrorsByType: ProviderHealthErrorsByType{
HTTP4xx: item.Error4xxCount,
HTTP5xx: item.Error5xxCount,
Timeout: item.TimeoutCount,
},
})
}
return results, nil
}
func (s *OpsService) GetLatencyHistogram(ctx context.Context, timeRange string) ([]*LatencyHistogramItem, error) {
if s == nil || s.repo == nil {
return nil, nil
}
duration, err := parseTimeRange(timeRange)
if err != nil {
return nil, err
}
endTime := time.Now()
startTime := endTime.Add(-duration)
ctxDB, cancel := context.WithTimeout(ctx, opsDBQueryTimeout)
defer cancel()
return s.repo.GetLatencyHistogram(ctxDB, startTime, endTime)
}
func (s *OpsService) GetErrorDistribution(ctx context.Context, timeRange string) ([]*ErrorDistributionItem, error) {
if s == nil || s.repo == nil {
return nil, nil
}
duration, err := parseTimeRange(timeRange)
if err != nil {
return nil, err
}
endTime := time.Now()
startTime := endTime.Add(-duration)
ctxDB, cancel := context.WithTimeout(ctx, opsDBQueryTimeout)
defer cancel()
return s.repo.GetErrorDistribution(ctxDB, startTime, endTime)
}
func parseTimeRange(timeRange string) (time.Duration, error) {
value := strings.TrimSpace(timeRange)
if value == "" {
return 0, errors.New("invalid time range")
}
// Support "7d" style day ranges for convenience.
if strings.HasSuffix(value, "d") {
numberPart := strings.TrimSuffix(value, "d")
if numberPart == "" {
return 0, errors.New("invalid time range")
}
days := 0
for _, ch := range numberPart {
if ch < '0' || ch > '9' {
return 0, errors.New("invalid time range")
}
days = days*10 + int(ch-'0')
}
if days <= 0 {
return 0, errors.New("invalid time range")
}
return time.Duration(days) * 24 * time.Hour, nil
}
dur, err := time.ParseDuration(value)
if err != nil || dur <= 0 {
return 0, errors.New("invalid time range")
}
// Cap to avoid unbounded queries.
const maxWindow = 30 * 24 * time.Hour
if dur > maxWindow {
dur = maxWindow
}
return dur, nil
}
func calculateHealthScore(successRate float64, p99Latency int, errorRate float64, redisStatus, dbStatus string) int {
score := 100.0
// SLA impact (max -45 points)
if successRate < 99.9 {
score -= math.Min(45, (99.9-successRate)*12)
}
// Latency impact (max -35 points)
if p99Latency > 1000 {
score -= math.Min(35, float64(p99Latency-1000)/80)
}
// Error rate impact (max -20 points)
if errorRate > 0.1 {
score -= math.Min(20, (errorRate-0.1)*60)
}
// Infra status impact
if redisStatus != "healthy" {
score -= 15
}
if dbStatus != "healthy" {
score -= 20
}
if score < 0 {
score = 0
}
if score > 100 {
score = 100
}
return int(math.Round(score))
}
func calculateRates(successCount, errorCount, requestCount int64) (successRate float64, errorRate float64) {
if requestCount <= 0 {
return 0, 0
}
successRate = (float64(successCount) / float64(requestCount)) * 100
errorRate = (float64(errorCount) / float64(requestCount)) * 100
return roundTo2DP(successRate), roundTo2DP(errorRate)
}
func roundTo2DP(v float64) float64 {
return math.Round(v*100) / 100
}
func roundTo1DP(v float64) float64 {
return math.Round(v*10) / 10
}
func safeDivide(numerator float64, denominator float64) float64 {
if denominator <= 0 {
return 0
}
return numerator / denominator
}
func percentChange(previous float64, current float64) float64 {
if previous == 0 {
if current > 0 {
return 100.0
}
return 0
}
return (current - previous) / previous * 100
}
func classifyTrend(delta float64, deadband float64) string {
if delta > deadband {
return "up"
}
if delta < -deadband {
return "down"
}
return "stable"
}
func classifySLAStatus(successRate float64, threshold float64) string {
if successRate >= threshold {
return "healthy"
}
if successRate >= threshold-0.5 {
return "warning"
}
return "critical"
}
func classifyLatencyStatus(p99LatencyMs int, thresholdP99 int) string {
if thresholdP99 <= 0 {
return "healthy"
}
if p99LatencyMs <= thresholdP99 {
return "healthy"
}
if p99LatencyMs <= thresholdP99*2 {
return "warning"
}
return "critical"
}
func getDiskUsagePercent(ctx context.Context, path string) (float64, error) {
usage, err := disk.UsageWithContext(ctx, path)
if err != nil {
return 0, err
}
if usage == nil {
return 0, nil
}
return usage.UsedPercent, nil
}
func (s *OpsService) checkRedisHealth(ctx context.Context) string {
if s == nil {
log.Printf("[OpsOverview][WARN] ops service is nil; redis health check skipped")
return "critical"
}
if s.repo == nil {
s.redisNilWarnOnce.Do(func() {
log.Printf("[OpsOverview][WARN] ops repository is nil; redis health check skipped")
})
return "critical"
}
ctxPing, cancel := context.WithTimeout(ctx, 800*time.Millisecond)
defer cancel()
if err := s.repo.PingRedis(ctxPing); err != nil {
log.Printf("[OpsOverview][WARN] redis ping failed: %v", err)
return "critical"
}
return "healthy"
}
func (s *OpsService) checkDatabaseHealth(ctx context.Context) string {
if s == nil {
log.Printf("[OpsOverview][WARN] ops service is nil; db health check skipped")
return "critical"
}
if s.sqlDB == nil {
s.dbNilWarnOnce.Do(func() {
log.Printf("[OpsOverview][WARN] database is nil; db health check skipped")
})
return "critical"
}
ctxPing, cancel := context.WithTimeout(ctx, 800*time.Millisecond)
defer cancel()
if err := s.sqlDB.PingContext(ctxPing); err != nil {
log.Printf("[OpsOverview][WARN] db ping failed: %v", err)
return "critical"
}
return "healthy"
}
func (s *OpsService) getDBConnections() DBConnectionsData {
if s == nil || s.sqlDB == nil {
return DBConnectionsData{}
}
stats := s.sqlDB.Stats()
maxOpen := stats.MaxOpenConnections
if maxOpen < 0 {
maxOpen = 0
}
return DBConnectionsData{
Active: stats.InUse,
Idle: stats.Idle,
Waiting: 0,
Max: maxOpen,
}
}
func (s *OpsService) getTokenTPS(ctx context.Context, endTime time.Time, startTime time.Time, duration time.Duration) (current float64, peak float64, avg float64, err error) {
if s == nil || s.sqlDB == nil {
return 0, 0, 0, nil
}
if duration <= 0 {
return 0, 0, 0, nil
}
// Current TPS: last 1 minute.
var tokensLastMinute int64
{
lastMinuteStart := endTime.Add(-1 * time.Minute)
ctxQuery, cancel := context.WithTimeout(ctx, opsDBQueryTimeout)
row := s.sqlDB.QueryRowContext(ctxQuery, `
SELECT COALESCE(SUM(input_tokens + output_tokens), 0)
FROM usage_logs
WHERE created_at >= $1 AND created_at < $2
`, lastMinuteStart, endTime)
scanErr := row.Scan(&tokensLastMinute)
cancel()
if scanErr != nil {
return 0, 0, 0, scanErr
}
}
var totalTokens int64
var maxTokensPerMinute int64
{
ctxQuery, cancel := context.WithTimeout(ctx, opsDBQueryTimeout)
row := s.sqlDB.QueryRowContext(ctxQuery, `
WITH buckets AS (
SELECT
date_trunc('minute', created_at) AS bucket,
SUM(input_tokens + output_tokens) AS tokens
FROM usage_logs
WHERE created_at >= $1 AND created_at < $2
GROUP BY 1
)
SELECT
COALESCE(SUM(tokens), 0) AS total_tokens,
COALESCE(MAX(tokens), 0) AS max_tokens_per_minute
FROM buckets
`, startTime, endTime)
scanErr := row.Scan(&totalTokens, &maxTokensPerMinute)
cancel()
if scanErr != nil {
return 0, 0, 0, scanErr
}
}
current = safeDivide(float64(tokensLastMinute), 60)
peak = safeDivide(float64(maxTokensPerMinute), 60)
avg = safeDivide(float64(totalTokens), duration.Seconds())
return current, peak, avg, nil
}
func formatPlatformName(platform string) string {
switch strings.ToLower(strings.TrimSpace(platform)) {
case PlatformOpenAI:
return "OpenAI"
case PlatformAnthropic:
return "Anthropic"
case PlatformGemini:
return "Gemini"
case PlatformAntigravity:
return "Antigravity"
default:
if platform == "" {
return "Unknown"
}
if len(platform) == 1 {
return strings.ToUpper(platform)
}
return strings.ToUpper(platform[:1]) + platform[1:]
}
}
func classifyProviderStatus(successRate float64, p99LatencyMs int, timeoutCount int64, requestCount int64) string {
if requestCount <= 0 {
return "healthy"
}
if successRate < 98 {
return "critical"
}
if successRate < 99.5 {
return "warning"
}
// Heavy timeout volume should be highlighted even if the overall success rate is okay.
if timeoutCount >= 10 && requestCount >= 100 {
return "warning"
}
if p99LatencyMs > 0 && p99LatencyMs >= 5000 {
return "warning"
}
return "healthy"
}
......@@ -61,9 +61,9 @@ func (s *SettingService) GetPublicSettings(ctx context.Context) (*PublicSettings
SettingKeySiteName,
SettingKeySiteLogo,
SettingKeySiteSubtitle,
SettingKeyApiBaseUrl,
SettingKeyAPIBaseURL,
SettingKeyContactInfo,
SettingKeyDocUrl,
SettingKeyDocURL,
}
settings, err := s.settingRepo.GetMultiple(ctx, keys)
......@@ -79,9 +79,9 @@ func (s *SettingService) GetPublicSettings(ctx context.Context) (*PublicSettings
SiteName: s.getStringOrDefault(settings, SettingKeySiteName, "Sub2API"),
SiteLogo: settings[SettingKeySiteLogo],
SiteSubtitle: s.getStringOrDefault(settings, SettingKeySiteSubtitle, "Subscription to API Conversion Platform"),
ApiBaseUrl: settings[SettingKeyApiBaseUrl],
APIBaseURL: settings[SettingKeyAPIBaseURL],
ContactInfo: settings[SettingKeyContactInfo],
DocUrl: settings[SettingKeyDocUrl],
DocURL: settings[SettingKeyDocURL],
}, nil
}
......@@ -94,15 +94,15 @@ func (s *SettingService) UpdateSettings(ctx context.Context, settings *SystemSet
updates[SettingKeyEmailVerifyEnabled] = strconv.FormatBool(settings.EmailVerifyEnabled)
// 邮件服务设置(只有非空才更新密码)
updates[SettingKeySmtpHost] = settings.SmtpHost
updates[SettingKeySmtpPort] = strconv.Itoa(settings.SmtpPort)
updates[SettingKeySmtpUsername] = settings.SmtpUsername
if settings.SmtpPassword != "" {
updates[SettingKeySmtpPassword] = settings.SmtpPassword
updates[SettingKeySMTPHost] = settings.SMTPHost
updates[SettingKeySMTPPort] = strconv.Itoa(settings.SMTPPort)
updates[SettingKeySMTPUsername] = settings.SMTPUsername
if settings.SMTPPassword != "" {
updates[SettingKeySMTPPassword] = settings.SMTPPassword
}
updates[SettingKeySmtpFrom] = settings.SmtpFrom
updates[SettingKeySmtpFromName] = settings.SmtpFromName
updates[SettingKeySmtpUseTLS] = strconv.FormatBool(settings.SmtpUseTLS)
updates[SettingKeySMTPFrom] = settings.SMTPFrom
updates[SettingKeySMTPFromName] = settings.SMTPFromName
updates[SettingKeySMTPUseTLS] = strconv.FormatBool(settings.SMTPUseTLS)
// Cloudflare Turnstile 设置(只有非空才更新密钥)
updates[SettingKeyTurnstileEnabled] = strconv.FormatBool(settings.TurnstileEnabled)
......@@ -115,9 +115,9 @@ func (s *SettingService) UpdateSettings(ctx context.Context, settings *SystemSet
updates[SettingKeySiteName] = settings.SiteName
updates[SettingKeySiteLogo] = settings.SiteLogo
updates[SettingKeySiteSubtitle] = settings.SiteSubtitle
updates[SettingKeyApiBaseUrl] = settings.ApiBaseUrl
updates[SettingKeyAPIBaseURL] = settings.APIBaseURL
updates[SettingKeyContactInfo] = settings.ContactInfo
updates[SettingKeyDocUrl] = settings.DocUrl
updates[SettingKeyDocURL] = settings.DocURL
// 默认配置
updates[SettingKeyDefaultConcurrency] = strconv.Itoa(settings.DefaultConcurrency)
......@@ -198,8 +198,8 @@ func (s *SettingService) InitializeDefaultSettings(ctx context.Context) error {
SettingKeySiteLogo: "",
SettingKeyDefaultConcurrency: strconv.Itoa(s.cfg.Default.UserConcurrency),
SettingKeyDefaultBalance: strconv.FormatFloat(s.cfg.Default.UserBalance, 'f', 8, 64),
SettingKeySmtpPort: "587",
SettingKeySmtpUseTLS: "false",
SettingKeySMTPPort: "587",
SettingKeySMTPUseTLS: "false",
}
return s.settingRepo.SetMultiple(ctx, defaults)
......@@ -210,26 +210,26 @@ func (s *SettingService) parseSettings(settings map[string]string) *SystemSettin
result := &SystemSettings{
RegistrationEnabled: settings[SettingKeyRegistrationEnabled] == "true",
EmailVerifyEnabled: settings[SettingKeyEmailVerifyEnabled] == "true",
SmtpHost: settings[SettingKeySmtpHost],
SmtpUsername: settings[SettingKeySmtpUsername],
SmtpFrom: settings[SettingKeySmtpFrom],
SmtpFromName: settings[SettingKeySmtpFromName],
SmtpUseTLS: settings[SettingKeySmtpUseTLS] == "true",
SMTPHost: settings[SettingKeySMTPHost],
SMTPUsername: settings[SettingKeySMTPUsername],
SMTPFrom: settings[SettingKeySMTPFrom],
SMTPFromName: settings[SettingKeySMTPFromName],
SMTPUseTLS: settings[SettingKeySMTPUseTLS] == "true",
TurnstileEnabled: settings[SettingKeyTurnstileEnabled] == "true",
TurnstileSiteKey: settings[SettingKeyTurnstileSiteKey],
SiteName: s.getStringOrDefault(settings, SettingKeySiteName, "Sub2API"),
SiteLogo: settings[SettingKeySiteLogo],
SiteSubtitle: s.getStringOrDefault(settings, SettingKeySiteSubtitle, "Subscription to API Conversion Platform"),
ApiBaseUrl: settings[SettingKeyApiBaseUrl],
APIBaseURL: settings[SettingKeyAPIBaseURL],
ContactInfo: settings[SettingKeyContactInfo],
DocUrl: settings[SettingKeyDocUrl],
DocURL: settings[SettingKeyDocURL],
}
// 解析整数类型
if port, err := strconv.Atoi(settings[SettingKeySmtpPort]); err == nil {
result.SmtpPort = port
if port, err := strconv.Atoi(settings[SettingKeySMTPPort]); err == nil {
result.SMTPPort = port
} else {
result.SmtpPort = 587
result.SMTPPort = 587
}
if concurrency, err := strconv.Atoi(settings[SettingKeyDefaultConcurrency]); err == nil {
......@@ -245,8 +245,8 @@ func (s *SettingService) parseSettings(settings map[string]string) *SystemSettin
result.DefaultBalance = s.cfg.Default.UserBalance
}
// 敏感信息直接返回方便测试连接时使用
result.SmtpPassword = settings[SettingKeySmtpPassword]
// 敏感信息直接返回,方便测试连接时使用
result.SMTPPassword = settings[SettingKeySMTPPassword]
result.TurnstileSecretKey = settings[SettingKeyTurnstileSecretKey]
return result
......@@ -278,28 +278,28 @@ func (s *SettingService) GetTurnstileSecretKey(ctx context.Context) string {
return value
}
// GenerateAdminApiKey 生成新的管理员 API Key
func (s *SettingService) GenerateAdminApiKey(ctx context.Context) (string, error) {
// GenerateAdminAPIKey 生成新的管理员 API Key
func (s *SettingService) GenerateAdminAPIKey(ctx context.Context) (string, error) {
// 生成 32 字节随机数 = 64 位十六进制字符
bytes := make([]byte, 32)
if _, err := rand.Read(bytes); err != nil {
return "", fmt.Errorf("generate random bytes: %w", err)
}
key := AdminApiKeyPrefix + hex.EncodeToString(bytes)
key := AdminAPIKeyPrefix + hex.EncodeToString(bytes)
// 存储到 settings 表
if err := s.settingRepo.Set(ctx, SettingKeyAdminApiKey, key); err != nil {
if err := s.settingRepo.Set(ctx, SettingKeyAdminAPIKey, key); err != nil {
return "", fmt.Errorf("save admin api key: %w", err)
}
return key, nil
}
// GetAdminApiKeyStatus 获取管理员 API Key 状态
// GetAdminAPIKeyStatus 获取管理员 API Key 状态
// 返回脱敏的 key、是否存在、错误
func (s *SettingService) GetAdminApiKeyStatus(ctx context.Context) (maskedKey string, exists bool, err error) {
key, err := s.settingRepo.GetValue(ctx, SettingKeyAdminApiKey)
func (s *SettingService) GetAdminAPIKeyStatus(ctx context.Context) (maskedKey string, exists bool, err error) {
key, err := s.settingRepo.GetValue(ctx, SettingKeyAdminAPIKey)
if err != nil {
if errors.Is(err, ErrSettingNotFound) {
return "", false, nil
......@@ -320,10 +320,10 @@ func (s *SettingService) GetAdminApiKeyStatus(ctx context.Context) (maskedKey st
return maskedKey, true, nil
}
// GetAdminApiKey 获取完整的管理员 API Key(仅供内部验证使用)
// GetAdminAPIKey 获取完整的管理员 API Key(仅供内部验证使用)
// 如果未配置返回空字符串和 nil 错误,只有数据库错误时才返回 error
func (s *SettingService) GetAdminApiKey(ctx context.Context) (string, error) {
key, err := s.settingRepo.GetValue(ctx, SettingKeyAdminApiKey)
func (s *SettingService) GetAdminAPIKey(ctx context.Context) (string, error) {
key, err := s.settingRepo.GetValue(ctx, SettingKeyAdminAPIKey)
if err != nil {
if errors.Is(err, ErrSettingNotFound) {
return "", nil // 未配置,返回空字符串
......@@ -333,7 +333,7 @@ func (s *SettingService) GetAdminApiKey(ctx context.Context) (string, error) {
return key, nil
}
// DeleteAdminApiKey 删除管理员 API Key
func (s *SettingService) DeleteAdminApiKey(ctx context.Context) error {
return s.settingRepo.Delete(ctx, SettingKeyAdminApiKey)
// DeleteAdminAPIKey 删除管理员 API Key
func (s *SettingService) DeleteAdminAPIKey(ctx context.Context) error {
return s.settingRepo.Delete(ctx, SettingKeyAdminAPIKey)
}
......@@ -4,13 +4,13 @@ type SystemSettings struct {
RegistrationEnabled bool
EmailVerifyEnabled bool
SmtpHost string
SmtpPort int
SmtpUsername string
SmtpPassword string
SmtpFrom string
SmtpFromName string
SmtpUseTLS bool
SMTPHost string
SMTPPort int
SMTPUsername string
SMTPPassword string
SMTPFrom string
SMTPFromName string
SMTPUseTLS bool
TurnstileEnabled bool
TurnstileSiteKey string
......@@ -19,9 +19,9 @@ type SystemSettings struct {
SiteName string
SiteLogo string
SiteSubtitle string
ApiBaseUrl string
APIBaseURL string
ContactInfo string
DocUrl string
DocURL string
DefaultConcurrency int
DefaultBalance float64
......@@ -35,8 +35,8 @@ type PublicSettings struct {
SiteName string
SiteLogo string
SiteSubtitle string
ApiBaseUrl string
APIBaseURL string
ContactInfo string
DocUrl string
DocURL string
Version string
}
......@@ -197,7 +197,7 @@ func TestClaudeTokenRefresher_CanRefresh(t *testing.T) {
{
name: "anthropic api-key - cannot refresh",
platform: PlatformAnthropic,
accType: AccountTypeApiKey,
accType: AccountTypeAPIKey,
want: false,
},
{
......
......@@ -79,7 +79,7 @@ type ReleaseInfo struct {
Name string `json:"name"`
Body string `json:"body"`
PublishedAt string `json:"published_at"`
HtmlURL string `json:"html_url"`
HTMLURL string `json:"html_url"`
Assets []Asset `json:"assets,omitempty"`
}
......@@ -96,13 +96,13 @@ type GitHubRelease struct {
Name string `json:"name"`
Body string `json:"body"`
PublishedAt string `json:"published_at"`
HtmlUrl string `json:"html_url"`
HTMLURL string `json:"html_url"`
Assets []GitHubAsset `json:"assets"`
}
type GitHubAsset struct {
Name string `json:"name"`
BrowserDownloadUrl string `json:"browser_download_url"`
BrowserDownloadURL string `json:"browser_download_url"`
Size int64 `json:"size"`
}
......@@ -285,7 +285,7 @@ func (s *UpdateService) fetchLatestRelease(ctx context.Context) (*UpdateInfo, er
for i, a := range release.Assets {
assets[i] = Asset{
Name: a.Name,
DownloadURL: a.BrowserDownloadUrl,
DownloadURL: a.BrowserDownloadURL,
Size: a.Size,
}
}
......@@ -298,7 +298,7 @@ func (s *UpdateService) fetchLatestRelease(ctx context.Context) (*UpdateInfo, er
Name: release.Name,
Body: release.Body,
PublishedAt: release.PublishedAt,
HtmlURL: release.HtmlUrl,
HTMLURL: release.HTMLURL,
Assets: assets,
},
Cached: false,
......
......@@ -10,7 +10,7 @@ const (
type UsageLog struct {
ID int64
UserID int64
ApiKeyID int64
APIKeyID int64
AccountID int64
RequestID string
Model string
......@@ -42,7 +42,7 @@ type UsageLog struct {
CreatedAt time.Time
User *User
ApiKey *ApiKey
APIKey *APIKey
Account *Account
Group *Group
Subscription *UserSubscription
......
......@@ -17,7 +17,7 @@ var (
// CreateUsageLogRequest 创建使用日志请求
type CreateUsageLogRequest struct {
UserID int64 `json:"user_id"`
ApiKeyID int64 `json:"api_key_id"`
APIKeyID int64 `json:"api_key_id"`
AccountID int64 `json:"account_id"`
RequestID string `json:"request_id"`
Model string `json:"model"`
......@@ -75,7 +75,7 @@ func (s *UsageService) Create(ctx context.Context, req CreateUsageLogRequest) (*
// 创建使用日志
usageLog := &UsageLog{
UserID: req.UserID,
ApiKeyID: req.ApiKeyID,
APIKeyID: req.APIKeyID,
AccountID: req.AccountID,
RequestID: req.RequestID,
Model: req.Model,
......@@ -128,9 +128,9 @@ func (s *UsageService) ListByUser(ctx context.Context, userID int64, params pagi
return logs, pagination, nil
}
// ListByApiKey 获取API Key的使用日志列表
func (s *UsageService) ListByApiKey(ctx context.Context, apiKeyID int64, params pagination.PaginationParams) ([]UsageLog, *pagination.PaginationResult, error) {
logs, pagination, err := s.usageRepo.ListByApiKey(ctx, apiKeyID, params)
// ListByAPIKey 获取API Key的使用日志列表
func (s *UsageService) ListByAPIKey(ctx context.Context, apiKeyID int64, params pagination.PaginationParams) ([]UsageLog, *pagination.PaginationResult, error) {
logs, pagination, err := s.usageRepo.ListByAPIKey(ctx, apiKeyID, params)
if err != nil {
return nil, nil, fmt.Errorf("list usage logs: %w", err)
}
......@@ -165,9 +165,9 @@ func (s *UsageService) GetStatsByUser(ctx context.Context, userID int64, startTi
}, nil
}
// GetStatsByApiKey 获取API Key的使用统计
func (s *UsageService) GetStatsByApiKey(ctx context.Context, apiKeyID int64, startTime, endTime time.Time) (*UsageStats, error) {
stats, err := s.usageRepo.GetApiKeyStatsAggregated(ctx, apiKeyID, startTime, endTime)
// GetStatsByAPIKey 获取API Key的使用统计
func (s *UsageService) GetStatsByAPIKey(ctx context.Context, apiKeyID int64, startTime, endTime time.Time) (*UsageStats, error) {
stats, err := s.usageRepo.GetAPIKeyStatsAggregated(ctx, apiKeyID, startTime, endTime)
if err != nil {
return nil, fmt.Errorf("get api key stats: %w", err)
}
......@@ -270,9 +270,9 @@ func (s *UsageService) GetUserModelStats(ctx context.Context, userID int64, star
return stats, nil
}
// GetBatchApiKeyUsageStats returns today/total actual_cost for given api keys.
func (s *UsageService) GetBatchApiKeyUsageStats(ctx context.Context, apiKeyIDs []int64) (map[int64]*usagestats.BatchApiKeyUsageStats, error) {
stats, err := s.usageRepo.GetBatchApiKeyUsageStats(ctx, apiKeyIDs)
// GetBatchAPIKeyUsageStats returns today/total actual_cost for given api keys.
func (s *UsageService) GetBatchAPIKeyUsageStats(ctx context.Context, apiKeyIDs []int64) (map[int64]*usagestats.BatchAPIKeyUsageStats, error) {
stats, err := s.usageRepo.GetBatchAPIKeyUsageStats(ctx, apiKeyIDs)
if err != nil {
return nil, fmt.Errorf("get batch api key usage stats: %w", err)
}
......
......@@ -21,7 +21,7 @@ type User struct {
CreatedAt time.Time
UpdatedAt time.Time
ApiKeys []ApiKey
APIKeys []APIKey
Subscriptions []UserSubscription
}
......
......@@ -73,6 +73,20 @@ func ProvideDeferredService(accountRepo AccountRepository, timingWheel *TimingWh
return svc
}
// ProvideOpsMetricsCollector creates and starts OpsMetricsCollector.
func ProvideOpsMetricsCollector(opsService *OpsService, concurrencyService *ConcurrencyService) *OpsMetricsCollector {
svc := NewOpsMetricsCollector(opsService, concurrencyService)
svc.Start()
return svc
}
// ProvideOpsAlertService creates and starts OpsAlertService.
func ProvideOpsAlertService(opsService *OpsService, userService *UserService, emailService *EmailService) *OpsAlertService {
svc := NewOpsAlertService(opsService, userService, emailService)
svc.Start()
return svc
}
// ProvideConcurrencyService creates ConcurrencyService and starts slot cleanup worker.
func ProvideConcurrencyService(cache ConcurrencyCache, accountRepo AccountRepository, cfg *config.Config) *ConcurrencyService {
svc := NewConcurrencyService(cache)
......@@ -87,13 +101,14 @@ var ProviderSet = wire.NewSet(
// Core services
NewAuthService,
NewUserService,
NewApiKeyService,
NewAPIKeyService,
NewGroupService,
NewAccountService,
NewProxyService,
NewRedeemService,
NewUsageService,
NewDashboardService,
NewOpsService,
ProvidePricingService,
NewBillingService,
NewBillingCacheService,
......@@ -125,5 +140,7 @@ var ProviderSet = wire.NewSet(
ProvideTimingWheelService,
ProvideDeferredService,
ProvideAntigravityQuotaRefresher,
ProvideOpsMetricsCollector,
ProvideOpsAlertService,
NewUserAttributeService,
)
// Package setup provides CLI-based installation wizard for initial system configuration.
package setup
import (
......
......@@ -345,7 +345,7 @@ func writeConfigFile(cfg *SetupConfig) error {
Default struct {
UserConcurrency int `yaml:"user_concurrency"`
UserBalance float64 `yaml:"user_balance"`
ApiKeyPrefix string `yaml:"api_key_prefix"`
APIKeyPrefix string `yaml:"api_key_prefix"`
RateMultiplier float64 `yaml:"rate_multiplier"`
} `yaml:"default"`
RateLimit struct {
......@@ -367,12 +367,12 @@ func writeConfigFile(cfg *SetupConfig) error {
Default: struct {
UserConcurrency int `yaml:"user_concurrency"`
UserBalance float64 `yaml:"user_balance"`
ApiKeyPrefix string `yaml:"api_key_prefix"`
APIKeyPrefix string `yaml:"api_key_prefix"`
RateMultiplier float64 `yaml:"rate_multiplier"`
}{
UserConcurrency: 5,
UserBalance: 0,
ApiKeyPrefix: "sk-",
APIKeyPrefix: "sk-",
RateMultiplier: 1.0,
},
RateLimit: struct {
......
//go:build !embed
// Package web provides web server functionality including embedded frontend support.
package web
import (
......
-- Ops error logs and system metrics
CREATE TABLE IF NOT EXISTS ops_error_logs (
id BIGSERIAL PRIMARY KEY,
request_id VARCHAR(64),
user_id BIGINT,
api_key_id BIGINT,
account_id BIGINT,
group_id BIGINT,
client_ip INET,
error_phase VARCHAR(32) NOT NULL,
error_type VARCHAR(64) NOT NULL,
severity VARCHAR(4) NOT NULL,
status_code INT,
platform VARCHAR(32),
model VARCHAR(100),
request_path VARCHAR(256),
stream BOOLEAN NOT NULL DEFAULT FALSE,
error_message TEXT,
error_body TEXT,
provider_error_code VARCHAR(64),
provider_error_type VARCHAR(64),
is_retryable BOOLEAN NOT NULL DEFAULT FALSE,
is_user_actionable BOOLEAN NOT NULL DEFAULT FALSE,
retry_count INT NOT NULL DEFAULT 0,
completion_status VARCHAR(16),
duration_ms INT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_ops_error_logs_created_at ON ops_error_logs (created_at DESC);
CREATE INDEX IF NOT EXISTS idx_ops_error_logs_phase ON ops_error_logs (error_phase);
CREATE INDEX IF NOT EXISTS idx_ops_error_logs_platform ON ops_error_logs (platform);
CREATE INDEX IF NOT EXISTS idx_ops_error_logs_severity ON ops_error_logs (severity);
CREATE INDEX IF NOT EXISTS idx_ops_error_logs_phase_platform_time ON ops_error_logs (error_phase, platform, created_at DESC);
CREATE TABLE IF NOT EXISTS ops_system_metrics (
id BIGSERIAL PRIMARY KEY,
success_rate DOUBLE PRECISION,
error_rate DOUBLE PRECISION,
p95_latency_ms INT,
p99_latency_ms INT,
http2_errors INT,
active_alerts INT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_ops_system_metrics_created_at ON ops_system_metrics (created_at DESC);
-- Extend ops_system_metrics with windowed/system stats
ALTER TABLE ops_system_metrics
ADD COLUMN IF NOT EXISTS window_minutes INT NOT NULL DEFAULT 1,
ADD COLUMN IF NOT EXISTS cpu_usage_percent DOUBLE PRECISION,
ADD COLUMN IF NOT EXISTS memory_used_mb BIGINT,
ADD COLUMN IF NOT EXISTS memory_total_mb BIGINT,
ADD COLUMN IF NOT EXISTS memory_usage_percent DOUBLE PRECISION,
ADD COLUMN IF NOT EXISTS heap_alloc_mb BIGINT,
ADD COLUMN IF NOT EXISTS gc_pause_ms DOUBLE PRECISION,
ADD COLUMN IF NOT EXISTS concurrency_queue_depth INT;
CREATE INDEX IF NOT EXISTS idx_ops_system_metrics_window_time
ON ops_system_metrics (window_minutes, created_at DESC);
-- Ops alert rules and events
CREATE TABLE IF NOT EXISTS ops_alert_rules (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(128) NOT NULL,
description TEXT,
enabled BOOLEAN NOT NULL DEFAULT TRUE,
metric_type VARCHAR(64) NOT NULL,
operator VARCHAR(8) NOT NULL,
threshold DOUBLE PRECISION NOT NULL,
window_minutes INT NOT NULL DEFAULT 1,
sustained_minutes INT NOT NULL DEFAULT 1,
severity VARCHAR(4) NOT NULL DEFAULT 'P1',
notify_email BOOLEAN NOT NULL DEFAULT FALSE,
notify_webhook BOOLEAN NOT NULL DEFAULT FALSE,
webhook_url TEXT,
cooldown_minutes INT NOT NULL DEFAULT 10,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_ops_alert_rules_enabled ON ops_alert_rules (enabled);
CREATE INDEX IF NOT EXISTS idx_ops_alert_rules_metric ON ops_alert_rules (metric_type, window_minutes);
CREATE TABLE IF NOT EXISTS ops_alert_events (
id BIGSERIAL PRIMARY KEY,
rule_id BIGINT NOT NULL REFERENCES ops_alert_rules(id) ON DELETE CASCADE,
severity VARCHAR(4) NOT NULL,
status VARCHAR(16) NOT NULL DEFAULT 'firing',
title VARCHAR(200),
description TEXT,
metric_value DOUBLE PRECISION,
threshold_value DOUBLE PRECISION,
fired_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
resolved_at TIMESTAMPTZ,
email_sent BOOLEAN NOT NULL DEFAULT FALSE,
webhook_sent BOOLEAN NOT NULL DEFAULT FALSE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_ops_alert_events_rule_status ON ops_alert_events (rule_id, status);
CREATE INDEX IF NOT EXISTS idx_ops_alert_events_fired_at ON ops_alert_events (fired_at DESC);
-- Seed default ops alert rules (idempotent)
INSERT INTO ops_alert_rules (
name,
description,
enabled,
metric_type,
operator,
threshold,
window_minutes,
sustained_minutes,
severity,
notify_email,
notify_webhook,
webhook_url,
cooldown_minutes
)
SELECT
'Global success rate < 99%',
'Trigger when the 1-minute success rate drops below 99% for 2 consecutive minutes.',
TRUE,
'success_rate',
'<',
99,
1,
2,
'P1',
TRUE,
FALSE,
NULL,
10
WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules);
-- Seed additional ops alert rules (idempotent)
INSERT INTO ops_alert_rules (
name,
description,
enabled,
metric_type,
operator,
threshold,
window_minutes,
sustained_minutes,
severity,
notify_email,
notify_webhook,
webhook_url,
cooldown_minutes
)
SELECT
'Global error rate > 1%',
'Trigger when the 1-minute error rate exceeds 1% for 2 consecutive minutes.',
TRUE,
'error_rate',
'>',
1,
1,
2,
'P1',
TRUE,
CASE
WHEN (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1) IS NULL THEN FALSE
ELSE TRUE
END,
(SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1),
10
WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules WHERE name = 'Global error rate > 1%');
INSERT INTO ops_alert_rules (
name,
description,
enabled,
metric_type,
operator,
threshold,
window_minutes,
sustained_minutes,
severity,
notify_email,
notify_webhook,
webhook_url,
cooldown_minutes
)
SELECT
'P99 latency > 2000ms',
'Trigger when the 5-minute P99 latency exceeds 2000ms for 2 consecutive samples.',
TRUE,
'p99_latency_ms',
'>',
2000,
5,
2,
'P1',
TRUE,
CASE
WHEN (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1) IS NULL THEN FALSE
ELSE TRUE
END,
(SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1),
15
WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules WHERE name = 'P99 latency > 2000ms');
INSERT INTO ops_alert_rules (
name,
description,
enabled,
metric_type,
operator,
threshold,
window_minutes,
sustained_minutes,
severity,
notify_email,
notify_webhook,
webhook_url,
cooldown_minutes
)
SELECT
'HTTP/2 errors > 20',
'Trigger when HTTP/2 errors exceed 20 in the last minute for 2 consecutive minutes.',
TRUE,
'http2_errors',
'>',
20,
1,
2,
'P2',
FALSE,
CASE
WHEN (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1) IS NULL THEN FALSE
ELSE TRUE
END,
(SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1),
10
WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules WHERE name = 'HTTP/2 errors > 20');
INSERT INTO ops_alert_rules (
name,
description,
enabled,
metric_type,
operator,
threshold,
window_minutes,
sustained_minutes,
severity,
notify_email,
notify_webhook,
webhook_url,
cooldown_minutes
)
SELECT
'CPU usage > 85%',
'Trigger when CPU usage exceeds 85% for 5 consecutive minutes.',
TRUE,
'cpu_usage_percent',
'>',
85,
1,
5,
'P2',
FALSE,
CASE
WHEN (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1) IS NULL THEN FALSE
ELSE TRUE
END,
(SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1),
15
WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules WHERE name = 'CPU usage > 85%');
INSERT INTO ops_alert_rules (
name,
description,
enabled,
metric_type,
operator,
threshold,
window_minutes,
sustained_minutes,
severity,
notify_email,
notify_webhook,
webhook_url,
cooldown_minutes
)
SELECT
'Memory usage > 90%',
'Trigger when memory usage exceeds 90% for 5 consecutive minutes.',
TRUE,
'memory_usage_percent',
'>',
90,
1,
5,
'P2',
FALSE,
CASE
WHEN (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1) IS NULL THEN FALSE
ELSE TRUE
END,
(SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1),
15
WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules WHERE name = 'Memory usage > 90%');
INSERT INTO ops_alert_rules (
name,
description,
enabled,
metric_type,
operator,
threshold,
window_minutes,
sustained_minutes,
severity,
notify_email,
notify_webhook,
webhook_url,
cooldown_minutes
)
SELECT
'Queue depth > 50',
'Trigger when concurrency queue depth exceeds 50 for 2 consecutive minutes.',
TRUE,
'concurrency_queue_depth',
'>',
50,
1,
2,
'P2',
FALSE,
CASE
WHEN (SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1) IS NULL THEN FALSE
ELSE TRUE
END,
(SELECT webhook_url FROM ops_alert_rules WHERE webhook_url IS NOT NULL AND webhook_url <> '' LIMIT 1),
10
WHERE NOT EXISTS (SELECT 1 FROM ops_alert_rules WHERE name = 'Queue depth > 50');
-- Enable webhook notifications for rules with webhook_url configured
UPDATE ops_alert_rules
SET notify_webhook = TRUE
WHERE webhook_url IS NOT NULL
AND webhook_url <> ''
AND notify_webhook IS DISTINCT FROM TRUE;
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment