可观测性:监控、日志、链路追踪三位一体

引子:一次线上故障的排查噩梦 2021年某晚,某电商平台接口响应慢,用户投诉激增。 排查过程: 运维:哪个服务出问题了?(无监控) 开发:日志在哪?(分散在100台服务器) 架构师:调用链路是什么?(无链路追踪) 耗时:3小时才定位到问题(数据库连接池配置错误) 教训:微服务架构下,可观测性至关重要 一、可观测性的本质 1.1 什么是可观测性? **可观测性(Observability)**是指通过外部输出理解系统内部状态的能力。 三大支柱: Metrics(指标):数字化的度量(QPS、响应时间、错误率) Logs(日志):事件的记录(请求日志、错误日志) Traces(追踪):请求的全链路视图(调用链路) 1.2 监控 vs 可观测性 维度 监控 可观测性 目标 已知问题 未知问题 方式 预设指标 任意维度查询 例子 CPU > 80%告警 为什么这个请求慢? 二、监控体系:Prometheus + Grafana 2.1 监控指标的四个黄金信号 延迟(Latency):请求响应时间 流量(Traffic):QPS、TPS 错误(Errors):错误率 饱和度(Saturation):CPU、内存、磁盘使用率 2.2 Prometheus监控配置 1. 添加依赖 <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency> 2. 配置application.yml management: endpoints: web: exposure: include: "*" # 暴露所有端点 metrics: export: prometheus: enabled: true tags: application: ${spring.application.name} 3. 自定义指标 ...

2025-11-03 · maneng

监控告警体系:构建Redis可观测性

核心监控指标 1. 性能指标 @Component public class RedisMetricsCollector { @Autowired private RedisTemplate<String, String> redis; // QPS(每秒查询数) public long getQPS() { Properties info = redis.execute((RedisCallback<Properties>) connection -> connection.info("stats") ); return Long.parseLong(info.getProperty("instantaneous_ops_per_sec")); } // 延迟 public long getLatency() { long start = System.currentTimeMillis(); redis.opsForValue().get("health_check"); return System.currentTimeMillis() - start; } // 命中率 public double getHitRate() { Properties info = redis.execute((RedisCallback<Properties>) connection -> connection.info("stats") ); long hits = Long.parseLong(info.getProperty("keyspace_hits")); long misses = Long.parseLong(info.getProperty("keyspace_misses")); if (hits + misses == 0) { return 0; } return hits * 100.0 / (hits + misses); } // 慢查询数量 public long getSlowlogCount() { return redis.execute((RedisCallback<Long>) connection -> connection.slowlogLen() ); } } 2. 资源指标 // 内存使用 public Map<String, Object> getMemoryMetrics() { Properties info = redis.execute((RedisCallback<Properties>) connection -> connection.info("memory") ); Map<String, Object> metrics = new HashMap<>(); metrics.put("used_memory", Long.parseLong(info.getProperty("used_memory"))); metrics.put("used_memory_rss", Long.parseLong(info.getProperty("used_memory_rss"))); metrics.put("mem_fragmentation_ratio", Double.parseDouble(info.getProperty("mem_fragmentation_ratio"))); metrics.put("evicted_keys", Long.parseLong(info.getProperty("evicted_keys"))); return metrics; } // CPU使用 public double getCPUUsage() { Properties info = redis.execute((RedisCallback<Properties>) connection -> connection.info("cpu") ); return Double.parseDouble(info.getProperty("used_cpu_sys")); } // 连接数 public Map<String, Long> getConnectionMetrics() { Properties info = redis.execute((RedisCallback<Properties>) connection -> connection.info("clients") ); Map<String, Long> metrics = new HashMap<>(); metrics.put("connected_clients", Long.parseLong(info.getProperty("connected_clients"))); metrics.put("blocked_clients", Long.parseLong(info.getProperty("blocked_clients"))); return metrics; } 3. 持久化指标 public Map<String, Object> getPersistenceMetrics() { Properties info = redis.execute((RedisCallback<Properties>) connection -> connection.info("persistence") ); Map<String, Object> metrics = new HashMap<>(); // RDB metrics.put("rdb_last_save_time", Long.parseLong(info.getProperty("rdb_last_save_time"))); metrics.put("rdb_changes_since_last_save", Long.parseLong(info.getProperty("rdb_changes_since_last_save"))); // AOF if ("1".equals(info.getProperty("aof_enabled"))) { metrics.put("aof_current_size", Long.parseLong(info.getProperty("aof_current_size"))); metrics.put("aof_base_size", Long.parseLong(info.getProperty("aof_base_size"))); } return metrics; } 4. 复制指标 public Map<String, Object> getReplicationMetrics() { Properties info = redis.execute((RedisCallback<Properties>) connection -> connection.info("replication") ); Map<String, Object> metrics = new HashMap<>(); metrics.put("role", info.getProperty("role")); if ("master".equals(info.getProperty("role"))) { metrics.put("connected_slaves", Integer.parseInt(info.getProperty("connected_slaves"))); } else { metrics.put("master_link_status", info.getProperty("master_link_status")); metrics.put("master_last_io_seconds_ago", Integer.parseInt(info.getProperty("master_last_io_seconds_ago"))); } return metrics; } 告警规则 1. 性能告警 @Component public class PerformanceAlerting { @Autowired private RedisMetricsCollector metrics; @Scheduled(fixedRate = 60000) // 每分钟 public void checkPerformance() { // QPS过高 long qps = metrics.getQPS(); if (qps > 50000) { sendAlert("QPS告警", String.format("当前QPS: %d", qps)); } // 延迟过高 long latency = metrics.getLatency(); if (latency > 100) { sendAlert("延迟告警", String.format("当前延迟: %dms", latency)); } // 命中率过低 double hitRate = metrics.getHitRate(); if (hitRate < 80) { sendAlert("命中率告警", String.format("当前命中率: %.2f%%", hitRate)); } // 慢查询过多 long slowlogCount = metrics.getSlowlogCount(); if (slowlogCount > 100) { sendAlert("慢查询告警", String.format("慢查询数量: %d", slowlogCount)); } } private void sendAlert(String title, String message) { log.warn("{}: {}", title, message); // 发送钉钉/邮件/短信告警 } } 2. 资源告警 @Scheduled(fixedRate = 60000) public void checkResources() { // 内存使用 Map<String, Object> memMetrics = metrics.getMemoryMetrics(); long usedMemory = (long) memMetrics.get("used_memory"); long maxMemory = 4L * 1024 * 1024 * 1024; // 4GB if (usedMemory > maxMemory * 0.9) { sendAlert("内存告警", String.format("内存使用: %dMB / %dMB", usedMemory / 1024 / 1024, maxMemory / 1024 / 1024)); } // 内存碎片率 double fragRatio = (double) memMetrics.get("mem_fragmentation_ratio"); if (fragRatio > 1.5) { sendAlert("内存碎片告警", String.format("碎片率: %.2f", fragRatio)); } // 连接数 Map<String, Long> connMetrics = metrics.getConnectionMetrics(); long connectedClients = connMetrics.get("connected_clients"); if (connectedClients > 1000) { sendAlert("连接数告警", String.format("当前连接数: %d", connectedClients)); } } 3. 可用性告警 @Scheduled(fixedRate = 10000) // 每10秒 public void checkAvailability() { try { // 健康检查 redis.opsForValue().get("health_check"); } catch (Exception e) { sendAlert("Redis不可用", e.getMessage()); } // 主从复制状态 Map<String, Object> replMetrics = metrics.getReplicationMetrics(); if ("slave".equals(replMetrics.get("role"))) { String linkStatus = (String) replMetrics.get("master_link_status"); if (!"up".equals(linkStatus)) { sendAlert("主从复制断开", "master_link_status: " + linkStatus); } int lastIO = (int) replMetrics.get("master_last_io_seconds_ago"); if (lastIO > 60) { sendAlert("主从复制延迟", String.format("最后同步时间: %d秒前", lastIO)); } } } Prometheus + Grafana监控 1. 安装Redis Exporter docker run -d \ --name redis-exporter \ -p 9121:9121 \ oliver006/redis_exporter:latest \ --redis.addr=redis://redis:6379 2. Prometheus配置 # prometheus.yml scrape_configs: - job_name: 'redis' static_configs: - targets: ['redis-exporter:9121'] 3. Grafana Dashboard 导入官方Dashboard: ...

2025-01-22 · maneng

如约数科科技工作室

浙ICP备2025203501号

👀 本站总访问量 ...| 👤 访客数 ...| 📅 今日访问 ...