Spring Boot 全链路监控系统建设指南,包含从底层原理到生产级部署的完整细节,分为七个核心模块
一、监控体系架构深度解析
1.1 现代监控技术栈分层
应用层
Spring Boot Actuator
Micrometer
Prometheus/InfluxDB
Grafana
AlertManager
企业微信/钉钉
1.2 指标采集原理
public class MicrometerRegistry {
void registerMeter ( Meter meter) {
}
}
1 . 定时发送HTTP GET到/actuator/prometheus
2 . 解析文本格式的metrics数据
3 . 存储到TSDB时序数据库
4 . 每2小时压缩一次block
二、Spring Boot 监控配置全解
2.1 精细化指标暴露控制
management :
metrics :
export :
prometheus :
step : 1m
descriptions : true
enable :
jvm : true
logback : false
distribution :
percentiles : [ 0.5 , 0.95 , 0.99 ]
endpoint :
health :
show-details : always
probes :
enabled : true
2.2 自定义指标开发实战
@Bean
public MeterBinder orderMetrics ( OrderRepository repo) {
return registry -> {
Gauge . builder ( "order.count" , repo, OrderRepository :: count )
. tag ( "region" , System . getenv ( "REGION" ) )
. register ( registry) ;
} ;
}
@Aspect
@Component
public class ServiceMonitor {
private final Timer serviceTimer = Timer . builder ( "service.time" )
. publishPercentiles ( 0.95 )
. register ( Metrics . globalRegistry) ;
@Around ( "execution(* com..*Service.*(..))" )
public Object timeService ( ProceedingJoinPoint pjp) throws Throwable {
return serviceTimer. record ( ( ) -> pjp. proceed ( ) ) ;
}
}
三、Prometheus 高级配置
3.1 存储优化方案
storage :
tsdb :
retention : 30d
block_duration : 2h
remote_write :
- url : "http://thanos:10908/api/v1/receive"
queue_config :
capacity : 10000
max_shards : 200
3.2 联邦集群部署
数据中心B
数据中心A
Thanos Receiver
Prometheus-B
Thanos Receiver
Prometheus-A
Thanos Query
Grafana
四、Grafana 看板开发进阶
4.1 JVM 内存分析模板
sum ( jvm_memory_used_bytes{area= "heap" }) by ( instance) /
sum ( jvm_memory_max_bytes{area= "heap" }) by ( instance)
rate( jvm_memory_used_bytes{area= "heap" }[ 1 h] ) > 100000000
4.2 分布式追踪集成
management :
tracing :
sampling :
probability : 0.1
zipkin :
endpoint : http: //zipkin: 9411/api/v2/spans
五、告警体系设计
5.1 多级告警规则
groups :
- name : critical
rules :
- alert : HighErrorRate
expr : rate(http_server_errors_total[ 5m] ) > 10
for : 10m
labels :
severity : critical
annotations :
summary : "High error rate on {{ $labels.instance }}"
description : "Error rate is {{ $value }}"
- name : warning
rules :
- alert : MemoryLeakWarning
expr : predict_linear(jvm_memory_used_bytes[ 6h] , 86400) > jvm_memory_max_bytes
labels :
severity : warning
5.2 告警路由策略
route :
group_by : [ 'alertname' ]
receiver : 'slack-notifications'
routes :
- match :
severity : 'critical'
receiver : 'sms-alert'
receivers :
- name : 'slack-notifications'
slack_configs :
- api_url : https: //hooks.slack.com/services/XXX
- name : 'sms-alert'
webhook_configs :
- url : http: //sms- gateway/api
六、性能优化实战
6.1 指标采集降载方案
@Bean
public MeterFilter samplingFilter ( ) {
return MeterFilter . filter ( MeterFilter . deny ( id -> {
String uri = id. getTag ( "uri" ) ;
return uri != null && uri. startsWith ( "/actuator" ) ;
} ) ) . sample (
Sample . of ( 100 ) . withProbability ( 0.5 )
) ;
}
6.2 高并发场景优化
reactor :
netty :
resources :
max-connections : 50000
max-idle-time : 30s
metrics :
enabled : true
binders : [ "jvm" , "reactor" ]
七、生产环境部署清单
7.1 K8s 部署模板
apiVersion : apps/v1
kind : Deployment
metadata :
name : spring- boot- app
spec :
template :
spec :
containers :
- name : app
livenessProbe :
httpGet :
path : /actuator/health/liveness
port : 8080
readinessProbe :
httpGet :
path : /actuator/health/readiness
port : 8080
resources :
limits :
memory : 2Gi
requests :
memory : 1Gi
7.2 监控组件资源规划
组件
CPU
内存
存储
Prometheus
4核
16GB
500GB
Grafana
2核
4GB
50GB
AlertManager
1核
2GB
-
八、全链路监控实战案例
8.1 电商大促监控场景
SELECT
sum ( order_count) OVER ( ORDER BY time DESC LIMIT 5 ) AS recent_orders,
avg ( payment_latency) FILTER( WHERE status = 'paid' ) AS avg_pay_time
FROM metrics
WHERE time > now ( ) - 1 h
GROUP BY 1 m
8.2 金融交易监控
@Transactional
public void transfer ( Account from, Account to , BigDecimal amount) {
Metrics . counter ( "transfer.count" ,
"currency" , from. getCurrency ( ) )
. increment ( ) ;
Timer. Sample sample = Timer . start ( ) ;
try {
} finally {
sample. stop ( Metrics . timer ( "transfer.time" ) ) ;
}
}