0.说明
监控主要构成。
软件版本。
1.exporter监控配置
1.1 node_exporter
启动命令
nohup ./node_exporter &
服务
创建文件 /etc/systemd/system/node_exporter.service:
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=bigdatabit9
Group=bigdatabit9
Type=simple
ExecStart=/opt/apps/node_exporter/node_exporter
Restart=always
[Install]
WantedBy=multi-user.target
1.2 kafka_exporter
启动脚本
#!/bin/bash
cd /opt/apps/exporters/kafka_exporter
nohup ./kafka_exporter --kafka.server=instance-kafka01:9092 --kafka.server=instance-kafka02:9092 --kafka.server=instance-kafka03:9092 \
--zookeeper.server=instance-kafka03:2181,instance-kafka02:2181,instance-kafka01:2181 \
--web.listen-address="172.16.0.243:9340" >/dev/null 2>&1 &
服务
创建文件 /etc/systemd/system/kafka_exporter.service:
[Unit]
Description=Kafka Exporter for Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=bigdatabit9
Group=bigdatabit9
Type=simple
ExecStart=/opt/apps/exporters/kafka_exporter/kafka_exporter \
--kafka.server=instance-kafka01:9092 \
--kafka.server=instance-kafka02:9092 \
--kafka.server=instance-kafka03:9092 \
--zookeeper.server=instance-kafka03:2181,instance-kafka02:2181,instance-kafka01:2181 \
--web.listen-address=0.0.0.0:9340
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
启动exporter
这里以kafka_exporter为例,其他服务一样。
命令
sudo systemctl daemon-reload
sudo systemctl enable kafka_exporter
sudo systemctl start kafka_exporter
检查服务状态
sudo systemctl status kafka_exporter
2. prometheus 配置
2.1 配置prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- instance-metric01:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "rules/*.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: "pushgateway"
static_configs:
- targets: ["instance-metric01:9091"]
- job_name: "kafka"
static_configs:
- targets: ["1instance-kafka02:9340"]
- job_name: "node"
static_configs:
- targets: ["instance-kafka01:9100","instance-kafka02:9100","instance-kafka03:9100","instance-metric01:9100"]
metric_relabel_configs:
- action: replace
source_labels: ["instance"]
regex: ([^:]+):([0-9]+)
replacement: $1
target_label: "host_name"
2.2 告警规则rules 配置
在prometheus目录rules目录下。
cpu.yml
groups:
- name: cpu_state
rules:
- alert: cpu使用率告警
expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) by (host_name)) * 100 > 90
for: 30s
labels:
severity: warning
annotations:
summary: "{{$labels.host_name}}CPU使用率超过90%"
description: " 服务器【{{$labels.host_name}}】:当前CPU使用率{{$value}}%超过90%"
disk.yml
groups:
- name: disk_state
rules:
- alert: 磁盘使用率告警
expr: (node_filesystem_size_bytes{fstype=~"ext.?|xfs"} - node_filesystem_avail_bytes{fstype=~"ext.?|xfs"}) / node_filesystem_size_bytes{fstype=~"ext.?|xfs"} * 100 > 80
for: 30s
labels:
severity: warning
annotations:
summary: "{{$labels.host_name}}磁盘分区使用率超过80%"
description: " 服务器【{{$labels.host_name}}】上的挂载点:【{{ $labels.mountpoint }}】当前值{{$value}}%超过80%"
dispatcher.yml
groups:
- name: dispatcher_state
rules:
- alert: dispatcher06状态
expr: sum(dispatcher06_data) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "dispatcher写入数据为0"
description: "服务器172.16.0.218上的dispatcher写入数据为0,进程发生问题!"
- alert: dispatcher07状态
expr: sum(dispatcher07_data) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "dispatcher写入数据为0"
description: "服务器172.16.0.219上的dispatcher写入数据为0,进程发生问题!"
- alert: dispatcherk1状态
expr: sum(dispatcherk1_data) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "dispatcher写入数据为0"
description: "服务器172.16.0.243上的dispatcher写入数据为0,进程发生问题!"
- alert: dispatcherk2状态
expr: sum(dispatcherk2_data) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "dispatcher写入数据为0"
description: "服务器172.16.0.244上的dispatcher写入数据为0,进程发生问题!"
- alert: dispatcherk3状态
expr: sum(dispatcherk3_data) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "dispatcher写入数据为0"
description: "服务器172.16.0.245上的dispatcher写入数据为0,进程发生问题!"
- alert: dispatcherk4状态
expr: sum(dispatcherk4_data) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "dispatcher写入数据为0"
description: "服务器172.16.0.246上的dispatcher写入数据为0,进程发生问题!"
- alert: dispatcherk5状态
expr: sum(dispatcherk5_data) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "dispatcher写入数据为0"
description: "服务器172.16.0.247上的dispatcher写入数据为0,进程发生问题!"
- alert: dispatcherk6状态
expr: sum(dispatcherk6_data) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "dispatcher写入数据为0"
description: "服务器172.16.0.140上的dispatcher写入数据为0,进程发生问题!"
- alert: dispatcherk7状态
expr: sum(dispatcherk7_data) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "dispatcher写入数据为0"
description: "服务器172.16.0.141上的dispatcher写入数据为0,进程发生问题!"
- alert: dispatcherk8状态
expr: sum(dispatcherk8_data) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "dispatcher写入数据为0"
description: "服务器172.16.0.142上的dispatcher写入数据为0,进程发生问题!"
- alert: dispatcherk9状态
expr: sum(dispatcherk9_data) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "dispatcher写入数据为0"
description: "服务器172.16.0.143上的dispatcher写入数据为0,进程发生问题!"
- alert: dispatcherk13状态
expr: sum(dispatcherk13_data) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "dispatcher写入数据为0"
description: "服务器172.16.0.155上的dispatcher写入数据为0,进程发生问题!"
dn.yml
groups:
- name: dn_state
rules:
- alert: DataNode容量告警
expr: (sum(Hadoop_DataNode_DfsUsed{name="FSDatasetState"}) by (host_name) / sum(Hadoop_DataNode_Capacity{name="FSDatasetState"}) by(host_name)) * 100 > 80
for: 30s
labels:
severity: warning
annotations:
summary: "DataNode节点:{{$labels.host_name}}已使用容量超过80%"
description: "DataNode节点:{{$labels.host_name}},当前已使用容量:{{$value}}超过总容量的80%"
kafka_lag.yml
groups:
- name: kafka_lag
rules:
- alert: kafka消息积压报警
expr: sum(kafka_consumergroup_lag{ topic!~"pct_.+"}) by(consumergroup,topic) > 500000 or sum(kafka_consumergroup_lag{topic=~"pct_.+"}) by(consumergroup,topic) > 2000000
for: 30s
labels:
severity: warning
annotations:
summary: "Topic:{{$labels.topic}}的消费组{{$labels.consumergroup}}消息积压"
description: "消息Lag:{{$value}}"
mem.yml
groups:
- name: memory_state
rules:
- alert: 内存使用率告警
expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)))* 100 > 90
for: 30s
labels:
severity: warning
annotations:
summary: "{{$labels.host_name}}内存使用率超过90%"
description: " 服务器【{{$labels.host_name}}】:当前内存使用率{{$value}}%超过90%"
process.yml
groups:
- name: proc_state
rules:
- alert: 进程存活告警
expr: namedprocess_namegroup_num_procs<1
for: 60s
labels:
severity: critical
target: "{{$labels.app_name}}"
annotations:
summary: "进程{{$labels.app_name}}已停止"
description: "进程 {{$labels.app_name}} 在服务器:{{$labels.host_name}}上已经停止."
prometheus_process.yml
groups:
- name: proc_state
rules:
- alert: prometheus组件进程存活告警
expr: sum(up) by(instance,job) == 0
for: 30s
labels:
severity: critical
target: "{{$labels.job}}"
annotations:
summary: "进程{{$labels.job}}已停止"
description: "进程 {{$labels.job}} 在服务器:{{$labels.instance}}上已经停止."
yarn.yml
groups:
- name: yarn_node
rules:
- alert: yarn节点不足
expr: sum(Hadoop_ResourceManager_NumActiveNMs{job='rm'}) by (job) < 13 or sum(Hadoop_ResourceManager_NumActiveNMs{job='rmf'}) by (job) < 12
for: 30s
labels:
severity: warning
annotations:
summary: "yarn集群:{{$labels.job}}节点不足"
2.3 启动
启动命令
nohup /opt/apps/prometheus/prometheus \
--web.listen-address="0.0.0.0:9090" \
--web.read-timeout=5m \
--web.max-connections=10 \
--storage.tsdb.retention=7d \
--storage.tsdb.path="data/" \
--query.max-concurrency=20 \
--query.timeout=2m \
--web.enable-lifecycle \
> /opt/apps/prometheus/logs/start.log 2>&1 &
2.4 重新加载配置
重新加载配置
curl -X POST http://localhost:9090/-/reload
3. pushgateway
启动命令
nohup /opt/apps/pushgateway/pushgateway \
--web.listen-address="0.0.0.0:9091" \
> /opt/apps/pushgateway/start.log 2>&1 &
4. alertmanager
4.1 配置alertmanager.yml
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 1m
repeat_interval: 5m
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://mecury-ca01:9825/api/alarm/send'
send_resolved: true
inhibit_rules:
- source_match:
alertname: 'ApplicationDown'
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'job', "target", 'instance']
配置报警地址,报警参数参考
{
"version": "4",
"groupKey": "alertname:ApplicationDown",
"status": "firing",
"receiver": "web.hook",
"groupLabels": {
"alertname": "ApplicationDown"
},
"commonLabels": {
"alertname": "ApplicationDown",
"severity": "critical",
"instance": "10.0.0.1:8080",
"job": "web",
"target": "10.0.0.1"
},
"commonAnnotations": {
"summary": "Web application is down",
"description": "The web application at instance 10.0.0.1:8080 is not responding."
},
"externalURL": "http://alertmanager:9093",
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "ApplicationDown",
"severity": "critical",
"instance": "10.0.0.1:8080",
"job": "web",
"target": "10.0.0.1"
},
"annotations": {
"summary": "Web application is down",
"description": "The web application at instance 10.0.0.1:8080 is not responding."
},
"startsAt": "2025-06-19T04:30:00Z",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://prometheus:9090/graph?g0.expr=up%7Bjob%3D%22web%22%7D+%3D%3D+0",
"fingerprint": "1234567890abcdef"
}
]
}
4.2 启动
启动脚本 start.sh
#!/bin/bash
nohup /opt/apps/alertmanager/alertmanager \
--config.file=/opt/apps/alertmanager/alertmanager.yml \
> /opt/apps/alertmanager/start.log 2>&1 &
5.grafana
5.1 安装
启动命令
nohup /opt/apps/grafana/bin/grafana-server web > /opt/apps/grafana/grafana.log 2>&1 &
默认用户名和密码:admin
5.2 常用模板
node 16098
kafka 7589
process 249