16.大数据监控-EW帮帮网

0.说明

监控主要构成。
在这里插入图片描述

软件版本。
在这里插入图片描述

1.exporter监控配置

1.1 node_exporter

启动命令

nohup ./node_exporter &

服务
创建文件 /etc/systemd/system/node_exporter.service：

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=bigdatabit9
Group=bigdatabit9
Type=simple
ExecStart=/opt/apps/node_exporter/node_exporter
Restart=always

[Install]
WantedBy=multi-user.target

1.2 kafka_exporter

启动脚本

#!/bin/bash
cd /opt/apps/exporters/kafka_exporter 
nohup ./kafka_exporter --kafka.server=instance-kafka01:9092 --kafka.server=instance-kafka02:9092 --kafka.server=instance-kafka03:9092 \
--zookeeper.server=instance-kafka03:2181,instance-kafka02:2181,instance-kafka01:2181 \
--web.listen-address="172.16.0.243:9340" >/dev/null 2>&1 &

服务
创建文件 /etc/systemd/system/kafka_exporter.service：

[Unit]
Description=Kafka Exporter for Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=bigdatabit9
Group=bigdatabit9
Type=simple
ExecStart=/opt/apps/exporters/kafka_exporter/kafka_exporter \
  --kafka.server=instance-kafka01:9092 \
  --kafka.server=instance-kafka02:9092 \
  --kafka.server=instance-kafka03:9092 \
  --zookeeper.server=instance-kafka03:2181,instance-kafka02:2181,instance-kafka01:2181 \
  --web.listen-address=0.0.0.0:9340
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

启动exporter

这里以kafka_exporter为例，其他服务一样。

命令

sudo systemctl daemon-reload
sudo systemctl enable kafka_exporter
sudo systemctl start kafka_exporter

检查服务状态

sudo systemctl status kafka_exporter

在这里插入图片描述

2. prometheus 配置

2.1 配置prometheus.yml

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - instance-metric01:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "rules/*.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "pushgateway"
    static_configs:
      - targets: ["instance-metric01:9091"]
  - job_name: "kafka"
    static_configs:
      - targets: ["1instance-kafka02:9340"]
  - job_name: "node"
    static_configs:
      - targets: ["instance-kafka01:9100","instance-kafka02:9100","instance-kafka03:9100","instance-metric01:9100"]
    metric_relabel_configs:
    - action: replace
      source_labels: ["instance"]
      regex: ([^:]+):([0-9]+)
      replacement: $1
      target_label: "host_name"

2.2 告警规则rules 配置

在prometheus目录rules目录下。

cpu.yml

groups:
- name: cpu_state
  rules:
  - alert: cpu使用率告警
    expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) by (host_name)) * 100 > 90
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.host_name}}CPU使用率超过90%"
      description: " 服务器【{{$labels.host_name}}】：当前CPU使用率{{$value}}%超过90%"

disk.yml

groups:
- name: disk_state
  rules:
  - alert: 磁盘使用率告警
    expr: (node_filesystem_size_bytes{fstype=~"ext.?|xfs"} - node_filesystem_avail_bytes{fstype=~"ext.?|xfs"}) / node_filesystem_size_bytes{fstype=~"ext.?|xfs"} * 100 > 80
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.host_name}}磁盘分区使用率超过80%"
      description: " 服务器【{{$labels.host_name}}】上的挂载点：【{{ $labels.mountpoint }}】当前值{{$value}}%超过80%"

dispatcher.yml

groups:
- name: dispatcher_state
  rules:
  - alert: dispatcher06状态
    expr: sum(dispatcher06_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.218上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcher07状态
    expr: sum(dispatcher07_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.219上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk1状态
    expr: sum(dispatcherk1_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.243上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk2状态
    expr: sum(dispatcherk2_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.244上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk3状态
    expr: sum(dispatcherk3_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.245上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk4状态
    expr: sum(dispatcherk4_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.246上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk5状态
    expr: sum(dispatcherk5_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.247上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk6状态
    expr: sum(dispatcherk6_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.140上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk7状态
    expr: sum(dispatcherk7_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.141上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk8状态
    expr: sum(dispatcherk8_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.142上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk9状态
    expr: sum(dispatcherk9_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.143上的dispatcher写入数据为0，进程发生问题！"
  - alert: dispatcherk13状态
    expr: sum(dispatcherk13_data) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "dispatcher写入数据为0"
      description: "服务器172.16.0.155上的dispatcher写入数据为0，进程发生问题！"

dn.yml

groups:
- name: dn_state
  rules:
  - alert: DataNode容量告警
    expr: (sum(Hadoop_DataNode_DfsUsed{name="FSDatasetState"}) by (host_name) / sum(Hadoop_DataNode_Capacity{name="FSDatasetState"}) by(host_name)) * 100 > 80
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "DataNode节点：{{$labels.host_name}}已使用容量超过80%"
      description: "DataNode节点：{{$labels.host_name}}，当前已使用容量：{{$value}}超过总容量的80%"

kafka_lag.yml

groups:
- name: kafka_lag
  rules:
  - alert: kafka消息积压报警
    expr: sum(kafka_consumergroup_lag{ topic!~"pct_.+"}) by(consumergroup,topic) > 500000 or sum(kafka_consumergroup_lag{topic=~"pct_.+"}) by(consumergroup,topic) > 2000000
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "Topic:{{$labels.topic}}的消费组{{$labels.consumergroup}}消息积压"
      description: "消息Lag:{{$value}}"

mem.yml

groups:
- name: memory_state
  rules:
  - alert: 内存使用率告警
    expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)))* 100 > 90
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.host_name}}内存使用率超过90%"
      description: " 服务器【{{$labels.host_name}}】：当前内存使用率{{$value}}%超过90%"

process.yml

groups:
- name: proc_state
  rules:
  - alert: 进程存活告警
    expr: namedprocess_namegroup_num_procs<1
    for: 60s
    labels:
      severity: critical
      target: "{{$labels.app_name}}"
    annotations:
      summary: "进程{{$labels.app_name}}已停止"
      description: "进程 {{$labels.app_name}} 在服务器:{{$labels.host_name}}上已经停止."

prometheus_process.yml

groups:
- name: proc_state
  rules:
  - alert: prometheus组件进程存活告警
    expr: sum(up) by(instance,job) == 0
    for: 30s
    labels:
      severity: critical
      target: "{{$labels.job}}"
    annotations:
      summary: "进程{{$labels.job}}已停止"
      description: "进程 {{$labels.job}} 在服务器:{{$labels.instance}}上已经停止."

yarn.yml

groups:
- name: yarn_node
  rules:
  - alert: yarn节点不足
    expr: sum(Hadoop_ResourceManager_NumActiveNMs{job='rm'}) by (job) < 13 or sum(Hadoop_ResourceManager_NumActiveNMs{job='rmf'}) by (job) < 12
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "yarn集群:{{$labels.job}}节点不足"

2.3 启动

启动命令

nohup /opt/apps/prometheus/prometheus \
--web.listen-address="0.0.0.0:9090" \
--web.read-timeout=5m \
--web.max-connections=10  \
--storage.tsdb.retention=7d  \
--storage.tsdb.path="data/" \
--query.max-concurrency=20   \
--query.timeout=2m \
--web.enable-lifecycle \
> /opt/apps/prometheus/logs/start.log 2>&1 &

2.4 重新加载配置

重新加载配置

curl -X POST http://localhost:9090/-/reload

3. pushgateway

启动命令

nohup /opt/apps/pushgateway/pushgateway \
--web.listen-address="0.0.0.0:9091" \
> /opt/apps/pushgateway/start.log 2>&1 &

4. alertmanager

4.1 配置alertmanager.yml

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 1m
  repeat_interval: 5m
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://mecury-ca01:9825/api/alarm/send'
    send_resolved: true
inhibit_rules:
  - source_match:
      alertname: 'ApplicationDown'
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'job', "target", 'instance']

配置报警地址，报警参数参考

{
  "version": "4",
  "groupKey": "alertname:ApplicationDown",
  "status": "firing",
  "receiver": "web.hook",
  "groupLabels": {
    "alertname": "ApplicationDown"
  },
  "commonLabels": {
    "alertname": "ApplicationDown",
    "severity": "critical",
    "instance": "10.0.0.1:8080",
    "job": "web",
    "target": "10.0.0.1"
  },
  "commonAnnotations": {
    "summary": "Web application is down",
    "description": "The web application at instance 10.0.0.1:8080 is not responding."
  },
  "externalURL": "http://alertmanager:9093",
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "ApplicationDown",
        "severity": "critical",
        "instance": "10.0.0.1:8080",
        "job": "web",
        "target": "10.0.0.1"
      },
      "annotations": {
        "summary": "Web application is down",
        "description": "The web application at instance 10.0.0.1:8080 is not responding."
      },
      "startsAt": "2025-06-19T04:30:00Z",
      "endsAt": "0001-01-01T00:00:00Z",
      "generatorURL": "http://prometheus:9090/graph?g0.expr=up%7Bjob%3D%22web%22%7D+%3D%3D+0",
      "fingerprint": "1234567890abcdef"
    }
  ]
}

4.2 启动

启动脚本 start.sh

#!/bin/bash

nohup /opt/apps/alertmanager/alertmanager \
--config.file=/opt/apps/alertmanager/alertmanager.yml \
> /opt/apps/alertmanager/start.log 2>&1 &

5.grafana

5.1 安装

启动命令

nohup /opt/apps/grafana/bin/grafana-server web > /opt/apps/grafana/grafana.log 2>&1 &

默认用户名和密码：admin

5.2 常用模板

node 16098
kafka 7589
process 249

16.大数据监控

0.说明

1.exporter监控配置

1.1 node_exporter

1.2 kafka_exporter

启动exporter

2. prometheus 配置

2.1 配置prometheus.yml

2.2 告警规则rules 配置

2.3 启动

2.4 重新加载配置

3. pushgateway

4. alertmanager

4.1 配置alertmanager.yml

4.2 启动

5.grafana

5.1 安装

5.2 常用模板

网站公告

今日签到

热门文章

最新发布