Kubernetes 中部署 kube-state-metrics 及 Prometheus 监控配置实战

发布于:2025-05-30 ⋅ 阅读:(47) ⋅ 点赞:(0)


Kubernetes 中部署 kube-state-metrics 及 Prometheus 监控配置实战

本文详细介绍了如何在 Kubernetes 集群中部署 kube-state-metrics 组件,配置服务账号(ServiceAccount)、RBAC 授权,并结合 Prometheus 采集 kube-state-metrics 和 node-exporter 指标的全过程,方便你快速搭建集群监控体系。


环境准备

假设你的 Kubernetes 集群节点 IP 为 10.255.101.217,且已经安装了 kubectl,且配置了访问权限。

一台 Master 多节点玩转 Kubernetes:sealos 一键部署实践

使用 Supervisor 和 Systemd 搭建 Prometheus + Alertmanager + Node Exporter + Grafana 全套监控系统


创建监控命名空间

首先,为监控组件创建一个专用的命名空间 monitor-sa

kubectl create ns monitor-sa

确认命名空间已经创建:

kubectl get ns

在这里插入图片描述


准备配置文件

  • sa.yaml
  • rbac.yaml
  • clust.yaml
  • jiankong.yaml
  • svc.yaml
  • node.yaml

创建 ServiceAccount

monitor-sa 命名空间中为 kube-state-metrics 创建一个服务账号 kube-state-metrics,方便后续绑定权限。

sa.yaml 文件内容:

apiVersion: v1
kind: ServiceAccount
metadata:
  # sa 账号名称
  name: kube-state-metrics
  # sa 账号名称空间
  namespace: monitor-sa

执行:

# kubectl apply -f sa.yaml 

serviceaccount/kube-state-metrics created

配置 RBAC 权限

为了让 kube-state-metrics 能够访问 Kubernetes 资源,创建对应的 ClusterRole:

rbac.yaml 文件内容:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
  verbs: ["list", "watch"]
- apiGroups: ["extensions"]
  resources: ["daemonsets", "deployments", "replicasets"]
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources: ["statefulsets"]
  verbs: ["list", "watch"]
- apiGroups: ["batch"]
  resources: ["cronjobs", "jobs"]
  verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
  resources: ["horizontalpodautoscalers"]
  verbs: ["list", "watch"]
- apiGroups: [""]
  resources: ["nodes/proxy"]
  verbs: ["get"]

创建 ClusterRoleBinding,将 ClusterRole 绑定给前面创建的 ServiceAccount:

clust.yaml 文件内容:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: monitor-sa

应用:

# kubectl apply -f rbac.yaml

clusterrole.rbac.authorization.k8s.io/kube-state-metrics created


# kubectl apply -f clust.yaml

clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created

部署 kube-state-metrics

准备 Deployment 配置文件 jiankong.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
  namespace: monitor-sa
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  template:
    metadata:
      labels:
        app.kubernetes.io/name: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
      - image: registry.k8s.io/kube-state-metrics/kube-state-metrics::latest
        imagePullPolicy: IfNotPresent
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP

在准备它的svc.yaml

apiVersion: v1
kind: Service
metadata:
  name: kube-state-metrics
  namespace: monitor-sa
spec:
  ports:
  - name: http-metrics
    port: 8080
    protocol: TCP
    #targetPort: 8080
    targetPort: http-metrics
  - name: telemetry
    port: 8081
    protocol: TCP
    targetPort: telemetry
  selector:
    app.kubernetes.io/name: kube-state-metrics
  sessionAffinity: None
  type: NodePort

执行部署:

# kubectl apply -f jiankong.yaml
deployment.apps/kube-state-metrics created
# kubectl apply -f svc.yaml
service/kube-state-metrics created

查看 Pod 状态:

# kubectl -n monitor-sa get pods

NAME                                  READY   STATUS    RESTARTS   AGE
kube-state-metrics-5b7cf967d6-knhww   1/1     Running   0          40m

查看映射的端口

# kubectl -n monitor-sa get svc
NAME                 TYPE       CLUSTER-IP        EXTERNAL-IP   PORT(S)                         AGE
kube-state-metrics   NodePort   192.168.144.178   <none>        8080:32470/TCP,8081:31602/TCP   75m

部署node_exporter(可选)

集群node很多,我们不会跟传统模式似的,一个一个去部署node_exporter,这个时候我们就再k8s中创建个DaemonSet,让它自己根据k8S的node节点进行创建

准备node.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitor-sa
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      containers:
      - name: node-exporter
        image: quay.io/prometheus/node-exporter:v1.9.1
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 9100
          name: metrics
        resources:
          requests:
            cpu: "150m"
          limits:
            cpu: "500m"
        securityContext:
          privileged: true  # 若非必要,可设为 false 增强安全
        args:
        - --path.procfs=/host/proc
        - --path.sysfs=/host/sys
        - --collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($|/)
        volumeMounts:
        - name: dev
          mountPath: /host/dev
          readOnly: true
        - name: proc
          mountPath: /host/proc
          readOnly: true
        - name: sys
          mountPath: /host/sys
          readOnly: true
        - name: rootfs
          mountPath: /rootfs
          readOnly: true
      tolerations:
      - key: "node-role.kubernetes.io/control-plane"
        operator: "Exists"
        effect: "NoSchedule"
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: dev
        hostPath:
          path: /dev
      - name: sys
        hostPath:
          path: /sys
      - name: rootfs
        hostPath:
          path: /

执行部署:

# kubectl apply -f node.yaml
daemonset.apps/node-exporter created

查看pod状态

# kubectl -n monitor-sa get nodes -o wide 

NAME             STATUS   ROLES                  AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION               CONTAINER-RUNTIME
10-255-101-152   Ready    <none>                 41d   v1.22.0   10.255.101.152   <none>        CentOS Linux 7 (Core)   4.18.9-1.el7.elrepo.x86_64   containerd://1.4.3
10-255-101-216   Ready    <none>                 41d   v1.22.0   10.255.101.216   <none>        CentOS Linux 7 (Core)   4.18.9-1.el7.elrepo.x86_64   containerd://1.4.3
10-255-101-217   Ready    control-plane,master   41d   v1.22.0   10.255.101.217   <none>        CentOS Linux 7 (Core)   4.18.9-1.el7.elrepo.x86_64   containerd://1.4.3
10-255-101-82    Ready    <none>                 41d   v1.22.0   10.255.101.82    <none>        CentOS Linux 7 (Core)   4.18.9-1.el7.elrepo.x86_64   containerd://1.4.3

# kubectl -n monitor-sa get pods -o wide 

NAME                                  READY   STATUS    RESTARTS   AGE   IP               NODE             NOMINATED NODE   READINESS GATES
kube-state-metrics-5b7cf967d6-tk5kr   1/1     Running   0          87m   192.168.154.12   10-255-101-82    <none>           <none>
node-exporter-7sc7c                   1/1     Running   0          72m   10.255.101.152   10-255-101-152   <none>           <none>
node-exporter-d2w2z                   1/1     Running   0          72m   10.255.101.216   10-255-101-216   <none>           <none>
node-exporter-rc6bt                   1/1     Running   0          72m   10.255.101.82    10-255-101-82    <none>           <none>

验证服务账号 Token

通过命令查看 default 服务账号的 token:

# kubectl -n monitor-sa get secrets 
NAME                             TYPE                                  DATA   AGE
default-token-wrbmj              kubernetes.io/service-account-token   3      5m9s
kube-state-metrics-token-bkrsr   kubernetes.io/service-account-token   3      3m41s

# kubectl -n monitor-sa describe secrets kube-state-metrics-token-bkrsr

将显示包含 token 的详细信息,可用于 Prometheus 授权。
在这里插入图片描述
!!!把token内容,复制到prometheus的服务器里

/data/app/prometheus/token

[root@10-255-101-216 prometheus]# cat token 
eyJhbGciOiJSUzI1NiIsImtpZCI6IlUyVjJSUGFyMWRDcWlZUUota2F0Q2xVY1pBTU45cW1HNEl2a1R2ajRlRzQifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJtb25pdG9yLXNhIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6Imt1YmUtc3RhdGUtbWV0cmljcy10b2tlbi1ia3JzciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlLXN0YXRlLW1ldHJpY3MiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiJjZjRmODFlYS00Mzg3LTRhOGUtYjdlMC04ZjM1NjM0YjczMTciLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6bW9uaXRvci1zYTprdWJlLXN0YXRlLW1ldHJpY3MifQ.cco-tUhN7SeZL6H40ShY4WPwZ-h3TBQ2fLj1v64W9lCRAf2U0yTFackRO19odYY5YgVhujdaQcmMxfd3EGN_RQuQZv3p0AtRIXstOc9q9jdwFmQtGaPMjN-DuUWHa5Gx72jUXjgdXzEe6oHugjfFikBs13JCSU7uY3DfpDTIGWRorNz2hQCXWGJktydk_5J_mqH7y3DWsGNOLXZpENavVo25DMRgVvIGuRLTqh7atkcGGgke92cSSUJqhQ9RMqtrCApJ_8eZiL4r8vY-aF224yCqbzlMva1Jd2CMhagQbQIBQUeXzfMDRqVIyPv9KNziIKr68cA4XEaIv6yvqMzE8w
[root@10-255-101-216 prometheus]# 

Prometheus 配置示例

将上面获取到的服务账号 Token 保存到 Prometheus 服务器 /data/app/prometheus/token 文件中。

Prometheus 配置文件 prometheus.yml 中增加如下内容,实现采集 kube-state-metricsnode-exporter 指标:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # 上边查看的svc的端口映射地址
  - job_name: kube-state-metrics
    static_configs:
      - targets: ['10.255.101.217:32470']
        labels:
          env: test20250528

  # 采集 node-exporter 指标
  - job_name: 'k8s-node-exporter'
    kubernetes_sd_configs:
    - role: pod
      api_server: https://10.255.101.217:6443
      bearer_token_file: /data/app/prometheus/token
      tls_config:
        insecure_skip_verify: true
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_label_app]
      regex: node-exporter
      action: keep
    - target_label: env
      replacement: test20250528
    - source_labels: [__meta_kubernetes_pod_ip]
      target_label: __address__
      replacement: '${1}:9100'
      action: replace
    - source_labels: [__meta_kubernetes_pod_node_name]
      target_label: node
      action: replace
    - source_labels: [__meta_kubernetes_namespace]
      target_label: kubernetes_namespace
      action: replace
    - source_labels: [__meta_kubernetes_pod_name]
      target_label: kubernetes_pod_name
      action: replace

  # 采集 cadvisor 指标
  - job_name: test20250528-cadvisor
    honor_timestamps: true
    metrics_path: /metrics
    scheme: https
    kubernetes_sd_configs:
    - api_server: https://10.255.101.217:6443
      role: node
      bearer_token_file: /data/app/prometheus/token
      tls_config:
        insecure_skip_verify: true
    bearer_token_file: /data/app/prometheus/token
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
    - separator: ;
      regex: (.*)
      target_label: __address__
      replacement: 10.255.101.217:6443
      action: replace
    - source_labels: [__meta_kubernetes_node_name]
      separator: ;
      regex: (.+)
      target_label: __metrics_path__
      replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
      action: replace
    - source_labels: [kubernetes_io_hostname]
      separator: ;
      regex: (.+)
      target_label: env_kubernetes_io_hostname
      replacement: test20250528-${1}
      action: replace
    - source_labels: [kubernetes_io_hostname]
      separator: ;
      regex: (.+)
      target_label: env
      replacement: test20250528
      action: replace

小结

通过以上步骤,你已经完成了以下工作:

  • 创建专用命名空间 monitor-sa
  • 创建 kube-state-metrics 服务账号和对应的 RBAC 授权
  • 部署 kube-state-metrics 监控组件
  • 通过 Prometheus 采集 kube-state-metrics 和 node-exporter 指标
  • 配置了 Prometheus 访问 Kubernetes API Server 的安全 Token

验证

在这里插入图片描述

增加Grafana面板

增加prometheus监控数据源

请添加图片描述
请添加图片描述
我是只修改了URL,其他没任何修改
请添加图片描述
请添加图片描述

添加k8s监控面板(需Grafana这台有网)

请添加图片描述
ID:10000
请添加图片描述
请添加图片描述
在这里插入图片描述
至此面板添加完了,数据未显示的,需要微调,可自行进行调整


扩展告警规则

# 容器相关报警信息
groups:
- name: "ContainerRules"
  rules:
  - alert: "容器异常"
    expr: kube_pod_container_status_running{env="test20250528",pod !~ "security-inspector-polaris-cronjob.*"} != 1
    for: 90s
    labels:
      severity: Disaster
      env: test20250528
    annotations:
      summary: "ns:{{ $labels.namespace }} pod: {{ $labels.container }}]"
      description: "{{ $labels.instance }}: {{ $labels.namespace }} 服务{{ $labels.container }} 容器运行异常"

    # 容器内存使用率告警(>80%)
  - alert: "ContainerMemoryUsage"
    expr: sum by(namespace,pod,container) (container_memory_rss{image!="",env="test20250528"}) / sum by(namespace,pod,container) (container_spec_memory_limit_bytes{image!="",env="test20250528"}) * 100  != +Inf > 80
    for: 1m
    labels:
      severity: Warning
      env: test20250528
    annotations:
      summary: "[{{ $labels.namespace }}/{{ $labels.pod }} - {{ $labels.container }}] Container memory usage warning"
      description: "Container memory usage is above 80%.\nVALUE = {{ $value | printf \"%.2f\" }}%\n"

    # 容器 CPU 使用率告警(>80% - Warning)
  - alert: ContainerCpuUsage
    expr: sum by(container, namespace, pod) (irate(container_cpu_usage_seconds_total{env="test20250528",image!=""}[5m]) * 100) / sum by(container, namespace, pod) (container_spec_cpu_quota{env="test20250528",image!=""} / container_spec_cpu_period{env="test20250528",image!=""}) > 80
    for: 1m
    labels:
      severity: Warning
      env: test20250528
    annotations:
      summary: "[{{ $labels.namespace }}/{{ $labels.pod }} - {{ $labels.container }}] Container CPU usage warning"
      description: "Container CPU usage is above 80%.\nVALUE = {{ $value | printf \"%.2f\" }}%\n"

    # 容器 CPU 使用率告警(>90% - Disaster)
  - alert: "ContainerCpuUsage"
    expr: sum by(container, namespace, pod) (irate(container_cpu_usage_seconds_total{env="test20250528",image!=""}[5m]) * 100) / sum by(container, namespace, pod) (container_spec_cpu_quota{env="test20250528",image!=""} / container_spec_cpu_period{env="test20250528",image!=""}) > 90
    for: 1m
    labels:
      severity: Disaster
      env: test20250528
    annotations:
      summary: "[{{ $labels.namespace }}/{{ $labels.pod }} - {{ $labels.container }}] Container CPU usage critical"
      description: "Container CPU usage is above 90%.\nVALUE = {{ $value | printf \"%.2f\" }}%\n"

  - alert: "容器重启"
    expr: rate(kube_pod_container_status_restarts_total{env="test20250528"}[15m]) > 0
    for: 5m
    labels:
      severity: Disaster
      env: test20250528
    annotations:
      summary: "[{{ $labels.namespace }}/{{ $labels.pod }} - {{ $labels.container }}] 容器发生重启"
      description: "{{ $labels.namespace }} 命名空间中的容器 {{ $labels.container }}(所属 Pod: {{ $labels.pod }})在过去 15 分钟内发生了重启)"

总结

至此 Kubernetes 监控体系的基础框架搭建完毕。后续可以根据业务需求增加更多监控项和告警规则。