K8S 游戏进程监控实战笔记

从零到一的离线部署与架构总结

核心架构与数据流

本指南详述了如何在 Kubernetes (ACK) 环境中,为基于 GameServerSet 的游戏应用集成 process-exporter,部署 kube-prometheus-stack 核心组件,并最终将所有进程指标汇聚到中央 Prometheus 服务的完整流程。本文档已做脱敏处理,可作为标准实施SOP。

架构数据流 (SVG)

下图展示了“配置流”(蓝色虚线)和“数据流”(黑色曲线)的完整路径。此图明确显示了**一个** PodMonitor 如何发现**多种** GameServerSet Pod。

K8S 监控配置层 (monitoring 命名空间) Prometheus Operator (prometheus-operator-game) 源头: 自动发现的起点 职责: 监视CRD, 生成配置 PodMonitor CRD (my-game-monitor) Label: { release: prometheus } Selector: { servertype: In [...] } 本地 Prometheus (prometheus-game-instance) 配置的终点 PodMonitorSelector: { release: prometheus } A. 监视 B. 配置 数据流层 本地 Prometheus (prometheus-game-instance) 数据流的起点 职责: 抓取 & 转发 ExternalLabel: { cluster: ... } 中央 Prometheus http://your-central-url.com 接收 Remote Write 数据 Grafana 数据源: 中央 Prometheus Query: up{cluster=...} 游戏 K8S 集群 (your-game-namespace) GSS Pod (Game) Label: 'servertype: game' /app/logic_server /app/process-exporter GSS Pod (Charge) Label: 'servertype: charge' /app/charge_server /app/process-exporter GSS Pod (Lister) Label: 'servertype: lister' /app/lister_server /app/process-exporter ... 1. 抓取 (一对多) 2. 远程写入 3. 查询

核心概念

在开始SOP之前,理解关键组件的职责至关重要。这套系统是“声明式”的,由多个“演员”协同工作。

Prometheus Operator

这是我们安装的 prometheus-operator-game。它不是 Prometheus 本身,而是 Prometheus 的“大脑”和“管家”。

  • 它持续监视 K8S 集群,寻找特定类型的资源(CRD)。
  • 当它发现一个 PodMonitor 资源时 (流程 A),它会读取这个资源的定义。
  • 它会自动将这个定义转换为 Prometheus 的抓取配置(scrape_config)。
  • 最后,它会自动更新它所管理的 Prometheus 实例(即 prometheus-game-instance)的配置文件 (流程 B),并使其重载生效。

PodMonitor

这是我们通过 app-monitors Chart 创建的 my-game-monitor 资源。

  • 它是一个“声明”,用来告诉 Operator:“嘿,请你帮我监控集群中所有符合A、B、C条件(namespace, labels)的 Pod”。
  • 如图所示,一个 PodMonitor 就可以通过 Selector 列表定义,去发现多种不同 `servertype` 的 Pod。
  • 它本身不执行任何监控,它只是一个“目标定义”。

process-exporter

这是一个 Exporter(暴露器),它被设计用来嗅探其所在的 PID 命名空间(即容器内部)的所有进程,并将其CPU、内存、FD等指标以 Prometheus 格式暴露在 /metrics 端口上。

Remote Write

本地 Prometheus(prometheus-game-instance)抓取到数据后 (流程 1),通过这个配置,将所有数据“转发”一份到中央 Prometheus (流程 2)。本地实例变成了一个“采集和转发站”。

External Labels

这是 custom-values.yaml 中的 externalLabels: { cluster: 'your-cluster-name' }。它为从本集群发出的所有指标打上了“烙印”。没有这个,中央 Prometheus 将无法区分上千条 up 指标到底来自哪个集群。

关键决策

在实施过程中,我们做了一些权衡,这些是本方案的基石。

决策一:如何部署 process-exporter

我们选择将 process-exporter 二进制文件构建到游戏镜像中,而不是使用 Sidecar 模式。这是此方案最核心的决策。

对比项 方案A: Sidecar 模式 方案B: 镜像内二进制 (★ 本方案选择)
部署方式 GameServerSetcontainers 列表中再加一个 process-exporter 容器。 Dockerfile 中将 process-exporter 二进制文件 COPY 进游戏镜像。
优点 1. 职责分离,镜像干净。
2. 可独立更新 exporter。
1. 可访问性:默认共享PID命名空间,能轻易监控到游戏主进程。
2. 部署简单,不增加 Pod 的容器数量。
缺点 1. PID 命名空间隔离:Sidecar 默认无法看到主容器的进程!
2. 必须开启 shareProcessNamespace: true (K8S 级别) 或 hostPID: true (高风险),配置复杂且有安全隐患。
1. 镜像耦合:“污染”了游戏镜像。
2. 更新 exporter 需要重新构建游戏镜像。
结论 因PID隔离问题,不适用于“进程监控”场景。 最适合的方案。为解决进程监控的根本问题(访问PID),我们接受镜像耦合的代价。

决策二:如何管理 PodMonitor

我们选择使用独立的 app-monitors Chart,而不是将 PodMonitor 放入游戏业务 Chart 中。

方案 优点 缺点 决策
A: PodMonitor 嵌入游戏 Chart 随游戏部署,原子性强 1. PodMonitor 资源必须创建在 monitoring 命名空间,导致业务 Chart 跨 ns 操作,权限混乱。
2. 监控配置分散在各个业务 Chart 中,难以统一管理。
不推荐
B: 独立 app-monitors Chart 1. 职责单一:所有 PodMonitor 都在 monitoring 空间统一部署和管理。
2. 易于维护和扩展。
3. 完美符合 Prometheus Operator 的工作模式。
需额外维护一个 Chart。 ★ 最终选择

步骤一:安装 Prometheus 核心组件

目标:在 monitoring 命名空间中,以离线、防冲突的方式部署 kube-prometheus-stack 的核心组件 (Operator, Prometheus),并禁用 Grafana 和 Alertmanager。

1.1 预准备:下载与推送镜像 (在线机器)

在有外网的机器上,下载 Helm Chart 并将所有依赖镜像推送到私有仓库。

  1. 下载 Chart: kube-prometheus-stack-75.3.0.tgz
  2. 创建 migrate-images.sh 脚本:

#!/bin/bash
# 用法: bash migrate-images.sh <CHART_FILE.tgz> <PRIVATE_REGISTRY/NAMESPACE>
# 示例: bash migrate-images.sh ./kube-prometheus-stack-75.3.0.tgz your-harbor-registry.com/your-project-namespace

set -e
CHART_FILE="$1"
PRIVATE_REGISTRY="$2"
if [ -z "$CHART_FILE" ] || [ -z "$PRIVATE_REGISTRY" ]; then
  echo "错误: 参数不足!" >&2
  exit 1
fi

echo "Helm Chart: $CHART_FILE"
echo "目标私有仓库: $PRIVATE_REGISTRY"

# 依赖 yq (https://github.com/mikefarah/yq)
IMAGES=$(helm template "$CHART_FILE" | yq e '.. | select(has("image")) | .image' - | grep -v '^---$' | sort -u)

if [ -z "$IMAGES" ]; then
  echo "错误: 未在 Chart 中找到任何镜像。" >&2
  exit 1
fi

echo "成功解析出以下唯一镜像:"
echo "$IMAGES"
echo "---"

docker login "$PRIVATE_REGISTRY"

for IMAGE in $IMAGES; do
  IMAGE_NAME=$(echo "$IMAGE" | sed 's|.*/||')
  NEW_IMAGE="$PRIVATE_REGISTRY/$IMAGE_NAME"

  echo "==> 正在处理: $IMAGE"
  if ! docker pull "$IMAGE"; then
    echo "警告: 拉取镜像 $IMAGE 失败,跳过。" >&2
    continue
  fi
  docker tag "$IMAGE" "$NEW_IMAGE"
  if ! docker push "$NEW_IMAGE"; then
    echo "警告: 推送镜像 $NEW_IMAGE 失败,跳过。" >&2
    continue
  fi
  echo "成功迁移: $IMAGE -> $NEW_IMAGE"
done

echo "所有镜像迁移完成!"
注意:处理遗漏镜像

此脚本可能遗漏 prometheus-config-reloader。如果安装时拉取失败,请手动迁移:


# (原始镜像)
docker pull quay.io/prometheus-operator/prometheus-config-reloader:v0.83.0
# (重命名并推送,注意修改为你自己的仓库地址)
docker tag ... hbregistry-cn.lrgameglobal.com/kafka/prometheus-config-reloader:v0.83.0
docker push ...

1.2 离线安装 (K8S 管理节点)

1. 创建 Namespace


kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -

2. 手动安装 CRD (关键步骤)

为避免 Helm 超时,必须先手动应用 CRD。


# 1. 从 Chart 包中解压 crds 目录
tar -xvf kube-prometheus-stack-75.3.0.tgz kube-prometheus-stack/charts/crds/crds/

# 2. 使用 server-side apply 方式安装 (避免 'metadata.annotations' 过大问题)
kubectl apply --server-side -f kube-prometheus-stack/charts/crds/crds/

3. 创建 K8S Secrets

A. 私有镜像仓库凭证 (harbor-credential)


# (请替换为你自己的私有仓库信息)
kubectl create secret docker-registry harbor-credential \
  --namespace=monitoring \
  --docker-server=your-harbor-registry.com \
  --docker-username=your-username \
  --docker-password=your-password

B. 远端写入凭证 (prometheus-rw-credentials)


# (请替换为你的中央 Prometheus 凭据)
kubectl create secret generic prometheus-rw-credentials \
  --namespace=monitoring \
  --from-literal=username='central-prometheus-user' \
  --from-literal=password='central-prometheus-password'

4. 准备 custom-values.yaml 配置文件

这是此次部署的核心,已脱敏并包含所有关键配置(防冲突、指向私有仓库、配置Remote Write)。

点击展开/折叠 custom-values.yaml

# 文件名: custom-values.yaml
# 离线安装 kube-prometheus-stack 的 UAT/Test 环境配置文件
# - 禁用了 grafana 和 alertmanager
# - 配置了全局的 imagePullSecrets
# - 所有镜像地址指向私有 Harbor 仓库
# - 修改资源名称 (如 'prometheus-game-instance') 避免与 ACK 自带组件冲突
# - 仅配置远程写入,不暴露本地 UI 端口

global:
  imagePullSecrets:
    - name: harbor-credential

# 1. 禁用 Alertmanager 和 Grafana
alertmanager:
  enabled: false

grafana:
  enabled: false

# 2. 禁用 CRD 安装 (因为已手动应用)
crds:
  enabled: false

# 3. 为避免与现有 Prometheus 冲突,修改 release 名称和命名空间隔离
# (使用 'prometheus-game-instance' 作为前缀)
fullnameOverride: "prometheus-game-instance"
nameOverride: "prometheus-game-instance"

# 4. 覆盖所有活动组件的镜像地址和标签
kube-state-metrics:
  nameOverride: "kube-state-metrics-game"
  fullnameOverride: "kube-state-metrics-game"
  image:
    registry: your-harbor-registry.com
    repository: your-project-namespace/kube-state-metrics
    tag: v2.15.0
  selfMonitor:
    enabled: false

prometheus-node-exporter:
  enabled: false

prometheusOperator:
  nameOverride: "prometheus-operator-game"
  fullnameOverride: "prometheus-operator-game"
  image:
    registry: your-harbor-registry.com
    repository: your-project-namespace/prometheus-operator
    tag: v0.83.0
  
  # 避免 RBAC 冲突,使用特定的 serviceAccount
  serviceAccount:
    create: true
    name: "prometheus-operator-game-sa"
  rbac:
    create: true
    pspEnabled: false

  admissionWebhooks:
    patch:
      image:
        registry: your-harbor-registry.com
        repository: your-project-namespace/kube-webhook-certgen
        tag: v1.5.4
  
  prometheusConfigReloader:
    image:
      registry: your-harbor-registry.com
      repository: your-project-namespace/prometheus-config-reloader
      tag: v0.83.0

# 5. 配置 Prometheus 实例 (核心)
prometheus:
  nameOverride: "prometheus-game-instance"
  fullnameOverride: "prometheus-game-instance"
  prometheusSpec:
    # 关键:为所有指标打上集群标签,用于中央 Prometheus 识别
    externalLabels:
      cluster: 'your-cluster-name'
      instance: 'prometheus-game-instance'
    
    # -------------------------------------------------------------
    # !! 关键配置 (见核心概念) !!
    # 告诉此 Prometheus 实例去“认领”哪些 PodMonitor
    # 它只会寻找 metadata.labels 中包含 { release: "prometheus" } 的 PodMonitor
    podMonitorSelector:
      matchLabels:
        release: prometheus
    # -------------------------------------------------------------

    # 关键:配置远程写入
    remoteWrite:
      - url: "http://your-central-prometheus-url.com:9090/api/v1/write"
        basicAuth:
          username:
            name: "prometheus-rw-credentials" # 引用 1.3 步创建的 Secret
            key: "username"
          password:
            name: "prometheus-rw-credentials" # 引用 1.3 步创建的 Secret
            key: "password"
            
    image:
      registry: your-harbor-registry.com
      repository: your-project-namespace/prometheus
      tag: v3.4.1
      
    # 使用特定的 serviceAccount 避免冲突
    serviceAccount:
      name: "prometheus-game-instance"
      
    # 仅启用远程写入,不暴露服务端口
    enableAdminAPI: false
    service:
      enabled: false
      
    # 隔离规则,防止加载 ACK 自带的 PrometheusRule
    ruleNamespaceSelector: {}
    ruleSelector: {}
    
    # 关闭 UI 相关端口
    listenLocal: true

5. 执行 Helm 安装


helm install prometheus ./kube-prometheus-stack-75.3.0.tgz \
  --namespace monitoring \
  -f ./custom-values.yaml

验证:kubectl get pods -n monitoring,等待所有 Pod (prometheus-operator-game-xxx, prometheus-game-instance-xxx 等) 变为 Running。

步骤二:游戏应用改造 (植入 Exporter)

目标:让游戏容器在启动游戏进程的同时,启动一个 process-exporter 进程来暴露指标。(决策理由见 关键决策

2.1 准备 process-exporter 配置文件

在游戏镜像的 /config/ 目录下准备 process-exporter.yaml。使用 {{.ExeFull}} 动态匹配所有进程的完整路径作为标签。


# config/process-exporter.yaml
process_names:
- name: "{{.ExeFull}}"
  cmdline:
  - '.+'

2.2 改造 Dockerfile

确保 process-exporter 二进制文件和配置文件被复制到镜像中。


# ... (您的其他 Dockerfile 指令)

# 拷贝 process-exporter
COPY ./bin/process-exporter /usr/local/bin/process-exporter
RUN chmod +x /usr/local/bin/process-exporter

# 拷贝配置文件 (假设目标路径为 /etc/process-exporter.yaml)
COPY ./config/process-exporter.yaml /etc/process-exporter.yaml

# ...
CMD ["/app/start_game.sh"]

2.3 改造启动脚本 (start_game.sh)

脚本必须在后台启动 process-exporter,并确保最后有一个前台进程防止容器退出。


#!/bin/sh

echo "Starting monitoring agent (process-exporter)..."
# 1. 在后台启动 process-exporter,并指定正确的配置文件路径
/usr/local/bin/process-exporter -config.path /etc/process-exporter.yaml &

echo "Starting game processes..."
# 2. 执行您现有的、启动多个游戏进程的逻辑 (假设它们也在后台运行)
/app/logic_server &
/app/scene_server &
# ...

# 3. 关键:保持一个前台进程 (例如 tail)
echo "All processes started. Tailing logs to keep container alive."
tail -f /dev/null

步骤三:K8S 资源配置 (GameServerSet)

目标:修改游戏的 GameServerSet (GSS) 定义,暴露 `metrics` 端口并添加用于服务发现的 `label`。

这是 PodMonitor 能够找到游戏 Pod 的前提。您需要为您所有的 GSS(`charge`, `game`, `lister` 等)都应用此配置。


apiVersion: game.openkruise.io/v1alpha1
kind: GameServerSet
metadata:
  name: my-game # (例如 'game', 'charge', 'lister'...)
  # 命名空间必须与 app-monitors 中配置的一致
  namespace: your-game-namespace 
spec:
  replicas: 1 # (或您的副本数)
  gameServerTemplate:
    metadata:
      labels:
        # -------------------------------------------------------------
        # !! 关键标签 !!
        # 此标签将用于被 Prometheus (PodMonitor) 发现
        # 必须与 app-monitors.values.yaml 中 serverTypes 列表内的值匹配
        servertype: "game" # (或 'charge', 'lister'...)
        # -------------------------------------------------------------
    spec:
      containers:
      - name: game
        image: your-game-image:latest
        ports:
        # -------------------------------------------------------------
        # !! 关键端口 !!
        # 暴露 process-exporter 的端口 (默认 9256)
        - name: metrics
          containerPort: 9256
          protocol: TCP
        # -------------------------------------------------------------
        # ... (您的其他游戏端口)

步骤四:部署自动发现 (PodMonitor)

目标:部署一个 PodMonitor 资源,告诉在步骤一安装的 Prometheus 实例去哪里、如何寻找游戏 Pod。(决策理由见 关键决策

4.1 配置 app-monitors/values.yaml

此文件定义了 PodMonitor 要去哪个 namespace 寻找哪些 servertype 的 Pod。这个列表应包含您所有 GSS 的 `servertype` 标签。


# app-monitors/values.yaml

default: 
  interval: 15s

myGame:
  enabled: true
  # 游戏应用部署的命名空间 (必须与 GSS 中的 namespace 一致)
  namespace: "your-game-namespace"
  
  # 用于选择 GameServer Pod 的标签 (必须与 GSS 中 template.metadata.labels 一致)
  # (此列表来自您的 kubectl 输出和 yaml 文件)
  serverTypes:
    - "lister"
    - "game"
    - "global"
    - "router"
    - "match"
    - "collector"
    - "charge"
    - "manage"
    - "cross"
    - "member-game"
    # ... (其他 server types)

4.2 配置 app-monitors/templates/my-game-podmonitor.yaml

这是 PodMonitor 模板,它实现了关键的“三方握手”:

  1. 本地 Prometheus (prometheus-game-instance) 通过 podMonitorSelector 寻找 release: prometheus 标签。
  2. PodMonitor 携带 release: prometheus 标签,因此被 Prometheus "认领"。
  3. PodMonitor 通过 namespaceSelectorselector (读取 `values.yaml` 中的 `serverTypes` 列表) 去寻找所有匹配的 Pod。

{{- if .Values.myGame.enabled -}}
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: my-game-monitor
  # -------------------------------------------------------------
  # !! 关键:此资源必须创建在 'monitoring' 命名空间 !!
  # 以便被 Prometheus Operator 发现
  namespace: monitoring
  # -------------------------------------------------------------
  labels:
    # -------------------------------------------------------------
    # !! 关键:此标签必须为 'prometheus' !!
    # 以便被步骤一中部署的 Prometheus 实例选中 (匹配 podMonitorSelector)
    release: prometheus
    # -------------------------------------------------------------
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: app-monitors
spec:
  # 关键:指定要去哪个命名空间里寻找目标 Pod
  namespaceSelector:
    matchNames:
      - {{ .Values.myGame.namespace | quote }}
      
  # 关键:使用 matchExpressions 来匹配多个 servertype 值
  selector:
    matchExpressions:
      - key: servertype
        operator: In
        values:
          # (这里会渲染 values.yaml 中的 serverTypes 列表)
          {{- toYaml .Values.myGame.serverTypes | nindent 10 }}
          
  podMetricsEndpoints:
  - port: "metrics" # 匹配 GSS 中定义的 9256 端口名
    path: "/metrics"
    interval: {{ .Values.default.interval }}
    # (可选但推荐) 将 Pod 标签 'servertype' 转换为指标标签
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_label_servertype]
      targetLabel: servertype
{{- end -}}

4.3 部署 app-monitors Chart


# (确保 app-monitors Chart 和 values.yaml 准备就绪)
helm upgrade --install app-monitors ./app-monitors \
  --namespace monitoring \
  -f ./app-monitors/values.yaml

步骤五:配置 Remote Write (数据上报)

此步骤在步骤一的 custom-values.yaml 中已完成配置。

这里再次强调两个关键配置点:


# ... 在 custom-values.yaml -> prometheus.prometheusSpec 中 ...

    # 1. 外部标签:
    #    所有上报的指标都会自动带上 {cluster="your-cluster-name"}
    #    这是在中央 Prometheus 区分数据来源的唯一标识。
    externalLabels:
      cluster: 'your-cluster-name'
      instance: 'prometheus-game-instance'

    # 2. 远程写入配置:
    #    指向中央 Prometheus,并使用 Secret (prometheus-rw-credentials) 进行认证。
    remoteWrite:
      - url: "http://your-central-prometheus-url.com:9090/api/v1/write"
        basicAuth:
          username:
            name: "prometheus-rw-credentials"
            key: "username"
          password:
            name: "prometheus-rw-credentials"
            key: "password"

步骤六:Grafana 展示 & 数据解读

数据已上报到中央 Prometheus,最后一步是在中央 Grafana 中配置仪表盘进行展示。

6.1 关键 PromQL 验证查询

登录中央 Prometheus,使用以下 PromQL 语句验证数据是否成功上报:

1. 检查 your-cluster-name 集群的 up 指标:


# 确认 'your-cluster-name' 集群的 Prometheus 实例在向中央发送数据
up{cluster="your-cluster-name"}

2. 检查游戏服监控 (my-game-monitor) 是否在线:


# 查询 'your-cluster-name' 集群中,由 'my-game-monitor' 抓取的所有目标
# 预期结果:您有多少个 GSS Pod,就应该有多少条指标,且 value 都是 1 (UP)
up{cluster="your-cluster-name", job="monitoring/my-game-monitor"}

3. 查询所有游戏进程的 CPU 使用率 (核心):


# 按进程名 (groupname) 聚合,计算 5 分钟内的平均 CPU 使用率(单位:核)
# 预期结果:看到 /app/logic_server、/app/scene_server 等进程的 CPU
sum(rate(process_cpu_seconds_total{cluster="your-cluster-name", job="monitoring/my-game-monitor"}[5m])) by (groupname)

4. 查询所有游戏进程的物理内存使用量 (核心):


# 按进程名 (groupname) 聚合,加总所有同名进程的内存使用
sum(process_resident_memory_bytes{cluster="your-cluster-name", job="monitoring/my-game-monitor"}) by (groupname)

6.2 Grafana Dashboard json

下方的json可直接导到自建grafana中,代码很长,默认折叠

点击展开/折叠
JSON

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "description": "此仪表板用于监控游戏应用的Pod和进程级别指标 (process-exporter)。[v2: 将'稳定'和'KEDA'卡片合并为'全局总览',确保数据一致性]",
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": 183,
  "links": [],
  "panels": [
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 10,
      "panels": [],
      "title": "Overall Health",
      "type": "row"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Ar2rF7GHk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "decimals": 0,
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "red",
                "value": 0
              },
              {
                "color": "green",
                "value": 1
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 1
      },
      "id": 1,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "percentChangeColorMode": "standard",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "showPercentChange": false,
        "textMode": "auto",
        "wideLayout": true
      },
      "pluginVersion": "12.1.0",
      "targets": [
        {
          "editorMode": "code",
          "exemplar": false,
          "expr": "sum(up{cluster=\"game-test\", job=\"monitoring/my-game-monitor\"})",
          "instant": true,
          "legendFormat": "UP",
          "range": false,
          "refId": "A"
        },
        {
          "editorMode": "code",
          "exemplar": false,
          "expr": "count(up{cluster=\"game-test\", job=\"monitoring/my-game-monitor\"})",
          "instant": true,
          "legendFormat": "Total",
          "range": false,
          "refId": "B"
        }
      ],
      "title": "游戏服务总览 (UP / Total) - 瞬时",
      "type": "stat"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Ar2rF7GHk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "custom": {
            "align": "auto",
            "cellOptions": {
              "type": "auto"
            },
            "inspect": false
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "red",
                "value": 0
              },
              {
                "color": "green",
                "value": 1
              }
            ]
          },
          "unit": "short"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "servertype"
            },
            "properties": [
              {
                "id": "displayName",
                "value": "Server Type"
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "Value #A"
            },
            "properties": [
              {
                "id": "displayName",
                "value": "UP"
              },
              {
                "id": "mappings",
                "value": [
                  {
                    "options": {
                      "0": {
                        "color": "red",
                        "text": "Down"
                      }
                    },
                    "type": "value"
                  }
                ]
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "Value #B"
            },
            "properties": [
              {
                "id": "displayName",
                "value": "TOTAL"
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 1
      },
      "id": 2,
      "options": {
        "cellHeight": "sm",
        "footer": {
          "countRows": false,
          "fields": "",
          "reducer": [
            "sum"
          ],
          "show": true
        },
        "frameIndex": 1,
        "showHeader": true,
        "sortBy": [
          {
            "desc": true,
            "displayName": "UP"
          }
        ]
      },
      "pluginVersion": "12.1.0",
      "targets": [
        {
          "expr": "sum(up{cluster=\"game-test\", job=\"monitoring/my-game-monitor\"}) by (servertype)",
          "format": "table",
          "instant": true,
          "legendFormat": "UP",
          "range": false,
          "refId": "A"
        },
        {
          "expr": "count(up{cluster=\"game-test\", job=\"monitoring/my-game-monitor\"}) by (servertype)",
          "format": "table",
          "instant": true,
          "legendFormat": "TOTAL",
          "range": false,
          "refId": "B"
        }
      ],
      "title": "按 ServerType 分组的健康状态 (瞬时)",
      "type": "table"
    },
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 9
      },
      "id": 11,
      "panels": [],
      "title": "Process & Pod Metrics",
      "type": "row"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Ar2rF7GHk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "barWidthFactor": 0.6,
            "drawStyle": "line",
            "fillOpacity": 10,
            "gradientMode": "opacity",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": true,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": 0
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "bytes"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 10
      },
      "id": 3,
      "options": {
        "legend": {
          "calcs": [
            "lastNotNull",
            "mean",
            "max"
          ],
          "displayMode": "table",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "hideZeros": false,
          "mode": "multi",
          "sort": "desc"
        }
      },
      "pluginVersion": "12.1.0",
      "targets": [
        {
          "expr": "sum(namedprocess_namegroup_memory_bytes{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", memtype=\"resident\", servertype=~\"$servertype\", pod=~\"$pod\", groupname=~\"$groupname\"}) by (groupname)",
          "instant": false,
          "legendFormat": "{{groupname}}",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "进程内存使用 (Resident) - 按进程组",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Ar2rF7GHk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "barWidthFactor": 0.6,
            "drawStyle": "line",
            "fillOpacity": 10,
            "gradientMode": "opacity",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": true,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": 0
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "bytes"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 10
      },
      "id": 4,
      "options": {
        "legend": {
          "calcs": [
            "lastNotNull",
            "mean",
            "max"
          ],
          "displayMode": "table",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "hideZeros": false,
          "mode": "multi",
          "sort": "desc"
        }
      },
      "pluginVersion": "12.1.0",
      "targets": [
        {
          "expr": "sum(process_resident_memory_bytes{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\"}) by (pod)",
          "instant": false,
          "legendFormat": "{{pod}}",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Pod 内存使用 (Resident) - 按 Pod",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Ar2rF7GHk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "barWidthFactor": 0.6,
            "drawStyle": "line",
            "fillOpacity": 10,
            "gradientMode": "opacity",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": true,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "max": 100,
          "min": 0,
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": 0
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "percent"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 18
      },
      "id": 5,
      "options": {
        "legend": {
          "calcs": [
            "lastNotNull",
            "mean",
            "max"
          ],
          "displayMode": "table",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "hideZeros": false,
          "mode": "multi",
          "sort": "desc"
        }
      },
      "pluginVersion": "12.1.0",
      "targets": [
        {
          "expr": "sum(rate(namedprocess_namegroup_cpu_seconds_total{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\", groupname=~\"$groupname\"}[5m])) by (groupname) * 100",
          "instant": false,
          "legendFormat": "{{groupname}}",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "进程 CPU 使用率 - 按进程组",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Ar2rF7GHk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "barWidthFactor": 0.6,
            "drawStyle": "line",
            "fillOpacity": 10,
            "gradientMode": "opacity",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": true,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "max": 100,
          "min": 0,
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": 0
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "percent"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 18
      },
      "id": 6,
      "options": {
        "legend": {
          "calcs": [
            "lastNotNull",
            "mean",
            "max"
          ],
          "displayMode": "table",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "hideZeros": false,
          "mode": "multi",
          "sort": "desc"
        }
      },
      "pluginVersion": "12.1.0",
      "targets": [
        {
          "expr": "sum(rate(process_cpu_seconds_total{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\"}[5m])) by (pod) * 100",
          "instant": false,
          "legendFormat": "{{pod}}",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Pod CPU 使用率 - 按 Pod",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Ar2rF7GHk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "barWidthFactor": 0.6,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": true,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": 0
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 0,
        "y": 26
      },
      "id": 7,
      "options": {
        "legend": {
          "calcs": [
            "lastNotNull"
          ],
          "displayMode": "table",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "hideZeros": false,
          "mode": "multi",
          "sort": "desc"
        }
      },
      "pluginVersion": "12.1.0",
      "targets": [
        {
          "expr": "sum(process_open_fds{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\"}) by (pod)",
          "instant": false,
          "legendFormat": "{{pod}}",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Pod 文件描述符 (FDs) - 按 Pod",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Ar2rF7GHk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "barWidthFactor": 0.6,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": true,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": 0
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "B/s"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 8,
        "y": 26
      },
      "id": 8,
      "options": {
        "legend": {
          "calcs": [
            "lastNotNull"
          ],
          "displayMode": "table",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "hideZeros": false,
          "mode": "multi",
          "sort": "desc"
        }
      },
      "pluginVersion": "12.1.0",
      "targets": [
        {
          "expr": "sum(rate(namedprocess_namegroup_read_bytes_total{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\", groupname=~\"$groupname\"}[5m])) by (groupname)",
          "instant": false,
          "legendFormat": "Read - {{groupname}}",
          "range": true,
          "refId": "A"
        },
        {
          "expr": "sum(rate(namedprocess_namegroup_write_bytes_total{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\", groupname=~\"$groupname\"}[5m])) by (groupname)",
          "instant": false,
          "legendFormat": "Write - {{groupname}}",
          "range": true,
          "refId": "B"
        }
      ],
      "title": "进程 IO - 按进程组 (5m rate)",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Ar2rF7GHk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "barWidthFactor": 0.6,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": true,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": 0
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 16,
        "y": 26
      },
      "id": 9,
      "options": {
        "legend": {
          "calcs": [
            "lastNotNull"
          ],
          "displayMode": "table",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "hideZeros": false,
          "mode": "multi",
          "sort": "desc"
        }
      },
      "pluginVersion": "12.1.0",
      "targets": [
        {
          "expr": "sum(go_goroutines{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\"}) by (pod)",
          "instant": false,
          "legendFormat": "{{pod}}",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Go Goroutines - 按 Pod",
      "type": "timeseries"
    }
  ],
  "preload": false,
  "refresh": "30s",
  "schemaVersion": 41,
  "tags": [
    "game",
    "prometheus",
    "process-exporter"
  ],
  "templating": {
    "list": [
      {
        "current": {
          "text": "VictoriaMetrics-大陆",
          "value": "Ar2rF7GHk"
        },
        "label": "Prometheus Datasource",
        "name": "DS_PROMETHEUS",
        "options": [],
        "query": "prometheus",
        "refresh": 1,
        "type": "datasource"
      },
      {
        "allValue": ".*",
        "current": {
          "text": "All",
          "value": [
            "$__all"
          ]
        },
        "datasource": {
          "type": "prometheus",
          "uid": "Ar2rF7GHk"
        },
        "includeAll": true,
        "label": "Server Type",
        "multi": true,
        "name": "servertype",
        "options": [],
        "query": "label_values(up{cluster=\"game-test\", job=\"monitoring/my-game-monitor\"}, servertype)",
        "refresh": 1,
        "sort": 1,
        "type": "query"
      },
      {
        "allValue": ".*",
        "current": {
          "text": "All",
          "value": [
            "$__all"
          ]
        },
        "datasource": {
          "type": "prometheus",
          "uid": "Ar2rF7GHk"
        },
        "includeAll": true,
        "label": "Pod",
        "multi": true,
        "name": "pod",
        "options": [],
        "query": "label_values(up{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\"}, pod)",
        "refresh": 1,
        "sort": 1,
        "type": "query"
      },
      {
        "allValue": ".*",
        "current": {
          "text": "All",
          "value": [
            "$__all"
          ]
        },
        "datasource": {
          "type": "prometheus",
          "uid": "Ar2rF7GHk"
        },
        "includeAll": true,
        "label": "Process Group",
        "multi": true,
        "name": "groupname",
        "options": [],
        "query": "label_values(namedprocess_namegroup_cpu_seconds_total{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\"}, groupname)",
        "refresh": 1,
        "sort": 1,
        "type": "query"
      }
    ]
  },
  "time": {
    "from": "now-5m",
    "to": "now"
  },
  "timepicker": {
    "refresh_intervals": [
      "5s",
      "10s",
      "30s",
      "1m",
      "5m",
      "15m",
      "30m",
      "2h",
      "1d"
    ]
  },
  "timezone": "browser",
  "title": "游戏应用监控仪表板 v2 (全局总览)",
  "uid": "game-app-monitor-dashboard-v2",
  "version": 1
}
                        

6.3 [重要] 数据解读:KEDA 与自动伸缩

GameServerSet 结合 KEDA 或 HPA 进行自动伸缩时,Pods 会频繁地创建和销毁。这在 Grafana 图表上会产生特定现象,必须正确解读。

现象:图表“毛刺”和“断崖”

1. rate()increase() 的陷阱:
rate(process_cpu_seconds_total[5m]) 这样的查询,计算的是“5分钟内的平均速率”。如果一个 Pod 刚启动 1 分钟,它的 [5m] 数据是不完整的,会导致图表上该 Pod 的 CPU 看起来非常低,直到它运行满 5 分钟。同理,一个刚被销毁的 Pod,它的数据会突然变为 0,导致总和图表(sum)出现“断崖”。

2. avg() 的陷阱:
绝对不要使用 avg() (平均值) 来聚合自动伸缩的 Pod。在高峰期,Pod 数量从 10 增加到 50,此时 avg() 的分母(50)变大,可能会导致您误认为“平均负载”下降了,而实际总负载(sum)正在飙升。

解读建议:

  • 看总和 (sum),而不是平均 (avg): 对于集群总览,始终使用 sum(rate(...)) by (servertype)。这能真实反映该类型服务的“总资源消耗”,无论它由 10 个 Pod 还是 50 个 Pod 提供。
  • 看实例数 (count): 始终在图表旁边放一个 count(up{...}) 的面板,用于显示当前存活的 Pod 数量。这可以帮您交叉验证“总CPU”的上升是否因为 Pod 数量增加。
  • 接受“瞬态”: 在查看“按 Pod”维度的图表时,看到曲线的出现和消失是完全正常的,这代表了 GSS 的自动伸缩。

附录:常见问题 (FAQ)

基于 K8S-Game-Process-Monitoring-Docs.md 总结的常见问题。

Q1: 为什么我的 PodMonitor 创建了,但 Prometheus Targets 页面看不到?

A: 两个最可能的原因:

  • 原因1 (命名空间): Prometheus Operator 默认只扫描它自己所在的 monitoring 命名空间。你必须确保你的 PodMonitor (即 my-game-monitor) 也是创建在 monitoring 命名空间,而不是游戏业务所在的 your-game-namespace
  • 原因2 (标签选择器): 你的 Prometheus 实例 (在 custom-values.yaml 中定义) 使用 podMonitorSelector 来决定“认领”哪些 PodMonitor。在我们的配置中,它要求 PodMonitor 必须带有 release: prometheus 标签。请检查你的 PodMonitor YAML 中是否包含了这个 metadata.labels

Q2: 为什么中央 Prometheus 看不到数据?

A: 检查本地 Prometheus (prometheus-game-instance) 和中央 Prometheus。

  • 本地检查: kubectl logs -n monitoring -f prometheus-game-instance-0。查找与 remoteWrite 相关的错误 (例如 "connection refused", "unauthorized")。
  • 凭据检查: 确认 prometheus-rw-credentials Secret 中的用户名和密码是否正确。
  • 网络检查: 确认 K8S 集群的 Pod 可以访问 http://your-central-prometheus-url.com:9090

Q3: 为什么我在中央 Prometheus 找不到我的指标?

A: 确保你使用了正确的 cluster 标签!

我们在 custom-values.yaml 中配置了 externalLabels: { cluster: 'your-cluster-name' }。因此,在中央 Prometheus 查询时,你所有的 PromQL 语句都必须包含 {cluster="your-cluster-name", ...} 才能过滤到数据。例如,process_cpu_seconds_total 必须查 process_cpu_seconds_total{cluster="your-cluster-name"}