# 文件名: custom-values.yaml
# 离线安装 kube-prometheus-stack 的 UAT/Test 环境配置文件
# - 禁用了 grafana 和 alertmanager
# - 配置了全局的 imagePullSecrets
# - 所有镜像地址指向私有 Harbor 仓库
# - 修改资源名称 (如 'prometheus-game-instance') 避免与 ACK 自带组件冲突
# - 仅配置远程写入,不暴露本地 UI 端口
global:
imagePullSecrets:
- name: harbor-credential
# 1. 禁用 Alertmanager 和 Grafana
alertmanager:
enabled: false
grafana:
enabled: false
# 2. 禁用 CRD 安装 (因为已手动应用)
crds:
enabled: false
# 3. 为避免与现有 Prometheus 冲突,修改 release 名称和命名空间隔离
# (使用 'prometheus-game-instance' 作为前缀)
fullnameOverride: "prometheus-game-instance"
nameOverride: "prometheus-game-instance"
# 4. 覆盖所有活动组件的镜像地址和标签
kube-state-metrics:
nameOverride: "kube-state-metrics-game"
fullnameOverride: "kube-state-metrics-game"
image:
registry: your-harbor-registry.com
repository: your-project-namespace/kube-state-metrics
tag: v2.15.0
selfMonitor:
enabled: false
prometheus-node-exporter:
enabled: false
prometheusOperator:
nameOverride: "prometheus-operator-game"
fullnameOverride: "prometheus-operator-game"
image:
registry: your-harbor-registry.com
repository: your-project-namespace/prometheus-operator
tag: v0.83.0
# 避免 RBAC 冲突,使用特定的 serviceAccount
serviceAccount:
create: true
name: "prometheus-operator-game-sa"
rbac:
create: true
pspEnabled: false
admissionWebhooks:
patch:
image:
registry: your-harbor-registry.com
repository: your-project-namespace/kube-webhook-certgen
tag: v1.5.4
prometheusConfigReloader:
image:
registry: your-harbor-registry.com
repository: your-project-namespace/prometheus-config-reloader
tag: v0.83.0
# 5. 配置 Prometheus 实例 (核心)
prometheus:
nameOverride: "prometheus-game-instance"
fullnameOverride: "prometheus-game-instance"
prometheusSpec:
# 关键:为所有指标打上集群标签,用于中央 Prometheus 识别
externalLabels:
cluster: 'your-cluster-name'
instance: 'prometheus-game-instance'
# -------------------------------------------------------------
# !! 关键配置 (见核心概念) !!
# 告诉此 Prometheus 实例去“认领”哪些 PodMonitor
# 它只会寻找 metadata.labels 中包含 { release: "prometheus" } 的 PodMonitor
podMonitorSelector:
matchLabels:
release: prometheus
# -------------------------------------------------------------
# 关键:配置远程写入
remoteWrite:
- url: "http://your-central-prometheus-url.com:9090/api/v1/write"
basicAuth:
username:
name: "prometheus-rw-credentials" # 引用 1.3 步创建的 Secret
key: "username"
password:
name: "prometheus-rw-credentials" # 引用 1.3 步创建的 Secret
key: "password"
image:
registry: your-harbor-registry.com
repository: your-project-namespace/prometheus
tag: v3.4.1
# 使用特定的 serviceAccount 避免冲突
serviceAccount:
name: "prometheus-game-instance"
# 仅启用远程写入,不暴露服务端口
enableAdminAPI: false
service:
enabled: false
# 隔离规则,防止加载 ACK 自带的 PrometheusRule
ruleNamespaceSelector: {}
ruleSelector: {}
# 关闭 UI 相关端口
listenLocal: true
核心架构与数据流
本指南详述了如何在 Kubernetes (ACK) 环境中,为基于 GameServerSet 的游戏应用集成 process-exporter,部署 kube-prometheus-stack 核心组件,并最终将所有进程指标汇聚到中央 Prometheus 服务的完整流程。本文档已做脱敏处理,可作为标准实施SOP。
架构数据流 (SVG)
下图展示了“配置流”(蓝色虚线)和“数据流”(黑色曲线)的完整路径。此图明确显示了**一个** PodMonitor 如何发现**多种** GameServerSet Pod。
核心概念
在开始SOP之前,理解关键组件的职责至关重要。这套系统是“声明式”的,由多个“演员”协同工作。
Prometheus Operator
这是我们安装的 prometheus-operator-game。它不是 Prometheus 本身,而是 Prometheus 的“大脑”和“管家”。
- 它持续监视 K8S 集群,寻找特定类型的资源(CRD)。
- 当它发现一个
PodMonitor资源时 (流程 A),它会读取这个资源的定义。 - 它会自动将这个定义转换为 Prometheus 的抓取配置(
scrape_config)。 - 最后,它会自动更新它所管理的
Prometheus实例(即prometheus-game-instance)的配置文件 (流程 B),并使其重载生效。
PodMonitor
这是我们通过 app-monitors Chart 创建的 my-game-monitor 资源。
- 它是一个“声明”,用来告诉 Operator:“嘿,请你帮我监控集群中所有符合A、B、C条件(
namespace,labels)的 Pod”。 - 如图所示,一个
PodMonitor就可以通过Selector列表定义,去发现多种不同 `servertype` 的 Pod。 - 它本身不执行任何监控,它只是一个“目标定义”。
process-exporter
这是一个 Exporter(暴露器),它被设计用来嗅探其所在的
PID 命名空间(即容器内部)的所有进程,并将其CPU、内存、FD等指标以 Prometheus 格式暴露在 /metrics 端口上。
Remote Write
本地 Prometheus(prometheus-game-instance)抓取到数据后 (流程 1),通过这个配置,将所有数据“转发”一份到中央 Prometheus (流程 2)。本地实例变成了一个“采集和转发站”。
External Labels
这是 custom-values.yaml 中的 externalLabels: { cluster: 'your-cluster-name' }。它为从本集群发出的所有指标打上了“烙印”。没有这个,中央 Prometheus 将无法区分上千条 up 指标到底来自哪个集群。
关键决策
在实施过程中,我们做了一些权衡,这些是本方案的基石。
决策一:如何部署 process-exporter?
我们选择将 process-exporter 二进制文件构建到游戏镜像中,而不是使用 Sidecar 模式。这是此方案最核心的决策。
| 对比项 | 方案A: Sidecar 模式 | 方案B: 镜像内二进制 (★ 本方案选择) |
|---|---|---|
| 部署方式 | 在 GameServerSet 的 containers 列表中再加一个 process-exporter 容器。 |
在 Dockerfile 中将 process-exporter 二进制文件 COPY 进游戏镜像。 |
| 优点 | 1. 职责分离,镜像干净。 2. 可独立更新 exporter。 |
1. 可访问性:默认共享PID命名空间,能轻易监控到游戏主进程。 2. 部署简单,不增加 Pod 的容器数量。 |
| 缺点 | 1. PID 命名空间隔离:Sidecar 默认无法看到主容器的进程! 2. 必须开启 shareProcessNamespace: true (K8S 级别) 或 hostPID: true (高风险),配置复杂且有安全隐患。 |
1. 镜像耦合:“污染”了游戏镜像。 2. 更新 exporter 需要重新构建游戏镜像。 |
| 结论 | 因PID隔离问题,不适用于“进程监控”场景。 | 最适合的方案。为解决进程监控的根本问题(访问PID),我们接受镜像耦合的代价。 |
决策二:如何管理 PodMonitor?
我们选择使用独立的 app-monitors Chart,而不是将 PodMonitor 放入游戏业务 Chart 中。
| 方案 | 优点 | 缺点 | 决策 |
|---|---|---|---|
A: PodMonitor 嵌入游戏 Chart |
随游戏部署,原子性强 | 1. PodMonitor 资源必须创建在 monitoring 命名空间,导致业务 Chart 跨 ns 操作,权限混乱。2. 监控配置分散在各个业务 Chart 中,难以统一管理。 |
不推荐 |
B: 独立 app-monitors Chart |
1. 职责单一:所有 PodMonitor 都在 monitoring 空间统一部署和管理。2. 易于维护和扩展。 3. 完美符合 Prometheus Operator 的工作模式。 |
需额外维护一个 Chart。 | ★ 最终选择 |
步骤一:安装 Prometheus 核心组件
目标:在 monitoring 命名空间中,以离线、防冲突的方式部署 kube-prometheus-stack 的核心组件 (Operator, Prometheus),并禁用 Grafana 和 Alertmanager。
1.1 预准备:下载与推送镜像 (在线机器)
在有外网的机器上,下载 Helm Chart 并将所有依赖镜像推送到私有仓库。
- 下载 Chart:
kube-prometheus-stack-75.3.0.tgz。 - 创建
migrate-images.sh脚本:
#!/bin/bash
# 用法: bash migrate-images.sh <CHART_FILE.tgz> <PRIVATE_REGISTRY/NAMESPACE>
# 示例: bash migrate-images.sh ./kube-prometheus-stack-75.3.0.tgz your-harbor-registry.com/your-project-namespace
set -e
CHART_FILE="$1"
PRIVATE_REGISTRY="$2"
if [ -z "$CHART_FILE" ] || [ -z "$PRIVATE_REGISTRY" ]; then
echo "错误: 参数不足!" >&2
exit 1
fi
echo "Helm Chart: $CHART_FILE"
echo "目标私有仓库: $PRIVATE_REGISTRY"
# 依赖 yq (https://github.com/mikefarah/yq)
IMAGES=$(helm template "$CHART_FILE" | yq e '.. | select(has("image")) | .image' - | grep -v '^---$' | sort -u)
if [ -z "$IMAGES" ]; then
echo "错误: 未在 Chart 中找到任何镜像。" >&2
exit 1
fi
echo "成功解析出以下唯一镜像:"
echo "$IMAGES"
echo "---"
docker login "$PRIVATE_REGISTRY"
for IMAGE in $IMAGES; do
IMAGE_NAME=$(echo "$IMAGE" | sed 's|.*/||')
NEW_IMAGE="$PRIVATE_REGISTRY/$IMAGE_NAME"
echo "==> 正在处理: $IMAGE"
if ! docker pull "$IMAGE"; then
echo "警告: 拉取镜像 $IMAGE 失败,跳过。" >&2
continue
fi
docker tag "$IMAGE" "$NEW_IMAGE"
if ! docker push "$NEW_IMAGE"; then
echo "警告: 推送镜像 $NEW_IMAGE 失败,跳过。" >&2
continue
fi
echo "成功迁移: $IMAGE -> $NEW_IMAGE"
done
echo "所有镜像迁移完成!"
此脚本可能遗漏 prometheus-config-reloader。如果安装时拉取失败,请手动迁移:
# (原始镜像)
docker pull quay.io/prometheus-operator/prometheus-config-reloader:v0.83.0
# (重命名并推送,注意修改为你自己的仓库地址)
docker tag ... hbregistry-cn.lrgameglobal.com/kafka/prometheus-config-reloader:v0.83.0
docker push ...
1.2 离线安装 (K8S 管理节点)
1. 创建 Namespace
kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
2. 手动安装 CRD (关键步骤)
为避免 Helm 超时,必须先手动应用 CRD。
# 1. 从 Chart 包中解压 crds 目录
tar -xvf kube-prometheus-stack-75.3.0.tgz kube-prometheus-stack/charts/crds/crds/
# 2. 使用 server-side apply 方式安装 (避免 'metadata.annotations' 过大问题)
kubectl apply --server-side -f kube-prometheus-stack/charts/crds/crds/
3. 创建 K8S Secrets
A. 私有镜像仓库凭证 (harbor-credential)
# (请替换为你自己的私有仓库信息)
kubectl create secret docker-registry harbor-credential \
--namespace=monitoring \
--docker-server=your-harbor-registry.com \
--docker-username=your-username \
--docker-password=your-password
B. 远端写入凭证 (prometheus-rw-credentials)
# (请替换为你的中央 Prometheus 凭据)
kubectl create secret generic prometheus-rw-credentials \
--namespace=monitoring \
--from-literal=username='central-prometheus-user' \
--from-literal=password='central-prometheus-password'
4. 准备 custom-values.yaml 配置文件
这是此次部署的核心,已脱敏并包含所有关键配置(防冲突、指向私有仓库、配置Remote Write)。
点击展开/折叠 custom-values.yaml
5. 执行 Helm 安装
helm install prometheus ./kube-prometheus-stack-75.3.0.tgz \
--namespace monitoring \
-f ./custom-values.yaml
验证:kubectl get pods -n monitoring,等待所有 Pod (prometheus-operator-game-xxx, prometheus-game-instance-xxx 等) 变为 Running。
步骤二:游戏应用改造 (植入 Exporter)
目标:让游戏容器在启动游戏进程的同时,启动一个 process-exporter 进程来暴露指标。(决策理由见 关键决策)
2.1 准备 process-exporter 配置文件
在游戏镜像的 /config/ 目录下准备 process-exporter.yaml。使用 {{.ExeFull}} 动态匹配所有进程的完整路径作为标签。
# config/process-exporter.yaml
process_names:
- name: "{{.ExeFull}}"
cmdline:
- '.+'
2.2 改造 Dockerfile
确保 process-exporter 二进制文件和配置文件被复制到镜像中。
# ... (您的其他 Dockerfile 指令)
# 拷贝 process-exporter
COPY ./bin/process-exporter /usr/local/bin/process-exporter
RUN chmod +x /usr/local/bin/process-exporter
# 拷贝配置文件 (假设目标路径为 /etc/process-exporter.yaml)
COPY ./config/process-exporter.yaml /etc/process-exporter.yaml
# ...
CMD ["/app/start_game.sh"]
2.3 改造启动脚本 (start_game.sh)
脚本必须在后台启动 process-exporter,并确保最后有一个前台进程防止容器退出。
#!/bin/sh
echo "Starting monitoring agent (process-exporter)..."
# 1. 在后台启动 process-exporter,并指定正确的配置文件路径
/usr/local/bin/process-exporter -config.path /etc/process-exporter.yaml &
echo "Starting game processes..."
# 2. 执行您现有的、启动多个游戏进程的逻辑 (假设它们也在后台运行)
/app/logic_server &
/app/scene_server &
# ...
# 3. 关键:保持一个前台进程 (例如 tail)
echo "All processes started. Tailing logs to keep container alive."
tail -f /dev/null
步骤三:K8S 资源配置 (GameServerSet)
目标:修改游戏的 GameServerSet (GSS) 定义,暴露 `metrics` 端口并添加用于服务发现的 `label`。
这是 PodMonitor 能够找到游戏 Pod 的前提。您需要为您所有的 GSS(`charge`, `game`, `lister` 等)都应用此配置。
apiVersion: game.openkruise.io/v1alpha1
kind: GameServerSet
metadata:
name: my-game # (例如 'game', 'charge', 'lister'...)
# 命名空间必须与 app-monitors 中配置的一致
namespace: your-game-namespace
spec:
replicas: 1 # (或您的副本数)
gameServerTemplate:
metadata:
labels:
# -------------------------------------------------------------
# !! 关键标签 !!
# 此标签将用于被 Prometheus (PodMonitor) 发现
# 必须与 app-monitors.values.yaml 中 serverTypes 列表内的值匹配
servertype: "game" # (或 'charge', 'lister'...)
# -------------------------------------------------------------
spec:
containers:
- name: game
image: your-game-image:latest
ports:
# -------------------------------------------------------------
# !! 关键端口 !!
# 暴露 process-exporter 的端口 (默认 9256)
- name: metrics
containerPort: 9256
protocol: TCP
# -------------------------------------------------------------
# ... (您的其他游戏端口)
步骤四:部署自动发现 (PodMonitor)
目标:部署一个 PodMonitor 资源,告诉在步骤一安装的 Prometheus 实例去哪里、如何寻找游戏 Pod。(决策理由见 关键决策)
4.1 配置 app-monitors/values.yaml
此文件定义了 PodMonitor 要去哪个 namespace 寻找哪些 servertype 的 Pod。这个列表应包含您所有 GSS 的 `servertype` 标签。
# app-monitors/values.yaml
default:
interval: 15s
myGame:
enabled: true
# 游戏应用部署的命名空间 (必须与 GSS 中的 namespace 一致)
namespace: "your-game-namespace"
# 用于选择 GameServer Pod 的标签 (必须与 GSS 中 template.metadata.labels 一致)
# (此列表来自您的 kubectl 输出和 yaml 文件)
serverTypes:
- "lister"
- "game"
- "global"
- "router"
- "match"
- "collector"
- "charge"
- "manage"
- "cross"
- "member-game"
# ... (其他 server types)
4.2 配置 app-monitors/templates/my-game-podmonitor.yaml
这是 PodMonitor 模板,它实现了关键的“三方握手”:
- 本地 Prometheus (
prometheus-game-instance) 通过podMonitorSelector寻找release: prometheus标签。 - 此
PodMonitor携带release: prometheus标签,因此被 Prometheus "认领"。 - 此
PodMonitor通过namespaceSelector和selector(读取 `values.yaml` 中的 `serverTypes` 列表) 去寻找所有匹配的 Pod。
{{- if .Values.myGame.enabled -}}
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: my-game-monitor
# -------------------------------------------------------------
# !! 关键:此资源必须创建在 'monitoring' 命名空间 !!
# 以便被 Prometheus Operator 发现
namespace: monitoring
# -------------------------------------------------------------
labels:
# -------------------------------------------------------------
# !! 关键:此标签必须为 'prometheus' !!
# 以便被步骤一中部署的 Prometheus 实例选中 (匹配 podMonitorSelector)
release: prometheus
# -------------------------------------------------------------
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/instance: app-monitors
spec:
# 关键:指定要去哪个命名空间里寻找目标 Pod
namespaceSelector:
matchNames:
- {{ .Values.myGame.namespace | quote }}
# 关键:使用 matchExpressions 来匹配多个 servertype 值
selector:
matchExpressions:
- key: servertype
operator: In
values:
# (这里会渲染 values.yaml 中的 serverTypes 列表)
{{- toYaml .Values.myGame.serverTypes | nindent 10 }}
podMetricsEndpoints:
- port: "metrics" # 匹配 GSS 中定义的 9256 端口名
path: "/metrics"
interval: {{ .Values.default.interval }}
# (可选但推荐) 将 Pod 标签 'servertype' 转换为指标标签
relabelings:
- sourceLabels: [__meta_kubernetes_pod_label_servertype]
targetLabel: servertype
{{- end -}}
4.3 部署 app-monitors Chart
# (确保 app-monitors Chart 和 values.yaml 准备就绪)
helm upgrade --install app-monitors ./app-monitors \
--namespace monitoring \
-f ./app-monitors/values.yaml
步骤五:配置 Remote Write (数据上报)
此步骤在步骤一的 custom-values.yaml 中已完成配置。
这里再次强调两个关键配置点:
# ... 在 custom-values.yaml -> prometheus.prometheusSpec 中 ...
# 1. 外部标签:
# 所有上报的指标都会自动带上 {cluster="your-cluster-name"}
# 这是在中央 Prometheus 区分数据来源的唯一标识。
externalLabels:
cluster: 'your-cluster-name'
instance: 'prometheus-game-instance'
# 2. 远程写入配置:
# 指向中央 Prometheus,并使用 Secret (prometheus-rw-credentials) 进行认证。
remoteWrite:
- url: "http://your-central-prometheus-url.com:9090/api/v1/write"
basicAuth:
username:
name: "prometheus-rw-credentials"
key: "username"
password:
name: "prometheus-rw-credentials"
key: "password"
步骤六:Grafana 展示 & 数据解读
数据已上报到中央 Prometheus,最后一步是在中央 Grafana 中配置仪表盘进行展示。
6.1 关键 PromQL 验证查询
登录中央 Prometheus,使用以下 PromQL 语句验证数据是否成功上报:
1. 检查 your-cluster-name 集群的 up 指标:
# 确认 'your-cluster-name' 集群的 Prometheus 实例在向中央发送数据
up{cluster="your-cluster-name"}
2. 检查游戏服监控 (my-game-monitor) 是否在线:
# 查询 'your-cluster-name' 集群中,由 'my-game-monitor' 抓取的所有目标
# 预期结果:您有多少个 GSS Pod,就应该有多少条指标,且 value 都是 1 (UP)
up{cluster="your-cluster-name", job="monitoring/my-game-monitor"}
3. 查询所有游戏进程的 CPU 使用率 (核心):
# 按进程名 (groupname) 聚合,计算 5 分钟内的平均 CPU 使用率(单位:核)
# 预期结果:看到 /app/logic_server、/app/scene_server 等进程的 CPU
sum(rate(process_cpu_seconds_total{cluster="your-cluster-name", job="monitoring/my-game-monitor"}[5m])) by (groupname)
4. 查询所有游戏进程的物理内存使用量 (核心):
# 按进程名 (groupname) 聚合,加总所有同名进程的内存使用
sum(process_resident_memory_bytes{cluster="your-cluster-name", job="monitoring/my-game-monitor"}) by (groupname)
6.2 Grafana Dashboard json
下方的json可直接导到自建grafana中,代码很长,默认折叠
点击展开/折叠
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"description": "此仪表板用于监控游戏应用的Pod和进程级别指标 (process-exporter)。[v2: 将'稳定'和'KEDA'卡片合并为'全局总览',确保数据一致性]",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": 183,
"links": [],
"panels": [
{
"collapsed": false,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 0
},
"id": 10,
"panels": [],
"title": "Overall Health",
"type": "row"
},
{
"datasource": {
"type": "prometheus",
"uid": "Ar2rF7GHk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"decimals": 0,
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "red",
"value": 0
},
{
"color": "green",
"value": 1
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 1
},
"id": 1,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showPercentChange": false,
"textMode": "auto",
"wideLayout": true
},
"pluginVersion": "12.1.0",
"targets": [
{
"editorMode": "code",
"exemplar": false,
"expr": "sum(up{cluster=\"game-test\", job=\"monitoring/my-game-monitor\"})",
"instant": true,
"legendFormat": "UP",
"range": false,
"refId": "A"
},
{
"editorMode": "code",
"exemplar": false,
"expr": "count(up{cluster=\"game-test\", job=\"monitoring/my-game-monitor\"})",
"instant": true,
"legendFormat": "Total",
"range": false,
"refId": "B"
}
],
"title": "游戏服务总览 (UP / Total) - 瞬时",
"type": "stat"
},
{
"datasource": {
"type": "prometheus",
"uid": "Ar2rF7GHk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"custom": {
"align": "auto",
"cellOptions": {
"type": "auto"
},
"inspect": false
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "red",
"value": 0
},
{
"color": "green",
"value": 1
}
]
},
"unit": "short"
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "servertype"
},
"properties": [
{
"id": "displayName",
"value": "Server Type"
}
]
},
{
"matcher": {
"id": "byName",
"options": "Value #A"
},
"properties": [
{
"id": "displayName",
"value": "UP"
},
{
"id": "mappings",
"value": [
{
"options": {
"0": {
"color": "red",
"text": "Down"
}
},
"type": "value"
}
]
}
]
},
{
"matcher": {
"id": "byName",
"options": "Value #B"
},
"properties": [
{
"id": "displayName",
"value": "TOTAL"
}
]
}
]
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 1
},
"id": 2,
"options": {
"cellHeight": "sm",
"footer": {
"countRows": false,
"fields": "",
"reducer": [
"sum"
],
"show": true
},
"frameIndex": 1,
"showHeader": true,
"sortBy": [
{
"desc": true,
"displayName": "UP"
}
]
},
"pluginVersion": "12.1.0",
"targets": [
{
"expr": "sum(up{cluster=\"game-test\", job=\"monitoring/my-game-monitor\"}) by (servertype)",
"format": "table",
"instant": true,
"legendFormat": "UP",
"range": false,
"refId": "A"
},
{
"expr": "count(up{cluster=\"game-test\", job=\"monitoring/my-game-monitor\"}) by (servertype)",
"format": "table",
"instant": true,
"legendFormat": "TOTAL",
"range": false,
"refId": "B"
}
],
"title": "按 ServerType 分组的健康状态 (瞬时)",
"type": "table"
},
{
"collapsed": false,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 9
},
"id": 11,
"panels": [],
"title": "Process & Pod Metrics",
"type": "row"
},
{
"datasource": {
"type": "prometheus",
"uid": "Ar2rF7GHk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "opacity",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": 0
},
{
"color": "red",
"value": 80
}
]
},
"unit": "bytes"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 10
},
"id": 3,
"options": {
"legend": {
"calcs": [
"lastNotNull",
"mean",
"max"
],
"displayMode": "table",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"hideZeros": false,
"mode": "multi",
"sort": "desc"
}
},
"pluginVersion": "12.1.0",
"targets": [
{
"expr": "sum(namedprocess_namegroup_memory_bytes{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", memtype=\"resident\", servertype=~\"$servertype\", pod=~\"$pod\", groupname=~\"$groupname\"}) by (groupname)",
"instant": false,
"legendFormat": "{{groupname}}",
"range": true,
"refId": "A"
}
],
"title": "进程内存使用 (Resident) - 按进程组",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "Ar2rF7GHk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "opacity",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": 0
},
{
"color": "red",
"value": 80
}
]
},
"unit": "bytes"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 10
},
"id": 4,
"options": {
"legend": {
"calcs": [
"lastNotNull",
"mean",
"max"
],
"displayMode": "table",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"hideZeros": false,
"mode": "multi",
"sort": "desc"
}
},
"pluginVersion": "12.1.0",
"targets": [
{
"expr": "sum(process_resident_memory_bytes{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\"}) by (pod)",
"instant": false,
"legendFormat": "{{pod}}",
"range": true,
"refId": "A"
}
],
"title": "Pod 内存使用 (Resident) - 按 Pod",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "Ar2rF7GHk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "opacity",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"max": 100,
"min": 0,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": 0
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 18
},
"id": 5,
"options": {
"legend": {
"calcs": [
"lastNotNull",
"mean",
"max"
],
"displayMode": "table",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"hideZeros": false,
"mode": "multi",
"sort": "desc"
}
},
"pluginVersion": "12.1.0",
"targets": [
{
"expr": "sum(rate(namedprocess_namegroup_cpu_seconds_total{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\", groupname=~\"$groupname\"}[5m])) by (groupname) * 100",
"instant": false,
"legendFormat": "{{groupname}}",
"range": true,
"refId": "A"
}
],
"title": "进程 CPU 使用率 - 按进程组",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "Ar2rF7GHk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "opacity",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"max": 100,
"min": 0,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": 0
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 18
},
"id": 6,
"options": {
"legend": {
"calcs": [
"lastNotNull",
"mean",
"max"
],
"displayMode": "table",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"hideZeros": false,
"mode": "multi",
"sort": "desc"
}
},
"pluginVersion": "12.1.0",
"targets": [
{
"expr": "sum(rate(process_cpu_seconds_total{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\"}[5m])) by (pod) * 100",
"instant": false,
"legendFormat": "{{pod}}",
"range": true,
"refId": "A"
}
],
"title": "Pod CPU 使用率 - 按 Pod",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "Ar2rF7GHk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": 0
},
{
"color": "red",
"value": 80
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 8,
"x": 0,
"y": 26
},
"id": 7,
"options": {
"legend": {
"calcs": [
"lastNotNull"
],
"displayMode": "table",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"hideZeros": false,
"mode": "multi",
"sort": "desc"
}
},
"pluginVersion": "12.1.0",
"targets": [
{
"expr": "sum(process_open_fds{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\"}) by (pod)",
"instant": false,
"legendFormat": "{{pod}}",
"range": true,
"refId": "A"
}
],
"title": "Pod 文件描述符 (FDs) - 按 Pod",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "Ar2rF7GHk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": 0
},
{
"color": "red",
"value": 80
}
]
},
"unit": "B/s"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 8,
"x": 8,
"y": 26
},
"id": 8,
"options": {
"legend": {
"calcs": [
"lastNotNull"
],
"displayMode": "table",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"hideZeros": false,
"mode": "multi",
"sort": "desc"
}
},
"pluginVersion": "12.1.0",
"targets": [
{
"expr": "sum(rate(namedprocess_namegroup_read_bytes_total{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\", groupname=~\"$groupname\"}[5m])) by (groupname)",
"instant": false,
"legendFormat": "Read - {{groupname}}",
"range": true,
"refId": "A"
},
{
"expr": "sum(rate(namedprocess_namegroup_write_bytes_total{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\", groupname=~\"$groupname\"}[5m])) by (groupname)",
"instant": false,
"legendFormat": "Write - {{groupname}}",
"range": true,
"refId": "B"
}
],
"title": "进程 IO - 按进程组 (5m rate)",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "Ar2rF7GHk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": 0
},
{
"color": "red",
"value": 80
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 8,
"x": 16,
"y": 26
},
"id": 9,
"options": {
"legend": {
"calcs": [
"lastNotNull"
],
"displayMode": "table",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"hideZeros": false,
"mode": "multi",
"sort": "desc"
}
},
"pluginVersion": "12.1.0",
"targets": [
{
"expr": "sum(go_goroutines{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\"}) by (pod)",
"instant": false,
"legendFormat": "{{pod}}",
"range": true,
"refId": "A"
}
],
"title": "Go Goroutines - 按 Pod",
"type": "timeseries"
}
],
"preload": false,
"refresh": "30s",
"schemaVersion": 41,
"tags": [
"game",
"prometheus",
"process-exporter"
],
"templating": {
"list": [
{
"current": {
"text": "VictoriaMetrics-大陆",
"value": "Ar2rF7GHk"
},
"label": "Prometheus Datasource",
"name": "DS_PROMETHEUS",
"options": [],
"query": "prometheus",
"refresh": 1,
"type": "datasource"
},
{
"allValue": ".*",
"current": {
"text": "All",
"value": [
"$__all"
]
},
"datasource": {
"type": "prometheus",
"uid": "Ar2rF7GHk"
},
"includeAll": true,
"label": "Server Type",
"multi": true,
"name": "servertype",
"options": [],
"query": "label_values(up{cluster=\"game-test\", job=\"monitoring/my-game-monitor\"}, servertype)",
"refresh": 1,
"sort": 1,
"type": "query"
},
{
"allValue": ".*",
"current": {
"text": "All",
"value": [
"$__all"
]
},
"datasource": {
"type": "prometheus",
"uid": "Ar2rF7GHk"
},
"includeAll": true,
"label": "Pod",
"multi": true,
"name": "pod",
"options": [],
"query": "label_values(up{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\"}, pod)",
"refresh": 1,
"sort": 1,
"type": "query"
},
{
"allValue": ".*",
"current": {
"text": "All",
"value": [
"$__all"
]
},
"datasource": {
"type": "prometheus",
"uid": "Ar2rF7GHk"
},
"includeAll": true,
"label": "Process Group",
"multi": true,
"name": "groupname",
"options": [],
"query": "label_values(namedprocess_namegroup_cpu_seconds_total{cluster=\"game-test\", job=\"monitoring/my-game-monitor\", servertype=~\"$servertype\", pod=~\"$pod\"}, groupname)",
"refresh": 1,
"sort": 1,
"type": "query"
}
]
},
"time": {
"from": "now-5m",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"2h",
"1d"
]
},
"timezone": "browser",
"title": "游戏应用监控仪表板 v2 (全局总览)",
"uid": "game-app-monitor-dashboard-v2",
"version": 1
}
6.3 [重要] 数据解读:KEDA 与自动伸缩
当 GameServerSet 结合 KEDA 或 HPA 进行自动伸缩时,Pods 会频繁地创建和销毁。这在 Grafana 图表上会产生特定现象,必须正确解读。
1. rate() 和 increase() 的陷阱:
像 rate(process_cpu_seconds_total[5m]) 这样的查询,计算的是“5分钟内的平均速率”。如果一个 Pod 刚启动 1 分钟,它的 [5m] 数据是不完整的,会导致图表上该 Pod 的 CPU 看起来非常低,直到它运行满 5 分钟。同理,一个刚被销毁的 Pod,它的数据会突然变为 0,导致总和图表(sum)出现“断崖”。
2. avg() 的陷阱:
绝对不要使用 avg() (平均值) 来聚合自动伸缩的 Pod。在高峰期,Pod 数量从 10 增加到 50,此时 avg() 的分母(50)变大,可能会导致您误认为“平均负载”下降了,而实际总负载(sum)正在飙升。
解读建议:
- 看总和 (
sum),而不是平均 (avg): 对于集群总览,始终使用sum(rate(...)) by (servertype)。这能真实反映该类型服务的“总资源消耗”,无论它由 10 个 Pod 还是 50 个 Pod 提供。 - 看实例数 (
count): 始终在图表旁边放一个count(up{...})的面板,用于显示当前存活的 Pod 数量。这可以帮您交叉验证“总CPU”的上升是否因为 Pod 数量增加。 - 接受“瞬态”: 在查看“按 Pod”维度的图表时,看到曲线的出现和消失是完全正常的,这代表了 GSS 的自动伸缩。
附录:常见问题 (FAQ)
基于 K8S-Game-Process-Monitoring-Docs.md 总结的常见问题。
Q1: 为什么我的 PodMonitor 创建了,但 Prometheus Targets 页面看不到?
A: 两个最可能的原因:
- 原因1 (命名空间): Prometheus Operator 默认只扫描它自己所在的
monitoring命名空间。你必须确保你的PodMonitor(即my-game-monitor) 也是创建在monitoring命名空间,而不是游戏业务所在的your-game-namespace。 - 原因2 (标签选择器): 你的
Prometheus实例 (在custom-values.yaml中定义) 使用podMonitorSelector来决定“认领”哪些PodMonitor。在我们的配置中,它要求PodMonitor必须带有release: prometheus标签。请检查你的PodMonitorYAML 中是否包含了这个metadata.labels。
Q2: 为什么中央 Prometheus 看不到数据?
A: 检查本地 Prometheus (prometheus-game-instance) 和中央 Prometheus。
- 本地检查:
kubectl logs -n monitoring -f prometheus-game-instance-0。查找与remoteWrite相关的错误 (例如 "connection refused", "unauthorized")。 - 凭据检查: 确认
prometheus-rw-credentialsSecret 中的用户名和密码是否正确。 - 网络检查: 确认 K8S 集群的 Pod 可以访问
http://your-central-prometheus-url.com:9090。
Q3: 为什么我在中央 Prometheus 找不到我的指标?
A: 确保你使用了正确的 cluster 标签!
我们在 custom-values.yaml 中配置了 externalLabels: { cluster: 'your-cluster-name' }。因此,在中央 Prometheus 查询时,你所有的 PromQL 语句都必须包含 {cluster="your-cluster-name", ...} 才能过滤到数据。例如,process_cpu_seconds_total 必须查 process_cpu_seconds_total{cluster="your-cluster-name"}。