Nvidia GPU Metrics/Alerting
In the last post, I covered how to enable the Nvidia GPU in RHOAI (Openshift AI). Today, we will cover metrics that can be exposed through the the Openshift Web Console. This Prometheus Telemetry/metrics data can then be used to develop custom queries to send alerts through AlertManager.
Enabling GPU Monitoring and Metrics in Openshift
The steps I that are being followed are based on the following documentation:
![](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/_static/nvidia-logo-white.png)
The following commands will be run from the OC CLI.
# Download the JSON for CM
curl -LfO https://github.com/NVIDIA/dcgm-exporter/raw/main/grafana/dcgm-exporter-dashboard.json
# Create the CM
oc create configmap nvidia-dcgm-exporter-dashboard -n openshift-config-managed --from-file=dcgm-exporter-dashboard.json
#For view in Administrator perspective
oc label configmap nvidia-dcgm-exporter-dashboard -n openshift-config-managed "console.openshift.io/dashboard=true"
# For view in developer perspective
oc label configmap nvidia-dcgm-exporter-dashboard -n openshift-config-managed "console.openshift.io/odc-dashboard=true"
# Verify labels
oc -n openshift-config-managed get cm nvidia-dcgm-exporter-dashboard --show-labels
The output to the last command will look as follows:
![](https://myopenshiftblog.com/content/images/2024/04/image-26.png)
Now, verify that the Openshift Web Console shows this information in the Administrator perspective. Go to Observe --> Dashboards --> Nvidia DCGM Exporter Dashboard
![](https://myopenshiftblog.com/content/images/2024/04/image-27.png)
For the Developer perspective, go to Observe --> nvidia-gpu-operator namespace –> Dashboard --> Nvidia DCGM Exporter Dashboard
![](https://myopenshiftblog.com/content/images/2024/04/image-28.png)
Here are the default metrics that will show up in this dashboard
![](https://myopenshiftblog.com/content/images/2024/04/image-29.png)
Now, let's put together some AlertManager rules for each of these metrics.
I've created a GitHub repo with all of the sample YAML files:
Some Fact-Finding
There are a few alerts that come with the GPU operator. These are shown below:
![](https://myopenshiftblog.com/content/images/2024/04/image-30.png)
- Go to Observe --> Targets and search for the nvidia-dcgm-exporter ServiceMonitor CRD. It will be exposed through a NodePort on one of your OCP nodes. In this case, my endpoint is http://192.168.128.229:9400/metrics
![](https://myopenshiftblog.com/content/images/2024/04/image-31.png)
Here is some information that shows up there:
![](https://myopenshiftblog.com/content/images/2024/04/image-32.png)
Let me break this down into a table so it easier to read the description for each of these. We will use this information to develop the alerts.
DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
DCGM_FI_DEV_PCIE_REPLAY_COUNTER Total number of PCIe retries.
DCGM_FI_DEV_GPU_UTIL GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %).
DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %).
DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Total number of NVLink bandwidth counters for all lanes.
DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status
- To see the raw PrometheusRule CRDs, look at Administration --> Custom Resource Definitions --> PrometheusRule
![](https://myopenshiftblog.com/content/images/2024/04/image-34.png)
- Go to Instances and search for nvidia. You will see two PrometheusRule CRDs
![](https://myopenshiftblog.com/content/images/2024/04/image-35.png)
Creating PrometheusRule for GPU Temperature
The metric that will be used for this AlertManager rule will be GPU temperature (DCGM_FI_DEV_GPU_TEMP)
To see this information on the Openshift Web Console, go to Observe --> Metrics and run a query on DCGM_FI_DEV_GPU_TEMP as shown below (current value is 31 degrees Celsius):
![](https://myopenshiftblog.com/content/images/2024/04/image-33.png)
For demonstration purposes, I will setup an alert that goes off when the temperature of the GPU is over 30 degrees Celsius. This will go off because my current temperature is 31.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gputemp
namespace: nvidia-gpu-operator
spec:
groups:
- name: gputemp-rule
rules:
- alert: gputemp
expr:
DCGM_FI_DEV_GPU_TEMP > 30
labels:
severity: critical
annotations:
description: GPU Temperature is over 30 degree Celsius
This code is also at my Git repo
Let's apply this
oc apply -f gpu_alertmanager.yaml
Within a few minutes, this alert should show up in the Openshift Console under Observe --> Alerting as shown below.
![](https://myopenshiftblog.com/content/images/2024/04/image-36.png)
This can be integrated with the many AlertManager receivers that Openshift supports.