Hardware Observer

  • Canonical BootStack Charmers
Channel Revision Published Runs on
latest/stable 84 02 Jul 2024
Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/stable 13 01 Nov 2023
Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 113 15 Oct 2024
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 112 15 Oct 2024
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 13 30 Oct 2023
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 121 11 Nov 2024
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 120 11 Nov 2024
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 119 11 Nov 2024
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 118 11 Nov 2024
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 15 03 Nov 2023
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
juju deploy hardware-observer
Show information

Platform:

Ubuntu
24.04 22.04 20.04 18.04

Metrics

The details of the GPU metrics exposed by Hardware Observer using dcgm-exporter and node-exporter are as follows:

Metric Name Description Labels
DCGM_FI_DEV_GPU_TEMP GPU temperature (in C) DCGM_FI_DEV_BAR1_TOTAL, DCGM_FI_DEV_BRAND, DCGM_FI_DEV_CC_MODE, DCGM_FI_DEV_COMPUTE_MODE, DCGM_FI_DEV_COUNT, DCGM_FI_DEV_CUDA_COMPUTE_CAPABILITY, DCGM_FI_DEV_ECC_CURRENT, DCGM_FI_DEV_ECC_INFOROM_VER, DCGM_FI_DEV_ENFORCED_POWER_LIMIT, DCGM_FI_DEV_FB_TOTAL, DCGM_FI_DEV_GPU_MAX_OP_TEMP, DCGM_FI_DEV_INFOROM_IMAGE_VER, DCGM_FI_DEV_MAX_MEM_CLOCK, DCGM_FI_DEV_MAX_SM_CLOCK, DCGM_FI_DEV_MINOR_NUMBER, DCGM_FI_DEV_NAME, DCGM_FI_DEV_OEM_INFOROM_VER, DCGM_FI_DEV_PERSISTENCE_MODE, DCGM_FI_DEV_POWER_MGMT_LIMIT, DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX, DCGM_FI_DEV_POWER_MGMT_LIMIT_MIN, DCGM_FI_DEV_SERIAL, DCGM_FI_DEV_SHUTDOWN_TEMP, DCGM_FI_DEV_SLOWDOWN_TEMP, DCGM_FI_DEV_VBIOS_VERSION, DCGM_FI_DEV_VIRTUAL_MODE, DCGM_FI_DRIVER_VERSION, DCGM_FI_NVML_VERSION, Hostname, UUID, device, gpu, modelName, pci_bus_id
DCGM_FI_DEV_POWER_USAGE Power draw (in W) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_GPU_UTIL GPU utilization (in %) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_FAN_SPEED Fan speed (in 0-100%) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS Throttling reasons bitmask Same as DCGM_FI_DEV_GPU_TEMP
node_hwmon_chip_names Annotation metric for human-readable chip names chip, chip_name
node_hwmon_temp_celsius Hardware monitor for temperature (input) chip, sensor
node_hwmon_power_average_watt Hardware monitor for power usage in watts (average) chip, sensor
node_hwmon_freq_freq_mhz Hardware monitor for GPU frequency in MHz sensor, chip
node_hwmon_fan_rpm Hardware monitor for fan revolutions per minute (input) sensor, chip
node_hwmon_fan_max_rpm Hardware monitor for fan revolutions per minute (max) sensor, chip
node_drm_card_info Card information card, chip, memory_vendor, power_performance_level, unique_id
node_drm_gpu_busy_percent How busy the GPU is as a percentage card, chip
node_drm_memory_vram_used_bytes The used amount of VRAM in bytes card, chip
node_drm_memory_vram_size_bytes The size of VRAM in bytes card, chip

NOTE: This is the subset of metrics used for alerts and the GPU dashboard. Please see this file to learn about other DCGM metrics.

NOTE: metrics prefixed with node_ are provided by the node_exporter DRM and HWmon collectors for any GPU using open-source drivers. node_exporter is deployed by the grafana-agent charm, not hardware-observer. The metrics are reported here for convenience.

Alerts

The details of the alerts that Hardware Observer provides for NVIDIA GPUs are as follows:

Alert Rule Name Description Severity
GPUPowerBrakeThrottle NVIDIA GPU Hardware Power Brake Slowdown throttling detected Warning
GPUThermalHWThrottle NVIDIA GPU Hardware Thermal throttling detected Warning
GPUThermalSWThrottle NVIDIA GPU Software Thermal throttling detected Warning
GPUSyncBoostThrottle NVIDIA GPU Sync Boost throttling detected Warning
GPUSlowdownThrottle GPU Hardware Slowdown throttling detected Warning
GPUPowerThrottle GPU Software Power throttling detected Warning

For more details, please see NVIDIA Clocks Throttle reasons.

Throttling detection is currently only available for NVIDIA GPUs.