hardware-observer

Hardware Observer

Channel Revision Published Runs on
latest/stable 522 26 Aug 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/stable 521 26 Aug 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/stable 520 26 Aug 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/stable 519 26 Aug 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/stable 518 26 Aug 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/stable 517 26 Aug 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/stable 516 26 Aug 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/stable 515 26 Aug 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/stable 514 26 Aug 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/stable 513 26 Aug 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/stable 512 26 Aug 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/stable 511 26 Aug 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/stable 15 17 Jan 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 630 24 Oct 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 629 24 Oct 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 628 24 Oct 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 627 24 Oct 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 626 24 Oct 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 625 24 Oct 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 624 24 Oct 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 623 24 Oct 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 622 24 Oct 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 621 24 Oct 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 620 24 Oct 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 619 24 Oct 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/candidate 15 02 Jan 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 693 Today
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 692 Today
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 691 Today
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 690 Today
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 689 Today
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 688 Today
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 687 Today
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 686 Today
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 685 Today
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 684 Today
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 683 Today
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 682 Today
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
latest/edge 15 03 Nov 2023
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04 Ubuntu 18.04
juju deploy hardware-observer
Show information

Platform:

Ubuntu
24.04 22.04 20.04 18.04

Metrics

The details of the GPU metrics exposed by Hardware Observer using dcgm-exporter and node-exporter are as follows:

Metric Name Description Labels
DCGM_FI_DEV_GPU_TEMP GPU temperature (in C) DCGM_FI_DEV_BAR1_TOTAL, DCGM_FI_DEV_BRAND, DCGM_FI_DEV_CC_MODE, DCGM_FI_DEV_COMPUTE_MODE, DCGM_FI_DEV_COUNT, DCGM_FI_DEV_CUDA_COMPUTE_CAPABILITY, DCGM_FI_DEV_ECC_CURRENT, DCGM_FI_DEV_ECC_INFOROM_VER, DCGM_FI_DEV_ENFORCED_POWER_LIMIT, DCGM_FI_DEV_FB_TOTAL, DCGM_FI_DEV_GPU_MAX_OP_TEMP, DCGM_FI_DEV_INFOROM_IMAGE_VER, DCGM_FI_DEV_MAX_MEM_CLOCK, DCGM_FI_DEV_MAX_SM_CLOCK, DCGM_FI_DEV_MINOR_NUMBER, DCGM_FI_DEV_NAME, DCGM_FI_DEV_OEM_INFOROM_VER, DCGM_FI_DEV_PERSISTENCE_MODE, DCGM_FI_DEV_POWER_MGMT_LIMIT, DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX, DCGM_FI_DEV_POWER_MGMT_LIMIT_MIN, DCGM_FI_DEV_SERIAL, DCGM_FI_DEV_SHUTDOWN_TEMP, DCGM_FI_DEV_SLOWDOWN_TEMP, DCGM_FI_DEV_VBIOS_VERSION, DCGM_FI_DEV_VIRTUAL_MODE, DCGM_FI_DRIVER_VERSION, DCGM_FI_NVML_VERSION, Hostname, UUID, device, gpu, modelName, pci_bus_id
DCGM_FI_DEV_POWER_USAGE Power draw (in W) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_GPU_UTIL GPU utilization (in %) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_FAN_SPEED Fan speed (in 0-100%) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %) Same as DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS Throttling reasons bitmask Same as DCGM_FI_DEV_GPU_TEMP
node_hwmon_chip_names Annotation metric for human-readable chip names chip, chip_name
node_hwmon_temp_celsius Hardware monitor for temperature (input) chip, sensor
node_hwmon_power_average_watt Hardware monitor for power usage in watts (average) chip, sensor
node_hwmon_freq_freq_mhz Hardware monitor for GPU frequency in MHz sensor, chip
node_hwmon_fan_rpm Hardware monitor for fan revolutions per minute (input) sensor, chip
node_hwmon_fan_max_rpm Hardware monitor for fan revolutions per minute (max) sensor, chip
node_drm_card_info Card information card, chip, memory_vendor, power_performance_level, unique_id
node_drm_gpu_busy_percent How busy the GPU is as a percentage card, chip
node_drm_memory_vram_used_bytes The used amount of VRAM in bytes card, chip
node_drm_memory_vram_size_bytes The size of VRAM in bytes card, chip

NOTE: This is the subset of metrics used for alerts and the GPU dashboard. Please see this file to learn about other DCGM metrics.

NOTE: metrics prefixed with node_ are provided by the node_exporter DRM and HWmon collectors for any GPU using open-source drivers. node_exporter is deployed by the grafana-agent charm, not hardware-observer. The metrics are reported here for convenience.

Alerts

The details of the alerts that Hardware Observer provides for NVIDIA GPUs are as follows:

Alert Rule Name Description Severity
GPUPowerBrakeThrottle NVIDIA GPU Hardware Power Brake Slowdown throttling detected Warning
GPUThermalHWThrottle NVIDIA GPU Hardware Thermal throttling detected Warning
GPUThermalSWThrottle NVIDIA GPU Software Thermal throttling detected Warning
GPUSyncBoostThrottle NVIDIA GPU Sync Boost throttling detected Warning
GPUSlowdownThrottle GPU Hardware Slowdown throttling detected Warning
GPUPowerThrottle GPU Software Power throttling detected Warning

For more details, please see NVIDIA Clocks Throttle reasons.

Throttling detection is currently only available for NVIDIA GPUs.


Help improve this document in the forum (guidelines). Last updated 1 year, 1 month ago.