prometheus-k8s

Prometheus

  • Canonical Observability
Channel Revision Published Runs on
latest/stable 226 04 Feb 2025
Ubuntu 20.04
latest/candidate 228 04 Feb 2025
Ubuntu 20.04
latest/beta 228 21 Jan 2025
Ubuntu 20.04
latest/edge 232 10 Feb 2025
Ubuntu 20.04
1.0/stable 159 16 Feb 2024
Ubuntu 20.04
1.0/candidate 159 12 Dec 2023
Ubuntu 20.04
1.0/beta 159 12 Dec 2023
Ubuntu 20.04
1.0/edge 159 12 Dec 2023
Ubuntu 20.04
juju deploy prometheus-k8s
Show information

Platform:

When we relate a metrics provider (e.g. some server) to prometheus, we expect prometheus to post an alert if the server is not responding. With prometheus’s PromQL this can be expressed universally with up and absent expressions:

  • up < 1
  • absent(up)

Instead of having every single charm in the ecosystem duplicate the same alert rules, they are automatically generated by the prometheus-scrape and remote-write charm libraries. This alleviates charm authors from having to implement their own HostHealth rules per charm and reduces implementation error.

Avoiding alert fatigue

The alert rules are designed to be in the Pending state for 5 minutes before transitioning to the Firing state. This is necessary to avoid alerting false positives in cases of new installation, or flapping metric behaviour.

“Host down” vs. “metrics missing”

Note that HostHealth has slightly different semantics between remote-write and scrape:

  • If Prometheus failed to scrape, then the target is down (up < 1).
  • If Grafana Agent failed to remote-write (regardless of whether scrape succeeded) then it’s absent(up).

Scrape

With support for centralized (generic) alerts, Prometheus provides a HostDown alert for each charm and each of its units via alert labels.

The alert rule within prometheus_scrape contains (ignoring annotations):

groups:
  - name: HostHealth
    rules:
    - alert: HostDown
      expr: up < 1
      for: 5m
      labels:
        severity: critical
    - alert: HostMetricsMissing
      # This alert is applicable only when the provider is linked via
      # an aggregator (such as grafana agent)
      expr: absent(up)
      for: 5m
      labels:
        severity: critical

Note: We use absent(up) with for: 5m because the alert transitions from Pending to Firing. If query portability is desired, absent_over_time(up[5m]) is an alternative, but this will trigger without a Pending state after 5 minutes.

Remote write

With support for centralized (generic) alerts, Prometheus provides a HostMetricsMissing alert for Grafana Agent itself and each application that is aggregated by it.

Note: The HostMetricsMissing alert does not show each unit, only the application!

The alert rule within prometheus_remote_write contains (ignoring annotations):

groups:
  - name: AggregatorHostHealth
    rules:
    - alert: HostMetricsMissing
      expr: absent(up)
      for: 5m
      labels:
        severity: critical

Alerting scenarios

Check Alertmanager for labelled alerts at either the unit level (HostDown) or at the application level (HostMetricsMissing).

Note: In these example, the aggregator is Grafana Agent.

Metrics endpoint (k8s charms)

image

  1. When a unit of some-charm is down for 5 minutes, the HostDown alert fires in the Prometheus UI (showing the specific unit).
  2. If multiple units are down, they show in the labels as well.

With an aggregator (k8s charms)

Scrape

  1. When a unit of some-charm is down for 5 minutes, the HostDown alert fires in the Prometheus UI (showing the specific unit).
  2. If multiple units are down, they show in the labels as well.

Remote write

  1. When Grafana Agent is down for 5 minutes, the HostMetricsMissing alert fires for both the HostHealth and AggregatorHostHealth groups in the Prometheus UI.

With an aggregator (machine charms)

Scrape

  1. When a unit of some-charm is down for 5 minutes, the HostDown alert fires in the Prometheus UI (showing the specific unit).
  2. If multiple units are down, they show in the labels as well.

Remote write

  1. When Grafana Agent is down for 5 minutes, the HostMetricsMissing alert fires for both the HostHealth and AggregatorHostHealth groups in the Prometheus UI.

With Cos-proxy (machine charms)

  1. When cos-proxy is down for 5 minutes, the HostDown alert fires in the Prometheus UI.