alertmanager-k8s

Alertmanager

  • Canonical Observability
Channel Revision Published Runs on
latest/stable 150 04 Mar 2025
Ubuntu 20.04
latest/candidate 154 04 Mar 2025
Ubuntu 20.04
latest/beta 156 04 Mar 2025
Ubuntu 20.04
latest/edge 157 Yesterday
Ubuntu 20.04
1.0/stable 96 12 Dec 2023
Ubuntu 20.04
1.0/candidate 96 22 Nov 2023
Ubuntu 20.04
1.0/beta 96 22 Nov 2023
Ubuntu 20.04
1.0/edge 96 22 Nov 2023
Ubuntu 20.04
juju deploy alertmanager-k8s --channel 1.0/beta
Show information

Platform:

When we relate a metrics provider (e.g. some server) to prometheus, we expect prometheus to post an alert if the server is not responding. With prometheus’s PromQL this can be expressed universally with up and absent expressions:

  • up < 1
  • absent(up)

Instead of having every single charm in the ecosystem duplicate the same alert rules, they are automatically generated by the prometheus-scrape and remote-write charm libraries. This alleviates charm authors from having to implement their own HostHealth rules per charm and reduces implementation error.

Avoiding alert fatigue

The alert rules are designed to be in the Pending state for 5 minutes before transitioning to the Firing state. This is necessary to avoid alerting false positives in cases of new installation, or flapping metric behaviour.

“Host down” vs. “metrics missing”

Note that HostHealth has slightly different semantics between remote-write and scrape:

  • If Prometheus failed to scrape, then the target is down (up < 1).
  • If Grafana Agent failed to remote-write (regardless of whether scrape succeeded) then it’s absent(up).

Scrape

With support for centralized (generic) alerts, Prometheus provides a HostDown alert for each charm and each of its units via alert labels.

The alert rule within prometheus_scrape contains (ignoring annotations):

groups:
  - name: HostHealth
    rules:
    - alert: HostDown
      expr: up < 1
      for: 5m
      labels:
        severity: critical
    - alert: HostMetricsMissing
      # This alert is applicable only when the provider is linked via
      # an aggregator (such as grafana agent)
      expr: absent(up)
      for: 5m
      labels:
        severity: critical

Note: We use absent(up) with for: 5m because the alert transitions from Pending to Firing. If query portability is desired, absent_over_time(up[5m]) is an alternative, but this will trigger without a Pending state after 5 minutes.

Remote write

With support for centralized (generic) alerts, Prometheus provides a HostMetricsMissing alert for Grafana Agent itself and each application that is aggregated by it.

Note: The HostMetricsMissing alert does not show each unit, only the application!

The alert rule within prometheus_remote_write contains (ignoring annotations):

groups:
  - name: AggregatorHostHealth
    rules:
    - alert: HostMetricsMissing
      expr: absent(up)
      for: 5m
      labels:
        severity: critical

Alerting scenarios

Centralized (generic) alerts are supported in the following deployment scenarios

Note: In these example, the aggregator is Grafana Agent.

Note: Check Alertmanager for labelled alerts at either the unit level (HostDown) or at the application level (HostMetricsMissing).

Metrics endpoint (k8s charms)

image

  1. When a unit of some-charm is down for 5 minutes, the HostDown alert fires in the Prometheus UI (showing the specific unit).
  2. If multiple units are down, they show in the labels as well.

With an aggregator (k8s charms)

Scrape

  1. When a unit of some-charm is down for 5 minutes, the HostDown alert fires in the Prometheus UI (showing the specific unit).
  2. If multiple units are down, they show in the labels as well.

Remote write

  1. When Grafana Agent is down for 5 minutes, the HostMetricsMissing alert fires for both the HostHealth and AggregatorHostHealth groups in the Prometheus UI.

With an aggregator (machine charms)

Scrape

  1. When a unit of some-charm is down for 5 minutes, the HostDown alert fires in the Prometheus UI (showing the specific unit).
  2. If multiple units are down, they show in the labels as well.

Remote write

  1. When Grafana Agent is down for 5 minutes, the HostMetricsMissing alert fires for both the HostHealth and AggregatorHostHealth groups in the Prometheus UI.

With Cos-proxy (machine charms)

  1. When cos-proxy is down for 5 minutes, the HostDown alert fires in the Prometheus UI.

References