![prometheus-k8s](https://res.cloudinary.com/canonical/image/fetch/f_auto,q_auto,fl_sanitize,c_fill,w_100,h_100/https://api.charmhub.io/api/v1/media/download/charm_o5RHdxeu1Wl6q7durcG2ZRH1rpQv5wVY_icon_44a8c3712de989a55aa681e5e425038d644314086e67b8c862cf8982cacd09e8.png)
Prometheus
- Canonical Observability
Channel | Revision | Published | Runs on |
---|---|---|---|
latest/stable | 226 | 04 Feb 2025 | |
latest/candidate | 228 | 04 Feb 2025 | |
latest/beta | 228 | 21 Jan 2025 | |
latest/edge | 232 | 10 Feb 2025 | |
1.0/stable | 159 | 16 Feb 2024 | |
1.0/candidate | 159 | 12 Dec 2023 | |
1.0/beta | 159 | 12 Dec 2023 | |
1.0/edge | 159 | 12 Dec 2023 |
juju deploy prometheus-k8s
Deploy Kubernetes operators easily with Juju, the Universal Operator Lifecycle Manager. Need a Kubernetes cluster? Install MicroK8s to create a full CNCF-certified Kubernetes system in under 60 seconds.
Platform:
When we relate a metrics provider (e.g. some server) to prometheus, we expect prometheus to post an alert if the server is not responding. With prometheus’s PromQL this can be expressed universally with up
and absent
expressions:
up < 1
absent(up)
Instead of having every single charm in the ecosystem duplicate the same alert rules, they are automatically generated by the prometheus-scrape and remote-write charm libraries. This alleviates charm authors from having to implement their own HostHealth
rules per charm and reduces implementation error.
Avoiding alert fatigue
The alert rules are designed to be in the Pending
state for 5 minutes before transitioning to the Firing
state. This is necessary to avoid alerting false positives in cases of new installation, or flapping metric behaviour.
“Host down” vs. “metrics missing”
Note that HostHealth
has slightly different semantics between remote-write and scrape:
- If Prometheus failed to scrape, then the target is down (
up < 1
). - If Grafana Agent failed to remote-write (regardless of whether scrape succeeded) then it’s
absent(up)
.
Scrape
With support for centralized (generic) alerts, Prometheus provides a HostDown
alert for each charm and each of its units via alert labels.
The alert rule within prometheus_scrape
contains (ignoring annotations):
groups:
- name: HostHealth
rules:
- alert: HostDown
expr: up < 1
for: 5m
labels:
severity: critical
- alert: HostMetricsMissing
# This alert is applicable only when the provider is linked via
# an aggregator (such as grafana agent)
expr: absent(up)
for: 5m
labels:
severity: critical
Note: We use absent(up)
with for: 5m
because the alert transitions from Pending to Firing. If query portability is desired, absent_over_time(up[5m])
is an alternative, but this will trigger without a Pending
state after 5 minutes.
Remote write
With support for centralized (generic) alerts, Prometheus provides a HostMetricsMissing
alert for Grafana Agent itself and each application that is aggregated by it.
Note:
The HostMetricsMissing
alert does not show each unit, only the application!
The alert rule within prometheus_remote_write
contains (ignoring annotations):
groups:
- name: AggregatorHostHealth
rules:
- alert: HostMetricsMissing
expr: absent(up)
for: 5m
labels:
severity: critical
Alerting scenarios
Check Alertmanager for labelled alerts at either the unit level (HostDown) or at the application level (HostMetricsMissing).
Note: In these example, the aggregator is Grafana Agent.
Metrics endpoint (k8s charms)
- When a unit of
some-charm
is down for 5 minutes, theHostDown
alert fires in the Prometheus UI (showing the specific unit). - If multiple units are down, they show in the labels as well.
With an aggregator (k8s charms)
Scrape
- When a unit of
some-charm
is down for 5 minutes, theHostDown
alert fires in the Prometheus UI (showing the specific unit). - If multiple units are down, they show in the labels as well.
Remote write
- When Grafana Agent is down for 5 minutes, the
HostMetricsMissing
alert fires for both theHostHealth
andAggregatorHostHealth
groups in the Prometheus UI.
With an aggregator (machine charms)
Scrape
- When a unit of
some-charm
is down for 5 minutes, theHostDown
alert fires in the Prometheus UI (showing the specific unit). - If multiple units are down, they show in the labels as well.
Remote write
- When Grafana Agent is down for 5 minutes, the
HostMetricsMissing
alert fires for both theHostHealth
andAggregatorHostHealth
groups in the Prometheus UI.
With Cos-proxy (machine charms)
- When
cos-proxy
is down for 5 minutes, theHostDown
alert fires in the Prometheus UI.