Alertmanager

Canonical Observability

Architecture:

Base version:

Channel	Revision	Published	Runs on
latest/stable	157	15 Apr 2025	Ubuntu 20.04
latest/candidate	158	15 Apr 2025	Ubuntu 20.04
latest/beta	158	01 Apr 2025	Ubuntu 20.04
latest/edge	158	18 Mar 2025	Ubuntu 20.04
1.0/stable	96	12 Dec 2023	Ubuntu 20.04
1.0/candidate	96	22 Nov 2023	Ubuntu 20.04
1.0/beta	96	22 Nov 2023	Ubuntu 20.04
1.0/edge	96	22 Nov 2023	Ubuntu 20.04

Learn to deploy on juju >

Platform:

Relevant links

Homepage

Contacts

Submit a bug

Share your thoughts on this charm with the community on discourse.

Join the discussion

This post is a living document and could change frequently!

Prometheus and Alertmanager allow to respectively evaluate and send alerts. The purpose of this document is to outline best practices for creating alerts that are useful, descriptive, and not overwhelming.

Charmed alerts are packed together with their charm, and are sent over to the Canonical Observability Stack when integrating with Prometheus. The only exception being central-host-health-alerts which exist in cos-lib and are “injected” for deployments including Prometheus.

Principles

Define clear objectives for your alerts

Before creating your alerts, you should understand the behavior of your application and the potential impact of issues on the users. Identify the key metrics, service level indicators (SLIs) and objectives (SLOs), and think about how you can best express your failure modes.

Map failure modes

Map failure modes to symptoms

Start with thinking about the potential failures that the system may encounter.

Potential failure	Likeliness	Symptoms
Overloaded querying engine goes out of memory	Depends on load and resources	Querying errors are logged and counter accumulates

Inspect official documentation

Known failure modes could be already documented. Read up on it.

Map metrics to failure modes

Inspect the /metrics endpoint of the application and come up with potential failure modes and the steps that could be taken to resolve the failure.

Try to address root causes rather than symptoms.

For example:

Metric name	Failure mode / root cause	Potential resolution
`querying_errors_total`	The querying engine service is OOM-killed	Increase resource limits or update rate limit in config file
`querying_errors_total`	Client sends malformed queries	Confirm the client is using an appropriate schema

Map log lines to failure modes

Inspect the logs your application emits immediately prior to a failure. Map the contents or amount of log lines to failure modes. For example:

Try to address root causes rather than symptoms.

For example:

Log line	Failure mode / root cause	Potential resolution
write: broken pipe (*tls.permanentError)	Client terminated connection prematurely due to too low timeout	Increase timeout on the client side
write: broken pipe (*tls.permanentError)	TLS certificate expired or version mismatch	Verify certs and connections configured correctly

Combine the tables above, grouping by failure mode

We want to alert on root causes (e.g. “cert expired”) rather than symptoms (e.g. “broken pipe”), and keep the list of failure modes up-to-date by updating it as part of incident retrospective.

Write effective and relevant alerts

Receiving too many alert notifications for irrelevant issues will cause responders to start ignoring them: it’s a phenomenon called “notification overload”.

For alerts to be relevant, you should watch out for the following:

false positives: alerts whose conditions are met, but that are not indicative of an issue in the application;
recipients: make sure you’re sending alert notifications to people that can act on them;
non-actionable alerts: if the alert doesn’t contain enough information (or if it has too much), it becomes hard for someone to resolve quickly pinpoint the issues and solve them.

Alerts should be relevant for all of your users

When bundling alerts in a charm, they’re going to be active for all the users of your application; if an alert only applies to a fragment of them, then it’s probably best to not include that in the charm.

Actionable Advice

Keep the alert title in PagerDuty short

The notification title should communicate what the problem is, but it doesn’t need to contain all the relevant information; the rest can go into the description, so that a responder can still dig deeper and pinpoint the issue.

Use `group_by` to your advantage

Grouping alerts via the configuration file can be extremely helpful. Imagine an application with lots of units goes down: if you got an alert per each unit, it wouldn’t be more useful than just getting one alert for the application; in fact, the excessive amount of notification could hinder the response process, as you could easily miss some important information.

Common Alert Rules

Generic up/absent alerts

The following rules (central-host-health-alerts) were determined to be generic enough to apply to all deployments:

    - alert: HostDown
      expr: up < 1
      for: 5m
      labels:
        severity: critical

    - alert: HostMetricsMissing
      expr: absent(up)
      for: 5m
      labels:
        severity: critical

The HostMetricsMissing alert is applicable only when the provider is linked via an aggregator (such as grafana agent). We use for: 5m because the alert transitions from Pending to Firing. If query portability is desired, up[5m] is an alternative, but this will trigger without a Pending state after 5 minutes.

Prometheus Self-Monitoring

Target/Job Missing

Single Job

expr: absent(up{job=<Service>})

For example, <Service> might be:

"prometheus"
"alertmanager"

Single Target

expr: up == 0

All Targets

expr: sum by (job) (up) == 0

With Warmup Time

expr: sum by (instance, job) ((up == 0) * on (instance) group_right(job) (node_time_seconds - node_boot_time_seconds > 600))

Too Many Restarts

expr: changes(process_start_time_seconds{job=<Job>}[15m]) > 2

An exmaple <Job> might be:

Prometheus → ~"prometheus|pushgateway|alertmanager"
Loki → ~".*loki.*"

Configuration Failure

Configuration Reload Failure

expr: <Service>_config_last_reload_successful != 1

Configuration Not Synced

expr: count(count_values("config_hash", <Service>_config_hash)) > 1

Host and Hardware

Adding host and hardware alert rules can warn users of potential system failure and allows for the remediation before a serious system failure occurs. To enable these types of alerts, the Prometheus node exporter is required for hardware and OS metrics exposed by *NIX kernels.

Some alerts that are worth mentioning:

(Over|Under)utilized Memory
- Low Swap Memory
(Over|Under)utilized CPU
Unusual Network Throughput In/Out
Unusual Disk Read/Write Rate

Others

The list of alert rules above is not exhaustive. Alerting coverage can be extended to topics like:

Databases and brokers
Reverse proxies and load balancers
Runtimes
Orchestrators
Network, security and storage

References

Help improve this document in the forum (guidelines). Last updated a month ago.