Grafana Agent

Canonical Observability

Architecture:

Base version:

Channel	Revision	Published	Runs on
latest/stable	457	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/stable	454	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/stable	452	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/stable	456	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/stable	453	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/stable	455	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/candidate	457	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/candidate	454	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/candidate	452	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/candidate	456	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/candidate	453	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/candidate	455	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/beta	457	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/beta	454	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/beta	452	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/beta	456	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/beta	453	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/beta	455	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/edge	457	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/edge	456	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/edge	455	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/edge	454	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/edge	453	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/edge	452	13 Mar 2025	Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04

Learn to deploy on juju >

Platform:

24.04 22.04 20.04

Relevant links

Homepage

Contacts

Submit a bug

Share your thoughts on this charm with the community on discourse.

Join the discussion

`"connection reset by peer"` and `"499 Client Closed Request"` errors

Deployment example

Let’s imagine a Grafana Agent charm scrapes an application, and forward their logs to a Prometheus instances in another model.

Signs that something is wrong

In such a deployment, we expect OpenStack metrics to be sent regularly to Prometheus through Grafana Agent, but that doesn’t happen.

If we do some research we may find this kind of error logs in grafana-agent:

Jul 19 12:39:05 scexporter01 grafana-agent.grafana-agent[724]: ts=2024-07-19T12:39:05.965934155Z caller=dedupe.go:112 agent=prometheus instance=605713fb3bd3f34da68dbf90216eef44 component=remote level=warn remote_name=605713-e27a59 url=http://172.16.14.11/cos-prometheus-0/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://172.16.14.11/cos-prometheus-0/api/v1/write\": read tcp 172.16.14.99:39972->172.16.14.11:80: read: connection reset by peer"

The most important part of this log line is: read: connection reset by peer. What this is telling us is that Prometheus is closing the connection grafana-agent is trying to establish.

Since in our deployment Traefik is the ingress for Prometheus, we can confirm this by checking the its logs:

2024-07-22T17:47:10.811Z [traefik] time="2024-07-22T17:47:10Z" level=debug msg="'499 Client Closed Request' caused by: context canceled"

We may think the problem can be somewhere in the connection between Grafana Agent and Prometheus, but in this situation the real problem is in the other end.

If the application that Grafana Agent is scrapes takes a long time to return its metrics endpoint, in particular a longer time than the default timeout configured in Grafana Agent, we will start to see these types of errors because sends an empty request to Prometheus.

We can verify the response times by running:

$ time curl http://APPLICATION_ADDRESS:PORT/metrics

and we will obtain a bunch of metrics and the response time:

...
# TYPE ring_member_tokens_to_own gauge
ring_member_tokens_to_own{name="compactor"} 1
ring_member_tokens_to_own{name="scheduler"} 1
curl http://APPLICATION_ADDRESS:PORT/metrics 0,01s user 0,01s system 0% cpu 12,064 total

Note that in this example, the response time is more than 12s and our default global_scrape_timeout in Grafana Agent is 10s.

How to solve this situation

As we could see the problem is the response time of the metrics endpoint Grafana Agent scrapes. We need to figure out the root cause of such delay.

If we are not able to reduce the response times of our metrics endpoint we can increase the global_scrape_timeout in Grafana agent charm by running:

juju config grafana-agent global_scrape_timeout="15s"

Help improve this document in the forum (guidelines). Last updated 8 months ago.