grafana-agent

Grafana Agent

  • Canonical Observability
Channel Revision Published Runs on
latest/stable 319 07 Jan 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/stable 225 03 Dec 2024
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/stable 223 03 Dec 2024
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/stable 226 03 Dec 2024
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/stable 224 03 Dec 2024
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/candidate 335 07 Jan 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/candidate 332 07 Jan 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/candidate 334 07 Jan 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/candidate 333 07 Jan 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/candidate 319 10 Dec 2024
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/beta 368 07 Jan 2025
Ubuntu 22.04 Ubuntu 20.04
latest/beta 365 07 Jan 2025
Ubuntu 22.04 Ubuntu 20.04
latest/beta 367 07 Jan 2025
Ubuntu 22.04 Ubuntu 20.04
latest/beta 366 07 Jan 2025
Ubuntu 22.04 Ubuntu 20.04
latest/edge 391 16 Jan 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/edge 390 16 Jan 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/edge 389 16 Jan 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/edge 386 16 Jan 2025
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
latest/edge 356 10 Dec 2024
Ubuntu 24.04 Ubuntu 22.04 Ubuntu 20.04
juju deploy grafana-agent
Show information

Platform:

Ubuntu
24.04 22.04 20.04

"connection reset by peer" and "499 Client Closed Request" errors

Deployment example

Let’s imagine a Grafana Agent charm scrapes an application, and forward their logs to a Prometheus instances in another model.

Signs that something is wrong

In such a deployment, we expect OpenStack metrics to be sent regularly to Prometheus through Grafana Agent, but that doesn’t happen.

If we do some research we may find this kind of error logs in grafana-agent:

Jul 19 12:39:05 scexporter01 grafana-agent.grafana-agent[724]: ts=2024-07-19T12:39:05.965934155Z caller=dedupe.go:112 agent=prometheus instance=605713fb3bd3f34da68dbf90216eef44 component=remote level=warn remote_name=605713-e27a59 url=http://172.16.14.11/cos-prometheus-0/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://172.16.14.11/cos-prometheus-0/api/v1/write\": read tcp 172.16.14.99:39972->172.16.14.11:80: read: connection reset by peer"

The most important part of this log line is: read: connection reset by peer. What this is telling us is that Prometheus is closing the connection grafana-agent is trying to establish.

Since in our deployment Traefik is the ingress for Prometheus, we can confirm this by checking the its logs:

2024-07-22T17:47:10.811Z [traefik] time="2024-07-22T17:47:10Z" level=debug msg="'499 Client Closed Request' caused by: context canceled"

We may think the problem can be somewhere in the connection between Grafana Agent and Prometheus, but in this situation the real problem is in the other end.

Scrape Timeouts

If the application that Grafana Agent is scrapes takes a long time to return its metrics endpoint, in particular a longer time than the default timeout configured in Grafana Agent, we will start to see these types of errors because sends an empty request to Prometheus.

We can verify the response times by running:

$ time curl http://APPLICATION_ADDRESS:PORT/metrics

and we will obtain a bunch of metrics and the response time:

...
# TYPE ring_member_tokens_to_own gauge
ring_member_tokens_to_own{name="compactor"} 1
ring_member_tokens_to_own{name="scheduler"} 1
curl http://APPLICATION_ADDRESS:PORT/metrics 0,01s user 0,01s system 0% cpu 12,064 total

Note that in this example, the response time is more than 12s and our default global_scrape_timeout in Grafana Agent is 10s.

How to solve this situation

As we could see the problem is the response time of the metrics endpoint Grafana Agent scrapes. We need to figure out the root cause of such delay.

If we are not able to reduce the response times of our metrics endpoint we can increase the global_scrape_timeout in Grafana agent charm by running:

juju config grafana-agent global_scrape_timeout="15s"

Help improve this document in the forum (guidelines). Last updated 5 months ago.