Canonical Observability Stack Lite

By Canonical Observability | bundle

Architecture:

Channel	Revision	Published
latest/stable	11	21 Oct 2022
latest/candidate	10	21 Oct 2022
latest/beta	9	21 Oct 2022
latest/edge	18	20 Jun 2023
1.0/stable	16	21 Oct 2022
1.0/candidate	14	21 Oct 2022
1.0/beta	13	21 Oct 2022
1.0/edge	12	21 Oct 2022

Learn to deploy on juju >

Platform:

Relevant links

Homepage

Discuss this bundle

Share your thoughts on this charm with the community on discourse.

Join the discussion

One of the goals for COS Lite is to be able to ingest considerable amount of data on modest hardware. Load testing is useful for gaining insight into how to size observability clusters appropriately.

Method

No resource limits set (for now).
MicroK8s 1.24, Juju 2.9.34.
Loki workload version was pinned to 2.4.1 (reason).
The latest results were obtained with the following charm versions:

App	Version	Charm	Channel	Rev
alertmanager	0.25.0	alertmanager-k8s	edge	64
cos-config	3.5.0	cos-configuration-k8s	edge	31
grafana	9.2.1	grafana-k8s	edge	76
loki	2.4.1	loki-k8s	edge	80
prometheus	2.42.0	prometheus-k8s	edge	119
scrape-config	n/a	prometheus-scrape-config-k8s	edge	39
scrape-target	n/a	prometheus-scrape-target-k8s	edge	25
traefik	2.9.6	traefik-k8s	edge	124

The load test is using terraform to provision:

COS Lite VM. Deployed from the cos-lite bundle with an overlay dedicated for the load test.
Avalanche VM. Used as scrape targets for prometheus. The number of targets and the number of metrics per target are adjustable.
Query VM. Used for “browser testing” with 20 “virtual eyeballs” looking at a grafana dashboard consisting of 600 log lines and 7200 graph points, refreshing every 5 sec.
Loki log VM. Used for pushing logs to loki from multiple virtual targets (“streams”).
Monitoring VM. Used for collecting metrics from the load test’s components for monitoring and producing performance data sheets.

(Edit a copy of this diagram)

Results

	Metrics only	Logs only	Metrics and Logs
Max scraped data points / min	1,710,000	-	1,200,000
Max ingested log lines / min	-	180,000	135,000
Storage (GB/day)	4	54	50

Note that the results above do not leave any leeway. For production you should probably add >10% margin.

Discussion

VM resource usage

Load tests that run successfully for over 6h without incidents are marked as “PASSED”, and are used for constructing the datasheet. Passing tests are also used for curve fitting an approximation for resource usage.

In the diagram below:

Numbers in square boxes represent passing tests.
Red “X” represent failed tests.
Gray dashed contours are the resource consumption estimation, based on curve fitting a linear function in two variables (y = a1 * x1 + a2 * x2 + b) to all the tests.
Some data points appear inconsistent with the trend. This is most likely due to changing conditions from test to test, such as: new charm release, compaction taking place close to recording final test data, or timing alignment of the querying browser instances producing a spike. As charms and juju stabilize, the numbers are expected to be more consistent in future tests.
Log lines ingestion rate is CPU-limited, and and metrics ingestion rate is memory-limited.

Per-pod resource usage

In addition to the overall VM resource usage, we also look at the cpu and memory usage per pod (reported by kubectl top pod).

We need 0.8 CPU and 1.4 Gi memory for every 100,000 log lines per minute ingested by Loki.
We need 0.12 CPU and 2.3 Gi memory for every 1,000,000 metrics datapoints per minute ingested by prometheus.

Pod	CPU	Memory
Loki	0.75 / 100k	1.3Gi / 100k
Prometheus	0.11 / 1M	2.3Gi / 1M

Table: pod resources needed for per-minute ingestion rates.

Conclusions

The Grafana Labs stack that COS Lite is based on can ingest a substantial amount of data on fairly moderate hardware.
Being unable to deploy the same charm revisions (workload versions) every single load test run introduced some jitter in the results.
More load tests are needed to improve the prediction accuracy of the fitted curves.

Data

https://github.com/canonical/cos-lite-bundle/blob/feature/custom-exporter/tests/load/gcp/var_ssd-4cpu-8gb.csv

Code

https://github.com/canonical/cos-lite-bundle/pull/65

Help improve this document in the forum (guidelines). Last updated 7 months ago.