COS Lite ingestion limits for 8cpu-16gb-ssd

One of the goals for COS Lite is to be able to ingest considerable amount of data on modest hardware. Load testing is useful for gaining insight into how to size observability clusters appropriately.

Method

The test method is identical to the method used for load-testing on 4cpu8gb.

  • No k8s resource limits set. The per-pod resource requirements are presented below and can be used by admins to set resource limits.
  • MicroK8s 1.27, Juju 3.1.6.
  • 20 virtual SREs (dashboard gazers) “looking” at panels with thousands of datapoints and log lines. This mimics an outage response, where 20 people are suddenly looking at a heavy dashboard at the same time.
  • No receivers configured for alertmanager and no rules (other than self monitoring) are evaluated by prometheus.
  • Load tests that run successfully for over 12h without incidents are marked as “PASSED”, and were used for constructing the datasheet. Passing tests are also used for curve fitting an approximation for resource usage.
  • The latest results were obtained with the following charm versions:
App Workload version Charm revision
alertmanager 0.25.0 96
catalogue n/a 33
cos-config 3.5.0 44
grafana 9.2.1 97
loki 2.9.2 109
prometheus 2.48.0 163
scrape-config n/a 45
scrape-target n/a 32
traefik 2.10.4 166
  • Several IaC fixes were introduced in cos-lite/84 to address some necessary changes.
  • Loki config changes were made (loki-k8s/325) to improve query performance.

Results

In a “lab” environment, COS Lite on an 8cpu16gb VM with a “performance” SSD disk was able to ingest:

  • 6.6 million datapoints per minute
  • (6 million datapoints + 3600 log lines) per minute
  • (4.5 million data points + 320000 log lines) per minute.

Note that the results above do not leave any leeway. For production you should probably use >10% margin.

When COS Lite is in isolation and is only ingesting its own self-monitoring metrics (“idle” mode), it consumes 6% CPU (0.48 vCPU) and 16% memory (2.56 GB).

To calculate dynamically, use:

disk = 3.011e-4 * L + 3.823e-6 * M + 1.023
cpu = 1.89 * arctan(1.365e-4 * L) + 1.059e-7 * M + 1.644
mem = 2.063 * arctan(2.539e-3 * L) + 1.464e-6 * M + 3.3

Where:

  • disk is in GiB/day
  • cpu is in vCPUs
  • mem is in GB
  • L is the number of ingested log lines per minute
  • M is the number of ingested metric datapoints per minute

Discussion

  • For Loki, querying (data retrieval) is expensive: the resources required for ingestion were negligible compared to querying. As a result, Loki’s resource requirements were constant (independent of ingestion rate).
  • Major contributors to load are Loki retrieval (CPU-intensive) and Prometheus ingestion (memory-intensive).

Disk usage

  • Disk usage as a function of logs ingestion rate has an excellent linear fit. The fit was made using data from Loki 2.9.2 only, using a 12h average.
  • The data spread for metrics ingestion rate isn’t great, and may be due to a human error of inconsistent recording of 6h vs 12h averages (I switched from 6 to 12h during the experiment). The linear fit gives good coverage for high ingestion rates. More accurate results will be published soon after refactoring the load test.
  • Disk usage (GiB/day) can be calculated as follows: (3.011e-4 * L + 2.447e-3) + (3.823e-6 * M + 1.021), where L is the number of log lines per minute, and M is the number of metrics data points per minute. The time scale of 1m was chosen to match the default scrape interval of charmed prometheus. Self-monitoring contributes the 1.021 (GiB/day) to the overall usage. The 2.447 (MiB/day) is likely just a minor fitting error (effectively equals zero).

Per-pod resource usage

  • Per-pod resource usage is interesting because it gives better insight into how compute resources are consumed across components.

Loki

  • Ingestion load is negligible to query load. That is why resources saturate at the same level for a broad range of ingestion rate.
  • The difference in trend between the current experiment (8cpu16gb) and the previous experiment (4cpu8gb) can be explained by:
    • Different Loki configuration.
    • Different Loki version (more memory efficient).
  • For some reason that is currently unknown to me, Charmed Loki wasn’t able to ingest more than 360k log lines per minute, even though VM resources were not exhausted. This is likely to do with a Loki configuration option that I haven’t discovered yet.
  • For fitting purposes, the arctan function was used in order to capture the behaviour near the origin. The choice of arctan is arbitrary.
  • Resource usage calculation (L is the number of log lines per minute):
    • CPU usage (vCPUs): 1.442e-1 + 1.89 * arctan(1.365e-4 * L)
    • Memory usage (GB): 4.851e-2 + 2.063 * arctan(2.539e-3 * L)

Prometheus

  • Total load is strongly linear with ingestion rate, indicating that querying is cheaper than ingestion.
  • The difference in trend between the current experiment (8cpu16gb) and the previous experiment (4cpu8gb) can be explained by the different Prometheus version used, which included improvements in memory and cpu usage.
  • Resource usage calculation (M is the number of metric data points per minute):
    • CPU usage (vCPUs): 1.059e-7 * M + 1.696e-1
    • Memory usage (GB): 1.464e-6 * M + 2.51e-1

Everything else

For all other pods, resource consumption is fairly constant:

Component vCPUs Memory (GB)
Grafana 0.25 0.2
Traefik 0.08 0.2
Everything else (alertmanager, MicroK8s, …) 1.0 2.6

Conclusions

  • The Grafana Labs stack that COS Lite is based on can ingest a substantial amount of data with fairly moderate compute resources.
  • Resource requirements are sensitive to Loki configuration. While default config values for Loki are likely to meet most users’ needs, further tweaking will result in better tailored performance.
  • Additional work is needed to produce more repeatable and accurate results.

Data & code

See https://github.com/canonical/cos-lite-bundle/pull/84.

Future plans

  • Figure out why loki ingestion rate saturates at around 300k log lines / min, regardless of available resources.
  • Switch from flood-element and locust to k6.
  • Use the terraform juju provider instead of pure cloud-init runcmd.
  • Repeat tests with end-to-end TLS enabled.
  • Add juju metrics (juju exporter?) to load test dashboard.

References


Last updated 5 months ago.