Observability Libs

  • By Jon Seager
Channel Revision Published Runs on
latest/edge 38 Today
Ubuntu 20.04
juju deploy observability-libs --channel edge
Show information

Platform:

charms.observability_libs.v0.kubernetes_compute_resources_patch

# Copyright 2022 Canonical Ltd.
# See LICENSE file for licensing details.

"""# KubernetesComputeResourcesPatch Library.

This library is designed to enable developers to more simply patch the Kubernetes compute resource
limits and requests created by Juju during the deployment of a sidecar charm.

When initialised, this library binds a handler to the parent charm's `config-changed` event.
The config-changed event is used because it is guaranteed to fire on startup, on upgrade and on
pod churn. Additionally, resource limits may be set by charm config options, which would also be
caught out-of-the-box by this handler. The handler applies the patch to the app's StatefulSet.
This should ensure that the resource limits are correct throughout the charm's life. Additional
optional user-provided events for re-applying the patch are supported but discouraged.

The constructor takes a reference to the parent charm, a 'limits' and a 'requests' dictionaries
that together define the resource requirements. For information regarding the `lightkube`
`ResourceRequirements` model, please visit the `lightkube`
[docs](https://gtsystem.github.io/lightkube-models/1.23/models/core_v1/#resourcerequirements).


## Getting Started

To get started using the library, you just need to fetch the library using `charmcraft`. **Note
that you also need to add `lightkube` and `lightkube-models` to your charm's `requirements.txt`.**

```shell
cd some-charm
charmcraft fetch-lib charms.observability_libs.v0.kubernetes_compute_resources_patch
cat << EOF >> requirements.txt
lightkube
lightkube-models
EOF
```

Then, to initialise the library:

```python
# ...
from charms.observability_libs.v0.kubernetes_compute_resources_patch import (
    KubernetesComputeResourcesPatch,
    ResourceRequirements,
)

class SomeCharm(CharmBase):
  def __init__(self, *args):
    # ...
    self.resources_patch = KubernetesComputeResourcesPatch(
        self,
        "container-name",
        resource_reqs_func=lambda: ResourceRequirements(
            limits={"cpu": "2"}, requests={"cpu": "1"}
        ),
    )
    self.framework.observe(self.resources_patch.on.patch_failed, self._on_resource_patch_failed)

  def _on_resource_patch_failed(self, event):
    self.unit.status = BlockedStatus(event.message)
    # ...
```

Or, if, for example, the resource specs are coming from config options:

```python
class SomeCharm(CharmBase):
  def __init__(self, *args):
    # ...
    self.resources_patch = KubernetesComputeResourcesPatch(
        self,
        "container-name",
        resource_reqs_func=self._resource_spec_from_config,
    )

  def _resource_spec_from_config(self) -> ResourceRequirements:
    spec = {"cpu": self.model.config.get("cpu"), "memory": self.model.config.get("memory")}
    return ResourceRequirements(limits=spec, requests=spec)
```


Additionally, you may wish to use mocks in your charm's unit testing to ensure that the library
does not try to make any API calls, or open any files during testing that are unlikely to be
present, and could break your tests. The easiest way to do this is during your test `setUp`:

```python
# ...

@patch.multiple(
    "charm.KubernetesComputeResourcesPatch",
    _namespace="test-namespace",
    _is_patched=lambda *a, **kw: True,
    is_ready=lambda *a, **kw: True,
)
@patch("lightkube.core.client.GenericSyncClient")
def setUp(self, *unused):
    self.harness = Harness(SomeCharm)
    # ...
```

References:
- https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
- https://gtsystem.github.io/lightkube-models/1.23/models/core_v1/#resourcerequirements
"""

import decimal
import logging
from decimal import Decimal
from math import ceil, floor
from typing import Callable, Dict, List, Optional, Union

from lightkube import ApiError, Client  # pyright: ignore
from lightkube.core import exceptions
from lightkube.models.apps_v1 import StatefulSetSpec
from lightkube.models.core_v1 import (
    Container,
    PodSpec,
    PodTemplateSpec,
    ResourceRequirements,
)
from lightkube.resources.apps_v1 import StatefulSet
from lightkube.resources.core_v1 import Pod
from lightkube.types import PatchType
from lightkube.utils.quantity import equals_canonically, parse_quantity
from ops.charm import CharmBase
from ops.framework import BoundEvent, EventBase, EventSource, Object, ObjectEvents

logger = logging.getLogger(__name__)

# The unique Charmhub library identifier, never change it
LIBID = "2a6066f701444e8db44ba2f6af28da90"

# Increment this major API version when introducing breaking changes
LIBAPI = 0

# Increment this PATCH version before using `charmcraft publish-lib` or reset
# to 0 if you are raising the major API version
LIBPATCH = 7


_Decimal = Union[Decimal, float, str, int]  # types that are potentially convertible to Decimal


def adjust_resource_requirements(
    limits: Optional[dict], requests: Optional[dict], adhere_to_requests: bool = True
) -> ResourceRequirements:
    """Adjust resource limits so that `limits` and `requests` are consistent with each other.

    Args:
        limits: the "limits" portion of the resource spec.
        requests: the "requests" portion of the resource spec.
        adhere_to_requests: a flag indicating which portion should be adjusted when "limits" is
         lower than "requests":
         - if True, "limits" will be adjusted to max(limits, requests).
         - if False, "requests" will be adjusted to min(limits, requests).

    Returns:
        An adjusted (limits, requests) 2-tuple.

    >>> adjust_resource_requirements({}, {})
    ResourceRequirements(claims=None, limits={}, requests={})
    >>> adjust_resource_requirements({"cpu": "1"}, {})
    ResourceRequirements(claims=None, limits={'cpu': '1'}, requests={'cpu': '1'})
    >>> adjust_resource_requirements({"cpu": "1"}, {"cpu": "2"}, True)
    ResourceRequirements(claims=None, limits={'cpu': '2'}, requests={'cpu': '2'})
    >>> adjust_resource_requirements({"cpu": "1"}, {"cpu": "2"}, False)
    ResourceRequirements(claims=None, limits={'cpu': '1'}, requests={'cpu': '1'})
    >>> adjust_resource_requirements({"cpu": "1"}, {"memory": "1G"}, True)
    ResourceRequirements(claims=None, limits={'cpu': '1'}, requests={'memory': '1G', 'cpu': '1'})
    >>> adjust_resource_requirements({"cpu": "1"}, {"memory": "1G"}, False)
    ResourceRequirements(claims=None, limits={'cpu': '1'}, requests={'memory': '1G', 'cpu': '1'})
    >>> adjust_resource_requirements({"cpu": "1", "memory": "1"}, {"memory": "2"}, True)
    ResourceRequirements(\
claims=None, limits={'cpu': '1', 'memory': '2'}, requests={'memory': '2', 'cpu': '1'})
    >>> adjust_resource_requirements({"cpu": "1", "memory": "1"}, {"memory": "1G"}, False)
    ResourceRequirements(\
claims=None, limits={'cpu': '1', 'memory': '1'}, requests={'memory': '1', 'cpu': '1'})
    >>> adjust_resource_requirements({"custom-resource": "1"}, {"custom-resource": "2"}, False)
    Traceback (most recent call last):
      ...
    ValueError: Invalid limits spec: {'custom-resource': '1'}
    """
    if not is_valid_spec(limits):
        raise ValueError("Invalid limits spec: {}".format(limits))
    if not is_valid_spec(requests):
        raise ValueError("Invalid default requests spec: {}".format(requests))

    limits = sanitize_resource_spec_dict(limits) or {}
    requests = sanitize_resource_spec_dict(requests) or {}

    # Make sure we do not modify in-place
    limits, requests = limits.copy(), requests.copy()

    # Need to copy key-val pairs from "limits" to "requests", if they are not present in
    # "requests". This replicates K8s behavior:
    # https://kubernetes.io/docs/concepts/configuration/manage-resources-containers
    requests.update({k: limits[k] for k in limits if k not in requests})

    if adhere_to_requests:
        # Keep limits fixed when `limits` is too low
        adjusted, fixed = limits, requests
        func = max
    else:
        # Pull down requests when limit is too low
        fixed, adjusted = limits, requests
        func = min

    # adjusted = {}
    for k in adjusted:
        if k not in fixed:
            # The resource constraint is present in the "adjusted" dict but not in the "fixed"
            # dict. Keep the "adjusted" value as is
            continue

        adjusted_value = func(parse_quantity(fixed[k]), parse_quantity(adjusted[k]))  # type: ignore[type-var]
        adjusted[k] = (
            str(adjusted_value.quantize(decimal.Decimal("0.001"), rounding=decimal.ROUND_UP))  # type: ignore[union-attr]
            .rstrip("0")
            .rstrip(".")
        )

    return (
        ResourceRequirements(limits=adjusted, requests=fixed)
        if adhere_to_requests
        else ResourceRequirements(limits=fixed, requests=adjusted)
    )


def is_valid_spec(spec: Optional[dict], debug=False) -> bool:  # noqa: C901
    """Check if the spec dict is valid.

    TODO: generally, the keys can be anything, not just cpu and memory. Perhaps user could pass
     list of custom allowed keys in addition to the K8s ones?
    """
    if spec is None:
        return True
    if not isinstance(spec, dict):
        if debug:
            logger.error("Invalid resource spec type '%s': must be either None or dict.", spec)
        return False

    for k, v in spec.items():
        valid_keys = ["cpu", "memory"]  # K8s permits custom keys, but we limit here to what we use
        if k not in valid_keys:
            if debug:
                logger.error("Invalid key in resource spec: %s; valid keys: %s.", k, valid_keys)
            return False
        try:
            assert isinstance(v, (str, type(None)))  # for type checker
            pv = parse_quantity(v)
        except ValueError:
            if debug:
                logger.error("Invalid resource spec entry: {%s: %s}.", k, v)
            return False

        if pv and pv < 0:
            if debug:
                logger.error("Invalid resource spec entry: {%s: %s}; must be non-negative.", k, v)
            return False

    return True


def sanitize_resource_spec_dict(spec: Optional[dict]) -> Optional[dict]:
    """Fix spec values without altering semantics.

    The purpose of this helper function is to correct known issues.
    This function is not intended for fixing user mistakes such as incorrect keys present; that is
    left for the `is_valid_spec` function.
    """
    if not spec:
        return spec

    d = spec.copy()

    for k, v in spec.items():
        if not v:
            # Need to ignore empty values input, otherwise the StatefulSet will have "0" as the
            # setpoint, the pod will not be scheduled and the charm would be stuck in unknown/lost.
            # This slightly changes the spec semantics compared to lightkube/k8s: a setpoint of
            # `None` would be interpreted here as "no limit".
            del d[k]

    # Round up memory to whole bytes. This is need to avoid K8s errors such as:
    # fractional byte value "858993459200m" (0.8Gi) is invalid, must be an integer
    memory = d.get("memory")
    if memory:
        as_decimal = parse_quantity(memory)
        if as_decimal and as_decimal.remainder_near(floor(as_decimal)):
            d["memory"] = str(ceil(as_decimal))
    return d


class K8sResourcePatchFailedEvent(EventBase):
    """Emitted when patching fails."""

    def __init__(self, handle, message=None):
        super().__init__(handle)
        self.message = message

    def snapshot(self) -> Dict:
        """Save grafana source information."""
        return {"message": self.message}

    def restore(self, snapshot):
        """Restore grafana source information."""
        self.message = snapshot["message"]


class K8sResourcePatchEvents(ObjectEvents):
    """Events raised by :class:`K8sResourcePatchEvents`."""

    patch_failed = EventSource(K8sResourcePatchFailedEvent)


class ContainerNotFoundError(ValueError):
    """Raised when a given container does not exist in the list of containers."""


class ResourcePatcher:
    """Helper class for patching a container's resource limits in a given StatefulSet."""

    def __init__(self, namespace: str, statefulset_name: str, container_name: str):
        self.namespace = namespace
        self.statefulset_name = statefulset_name
        self.container_name = container_name
        self.client = Client()  # pyright: ignore

    def _patched_delta(self, resource_reqs: ResourceRequirements) -> StatefulSet:
        statefulset = self.client.get(
            StatefulSet, name=self.statefulset_name, namespace=self.namespace
        )

        return StatefulSet(
            spec=StatefulSetSpec(
                selector=statefulset.spec.selector,  # type: ignore[attr-defined]
                serviceName=statefulset.spec.serviceName,  # type: ignore[attr-defined]
                template=PodTemplateSpec(
                    spec=PodSpec(
                        containers=[Container(name=self.container_name, resources=resource_reqs)]
                    )
                ),
            )
        )

    @classmethod
    def _get_container(cls, container_name: str, containers: List[Container]) -> Container:
        """Find our container from the container list, assuming list is unique by name.

        Typically, *.spec.containers[0] is the charm container, and [1] is the (only) workload.

        Raises:
            ContainerNotFoundError, if the user-provided container name does not exist in the list.

        Returns:
            An instance of :class:`Container` whose name matches the given name.
        """
        try:
            return next(iter(filter(lambda ctr: ctr.name == container_name, containers)))
        except StopIteration:
            raise ContainerNotFoundError(f"Container '{container_name}' not found")

    def is_patched(self, resource_reqs: ResourceRequirements) -> bool:
        """Reports if the resource patch has been applied to the StatefulSet.

        Returns:
            bool: A boolean indicating if the service patch has been applied.
        """
        return equals_canonically(self.get_templated(), resource_reqs)  # pyright: ignore

    def get_templated(self) -> Optional[ResourceRequirements]:
        """Returns the resource limits specified in the StatefulSet template."""
        statefulset = self.client.get(
            StatefulSet, name=self.statefulset_name, namespace=self.namespace
        )
        podspec_tpl = self._get_container(
            self.container_name,
            statefulset.spec.template.spec.containers,  # type: ignore[attr-defined]
        )
        return podspec_tpl.resources

    def get_actual(self, pod_name: str) -> Optional[ResourceRequirements]:
        """Return the resource limits that are in effect for the container in the given pod."""
        pod = self.client.get(Pod, name=pod_name, namespace=self.namespace)
        podspec = self._get_container(
            self.container_name, pod.spec.containers  # type: ignore[attr-defined]
        )
        return podspec.resources

    def is_ready(self, pod_name, resource_reqs: ResourceRequirements):
        """Reports if the resource patch has been applied and is in effect.

        Returns:
            bool: A boolean indicating if the service patch has been applied and is in effect.
        """
        logger.info(
            "reqs=%s, templated=%s, actual=%s",
            resource_reqs,
            self.get_templated(),
            self.get_actual(pod_name),
        )
        return self.is_patched(resource_reqs) and equals_canonically(  # pyright: ignore
            resource_reqs, self.get_actual(pod_name)  # pyright: ignore
        )

    def apply(self, resource_reqs: ResourceRequirements) -> None:
        """Patch the Kubernetes resources created by Juju to limit cpu or mem."""
        # Need to ignore invalid input, otherwise the StatefulSet gives "FailedCreate" and the
        # charm would be stuck in unknown/lost.
        if self.is_patched(resource_reqs):
            return

        self.client.patch(
            StatefulSet,
            self.statefulset_name,
            self._patched_delta(resource_reqs),
            namespace=self.namespace,
            patch_type=PatchType.APPLY,
            field_manager=self.__class__.__name__,
        )


class KubernetesComputeResourcesPatch(Object):
    """A utility for patching the Kubernetes compute resources set up by Juju."""

    on = K8sResourcePatchEvents()  # pyright: ignore

    def __init__(
        self,
        charm: CharmBase,
        container_name: str,
        *,
        resource_reqs_func: Callable[[], ResourceRequirements],
        refresh_event: Optional[Union[BoundEvent, List[BoundEvent]]] = None,
    ):
        """Constructor for KubernetesComputeResourcesPatch.

        References:
            - https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

        Args:
            charm: the charm that is instantiating the library.
            container_name: the container for which to apply the resource limits.
            resource_reqs_func: a callable returning a `ResourceRequirements`; if raises, should
              only raise ValueError.
            refresh_event: an optional bound event or list of bound events which
                will be observed to re-apply the patch.
        """
        super().__init__(charm, "{}_{}".format(self.__class__.__name__, container_name))
        self._charm = charm
        self._container_name = container_name
        self.resource_reqs_func = resource_reqs_func
        self.patcher = ResourcePatcher(self._namespace, self._app, container_name)

        # Ensure this patch is applied during the 'config-changed' event, which is emitted every
        # startup and every upgrade. The config-changed event is a good time to apply this kind of
        # patch because it is always emitted after storage-attached, leadership and peer-created,
        # all of which only fire after install. Patching the statefulset prematurely could result
        # in those events firing without a workload.
        self.framework.observe(charm.on.config_changed, self._on_config_changed)

        if not refresh_event:
            refresh_event = []
        elif not isinstance(refresh_event, list):
            refresh_event = [refresh_event]
        for ev in refresh_event:
            self.framework.observe(ev, self._on_config_changed)

    def _on_config_changed(self, _):
        self._patch()

    def _patch(self) -> None:
        """Patch the Kubernetes resources created by Juju to limit cpu or mem."""
        try:
            resource_reqs = self.resource_reqs_func()
            limits = resource_reqs.limits
            requests = resource_reqs.requests
        except ValueError as e:
            msg = f"Failed obtaining resource limit spec: {e}"
            logger.error(msg)
            self.on.patch_failed.emit(message=msg)
            return

        for spec in (limits, requests):
            if not is_valid_spec(spec):
                msg = f"Invalid resource limit spec: {spec}"
                logger.error(msg)
                self.on.patch_failed.emit(message=msg)
                return

        resource_reqs = ResourceRequirements(
            limits=sanitize_resource_spec_dict(limits),  # type: ignore[arg-type]
            requests=sanitize_resource_spec_dict(requests),  # type: ignore[arg-type]
        )

        try:
            self.patcher.apply(resource_reqs)

        except exceptions.ConfigError as e:
            msg = f"Error creating k8s client: {e}"
            logger.error(msg)
            self.on.patch_failed.emit(message=msg)
            return

        except ApiError as e:
            if e.status.code == 403:
                msg = f"Kubernetes resources patch failed: `juju trust` this application. {e}"
            else:
                msg = f"Kubernetes resources patch failed: {e}"

            logger.error(msg)
            self.on.patch_failed.emit(message=msg)

        except ValueError as e:
            msg = f"Kubernetes resources patch failed: {e}"
            logger.error(msg)
            self.on.patch_failed.emit(message=msg)

        else:
            logger.info(
                "Kubernetes resources for app '%s', container '%s' patched successfully: %s",
                self._app,
                self._container_name,
                resource_reqs,
            )

    def is_ready(self) -> bool:
        """Reports if the resource patch has been applied and is in effect.

        Returns:
            bool: A boolean indicating if the service patch has been applied and is in effect.
        """
        try:
            resource_reqs = self.resource_reqs_func()
            limits = resource_reqs.limits
            requests = resource_reqs.requests
        except ValueError as e:
            msg = f"Failed obtaining resource limit spec: {e}"
            logger.error(msg)
            return False

        if not is_valid_spec(limits) or not is_valid_spec(requests):
            logger.error("Invalid resource requirements specs: %s, %s", limits, requests)
            return False

        resource_reqs = ResourceRequirements(
            limits=sanitize_resource_spec_dict(limits),  # type: ignore[arg-type]
            requests=sanitize_resource_spec_dict(requests),  # type: ignore[arg-type]
        )

        try:
            return self.patcher.is_ready(self._pod, resource_reqs)
        except (ValueError, ApiError) as e:
            msg = f"Failed to apply resource limit patch: {e}"
            logger.error(msg)
            self.on.patch_failed.emit(message=msg)
            return False

    @property
    def _app(self) -> str:
        """Name of the current Juju application.

        Returns:
            str: A string containing the name of the current Juju application.
        """
        return self._charm.app.name

    @property
    def _pod(self) -> str:
        """Name of the unit's pod.

        Returns:
            str: A string containing the name of the current unit's pod.
        """
        return "-".join(self._charm.unit.name.rsplit("/", 1))

    @property
    def _namespace(self) -> str:
        """The Kubernetes namespace we're running in.

        If a charm is deployed into the controller model (which certainly could happen as we move
        to representing the controller as a charm) then self._charm.model.name !== k8s namespace.
        Instead, the model name is controller in Juju and controller-<controller-name> for the
        namespace in K8s.

        Returns:
            str: A string containing the name of the current Kubernetes namespace.
        """
        with open("/var/run/secrets/kubernetes.io/serviceaccount/namespace", "r") as f:
            return f.read().strip()