Rolling Ops Library and Example Charm

  • Canonical
Channel Revision Published Runs on
latest/stable 5 26 Nov 2024
Ubuntu 22.04
latest/edge 13 19 Dec 2024
Ubuntu 22.04
juju deploy rolling-ops
Show information

Platform:

Ubuntu
22.04

charms.rolling_ops.v0.rollingops

This library enables "rolling" operations across units of a charmed Application.

For example, a charm author might use this library to implement a "rolling restart", in which all units in an application restart their workload, but no two units execute the restart at the same time.

To implement the rolling restart, a charm author would do the following:

  1. Add a peer relation called 'restart' to a charm's metadata.yaml:
peers:
    restart:
        interface: rolling_op

Import this library into src/charm.py, and initialize a RollingOpsManager in the Charm's __init__. The Charm should also define a callback routine, which will be executed when a unit holds the distributed lock:

src/charm.py

# ...
from charms.rolling_ops.v0.rollingops import RollingOpsManager
# ...
class SomeCharm(...):
    def __init__(...)
        # ...
        self.restart_manager = RollingOpsManager(
            charm=self, relation="restart", callback=self._restart
        )
        # ...
    def _restart(self, event):
        systemd.service_restart('foo')

To kick off the rolling restart, emit this library's AcquireLock event. The simplest way to do so would be with an action, though it might make sense to acquire the lock in response to another event.

    def _on_trigger_restart(self, event):
        self.charm.on[self.restart_manager.name].acquire_lock.emit()

In order to trigger the restart, a human operator would execute the following command on the CLI:

juju run-action some-charm/0 some-charm/1 <... some-charm/n> restart

Note that all units that plan to restart must receive the action and emit the aquire event. Any units that do not run their acquire handler will be left out of the rolling restart. (An operator might take advantage of this fact to recover from a failed rolling operation without restarting workloads that were able to successfully restart -- simply omit the successful units from a subsequent run-action call.)


class LockNoRelationError

Description

Raised if we are trying to process a lock, but do not appear to have a relation yet. None

class LockState

Possible states for our Distributed lock.

Description

Note that there are two states set on the unit, and two on the application.

class Lock

A class that keeps track of a single asynchronous lock.

Description

Warning: a Lock has permission to update relation data, which means that there are side effects to invoking the .acquire, .release and .grant methods. Running any one of them will trigger a RelationChanged event, once per transition from one internal status to another.

This class tracks state across the cloud by implementing a peer relation interface. There are two parts to the interface:

  1. The data on a unit's peer relation (defined in metadata.yaml.) Each unit can update this data. The only meaningful values are "acquire", and "release", which represent a request to acquire the lock, and a request to release the lock, respectively.

  2. The application data in the relation. This tracks whether the lock has been "granted", Or has been released (and reverted to idle). There are two valid states: "granted" or None. If a lock is in the "granted" state, a unit should emit a RunWithLocks event and then release the lock.

    If a lock is in "None", this means that a unit has not yet requested the lock, or that the request has been completed.

In more detail, here is the relation structure:

relation.data: <unit n>: status: 'acquire|release' <application>: <unit n>: 'granted|None'

Note that this class makes no attempts to timestamp the locks and thus handle multiple requests in a row. If a unit re-requests a lock before being granted the lock, the lock will simply stay in the "acquire" state. If a unit wishes to clear its lock, it simply needs to call lock.release().

Methods

Lock. __init__( self , manager , unit )

Lock. acquire( self )

Description

Request that a lock be acquired. None

Lock. release( self )

Description

Request that a lock be released. None

Lock. clear( self )

Description

Unset a lock. None

Lock. grant( self )

Description

Grant a lock to a unit. None

Lock. is_held( self )

Description

This unit holds the lock. None

Lock. release_requested( self )

Description

A unit has reported that they are finished with the lock. None

Lock. is_pending( self )

Description

Is this unit waiting for a lock? None

class Locks

Description

Generator that returns a list of locks. None

Methods

Locks. __init__( self , manager )

Locks. __iter__( self )

Description

Yields a lock for each unit we can find on the relation. None

class RunWithLock

Description

Event to signal that this unit should run the callback. None

class AcquireLock

Description

Signals that this unit wants to acquire a lock. None

Methods

AcquireLock. __init__( self , handle , callback_override )

AcquireLock. snapshot( self )

Description

Snapshot of lock event. None

AcquireLock. restore( self , snapshot )

Description

Restores lock event. None

class ProcessLocks

Description

Used to tell the leader to process all locks. None

class RollingOpsManager

Description

Emitters and handlers for rolling ops. None

Methods

RollingOpsManager. __init__( self , charm: CharmBase , relation: AnyStr , callback: Callable )

Register our custom events.

Description

params: charm: the charm we are attaching this to. relation: an identifier, by convention based on the name of the relation in the metadata.yaml, which identifies this instance of RollingOperatorsFactory, distinct from other instances that may be handling other events. callback: a closure to run when we have a lock. (It must take a CharmBase object and EventBase object as args.)