Rolling Ops Library and Example Charm

Canonical Publisher

Platform:

Ubuntu
22.04
Channel Revision Published Runs on
latest/stable 5 26 Nov 2024
Ubuntu 22.04
latest/edge 46 23 Apr 2026
Ubuntu 22.04
juju deploy rolling-ops

charms.rolling_ops.v1.rollingops

Rolling Ops v1 — coordinated rolling operations for Juju charms.

This library provides a reusable mechanism for coordinating rolling operations across units of a Juju application using a peer-relation distributed lock.

The library guarantees that at most one unit executes a rolling operation at any time, while allowing multiple units to enqueue operations and participate in a coordinated rollout.

Data model (peer relation)
Unit databag

Each unit maintains a FIFO queue of operations it wishes to execute.

Keys:

  • operations: JSON-encoded list of queued Operation objects
  • state: "idle" | "request" | "retry-release" | "retry-hold"
  • executed_at: UTC timestamp string indicating when the current operation last ran

Each Operation contains:

  • callback_id: identifier of the callback to execute
  • kwargs: JSON-serializable arguments for the callback
  • requested_at: UTC timestamp when the operation was enqueued
  • max_retry (optional): maximum retry count. None means unlimited
  • attempt: current attempt number
Application databag

The application databag represents the global lock state.

Keys:

  • granted_unit: unit identifier (unit name), or empty
  • granted_at: UTC timestamp indicating when the lock was granted
Operation semantics
  • Units enqueue operations instead of overwriting a single pending request.
  • Duplicate operations (same callback_id and kwargs) are ignored if they are already the last queued operation.
  • When granted the lock, a unit executes exactly one operation (the queue head).
  • After execution, the lock is released so that other units may proceed.
Retry semantics
  • If a callback returns OperationResult.RETRY_RELEASE the unit will release the lock and retry the operation later.
  • If a callback returns OperationResult.RETRY_HOLD the unit will keep the lock and retry immediately.
  • Retry state (attempt) is tracked per operation.
  • When max_retry is exceeded, the failing operation is dropped and the unit proceeds to the next queued operation, if any.
Scheduling semantics
  • Only the leader schedules lock grants.
  • If a valid lock grant exists, no new unit is scheduled.
  • Requests are preferred over retries.
  • Among requests, the operation with the oldest requested_at timestamp is selected.
  • Among retries, the operation with the oldest executed_at timestamp is selected.
  • Stale grants (e.g., pointing to departed units) are automatically released.

All timestamps are stored in UTC using ISO 8601 format.

Using the library in a charm
1. Declare a peer relation
peers:
  restart:
    interface: rolling_op

Import this library into src/charm.py, and initialize a RollingOpsManagerV1 in the Charm's __init__. The Charm should also define a callback routine, which will be executed when a unit holds the distributed lock:

src/charm.py

from charms.rolling_ops.v1.rollingops import RollingOpsManagerV1, OperationResult

class SomeCharm(CharmBase):
    def __init__(self, *args):
        super().__init__(*args)

        self.rolling_ops = RollingOpsManagerV1(
            charm=self,
            relation_name="restart",
            callback_targets={
                "restart": self._restart,
                "failed_restart": self._failed_restart,
                "defer_restart": self._defer_restart,
            },
        )

    def _restart(self, force: bool) -> OperationResult:
        # perform restart logic
        return OperationResult.RELEASE

    def _failed_restart(self) -> OperationResult:
        # perform restart logic
        return OperationResult.RETRY_RELEASE

    def _defer_restart(self) -> OperationResult:
        if not self.some_condition():
            return OperationResult.RETRY_HOLD
        # do restart logic
        return OperationResult.RELEASE

Request a rolling operation


    def _on_restart_action(self, event) -> None:
        self.rolling_ops.request_async_lock(
            callback_id="restart",
            kwargs={"force": True},
            max_retry=3,
    )

All participating units must enqueue the operation in order to be included in the rolling execution.

Units that do not enqueue the operation will be skipped, allowing operators to recover from partial failures by reissuing requests selectively.

Do not include sensitive information in the kwargs of the callback. These values will be stored in the databag.

Make sure that callback_targets is not dynamic and that the mapping contains the expected values at the moment of the callback execution.


Index

class RollingOpsNoRelationError

Description

Raised if we are trying to process a lock, but do not appear to have a relation yet. None

class RollingOpsDecodingError

Description

Raised if the content of the databag cannot be processed. None

class RollingOpsInvalidLockRequestError

Description

Raised if the lock request is invalid. None

class Operation

Description

A single queued operation. None

Methods

Operation. __post_init__( self )

Description

Validate the class attributes. None

Operation. create( cls , callback_id: str , kwargs , max_retry )

Description

Create a new operation from a callback id and kwargs. None

Operation. to_string( self )

Description

Serialize to a string suitable for a Juju databag. None

Operation. increase_attempt( self )

Description

Increment the attempt counter. None

Operation. is_max_retry_reached( self )

Description

Return True if attempt exceeds max_retry (unless max_retry is None). None

Operation. from_string( cls , data: str )

Deserialize from a Juju databag string.

Operation. __eq__( self , other: object )

Description

Equal for the operation. None

Operation. __hash__( self )

Description

Hash for the operation. None

class OperationQueue

Description

In-memory FIFO queue of Operations with encode/decode helpers for storing in a databag. None

Methods

OperationQueue. __init__( self , operations )

OperationQueue. __len__( self )

Description

Return the number of operations in the queue. None

OperationQueue. empty( self )

Description

Return True if there are no queued operations. None

OperationQueue. peek( self )

Description

Return the first operation in the queue if it exists. None

OperationQueue. dequeue( self )

Description

Drop the first operation in the queue if it exists and return it. None

OperationQueue. increase_attempt( self )

Description

Increment the attempt counter for the head operation and persist it. None

OperationQueue. enqueue_lock_request( self , callback_id: str , kwargs , max_retry )

Description

Append operation only if it is not equal to the last enqueued operation. None

OperationQueue. to_string( self )

Description

Encode entire queue to a single string. None

OperationQueue. from_string( cls , data: str )

Decode queue from a string.

class LockIntent

Description

Unit-level lock intents stored in unit databags. None

class OperationResult

Description

Callback return values. None

class Lock

State machine view over peer relation databags for a single unit.

Description

This class is the only component that should directly read/write the peer relation databags for lock state, queue state, and grant state.

Important:

  • All relation databag values are strings.
  • This class updates both unit databags and app databags, which triggers relation-changed events.

Methods

Lock. __init__( self , model: Model , relation_name: str , unit: Unit )

Lock. request( self , callback_id: str , kwargs , max_retry )

Enqueue an operation and mark this unit as requesting the lock.

Arguments

callback_id

identifies which callback to execute.

kwargs

dict of callback kwargs.

max_retry

None -> unlimited retries, else explicit integer.

Lock. retry_release( self )

Description

Indicate that the operation should be retried but the lock should be released. None

Lock. retry_hold( self )

Description

Indicate that the operation should be retried but the lock should be kept. None

Lock. complete( self )

Mark the head operation as completed successfully, pop it from the queue.

Description

Update unit state depending on whether more operations remain.

Lock. release( self )

Description

Clear the application-level grant. None

Lock. grant( self )

Description

Grant a lock to a unit. None

Lock. is_granted( self )

Description

Return True if the unit holds the lock. None

Lock. should_run( self )

Description

Return True if the lock has been granted to the unit and it is time to execute callback. None

Lock. should_release( self )

Description

Return True if the unit finished executing the callback and should be released. None

Lock. is_waiting( self )

Description

Return True if this unit is waiting for a lock to be granted. None

Lock. is_completed( self )

Description

Return True if this unit is completed callback but still has the grant (leader should clear). None

Lock. is_retry( self )

Description

Return True if this unit requested retry but still has the grant (leader should clear). None

Lock. is_waiting_retry( self )

Description

Return True if the unit requested retry and is waiting for lock to be granted. None

Lock. is_retry_hold( self )

Description

Return True if the unit requested retry and wants to keep the lock. None

Lock. get_current_operation( self )

Description

Return the head operation for this unit, if any. None

Lock. get_last_completed( self )

Description

Get the time the unit requested a retry of the head operation. None

Lock. get_requested_at( self )

Description

Get the time the head operation was requested at. None

class LockIterator

Description

Iterator over Lock objects for each unit present on the peer relation. None

Methods

LockIterator. __init__( self , model: Model , relation_name: str )

LockIterator. __iter__( self )

Description

Yields a lock for each unit we can find on the relation. None

def pick_oldest_completed(locks)

Description

Choose the retry lock with the oldest executed_at timestamp. None

def pick_oldest_request(locks)

Description

Choose the lock with the oldest head operation. None

class RollingOpsLockGrantedEvent

Description

Custom event emitted when the background worker grants the lock. None

class RollingOpsManagerV1

Description

Emitters and handlers for rolling ops. None

Methods

RollingOpsManagerV1. __init__( self , charm: CharmBase , relation_name: str , callback_targets )

Register our custom events.

Description

params: charm: the charm we are attaching this to. relation_name: the peer relation name from metadata.yaml. callback_targets: mapping from callback_id -> callable.

RollingOpsManagerV1. request_async_lock( self , callback_id: str , kwargs , max_retry )

Enqueue a rolling operation and request the distributed lock.

Arguments

callback_id

Identifier for the callback to execute when this unit is granted the lock. Must be a non-empty string and must exist in the manager's callback registry.

kwargs

Keyword arguments to pass to the callback when executed. If omitted, an empty dict is used. Must be JSON-serializable because it is stored in Juju relation databags.

max_retry

Retry limit for this operation. None means unlimited retries. 0 means no retries (drop immediately on first failure). Must be >= 0 when provided.

Description

This method appends an operation (identified by callback_id and kwargs) to the calling unit's FIFO queue stored in the peer relation databag and marks the unit as requesting the lock. It does not execute the operation directly.

class RollingOpsAsyncWorker

Description

Spawns and manages the external rolling-ops worker process. None

Methods

RollingOpsAsyncWorker. __init__( self , charm: CharmBase , relation_name: str )

RollingOpsAsyncWorker. start( self )

Description

Start a new worker process. None

RollingOpsAsyncWorker. stop( self )

Description

Stop the running worker process if it exists. None

def main()

Description

Juju hook event dispatcher. None