Kubeflow

Kubeflow Charmers | bundle
Cloud

Channel	Revision	Published
latest/candidate	294	24 Jan 2022
latest/beta	430	30 Aug 2024
latest/edge	423	26 Jul 2024
1.10/stable	436	07 Apr 2025
1.10/candidate	434	02 Apr 2025
1.10/beta	433	24 Mar 2025
1.9/stable	432	03 Dec 2024
1.9/beta	420	19 Jul 2024
1.9/edge	431	03 Dec 2024
1.8/stable	414	22 Nov 2023
1.8/beta	411	22 Nov 2023
1.8/edge	413	22 Nov 2023
1.7/stable	409	27 Oct 2023
1.7/beta	408	27 Oct 2023
1.7/edge	407	27 Oct 2023

Learn to deploy on juju >

Platform:

Relevant links

Homepage

Share your thoughts on this charm with the community on discourse.

Join the discussion

This guide describes how to leverage NVIDIA GPU resources in your Charmed Kubeflow (CKF) deployment.

Requirements

A CKF deployment and access to the Kubeflow dashboard. See Get started for more details.
An NVIDIA GPU accessible from the Kubernetes cluster that CKF is deployed on. Depending on your deployment, refer to one of the following guides for more details:
- MicroK8s
- AKS
- EKS

Spin a Notebook on a GPU

Kubeflow Notebooks can use any GPU resource available in the Kubernetes cluster. This is configurable during the Notebook’s creation.

When creating a Notebook, under GPUs, select the number of GPUs and NVIDIA as the GPU vendor. The GPUs number depends both on the cluster setup and your code demands.

If your Notebook uses a Tensorflow-based image with CUDA, use the following code to confirm the notebooks have access to a GPU:

import tensorflow as tf
gpus = tf.config.list_physical_devices("GPU")
print(f"Congratz! The following GPUs are available to the notebook: {gpus}" if gpus else "There's no GPU available to the notebook")

In case your cluster setup uses Taints, see Leverage PodDefaults for more details.

Kubeflow Pipelines provides steps to use GPU resources available in your Kubernetes cluster. You can enable this by adding the nvidia.com/gpu: 1 limit to a step during the Pipeline’s definition. See the detailed steps below.

A GPU can be used by one Pod at a time. Thus, a Pipeline can schedule Pods on a GPU only when available. For advanced GPU sharing practices on Kubernetes, see NVIDIA Multi-Instance GPU.

Open a notebook with your Pipeline. If you don’t have one, use the following code as an example. It creates a Pipeline with a single component that checks GPU access:

# Import required objects
from kfp import dsl

@dsl.component(base_image="kubeflownotebookswg/jupyter-tensorflow-cuda:v1.9.0")
def gpu_check() -> str:
    """Get the list of GPUs and print it. If empty, raise a RuntimeError."""
    import tensorflow as tf
    gpus = tf.config.list_physical_devices("GPU")
    print("GPU list:", gpus)
    if not gpus:
        raise RuntimeError("No GPU has been detected.")
    return str(len(gpus) > 0)

@dsl.pipeline
def gpu_check_pipeline() -> str:
    """Create a pipeline that runs code to check access to a GPU."""
    gpu_check_object = gpu_check()
    return gpu_check_object.output

Make sure the KFP SDK is installed in the Notebook’s environment:

!pip install "kfp>=2.4,<3.0"

Ensure the step of the Pipeline’s component gpu_check runs on a GPU by creating a function add_gpu_request(task) that uses the SDK’s add_node_selector_constraint() and set_accelerator_limit(). This sets the required limit for the step’s Pod:

def add_gpu_request(task: dsl.PipelineTask) -> dsl.PipelineTask:
    """Add a request field for a GPU to the container created by the PipelineTask object."""
    return task.add_node_selector_constraint(accelerator="nvidia.com/gpu").set_accelerator_limit(
        limit=1
    )

Modify the Pipeline definition by calling add_gpu_request() to the component:

@dsl.pipeline
def gpu_check_pipeline() -> str:
    """Create a pipeline that runs code to check access to a GPU."""
    gpu_check_object = add_gpu_request(gpu_check())
    return gpu_check_object.output

Submit and run the Pipeline:

# Submit the pipeline executes successfully
from kfp.client import Client
client = Client()
run = client.create_run_from_pipeline_func(
    gpu_check_pipeline,
    experiment_name="Check access to GPU",
    enable_caching=False,
)

Navigate to the output Run details. In its logs, you can see the available GPU devices the step has access to.

Inference with a KServe ISVC on a GPU

KServe inference services (ISVC) can schedule their Pods on a GPU. To ensure the ISVC Pod is using a GPU, add the nvidia.com/gpu: 1 limit to the ISVC’s definition.

You can do so by using the kubectl Command Line Interface (CLI) or within a notebook.

Using kubectl CLI

Using the kubectl CLI, you can enable GPU usage in your InferenceService Pod by directly modifying its configuration YAML file. For example, the inference service YAML file from this example would be modified to:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
      resources:
        limits:
          nvidia.com/gpu: 1

Within a notebook

A GPU can be used by one Pod at a time. Thus, an ISVC Pod can be scheduled on a GPU only when available. For advanced GPU sharing practices on Kubernetes, see NVIDIA Multi-Instance GPU.

Open a notebook with your InferenceService. If you don’t have one, use this one as an example.

Make sure the Kserve SDK is installed in the Notebook’s environment:

!pip install kserve

Import V1ResourceRequirements from kubernetes.client package and add a resources field in the workload you want to run on a GPU. See the example for reference:

ISVC_NAME = "sklearn-iris"
isvc = V1beta1InferenceService(
    api_version=constants.KSERVE_V1BETA1,
    kind=constants.KSERVE_KIND,
    metadata=V1ObjectMeta(
        name=ISVC_NAME,
        annotations={"sidecar.istio.io/inject": "false"},
    ),
    spec=V1beta1InferenceServiceSpec(
        predictor=V1beta1PredictorSpec(
            sklearn=V1beta1SKLearnSpec(
                resources=V1ResourceRequirements(
                    limits= {"nvidia.com/gpu":"1"}
                ),
                storage_uri="gs://kfserving-examples/models/sklearn/1.0/model"
            )
        )
    ),
)

Help improve this document in the forum (guidelines). Last updated a month ago.