Kubeflow

  • By Kubeflow Charmers | bundle
  • Cloud
Channel Revision Published
latest/stable 414 01 Dec 2023
latest/candidate 294 24 Jan 2022
latest/beta 430 30 Aug 2024
latest/edge 423 26 Jul 2024
1.9/stable 426 31 Jul 2024
1.9/beta 420 19 Jul 2024
1.9/edge 425 31 Jul 2024
1.8/stable 414 22 Nov 2023
1.8/beta 411 22 Nov 2023
1.8/edge 413 22 Nov 2023
1.7/stable 409 27 Oct 2023
1.7/beta 408 27 Oct 2023
1.7/edge 407 27 Oct 2023
1.6/stable 329 07 Sep 2022
1.6/beta 326 23 Aug 2022
1.6/edge 328 07 Sep 2022
1.4/stable 321 30 Jun 2022
1.4/edge 320 30 Jun 2022
juju deploy kubeflow --channel edge
Show information

Platform:

NVIDIA DGX systems are purpose-built hardware for enterprise AI use cases. These platforms feature NVIDIA Tensor Core GPUs, which vastly outperform traditional CPUs for machine learning workloads, alongside advanced networking and storage capabilities.

This guide contains setup instructions for running Charmed Kubeflow on single-node Nvidia’s DGX-enabled hardware. It supports both single-node and multi-node environments, having a couple of examples of how to use two components: Jupyter Notebooks and Kubeflow Pipelines.

Requirements:

  • Nvidia DGX-enabled hardware setup with correctly configured/updated BIOS settings, bootloader, OS, drivers, and packages (sample setup instructions provided below).
  • Familiarity with Python, Docker, Jupyter notebooks.
  • Tools: juju, kubectl
Sample Ubuntu and Grub setup

NOTE: The following setup instructions are given only as an example. There is no guarantee that they will be sufficient for all environments. Contact hardware distributor for more details on specific system setup. This document was tested on Ubuntu 20.04 vanilla

Ensure No Drivers Preinstalled

Make sure you don’t have any NVIDIA drivers preinstalled. You can do that with the following steps:

Check for apt packages :

$ sudo apt list --installed | grep nvidia

If any packages are listed, remove them:

$ sudo apt remove <package-name>
$ sudo apt autoremove

Check for kernel modules (if empty you OK):

$ lsmod | grep nvidia

If any modules are listed, remove them:

sudo modprobe -r <module-name>

Reboot

$ sudo reboot

Grub Setup

Edit /etc/default/grub and add the following options to

GRUB_CMDLINE_LINUX_DEFAULT: modprobe.blacklist=nouveau nouveau.modeset=
$ sudo reboot

Contents:

Install Kubernetes (MicroK8s)

Install microk8s and enable required addons:

$ sudo snap install microk8s --classic --channel 1.22
 
$ sudo microk8s enable dns:10.229.32.21 storage ingress registry rbac helm3 metallb:10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111
 
$ sudo usermod -a -G microk8s ubuntu
$ sudo chown -f -R ubuntu ~/.kube
$ newgrp microk8s

Edit /var/snap/microk8s/current/args/containerd-template.toml. Add:

[plugins."io.containerd.grpc.v1.cri".registry.configs]

[plugins."io.containerd.grpc.v1.cri".registry.configs."registry-1.docker.io".auth]
username = "afrikha"
password = "<>"
$ microk8s.stop; microk8s.start

Enable GPU add-on and configure MIG

Install GPU operator:

$ sudo microk8s.enable gpu
$ mkdir .kube
$ microk8s config > ~/.kube/config

Check gpu count for k8s:

$ kubectl get nodes --show-labels | grep gpu.count

Configure MIG devices:

$ kubectl label nodes blanka nvidia.com/mig.config=all-1g.5gb --overwrite

Recheck gpu count (should be increased):

kubectl get nodes --show-labels | grep gpu.count

Troubleshooting: If none of the nodes appear in the get nodes command, uninstall all GPU drivers form kubernetes nodes and reinstall the microk8s.

Deploy Charmed Kubeflow

Follow the instructions from How to install Charmed Kubeflow to deploy Charmed Kubeflow.

Try Kubeflow examples

Charmed Kubeflow can be run on single-node and multi-node DGX hardware. Depending on the environment, there are different requirements that should be followed. There are multiple examples that can be tried out.

Single-node DGX with Charmed Kubeflow examples

There is a GitHub repository that includes all the details about the Single-node DGX with Charmed Kubeflow.

The following examples can be found and tested:

  • Jupyter Notebook example on a single-node DGX in the file gpu-notebook.ipynb from the repository. It also uses multi GPU setup.
  • Kubeflow Pipeline example on a single-node DGX, that uses the same classifier as the Notebook. It is available in the file gpu-pipeline.ipynb.
Multi-node DGX with Charmed Kubeflow examples

There is a GitHub repository that includes all the details about the Multi-node DGX with Charmed Kubeflow.

The following examples can be found and tested:

  • Training Tensorflow models with multi-GPUs in a Jupyter Notebook using Charmed Kubeflow in the folder multi-gpu-in-notebook, where Jupyter Notebook file is available, gpu-notebook.ipynb.
  • Training Tensorflow models with GPUs in a Kubeflow Pipeline in the folder multi-gpu-in-pipeline.
  • A simulated example of multi-node training in Tensorflow, but using just a single node in the folder multi-node-gpu-simulated. There are going to be multiple files describing the workload distribution and how to run it.
  • Multi-node training in Tensorflow using the Kubeflow Training Operator’s TFJob in the folder multi-node-gpu-tfjob.

Help improve this document in the forum (guidelines). Last updated 11 months ago.