Kubeflow

  • By Kubeflow Charmers | bundle
  • Cloud
Channel Revision Published
latest/stable 414 01 Dec 2023
latest/candidate 294 24 Jan 2022
latest/beta 430 30 Aug 2024
latest/edge 423 26 Jul 2024
1.9/stable 426 31 Jul 2024
1.9/beta 420 19 Jul 2024
1.9/edge 425 31 Jul 2024
1.8/stable 414 22 Nov 2023
1.8/beta 411 22 Nov 2023
1.8/edge 413 22 Nov 2023
1.7/stable 409 27 Oct 2023
1.7/beta 408 27 Oct 2023
1.7/edge 407 27 Oct 2023
1.6/stable 329 07 Sep 2022
1.6/beta 326 23 Aug 2022
1.6/edge 328 07 Sep 2022
1.4/stable 321 30 Jun 2022
1.4/edge 320 30 Jun 2022
juju deploy kubeflow --channel beta
Show information

Platform:

Serve a BERT model using NVIDIA Triton Inference Server.

Prerequisites

An active Charmed Kubeflow deployment. For installation instructions, follow the Get started tutorial.

Contents

Refresh the knative-serving charm

upgrade the knative-serving charm to channel latest/edge

juju refresh knative-serving --channel=latest/edge

Wait until the charm is in active status, you can watch the status with:

juju status --watch 5s

Create a Notebook

Create a Kubeflow Jupyter Notebook. The Notebook will be your workspace from which you run the commands. Running the commands in this guide requires in-cluster communication and instructions won’t work outside of the Notebook environment.

The image for the Notebook can be anything since we will be only using the CLI. You can leave it as the default.

Explore components | Create a Kubeflow Notebook

Connect to the Notebook, and start a new terminal from the Launcher as shown below.

Use this terminal session to run the commands in the next sections.

Create the InferenceService

Define a new InferenceService yaml for the BERT model with the following content:

cat <<EOF > "./isvc.yaml"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bert-v2"
  annotations:
    "sidecar.istio.io/inject": "false"
spec:
  transformer:
    containers:
      - name: kserve-container      
        image: kfserving/bert-transformer-v2:latest
        command:
          - "python"
          - "-m"
          - "bert_transformer_v2"
        env:
          - name: STORAGE_URI
            value: "gs://kfserving-examples/models/triton/bert-transformer"
  predictor:
    triton:
      runtimeVersion: 20.10-py3
      resources:
        limits:
          cpu: "1"
          memory: 8Gi
        requests:
          cpu: "1"
          memory: 8Gi
      storageUri: "gs://kfserving-examples/models/triton/bert"
EOF

Disable istio sidecar

In the ISVC yaml, make sure to add the annotation "sidecar.istio.io/inject": "false" as done in the example above.

Due to issue GH 216, you will not be able to reach the ISVC without disabling istio sidecar injection.

GPU Scheduling

For running on GPU, specify the GPU resources in the ISVC yaml. For example, to run the predictor on NVIDIA GPU:

cat <<EOF > "./isvc-gpu.yaml"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bert-v2"
spec:
  transformer:
    containers:
      - name: kserve-container      
        image: kfserving/bert-transformer-v2:latest
        command:
          - "python"
          - "-m"
          - "bert_transformer_v2"
        env:
          - name: STORAGE_URI
            value: "gs://kfserving-examples/models/triton/bert-transformer"
  predictor:
    triton:
      runtimeVersion: 20.10-py3
      resources:      # specifiy gpu limits and vendor
        limits:
          nvidia.com/gpu: 1
        requests:
          nvidia.com/gpu: 1
      storageUri: "gs://kfserving-examples/models/triton/bert"
EOF

See more: Kubernetes | Schedule GPUs

Modify the ISVC yaml to set the node selector, node affinity, or tolerations in the ISVC to match your GPU node.

Expand to see an ISVC yaml with node scheduling attributes
cat <<EOF > "./isvc.yaml"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bert-v2"
spec:
  transformer:
    containers:
      - name: kserve-container      
        image: kfserving/bert-transformer-v2:latest
        command:
          - "python"
          - "-m"
          - "bert_transformer_v2"
        env:
          - name: STORAGE_URI
            value: "gs://kfserving-examples/models/triton/bert-transformer"
  predictor:
    nodeSelector:
      myLabel1: "true"
    tolerations:
      - key: "myTaint1"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
    triton:
      runtimeVersion: 20.10-py3
      resources:      # specifiy gpu limits and vendor
        limits:
          nvidia.com/gpu: 1
        requests:
          nvidia.com/gpu: 1
      storageUri: "gs://kfserving-examples/models/triton/bert"
EOF

This example sets nodeSelector and tolerations for the predictor. Similarly, you can set the affinity.

Apply the ISVC to your namespace with kubectl

kubectl apply -f ./isvc.yaml -n <namespace>

Since we are using the CLI from inside a Notebook, kubectl is using the ServiceAccount credentials of the Notebook pod.

Wait until the InferenceService is in Ready state. It can take a few minutes to be Ready because of pulling the large-size triton image. You can check on the state with:

kubectl get inferenceservice bert-v2 -n <namespace>

you should see an output similar to this:

NAME      URL                                           READY   AGE
bert-v2   http://bert-v2.default.10.64.140.43.nip.io   True    71s

Perform inference

Get the ISVC’s status.address.url

URL=$(kubectl get inferenceservice bert-v2 -n <namespace> -o jsonpath='{.status.address.url}')

Make a request to the ISVC’s URL

  • Prepare the inference input
cat <<EOF > "./input.json"
{
  "instances": [
    "What President is credited with the original notion of putting Americans in space?"
  ]
}
EOF
  • Make a prediction request
curl -v -H "Content-Type: application/json" ${URL}/v1/models/bert-v2:predict -d @./input.json

The response will contain the prediction output:

{"predictions": "John F. Kennedy", "prob": 77.91851169430718}

Help improve this document in the forum (guidelines). Last updated 5 months ago.