Kubeflow

Kubeflow Charmers | bundle
Cloud

Channel	Revision	Published
latest/candidate	294	24 Jan 2022
latest/beta	430	30 Aug 2024
latest/edge	423	26 Jul 2024
1.10/stable	436	07 Apr 2025
1.10/candidate	434	02 Apr 2025
1.10/beta	433	24 Mar 2025
1.9/stable	432	03 Dec 2024
1.9/beta	420	19 Jul 2024
1.9/edge	431	03 Dec 2024
1.8/stable	414	22 Nov 2023
1.8/beta	411	22 Nov 2023
1.8/edge	413	22 Nov 2023
1.7/stable	409	27 Oct 2023
1.7/beta	408	27 Oct 2023
1.7/edge	407	27 Oct 2023

Learn to deploy on juju >

Platform:

Relevant links

Homepage

Share your thoughts on this charm with the community on discourse.

Join the discussion

This guide describes how to serve a BERT model using NVIDIA Triton Inference Server.

Refresh the knative-serving charm

Upgrade the knative-serving charm to channel latest/edge:

juju refresh knative-serving --channel=latest/edge

Wait until the charm is in active status, you can check its status with:

juju status --watch 5s

Create a notebook

Create a Kubeflow Notebook to be used as your workspace. Leave the default notebook image, since you will only use the Command Line Interface (CLI) for running commands.

Running commands in this guide requires in-cluster communication, meaning instructions only work within the notebook environment.

Connect to the notebook, and start a new terminal from the launcher:

Use this terminal session to run the commands in the following sections.

Create the Inference Service

Define a new Inference Service YAML file for the BERT model as follows:

cat <<EOF > "./isvc.yaml"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bert-v2"
  annotations:
    "sidecar.istio.io/inject": "false"
spec:
  transformer:
    containers:
      - name: kserve-container      
        image: kfserving/bert-transformer-v2:latest
        command:
          - "python"
          - "-m"
          - "bert_transformer_v2"
        env:
          - name: STORAGE_URI
            value: "gs://kfserving-examples/models/triton/bert-transformer"
  predictor:
    triton:
      runtimeVersion: 20.10-py3
      resources:
        limits:
          cpu: "1"
          memory: 8Gi
        requests:
          cpu: "1"
          memory: 8Gi
      storageUri: "gs://kfserving-examples/models/triton/bert"
EOF

In the ISVC yaml, make sure to add the annotation "sidecar.istio.io/inject": "false". Due to issue GH 216, you will not be able to reach the ISVC without disabling istio sidecar injection.

Schedule GPUs

For running on GPUs, specify the GPU resources in the ISVC YAML file. For example, you can run the predictor on NVIDIA GPUs as follows:

cat <<EOF > "./isvc-gpu.yaml"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bert-v2"
spec:
  transformer:
    containers:
      - name: kserve-container      
        image: kfserving/bert-transformer-v2:latest
        command:
          - "python"
          - "-m"
          - "bert_transformer_v2"
        env:
          - name: STORAGE_URI
            value: "gs://kfserving-examples/models/triton/bert-transformer"
  predictor:
    triton:
      runtimeVersion: 20.10-py3
      resources:      # specifiy gpu limits and vendor
        limits:
          nvidia.com/gpu: 1
        requests:
          nvidia.com/gpu: 1
      storageUri: "gs://kfserving-examples/models/triton/bert"
EOF

See Schedule GPUs for more details.

Now you need to modify the ISVC YAML file to set the node selector, node affinity, or tolerations in the ISVC to match your GPU node.

For instance, this is an ISVC YAML file with node scheduling attributes:

cat <<EOF > "./isvc.yaml"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bert-v2"
spec:
  transformer:
    containers:
      - name: kserve-container      
        image: kfserving/bert-transformer-v2:latest
        command:
          - "python"
          - "-m"
          - "bert_transformer_v2"
        env:
          - name: STORAGE_URI
            value: "gs://kfserving-examples/models/triton/bert-transformer"
  predictor:
    nodeSelector:
      myLabel1: "true"
    tolerations:
      - key: "myTaint1"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
    triton:
      runtimeVersion: 20.10-py3
      resources:      # specifiy gpu limits and vendor
        limits:
          nvidia.com/gpu: 1
        requests:
          nvidia.com/gpu: 1
      storageUri: "gs://kfserving-examples/models/triton/bert"
EOF

This example sets nodeSelector and tolerations for the predictor. Similarly, you can set the affinity.

Now apply the ISVC to your namespace with kubectl:

kubectl apply -f ./isvc.yaml -n <namespace>

Since you are using the CLI within a notebook, kubectl is using the Service Account credentials of the notebook pod.

Wait until the Inference Service is in Ready state. It can take up to few minutes. Check its status with:

kubectl get inferenceservice bert-v2 -n <namespace>

You should see an output similar to this:

NAME      URL                                           READY   AGE
bert-v2   http://bert-v2.default.10.64.140.43.nip.io   True    71s

Perform inference

Get ISVC status.address.url:

URL=$(kubectl get inferenceservice bert-v2 -n <namespace> -o jsonpath='{.status.address.url}')

Make a request to this URL:

Prepare the inference input:

cat <<EOF > "./input.json"
{
  "instances": [
    "What President is credited with the original notion of putting Americans in space?"
  ]
}
EOF

Make a prediction request:

curl -v -H "Content-Type: application/json" ${URL}/v1/models/bert-v2:predict -d @./input.json

The response contains the prediction output:

{"predictions": "John F. Kennedy", "prob": 77.91851169430718}

Help improve this document in the forum (guidelines). Last updated 4 months ago.