Charmed Spark K8s

  • Canonical | bundle
Channel Revision Published
latest/edge 4 06 Aug 2024
3.4/edge 4 06 Aug 2024
juju deploy spark-k8s-bundle --channel edge
Show information

Platform:

Streaming workloads with Charmed Apache Spark

Apache Spark comes with built-in support for streaming workloads via Apache Spark Streaming. Charmed Apache Spark takes it a step further by making it easy to integrate with Apache Kafka using Juju. Apache Kafka is a distributed event-store with a producer/consumer API, designed to achieve massive throughput with clustering for horizontal scalability and high availability. For more information about Apache Kafka, please refer to the Apache Kafka project page, and for more information about Spark Streaming, please refer to the Apache Spark project documentation.

In this section, we are going to generate some streaming data, push it to Apache Kafka, and then consume the stream of data using Apache Spark, making an aggregation. We are going to use juju to deploy an Apache Kafka cluster as well as a simple test application which will generate and push events to Apache Kafka. We will then show how to set up a Spark job to continuously consume those events from Apache Kafka and calculate some statistics.

First of all, let’s start by creating a fresh juju model to be used as an experimental workspace for this project:

juju add-model spark-streaming

When you add a Juju model, a Kubernetes namespace of the same name is created automatically. You can verify that by running kubectl get namespaces - you should see a namespace called spark-streaming.

The service account spark that we created in the earlier section is in the spark namespace. Let’s create a similar service account but now in the spark-streaming namespace. We can copy the existing config options from the old service account into the new service account.

# Get config from old service account and store it in a file
spark-client.service-account-registry get-config \
  --username spark --namespace spark > properties.conf

# Create a new service account and load configurations from the file
spark-client.service-account-registry create \
  --username spark --namespace spark-streaming \
  --properties-file properties.conf

Now, let’s create a minimal Apache Kafka and Apache ZooKeeper setup. This can be done quickly and easily using the zookeeper-k8s and kafka-k8s charms. Although this setup is not highly available, using single instances for both should be enough to understand the underlying concepts.

# Deploy Apache Zookeper
juju deploy zookeeper-k8s --series=jammy --channel=edge

# Deploy Apache Kafka
juju deploy kafka-k8s --series=jammy --channel=edge

Once installed, let’s see the current status of the Juju model with the following command:

juju status --watch 1s

This command periodically refreshes the status of the Juju model and then presents a report in the terminal. Once all the charms have been deployed (you may need to wait some time), output similar to the following should appear:

Model            Controller      Cloud/Region        Version  SLA          Timestamp
spark-streaming  spark-tutorial  microk8s/localhost  3.1.7    unsupported  10:09:40Z

App            Version  Status   Scale  Charm          Channel  Rev  Address         Exposed  Message
kafka-k8s               waiting      1  kafka-k8s      3/edge    47  10.152.183.242  no       installing agent
zookeeper-k8s           active       1  zookeeper-k8s  3/edge    42  10.152.183.87   no       

Unit              Workload  Agent  Address      Ports  Message
kafka-k8s/0*      blocked   idle   10.1.29.184         missing required zookeeper relation
zookeeper-k8s/0*  active    idle   10.1.29.182         

The kafka-k8s/0 unit is blocked because we have not integrated Apache Kafka with Apache ZooKeeper yet. We can do that using:

juju integrate kafka-k8s zookeeper-k8s

After running the integration command, juju status should have an output similar to the following:

Model            Controller      Cloud/Region        Version  SLA          Timestamp
spark-streaming  spark-tutorial  microk8s/localhost  3.1.7    unsupported  10:13:55Z

App            Version  Status  Scale  Charm          Channel  Rev  Address         Exposed  Message
kafka-k8s               active      1  kafka-k8s      3/edge    47  10.152.183.242  no       
zookeeper-k8s           active      1  zookeeper-k8s  3/edge    42  10.152.183.87   no       

Unit              Workload  Agent  Address      Ports  Message
kafka-k8s/0*      active    idle   10.1.29.184         
zookeeper-k8s/0*  active    idle   10.1.29.182 

As you can see, both Apache Kafka and Apache ZooKeeper charms are in “active” status. However, it can take some time before the application and the “units” that compose the application are finally transitioned to active/idle state.

For us to experiment with the streaming feature, we need some sample streaming data to be generated in Apache Kafka continuously in real-time. For that, we can use the kafka-test-app charm to produce events.

Let’s deploy this charm with 3 units, and integrate it with kafka-k8s so that it is able to write messages to Apache Kafka.

juju deploy kafka-test-app -n 3 --series=jammy --channel=edge --config role=producer --config topic_name=spark-streaming-store --config num_messages=100000

juju integrate kafka-test-app kafka-k8s

Once the integration is complete, juju status should display something similar to:

Model            Controller      Cloud/Region        Version  SLA          Timestamp
spark-streaming  spark-tutorial  microk8s/localhost  3.1.7    unsupported  10:17:32Z

App             Version  Status  Scale  Charm           Channel  Rev  Address         Exposed  Message
kafka-k8s                active      1  kafka-k8s       3/edge    47  10.152.183.242  no       
kafka-test-app           active      3  kafka-test-app  edge       8  10.152.183.167  no       Topic spark-streaming-store enabled with process producer
zookeeper-k8s            active      1  zookeeper-k8s   3/edge    42  10.152.183.87   no       

Unit               Workload  Agent  Address      Ports  Message
kafka-k8s/0*       active    idle   10.1.29.184         
kafka-test-app/0   active    idle   10.1.29.185         Topic spark-streaming-store enabled with process producer
kafka-test-app/1   active    idle   10.1.29.186         Topic spark-streaming-store enabled with process producer
kafka-test-app/2*  active    idle   10.1.29.187         Topic spark-streaming-store enabled with process producer

zookeeper-k8s/0*   active    idle   10.1.29.182  

Now messages will be generated and written to Apache Kafka periodically by kafka-test-app. However, to establish a connection and actually consume these messages from Apache Kafka, Apache Spark needs to authenticate with Apache Kafka using the credentials. For the retrieval of these credentials, we are going to use the data-integrator charm. Let’s deploy data-integrator and integrate it with kafka-k8s with the following commands:

juju deploy data-integrator --series=jammy --channel=edge --config extra-user-roles=consumer,admin --config topic-name=spark-streaming-store

juju integrate data-integrator kafka-k8s 

Once this integration is complete, juju status should appear something similar to:

Model            Controller      Cloud/Region        Version  SLA          Timestamp
spark-streaming  spark-tutorial  microk8s/localhost  3.1.7    unsupported  10:22:14Z

App              Version  Status  Scale  Charm            Channel  Rev  Address         Exposed  Message
data-integrator           active      1  data-integrator  edge      15  10.152.183.18   no       
kafka-k8s                 active      1  kafka-k8s        3/edge    47  10.152.183.242  no       
kafka-test-app            active      3  kafka-test-app   edge       8  10.152.183.167  no       Topic spark-streaming-store enabled with process producer
zookeeper-k8s             active      1  zookeeper-k8s    3/edge    42  10.152.183.87   no       

Unit                Workload  Agent  Address      Ports  Message
data-integrator/0*  active    idle   10.1.29.189         
kafka-k8s/0*        active    idle   10.1.29.184         
kafka-test-app/0    active    idle   10.1.29.185         Topic spark-streaming-store enabled with process producer
kafka-test-app/1    active    idle   10.1.29.186         Topic spark-streaming-store enabled with process producer
kafka-test-app/2*   active    idle   10.1.29.187         Topic spark-streaming-store enabled with process producer
zookeeper-k8s/0*    active    idle   10.1.29.182 

Now that data-integrator is deployed and integrated with kafka-k8s, we can get the credentials to connect to Apache Kafka by running the get-credentials action exposed by the data-integrator charm:

juju run data-integrator/0 get-credentials

The output generated by this command is similar to the following:

Running operation 1 with 1 task
  - task 2 on unit-data-integrator-0

Waiting for task 2...
kafka:
  consumer-group-prefix: relation-9-
  data: '{"extra-user-roles": "consumer,admin", "requested-secrets": "[\"username\",
    \"password\", \"tls\", \"tls-ca\", \"uris\"]", "topic": "spark-streaming-store"}'
  endpoints: kafka-k8s-0.kafka-k8s-endpoints:9092
  password: g6c5gjg48IjTFld664ipkz8Khqb5FOG0
  tls: disabled
  topic: spark-streaming-store
  username: relation-9
  zookeeper-uris: zookeeper-k8s-0.zookeeper-k8s-endpoints:2181/kafka-k8s
ok: "True"

As you can see, the endpoint, username and password to be used to authenticate with Kafka are displayed next to “endpoints”, “username” and “password” respectively. It will help if we store the username and password as variables so that they can be used later in the tutorial. To do that, we can specify --format=json when running the get-credentials action, and then filter out username and password using jq.

KAFKA_USERNAME=$(juju run data-integrator/0 get-credentials --format=json | jq -r '.["data-integrator/0"].results.kafka.username')
KAFKA_PASSWORD=$(juju run data-integrator/0 get-credentials --format=json | jq -r '.["data-integrator/0"].results.kafka.password')
KAFKA_ENDPOINT=$(juju run data-integrator/0 get-credentials --format=json | jq -r '.["data-integrator/0"].results.kafka.endpoints')

Let’s see the format of an event generated by kafka-test-app.

{"timestamp": 1707128717.277745, "_id": "339e0f352b4843bfa0d512b796019a92", "origin": "kafka-test-app-0 (10.1.29.185)", "content": "Message #121"}

As we can see, the value of the “origin” key is the name and the IP address of the unit producing the events.

Now we will write a Spark job in Python that counts the number of events grouped by the “origin” key in real-time.

First, we will authenticate with Apache Kafka and load the events from the spark-streaming-store topic. This can be done using the spark.readStream function as follows:

lines = spark.readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "kafka-k8s-0.kafka-k8s-endpoints:9092") \
        .option("kafka.sasl.mechanism", "SCRAM-SHA-512") \
        .option("kafka.security.protocol", "SASL_PLAINTEXT") \
        .option("kafka.sasl.jaas.config", f'org.apache.kafka.common.security.scram.ScramLoginModule required username="relation-9" password="g6c5gjg48IjTFld664ipkz8Khqb5FOG0";') \
        .option("subscribe", "spark-streaming-store") \
        .option("includeHeaders", "true") \
        .load()

Now, let’s create a user-defined function that fetches the value of the “origin” column from the event:

from pyspark.sql.functions import udf
from json import loads

get_origin = udf(lambda x: loads(x)["origin"])

Then let’s count the number of events grouped by the value of the “origin” key, which is returned by the get_origin function that we wrote above.

count = lines.withColumn(
            "origin", 
            get_origin(col("value"))
          ).select("origin").groupBy("origin").count()

Finally, writing the result continuously to the console until the job is terminated can be done as follows:

query = w_count.writeStream.outputMode("complete").format("console").start()

query.awaitTermination()

Putting all of this together and enabling username, password, endpoint and topic to be passed as command line arguments, we have the following program:

import argparse
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
from json import loads

# Create a Spark Session
spark = SparkSession.builder.appName("SparkStreaming").getOrCreate()

# Read username, password and endpoint from command line arguments.
parser = argparse.ArgumentParser()
parser.add_argument("--kafka-username", "-u",
                help="The username to authenticate to Kafka",
                required=True)
parser.add_argument("--kafka-password", "-p",
                help="The password to authenticate to Kafka",
                  required=True)
parser.add_argument("--kafka-endpoint", "-e",
                  help="The bootstrap server endpoint",
                    required=True)
parser.add_argument("--kafka-topic", "-t",
                  help="The Kafka topic to subscribe to",
                    required=True)
args = parser.parse_args()
username=args.kafka_username
password=args.kafka_password
endpoint=args.kafka_endpoint
topic=args.kafka_topic

# Authenticating with Kafka and reading the stream from the topic
lines = spark.readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", endpoint) \
        .option("kafka.sasl.mechanism", "SCRAM-SHA-512") \
        .option("kafka.security.protocol", "SASL_PLAINTEXT") \
        .option(
          "kafka.sasl.jaas.config", 
          f'org.apache.kafka.common.security.scram.ScramLoginModule required username="{username}" password="{password}";'
        ).option("subscribe", topic) \
        .option("includeHeaders", "true") \
        .load()

# User defined function that returns the origin of one particular event
get_origin = udf(lambda x: loads(x)["origin"])

# Group by origin of the event and count number of event for each origins
count = lines.withColumn(
            "origin", 
            get_origin(col("value"))
          ).select("origin").groupBy("origin").count()

# Start writing the result to console
query = count.writeStream.outputMode("complete").format("console").start()

# Keep doing this until the job is terminated
query.awaitTermination()

Save the Python code above in a file named spark_streaming.py. We’ll copy this script to the S3 bucket. Run the following command:

aws s3 cp spark_streaming.py s3://spark-tutorial/spark_streaming.py

Once the file has been copied to S3, let’s submit a new job to our Apache Spark cluster using spark-submit. Please note that we need to specify a few extra packages to interact with Apache Kafka because they are not included by default in the Charmed Apache Spark image.

spark-client.spark-submit \
    --username spark --namespace spark-streaming \
    --deploy-mode cluster \
    --packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.4.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.1 \
    s3a://spark-tutorial/spark_streaming.py \
        --kafka-endpoint $KAFKA_ENDPOINT \
        --kafka-username $KAFKA_USERNAME \
        --kafka-password $KAFKA_PASSWORD \
        --kafka-topic "spark-streaming-store"

The job submission should start a driver pod, which in turn will create executor pods. The executor pods will then transition to the “Running” state and the streaming data is processed continuously by the executor pods.

We can view the status of the pods with the following command in a new terminal window:

watch -n1 "kubectl get pods -n spark-streaming | grep 'spark-streaming-.*-driver' "

The streaming output - directed to the console - is being written to the pod logs. To fetch the pod logs, we first need to know the name of the driver pod. Let’s find its name to then fetch the logs as:

pod_name=$(kubectl get pods -n spark-streaming | grep "spark-streaming-.*-driver" | tail -n 1 | cut -d' ' -f1)

kubectl logs -n spark-streaming -f $pod_name | grep "Batch: " -A 10 # filter out line starting with "Batch: " and next 10 lines after that line

The option -f will tail the pod logs until Ctrl + C keys are pressed. If you observe carefully, you can see that new logs are appended roughly every ten seconds, including our aggregated results calculation containing the number of events grouped by the origin, similar to the following:

...
2024-02-16T12:50:12.886Z [sparkd] Batch: 13
2024-02-16T12:50:12.886Z [sparkd] -------------------------------------------
...
2024-02-16T12:50:12.963Z [sparkd] +--------------------+-----+
2024-02-16T12:50:12.963Z [sparkd] |              origin|count|
2024-02-16T12:50:12.963Z [sparkd] +--------------------+-----+
2024-02-16T12:50:12.963Z [sparkd] |kafka-test-app-1 ...|   38|
2024-02-16T12:50:12.963Z [sparkd] |kafka-test-app-0 ...|   38|
2024-02-16T12:50:12.963Z [sparkd] |kafka-test-app-2 ...|   39|
2024-02-16T12:50:12.963Z [sparkd] +--------------------+-----+

...

Bravo! We succeeded in processing some streaming data with the Charmed Apache Spark solution.

In the next section, we will learn how to monitor the status of the job using the Spark History Server and the Canonical Observability Stack.