Ingest tweets with Apache Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. Learn more at flume.apache.org.

This charm provides a Flume agent designed to process tweets from the Twitter Streaming API and send them to the apache-flume-hdfs agent for storage in the shared filesystem (HDFS) of a connected Hadoop cluster. This leverages the TwitterSource jar packaged with Flume. Learn more about the 1% firehose.


The Twitter Streaming API requires developer credentials. You'll need to configure those for this charm. Find your credentials (or create an account if needed) here.

Create a secret.yaml file with your Twitter developer credentials:

    twitter_access_token: 'YOUR_TOKEN'
    twitter_access_token_secret: 'YOUR_TOKEN_SECRET'
    twitter_consumer_key: 'YOUR_CONSUMER_KEY'
    twitter_consumer_secret: 'YOUR_CONSUMER_SECRET'


This charm leverages our pluggable Hadoop model with the hadoop-plugin interface. This means that you will need to deploy a base Apache Hadoop cluster to run Flume. The suggested deployment method is to use the apache-ingestion-flume bundle. This will deploy the Apache Hadoop platform with a single Apache Flume unit that communicates with the cluster by relating to the apache-hadoop-plugin subordinate charm:

juju quickstart u/bigdata-dev/apache-ingestion-flume

Alternatively, you may manually deploy the recommended environment as follows:

juju deploy apache-hadoop-hdfs-master hdfs-master
juju deploy apache-hadoop-yarn-master yarn-master
juju deploy apache-hadoop-compute-slave compute-slave
juju deploy apache-hadoop-plugin plugin
juju deploy apache-flume-hdfs flume-hdfs

juju add-relation yarn-master hdfs-master
juju add-relation compute-slave yarn-master
juju add-relation compute-slave hdfs-master
juju add-relation plugin yarn-master
juju add-relation plugin hdfs-master
juju add-relation flume-hdfs plugin

Once the bundle has been deployed, add the apache-flume-twitter charm and relate it to the flume-hdfs agent:

juju deploy apache-flume-twitter flume-twitter --config=secret.yaml
juju add-relation flume-twitter flume-hdfs

That's it! Once the Flume agents start, tweets will start flowing into HDFS in year-month-day/hour directories here: /user/flume/events/%y-%m-%d/%H.

Test the deployment

To verify this charm is working as intended, SSH to the flume-hdfs unit, locate an event, and cat it:

juju ssh flume-hdfs/0
hdfs dfs -ls /user/flume/events  # <-- find a date
hdfs dfs -ls /user/flume/events/yy-mm-dd  # <-- find an hour
hdfs dfs -ls /user/flume/events/yy-mm-dd/HH  # <-- find an event
hdfs dfs -cat /user/flume/events/yy-mm-dd/HH/FlumeData.[id].avro

You'll see AVRO headers since that's the default format used to contain the tweets. You may not recognize the body of the tweet if it's not in a language you understand (remember, this is a 1% firehose from tweets all over the world). You may have to cat a few different events before you find a tweet worth reading. Happy hunting!

