juju deploy cs:apache-flume-twitter
Ingest tweets with Apache Flume Read more
Discuss this charm
Share your thoughts on this charm with the community on discourse.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. Learn more at flume.apache.org.
This charm provides a Flume agent designed to process tweets from the Twitter
Streaming API and send them to the
apache-flume-hdfs agent for storage in
the shared filesystem (HDFS) of a connected Hadoop cluster. This leverages the
TwitterSource jar packaged with Flume. Learn more about the
The Twitter Streaming API requires developer credentials. You'll need to configure those for this charm. Find your credentials (or create an account if needed) here.
secret.yaml file with your Twitter developer credentials:
flume-twitter: twitter_access_token: 'YOUR_TOKEN' twitter_access_token_secret: 'YOUR_TOKEN_SECRET' twitter_consumer_key: 'YOUR_CONSUMER_KEY' twitter_consumer_secret: 'YOUR_CONSUMER_SECRET'
This charm leverages our pluggable Hadoop model with the
interface. This means that you will need to deploy a base Apache Hadoop cluster
to run Flume. The suggested deployment method is to use the
bundle. This will deploy the Apache Hadoop platform with a single Apache Flume
unit that communicates with the cluster by relating to the
apache-hadoop-plugin subordinate charm:
juju quickstart u/bigdata-dev/apache-ingestion-flume
Alternatively, you may manually deploy the recommended environment as follows:
juju deploy apache-hadoop-hdfs-master hdfs-master juju deploy apache-hadoop-yarn-master yarn-master juju deploy apache-hadoop-compute-slave compute-slave juju deploy apache-hadoop-plugin plugin juju deploy apache-flume-hdfs flume-hdfs juju add-relation yarn-master hdfs-master juju add-relation compute-slave yarn-master juju add-relation compute-slave hdfs-master juju add-relation plugin yarn-master juju add-relation plugin hdfs-master juju add-relation flume-hdfs plugin
Once the bundle has been deployed, add the
apache-flume-twitter charm and
relate it to the
juju deploy apache-flume-twitter flume-twitter --config=secret.yaml juju add-relation flume-twitter flume-hdfs
That's it! Once the Flume agents start, tweets will start flowing into
HDFS in year-month-day/hour directories here:
Test the deployment
To verify this charm is working as intended, SSH to the
locate an event, and cat it:
juju ssh flume-hdfs/0 hdfs dfs -ls /user/flume/events # <-- find a date hdfs dfs -ls /user/flume/events/yy-mm-dd # <-- find an hour hdfs dfs -ls /user/flume/events/yy-mm-dd/HH # <-- find an event hdfs dfs -cat /user/flume/events/yy-mm-dd/HH/FlumeData.[id].avro
You'll see AVRO headers since that's the default format used to contain the tweets. You may not recognize the body of the tweet if it's not in a language you understand (remember, this is a 1% firehose from tweets all over the world). You may have to cat a few different events before you find a tweet worth reading. Happy hunting!