Data ingestion framework built on top of Hadoop Read more

"Gobblin is a universal data ingestion framework for extracting, transforming, and loading large volume of data from a variety of data sources, e.g., databases, rest APIs, FTP/SFTP servers, filers, etc., onto Hadoop." from the Gobblin wiki


This charm is uses the Hadoob base layer and the HDFS interface to pull its dependencies and act as a client to a Hadoop namenode. Here is how to deploy the Hadoop infrastructure:

juju quickstart apache-processing-mapreduce

Deploy the Gobblin charm and relate it to tha neme node:

juju deploy gobblin
juju add-relation gobblin plugin

Testing the deployment

Smoke test Gobblin

From the Gobblin unit, start the wikipedia ingestion demo job as the gobblin user:

juju ssh gobblin/0
cd /tmp
sudo su gobblin -c " --conf wikipedia.pull --jars /usr/lib/gobblin/lib/gobblin-example.jar"

The output will be in hdfs under /user/gobblin/work/job-output/gobblin/example/wikipedia/WikipediaOutput/ . You can set the output directory through the --workdir flag.

List and get the job output file(s) in avro format.

hdfs dfs -ls /user/gobblin/work/job-output/gobblin/example/wikipedia/WikipediaOutput/<Your_Job_Id>
hdfs dfs -get /user/gobblin/work/job-output/gobblin/example/wikipedia/WikipediaOutput/<Your_Job_Id>/<Path_To_Output>/<Output.avro>

Transform to JSON.

curl -O
java -jar avro-tools-1.7.7.jar tojson --pretty <Output.avro> > output.json

