Monday, December 15, 2014

Oozie Coordinator based on Data Availability

Apache Oozie framework(java web application) is used for scheduling Hadoop MR Jobs, Pig, Hive, Hbase. The task or jobs are referred as actions. The DAGs of these are created as the Workflow in XML format.

The Oozie jobs can be divided into two types:

  1. Workflow Jobs - These jobs specify the sequence of actions to be executed by using DAGs. These jobs consists of workflow.xml, workflow.properties and the code(having the code for actions to be executed). The bundle of workflow.xml and code as jar is created.

  2. Coordinator Jobs - These jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. These jobs have additional coordinator.xml as part of the bundle to be pushed to Oozie.

The oozie bundle needs to be copied to HDFS. The below is the structure of the bundle:


The content of the Workflow.xml is as below:


The code for oozie java action:


The workflow.properties file:


The code jar and workflow.xml is copied to HDFS:

hadoop fs -rm -r /apps/${JOB_NAME}
hadoop fs -mkdir /apps/${JOB_NAME}
hadoop fs -copyFromLocal ${TARGET_DIR}/${JOB_NAME} /apps/

# aagarwal-mbpro:OozieSample ashok.agarwal$ hadoop fs -ls /apps/OozieSample/lib/
# Found 1 items
# -rw-r--r-- 1 ashok.agarwal supergroup 8038 2014-09-11 14:22 /apps/OozieSample/lib/OozieSample.jar
# aagarwal-mbpro:OozieSample ashok.agarwal$ hadoop fs -ls /apps/OozieSample/
# Found 2 items
# drwxr-xr-x - ashok.agarwal supergroup 0 2014-09-11 14:22 /apps/OozieSample/lib
# -rw-r--r-- 1 ashok.agarwal supergroup 794 2014-08-07 12:54 /apps/OozieSample/workflow.xml
# aagarwal-mbpro:OozieSample ashok.agarwal$

oozie job -oozie http://aagarwal-mbpro.local:11000/oozie -config /apps/OozieSample/workflow.properties -run

So we have deployed workflow jobs.

We can make this job as recurrent by adding coordinator.xml.


Copy this coordinator.xml to  hdfs

hadoop fs -copyFromLocal coordinator.xml /apps/OozieSample/

The workflow.properties will not work in this case. So for coordinator we are creating coordinator.properties file.


Now again push the job using below command:

oozie job -oozie http://aagarwal-mbpro.local:11000/oozie -config ${TARGET_DIR}/coordinator.properties -run

Inorder to create coordinator to trigger on data availability the coordinator.xml is updated as below:


Copy the updated coordinator.xml to HDFS and push the job to oozie.

This job will wait till it find the data ie _SUCCESS signal(empty) file at ${appPath}/feed/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}.

So create it and paste it to the path.

touch _SUCCESS

hadoop fs -copyFromLocal ${appPath}/feed/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}

Check the oozie workflow from UI it will start execution as soon as the file is created at the path for which coordinator was looking for it.