-
Notifications
You must be signed in to change notification settings - Fork 262
Hands on Kite: Writing a Crunch Job
In this example, you'll build and run a crunch job that calculates the number of movies released in a given year. This will demonstrate:
- Using the Kite application parent POM to manage dependencies
- Using the Kite maven plugin to submit a MR job
- Opening a dataset with Kite's API
Download the maven project tarball and unzip it:
wget --http-user=kite --http-password=kite http://bits.cloudera.com/c68eb8e4/movies-crunch.tar.gz
tar xzf movies-crunch.tar.gz
Next, look at the contents:
[cloudera@localhost ~]$ tree movies-crunch
movies-crunch
├── pom.xml
└── src
└── main
├── java
│ └── com
│ └── cloudera
│ └── Movies.java
└── resources
└── mapred-site.xml
This is is a minimal crunch project:
-
Movies.java
is the driver program that defines functions to extract the year from each movie title, group the years, and count them. -
pom.xml
configures maven to build and run the driver program -
mapred-site.xml
sets cluster configuration, like the default FS and the metastore URI
The pom.xml
file uses the Kite app POM as its parent (not a dependency):
<parent>
<groupId>org.kitesdk</groupId>
<artifactId>kite-app-parent-cdh4</artifactId>
<version>0.15.0</version>
</parent>
This parent POM configures the project's dependencies for CDH4, including test dependencies and test-jar artifacts. Notice that the only dependency that's listed directly in the POM is hive-exec
, which is required for reading from the movies dataset in Hive. It is added to change it from a provided dependency to a compile dependency so that Kite will add it with -libjars
when running the job.
The POM also adds the Kite maven plugin, which performs Kite-specific tasks in maven. In this case, you will use it to run the driver program, com.cloudera.Movies
, after installing the jar:
cd movies-crunch
mvn clean install
mvn kite:run-tool
The toolClass
is already configured in the POM, but could instead be added to the command line with -Dkite.toolClass=com.cloudera.Movies
. You can find more information on the Kite maven plugin in the kitesdk.org usage docs and about other tasks it can perform in the goal docs.
If you haven't already, build and run the crunch tool:
mvn clean install run-tool
The job will create a "year_counts" directory in HDFS that you can view:
[cloudera@localhost movies-crunch]$ hdfs dfs -cat year_counts/*
[1926,1]
[1931,3]
[1932,6]
[1933,11]
[1934,20]
[1935,33]
[1936,48]
[1937,67]
[1938,89]
[1939,118]