Skip to content

Hands on Kite: Writing a Crunch Job

Joey Echeverria edited this page Aug 27, 2014 · 4 revisions

In this example, you'll build and run a crunch job that calculates the number of movies released in a given year. This will demonstrate:

  • Using the Kite application parent POM to manage dependencies
  • Using the Kite maven plugin to submit a MR job
  • Opening a dataset with Kite's API

Download the project

Download the maven project tarball and unzip it:

wget  --http-user=kite --http-password=kite http://bits.cloudera.com/c68eb8e4/movies-crunch.tar.gz
tar xzf movies-crunch.tar.gz

Next, look at the contents:

[cloudera@localhost ~]$ tree movies-crunch
movies-crunch
├── pom.xml
└── src
    └── main
        ├── java
        │   └── com
        │       └── cloudera
        │           └── Movies.java
        └── resources
            └── mapred-site.xml

This is is a minimal crunch project:

  • Movies.java is the driver program that defines functions to extract the year from each movie title, group the years, and count them.
  • pom.xml configures maven to build and run the driver program
  • mapred-site.xml sets cluster configuration, like the default FS and the metastore URI

The Maven POM

The pom.xml file uses the Kite app POM as its parent (not a dependency):

  <parent>
    <groupId>org.kitesdk</groupId>
    <artifactId>kite-app-parent-cdh4</artifactId>
    <version>0.15.0</version>
  </parent>

This parent POM configures the project's dependencies for CDH4, including test dependencies and test-jar artifacts. Notice that the only dependency that's listed directly in the POM is hive-exec, which is required for reading from the movies dataset in Hive. It is added to change it from a provided dependency to a compile dependency so that Kite will add it with -libjars when running the job.

The POM also adds the Kite maven plugin, which performs Kite-specific tasks in maven. In this case, you will use it to run the driver program, com.cloudera.Movies, after installing the jar:

cd movies-crunch
mvn clean install
mvn kite:run-tool

The toolClass is already configured in the POM, but could instead be added to the command line with -Dkite.toolClass=com.cloudera.Movies. You can find more information on the Kite maven plugin in the kitesdk.org usage docs and about other tasks it can perform in the goal docs.

Running the tool

If you haven't already, build and run the crunch tool:

mvn clean install run-tool

The job will create a "year_counts" directory in HDFS that you can view:

[cloudera@localhost movies-crunch]$ hdfs dfs -cat year_counts/*
[1926,1]
[1931,3]
[1932,6]
[1933,11]
[1934,20]
[1935,33]
[1936,48]
[1937,67]
[1938,89]
[1939,118]