This tutorial shows two ways in which Hadoop MapReduce programs can be run on an Hadoop Distributed File System (HDFS) using HDInsight Services for Windows Azure.

  • Use the Create Job UI to run MapReduce programs written in Java, contained in Hadoop jar files
  • Use the Interactive JavaScript Console to run programs written in JavaScript

You will learn:

  • How to run a basic Java MapReduce program by using a Hadoop jar file.
  • How to upload input files to the HDFS cluster and read output files from the HDFS cluster.
  • How to run a JavaScript MapReduce script with a query by using the fluent API on Pig that is provided by the Interactive JavaScript Console.
  • How view the results from the HDInsight Interactive JavaScript Console

This tutorial is composed of the following segments:

  1. How to run a basic Java MapReduce program using a Hadoop jar file with the Create Job UI.
  2. How to run a JavaScript MapReduce script using the Interactive Console.

Setup and configuration

You must have an account to access Hadoop on Windows Azure and have created a cluster to work through this tutorial. To obtain an account and create an Hadoop cluster, follow the instructions outlined in the Getting started with Microsoft Hadoop on Windows Azure section of the Introduction to Apache Hadoop-based Service for Windows Azure topic.

How to run a basic Java MapReduce program by using a Hadoop jar file with the Create Job UI

From your Account page, click the Create Job icon in the Your Tasks section to bring up the Create Job UI.

 

To run a MapReduce program, specify the Job Name and the JAR File to use. Parameters are added to specify the name of the MapReduce program to run, the location of input and code files, and an output directory.

To see a simple example of how this interface is used to run the MapReduce job, let's look at the Pi Estimator sample. Return to your Account page. Scroll down to the Samples icon in the Manage your account section and click it.

From your Account page, scroll down to the Samples icon in the Manage your account section and click it.

Click the Pi Estimator sample icon in the Hadoop Sample Gallery. 

On the Pi Estimator page, information is provided about the application and downloads that are available for Java MapReduce programs and the jar file that contains the files needed by Hadoop on Windows Azure to deploy the application.

To deploy the files to the cluster, click the Deploy to your cluster button on the right side. 

The fields on the Create Job page are populated for you in this example. The first parameter value defaults to "pi 16 10000000". The first number indicates how many maps to create (default is 16) and the second number indicates how many samples are generated per map (10 million by default). So this program uses 160 million random points to make its estimate of Pi. The Final Command is automatically constructed for you from the specified parameters and jar file.

To run the program on the Hadoop cluster, simply click the blue Execute job button on the right side of the page.

The status of the job is displayed on the page and changes to Completed Successfully when it is done. The result is displayed at the bottom of the Output(stdout) section. For the default parameters, the result is Pi = 3.14159155000000000000 which is accurate to eighth decimal place, when rounded.

How to run a JavaScript MapReduce script using the Interactive Console

This segment shows how to run a MapReduce job with a query by using the fluent API layered on Pig that is provided by the Interactive Console. This example requires an input data file. The WordCount sample that you use here has already had this file uploaded to the cluster. But the sample does require that the .js script be uploaded to the cluster and you use this step to show the procedure for uploading files to HDFS from the Interactive Console.

First download a copy of the WordCount.js script to your local machine. Store it locally to upload it to the cluster. Click here and save a copy of the WordCount.js file to your local ../downloads directory. In addition download the The Notebooks of Leonardo Da Vinci, available here.

To get to the Interactive JavaScript console, return to your Account page. To bring up the Interactive JavaScript console, scroll down to the Your Cluster section and click the Interactive Console icon. 

To upload the JavaScript.js file to the cluster, enter the upload command fs.put() at the js> console and select the Wordcount.js form your downloads folder, for the Destination parameter use ./WordCount.js/.



Click the Browse button for the Source, navigate to the ../downloads directory and select the WordCount.js file. Enter the Destination value as shown and click the Upload button.

Repeat this step to upload the davinci.txt file by using ./example/data/ for the Destination.

Execute the MapReduce program from the js> console by using the following command:

pig.from("/example/data/gutenberg/davinci.txt").mapReduce("WordCount.js", "word, count:long").orderBy("count DESC").take(10).to("DaVinciTop10Words")

Scroll to the right and click view log if you want to observe the details of the job's progress. This log also provides diagnostics if the job fails to complete.

To display the results in the DaVinciTop10Words directory once the job completes, use the file = fs.read("DaVinciTop10Words") command at the js> prompt.


Summary

In this tutorial, you have seen two ways to run MapReduce jobs by using the Hadoop on Windows Azure portal. One used the Create Job UI to run a Java MapReduce program by using a jar file. The other used the Interactive Console to run a MapReduce job by using a .js script within a Pig query.

See Also

Another important place to find an extensive amount of Cortana Intelligence Suite related articles is the TechNet Wiki itself. The best entry point is Cortana Intelligence Suite Resources on the TechNet Wiki.