Getting Started

The Microsoft Distribution of Hadoop comes with a web-based interactive Javascript console that is started along with the other Hadoop services. The console allows you to:

  • Perform HDFS operations, including uploading/reading files to/from the HDFS
  • Run MapReduce programs from .js scripts or JAR files, and monitor their progress
  • Run a Pig job specified using a fluent query syntax in Javascript, and monitor its progress
  • Visualize data with graphs built using HTML5

To get started, you can open the console in your browser by going to http://localhost:8080/ after running "isotope start" in a local installation, or by clicking on the appropriate link on the Azure portal after you have signed in. 

Walkthrough: Visualizing Word Count

Write the Javascript MapReduce script

Using Notepad or your favorite text editor, create a text file with the following contents:

var map = function (key, value, context) {
    var words = value.split(/[^a-zA-Z]/);
    for (var i = 0; i < words.length; i++) {
        if (words[i] !== "") {
            context.write(words[i].toLowerCase(), 1);
        }
    }
};

var reduce = function (key, values, context) {
    var sum = 0;
    while (values.hasNext()) {
        sum += parseInt(values.next());
    }
    context.write(key, sum);
};

 

Save the text file as “WordCount.js” to your hard drive.  Note that UTF-8 encoding, the default often used by Visual Studio, causes an "illegal character" exception when the Pig job runs so set the encoding to "US-ASCII – Codepage 20127".

Upload the script and input data

Open the interactive Javascript console and type:

fs.put()

Then select the WordCount.js file you created in the previous step and upload it to the HDFS.

Next, create a directory on the HDFS for the Gutenberg sample by typing:

#mkdir gutenberg

Finally, upload each of the Gutenberg files by typing

fs.put("gutenberg")

and selecting a .txt file from the Gutenberg set (located in C:\Apps\dist\examples\data\gutenberg). Repeat this step for each of the text files.

To make sure the files were uploaded correctly, use the following commands:

#ls
#ls gutenberg
#cat WordCount.js

Run the query

Run the following to find the top 10 most frequent words in the Gutenberg sample texts:

pig.from("gutenberg").mapReduce("WordCount.js", "word, count:long").orderBy("count DESC").take(10).to("gbtop10")

Once the job completes, you can see the output files in the HDFS by typing:

#ls gbtop10

Visualize the results

(Note: if you are using Internet Explorer, this step requires IE9+)

Read the results into the Javascript context by typing:

file = fs.read("gbtop10")
data = parse(file.data, "word, count:long")

Then make a bar graph of the data:

graph.bar(data)

Enjoy!

The article was Written By David Zhang.