Table of Contents


This tutorial shows how to use Azure HDInsight to run a MapReduce program that counts word occurrences in a text. The text file analyzed here is the Project Gutenberg eBook edition of The Notebooks of Leonardo Da Vinci. The Java MapReduce program outputs the total number of occurrences of each word in a text file.


This wiki topic is obsolete.

The wiki topics on Windows Azure HDInsight Service are no longer updated by Microsoft. We moved the content to where we keep it current. This topic can be found at Getting Started with Windows Azure HDInsight Service.

MapReduce Programs

The Hadoop MapReduce program reads the text file and counts how often each word occurs. The output is a new text file that consists of lines, each of which contains a word and the count (a key/value tab-separated pair) of how often that word occurred in the document. This process is done in two stages. The mapper (the cat.exe in this sample) takes each line from the input text as an input and breaks it into words. It emits a key/value pair each time a work occurs of the word followed by a 1. The reducer (the wc.exe in this sample) then sums these individual counts for each word and emits a single key/value pair that contains the word followed by the sum of its occurrence