Overview

This tutorial illustrates how to use Hadoop on Azure to do cluster analysis with Mahout.

The various forms of cluster analysis attempt to answer the problem: given a collection of objects with values for a set of properties, devise a scheme for grouping them where similar ones are put into the same class.

The sample used in this tutorial is derived from the clustering sample on Mahout's website.

This is a sample that requires users to use the Remote Desktop on Hadoop on Azure to run Mahout jobs on the head node of the cluster.

Goals

In this tutorial you see three things:

  1. How to use the Remote Desktop in Hadoop on Azure to access the head node of the HDFS cluster.

  2. How to run Mahout cluster analysis from Hadoop on Azure

Key Technologies

Setup and Configuration

You must have an account to access Hadoop on Azure and have created a cluster to work through this tutorial. To obtain an account and create an Hadoop cluster, follow the instructions outlined in the Getting started with Microsoft Hadoop on Azure section of the Introduction to Hadoop on Azure topic.


Tutorial

This tutorial is composed of the following segments:

  1. How to do cluster analysis by using Mahout on Hadoop on Azure.

How to do cluster analysis by using Mahout on Hadoop on Azure

From your Account page, scroll down to the Remote Desktop icon in the Your cluster section and click the icon to open the head node in your cluster.

Select Open when prompted to to open the .rpd file.

Select Connect in the Remote Desktop Connection window.

Enter your credentials for the Hadoop cluster (not your Hadoop on Azure account) into the Windows Security window and select OK.

Double-click on the Hadoop Command Shell in the upper left corner of the Desktop to open it change the directory to c:\apps\dist\mahout\examples\bin\work. Run the dir command to check that the synthetic_control.data file that contains the data that you will analyze is present.

To launch the Mahout cluster analysis on this data, go to folder c:\apps\dist\mahout\examples\bin and run the following command:

build-cluster-syntheticcontrol.cmd

Type in the number of the desired clustering algorithm from the driver script when prompted and press Enter. Status is reported as the job progresses. The other forms of cluster analysis can be run in the same way.

Once the job has completed, verify that the results are in the HDFS output directories by using the list cmd: hadoop fs -lsr output.

To download the output from the cluster analysis for further processing and visualization to the local file directory on the head node of the cluster use the following command:

hadoop fs -get output c:Apps\dist\mahout\examples


Summary

In this tutorial, you have seen how to run a Mahout cluster analysis from Hadoop on Azure Hadoop using the Remote Desktop.