HDInsight Services For Windows

HDInsight Services For Windows

This article is the main portal for technical information about HDInsight Services for Windows and related Microsoft technologies. It provides a brief overview of Apache Hadoop, as well as information for the HDInsight Services provided by Microsoft for deployment on both Windows and Windows Azure.







It also provides links to more detailed technical content in various formats.

Note: Contributions are welcome and appreciated: Please feel free to update this and other articles on this Wiki, and to add links to relevant content both from within and outside Microsoft.

Table of Contents

  Topics

  Content Types
Hadoop Overview Orientation
Learning Apache Hadoop Tutorials



Getting Started with HDInsight Services on Windows



Tutorials



Getting Started with HDInsight Services for Windows Azure 



Tutorials



Samples on the HDInsight Services for Windows Azure Dashboard



Samples



Developing with Hadoop Tutorials



Using HDInsight Services with other BI Technologies HowTos



How To  HowTos
Code Examples  Samples
Videos  Videos
Audio  Audio
Books  Books
Hadoop on Windows and on Windows Azure Best Practices  Guidance

 

Hadoop Overview

Apache Hadoop is an open source software framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It consists of two primary components: Hadoop Distributed File System (HDFS), a reliable and distributed data storage, and MapReduce, a parallel and distributed processing system. A Hadoop cluster can be made up of a single node or thousands.

HDFS is the primary distributed storage used by Hadoop applications. As you load data into a Hadoop cluster, HDFS splits up the data into blocks/chunks and creates multiple replicas of blocks and distributes them across the nodes of the cluster to enable reliable and extremely rapid computations.

Hadoop MapReduce is a software framework for writing applications that rapidly process vast amounts of data in parallel on a large cluster of compute nodes. A MapReduce job usually splits the input data-set into independent chunks. These independent chunks are processed by the map tasks running across the nodes of the Hadoop cluster in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Some of the main advantages of Hadoop are that it can process vast amounts of data, hundreds of terabytes or even petabytes quickly and efficiently, process both structured and non-structured data, perform the processing where the data is located rather than moving the data to some processing location, and detect and handle failures by design.

There are two other key Apache technologies that are frequently used with Hadoop: Hive and Pig. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems such as HDFS. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

For more details on Apache Hadoop, see http://hadoop.apache.org/.

Learning Apache Hadoop

This section contains links to resources useful in learning Hadoop, such as installation, configuration, and basic how-to information.

  Link   Description
Apache Hadoop The Apache Hadoop home page
Introduction to Apache MapReduce and HDFS [Video] An introduction to Apache MapReduce and HDFS
Hive A data warehouse system for Hadoop
Introduction to Apache Hive [Video] An introduction to Apache Hive
Pig A platform for analyzing large data sets
Introduction to Pig [Video] An introduction to Apache Pig
Learning resources for Apache Mahout An introduction to Apache Mahout
Mahout A scalable machine learning library
How to Contribute How to Contribute to Hadoop Common




Getting Started with HDInsight Services for Windows

The links in this section provide information on deploying and using the Developer Preview of HDInsight Services on Windows.

       
 Link  Description
Installing the Developer Preview of HDInsight Services on Windows How to install the Developer Preview of Hadoop on Windows with the Microsoft Web Platform Installer 4.0.
Getting Started with HDInsight Services for Windows  Tour through the Microsoft HDInsight dashboard and resources for getting started with the developer preview.

       

Getting Started with HDInsight Services for Windows Azure

The links in this section provide information on deploying and using Apache Hadoop on the Microsoft Windows Azure Platform. Instead of setting up and managing a Hadoop cluster on Azure by yourself, you can use the HDInsight Services for Windows Azure dashboard that Microsoft has made available at hadooponazure.com. This is a preview of the HDInsight Services for Windows Azure to which you can submit MapReduce jobs to be processed along with the data used in the processing. It enables you to process vast amounts of structured as well as non-structured data easily without worrying about setting up the Hadoop cluster, configuring, maintaining, and managing it manually.

  Link

  Description
Deployment of Hadoop-based Services on the Windows Azure Portal  A walkthrough for provisioning and using a temporary HDFS cluster on the Hadoop on Windows Azure Portal.
Introduction to HDInsight Sevices for Windows Azure A service that deploys and provisions clusters in the cloud, providing a software framework designed to manage, analyze and report on big data.
HD Insight Services for Windows Azure QuickStart: Running Hadoop Jobs This tutorial shows how to run MapReduce programs in a cluster by using Apache™ Hadoop™-based Services for Windows Azure in two ways.
Working With Data in HDInsight Services for Windows Azure Outlines several techniques for importing and storing data for use in Hadoop jobs run with Hadoop-based Services for Windows Azure.
Analyzing Twitter Movie Data with Hive in HDInsight Services for Windows Azure In this tutorial you will query, explore, and analyze data from Twitter using Apache™ Hadoop™-based Services for Windows Azure and a Hive query in Excel. Social web sites are one of the major driving forces for Big Data adoption.
Simple recommendation engine using Apache Mahout In this tutorial you use the Million Song Dataset to create song recommendations for users based on their past listening habits.
A Lap Around HDInsight  An end-to-end introduction to HDInsight, Map/Reduce. Pig, and Hive.

 

Samples on the HDInsight for Windows Azure Dashboard

This section contains links to the tutorials for the samples that are on the Hadoop on Windows Azure Portal.

  Link

Description

The Hadoop on Azure Pi Estimator Sample Tutorial  This tutorial shows how to deploy a MapReduce program with Hadoop on Windows Azure that uses a statistical (quasi-Monte Carlo) method to estimate the value of Pi.
The Hadoop on Azure 10-GB Graysort Sample Tutorial This tutorial shows how to run a general purpose GraySort on a 10 GB file using Hadoop on Windows Azure.
The Hadoop on Azure C# Streaming Sample Tutorial  This tutorial shows how to use C# programs with the Hadoop streaming interface.
The Hadoop on Azure Mahout Classification Sample This tutorial illustrates how to use Apache Mahout in Hadoop on Windows Azure to do classification.
The Hadoop on Azure Mahout Clustering Sample This tutorial illustrates how to use Hadoop on Windows Azure to do cluster analysis with Mahout.
The Hadoop on Azure Pegasus Degree Distribution Sample Tutorial This tutorial shows how to deploy Pegasus from the Hadoop on Windows Azure portal to compute the degree of each node and the distribution of degrees for a simple 16-node graph.
The Hadoop on Azure Pegasus Page Rank Sample Tutorial  This tutorial shows how to deploy Pegasus from the Hadoop on Windows Azure portal to compute the page rank for a simple 16-node graph.
The Hadoop on Azure Sqoop Import Sample Tutorial This tutorial shows how to use Sqoop to import data from a SQL database on Windows Azure to an Hadoop on Windows Azure HDFS cluster.
The Hadoop on Azure Wordcount Sample Tutorial This tutorial shows two ways to use Hadoop on Windows Azure to run a MapReduce program that counts word occurrences in a text.

 

Developing with Hadoop

This section contains information on developing solutions using Hadoop.

  Link

  Description
Yahoo! Hadoop tutorial A tutorial on using Hadoop 0.18.0
Map Reduce example A tutorial on using Map/Reduce
Hadoop Streaming Hadoop Wiki page on the Streaming utility

 

Using HDInsight Services with other BI Technologies

This section contains information on using Hadoop with other BI technologies.

Link Description
How to Connect Excel to Hadoop on Azure via HiveODBC Explains how to use Excel 2010 to access data in the Hive data warehouse running on Windows Azure by using the Hive ODBC Driver.
How to Connect Excel PowerPivot to Hive on Azure via HiveODBC Explains how to use PowerPivot to access data in the Hive data warehouse running on Windows Azure by using the Hive ODBC Driver.

Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS)

With the explosion of data, the open source Apache™ Hadoop™ Framework is gaining traction thanks to its huge ecosystem that has arisen around the core functionalities of Hadoop distributed file system (HDFS™) and Hadoop Map Reduce. As of today, being able to have SQL Server working with Hadoop™ becomes increasingly important because the two are indeed complementary. For instance, while petabytes of data can be stored unstructured in Hadoop and take hours to be queried, terabytes of data can be stored in a structured way in the SQL Server platform and queried in seconds. This leads to the need to transfer data between Hadoop and SQL Server.

 

How To

This section contains a list of Hadoop-related how-to articles.

  Link

  Description
Hadoop-based Services on Windows Azure How-Tos and FAQs  A collection of common How To topics along with FAQs. 
How to Contribute How to Contribute to Hadoop Common
How to count the number of lines in a file An example of counting the number of lines in a file using Map Reduce
How to get distinct values An example of getting distinct values/lines using Map Reduce
Avkash Chauhan's Blog  Information related to Hadoop-based services on Windows Azure.
How to Run a Job on a Provisioned Hadoop on Windows Azure Cluster  Information about creating Map Reduce jobs on a cluster that has been provisioned on the Hadoop on Windows Azure Portal
Use SQL Azure database as a Hive metastore Information about using SQL Azure database as a Hive metastore

 

Code Examples

This section contains a list of Hadoop-related examples.

  Link

  Description
Yahoo! Hadoop tutorial A tutorial on using Hadoop 0.18.0
Map Reduce example A tutorial on using Map/Reduce
How to count the number of lines in a file An example of counting the number of lines in a file using Map Reduce
How to get distinct values An example of getting distinct values/lines using Map Reduce

 

Videos

This section contains a list of Hadoop-related videos.

  Link

  Description
Introduction to Interactive JavaScript Console Learn how to use the JS console with your Hadoop cluster.
Introduction to Interactive Hive Console Learn how to use the Hive console with your Hadoop cluster.
Use Excel Hive Add-in to Access Hive on Windows Azure Use the Add-in to import data from Hive on Windows Azure.
Use PowerPivot to Access Hive on Windows Azure  Use Excel PowerPivot to access data from Hive on Windows Azure. 
Introduction to Apache Hive An introduction to Apache Hive
Introduction to Pig An introduction to Apache Pig
Uploading Data and the WordCount Sample Upload data to Azure cluster and then run the WordCount sample
Pi Sample Run the Pi Estimator Sample
Import from Azure Marketplace Import data from Marketplace into Hadoop Services for Windows Azure
10GB GraySort Sample - Generate Data  Introduction to the GraySort benchmark and generating test data
10GB GraySort Sample - Sort Data Running the MapReduce job to sort your data
10GB GraySort Sample - Validate Data After sorting the data, validate that the operation worked
PowerView, PowerPivot, Hadoop, and Hive Use PowerView to connect to a Hive sample table in PowerPivot

 

Audio

This section contains a list of Hadoop-related audio recordings.

  Link

  Description
.NET Rocks (podcast) episode discussing Hadoop on Azure .NET Rocks episode 755 (March 2012) with general discussion of Hadoop on Azure.



 

Books

This section contains a list of Hadoop-related books.

  Link

  Description
Hadoop: The Definitive Guide, 3rd Edition by Tom White (May 26, 2012) A comprehensive guide to build and maintain reliable, scalable, distributed systems with Apache Hadoop.

 

 

Hadoop on Windows and on Windows Azure Best Practices

Microsoft is planning on providing guidance on best practices in the future. If you have best practices guidance that you'd like to share, please feel free to provide a link to it here.







(Some suggestions.) Be great to list some best practices around:

  1. How to get big data sets into Windows Azure.
  2. Understanding how the costs work so as to cost optimize the process.

 

Sort by: Published Date | Most Recent | Most Useful
Comments
  • Great article! I'm nominating it to be featured. Thanks!

  • Do we have links to those videos on the bottom of that list?

  • Congratulations on being featured on the front page of TechNet Wiki!

  • Jonathan Gao edited Revision 55. Comment: add some href links

  • Thanks for the great collection.

    Question:In Quick start Link @ hadoop.apache.org/.../quickstart.html

    Not able to understand how to do this( I am doing it on Windows server 2008 R2 )  -

    I have Unpacked hadoop-0.20.203.0rc1.tar after download nextopened hadoop-env.sh file. It has JAVA_HOME=/usr/lib/j2sdk1.5.

    As per the site above given below steps to perform where Not seeing anything in bin like Hadoop. ?

    "Prepare to Start the Hadoop Cluster

    Unpack the downloaded Hadoop distribution. In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation.

    Try the following command:

    $ bin/hadoop

    This will display the usage documentation for the hadoop script. "

    Pls help

  • Wesley McSwain MSFT edited Revision 61. Comment: Fixing anchor tags broken on ToC

  • Wesley McSwain MSFT edited Revision 62. Comment: Fixing anchors

  • I am unable to view the contents when clicked on  the links for "On-Premise Deployment of Hadoop-based Services for Windows and Windows Azure Deployment of Hadoop-based Services" Can you please share the link for the same? Also I am not able to see the videos.

  • I am unable to view the contents when clicked on  the links for "On-Premise Deployment of Hadoop-based Services for Windows and Windows Azure Deployment of Hadoop-based Services" Can you please share the link for the same? Also I am not able to see the videos.