Hadoop-based Services For Windows (en-US)

Hadoop-based Services For Windows (en-US)

This article is the main portal for technical information about Hadoop-based services for Windows and related Microsoft technologies. It provides a brief overview of Hadoop, as well as overview information for the Hadoop-based services provided by Microsoft. It also provides links to more detailed technical content in various formats.

Note: Contributions are welcome and appreciated: Please feel free to update this and other articles on this wiki, and to add links to relevant content both from within and outside Microsoft.

Table of Contents

Topics Content Types
Hadoop Overview How To
Hadoop on Windows Overview Code Examples
Hadoop-based services on Windows Server Videos
Hadoop-based services on Windows Azure Audio
Elastic Map Reduce on Windows Azure
Learning Hadoop
General
Getting Started with Hadoop-based services for Windows
Hadoop Best Practices
Managing Hadoop
Developing with Hadoop
Using Hadoop with other BI Technologies

 

Hadoop Overview

Apache Hadoop is an open source software framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It consists of two primary components: Hadoop Distributed File System (HDFS), a reliable and distributed data storage, and MapReduce, a parallel and distributed processing system. A Hadoop cluster can be made up of a single node or thousands.

HDFS is the primary distributed storage used by Hadoop applications. As you load data into a Hadoop cluster, HDFS splits up the data into blocks/chunks and creates multiple replicas of blocks and distributes them across the nodes of the cluster to enable reliable and extremely rapid computations.

Hadoop MapReduce is a software framework for writing applications that rapidly process vast amounts of data in parallel on a large cluster of compute nodes. A MapReduce job usually splits the input data-set into independent chunks. These independent chunks are processed by the map tasks running across the nodes of the Hadoop cluster in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Some of the main advantages of Hadoop are that it can process vast amounts of data, hundreds of terabytes to even petabytes quickly and efficiently, process both structured and non-structured data, perform the processing where the data is, rather than moving the data to the processing, and detect and handle failures by design.

There are two other technologies that are related to Hadoop: Hive and Pig. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems such as HDFS. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

For more details on Apache Hadoop, see http://hadoop.apache.org/.

Hadoop-based Services on Windows Overview

Hadoop-based services for Windows is currently available through a limited Community Technical Preview (CTP). If you are interested in participating in the CTP, please fill out the form at https://connect.microsoft.com/SQLServer/Survey/Survey.aspx?SurveyID=13697.

For details on getting started using the Hadoop-based services for Windows, see Getting Started with Hadoop-based Services on Microsoft Windows.

Hadoop-based Server for Microsoft Windows

You can use Hadoop-based Server for Windows to set up a Hadoop cluster using Windows servers in your enterprise (on-premises) very easily. The GUI based deployment tool lets you specify the name node and worker nodes for the cluster, pings the nodes to ensure that they are reachable, deploys the Hadoop-based Server for Windows onto each node of the cluster, and reports the results back to you. Alternatively, you can deploy your Hadoop-based Server for Windows to a cluster of Windows Servers using a command-line deployment tool.

When you install the Hadoop-based Server for Windows on a Windows Server computer, the binaries are installed into the c:\apps\dist folder. The bin subfolder contains the binaries and command files that you will use frequently with the Hadoop cluster. Hadoop components such as Name Node, Job Tracker and Task Tracker are installed as Windows services.

You mainly use the Hadoop.cmd file in the BIN folder to perform tasks such as performing operations on Hadoop Distributed File System (HDFS) and running MapReduce jobs. You use the Hive command-line program (Hive.cmd) to execute commands interactively against the Hive data warehouse and use Pig.cmd to execute Pig scripts on the cluster.

This distribution includes an ODBC driver for Hive and an excel add-in to access data in the Hive data warehouse using the ODBC driver. You can also access Hive data using this driver from an Excel PowerPivot window.

Hadoop-based services on Windows Azure

You can install the Hadoop-based services for Windows Azure to set up a private Hadoop cluster on Windows Azure. The administration/deployment tool in the package lets you deploy your own Hadoop cluster on Azure very quickly; you just need to update the configuration file for the tool with your Azure subscription ID, a storage account within the subscription, and the management certificate thumbprint before running the installation.

The administration tool creates a hosted service under your subscription, and deploys a Hadoop cluster with 4 nodes, one head node and three worker nodes. The head node is preconfigured to allow RDP (Remote Desktop Protocol) access. You can use the same administration tool to access the head node via RDP, and then launch MapReduce jobs using a shell as you do with an on-premises Hadoop cluster.

This tool also allows you to monitor the status of the Hadoop cluster, resize the cluster i.e. increase the number of nodes in the cluster, shutdown the cluster and delete the hosted service, purge an existing cluster and redeploy a new cluster.

To get started deploying a Hadoop cluster to a Windows Azure subscription, see Windows Azure Deployment of Hadoop-based Services.

Elastic Map Reduce on Windows Azure

Instead of setting up and managing a Hadoop cluster on Azure by yourself, you could just take advantage of Microsoft’s hosted Elastic Map Reduce Service on Azure to which you can submit MapReduce jobs to be processed along with the data used in the processing. It enables you to easily and cost-effectively process vast amounts of structured as well as non-structured data without worrying about setting up the Hadoop cluster, configuring, maintaining, and managing it.

To get started provisioning a Hadoop cluster on the Elastic Map Reduce (EMR) Portal, see Getting Started using Hadoop-based Services on the Elastic Map Reduce Portal.

Learning Hadoop

This section contains links to resources useful in learning Hadoop, such as installation, configuration, and basic how-to information.

General

Link Description
Apache Hadoop The Apache Hadoop home page
Hive A data warehouse system for Hadoop
Introduction to Apache Hive [Video] An introduction to Apache Hive
Pig A platform for analyzing large data sets
Introduction to Pig [Video] An introduction to Apache Pig
Hadoop quickstart Steps on getting Hadoop up and running
How to Contribute How to Contribute to Hadoop Common

 

Getting Started with Hadoop-based Services for Windows

The links in this section provide information on deploying Apache Hadoop to Microsoft Windows Platforms.

Link Description
Getting Started with Hadoop-based Services for Windows An overview of the Getting Started guides currently available.
Getting Started Deploying Hadoop-based Services for an On-Premise Hadoop Cluster. A walkthrough for deploying Hadoop to a set of servers that you manage.
Getting Started Deploying Hadoop-Based Services for a cluster on Windows Azure A walkthrough for deploying a Hadoop cluster on compute instances with your Windows Azure Subscription.
Getting Started a Hadoop cluster on the Elastic Map Reduce Portal.  A walkthrough for provisioning and using a temporary Hadoop cluster on the Elastic Map Reduce Portal (EMR) Portal.

 

Hadoop Best Practices

Microsoft is planning on providing guidance on best practices in the future. If you have best practices guidance that you'd like to share, please feel free to provide a link to it here.

(suggestion) Be great to list some best practices around (a) how to get big data sets into Windows Azure, and (b) understanding how the costs work so as to cost optimize the process.

Managing Hadoop

This section contains information specific to the management of Hadoop.

Link Description
Hadoop On Demand user’s guide Hadoop On Demand is a system for provisioning virtual Hadoop clusters

 

Developing with Hadoop

This section contains information on developing solutions using Hadoop.

Link Description
Yahoo! Hadoop tutorial A tutorial on using Hadoop 0.18.0
Map Reduce example A tutorial on using Map/Reduce
Hadoop Streaming Hadoop Wiki page on the Streaming utility

 

Using Hadoop with other BI Technologies

This section contains information on using Hadoop with other BI technologies.

Link Description
How to Connect Excel to Hadoop on Azure via HiveODBC Explains how to use Excel 2010 to access data in the Hive data warehouse running on Windows Azure by using the Hive ODBC Driver.
How to Connect Excel PowerPivot to Hive on Azure via HiveODBC Explains how to use PowerPivot to access data in the Hive data warehouse running on Windows Azure by using the Hive ODBC Driver.

 

How To

This section contains a list of Hadoop-related how-to articles.

Link Description
Hadoop-based Services on Windows Azure How-Tos and FAQs  A collection of common How To topics along with FAQs. 
How to Contribute How to Contribute to Hadoop Common
How to count the number of lines in a file An example of counting the number of lines in a file using Map Reduce
How to get distinct values An example of getting distinct values/lines using Map Reduce
Avkash Chauhan's Blog  Information related to Hadoop-based services on Windows Azure.
How to Run a Job on a Provisioned Hadoop on Windows Azure Cluster  Information about creating Map Reduce jobs on a cluster that has been provisioned on the Elastic Map Reduce (EMR) portal
How To FTP Data to Hadoop on Windows Azure A walkthrough for using FTPS to send file data to the cluster
How to create a mapper and reducer in C# (Hadoop Streaming) A walkthough for creating a mapper and reducer in C# using Hadoop Streaming
Use SQL Azure database as a Hive metastore Information about using SQL Azure database as a Hive metastore

 

Code Examples

This section contains a list of Hadoop-related examples.

Link Description
Yahoo! Hadoop tutorial A tutorial on using Hadoop 0.18.0
Map Reduce example A tutorial on using Map/Reduce
How to count the number of lines in a file An example of counting the number of lines in a file using Map Reduce
How to get distinct values An example of getting distinct values/lines using Map Reduce

 

Videos

This section contains a list of Hadoop-related videos.

Link Description
Introduction to Interactive JavaScript Console Learn how to use the JS console with your Hadoop cluster.
Introduction to Interactive Hive Console Learn how to use the Hive console with your Hadoop cluster.
Use Excel Hive Add-in to Access Hive on Windows Azure Use the Add-in to import data from Hive on Windows Azure.
Use PowerPivot to Access Hive on Windows Azure  Use Excel PowerPivot to access data from Hive on Windows Azure. 
Introduction to Apache Hive An introduction to Apache Hive
Introduction to Pig An introduction to Apache Pig
Uploading Data and the WordCount Sample Upload data to Azure cluster and then run the WordCount sample
Pi Sample Run the Pi Estimator Sample
Import from Azure Marketplace Import data from Marketplace into Hadoop Services for Windows Azure
10GB GraySort Sample - Generate Data  Introduction to the GraySort benchmark and generating test data
10GB GraySort Sample - Sort Data Running the MapReduce job to sort your data
10GB GraySort Sample - Validate Data After sorting the data, validate that the operation worked
PowerView, PowerPivot, Hadoop, and Hive Use PowerView to connect to a Hive sample table in PowerPivot
Back Up Hadoop  HDFS Metadata to Azure Take periodic snapshots of your data and upload to an Azure storage account
Restore a Cluster from a Backup Use a backup to get your cluster up and running again

 

Audio

Please feel free to add links to relevant audio under this section.

Sort by: Published Date | Most Recent | Most Useful
Comments
  • Great article! I'm nominating it to be featured. Thanks!

  • Do we have links to those videos on the bottom of that list?

  • Congratulations on being featured on the front page of TechNet Wiki!

  • Jonathan Gao edited Revision 55. Comment: add some href links

  • Thanks for the great collection.

    Question:In Quick start Link @ hadoop.apache.org/.../quickstart.html

    Not able to understand how to do this( I am doing it on Windows server 2008 R2 )  -

    I have Unpacked hadoop-0.20.203.0rc1.tar after download nextopened hadoop-env.sh file. It has JAVA_HOME=/usr/lib/j2sdk1.5.

    As per the site above given below steps to perform where Not seeing anything in bin like Hadoop. ?

    "Prepare to Start the Hadoop Cluster

    Unpack the downloaded Hadoop distribution. In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation.

    Try the following command:

    $ bin/hadoop

    This will display the usage documentation for the hadoop script. "

    Pls help

  • Wesley McSwain MSFT edited Revision 61. Comment: Fixing anchor tags broken on ToC

  • Wesley McSwain MSFT edited Revision 62. Comment: Fixing anchors

  • I am unable to view the contents when clicked on  the links for "On-Premise Deployment of Hadoop-based Services for Windows and Windows Azure Deployment of Hadoop-based Services" Can you please share the link for the same? Also I am not able to see the videos.

  • I am unable to view the contents when clicked on  the links for "On-Premise Deployment of Hadoop-based Services for Windows and Windows Azure Deployment of Hadoop-based Services" Can you please share the link for the same? Also I am not able to see the videos.

  • Hi Mindtree,

    Those videos aren't ready for prime time yet.  I've removed the deployment videos for now.  

Page 1 of 2 (12 items) 12