This article is the main portal for technical information about Hadoop-based services for Windows and related Microsoft technologies. It provides a brief overview of Hadoop, as well as overview information for the Hadoop-based services provided by Microsoft. It also provides links to more detailed technical content in various formats.
Note: Contributions are welcome and appreciated: Please feel free to update this and other articles on this wiki, and to add links to relevant content both from within and outside Microsoft.
Apache Hadoop is an open source software framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It consists of two primary components: Hadoop Distributed File System (HDFS), a reliable and distributed data storage, and MapReduce, a parallel and distributed processing system. A Hadoop cluster can be made up of a single node or thousands.
HDFS is the primary distributed storage used by Hadoop applications. As you load data into a Hadoop cluster, HDFS splits up the data into blocks/chunks and creates multiple replicas of blocks and distributes them across the nodes of the cluster to enable reliable and extremely rapid computations.
Hadoop MapReduce is a software framework for writing applications that rapidly process vast amounts of data in parallel on a large cluster of compute nodes. A MapReduce job usually splits the input data-set into independent chunks. These independent chunks are processed by the map tasks running across the nodes of the Hadoop cluster in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Some of the main advantages of Hadoop are that it can process vast amounts of data, hundreds of terabytes to even petabytes quickly and efficiently, process both structured and non-structured data, perform the processing where the data is, rather than moving the data to the processing, and detect and handle failures by design.
There are two other technologies that are related to Hadoop: Hive and Pig. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems such as HDFS. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
For more details on Apache Hadoop, see http://hadoop.apache.org/.
Hadoop-based services for Windows is currently available through a limited Community Technical Preview (CTP). If you are interested in participating in the CTP, please fill out the form at https://connect.microsoft.com/SQLServer/Survey/Survey.aspx?SurveyID=13697. For details on getting started using the Hadoop-based services for Windows, see Getting Started with Hadoop-based Services on Microsoft Windows.
You can use Hadoop-based Server for Windows to set up a Hadoop cluster using Windows servers in your enterprise (on-premises) very easily. The GUI based deployment tool lets you specify the name node and worker nodes for the cluster, pings the nodes to ensure that they are reachable, deploys the Hadoop-based Server for Windows onto each node of the cluster, and reports the results back to you. Alternatively, you can deploy your Hadoop-based Server for Windows to a cluster of Windows Servers using a command-line deployment tool.
When you install the Hadoop-based Server for Windows on a Windows Server computer, the binaries are installed into the c:\apps\dist folder. The bin subfolder contains the binaries and command files that you will use frequently with the Hadoop cluster. Hadoop components such as Name Node, Job Tracker and Task Tracker are installed as Windows services.
You mainly use the Hadoop.cmd file in the BIN folder to perform tasks such as performing operations on Hadoop Distributed File System (HDFS) and running MapReduce jobs. You use the Hive command-line program (Hive.cmd) to execute commands interactively against the Hive data warehouse and use Pig.cmd to execute Pig scripts on the cluster.
This distribution includes an ODBC driver for Hive and an excel add-in to access data in the Hive data warehouse using the ODBC driver. You can also access Hive data using this driver from an Excel PowerPivot window.
You can install the Hadoop-based services for Windows Azure to set up a private Hadoop cluster on Windows Azure. The administration/deployment tool in the package lets you deploy your own Hadoop cluster on Azure very quickly; you just need to update the configuration file for the tool with your Azure subscription ID, a storage account within the subscription, and the management certificate thumbprint before running the installation.
The administration tool creates a hosted service under your subscription, and deploys a Hadoop cluster with 4 nodes, one head node and three worker nodes. The head node is preconfigured to allow RDP (Remote Desktop Protocol) access. You can use the same administration tool to access the head node via RDP, and then launch MapReduce jobs using a shell as you do with an on-premises Hadoop cluster.
This tool also allows you to monitor the status of the Hadoop cluster, resize the cluster i.e. increase the number of nodes in the cluster, shutdown the cluster and delete the hosted service, purge an existing cluster and redeploy a new cluster. To get started deploying a Hadoop cluster to a Windows Azure subscription, see Windows Azure Deployment of Hadoop-based Services.
Instead of setting up and managing a Hadoop cluster on Azure by yourself, you could just take advantage of Microsoft’s hosted Elastic Map Reduce Service on Azure to which you can submit MapReduce jobs to be processed along with the data used in the processing. It enables you to easily and cost-effectively process vast amounts of structured as well as non-structured data without worrying about setting up the Hadoop cluster, configuring, maintaining, and managing it. To get started provisioning a Hadoop cluster on the Elastic Map Reduce (EMR) Portal, see Getting Started using Hadoop-based Services on the Elastic Map Reduce Portal.
This section contains links to resources useful in learning Hadoop, such as installation, configuration, and basic how-to information.
The links in this section provide information on deploying Apache Hadoop to Microsoft Windows Platforms.
Microsoft is planning on providing guidance on best practices in the future. If you have best practices guidance that you'd like to share, please feel free to provide a link to it here. (suggestion) Be great to list some best practices around (a) how to get big data sets into Windows Azure, and (b) understanding how the costs work so as to cost optimize the process.
This section contains information specific to the management of Hadoop.
This section contains information on developing solutions using Hadoop.
This section contains information on using Hadoop with other BI technologies.
This section contains a list of Hadoop-related how-to articles.
This section contains a list of Hadoop-related examples.
This section contains a list of Hadoop-related videos.
Please feel free to add links to relevant audio under this section.
Great article! I'm nominating it to be featured. Thanks!
Do we have links to those videos on the bottom of that list?
Congratulations on being featured on the front page of TechNet Wiki!
Jonathan Gao edited Revision 55. Comment: add some href links
Thanks for the great collection.
Question:In Quick start Link @ hadoop.apache.org/.../quickstart.html
Not able to understand how to do this( I am doing it on Windows server 2008 R2 ) -
I have Unpacked hadoop-0.20.203.0rc1.tar after download nextopened hadoop-env.sh file. It has JAVA_HOME=/usr/lib/j2sdk1.5.
As per the site above given below steps to perform where Not seeing anything in bin like Hadoop. ?
"Prepare to Start the Hadoop Cluster
Unpack the downloaded Hadoop distribution. In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation.
Try the following command:
$ bin/hadoop
This will display the usage documentation for the hadoop script. "
Pls help
Wesley McSwain MSFT edited Revision 61. Comment: Fixing anchor tags broken on ToC
Wesley McSwain MSFT edited Revision 62. Comment: Fixing anchors
I am unable to view the contents when clicked on the links for "On-Premise Deployment of Hadoop-based Services for Windows and Windows Azure Deployment of Hadoop-based Services" Can you please share the link for the same? Also I am not able to see the videos.
Hi Mindtree,
Those videos aren't ready for prime time yet. I've removed the deployment videos for now.