This article is the main portal for technical information about HDInsight Services for Windows and related Microsoft technologies. It provides a brief overview of Apache Hadoop, as well as information for the HDInsight Services provided by Microsoft
for deployment on both Windows and Windows Azure.
It also provides links to more detailed technical content in various formats.
Note: Contributions are welcome and appreciated: Please feel free to update this and other articles on this Wiki, and to add links to relevant content both from within and outside Microsoft.
Apache Hadoop is an open source software framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It consists of two primary components: Hadoop
Distributed File System (HDFS), a reliable and distributed data storage, and
MapReduce, a parallel and distributed processing system. A Hadoop cluster can be made up of a single node or thousands.
HDFS is the primary distributed storage used by Hadoop applications. As you load data into a Hadoop cluster, HDFS splits up the data into blocks/chunks and creates multiple replicas of blocks and distributes them across the nodes of the cluster to enable
reliable and extremely rapid computations.
Hadoop MapReduce is a software framework for writing applications that rapidly process vast amounts of data in parallel on a large cluster of compute nodes. A MapReduce job usually splits the input data-set into independent chunks. These independent chunks
are processed by the map tasks running across the nodes of the Hadoop cluster in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored
in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Some of the main advantages of Hadoop are that it can process vast amounts of data, hundreds of terabytes or even petabytes quickly and efficiently, process both structured and non-structured data, perform the processing where the data is located rather
than moving the data to some processing location, and detect and handle failures by design.
There are two other key Apache technologies that are frequently used with Hadoop:
Hive and Pig. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of
large datasets stored in Hadoop compatible file systems such as HDFS. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows map/reduce programmers
to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is
amenable to substantial parallelization, which in turns enables them to handle very large data sets.
For more details on Apache Hadoop, see
This section contains links to resources useful in learning Hadoop, such as installation, configuration, and basic how-to information.
The links in this section provide information on deploying and using the
Developer Preview of HDInsight Services on Windows.
The links in this section provide information on deploying and using Apache Hadoop on the Microsoft Windows Azure Platform. Instead of setting up and managing a Hadoop cluster on Azure by yourself, you can use the HDInsight Services for Windows Azure dashboard
that Microsoft has made available at hadooponazure.com. This is a preview of the HDInsight Services for Windows Azure to which you can submit MapReduce jobs to
be processed along with the data used in the processing. It enables you to process vast amounts of structured as well as non-structured data easily without worrying about setting up the Hadoop cluster, configuring, maintaining, and managing it manually.
This section contains links to the tutorials for the samples that are on the Hadoop on Windows Azure Portal.
This section contains information on developing solutions using Hadoop.
This section contains information on using Hadoop with other BI technologies.
Leveraging a Hadoop cluster from
SQL Server Integration Services (SSIS)
This section contains a list of Hadoop-related how-to articles.
This section contains a list of Hadoop-related examples.
This section contains a list of Hadoop-related videos.
This section contains a list of Hadoop-related books.
Microsoft is planning on providing guidance on best practices in the future. If you have best practices guidance that you'd like to share, please feel free to provide a link to it here.
(Some suggestions.) Be great to list some best practices around: