(Note: The parts of this topic, especially the Getting Started section, need revision for the latest release of the HDInsight Emulator.)
The Windows Azure HDInsight Emulator is an implementation of HDInsight on Windows. This Apache™ Hadoop™-based services on Windows uses only a single node deployment. HDInsight Server provides a local development environment for the Windows Azure HDInsight Service. This technology is being developed to provide a software framework designed to manage, analyze and report on big data
Like the Windows Azure HDInsight Service, this local development environment for HDInsight simplifies the configuring, running, and post-processing of Hadoop jobs by providing a PowerShell library with HDInsight cmdlets for managing the cluster and the jobs run on it as well as a .NET SDK for HDInsight available to automate these procedures.
Pig is a high-level platform for processing big data on Hadoop clusters. Pig consists of a data flow language, called Pig Latin, supporting writing queries on large datasets. The Pig Latin programs consist of dataset transformation series converted under the covers, to a MapReduce program series.
Hive is a distributed data warehouse managing data stored in an HDFS. It is the Hadoop query engine. Hive is for analysts with strong SQL skills providing an SQL-like interface and a relational data model. Hive uses a language called HiveQL; a dialect of SQL. Hive, like Pig, is an abstraction on top of MapReduce and when run, Hive translates queries into a series of MapReduce jobs.
The scenario for using Hadoop is valid for a wide variety of activities in business, science, and governance. These activities include, for example, monitoring supply chains in retail, suspicious trading patterns in finance, demand patterns for public utilities and services, air and water quality from arrays of environmental sensors, or crime patterns in metropolitan areas.
Scenarios for Hive are closer in concept to those for a RDBMS, and so are appropriate for use with more structured data. For unstructured data, Pig is typically a better choice.
A feature of Microsoft Big Data Solution is the integration of Hadoop with components of Microsoft Business Intelligence (BI). An example of Hadoop integration with Microsoft BI provided in the Developer Preview is the ability for Excel to connect to the Hive data warehouse framework in the Hadoop cluster via either Power Query or the Simba Hive ODBC driver for HDInsight to access and view data in the cluster.
The HDInsight Emulator provides a set of samples that enable new users to get started quickly by walking them through a series of tasks that are typically needed when process a big data set.
you may like to visit site for Run the Hadoop samples in HDInsight :
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-run-samples/
The installation procedures for the Windows Azure HDInsight Emulator are outlined here Install the Windows Azure HDInsight Emulator.
(New content is needed here now that the Windows Azure HDInsight Emulator is publically available. The material that follows also needs updating.)
The samples are organized around the IIS W3C log data scenarios. A data generation tool is provided to create and import these two data sets in various sizes. MapReduce, Pig or Hive jobs may then be run on the pages of data generated by using Powershell scripts that are also provided. Note that the Pig an Hive scripts used both compile to MapReduce programs. Users may run a series of jobs to observe, for themselves, the effects of using these different technologies and the effects of the size of the data on the execution of the processing tasks.
powershell –F buildsamples.ps1
powershell –F runSamples.ps1 <scenario> <size> <method> <job>
The values available for four parameters in the W3c log data scenario are:
Additional information on generating and importing the scenario data and on how to run the samples is available by clicking on the tiles at the top of the Getting Started with HDInsight Server page or here:
Importing the data is done using the powershell script importdata.ps1. Note: Importing the data can take quite a while and can also be resource intensive. Import the w3c scenario data now by running:
powershell -ExecutionPolicy unrestricted –F importdata.ps1 w3c
This will create the following data structure in HDFS:
Map/Reduce is the basic compute engine for Hadoop. By default, it is implemented in Java, but there are also examples that leverage .NET and Hadoop Streaming that use C#.
powershell -ExecutionPolicy unrestricted /F runSamples.ps1 w3c <size> java <job>
There are three values available for the size and job parameters in the W3c log data scenario:
Use one of the possible sets of parameter values, for example:
powershell -ExecutionPolicy unrestricted /F runSamples.ps1 w3c small java totalhits
This will begin to output the status of the job to the console. To track the job, you can follow along with the output in the console, or navigate to the http://social.technet.microsoft.com/wiki/contents/articles/14143.getting-started-with-the-developer-preview-of-apachetm-hadooptm-based-services-on-windows/Job'%20originalAttribute= that's hosted on the headnode in the HDFS cluster.
If you are interested in the Java source code, it is available here:
powershell -ExecutionPolicy unrestricted /F runSamples.ps1 w3c <size> csharp <job>
powershell -ExecutionPolicy unrestricted /F runSamples.ps1 w3c small csharp totalhits
If you are interested in the C# source code, it is available here:
Pig processsing uses a data flow language, called Pig Latin. Pig Latin abstractions provide richer data structures than MapReduce, and perform for Hadoop what SQL performs for RDBMS systems. For more information, see http://social.technet.microsoft.com/wiki/contents/articles/14143.getting-started-with-the-developer-preview-of-apachetm-hadooptm-based-services-on-windows/Welcome'%20originalAttribute=
To run Pig jobs, open the Hadoop command prompt, navigate to c:\Hadoop\GettingStarted and execute the command:
powershell -ExecutionPolicy unrestricted /F runSamples.ps1 <scenario> <size> Pig <job>
There are three values available for the size and job parameters and two scenarios can be used:
powershell -ExecutionPolicy unrestricted /F runSamples.ps1 w3c small pig totalhits
If you are interested in the Pig source code, it is available here:
Note that since Pig scripts compile to MapReduce jobs, and potentially to more than one such job, users may see multiple MapReduce jobs executing in the course of processing a Pig job.
The Hive query engine will feel familiar to analysts with strong SQL skills. It provides a SQL-like interface and a relational data model for HDFS. Hive uses a language called HiveQL (or HQL), which is a dialect of SQL. For more information, see http://social.technet.microsoft.com/wiki/contents/articles/14143.getting-started-with-the-developer-preview-of-apachetm-hadooptm-based-services-on-windows/Welcome'%20originalAttribute=
To run Hive jobs, open the Hadoop command prompt, navigate to c:\Hadoop\GettingStarted and execute a command of the form:
powershell -ExecutionPolicy unrestricted /F runSamples.ps1 w3c <size> Hive <job>
powershell -ExecutionPolicy unrestricted /F runSamples.ps1 w3c small hive totalhits
Note that as a first step in each of the jobs, a table will be created and data will be loaded into the table from the file created earlier. You can browse the file that was created by looking under the /Hive node in HDFS
If you are interested in the Hive source code, it is available here:
Additional installation issues for the HDInsight Community Technology Preview may be found in the Release Notes .
Microsoft HDInsight feature suggestions may be made on the Feature Voting page
MSDN forum for discussing HDInsight (Windows and Windows Azure)
http://social.technet.microsoft.com/wiki/contents/articles/14143.getting-started-with-the-developer-preview-of-apachetm-hadooptm-based-services-on-windows/Development'%20originalAttribute= - Portal to the Windows Azure cloud platform.
http://social.technet.microsoft.com/wiki/contents/articles/14143.getting-started-with-the-developer-preview-of-apachetm-hadooptm-based-services-on-windows/TechNet'%20originalAttribute= - The main portal for technical information about Hadoop-based services for Windows and related Microsoft technologies.
MSDN forum for discusssing HDInsight (Windows and Windows Azure)