Introduction to Big Data

What is Big Data?

Big Data refers to data that is too large or complex for analysis in traditional databases because of factors such as the volume, variety, and velocity of the data to be analyzed.

The quantity of data that is generated is very high. 
For Example, consider analyzing application logs, where new data is generated each time a user does some action in an application. This may generate several lines per minute or even per second as the user works.

The data that needs to be analyzed is not standard, consisting of both structured and unstructured data.
One example of this can be the analysis of Social Media data consisting of emoticons, hash tags and texts in several languages.

This is where data is being generated very frequently. This is becoming quite common with emerging technologies such as the Internet of Things where devices/sensors generate data continuously.

What is Apache Hadoop?

Apache Hadoop is an open source Java framework primarily intended for storage and processing of very large sets of data.
It does distributed processing of large data sets where the data is split across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. 

What is MapReduce?

Back to top

MapReduce is the application logic which splits the data for processing by different nodes in the Hadoop cluster.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.
The framework sorts the outputs of the maps, which are then inputted to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
The framework also takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

MapReduce is done in the 3 steps below:

1. Source data is divided among data nodes
2. Map phase generates key/value pairs
3. Reduce phase aggregates values for each key

Introduction to Azure HDInsight

Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data.
The Hadoop core provides reliable data storage with the Hadoop Distributed File System (HDFS), and a simple MapReduce programming model to process and analyze, in parallel, the data stored in this distributed system.

Creating an HDInsight Cluster

Back to top

To create an Azure HDInsight Cluster, open the Azure portal, click on New > Data Services > HDInsight.
The following options are available:
a. Hadoop is the default and native implementation of Apache Hadoop
b. HBase is an Apache open-source NoSQL database built on Hadoop that provides random access and strong consistency for large mounts of unstructured data.
c. Storm is a distributed, fault-tolerant, open-source computation system that allows you to process data in real time.

This article uses the Hadoop cluster.


The next step is to add a cluster name, select the cluster size, add a password, select a storage and click on create HDInsight cluster.

Enable Remote Desktop on the Cluster

Once the cluster has been created, its jobs and contents can be viewed by remote connection.
To enable remote connection to the cluster, follow the steps below:

1. Click HDINSIGHT on the left pane. You will see a list of deployed HDInsight clusters.
2. Click the HDInsight cluster that you want to connect to.
3. From the top of the page, click CONFIGURATION.
4. From the bottom of the page, click ENABLE REMOTE.

In the Configure Remote Desktop wizard, enter a username and password for the remote desktop. Note that the username must be different from the one used to create the cluster (admin by default with the Quick Create option). Enter an expiration date in the EXPIRES ON box.

Accessing the Hadoop Cluster using Remote Desktop Connection

Back to top

To connect to the cluster via Remote Desktop Connection, in the portal, select your cluster and go to configuration and click connect.
An RDP file will be downloaded which shall be used to connect to the cluster. Open the file, enter the required credentials and click connect.

Once the Remote Connection is established, double-click the Hadoop Command Line icon. 
This will be used to navigate through the Hadoop File System.

View files in root directory
Once the command line is open, you may view all the files in the root folder.
The syntax to use is hadoop fs followed by the Linux command used inside the Hadoop File System.

hadoop fs -ls /

The command above will list all the files in the root folder.

Browse to the Example folder
When the cluster has been created, some sample files and data have already been included. To view them, navigate to the example folder.

hadoop fs -ls /example

Browse to Jars folder
Jar is the file type in which Java code is compiled. In this folder, there is an implementation of MapReduce.

hadoop fs -ls /example/jars

View the sample data available

hadoop fs -ls /example/data

Browse to Gutenberg folder

hadoop fs -ls /example/data/gutenberg

From the gutenberg folder, assume that MapReduce needs to be done on the file davinci.txt.
The file has lots of text which is actually an extract of an ebook.

Run MapReduce
To run a MapReduce job on the file davinci.txt, the following command is used.

hadoop jar hadoop-mapreduce-examples.jar wordcount /example/data/gutenberg/davinci.txt /example/results

The command consists of:
a. hadoop-mapreduce-examples.jar which is the compiled java code used
b.   wordcount is the method called from the jar file
c.   /example/data/gutenberg/davinci.txt is the source data
d.   /example/results is the folder where the result shall be stored

View the result

hadoop fs -tail /example/results/part-r-00000

The MapReduce job has been executed and the result saved in /example/results/.

Running MapReduce Jobs using PowerShell

Back to top

Download and Install PowerShell

PowerShell can be download at the link here.

Connect PowerShell to a Microsoft Azure Account

Once PowerShell is installed, its time to connect it to your Azure Account.
The code below will open up the Azure portal, ask for your credentials and download a file.

PS C:\> Get-AzurePublishSettingsFile

Key in the following command, together with the path to the file download above.

PS C:\> Import-AzurePublishSettingsFile "FILE PATH \Visual Studio Ultimate with MSDN-4-29-2015-redentials.publishsettings"

PowerShell is now connected to your Azure Account.

Upload Data

Back to top

The script below will upload all the files from your local folder to Azure storage. The source location should be entered in the variable $localFolder while the location to save the file on Azure should be in the variable $destFolder.

The script shall loop through all the files in the local folder and upload them to the destination folder.
The values of $storageAccountName and $containerName should be replaced by values that maps the Azure account being used.

$storageAccountName = ""
$containerName = "chervinehadoop"
$localFolder = "K:\Wiki & Blog\Big Data Wikis\Intro\Upload"
$destfolder = "UploadedData"
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName $storageAccountName).Primary
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
$files = Get-ChildItem $localFolder
foreach($file in $files){
  $fileName = "$localFolder\$file"
  $blobName = "$destfolder/$file"
  write-host "copying $fileName to $blobName"
  Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob $blobName -Context $destContext -Force
write-host "All files in $localFolder uploaded to $containerName!"

 Once the files have been uploaded, they may be viewed from the portal by going to the cluster > dashboard > linked Resources > Containers

Run the MapReduce

Once that data has been uploaded, it needs to be processed using MapReduce, and the script which creates a new MapReduce job definition.
The command New-AzureHDInsightMapReduceJobDefinition takes the following parameters:
1. JarFile "wasb:///example/jars/hadoop-mapreduce-examples.jar" : The location of the Jar file containing the MapReduce code
2. ClassName "wordcount" : The class to be used inside the Jar file.
3. Arguments "wasb:///UploadedData", "wasb:///UploadedData/output" : Represents the Source and Destination folder respectively.

Once the definition of the job is created, the job is executed by the command  Start-AzureHDInsightJob which takes as a parameter the cluster name and the job definition.

$clusterName = "ChervineHadoop"
$jobDef = New-AzureHDInsightMapReduceJobDefinition -JarFile "wasb:///example/jars/hadoop-mapreduce-examples.jar" -ClassName "wordcount" -Arguments "wasb:///UploadedData", "wasb:///UploadedData/output"
$wordCountJob = Start-AzureHDInsightJob –Cluster $clusterName –JobDefinition $jobDef
Write-Host "Map/Reduce job submitted..."
Wait-AzureHDInsightJob -Job $wordCountJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $wordCountJob.JobId -StandardError

The execution progress shall be displayed on the PowerShell console.

View the result

Back to top

When the MapReduce completes, the output folder specified above shall be created and the result shall be stored in it.
From the Azure portal, navigate to the storage account > Container and notice that the folder "output" has been created.

Select the files and download them to view the results.

Conclusion & Next Steps

Back to top

In this article, the basic concepts of Big Data were introduced before looking at some examples of how the Microsoft Azure platform can be used to solve big data problems. using Microsoft Azure, it is not only easy to use and explore big data, but it is also easy to automate these tasks using PowerShell. Using the combination of Azure and PowerShell gives the user the possibility to automate the process completely from creating a Hadoop cluster to getting the results back.
In the next article, we shall see how Hive can be used to execute SQL-Like queries against big data.

See Also

Big Data Analytics using Microsoft Azure: Hive