The purpose of this wiki post is to provide an example scenario on how to work with Hadoop on Azure, upload a web log sample file via secure FTP, and run some simple HiveQL queries.

 Important!
This wiki topic may be obsolete.
The wiki topics on Windows Azure HDInsight Service are no longer updated by Microsoft. We moved the content to windowsazure.com where we keep it current. This topic can be found at Using Hive with HDInsight.

Preparation

Please download the sample weblog file ex20111214.log.gz and the weblog_sample.hql file to a local location on your computer.

Upload the Weblog file

To upload the weblog, let's make use of the FTP option using curl.  To do this, you will need to do the following:

1) From the Interactive Javascript Console, create the folder weblog using the command #mkdir
#mkdir weblog






2) Follow the instructions at How to FTP Data to Hadoop on Windows Azure to get the data up to the weblog folder you had created (i.e. instructions on how to open up the FTP ports and how to push data up to HDFS using curl).  In this case, you can use the curl command used to ftp the data is noted below.
curl -k -T ex20111214.log.gz ftps://Campschurmann:[MD5 Hash Password]@tardis.cloudapp.net:2226/user/Campschurmann/weblog/


Some quick notes based on the color coding:
  • Campschurmann - this is the username that I had specified when I had created my cluster.  Notice that it is case sensitive.
  • [MD5 Hash Password] - this is a MD5 hash of the password you had created when you had created your cluster.  
  • tardis.cloudapp.net - this is the name of my cluster where i had specified tardis as the name of my cluster.
  • 2226 - this is the FTP port to allow the transfer of data, this is the port you had previously opened in the Open Ports Live tile.
  • weblog - this is the folder that you had just created using the #mkdir command.


3) To verify the file, go back to the Interactive Javascript console and type #ls weblog and you should see the file listed.



Create HiveQL table pointing to this sample weblog file

Now that you have uploaded the sample weblog file, now you can create a Hive table that points to the weblog folder you just created which contains the sample file.  To do this:

1) Go to the Interactive Hive Console, and type the command below.

CREATE EXTERNAL TABLE weblog_sample (
   evtdate STRING,
   evttime STRING,
   svrsitename STRING,
   svrip STRING, 
   csmethod STRING, 
   csuristem STRING, 
   csuriquery STRING, 
   svrport INT,
   csusername STRING, 
   cip STRING, 
   UserAgent STRING, 
   Referer STRING, 
   scstatus STRING, 
   scsubstatus STRING, 
   scwin32status STRING, 
   scbytes STRING, 
   csbytes STRING, 
   timetaken STRING
)
COMMENT 'This is a web log sample'
ROW FORMAT DELIMITED FIELDS TERMINATED by '32'
STORED AS TEXTFILE
LOCATION '/user/campschurmann/weblog/';

You should be able to copy/paste it directly from this wiki post but just in case you cannot, the weblog_sample.hql file you had previously downloaded contains the same command.  


Note
You will notice that this is a CREATE EXTERNAL TABLE command - this allows you to create a Hive table that points to the files located in a folder instead of going through the task of uploading the data into separate hive table / partitions.


More Information
For more information about CREATING TABLES in HIVE, please reference the Apache Hive Tutorial > Creating Tables.


2) To verify that the table exists, type the command:
show tables

and you should see the weblog_sample table that you had created listed.



3) To validate the data can be read, you can type the command:
select * from weblog_sample limit 10;

and you should view the first ten rows from the weblog_sample Hive table, which is pointing to the ex20111214.log.gz web log file.





Note
You may notice that the weblog_sample Hive table is pointing to the weblog folder which contains a compressed gzip file.  There advantage is that if your weblog files are already gzipped, you do not need to decompress them to read them with Hive.  


Querying your Hive Table

As noted above, you can run your HiveQL queries against this sample web log.  But one of the key important things is to utilize hive parsing functions to extract valuable data from the key-value pairs.  For example, the query below extracts out the first page hierarchy information from the csuristem column and groups by that value, and does a count.

select regexp_replace(split(csuristem, "/")[1], "MainFeed.aspx", "Home"), count(*) 
from weblog_sample 
group by regexp_replace(split(csuristem, "/")[1], "MainFeed.aspx", "Home")


The page hierarchy in the csuristem column looks like:
/Olympics/archive/2007/09/13/Lena-Lake.aspx

By using the split function, in the form of:
split(csuristem, "/")[1]

I'm able to extract out the first value of the string array defined by "/" - in the above case, this would be the value "Olympics".  I'm also using the regexp_replace function to change the MainFeed.aspx page to indicate that it's actually the Home Page.

Finally, I use the group by and count(*) functions to perform my aggregate query.






More Information
To review all of the available Hive functions, please reference the Apache Hive Language Manual UDF at Apache Hive > LanguageManual UDF.