Sign In
Home
Library
Wiki
Learn
Gallery
Downloads
Support
Forums
Blogs
Resources For IT Professionals
United States (English)
Россия (Pусский)
中国(简体中文)
Brasil (Português)
Post an article
Translate this page
Powered by
Microsoft® Translator
Wikis - Page Details
First published by
Denny Lee [MSFT]
When:
9 Jan 2012 11:56 PM
Last revision by
Denny Lee [MSFT]
When:
10 Jan 2012 12:56 AM
Revisions:
8
Comments:
0
Options
Subscribe to Article (RSS)
Share this
Can You Improve This Article?
Positively!
Click Sign In to add the tip, solution, correction or comment that will help other users.
Report inappropriate content using the Report Abuse link in the footer.
Wiki
>
TechNet Articles
>
Hadoop on Azure Scenario: Query a web log via HiveQL
Hadoop on Azure Scenario: Query a web log via HiveQL
Article
History
Hadoop on Azure Scenario: Query a web log via HiveQL
The purpose of this wiki post is to provide an example scenario on how to work with Hadoop on Azure, upload a web log sample file via secure FTP, and run some simple HiveQL queries.
Preparation
Please download the sample weblog file
ex20111214.log.gz
and the
weblog_sample.hql
file to a local location on your computer.
Upload the Weblog file
To upload the weblog, let's make use of the FTP option using curl. To do this, you will need to do the following:
1) From the
Interactive Javascript Console
, create the folder weblog using the command #mkdir
#mkdir weblog
2) Follow the instructions at
How to FTP Data to Hadoop on Windows Azure
to get the data up to the weblog folder you had created (i.e. instructions on how to open up the FTP ports and how to push data up to HDFS using curl). In this case, you can use the curl command used to ftp the data is noted below.
curl -k -T ex20111214.log.gz ftps://
Campschurmann
:
[MD5 Hash Password]
@
tardis.cloudapp.net
:
2226
/user/
Campschurmann
/
weblog
/
Some quick notes based on the color coding:
Campschurmann
- this is the username that I had specified when I had created my cluster. Notice that it is
case sensitive
.
[MD5 Hash Password]
- this is a MD5 hash of the password you had created when you had created your cluster.
tardis.cloudapp.net
- this is the name of my cluster where i had specified tardis as the name of my cluster.
2226
- this is the FTP port to allow the transfer of data, this is the port you had previously opened in the Open Ports Live tile.
weblog
- this is the folder that you had just created using the #mkdir command.
3) To verify the file, go back to the Interactive Javascript console and type #ls weblog and you should see the file listed.
Create HiveQL table pointing to this sample weblog file
Now that you have uploaded the sample weblog file, now you can create a Hive table that points to the weblog folder you just created which contains the sample file. To do this:
1) Go to the
Interactive Hive Console
, and type the command below.
CREATE EXTERNAL TABLE weblog_sample (
evtdate STRING,
evttime STRING,
svrsitename STRING,
svrip STRING,
csmethod STRING,
csuristem STRING,
csuriquery STRING,
svrport INT,
csusername STRING,
cip STRING,
UserAgent STRING,
Referer STRING,
scstatus STRING,
scsubstatus STRING,
scwin32status STRING,
scbytes STRING,
csbytes STRING,
timetaken STRING
)
COMMENT 'This is a web log sample'
ROW FORMAT DELIMITED FIELDS TERMINATED by '32'
STORED AS TEXTFILE
LOCATION '/user/campschurmann/weblog/';
You should be able to copy/paste it directly from this wiki post but just in case you cannot, the weblog_sample.hql file you had previously downloaded contains the same command.
Note
You will notice that this is a CREATE EXTERNAL TABLE command - this allows you to create a Hive table that points to the files located in a folder instead of going through the task of uploading the data into separate hive table / partitions.
More Information
For more information about CREATING TABLES in HIVE, please reference the
Apache Hive Tutorial > Creating Tables
.
2) To verify that the table exists, type the command:
show tables
and you should see the weblog_sample table that you had created listed.
3) To validate the data can be read, you can type the command:
select * from weblog_sample limit 10;
and you should view the first ten rows from the weblog_sample Hive table, which is pointing to the ex20111214.log.gz web log file.
Note
You may notice that the weblog_sample Hive table is pointing to the weblog folder which contains a
compressed gzip file
. There advantage is that if your weblog files are already gzipped, you do not need to decompress them to read them with Hive.
Querying your Hive Table
As noted above, you can run your HiveQL queries against this sample web log. But one of the key important things is to utilize hive parsing functions to extract valuable data from the key-value pairs. For example, the query below extracts out the first page hierarchy information from the csuristem column and groups by that value, and does a count.
select regexp_replace(split(csuristem, "/")[1], "MainFeed.aspx", "Home"), count(*)
from weblog_sample
group by regexp_replace(split(csuristem, "/")[1], "MainFeed.aspx", "Home")
The page hierarchy in the csuristem column looks like:
/Olympics/archive/2007/09/13/Lena-Lake.aspx
By using the
split
function, in the form of:
split(csuristem, "/")[1]
I'm able to extract out the first value of the string array defined by "/" - in the above case, this would be the value "Olympics". I'm also using the
regexp_replace
function to change the MainFeed.aspx page to indicate that it's actually the Home Page.
Finally, I use the group by and count(*) functions to perform my aggregate query.
More Information
To review all of the available Hive functions, please reference the Apache Hive Language Manual UDF at
Apache Hive > LanguageManual UDF
.
azure
,
faq
,
Hadoop
,
Hadoop on Azure
,
Hive
,
HiveQL
,
scenario
,
Web Log