The following article demonstrates how unstructured data and relational data can be queried, joined and processed in a single query using PolyBase, a new feature in SQL Server 2016.
Traditionally, Big Data is processed using Apache Hadoop which is totally fine. But what if the result of this needs to be linked to the traditional Relation Database? For example, assume that from the analysis of tons of application logs, marketing needs to contact some customs that faced problems in an application following a failure in the application.
Then install PolyBase Query Service for External Data
CREATE
MASTER
KEY
ENCRYPTION
BY
PASSWORD
=’MyP@ssword31’;
DATABASE
SCOPED CREDENTIAL HadoopUser1
WITH
IDENTITY = ‘xxxx’,
Secret = ‘xxxxxx’;
Sp_configure ‘hadoop connectivity’, 1;
reconfigure;
EXTERNAL DATA SOURCE AzureDs1
(
TYPE = HADOOP,
--Specifiy the container name and account name
LOCATION = 'wasbs://X@Z.blob.core.windows.net/',
--Specify the credential that we created earlier
CREDENTIAL = HadoopUser1
);
EXTERNAL FILE FORMAT CommaFormat
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (FIELD_TERMINATOR =’,’)
EXTERNAL
TABLE
Hvac (
Date
varchar
(10),
Time
TargetTemp
smallint
,
ActualTemp
SystemID
SystemAge
BuildingID
)
--Set the file to be the HVAC sensor sample file
LOCATION=’/HdiSamples/SensorSampleData/hvac/HVAC.csv’,
DATA_SOURCE = AzureDs,
FILE_FORMAT = CommaFormat,
--We will allow the header to be rejected
REJECT_TYPE = VALUE,
REJECT_VALUE = 1
Assume someone has a business running and has been storing all his application logs.
Someday, the application got a severe error on one of the most critical areas and the management decides to contact all the customers that tries to access the Application at that time.
Since, the database is down, the only way to track these customers were through the logs. But now, how to get the details of these customers?
This problem can be easily solved with PolyBase.
The following example demonstrates the analytics and BI capabilities of joining both the relational and unstructured data using PolyBase.
Relational Customer table The following is a relational customer table where all the customer data are stored on-premise.
Application Logs on Azure Blob Store The following is the raw dump of application logs stored on the Azure blob storage.
APPLICATION_LOGS (
[LOG_DATE]
(50),
[LOG_TIME]
[CUSTOMER_ID]
int
[PAGE]
[
ACTION
]
[STATUS]
(50)
LOCATION=
'/ApplicationLogSample.txt'
select
CUST.[FIRST_NAME],
CUST.LAST_NAME,
CUST.PHONE,
COUNT
(1)
from
APPLICATION_LOGS APPLO
inner
join
[dbo].[CUSTOMER] CUST
ON
CUST.[ID] = APPLO.CUSTOMER_ID
WHERE
APPLO.STATUS=
'ERROR'
AND
PAGE =
'EPAYMENT'
GROUP
CUST.PHONE
HAVING
(1)>3
count
(1),
page
APPLICATION_LOGS
group
by
having
(1) > 10
SELECT
*
INTO
RELATIONAL_LOGS
FROM
APPLICATION_LOGS;