SQL11UPD05-TSCRIPT-05

This wiki is a transcript of a previously recorded video.

Related content assets:


Introducing SQL Server 2012 Data Quality Services

Welcome to this presentation introducing SQL Server 2012 Data Quality Services. My name is Peter Myers. I am a SQL Server MVP and a Mentor with SolidQ.

 

In this presentation, we are going to introduce Data Quality Services (DQS) and then introduce you to the components of Data Quality Services including the Knowledge Base and how to produce a Data Quality Project. Then we’ll explore the integration scenario with Integration Services using the DQS Cleansing transform. We’ll finally discuss developer opportunities and finish with a summary and resources.

 

Data Quality Services is a new knowledge-driven data cleansing solution delivered in SQL Server 2012. It supports building Knowledge Bases that support the correction and de-duplication of your data. Knowledge in a Knowledge Base can improve over time so I think of it like a wine. We can let it sit, we can let it sit over time and let it evolve with little effort to become more and more useful as we discover more rules and values used to identify data problems and correct those problems. Integration Services includes the DQS Cleansing transform and Master Data Services includes the Excel Add-in that can leverage DQS.

 

SQL Server Enterprise Information Management (EIM) is really referring to these three services and their ability to deliver quality data. We’ve already explored Integration Services and Master Data Services – in this module we’ll explore Data Quality Services, and knowledge-driven data cleansing, correction, de-duplication and standardization of data and the fact that it can integrate with Integration Services. Collectively, these three services deliver complete, current, consistent and clean data. This is what Enterprise Information Management is all about. Specifically, Data Quality Services is about clean data.

 

What are the business scenarios for Data Quality Services? Well, for knowledge management and reference data, we can create and manage Knowledge Bases. We can discover knowledge from the existing data we have. We can explore and integrate with third party reference data. For cleansing and matching – we can correct, de-duplicate and standardize data. And there is an administration role too that supports the monitoring and control of data quality processes.

 

When we take a look at this diagram, the DQS process involves first of all building and this involves building a Knowledge Base. That Knowledge Base consists of domains. Domains can be described in terms of values – they can also be described in terms of rules and also reference data services. And so this process involves discovery, exploration, connection to these services, and this is known as knowledge management. Now when it comes to Data Quality Projects, this is just one of three ways that is supported to use the Knowledge Base. So this is using the Knowledge Base to help us either correct, standardize, or de-duplicate data.

 

DQS consists of a Data Quality Server and the Client. There is a one-to-one mapping between the Data Quality Server and the relational instance of SQL Server. In fact, when you install and configure Data Quality Services, it results in the creation of three SQL Server databases and the registration of some assemblies and CLR objects that are used to support the functionality of DQS. The Data Quality Client is a standalone Windows application that allows data stewards, data experts, and IT Professionals, to perform data quality operations. And so its functionality includes the ability to create and maintain Knowledge Bases, create and maintain Data Quality Projects, monitor Data Quality activities, and manage the system settings.

 

This is a snapshot of what the Data Quality Client looks like. We have the different pillars of Knowledge Base Management, Data Quality Projects, and Administration. So let’s focus on the Knowledge Bases themselves. The rationale is: To clean data you need to know something about it. So the Knowledge Base is then a data repository of knowledge that enables professionals to understand their data and to maintain its integrity. Essentially a Knowledge Base can be a collection of domains and matching policies.

 

Domains capture the semantics of data and they can be used with online reference data. For example, Online DataMarket reference data services or third party data reference services can also be used. The processes include managing the domains to define them, knowledge discovery to learn domain values from your existing data, and matching policies to define how to identify potential duplicates within your data.

 

Here’s a look at a Knowledge Base that can consist of domains, matching policies for de-duplication, and composite domains that really mean you have domains that work together. For example, an address could consist of domains of the street, city, zip, country for example. There could be four domains that could collectively work together as a composite. Now each domain consists of a configuration and so a domain could be based on a list of values, valid values, invalid values, and corrected values. It could be based on rules – for example, an e-mail address could use a regular expression to say the e-mail address must match this regular expression, or it could reference an external data service for its abilities.

 

So once you have a Knowledge Base, and that’s been published and users have access to it, there are really three distinct ways that you can use that Knowledge Base. That includes first of all the Data Quality Project; there is also the Integration Services scenario that we we’ll talk about later, and the Master Data Services Add-in also includes the ability to integrate DQS functionality for the purpose of de-duplication. So let’s consider the Data Quality Project here, which is a means of using that Knowledge Base to improve the quality of source data by performing data cleansing and data matching activities. This is created using the Data Quality Client. The results can be exported either to a SQL Server table or CSV file and there are two types of activities. There is the cleansing and there is the matching activity to prevent data duplication.

 

So the DQS Cleansing transform is the other approach in Integration Services 2012. Its configuration involves defining a connection to the Data Quality Server and then configuring the component to use this connection, selecting a Knowledge Base, mapping the input columns to the component to the domains in the Knowledge Base, and then optionally selecting advanced statistical columns that will then determine the outputs. Now the output of this component includes the original data together with the corrected data and status. And so down stream in the data flow you will likely use Conditional Split components to assess the status to then determine what to do with the data. Do we use the data or do we use the corrected values? And so on.

 

So there is a demonstration that accompanies this presentation, and this takes the demonstration from Master Data Services and extends it by exploring a scenario whereby we need to add some extra members to an entity, and yet we suspect that there could be duplicates. And so it requires us to create a Knowledge Base with a Matching Policy and then return to the Master Data Services Add-in to use it to identify potential duplicates, and then use the statistics returned by Data Quality Services to remove those that we consider to be duplicates. Then, effectively, we have loaded new data into the list and have avoided the loading of duplicate members.

 

Now for developers, there is limited opportunity in this initial release. The only potential is to develop third party reference data providers for either internal use within the organization or for public consumption by using Windows Azure Marketplace. Be well aware that there is no public API available in this release. We include a link here if it’s of use to you to learn more about Windows Azure Marketplace.

 

So in summary, DQS is a new knowledge-driven data cleansing solution delivered in SQL Server 2012. Knowledge Bases are designed to support correction and/or de-duplication of data. And knowledge in a Knowledge Base can improve over time and can be reused by many solutions. This is very much the value of this product, that you can create and maintain a repository of values and rules that may be reused in many scenarios. Integration Services 2012 includes a DQS Cleansing transform and the MDS Excel add-in can leverage the Data Quality Services for de-duplication.

 

For the resources, we include a presentation for TechEd North America 2011, a great introduction to using Knowledge Bases to cleanse data with DQS. The DQS Team Blog is a great resource for understanding what the team is doing, some of the design principles, some of the challenges you may face and how to solve them. And then, finally, books online for SQL Server includes, of course, comprehensive Data Quality Services documentation.

 

There is a hands-on lab that accompanies this presentation, so wherever you found this presentation, you’re likely to find links that will allow you to explore SQL Server 2012 Data Quality Services. Essentially, this lab allows you to create a Knowledge Base and then to integrate that Knowledge Base into an Integration Services data flow to help correct data when you are populating the data warehouse. Thanks very much for watching this presentation.


Return to SQL Server 2012 Developer Training Kit BOM (en-US)

See Also