This wiki is a transcript of a previously recorded video.

Related content assets:

  • Demo: De-Duplicating Data with SQL Server 2012 Data Quality Services (SQL11UPD05-DEMO-03)
  • Video: De-Duplicating Data with SQL Server 2012 Data Quality Services (SQL11UPD05-REC-06)


De-Duplicating Data with SQL Server 2012 Data Quality Services

Welcome to this demonstration. De-Duplicating Data with SQL Server 2012 Data Quality Services (DQS). My name is Peter Myers. I am a SQL Server MVP and a Mentor with SolidQ.

This demonstration is in fact an extension of the previous demonstration I did showing the new features of SQL Server 2012 Master Data Services (MDS). There is no direct requirement that you see this demonstration first, but if you do intend seeing both, I would strongly recommend you start with the Master Data Services demonstration and then see the evolution of how Data Quality Services integrates its feature set into the MDS Add-in with Excel.

The starting point for this demonstration is to go ahead and launch the Data Quality Client. In this part of the demonstration, I am connecting to the DQS server on the KIWI instance, and I am wearing the administrator hat. I am going to go into the administration configuration and here, under General Settings, I am going to change the minimum records score from the default of 80 to a value of 50 percent. And this is basically a threshold by which it uses for a matching policy to determine whether something is in fact a match. And so really it is a constraint to ensure that matching policies do not use a minimum score less than the one the administrator configures. To support the demonstration I have, I’m going to relax that by reducing the minimum score to around 50 percent.

Switching across to SQL Server Management Studio (SSMS), I’ll just point out here with the databases that I have on my relational instance. When you install Data Quality Services, it is in fact three relational databases. There is a requirement that, if you are going to use the combined functionality of Data Quality Services and Master Data Services using the MDS Add-in with Excel, is that DQS and MDS need to be on the same instance, and in fact we do see that because my Master Data Services database is also hosted here. Now what we need to be aware of is that when you are accessing data through the Data Quality Client, is that permissions must be granted to the DQS users on the instance. So the intention of this demonstration is that I am going to load some new resellers into the Reseller entity in Master Data Services. In fact, this is the entity that I created in the previous demonstration. Now what I suspect is that some of the new resellers that I want to load into this list we already have, and I want to use the power of Data Quality Services to identify if there are any duplicates to help me eliminate them prior to inserting them into the master list.

What I need to do is go ahead and open up a script that is going to grant privileges to Data Quality Services to access the data that I am looking to learn from. So I open up the script that pretty much grants privileges to the DQS service and the DQS dbo users to the AdventureWorks2008R2 database. So I execute the script and now I am ready to proceed by creating a Knowledge Base that will learn from the data that is stored in the AdventureWorks2008R2 database. So switching back to the Data Quality Client, what I am going to do is create a new Knowledge Base and I’ll provide a name for the Knowledge Base – that is, AdventureWorksBI. Of the three activities that I have here, I am going to select Knowledge Discovery. I am going to learn from the data that we currently have in the AdventureWorks database. So I click Next. The next step then, in fact the first step in this series, is to map the data. Note that we can introduce data from either SQL Server or from an Excel file. In this demonstration, it will be sourcing the data from SQL Server. The database then is the AdventureWorks2008R2 and I have a view defined in this database that is the USStoreAddressList. Essentially, these are all of the resellers by name and their address. And then what I am going to do is create a mapping from the column that is the State column in this view to a domain. Now Knowledge Bases consist of domains. Being a new Knowledge Base, of course I do not have any domains defined. So I click on this button here to create a domain which I name State.

Step two is the Discover Phase. When I click Next, it takes me to this phase and I can click start and this is where DQS, providing those permissions were granted, is going to go ahead and retrieve the data and extract. In this case, from the 391 rows of data that I have, it found 35 unique states. So be aware that these are resellers defined in the states of the United States, the Country. So that is the Discover Phase.

Then I click Next and that takes me to the third phase, which is to manage the Domain Values. So we can see here on the left hand side the State domain consists of 35 values and here we see in alphabetic order the state codes for those 35. And what I can do is explore these. I am also in a position to go ahead and create a new domain value. And so, what we have discovered, that when we load data into the data warehouse, some of the extracts that we receive contain incorrect data. Some of it is the stores that we have – use the abbreviation of “Calif.” as we see here to represent California. And yet, that is an error when it hits our data. So what I am going to do is introduce this as the domain value but configure this to be an error and that it should correct to the two character code of CA. And so we can see already the way that knowledge can evolve. First of all, we can discover it in our data source, and then second, we can go ahead and provide these additional values that mapped the correct values.

All right, so now that I scrolled to the top, it is interesting to see the relationship then between the CA member and California that is an error that corrects to this. And this is an example of what we can do in one sense to define a domain that can be used for data correction. Now that is not really the intention for this demonstration, so I thought I would kick-start the demonstration by showing you that. So what I am going to do is finish this process. It does prompt me here in the client: do I wish to publish this Knowledge Base? In which case it unlocks it and makes it available for use. Because I have more work to do, I click No.

The next part is to go ahead and define a matching policy that can be used to de-duplicate data. So coming back to my AdventureWorksBI Knowledge Base, I then choose the activity of the Matching Policy. And I work through the mapping step again, so I connect to the same database and then I connect to the same view. And this time I am going to create four domains for the different parts of the address. So I have, let’s go for the store itself. And so I create a new domain for it. And then its address. And then we’ll go for a city. And then we’ll go for a state. Oh no, I have a state, so we’ll go for a zip. I click next to move forward to the Matching Policy definition. Now we have no matching policies in this Knowledge Base at this stage so I go ahead and I create a new matching rule and I provide a name for this – that will be Duplicate Stores. Note the minimum matching score here because I updated that administration value from 80 down to 50, it’s defaulting to 50 and allows me to test down to this level. Now here in the Rule Editor is where we can introduce the domains and the rules by which a match is to be identified. So I am going to click this button here four times to add the four domains of Store, Address, City and Zip. Now I do appreciate this window is really small. It is really, really difficult to configure the rule. But what I am going to do first of all for the city and the zip, my rule is this, to identify and duplicate the city and the zip must be an exact match. So by checking the prerequisite checkbox, you can see here that the matching policy says that they must be exact. That then leaves me to configure the two domains of Store and Address. So I am looking for matches that are similar. The weight here in total then must add up to 100 (percent). And so what my matching rule here is saying is that for the four domains, and so for city and zip, they must be identical. For store and address, they must be similar and their similarities should be averaged out or weighted at 50 percent each.

Now that I have a complete rule defined, I can then go ahead and click Start, which will actually look to identify duplicates within the view itself. Now that is not my intention. My intention is that when I am loading new stores into the master list that we identify the duplicates. So I don’t need to click Start at this point. I do click Next. That takes me to the third step, which is the matching policy. And then I just complete the process – so in the bottom right corner here, you’ll see that I have a Finish button and so it completes the configuration of the Knowledge Base. And now that I have completed my work, I go ahead and publish.

And so now I can switch across to Excel. I go ahead and open up a workbook that consists of the new resellers that I want to add to my master list. And here they are. There are ten in total. So we have a header row and then I have ten rows representing new Resellers. Now remember I have suspicions about this data. I do suspect that there are some duplicates because we already had some of these stores already in the master list. So what I am going to do then is come across to a new sheet and then I am going to connect to my Master Data Services instance. In the previous demonstration, in the AdventureWorksBI model, I created the Reseller entity, and so I double-click it to load the entity and its members into this worksheet. So the next thing is to combine the data from what I have here in my NewReseller worksheet with the data here in the Reseller entity. And so here on the Master Data ribbon, I have the Combine Data function. What I can do is select the range that represents these new resellers including the header. And then I can map the columns across to the attributes in the Reseller entity. So the name is really the reseller. We don’t have a code. The address maps to the address. City maps to city and state maps to state and I think you have guessed it, zip maps to zip. Simply by clicking Combine, all it’s doing for me is appending to the bottom of the list those ten new resellers. So note the color-coding here that we learned in the previous demonstration represents that these are changes made to the list since it was last retrieved. Now you might notice also the addition of this column here that says what source is it – is it from MDS or is it external data that doesn’t yet exist in MDS? So we do appreciate that those ten rows that were added are classified as external.

Now we can bring in the power of Data Quality Services with Master Data Services by then saying, let’s go ahead and match data. So here on the Master Data ribbon, I click match data, I select the Knowledge Base. And then it says, well, in this Knowledge Base it found that there is a matching policy that uses these four domains. So how do the attributes in the masters list match to this? So store is really found in the name attribute and address is found in address. City is found in city and zip is found in zip. And now when I click OK, Master Data Services is assessing me an entire set of data both MDS and external data and it’s looking for duplicates. It returns the findings with some additional columns that it will add to the table here. So just be clear that while it’s doing this – sorry, I may have blocked your view there – the external column, these are the external values that I was pointing out to you earlier.

All right, so what has it found? It’s added this additional column here being the Cluster ID. We can add some additional information by saying, “show details.” And so I think what is helpful is to actually look at these by Cluster ID. Let me just make this a little easier to see. So what we see is that, for example, these two rows here, one being an MDS entry, one being external, that it thinks that these are very, very similar and therefore has put them together as the same Cluster ID. This one here, it says that it is 94 percent similar to the one that we have here. And let’s take a look at why. The name is pretty much the same but when we look at the address, we see that this address is in full and this one is abbreviated. So they are indeed the same store, one just happens to be slightly different than the other. So what we have is pretty much the control here to right-click and delete that row. It’s not part of MDS yet until we publish it so I simply delete it and therefore it does not become, it is no longer available for publication. This one here is a 100 percent match so we can easily delete it. This one here is an 85 percent match again because the abbreviations. Again, we have 100 percent. This one here being a 50 percent match mainly because the name of the company is very, very different. So we might identify that while the address is the same, perhaps it was a change of ownership, it is truly a different entity, so I do not delete this one.

All right, so now what I can do is, just for convenience, we’ll go ahead and hide those extra columns. I can filter by saying, show me just the external data now, and so essentially those ten resellers that were about to be published have now been narrowed down to six and all I need to do is provide a unique code for them. And then I can go ahead and publish. And so my annotation here will be “adding new stores.” A quick publish and the rest is over to MDS.

And so what we have seen in this demonstration is the combination of Data Quality Services to define a Knowledge Base including the domain values, and a matching policy and the way that matching policy can help us identify duplicates prior to publication here in Excel using the MDS Add-in. Thanks very much for watching. 

Return to SQL Server 2012 Developer Training Kit BOM (en-US)