Fuzzy Matching Transformation RRS feed

  • Question

  • Hi, 

    I am working on the ETL design for the CRM system and would need to device a matching logic while matching person information in the CRM database ( person table ) with the information for the person in the stg table. 

    The Match is based on:

    1. First Name 

    2.Last Name 

    3. Email Address 

    4.Email Address 

    5.Phone Number 

    6. Address Line1 

    7.Address Line2 



    10.Postal code 

    If there is a match found a new record in the CRM person table would not be created. But if the match is not found a new record will be created. Due the nature of the data coming in from different sources I tried using the Fuzzy Lookup Transformation in SSIS. Based on the _similarity score and the threshold defined (>=8.85) the record is either ignored or created in the CRM database. 

    Now when database will be back filled the person table would have 10 M-12 M rows.  I have read that it is best practice to use the Fuzzy Lookup Transformation in SSIS with small datasets . The staging data sets would be approx 4000 records daily but the reference data set would be 10-12 M records. 

    What is the best method to achieve such kind of matching logic in SSIS ? Is it a good approach to use to stored procedure and function (Tsql) to achieve this ? Which approach would be faster and optimal ? ( Considering we have indexes set up on the person table) 

    Please advice ......



    Monday, September 10, 2012 2:11 PM

All replies

  • 10-12M could be a little to much... what are the specs of your machine?

    But you could also try to reduce the number of columns. Fuzzy matching on numbers, codes and fields like emailaddresses isn't very usefull.
    Lastname, addressline1 and city are good candidates.

    Please mark the post as answered if it answers your question | My SSIS Blog: | Twitter

    Monday, September 10, 2012 2:21 PM
  • Thank you for the prompt response. machine specs are : 8 gigs memory, 64 bit - windows server . The business needs to match on first name, last name , emailaddress, addressline1, city . It would be difficult to eliminate emailaddress due to business requirements. 

    I have used t-sql stored proc and function to apply the matching logic in prior projects but was concerned if I can use the capabilities of SSIS 2008 ( Fuzzy Logic Tranformation ) to achieve this ? Would implementing it in SSIS  be any faster than TSQL programming ? Also I noticed that Fuzzy Logic Creates some index tables when it processes records ..would be required to clean them up as the matching is completed .....but I am not sure what is the best practice when it comes to comparing records  ( 10-12 M ) with the staging records ( 4000 ) and then determine whether its a new record or an existing record. The aim is to make sure no duplicates are created in the person table. 

    Please advice ....



    Monday, September 10, 2012 3:03 PM