none
Cities lookup fuzzy match? RRS feed

  • Question

  • Hi

    I have a Cities table:

    CREATE TABLE [dbo].[tblReferenceCities](
    	[CounterID] [bigint] IDENTITY(1,1) NOT NULL,
    	[CountryCode] [nvarchar](2) NOT NULL,
    	[StateCode] [nvarchar](4) NOT NULL,
    	[CityCode] [nvarchar](5) NOT NULL,
    	[CityName] [nvarchar](255) NOT NULL
    ) ON [PRIMARY]
    

    I have a asp.net webservice that amoung other things asks people to submit a City code. However if they don't know that I want to allow them to enter a City Name in that field instead and then I can then lookup and match it to a City Code so I can insert that into another table, tblTransactions.

    So I could first validate to see if there is a match for the CityCode entered. If not I could search on the data entered and see if I can find a match on CityName.

    Is there a way to get a best match? For example, if they enter:

    Lop Angeles, Las Angelis or Los Wngeles

    I still want any of those to match with my LAX for Los Angeles (I want to be quite generous -- remembering that they will have to enter Country and State codes first so the resulting cities to search from are quite finite).

    However, if they enter:

    ernjjvjpoj veqrpovij

    I don't want any city match, return a "-"

     TIA

    • Moved by Tom PhillipsModerator Thursday, September 8, 2011 1:43 PM TSQL question (From:SQL Server Database Engine)
    Thursday, September 8, 2011 2:00 AM

Answers

All replies

  • You Can Use SoundEX Function 

    Select * from CityMaster Where SOUNDEX(CityName)=SOUNDEX('ABC')
    



    Mark as Answer If Reply Was Helpful Thanks Kuldeep Bisht Technical Lead @ Simplion Technologies http://mssqlguide.kuldeepbisht.com
    Thursday, September 8, 2011 4:51 AM
  • Using:

    SELECT  CityCode,CityName
    FROM  tblTSReferenceCities
    WHERE (CountryCode = @EnterCountryCode)
     AND (StateCode = @EnterStateCode)
     AND SOUNDEX(CityName)=SOUNDEX(@EnterCityName)

    'Los Angelles' returned all of these:

    LKU Lake Almanor
    LA2 Lake Arrowhead
    LE3 Lake Elsinore
    LFC Lake Forest
    LAT Los Alamitos
    XYL Los Alamos
    LLT Los Altos
    LAH Los Altos Hills
    LAX Los Angeles
    LSN Los Banos
    LXX Los Gatos
    IOS Los Molinos
    LNE Los Nietos
    OVS Los Olivos

    I tried:

    SELECT DIFFERENCE(CityName, @EnterCityName) AS Calc, CityCode,CityName
    
    FROM  tblTSReferenceCities
    
    WHERE (CountryCode = @EnterCountryCode) AND (StateCode = @EnterStateCode)
    
    ORDER BY Calc DESC
    
    

    and got:

    4 LKU Lake Almanor
    4 LA2 Lake Arrowhead
    4 LE3 Lake Elsinore
    4 LFC Lake Forest
    4 LAT Los Alamitos
    4 XYL Los Alamos
    4 LLT Los Altos
    4 LAH Los Altos Hills
    4 LAX Los Angeles
    4 LSN Los Banos
    4 LXX Los Gatos
    4 IOS Los Molinos
    4 LNE Los Nietos
    4 OVS Los Olivos
    3 TSO Lost Hills
    3 LGB Long Beach
    3 OXZ Loomis
    3 LOH Lyoth
    3 LB5 Lakehead
    3 LAB Laguna Beach
    ... 


    • Edited by Mark B 123 Thursday, September 8, 2011 11:15 AM
    Thursday, September 8, 2011 11:13 AM
  • Take a look also at

    http://www.merriampark.com/ldtsql.htm

    and threads referenced in

    http://tek-tips.com/viewthread.cfm?qid=1461657


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog
    Thursday, September 8, 2011 2:04 PM
    Moderator
  • I don't know why, but that code (shown below) returns NULL for:

    SELECT [dbo].[LEVENSHTEIN]('los angeles','los angeiles')

    Here's the code:

    ALTER function [dbo].[LEVENSHTEIN]( @s varchar(50), @t
    varchar(50) ) 
    --Returns the Levenshtein Distance between strings s1 and s2.
    --Original developer: Michael Gilleland http://www.merriampark.com/ld.htm
    --Translated to TSQL by Joseph Gama
    returns varchar(50)
    as
    BEGIN
    DECLARE @d varchar(100), @LD int, @m int, @n int, @i
    int, @j int,
    @s_i char(1), @t_j char(1),@cost int
    --Step 1
    SET @n=LEN(@s)
    SET @m=LEN(@t)
    SET @d=replicate(CHAR(0),100)
    If @n = 0
    	BEGIN
    	SET @LD = @m
    	GOTO done
    	END
    If @m = 0
    	BEGIN
    	SET @LD = @n
    	GOTO done
    	END
    --Step 2
    SET @i=0
    WHILE @i<=@n
    	BEGIN
    	SET @d=STUFF(@d,@i+1,1,CHAR(@i))--d(i, 0) = i
    	SET @i=@i+1
    	END
    
    SET @i=0
    WHILE @i<=@m
    	BEGIN
    	SET @d=STUFF(@d,@i*(@n+1)+1,1,CHAR(@i))--d(0, j) = j
    	SET @i=@i+1
    	END
    --goto done
    --Step 3
    	SET @i=1
    	WHILE @i<=@n
    		BEGIN
    		SET @s_i=(substring(@s,@i,1))
    --Step 4
    	SET @j=1
    	WHILE @j<=@m
    		BEGIN
    		SET @t_j=(substring(@t,@j,1))
    		--Step 5
    		If @s_i = @t_j
    			SET @cost=0
    		ELSE
    			SET @cost=1
    --Step 6
    		SET @d=STUFF(@d,@j*(@n+1)+@i+1,1,CHAR(dbo.MIN3(
    		ASCII(substring(@d,@j*(@n+1)+@i-1+1,1))+1,
    		ASCII(substring(@d,(@j-1)*(@n+1)+@i+1,1))+1,
    		ASCII(substring(@d,(@j-1)*(@n+1)+@i-1+1,1))+@cost)
    		))
    		SET @j=@j+1
    		END
    	SET @i=@i+1
    	END      
    --Step 7
    SET @LD = ASCII(substring(@d,@n*(@m+1)+@m+1,1))
    done:
    --RETURN @LD
    --I kept this code that can be used to display the matrix with all calculated values
    --From Query Analyser it provides a nice way to check the algorithm in action
    --
    RETURN @LD
    --declare @z varchar(255)
    --set @z=''
    --SET @i=0
    --WHILE @i<=@n
    --	BEGIN
    --	SET @j=0
    --	WHILE @j<=@m
    --		BEGIN
    --		set @z=@z+CONVERT(char(3),ASCII(substring(@d,@i*(@m+1)+@j+1 ,1)))
    --		SET @j=@j+1 
    --		END
    --	SET @i=@i+1
    --	END
    --print dbo.wrap(@z,3*(@n+1))
    END
    

    And here's the MIN3 function needed to run it: 

    ALTER FUNCTION [dbo].[Min3](@a int, @b int,  @c int )
    RETURNS int
    AS
    BEGIN
        DECLARE @m INT
        SET @m = @a
    
        IF @b < @m SET @m = @b
        IF @c < @m SET @m = @c
        
        RETURN @m
    END
    

     



    • Edited by Mark B 123 Monday, September 12, 2011 2:17 AM
    Monday, September 12, 2011 2:12 AM
  • Here, I would suggest you to place some intelligence in your system.

    • Add a column keywords VARCHAR(MAX) in you table.
    • Add delimiter separated words that you want to match with your CityName column (in case wrongly spelled).
    • Initially you can dump as many keywords as you want and then with time let the different searches do the job for you.
    • Using the SOUNDEX() funtion, once you identify the CityName with which you want to match the user input, add that input into your keywords column. This way your keywords collection will keep increasing giving you more precision over time.
    • Your search sequence will be : CityCode - CityName - Keywords - SOUNDEX()

    Hope, this helps !


    -Vinay Pugalia
    If a post answers your question, please click "Mark As Answer" on that post or "Vote as Helpful".
    Web : Inkey Solutions
    Blog : My Blog
    Email : Vinay Pugalia
    Monday, September 12, 2011 5:21 AM
  • Thanks. Can you or anyone else see a bug in the SQL code above that is causing it to return a NULL for Los Angeilles?
    Tuesday, September 13, 2011 12:13 AM
  • Yes, the @d varchar variable is too small. Try this definition (note the bold):

     

    ALTER function [dbo].[LEVENSHTEIN]( @s varchar(50), @t
    varchar(50) ) 
    --Returns the Levenshtein Distance between strings s1 and s2.
    --Original developer: Michael Gilleland http://www.merriampark.com/ld.htm
    --Translated to TSQL by Joseph Gama
    returns varchar(50)
    as
    BEGIN
    DECLARE @d varchar(500), @LD int, @m int, @n int, @i
    int, @j int,
    @s_i char(1), @t_j char(1),@cost int
    --Step 1
    SET @n=LEN(@s)
    SET @m=LEN(@t)
    SET @d=replicate(CHAR(0),500)
    

     

     

     


    MG.-
    Mariano Gomez, MIS, MCITP, PMP
    IntellPartners, LLC.
    http://www.intellpartners.com/
    Blog http://dynamicsgpblogster.blogspot.com/
    • Edited by Mariano GomezMVP Tuesday, September 13, 2011 1:40 AM
    • Marked as answer by Mark B 123 Tuesday, September 13, 2011 5:28 AM
    Tuesday, September 13, 2011 1:39 AM
  • Thanks Mariano. That works.

    Now I need to get the CityCode for the CityName that has the lowest LEVENSHTEIN number.

    I have tried Sorting the result set:

    SELECT
    
    dbo.LEVENSHTEIN(@EnterCityName, CityName) AS Calc1
    
    ,CityCode
    
    ,CityName
    
    FROM  tblTSReferenceCities
    
    WHERE 
    
    (CountryCode = @EnterCountryCode) AND
    
    (StateCode = @EnterStateCode)
    
    
    
    

    but it doesn't sort because I suspect you can't sort columns that are a result of a function. So I need to create a temporary table to store the results then SELECT TOP(1)?

    I haven't had experience creating temporary tables before. Can someone show me what the SQL code to do that would be (i.e. what the code above would need to be changed to return the best CityCode?

     

     

     


    • Edited by Mark B 123 Tuesday, September 13, 2011 5:27 AM
    Tuesday, September 13, 2011 5:25 AM
  • You could if you commit the initial search as a derived table:

     

    SELECT Calc1, CitiCode, CityName FROM (
      SELECT
         dbo.LEVENSHTEIN(@EnterCityName, CityName) AS Calc1
        ,CityCode
        ,CityName
      FROM  tblTSReferenceCities
      WHERE 
        (CountryCode = @EnterCountryCode) AND
        (StateCode = @EnterStateCode)
    ) LevnshResult 
    ORDER BY Calc1 DESC, etc, etc;
    


    MG.-
    Mariano Gomez, MIS, MCITP, PMP
    IntellPartners, LLC.
    http://www.intellpartners.com/
    Blog http://dynamicsgpblogster.blogspot.com/
    Tuesday, September 13, 2011 11:24 AM
  • Wouldn't fulltext index be usefull here?

     

    Is a predicate used in a WHERE clause to search columns containing character-based data types for precise or fuzzy (less precise) matches to single words and phrases, the proximity of words within a certain distance of one another, or weighted matches.

    In SQL Server, you can use four-part names in CONTAINS or FREETEXT full-text predicates to execute queries against linked servers.

    CONTAINS can search for:

    • A word or phrase.

    • The prefix of a word or phrase.

    • A word near another word.

    • A word inflectionally generated from another (for example, the word drive is the inflectional stem of drives, drove, driving, and driven).

    • A word that is a synonym of another word using a thesaurus (for example, the word metal can have synonyms such as aluminum and steel).

    Tuesday, September 13, 2011 3:16 PM
  • For some reason that still didn't sort correctly but the following worked:

    ALTER PROCEDURE [dbo].[uspTSGeneralReferenceValidateCityGet]
    
    @EnterCountryCode nvarchar(2),
    @EnterStateCode nvarchar(4),
    @EnterCityName nvarchar(200)
    
    AS
     BEGIN
    
    DECLARE @TempResults TABLE 
    (
    TempLevenshteinNumber int,  
    TempCityCode nvarchar(5)
    )
    
    INSERT INTO @TempResults (TempLevenshteinNumber,TempCityCode)
    SELECT
    dbo.LEVENSHTEIN(@EnterCityName, CityName) AS Calc1
    ,CityCode
    FROM  tblTSReferenceCities
    WHERE 
    (CountryCode = @EnterCountryCode) AND
    (StateCode = @EnterStateCode)
    
    SELECT TOP(1) TempCityCode FROM @TempResults
    ORDER BY TempLevenshteinNumber
    
    
    END
    

    Thanks very much everybody for all of your help. Seems to be all done now.

    Mark
     

     

    Tuesday, September 13, 2011 11:01 PM
  • That's the beauty of SQL... there's no one way to do things and it comes down to what works best for your particular situation. What I initially posted, worked well in my environment, with the limited data I had.
    MG.-
    Mariano Gomez, MIS, MCITP, PMP
    IntellPartners, LLC.
    http://www.intellpartners.com/
    Blog http://dynamicsgpblogster.blogspot.com/
    Tuesday, September 13, 2011 11:03 PM