Gold Award WinnerGold Award Winner

All current Web search engines have only one input field for the search term and always deliver a result, no matter what you type. This makes us developer’s life difficult, because now end users expect the same in Line-of-Business (LOB) applications like ERP, for example, when searching for an address or an item. This article shows how to implement a similar search function with the Microsoft SQL Server.

Introducing

The built-In function “Full-Text Index” of the Microsoft SQL Server provides an easy way to index all the words in a database, as well as to query them in a flexible and performant way. But there are a few limitations that cause it's not suitable for a similar web search functionality.

  • no suffix (ends with) search possible like *part
  • no typing error tolerance with respect to search term and/or base data
  • only conditional indexing record delimiter and other Noise Word

    An alternative is to implement a own indexing and in this case the so-called N-Gram Index. In a n-gram index all N consecutive characters are counted as an unit (fragment) and indexed, where N can be any numeric value from 1-x. An example with "Hello World" and a 3-grams results in the units
    "HEL", "ELL", "LLO", "LO ", "O W", " WO", "WOR", "ORL", "RLD"

    The higher the N value, the larger the fragment and the more precise is the search, the less the N value the higher the typing error tolerance,


    Normalization

    The aim is to implement a typing error tolerant and accent insensitive search of addresses, for this we have to provide normalized data base for the n-gram index accordingly. This requires a few considerations ahead.

    Umlauts

    German umlauts like Ü can be replaced by UE, so that we can find with the search phrase "Müller" also the name "Mueller". In other languages, there are special characters that come from the Latin alphabet and include indicators for emphasis, as the Polish Ł. Such characters could be also replace.

    Delimiter/Punctuation Mark

    For full-text index delimiters are counted as the end of words and are not indexed. There are, however, company and product names that contain delimiter and therefore they should be in a certain way searchable. Here you have to decide by the data if the delimiters are retained or replaced.

    In this example, all delimiters are replaced by a single character for simplicity. So the structure of the text is retained, but for indexing and searching just one sign is used.

    Special Character

    Again, you have to decide by database whether to replace special characters or not. Good example is the @ sign in e-mail addresses; one would like to specifically search only for the e-mail or also after the parts of the e-mail address like domain name in another data areas.

    This means in this example only

  • digits
  • Western capital letters
  • Exclamation mark ! as a general substitute for all delimiter
  • hash mark # as a general substitute for all other special characters
    are approved as signs of the normalized text.To maintain a central list of allowable characters, a function is created that returns them; global variables are not available in MS SQL Server.


    CREATE FUNCTION dbo.fnGetNGramValidChars()
        RETURNS nvarchar(39)
    AS
    BEGIN
        -- Approved Signs for N-Gram Index
        -- Western capital letters + Space
        -- Digits
        -- Delimiter sign: !
        -- Special character sign: #
        RETURN N'ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789!#';
     
    END;

    All other characters are replaced accordingly. For this another function is created, which normalizes the supplied text. It is later used for both indexing the data and for the normalization of the search term.
    Here the shortened version:

    CREATE FUNCTION dbo.fnNormalizeDatastring
        (@string nvarchar(max))
        RETURNS nvarchar(max)
    AS
    BEGIN
        -- Get all valid characters
        DECLARE @validChars nvarchar(50) = (SELECT dbo.dbo.fnGetNGramValidChars() AS Result);
        -- Only capital characters
        SET @string = UPPER(@string);
        -- Replace umlauts
        SET @string = REPLACE(@string, N'Ä', N'AE');
        SET @string = REPLACE(@string, N'Æ', N'AE');
        -- usw. ...
     
        -- Registered Marks
        SET @string = REPLACE(@string, N'℅', N'C#O');
        SET @string = REPLACE(@string, N'™', N'TM');
        -- usw. ...
     
        -- Replace delimiter by !
        SET @string = REPLACE(@string, N'.', N'!');
        SET @string = REPLACE(@string, N',', N'!');
        -- usw. ...
     
        -- Replace special characters by hash mark.
        SET @string = REPLACE(@string, N'+', N'#');
        SET @string = REPLACE(@string, N'-', N'#');
        -- usw. ...
     
        -- Replace all other invalid characters by space.
        DECLARE @loop int = 1;
        DECLARE @len int = LEN(@string);
        DECLARE @invalidChar nchar(1);
     
        WHILE @loop <= @len
        BEGIN
            SET @invalidChar = SUBSTRING(@string, @loop, 1);
            IF CHARINDEX(@invalidChar, @validChars) <= 0
            BEGIN
                SET @string = REPLACE(@string, @invalidChar, N' ');
            END;
     
            SET @loop = @loop + 1;
        END;
     
        -- Remove multiple spaces
        DECLARE @goOn tinyint = 1;
        DECLARE @oldLen int;
     
        WHILE @goOn <> 0
        BEGIN
            SET @oldLen = LEN(@string);
            SET @string = REPLACE(@string, N'    ', N' ');
            SET @string = REPLACE(@string, N'  ', N' ');
     
            IF @oldLen = LEN(@string)
            BEGIN
                SET @goOn = 0;
            END;
        END;
     
        RETURN @string;
    END;

    Note: Such elaborate string operations are very expensive in SQL Server. It would be conceivable to optimize implementation as CLR Assembly.


    Indexer

    The index can be easily calculated on the number of allowed characters, the position of the character and the N-gram value. For the fragment "BDHZ" the index for 4-Gram is calculated by way of example as below (position is 0-based):
    B = Position  1 * (39^0) = 1
    D = Position  4 * (39^1) = 156
    H = Position  7 * (39^2) = 10,647
    Z = Position 25 * (39^3) = 1,482,975
    Sum = 1,493,779 less integer lower limit -2,147,483,648 => index value -2,145,989,869.

    Take the example of a 4-gram index with the above defined 39 characters, then we can get 39 ^ 4 = 2,313,441 different character combinations (fragments).
    Let we allow negative values, then the value range of the int data type fits up to the 6-gram index values. The number of possible index values for each n-gram:
    N-Gram Formular Value Data Typ
    2-Gram = 39^2 = 1,521 => smallint
    3-Gram = 39^3 = 59,319 => smallint
    4-Gram = 39^4 = 2,313,441 => int
    5-Gram = 39^5 = 90,224,199 => int
    6-Gram = 39^6 = 3,518,743,761 => int
    7-Gram = 39^7 = 137,231,006,679 => bigint

    The index calculation function in T-SQL provide as follows is being assumed, that the fragment is passed normalized.

    CREATE FUNCTION dbo.fnCalculate4GramIndex
        (@fragment nchar(4))
        RETURNS int
    AS
    BEGIN
        -- Get all valid characters
        DECLARE @validChars varchar(50) = (SELECT dbo.fnGetNGramValidChars() AS Result);
        DECLARE @size int = (SELECT LEN(@validChars) AS size);
     
        DECLARE @id int = -2147483648;
        -- Calculate position of each char
        DECLARE @pos1 int = CHARINDEX(SUBSTRING(@fragment, 1, 1), @validChars);
        DECLARE @pos2 int = CHARINDEX(SUBSTRING(@fragment, 2, 1), @validChars);
        DECLARE @pos3 int = CHARINDEX(SUBSTRING(@fragment, 3, 1), @validChars);
        DECLARE @pos4 int = CHARINDEX(SUBSTRING(@fragment, 4, 1), @validChars);
     
        SET @id = @id + (@pos1 - 1) +
                        (@pos2 - 1) * @size +
                        (@pos3 - 1) * POWER(@size, 2) +
                        (@pos4 - 1) * POWER(@size, 3);
     
        RETURN @id;
    END;

    Now, in order to generate all the index values for an entire text, a Table Valued Function (TVF) is created, which breaks down the text into fragments, calculates the values for each and returns the result as a table.

    CREATE FUNCTION dbo.fnBuildIndex4Gram
        (@string nvarchar(max))
        RETURNS @ids TABLE (id int)
    AS
    BEGIN
        SET @string = (SELECT dbo.fnNormalizeDatastring(@string));
        DECLARE @pos int = 1;
        DECLARE @len int = LEN(@string) - 4;
     
        IF @len <= 0
            RETURN;
     
        WHILE @pos <= @len
        BEGIN
            INSERT INTO @ids (id)
            SELECT dbo.fnCalculate4GramIndex(SUBSTRING(@string, @pos, 4));
     
            SET @pos = @pos + 1;
        END
     
        RETURN;
    END;
     
    -- Small test:
    SELECT *
    FROM dbo.fnBuildIndex4Gram(N'Hello World') AS IDX;

    As an example for indexing, the address data used in the scheme "Person" from the AdventureWorks database is used.
    To get all the relevant data in just one field, incl. of 0-n telephone numbers, a view is created that provides the data. This view has the advantage that the data can be extended at any time, without having to adjust other logics and process steps.

    CREATE VIEW Person.vAddressIndexData
    AS
        SELECT ADR.AddressID,
               ADR.AddressLine1 + N' ' +
               ISNULL(ADR.AddressLine2 + N' ', N'') +
               ADR.City + N' ' +
               ADR.PostalCode +  N' ' +
               SPR.StateProvinceCode + N' ' +
               SPR.Name + N' ' +
               CR.Name + N' ' +
               ISNULL(PER.Title + N' ', N'') +
               ISNULL(PER.FirstName + N' ', N'') +
               ISNULL(PER.MiddleName + N' ', N'') +
               ISNULL(PER.LastName + N' ', N'') +
               ISNULL(PER.Suffix + N' ', N'') +
               ISNULL(EA.EmailAddress + N' ', N'') +
               ISNULL((SELECT CAST((SELECT (SELECT PHO.PhoneNumber + ','
                                     FROM Person.PersonPhone AS PHO
                                     WHERE PHO.BusinessEntityID = PER.BusinessEntityID
                                     FOR XML PATH (''), Type
                                   ) AS ColCSV
                                  ) AS nvarchar(MAX))) + N' ', N'')
               AS IndexData
        FROM Person.Address AS ADR
             INNER JOIN
             Person.StateProvince AS SPR
                 ON ADR.StateProvinceID = SPR.StateProvinceID
             INNER JOIN
             Person.CountryRegion AS CR
                 ON SPR.CountryRegionCode = CR.CountryRegionCode
             LEFT JOIN
             Person.BusinessEntityAddress AS BEA
                 ON ADR.AddressID = BEA.AddressID
             LEFT JOIN
             Person.Person AS PER
                 ON BEA.BusinessEntityID = PER.BusinessEntityID
             LEFT JOIN
             Person.EmailAddress AS EA
                 ON PER.BusinessEntityID = EA.BusinessEntityID
                    AND PER.EmailPromotion = EA.EmailAddressID;

    Note: The view is only meant as an example, sense and correctness was not respected here.

    Next, we need a table to save the index values. What is needed is the primary key of the record, the AddressId, and the n-gram index value. Since a fragment / index value can occur more than once for each record, a field for the occurrence is used instead of storing each index value individually; which of course also would work.

    CREATE TABLE dbo.Gram4Index(
        AddressId int NOT NULL,
        ID_4Gram int NOT NULL,
        Occurrence smallint NOT NULL,
        CONSTRAINT PK_Gram4Index PRIMARY KEY CLUSTERED
        (
            AddressId ASC,
            ID_4Gram ASC
        )
    );

    Note: The mixing of different schemes, like here with Person/dbo, usually does not make sense.

    During search only the index value is directly needed, the primary key index is not so suitable for this, so another index is created, which is more selective. To avoid unnecessary key lookups on the address and occurrences in the base table, the fields are included as "include".

    CREATE NONCLUSTERED INDEX IDX_Gram4Index_Search
        ON dbo.Gram4Index
           (ID_4Gram)
        INCLUDE (AddressId, Occurrence);
     
    -- Some additional statistics.
     
    CREATE STATISTICS [STAT_Gram4Index_AddressId]
        ON [dbo].[Gram4Index]
           ([AddressId]);
     
    CREATE STATISTICS [STAT_Gram4Index_Occurrence]
        ON [dbo].[Gram4Index]
           ([Occurrence]);
     
    CREATE STATISTICS [STAT_Gram4Index_AddressId_Occurrence]
        ON [dbo].[Gram4Index]
           (AddressId, [Occurrence]);

    How it should be, filling the index value is done centrally via a stored procedure, which can then be called from other SP's or from triggers. The following SP serves this purpose, it first deletes the old index values and adds the new into the index table for an address; it expects only the PK AddressId as parameters.

    CREATE PROCEDURE dbo.spUpdate4Gram
        @addressId int
    AS
    BEGIN
        SET NOCOUNT ON;
     
        -- Delete old index data
        DELETE FROM dbo.Gram4Index
        WHERE AddressId  = @addressId;
     
        -- Re-create Index for the Address.
        INSERT INTO dbo.Gram4Index
           (AddressId, ID_4Gram, Occurrence)
        SELECT ADR.AddressId, NG.ID AS ID_NGram4, COUNT(*) AS Occurrence
        FROM Person.vAddressIndexData AS ADR
             CROSS APPLY
             dbo.fnBuildIndex4Gram(ADR.IndexData) AS NG
        WHERE ADR.AddressId = @addressId
        GROUP BY ADR.AddressId, NG.ID;
    END;

    Since we have with the Adventure Works an existing database with data, we need additionally a routine to initial process the existing data. For this we use a cursor to process all addresses one by one.

    SET NOCOUNT ON;
    DECLARE @addressId int;
    DECLARE addresses CURSOR LOCAL FOR
        SELECT AddressId
        FROM Person.Address;
     
    OPEN addresses;
     
    FETCH NEXT FROM addresses INTO @addressId;
    WHILE (@@FETCH_STATUS = 0)
    BEGIN
        EXEC dbo.spUpdate4Gram @addressId;
        FETCH NEXT FROM addresses INTO @addressId;
    END;
     
    CLOSE addresses;
    DEALLOCATE addresses;
     
    PRINT N'Finished';



    Space requirements for the index

    Every index requires additional space, with this N-Gram index it's the same, of course.
    For a N-Gram index and a text length of x characters we get (x - N) fragments. For each fragment we save
  • AddressId = int = 4 byte
  • ID_4Gram = int = 4 byte
  • Occurrence = smallint = 2 byte
  • Primary Key = 8 byte
  • Search Index = 10 byte
    in total it is 28 bytes per fragment. If we have a text of 1,000 characters, we get for a 4-Gram index 996 * 28 bytes = 27,888 bytes storage size. In compare a common index on a NVarChar column needs 2 bytes per character, here 2,000 bytes. The N-Gram index requires ~14 times of space and therefore it is from storage side an expensive index. In this particular case, 30.4 MB for data and 21 MB for indexes are needed, thus total 51.4 MB.

    Note: The size calculation is simplified, the more accurate calculation can be read at MSDN: Estimate the Size of a Table


    Search Logic

    Suitable search results are the addresses, that have as many of the same index values as in the search term. The fewer index values matches, the less the address matches with the search term.
    Unlike a normal index, the order of individual term in the search text as well as in the data is almost irrelevant. On the other hand, of course, fragments from the search text can appear in the data also in completely different combinations from different words.
    For searching first the entire search string is separated by the function dbo.fnBuildIndex4Gram as previously in the individual index values, this is joined with the stored index values and then counted the matches, more specifically, the sum of "occurrence" is formed and so frequent occurring in the text matching index values are more heavily weighted.

    CREATE PROCEDURE dbo.spSearch4Gram
        (@search nvarchar(4000),
         @factor decimal(9, 4)
        )
    AS
    BEGIN
        SET NOCOUNT ON;
        DECLARE @len int;
     
        -- Normalize search string.
        SET @search = (SELECT dbo.fnNormalizeDatastring(@search));
        SET @len = (LEN(@search) - 4) * @factor;
        IF @len <= 0
            SET @len = 1;
     
        -- Weighted 4-Gram index search
        ;WITH idx AS
            (SELECT GR4.AddressId, SUM(GR4.Occurrence) AS Weigth
             FROM dbo.fnBuildIndex4Gram(@search) AS BIG
                  INNER JOIN
                  dbo.Gram4Index AS GR4
                      ON GR4.ID_4Gram = BIG.id
             GROUP BY GR4.AddressId
             HAVING SUM(GR4.Occurrence) > @len
            )
        SELECT ADR.AddressID,
               ADR.AddressLine1,
               ADR.City,
               ADR.City,
               idx.Weigth
        FROM Person.Address AS ADR WITH (NOLOCK)
             INNER JOIN
             idx
                 ON ADR.AddressID = idx.AddressId
        ORDER BY idx.weigth DESC;
    END;
     
    -- Test cases
     
    -- Umlaut search => "Buergermeister" as result
    EXEC dbo.spSearch4Gram N'Saarland Bürgermeister', 0.8;
    -- and counter wise
    EXEC dbo.spSearch4Gram N'rotthaeuser germany saarbruecken', 0.8;
     
    -- City name and telephone number without area code.
    EXEC dbo.spSearch4Gram N'Springfield -555-0181', 0.7;
     
    -- Typo: "Bery" instead "Berry"
    EXEC dbo.spSearch4Gram N'Bery court', 0.6;
     
    -- Some part terms
    EXEC dbo.spSearch4Gram N'West gloria California 91791', 0.75;
     
    -- Long search text
    EXEC dbo.spSearch4Gram N'5157 Washington 98027 lane marywood Issaquah', 0.8;
     
    -- Large result set
    EXEC dbo.spSearch4Gram N'Washington', 0.8;



    Summary

    The N-gram index is easy to implement using standard Transact-SQL and can be easily adapted to specific characteristics of each data set and the desired search logic.
    The search function is very performant and the accuracy or tolerance of typing errors can be controlled individually via a parameter, as well as the choice of the N value. These benefits is bought with a high storage requirements for the index values.

    As with every solution you can also still find potential for optimization, for example, the usage of a Column Store Index, but this would restricts the usage to Enterprise Edition.


    Other Languages

    SQL Server: Implementierung eines N-Gram Suchindex (de-DE)

    Weblinks


    https://en.wikipedia.org/wiki/N-gram