All current Web search engines have only one input field for the search term and always deliver a result, no matter what you type. This makes us developer’s life difficult, because now end users expect the same in Line-of-Business (LOB) applications like ERP, for example, when searching for an address or an item. This article shows how to implement a similar search function with the Microsoft SQL Server.
The built-In function “Full-Text Index” of the Microsoft SQL Server provides an easy way to index all the words in a database, as well as to query them in a flexible and performant way. But there are a few limitations that cause it's not suitable for a similar web search functionality.
An alternative is to implement a own indexing and in this case the so-called N-Gram Index. In a n-gram index all N consecutive characters are counted as an unit (fragment) and indexed, where N can be any numeric value from 1-x. An example with "Hello World" and a 3-grams results in the units "HEL", "ELL", "LLO", "LO ", "O W", " WO", "WOR", "ORL", "RLD" The higher the N value, the larger the fragment and the more precise is the search, the less the N value the higher the typing error tolerance,
The aim is to implement a typing error tolerant and accent insensitive search of addresses, for this we have to provide normalized data base for the n-gram index accordingly. This requires a few considerations ahead.
For full-text index delimiters are counted as the end of words and are not indexed. There are, however, company and product names that contain delimiter and therefore they should be in a certain way searchable. Here you have to decide by the data if the delimiters are retained or replaced. In this example, all delimiters are replaced by a single character for simplicity. So the structure of the text is retained, but for indexing and searching just one sign is used.
Again, you have to decide by database whether to replace special characters or not. Good example is the @ sign in e-mail addresses; one would like to specifically search only for the e-mail or also after the parts of the e-mail address like domain name in another data areas.
This means in this example only
CREATE
FUNCTION
dbo.fnGetNGramValidChars()
RETURNS
nvarchar(39)
AS
BEGIN
-- Approved Signs for N-Gram Index
-- Western capital letters + Space
-- Digits
-- Delimiter sign: !
-- Special character sign: #
RETURN
N
'ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789!#'
;
END
All other characters are replaced accordingly. For this another function is created, which normalizes the supplied text. It is later used for both indexing the data and for the normalization of the search term. Here the shortened version:
dbo.fnNormalizeDatastring
(@string nvarchar(
max
))
nvarchar(
)
-- Get all valid characters
DECLARE
@validChars nvarchar(50) = (
SELECT
dbo.dbo.fnGetNGramValidChars()
Result);
-- Only capital characters
SET
@string =
UPPER
(@string);
-- Replace umlauts
REPLACE
(@string, N
'Ä'
, N
'AE'
);
'Æ'
-- usw. ...
-- Registered Marks
'℅'
'C#O'
'™'
'TM'
-- Replace delimiter by !
'.'
'!'
','
-- Replace special characters by hash mark.
'+'
'#'
'-'
-- Replace all other invalid characters by space.
@loop
int
= 1;
@len
= LEN(@string);
@invalidChar
nchar
(1);
WHILE @loop <= @len
@invalidChar =
SUBSTRING
(@string, @loop, 1);
IF CHARINDEX(@invalidChar, @validChars) <= 0
(@string, @invalidChar, N
' '
@loop = @loop + 1;
-- Remove multiple spaces
@goOn tinyint = 1;
@oldLen
WHILE @goOn <> 0
@oldLen = LEN(@string);
IF @oldLen = LEN(@string)
@goOn = 0;
@string;
Note: Such elaborate string operations are very expensive in SQL Server. It would be conceivable to optimize implementation as CLR Assembly.
Take the example of a 4-gram index with the above defined 39 characters, then we can get 39 ^ 4 = 2,313,441 different character combinations (fragments). Let we allow negative values, then the value range of the int data type fits up to the 6-gram index values. The number of possible index values for each n-gram: N-Gram Formular Value Data Typ 2-Gram = 39^2 = 1,521 => smallint 3-Gram = 39^3 = 59,319 => smallint 4-Gram = 39^4 = 2,313,441 => int 5-Gram = 39^5 = 90,224,199 => int 6-Gram = 39^6 = 3,518,743,761 => int 7-Gram = 39^7 = 137,231,006,679 => bigint
The index calculation function in T-SQL provide as follows is being assumed, that the fragment is passed normalized.
dbo.fnCalculate4GramIndex
(@fragment
(4))
@validChars
varchar
(50) = (
@
size
= (
LEN(@validChars)
@id
= -2147483648;
-- Calculate position of each char
@pos1
= CHARINDEX(
(@fragment, 1, 1), @validChars);
@pos2
(@fragment, 2, 1), @validChars);
@pos3
(@fragment, 3, 1), @validChars);
@pos4
(@fragment, 4, 1), @validChars);
@id = @id + (@pos1 - 1) +
(@pos2 - 1) * @
+
(@pos3 - 1) * POWER(@
, 2) +
(@pos4 - 1) * POWER(@
, 3);
@id;
Now, in order to generate all the index values for an entire text, a Table Valued Function (TVF) is created, which breaks down the text into fragments, calculates the values for each and returns the result as a table.
dbo.fnBuildIndex4Gram
@ids
TABLE
(id
@string = (
dbo.fnNormalizeDatastring(@string));
@pos
= LEN(@string) - 4;
IF @len <= 0
WHILE @pos <= @len
INSERT
INTO
@ids (id)
dbo.fnCalculate4GramIndex(
(@string, @pos, 4));
@pos = @pos + 1;
-- Small test:
*
FROM
dbo.fnBuildIndex4Gram(N
'Hello World'
IDX;
As an example for indexing, the address data used in the scheme "Person" from the AdventureWorks database is used. To get all the relevant data in just one field, incl. of 0-n telephone numbers, a view is created that provides the data. This view has the advantage that the data can be extended at any time, without having to adjust other logics and process steps.
VIEW
Person.vAddressIndexData
ADR.AddressID,
ADR.AddressLine1 + N
ISNULL
(ADR.AddressLine2 + N
''
) +
ADR.City + N
ADR.PostalCode + N
SPR.StateProvinceCode + N
SPR.
Name
+ N
CR.
(PER.Title + N
(PER.FirstName + N
(PER.MiddleName + N
(PER.LastName + N
(PER.Suffix + N
(EA.EmailAddress + N
((
CAST
(
PHO.PhoneNumber +
Person.PersonPhone
PHO
WHERE
PHO.BusinessEntityID = PER.BusinessEntityID
FOR
XML PATH (
), Type
ColCSV
MAX
))) + N
IndexData
Person.Address
ADR
INNER
JOIN
Person.StateProvince
SPR
ON
ADR.StateProvinceID = SPR.StateProvinceID
Person.CountryRegion
CR
SPR.CountryRegionCode = CR.CountryRegionCode
LEFT
Person.BusinessEntityAddress
BEA
ADR.AddressID = BEA.AddressID
Person.Person
PER
BEA.BusinessEntityID = PER.BusinessEntityID
Person.EmailAddress
EA
PER.BusinessEntityID = EA.BusinessEntityID
AND
PER.EmailPromotion = EA.EmailAddressID;
Note: The view is only meant as an example, sense and correctness was not respected here.
Next, we need a table to save the index values. What is needed is the primary key of the record, the AddressId, and the n-gram index value. Since a fragment / index value can occur more than once for each record, a field for the occurrence is used instead of storing each index value individually; which of course also would work.
dbo.Gram4Index(
AddressId
NOT
NULL
,
ID_4Gram
Occurrence
smallint
CONSTRAINT
PK_Gram4Index
PRIMARY
KEY
CLUSTERED
ASC
Note: The mixing of different schemes, like here with Person/dbo, usually does not make sense.
During search only the index value is directly needed, the primary key index is not so suitable for this, so another index is created, which is more selective. To avoid unnecessary key lookups on the address and occurrences in the base table, the fields are included as "include".
NONCLUSTERED
INDEX
IDX_Gram4Index_Search
dbo.Gram4Index
(ID_4Gram)
INCLUDE (AddressId, Occurrence);
-- Some additional statistics.
STATISTICS
[STAT_Gram4Index_AddressId]
[dbo].[Gram4Index]
([AddressId]);
[STAT_Gram4Index_Occurrence]
([Occurrence]);
[STAT_Gram4Index_AddressId_Occurrence]
(AddressId, [Occurrence]);
How it should be, filling the index value is done centrally via a stored procedure, which can then be called from other SP's or from triggers. The following SP serves this purpose, it first deletes the old index values and adds the new into the index table for an address; it expects only the PK AddressId as parameters.
PROCEDURE
dbo.spUpdate4Gram
@addressId
NOCOUNT
-- Delete old index data
DELETE
AddressId = @addressId;
-- Re-create Index for the Address.
(AddressId, ID_4Gram, Occurrence)
ADR.AddressId, NG.ID
ID_NGram4,
COUNT
(*)
CROSS
APPLY
dbo.fnBuildIndex4Gram(ADR.IndexData)
NG
ADR.AddressId = @addressId
GROUP
BY
ADR.AddressId, NG.ID;
Since we have with the Adventure Works an existing database with data, we need additionally a routine to initial process the existing data. For this we use a cursor to process all addresses one by one.
addresses
CURSOR
LOCAL
Person.Address;
OPEN
addresses;
FETCH
NEXT
@addressId;
WHILE (@@FETCH_STATUS = 0)
EXEC
dbo.spUpdate4Gram @addressId;
CLOSE
DEALLOCATE
PRINT N
'Finished'
Note: The size calculation is simplified, the more accurate calculation can be read at MSDN: Estimate the Size of a Table
dbo.spSearch4Gram
(@search nvarchar(4000),
@factor
decimal
(9, 4)
-- Normalize search string.
@search = (
dbo.fnNormalizeDatastring(@search));
@len = (LEN(@search) - 4) * @factor;
@len = 1;
-- Weighted 4-Gram index search
WITH
idx
GR4.AddressId,
SUM
(GR4.Occurrence)
Weigth
dbo.fnBuildIndex4Gram(@search)
BIG
GR4
GR4.ID_4Gram = BIG.id
GR4.AddressId
HAVING
(GR4.Occurrence) > @len
ADR.AddressLine1,
ADR.City,
idx.Weigth
(NOLOCK)
ADR.AddressID = idx.AddressId
ORDER
idx.weigth
DESC
-- Test cases
-- Umlaut search => "Buergermeister" as result
dbo.spSearch4Gram N
'Saarland Bürgermeister'
, 0.8;
-- and counter wise
'rotthaeuser germany saarbruecken'
-- City name and telephone number without area code.
'Springfield -555-0181'
, 0.7;
-- Typo: "Bery" instead "Berry"
'Bery court'
, 0.6;
-- Some part terms
'West gloria California 91791'
, 0.75;
-- Long search text
'5157 Washington 98027 lane marywood Issaquah'
-- Large result set
'Washington'
As with every solution you can also still find potential for optimization, for example, the usage of a Column Store Index, but this would restricts the usage to Enterprise Edition.