none
Generic Word breaker in SQL FTS

    Question

  • Hello everyone,

    I am working an application which requires Full text search on the database.

    I understood that the database will contain data in multiple languages(ie US-English,Germany-German,Japan-Japaneese,Spaain-Spanish etc..) depending on the customer location.

    We realized that default word breaker provided out of box by MS for each language has some limitations with respect to tokenization.

    Suppose if we develop a customized word breaker(langauge independent), can we use the same word breaker for all the above languages?

    or Am I required to develop customized word breaker for each lanaguage?

    The first solution which i am proposing  is it a viable option?

    Can we have language independent word breaker?

    please provide your comments and let me know if you need more information regarding this

    Thanks

    Thursday, February 23, 2012 6:29 AM

All replies

  • You should read this from the SQL Server documentation.  It gives you some pretty good guidance:

    http://msdn.microsoft.com/en-us/library/ms142507.aspx

    It offers 3 possible solutions for using different word breakers for different content: (1) separate columns for radically different languages, (2) use binary objects such as Word documents, since the carry language information with them, or (3) include HTML language tags in text columns.

    For European Roman alphabet languages, the German word breaker will probably work fine for English and Spanish, but not for Japanese. Japanese, other East Asian languages, Arabic, etc. are radically different from European languages. I cannot see how a single word breaker will meet all needs.

    RLF

    Thursday, February 23, 2012 4:09 PM
  • Hi Russel,

    Thanks for your quick response.We are in the process of evaluating the need for customized word breaker for our application which will contain data in multiple languages(assume one instance one language depnding on cusromer location).

    With this being main requirement Can you please guide me for the below questions.

    1.Basically i want to treat special characters differently as compared to default word breaker behavior which treats special characters as deleimiters.is it possible to achieve this with a custom word breaker?

    2.How to resolve the multiple languages support along with a customized word breaker for my application?

    2.Under what circumstances people are going for customized word breaker?

    3.How to design/develop a new word breaker for specified language?

    4.Can i have default english word breaker for one database and customized english word breaker for another database which uses FTS fucntionality in a given instance of SQL server?

    5.What are the side effects of having customized word breaker?

    Thanks & Regards

    Samba

    Friday, February 24, 2012 3:23 AM
  • 1.Basically i want to treat special characters differently as compared to default word breaker behavior which treats special characters as deleimiters.is it possible to achieve this with a custom word breaker?

    Did you see these posts on creating word breakers and using some custom dictionaries.

    http://msdn.microsoft.com/en-us/library/windows/desktop/ff819112%28v=vs.85%29.aspx#implementing_a_word_beaker

    Creating Custom Dictionaries for special terms to be indexed 'as-is' in SQL Server 2008 Full-Text Indexes

    2.How to resolve the multiple languages support along with a customized word breaker for my application? 

    My previous post mentioned language tags, since without knowing what language you are breaking, how do you decide?  If you simple break on "white-spaces" or some simple punctuation you might get it to work mostly OK.  

    But what will you do for Chinese and Japanese, which do not have western-style space usage?

    2.Under what circumstances people are going for customized word breaker?

    Wanting to include special characters, wanting to handle a language for which there is no effective word-breaker, etc.

    3.How to design/develop a new word breaker for specified language?

    See the post above and look at these comments from Hilary Cotter: http://social.msdn.microsoft.com/Forums/en/sqlsearch/thread/ece84d87-1a19-463b-9c80-a45a47242277

    4.Can i have default english word breaker for one database and customized english word breaker for another database which uses FTS fucntionality in a given instance of SQL server?

    Word-Breakers are server resources.  So you cannot have two word breakers for the same language.  (Of course, American English and British English are separate breakers, so you could keep one and change the other.)

    5.What are the side effects of having customized word breaker?

    More work and you are responsible for the reliability and stability of the code.

    All the best,
    RLF

    Friday, February 24, 2012 3:05 PM
  • Hi Russel,

    Below is a quick question on custom dictionary implementation.

    I have gone through the steps involved for developing a custom dictionary.
    Currently I am using SQL 2008 R2 on XP machine.Hence I couldn't find the below dlls any where in my machine.
    1.      NlsData0009.dll
    2.      NlsLexicons0009.dll
    3.      NlsGrammars0009.dlll

    Please let me know how to implement custom dictionary for english word breaker where SQL server 2008 R2 is running on XP.

    or is Custom dictionary(or the above mentoned dlls) available only on vista+ operating sustems?
    What about Windows 2003 and 2008 servers?

    Thanks & Regards

    Samba

    Monday, February 27, 2012 10:27 AM
  • Why do you want a custom dictionary for English? What added functionality do you need?

    looking for a book on SQL Server 2008 Administration? http://www.amazon.com/Microsoft-Server-2008-Management-Administration/dp/067233044X looking for a book on SQL Server 2008 Full-Text Search? http://www.amazon.com/Pro-Full-Text-Search-Server-2008/dp/1430215941

    Monday, February 27, 2012 2:14 PM
  • Hi Hilary,

    While exploring the SQL server 2008 R2 FTS features, I came across the feature of custom dictionaries for indexing the words as is.

    for example words like AT&T , ISBN-7 are inidexed as is along with the usual tokens generated by default word breaker.

    So just want to try out this feature on my SQl instance which is on XP.

    I created the Custom1033.lex file for the word AT&T and placed the file into BINN folder of my SQL instance.

    Select * from sys.dm_fts_parser('AT&T',1033,5,0) without any stop words results in the following output.

    Only at,t are indexed but not the word at&t which is the desired output with custom dictionary.Can you please let me know what is the mistake iam doing here to test custom dictionary feature?Are there any additional settings need to be done on XP machine to make use of custom dictionary feature for english language?

    Tuesday, February 28, 2012 3:54 AM