none
Lemmatization and connected words RRS feed

  • Question

  •  

    Hi all,

     

    This is probably documented somewhere, but I can't for the life of me find it.

     

    We have enabled stemming in our enviroment for some managed properties. It's working to some extent. We have discovered a problem though, because it seems like FS4SP won't lemmatize words which consists of two connected words.

     

    Example:

    A company have "Usersummaries" as a business concept/often used word (weird one in english, but let's take it as an example fwiw).

    We search for "Summary" and get hits for Summary and summaries, and where the word is a part of a connected word, like "usersummaries".

    We search for "Usersummary", but it will NOT match "Usersummaries". 

     

    Example 2:

    I search for "leadereducation" and does not get hits for "leadereducations". I search for "education" and also get hits for "educations".

    Now as I understand FS4SP, the word should be split to two words: "User" and "Summaries". "Summaries" should then be lemmatized to "Summary". If I search for "Summary" as a single word it will also return hits for "Summaries". Does this mean that FS4SP can't split connected words and also apply lemmatization and stemming?

     


    • Edited by tarjeieo Wednesday, May 18, 2011 11:26 AM Added example
    Wednesday, May 18, 2011 11:12 AM

Answers

  • Hi,

     

    "bdw" is right: the FAST tokenizer is not able to split compound words. And that limitation applies also to those languages where compound (or "connected") words actually are valid, such as German, Danish and Norwegian.

    A possible workaround is to use a custom property extractor (http://msdn.microsoft.com/en-US/library/ff795797.aspx#custom-prop-mapping), and there mappings between the specific terms (e.g. usersummary -> usersummaries). You would then need to map the output of that into the default full-text index. If there is a limited number of such terms, this approach may be OK.

    Note that the suggested solution is case-sensitive, so you may to apply some logic like in this script: http://gallery.technet.microsoft.com/scriptcenter/15f58a4a-42e5-44b8-b17e-624b83d6f902

     

    Regards


    Thomas Svensen | Microsoft Enterprise Search Practice
    Thursday, May 19, 2011 6:26 AM
    Moderator
  • Good question.

     

    I remember now hearing about FAST getting support for "breaking up" connected words to increase recall in languages such as Norwegian and German. Unfortunately, it seems they didn't integrate that process with lemmatization, so you just get the base forms.

    You could verify that by looking at the FIXML, as described here: Seeing what actually gets indexed

    Regards


    Thomas Svensen | Microsoft Enterprise Search Practice
    • Marked as answer by tarjeieo Monday, May 23, 2011 10:31 AM
    Friday, May 20, 2011 11:55 PM
    Moderator

All replies

  • FAST's tokenizer will not split up the words at indexing time. In order for a word to be lemmatized it needs to match what is in the internal dictionaries.

    There are other languages where "connected words" are actually valid, but not with english.

     

    Wednesday, May 18, 2011 7:59 PM
  • Hi,

     

    "bdw" is right: the FAST tokenizer is not able to split compound words. And that limitation applies also to those languages where compound (or "connected") words actually are valid, such as German, Danish and Norwegian.

    A possible workaround is to use a custom property extractor (http://msdn.microsoft.com/en-US/library/ff795797.aspx#custom-prop-mapping), and there mappings between the specific terms (e.g. usersummary -> usersummaries). You would then need to map the output of that into the default full-text index. If there is a limited number of such terms, this approach may be OK.

    Note that the suggested solution is case-sensitive, so you may to apply some logic like in this script: http://gallery.technet.microsoft.com/scriptcenter/15f58a4a-42e5-44b8-b17e-624b83d6f902

     

    Regards


    Thomas Svensen | Microsoft Enterprise Search Practice
    Thursday, May 19, 2011 6:26 AM
    Moderator
  • How large is your term set? The synonym system can be used to tackle this on a small scale as well. Otherwise if this is a business requirement you will have to do some interesting things.

    Thursday, May 19, 2011 6:38 AM
  • Okay, this makes sense.

    I have one question though; Why does a single word match a connected word? E.g. I search for 'education' and get hits for 'leadershipeducation'.

    Thursday, May 19, 2011 8:34 AM
  • Good question.

     

    I remember now hearing about FAST getting support for "breaking up" connected words to increase recall in languages such as Norwegian and German. Unfortunately, it seems they didn't integrate that process with lemmatization, so you just get the base forms.

    You could verify that by looking at the FIXML, as described here: Seeing what actually gets indexed

    Regards


    Thomas Svensen | Microsoft Enterprise Search Practice
    • Marked as answer by tarjeieo Monday, May 23, 2011 10:31 AM
    Friday, May 20, 2011 11:55 PM
    Moderator
  • Thanks for the reply.

     

    One final question; Is it possible to view the dictionaries FAST use for stemming? (I bet not) Can I modify them? (I bet not)

    Monday, May 23, 2011 9:43 AM
  • Hi

    I have to admit that you are pretty good at this betting game...

    ;-)

    Regards


    Thomas Svensen | Microsoft Enterprise Search Practice
    Monday, May 23, 2011 11:11 AM
    Moderator
  • One more question:

    So FAST is breaking up words. This is done as a stage in the pipeline, right? I would assume that this then happens in the stage(s) WordPartExtractor (1 and 2) - correct?

    What's weird is that the lemmatization stage happens after this stage, so I would think that lemmatization should have effect on broken words as well. This doesn't seem to be the case though. What have I misunderstood?

    Friday, May 27, 2011 9:17 AM