none
Any better way at extracting names in East Asian Language RRS feed

  • Question

  • Hi,

    At the present time, Word Part (Substring) Matching Property Extractors is the only supported type to extract names from East Asian Language, and the precision is very low. For example:

    “历史伟人" is a word in Chinese, "史伟" is a name,  a customer property extractor (designed to match Chinese) with key '史伟‘ will match '历史伟人' witch does not make sense.

    As far as I know, a word breaker can easily recognize a person's name or at least can with a custom dictionary, however, customer property extractors in Fast is n't giving us any way to integrate that feature?  

    Thank you,

    Bruce


    Monday, February 20, 2012 8:36 AM

All replies

  • Hi Bruce,

    You actually can provide custom dictionaries for some languages: Japanse, Thai, Chinese Simplified and Chinese Traditional.

    Take a look at http://technet.microsoft.com/en-us/library/gg130819.aspx which explains how to create a custom dictionary for doing word breaking in these languages without turning on substring support. You can also combine both methods of a custom dictionary and substrings, also explained at the above link.

    Regards,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/

    Monday, February 20, 2012 8:36 PM
  • Thank you Mikael, I followed the artical steps by steps and still can't get it work.

    I created the .lex file and saved it in Unicode, and reset docprocs and restarted qrserver,

    Search for "历史伟人“ still return two words <c0>历史</c0><c0>伟人</c0>which was supposed to be one word <co>历史伟人</co> in result page.

    Now I got two questions:

    1) Is there any way to know if the custom dictionary is loaded properly in runtime?

    2) Per http://msdn.microsoft.com/en-us/library/ff795797.aspx, if I understand correctly, it says word part (substring) property extractor is only type for East Asian Language, I couldn't figure out how a custom dictionary could apply here.

    Thanks for your help!


    Best Regards, Bruce

    Tuesday, February 21, 2012 3:33 AM
  • Hi,

    Tried it myself and I also got two hits back from this. Guess you could override the core results, check the xml highlights and compare to the search query and concatenate them if there is a match. Certainly not ideal, but could work.

    And I'm not sure if you can check that the dict is loaded except using something like FileMon or similar to check that it is read. I had an 0404 file I know was working from before which I tested on now and got the same results as you did.

    As for using a property extractor this could help if you search against a particular managed property but not when searching all the content as you would get the two word hits as well.

    Regards,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/

    Wednesday, February 22, 2012 9:34 AM
  • Hi Mikael,

        For #1,  weird behavior, searching for builtin word '一路顺风', Fast Search system would product results like '<c0>一路顺风</c0>', that was why I expected the same for the word I added in custom dictionary, that is also the reason I doubt if the word was loaded or not, or maybe the behavior for words in custom dictionary is different from builtin ones.

        For #2, the stage I really care is when property extractor is extracting names from crawled item, it is meaningless if property extractor extracts the wrong name, and I believe it would if it only use substring pattern to match the name.

        Appologize that I did not express the issue clearly, and thank you again for the effort.


    Best Regards, Bruce

    Friday, February 24, 2012 5:34 AM
  • Hi,

    you are absolutely right with using a substring matcher that it would probably extract a name from the word. You could create a custom extractor in C# (pipeline extensibility) where you can use your own logic to break up the data. That should probably work.

    Regards,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/

    Friday, February 24, 2012 7:21 PM
  • Hi, I once deployed a pipeline module which slowed down the item processing speed quite a bit.

    Using a word tokenizer to break up the data in custom pipeline would work, I just warry about the impact to the speed of processing items.

    Is there any real word example of using pipeline extensibility in production?


    Best Regards, Bruce

    Monday, February 27, 2012 5:34 AM
  • Hi,

    I guess there are many real world examples of usage, and as you said, it might impact indexing speeds. This is due to it being an .exe file being executed per item. Question is how many items do you have, and an increase in the initial indexing might not hinder the incremental ones later on.

    When you say "quite a bit", how big an increase did you get?

    Regards,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/

    Monday, February 27, 2012 2:22 PM