none
XMLMapper pipeline flaw RRS feed

  • Question

  • I'm trying to index some xml files, and have created a simple xmlmapper file. I'm mapping title and body to separate crawled properties. So far so good. Mapping my custom title to the managed property "title" works just fine. It's the mapping to "body" which is the issue.

    When the pipeline is executing the XMLMapper kicks in as the extension is ".xml". And all my custom crawled properties is set. When it comes down to the IFilterConverter stage, this also picks up the extension and now transforms my whole xml file, not only the nodes of my choosing, into html.

    Further all content of the xml files is used for entity extractors and teaser generation, and all words are searchable, not only what I have mapped to "body". I have tried both include all and single value on the mapping of my custom body to the "body" mapped property. Same result.

    If I remove the IFilterConverter stage, then it works just fine, except I don't get a teaser generated from by custom body element (or entity extraction).

    With the IFilterConverter running the problem is that it sets the "html" field, which is further used by the FASTHtmlParser to set the "body" field etc etc.

    If I disregard the entity extractors I can get it to work with a custom search profile where I decide all the mapping, but that's an awful lot of work to get the XMLMapper to work as at least I expect. And the entity extractors match on xml nodes I want excluded.

    The way the XMLMapper works now is pretty much useless in my opinion, as there is no way to omit data you don't want searchable/processed.

    I know I'm ranting a bit here.. and I hope I'm wrong and that someone can correct me :)

    (Edit: One solution might be to use a third party ifilter with xpath capabilites instead of XMLMapper)

    Regards,
    Mikael Svenson 

     

     


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Tuesday, October 25, 2011 7:34 PM

Answers

  • Hi Rob,

    Thanks for taking the time to answer, but I'm not any wiser :)

    I sort of tried to say that I have tried everything on that page in the last post. And the page is not talking about a bypass rule for ifilter, but adding a rule to make sure xml is not accidentally detected as html :)

    I have also tried both settings with mapping my cp body to body.. meaning with and without MergeCrawledProperties set to true and false. But this will not affect ifilter from triggering.

    If setting the custom body mapping had worked, then all would be ok. But in the pipeline, the internal "body" and "html" properties will be based on the whole xml file, due to ifilter doing it's thing. And then all entity extraction and teaser generation will happen on all the content.

    In effect I end up with a larger index than neccessary, due to my body and title being pulled out twice, as well as all the content in the xml which I don't want into the index.

    My quickest solution is probably to convert all my xml files to html files, and index those instead. Then I have full control of what data I want to index, can put any metadata into the header/meta tags, and indexing will be quicker as it can skip conversion and html generation. 

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Wednesday, October 26, 2011 8:08 PM
  • Hi Mikael,

     

    Some more food for thought.  In addition to mapping a suitable crawled property from the XML to “body”, you should also map a crawled property from the XML to “teaser” (e.g. the first 100 characters from something in the XML) Otherwise, content from the body that IFilterConverter extracted might be used.

     

    If you don’t want or need the IFilterConverter to touch any XML documents, you can change the GUID in the “PersistentHandler” key of the .xml extension in the registry to point to the null filter.  In my example, for the “PersistentHandler” key, it has a name of Default, a Type of REG_SZ, and Data of {7E9D8D44-6926-426F-AA2B-217A819A5CCE}

     

    Change the Data from {7E9D8D44-6926-426F-AA2B-217A819A5CCE} to  {098f2470-bae0-11cd-b579-08002b30bfeb}.

     

    If you now try to run %FASTSEARCH%\bin\ifilter2html hello.xml”, you should get an error/warning saying:

     

    >LoadIFilter() failed: No IFilter (besides Null) registered for the extension '.xml' (0x0)

     

    When you crawl documents, this will end up as a warning in the crawl log (which you can safely ignore).

     

    I believe this would prevent the pipeline from processing anything you haven’t extracted through the XMLMapper. However, you still have the problem that the crawled properties from the XMLMapper is not used during entity extraction (if you need it).  You could try this on your development environment and observe the results.

     

    Thanks!

    Rob Vazzana | Sr Support Escalation Engineer | US Customer Service & Support

    Customer Service & Support                          Microsoft | Services

    Friday, October 28, 2011 9:28 PM

All replies

  • Hi Mikael,

     

    I believe there is some discussion about this at the below page, with details on dealing with XML and creating a bypass rule to prevent the IFilter from converting the XML to HTML:

    http://msdn.microsoft.com/library/ff795813.aspx#custom-xml-parsing

     

    Feel free to review and let us know your thoughts.

     

    Thanks!

    Rob Vazzana | Sr Support Escalation Engineer | US Customer Service & Support

    Customer Service & Support                          Microsoft | Services

    Wednesday, October 26, 2011 6:01 PM
  • Hi Rob,

    Thanks for taking the time to answer, but I'm not any wiser :)

    I sort of tried to say that I have tried everything on that page in the last post. And the page is not talking about a bypass rule for ifilter, but adding a rule to make sure xml is not accidentally detected as html :)

    I have also tried both settings with mapping my cp body to body.. meaning with and without MergeCrawledProperties set to true and false. But this will not affect ifilter from triggering.

    If setting the custom body mapping had worked, then all would be ok. But in the pipeline, the internal "body" and "html" properties will be based on the whole xml file, due to ifilter doing it's thing. And then all entity extraction and teaser generation will happen on all the content.

    In effect I end up with a larger index than neccessary, due to my body and title being pulled out twice, as well as all the content in the xml which I don't want into the index.

    My quickest solution is probably to convert all my xml files to html files, and index those instead. Then I have full control of what data I want to index, can put any metadata into the header/meta tags, and indexing will be quicker as it can skip conversion and html generation. 

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Wednesday, October 26, 2011 8:08 PM
  • Hi Mikael,

     

    I agree with your proposed quickest solution.  Unfortunately, after discussing this with some of my colleagues, I do not have much in the way of additional suggestions.  I assume you have already seen the below article that outlines an end to end XMLMapper example

    http://social.technet.microsoft.com/wiki/contents/articles/how-to-create-a-fast-search-for-sharepoint-test-document-using-xmlmapper.aspx

     

    But that aligns pretty closely with the product documentation.  Sorry I do not have a better suggestion for you, but hopefully other may chime with additional suggestions.

     

    Thanks!

    Rob Vazzana | Sr Support Escalation Engineer | US Customer Service & Support

    Customer Service & Support                          Microsoft | Services

    Wednesday, October 26, 2011 9:32 PM
  • Hi Mikael

    Although your "workaround" of converting the XML files to HTML prior to indexing is far from ideal, I'll propose it as an answer anyhow since it may to useful to at least be aware of the possibility for others coming to this thread. I, for one, hadn't thought of that alternative before you mentioned it :-)

    PS! No surprise there are few uses of XMLMapper to be seen...

    Regards


    Thomas Svensen | Microsoft Consulting Services
    Thursday, October 27, 2011 5:28 AM
    Moderator
  • Yes, I've read that as well. I think that article + the MSDN docs are just about the only hits you get when searching for XMLMapper on the net.

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Thursday, October 27, 2011 7:05 AM
  • Hi Mikael,

     

    Some more food for thought.  In addition to mapping a suitable crawled property from the XML to “body”, you should also map a crawled property from the XML to “teaser” (e.g. the first 100 characters from something in the XML) Otherwise, content from the body that IFilterConverter extracted might be used.

     

    If you don’t want or need the IFilterConverter to touch any XML documents, you can change the GUID in the “PersistentHandler” key of the .xml extension in the registry to point to the null filter.  In my example, for the “PersistentHandler” key, it has a name of Default, a Type of REG_SZ, and Data of {7E9D8D44-6926-426F-AA2B-217A819A5CCE}

     

    Change the Data from {7E9D8D44-6926-426F-AA2B-217A819A5CCE} to  {098f2470-bae0-11cd-b579-08002b30bfeb}.

     

    If you now try to run %FASTSEARCH%\bin\ifilter2html hello.xml”, you should get an error/warning saying:

     

    >LoadIFilter() failed: No IFilter (besides Null) registered for the extension '.xml' (0x0)

     

    When you crawl documents, this will end up as a warning in the crawl log (which you can safely ignore).

     

    I believe this would prevent the pipeline from processing anything you haven’t extracted through the XMLMapper. However, you still have the problem that the crawled properties from the XMLMapper is not used during entity extraction (if you need it).  You could try this on your development environment and observe the results.

     

    Thanks!

    Rob Vazzana | Sr Support Escalation Engineer | US Customer Service & Support

    Customer Service & Support                          Microsoft | Services

    Friday, October 28, 2011 9:28 PM
  • Hi Rob,

    I was actually thinking about changing how the ifilters were loaded, but didn't have time to look into it at the time. I really appreciate you taking the time to figure this out :)

    And I think I like this solution the best so far. But as you say, no entity extraction, which might be a bonus in a way, if you don't need the built-in ones as the indexing will speed up. Any custom dictionary can be created as a custom extensibility stage.

    By going xml->htlm I can map both a custom teaser without needing a custom crawled property, as well as map the body directly. And by using my real date as the file dates I get that automatically as well.

    I might do a test with both options and compare the index size and indexing speed.

    -m


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Sunday, October 30, 2011 2:47 PM
  • Glad to help!  If you have an opportunity to do the test for both options, feel free to post your results!

     

    Thanks!

    Rob Vazzana | Sr Support Escalation Engineer | US Customer Service & Support

    Customer Service & Support                          Microsoft | Services

    Monday, October 31, 2011 2:41 PM