Proposed Answer Drop the document in pipeline

  • Thursday, September 15, 2011 11:00 PM
     
     

    We want to drop specific document where title tag contains don't crawl.

    1. How we are going to achieve it in FS2010.

    2. Is there any way we can drop the page through pipeline stages.

    3. processors.DropIfMatch is it the right stage for the above requirement.


    Chittaranjan. Consultant Enterprise Search Products.

All Replies

  • Thursday, September 15, 2011 11:28 PM
     
     Proposed Answer

    Hi,

    Dropping a document in the pipeline in Fs4SP is easier said than done, please see this thread for a previous discussion: http://social.technet.microsoft.com/Forums/en-US/fastsharepoint/thread/835bb78e-3bdf-4ed3-8ac1-aa3ce389fdd5

    As you see, it is indeed possible using unsupported operations and/or quirky work-arounds. However, if you can find a way of doing this before the documents entering the pipeline it's strongly advised. Are crawler exclusion rules an option?

    Regarding 3: DropIfMatch drops all documents with the noindex property set to true. As such, if you could make your crawler/connector/whatever you're using to feed documents populate that property the document would get dropped in the pipeline.

    Cheers,

    Marcus

     

     

     


    Marcus Johansson | Search Nerd | comperiosearch.com | linkedin.com/in/marcusjohansson
  • Thursday, September 29, 2011 5:31 PM
     
     

    Hello Marcus,

    As i cant use crawl rules to exclude certain contents from the index, So need to go for pipeline-extensiblity. And am using Web-Crawler connector 

    1 > I need to know, whether "noindex" property the crawler is referring to Meta-Robots tag values ?

    2> In the  DropIfMatch match stage Can we add a condition to set the value to 1 , where the stage is referring ?   

     

     

    Thanks,

    Nikhil Sankolli

  • Friday, September 30, 2011 6:27 PM
     
     

    Hi,

    You are correct that "noindex" is in the meta-robots header tag. Default the FAST Web crawler will obey this setting, but you can configure it to disregard it as well. If you set this header tag, then the DropIfMatch stage will pick it up.

    If your approach is to modify the html in a pipeline extensibility stage to insert the meta-robots tag, then this will not work, as your custom stage runs after the html parser stage. You would need to edit and do something unsupported in order to drop the document as Marcus refers to.

    In theory it's technically easy to drop a document, but doing it in a supported manner is a different story, as the best approach is to be able do not pass the document to FAST in the first place.

    Another approach is to set a crawler/managed property of the items you want excluded, and filter them out with a scope. They will then enter your index, but you filter them out. Of course not an ideal solution, but it can work.

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
  • Monday, October 03, 2011 7:04 AM
     
     

    Hi Mikael,

    When you say "do not pass the document to FAST in the first place" are you referring to collect the documents which needs to be removed on the first full crawl for examples the urls of the pages and then run a tool to add the html tag add the required values to it so as to remove that particular page during next crawl ? 

    If that's the case, i would like to know - how dropifmatch exactly function ? does it refer to any crawled property which stores values of the tag  meta-robot ? then which is the crawled property ?

    Adding scopes would be ideal, but i need to remove it from the index itself as these documents/data will never be used or referenced and some pages i dont want to expose at all.  

    Also like to know adding more stages how will it effect the crawling/indexing time and performance what would be the best approach to get optimum performance ?

     


    Regards,

    Nikhil Sankolli

    • Edited by N Sankolli Monday, October 03, 2011 7:04 AM
    •  
  • Monday, October 03, 2011 4:50 PM
     
     

    Hi,

    I'm talking about being able to have the connector skip indexing them and sending them to FAST, like Marcus said with crawl rules, or with meta tags.

    How dropifmatch works is not that important, as long as we talk supported scenarios. Supported, your options are those mentioned earlier and in the thread referred to by Marcus.

    As for the impact of adding custom stages I would not be worried, but Eric Belisle has a blog post on it at http://fs4sp.blogspot.com/2011/06/fs4sp-pipeline-performance-and.html. If performance on the document processors really becomes and issue, you can add more processors. For most scenarios it's the initial indexing which takes time, on incremental you won't notice this as much.

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
  • Tuesday, November 29, 2011 6:40 AM
     
     

    Hey Mikael,

            Creating  manage property that segregates the unwanted pages and filtering via scopes or fql will surely not display in the results page, but not remove from the index. But can we change the values of robots crawl property in pipelineextensiblity stage  and then call dropifmatch stage after customer stage. Doing so will it remove from the index ?

    regards,

    nikhil   

  • Tuesday, November 29, 2011 7:16 AM
     
     

    Hi nikhil,

    The meta/robots tags are picked up by the crawler, thus preventing the page from ever being sent over to FAST for indexing. So the answer is no.

    Dropping documents in the pipeline seems to be one of the more asked about and missing features, and we can only hope it will be introduced at some time or another.

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
  • Saturday, December 10, 2011 8:11 PM
     
     

    It's doable!

    Read my blogpost: http://techmikael.blogspot.com/2011/12/how-to-prevent-item-from-being-indexed.html

    You can use the Offensive Content Filter to help you out.

    -m


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/