none
How can we drop a document from indexing in FAST Search for SharePoint 2010 ? RRS feed

  • Question

  • Hi All,

    Does anyone have worked on dropping a document from being indexed in FAST Search for SharePoint 2010 ?

    Does this has to be done during pre-processing stages or post-processing stages ?

    Please let me know on how this can be achieved in FS4SP 2010.

    Thanks,

    Ajay

    Friday, May 6, 2011 6:58 AM

All replies

  • There are various ways to do this.

    Before indexing,

    1 - You can use crawl exclusion rules or

    2 - Under library settings - advanced settings, you can disable this specific library to appear in search results.

    After indexing,

    1 - You can create scopes to narrow the search results with powerful fql queries.

     

    Hope this helps.


    Friday, May 6, 2011 7:55 AM
  • is there an option of doing this "during indexing" ?

    I havent tried this, but can we do that during the pipeline extensibility stage ??

    -A

    Friday, May 6, 2011 8:27 AM
  • Hi Ashwani

    You are not the first to ask this question, and I am sorry to say that, NO, that is not possible. You can in theory try to empty all content inside pipeline extensibility, so that no queries will match the given document, but I don't that is a feasible solution. E.g., you cannot alter the url, so queries matching the url string would still match.

    Regards


    Thomas Svensen | Microsoft Enterprise Search Practice
    Saturday, May 7, 2011 7:57 AM
    Moderator
  • Thomas,
    if you return an exit code != 0 in a custom pipeline stage for a specific url I think it will drop indexing that item, as the pipeline fails. Haven't tried this myself, and not the prettiest solution, but do you think it would work?

    Regards,
    Mikael Svenson 


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/ - http://www.comperiosearch.com/
    Saturday, May 7, 2011 7:46 PM
  • Hi Thomas/Mikael,

    Please let me know if you have any solution to drop documents being indexed in FS4SP 2010.

    Thanks,

    Ajay

    Sunday, May 8, 2011 11:53 AM
  • My suggestion did not work. Returning an error from a custom stage will only skip the stage, not abort processing of the pipeline.

    My next attempt tried to use the Offensive Content Filter, but there is no way to assign offensive words to the title/body in a custom stage. There is also a mention of a field called "ocfcontribution", but I have no idea how to set this.

    So, your only solution would be to write an unsupported stage in Python, and implement your drop logic there.

    Regards,
    Mikael Svenson 


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/ - http://www.comperiosearch.com/
    Sunday, May 8, 2011 7:58 PM
  • Hi

    There is a solution which does not use unsupported and should work:

    In a pipeline extensibility stage, write the URL of the pages you want deleted to a file (would have to be the AppData/LocalLow folder of the user running FAST), and then have scheduled task running at frequent intervals, reading this file, and submitting a "docpush -d <URL>", ref. http://technet.microsoft.com/en-us/library/ee943508.aspx

    I don't have an image/install available for testing it, but I am pretty sure it should work.

    Regards


    Thomas Svensen | Microsoft Enterprise Search Practice
    Monday, May 9, 2011 4:51 AM
    Moderator
  • Thomas,

    That's quite ingenious, and I'm sure it will work. And probably the only scenario where docpush is useful in a production scenario :)

    Regards,
    Mikael Svenson 


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/ - http://www.comperiosearch.com/
    Monday, May 9, 2011 7:01 AM
  • I agree.. this should work.

    we had done this quite a few time in production, for some skewed scenarios, with the OLD ESP. :)

    I hope that the support for something like "ProcessorStatus.NotPassing" gets added to the pipeline extensibility, else docpush would become a de-facto mechanism in quite a bit of cases.

    Thanks,

    Ashwani

    Monday, May 9, 2011 7:23 AM
  • Mikael,

    Could you please elaborate on what did you mean by "...Offensive Content Filter, but here is no way to assign offensive words to the title/body in a custom stage."? What's the downside of assigning offensive content to a title and having it dropped in OCF stage?

     

    Thank you,

     

    Mike.

    Friday, May 20, 2011 3:51 PM
  • Sadhak,

    There is no way you can modify the title or body of an indexed document during the pipeline (in a supported manner), therefore, you can't change words to get it dropped by the offensive content filter.

    If you however can assign words to the title or body during crawl/index time, then it will be dropped as expected.

    Regards,
    Mikael Svenson 


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Friday, May 20, 2011 7:44 PM
  • Mikael,

     

    Thank you for clarifying.

     

    It seems pretty straight forward to me - you just write "offensive" string into a title/body crawled property. And it looks "supported" as well :-).

    I tested rewriting title and it works.

    I would feel more comfortable assigning certain words to 'ocfcontribution' crawled property, but haven't been able to find ways (supported or unsupported) to make it work. It would be nice of Microsoft to explain in more details how to do it, particularly in supported way.

     

    Thank you,

     

    Mike.

    Friday, May 20, 2011 8:45 PM
  • Hi Mike

    If you are able to change the title/body of your documents at crawl time, then why not add a "noindex" tag in the body instead, assuming your content is HTML, of course. That would be a cleaner and more officially supported approach.

    Regards


    Thomas Svensen | Microsoft Enterprise Search Practice
    Friday, May 20, 2011 11:08 PM
    Moderator
  • Dear Ajay,

     

    Documents can be excluded during pre-process and post process.

     

    1. During Pre-Process i.e during indexing you can add exclusion rules in the crawler itself via advance settings or through a document processing pipeline.
    2. During Post-process i.e. querying you can use advanced fal features and narrow down the scope.
    3. In the worst case such as documents are indexed and re-indexing is very inconvenient that using search business center you can apply rules to drop the documents and will not return in the result-set.

    Regards,

    Chirag Shah

    Enterprise Search


     

     

     

     



    Monday, July 25, 2011 6:25 AM
  • Hi,

    I solved it!

    Read my blogpost: http://techmikael.blogspot.com/2011/12/how-to-prevent-item-from-being-indexed.html

    You can use the Offensive Content Filter to help you out.

    -m


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Saturday, December 10, 2011 8:12 PM
  • Great Blog Mikael. thanks for posting this.
    Sriram S
    Sunday, December 11, 2011 4:32 PM
  • Hi Mikael!

    Your approach doesn't work if text has more than 3-4 thousands of characters, we found another (tricky) solution how to prevent indexing. You should create crawled property of int type (varType="20), map it to any managed property of the same type (mapping is required) and in pipeline extensibility stage write not valid integer to this property ("hi there" for example). After document processing fixml will be generated, and your crawled property presented something like this:

    <context name="bconintprop1">hi there</context>

    But indexer will fail when trying to parse this string into integer, and document will be not passed to the index. If you open "c:\FASTSearch\data\data_fixml\doc_errors_sp.dat" you can find this line:

    Aborted document during indexing at fixml file line 61 column 37. Reason: AddDecimalNumber(bi1, bconintprop1, hi) failed: Could not convert decimal number 'hi' to an integer using 0 digit decimal precision

    So, as the result document not indexed. The only question if this is a really good solution? Mikael, what do you think?

    Thank you.

    UPDATE

    We've made some more experiments and found another solution. Let me summarize what we found. The first solution I called "Invalid integer" and the second is "MaxIndexSize overflow":

    • Invalid integer: you must create crawled property of type int and map it to a managed property of type int. In pipeline extensibility stage write invalid int to this property. As a result fixml will be generated, but indexer will fail with error "Could not convert decimal number 'hi' to an integer using 0 digit decimal precision" in logs. The question is how your document may appear in index again. If you run incremental crawl and document not changed, it will not pass to pipeline, but if we change it somehow(modified date will be changed), it will appear in pipeline again and you'll have opportunity to decide what to do depending on your requirements (drop again or not)
    • MaxIndexSize overflow:you must create crawled property of type text and map it it a managed property of type text, next set MaxIndexSize property for managed property to 1 (means that you can write to this property only 1KB of information). In pipeline extensibility write to crawled property (as you guess) string with length more than 1KB. As a result indexer will fail with error "DocError: max-index-size limit 1023 bytes for field longprop1 exceeded (field length is 2118 bytes)". The only difference from the first approach that fixml is not generated. It means that document that was dropped will appear in pipeline on next incremental crawl regardless of whether it was modified or not.

    Yes, I agree this a tricky solution, but MS don't provide an easy way to drop document during indexing, I hope in vNext they'll fix this situation.



    Friday, May 25, 2012 9:14 AM
  • Hi Segei,

    Excellent solution indeed! Much better than using the Offensive Content filter in my opinion. Why didn't I think of that :D

    I will add this to my blog post on the issue.

    Regards,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/MCT
    http://techmikael.blogspot.com/
    Author of Working with FAST Search Server 2010 for SharePoint



    Friday, May 25, 2012 10:37 AM
  • Sergei,

    Interesting solution!  One thing to keep in mind here is that this will not actually "drop" the document.  It will still be in FIXML and will show up as "non-indexed" document in the indexer count indefinitely.  So it will get the job done of not having this document searchable and not in the actual index, but will have some unintended consequences. 


    Igor Veytskin

    Tuesday, May 29, 2012 2:12 PM
    Moderator
  • Igor, I agree with you, we found another solution (see my update), when fixml is not generated, may be this solution is better, but still tricky.
    Tuesday, May 29, 2012 3:09 PM
  • Hi Segei,
    You've done some excellent "hacking" around here :) Two thumbs up!

    The cleanest way is most likely writing a python stage doing a real DROP based on some rules (which you can set in a custom pipeline module). Sure, it's not supported, but as MS forgot to give us an option to drop documents, it's a matter of which "hack" is the nicest ,)

    Thanks,
    Mikael Svenson

    Search Enthusiast - SharePoint MVP/MCT
    http://techmikael.blogspot.com/
    Author of Working with FAST Search Server 2010 for SharePoint

    Wednesday, May 30, 2012 7:04 PM
  • Hi,

    Just want to add that you have to make the managed property Queryable to get the overflow method to work.

    Thanks,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/MCT/MCPD - If you find an answer useful, please up-vote it.
    http://techmikael.blogspot.com/
    Author of Working with FAST Search Server 2010 for SharePoint

    Thursday, September 27, 2012 11:00 AM
  • I tried the MaxIndexSize overflow method for the title. 

    Managed Property details are as follows:

    Name                   : Title
    Description            : The title of the document
    Type                   : Text
    Queryable              : True
    StemmingEnabled        : True
    RefinementEnabled      : False
    MergeCrawledProperties : False
    SubstringEnabled       : False
    DeleteDisallowed       : True
    MappingDisallowed      : False
    MaxIndexSize           : 1024
    MaxResultSize          : 64
    DecimalPlaces          : 3
    SortableType           : SortableEnabled
    SummaryType            : Dynamic

    The text i tried to assign is :

     Deep refinement is based on the aggregation of managed property statistics for all of the search results. The indexer creates aggregation data that is used in the query matching process. The advantage of using deep refiners is that the refinement options will reflect all the search items matching a query. The number of matching search items is displayed in parentheses behind each refiner. This is usually the recommended mode, but defining many deep refiners may have a significant adverse effect on memory usage in the query matching component.Deep refinement is based on the aggregation of managed property statistics for all of the search results. The indexer creates aggregation data that is used in the query matching process. The advantage of using deep refiners is that the refinement options will reflect all the search items matching a query. The number of matching search items is displayed in parentheses behind each refiner. This is usually the recommended mode, but defining many deep refiners may have a significant adverse effect on memory usage in the query matching component.Deep refinement is based on the aggregation of managed property statistics for all of the search results. The indexer creates aggregation data that is used in the query matching process. The advantage of using deep refiners is that the refinement options will reflect all the search items matching a query. The number of matching search items is displayed in parentheses behind each refiner. This is usually the recommended mode, but defining many deep refiners may have a significant adverse effect on memory usage in the query matching component.Deep refinement is based on the aggregation of managed property statistics for all of the search results. The indexer creates aggregation data that is used in the query matching process. The advantage of using deep refiners is that the refinement options will reflect all the search items matching a query. The number of matching search items is displayed in parentheses behind each refiner. This is usually the recommended mode, but defining many deep refiners may have a significant adverse effect on memory usage in the query matching component.Deep refinement is based on the aggregation of managed property statistics for all of the search results. The indexer creates aggregation data that is used in the query matching process. The advantage of using deep refiners is that the refinement options will reflect all the search items matching a query. The number of matching search items is displayed in parentheses behind each refiner. This is usually the recommended mode, but defining many deep refiners may have a significant adverse effect on memory usage in the query matching component.Deep refinement is based on the aggregation of managed property statistics for all of the search results. The indexer creates aggregation data that is used in the query matching process. The advantage of using deep refiners is that the refinement options will reflect all the search items matching a query. The number of matching search items is displayed in parentheses behind each refiner. This is usually the recommended mode, but defining many deep refiners may have a significant adverse effect on memory usage in the query matching component.Deep refinement is based on the aggregation of managed property statistics for all of the search results. The indexer creates aggregation data that is used in the query matching process. The advantage of using deep refiners is that the refinement options will reflect all the search items matching a query. The number of matching search items is displayed in parentheses behind each refiner. This is usually the recommended mode, but defining many deep refiners may have a significant adverse effect on memory usage in the query matching component.Deep refinement is based on the aggregation of managed property statistics for all of the search results. The indexer creates aggregation data that is used in the query matching process. The advantage of using deep refiners is that the refinement options will reflect all the search items matching a query. The number of matching search items is displayed in parentheses behind each refiner. This is usually the recommended mode, but defining many deep refiners may have a significant adverse effect on memory usage in the query matching component.Deep refinement is based on the aggregation of managed property statistics for all of the search results. The indexer creates aggregation data that is used in the query matching process. The advantage of using deep refiners is that the refinement options will reflect all the search items matching a query. The number of matching search items is displayed in parentheses behind each refiner. This is usually the recommended mode, but defining many deep refiners may have a significant adverse effect on memory usage in the query matching component.Deep refinement is based on the aggregation of managed property statistics for all of the search results. The indexer creates aggregation data that is used in the query matching process. The advantage of using deep refiners is that the refinement options will reflect all the search items matching a query. The number of matching search items is displayed in parentheses behind each refiner. This is usually the recommended mode, but defining many deep refiners may have a significant adverse effect on memory usage in the query matching component.Deep refinement is based on the aggregation of managed property statistics for all of the search results. The indexer creates aggregation data that is used in the query matching process. The advantage of using deep refiners is that the refinement options will reflect all the search items matching a query. The number of matching search items is displayed in parentheses behind each refiner. This is usually the recommended mode, but defining many deep refiners may have a significant adverse effect on memory usage in the query matching component.

    Title is getting updated with the string but its not getting failed during the index.

    Please assist me to get this working.


    Regards, Shwetha Veeraiah

    Wednesday, September 4, 2013 10:51 AM
  • Hi,

    Your title property has MaxIndexSize set to 1024, which means 1MB of text, which is more than you pasted above. I suggest you set it to 1, as Segei propose in his original post.

    Thanks,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/MCT/MCPD - If you find an answer useful, please up-vote it.
    http://techmikael.blogspot.com/
    Author of Working with FAST Search Server 2010 for SharePoint

    Saturday, September 7, 2013 5:42 PM