locked
Remove a document from the index based on an attribute value in ESP 5.3 / SP2007 RRS feed

  • Question

  • Hello again,

    On our Sharepoint application we want to allow users to disallow indexing of lists items by Fast.

    We added an "IndexerStatus" field to our lists ContentTypes with two possible values: "Index" (default) and "Do not index".

    We are aware of the "spallowcrawl" attribute we get from the SP Connector but it works at Site and List levels, not at the listitem level.

    Our initial solution was to make our own AttributeFilter in Python and mark the ProcessorStatus as "Completed" when the attribute matches the filter value ; thus, do not pass the document to the indexer. Where this worked at initial indexing, it obviously didn't remove existing documents from the index for which we changed the "IndexStatus" value from "Index" to "Do not index".

    We are now considering adding the documents to delete to a list (in the "Process" method) and then remove them from the index by calling the "indexeradmin" in the "PostProcessBatch" method.

    It looks a little bit extreme to me so I was wondering if there was maybe a simplier way of achieving our goal.

    Bonus question: Are there reference documents detailing the Python API for Python stages? (ProcessorStatus possible values and meanings, ...)

    Thanks,

    Mat
    Tuesday, February 15, 2011 10:31 AM

Answers

  • No worries Lina,

    There are two cases where the stock AttributeFilter doesn't fill our needs:
    • The field we want to filter on has a lot of possible values yet only one is valid for a pipeline. We set up a Sharepoint ContentType filter for each of our collections, it was better to set up a "include if value" filter than a "exclude all the other possible values that we might not even know about yet" ;
    • The value has a space in it, the default AttributeFilter treats the "space" in the filter values as a separator. If you have 3 content types "Good Article", "Article", "Bad Article", you can't filter out "Bad Article" and "Article" and keep only "Good Article". Now we can specify which separator to use.

    About the InternalID, according to the "Online" documentation, it looks like the first part is a MD5 hash of the Document:

    internalid: Reserved field name representing a unique internal document ID that is also returned in the query results.
    This is formatted as: <MD5 checksum of document>_<collection name>. 
    
    Anyway, I finally found the solution to my initial issue ; in the processor, change the routing of the document being processed with:

    document.SetRouting('op', 'DEL')

    This will trigger an error in the admin console log if the document doesn't exist in the index.

    Regards,

    Mat

    • Marked as answer by mgrandis Wednesday, February 16, 2011 4:06 PM
    Wednesday, February 16, 2011 2:16 PM

All replies

  • Hi Mat,

     

    One  of the greatest things about FAST ESP was that up to 80 % you do not have to implement but to configure.

    For me the easeast way to do this is to add to your pipeline the out-of -the-box stage AttributeFilter, which is doing the folllowing

    Drop documents based on attribute values
    Drop document if the configured attribute doesn't have any of the configured values.
    The 'Attribute' parameter should contain the name of the attribute to act on. Example: 'languages'.
    The 'Separator' parameter should contain the string that separates different values in the attribute. Example: ';'.
    The 'Values' parameters should contain a list of required values (one of which must be present in the configured attribute), separated by space characters. Example: 'es pt'.
    The 'DropSilently' boolean parameter should be set to '0' to get a log message for dropped documents or '1' to drop documents silently.

    You really do not need to implement this by yourself, but if you want you can do this of course. This is of course a solution for not indexed documents. It means you have to reindex the whole index to see all changes

    If your documents are already indexed and you do not want to reindex than you can try the following. Get all the document IDs of the indexed documents in a .txt file and by using the filetraverser with option -t <your txt file> delete all of them. You can also incorporate this in a cronjob or some other triggering process for this. At my opinion using for this the indexeradmin is a little bit too extreme. If you are realy creating a list of documents , then try the filetraverserer -t

    Regards

    Tuesday, February 15, 2011 11:10 AM
  • Hi Lina thanks for the reply,

    We knew about the bundled AttributeFilter but we actually wanted to be able to specify whether we wanted to "include" or "exclude" a document based on one of its attributes value. The out-of-the-box AttributeFilter doesn't offer this option.

    We can't use Filetraverser nor Docpush to remove docs because they both resubmit the document to the pipeline. They fail to remove the document doing so as the document content has changed (IndexingStatus has changed from "Index" to "Do not index") and thus the internalid of the document has changed too.

    What we do for now is:

     - find the internalids of the documents using getfixml;

     - build a text file containing the internalids;

     - submit the file to the "indexeradmin rdocs" command to remove the docs;

    Regards
    Tuesday, February 15, 2011 1:19 PM
  • Hi Mat,

    I am sorry my answer did not helped you, but I am not getting it.

    You say you want to exclude or include, so for the AttributeFilter the default behaviour is exclude from indexing (drop document) if this attribute has exactly the value "xyz" otherwise do nothing, so you have the binar behaviuor (include / exclude). Is your attribute binary (I mean: does it have only the two values include / exclude or does it have more possible values and what happens than). I am not saying do not implement your own stage, please do, I am just not getting it.

    There are many ways to do something. I would rather prefer not to use the indexeradmin in a process. But this does not mean for you not to use it if it works fine for you.

     The internal ID of a document is not the content it is a hash of its URI + collection (for crawled content it is the URL+collection). According to your configuration it can be anything: title, timestamp, so on if it is unique

    You are submitting to filetraverser or docpush the URI. Does it mean that you do not know what is the URI of your documents?

    If you know the document URI it is sufficiant for docpush. You can make a stage which writes in a list the URIs of your documents if there is a set attribute exclude and remove this documents per docpush or filetraverser afterwards or somethinglike this.

     

    Regards

     

     

    Tuesday, February 15, 2011 4:06 PM
  • No worries Lina,

    There are two cases where the stock AttributeFilter doesn't fill our needs:
    • The field we want to filter on has a lot of possible values yet only one is valid for a pipeline. We set up a Sharepoint ContentType filter for each of our collections, it was better to set up a "include if value" filter than a "exclude all the other possible values that we might not even know about yet" ;
    • The value has a space in it, the default AttributeFilter treats the "space" in the filter values as a separator. If you have 3 content types "Good Article", "Article", "Bad Article", you can't filter out "Bad Article" and "Article" and keep only "Good Article". Now we can specify which separator to use.

    About the InternalID, according to the "Online" documentation, it looks like the first part is a MD5 hash of the Document:

    internalid: Reserved field name representing a unique internal document ID that is also returned in the query results.
    This is formatted as: <MD5 checksum of document>_<collection name>. 
    
    Anyway, I finally found the solution to my initial issue ; in the processor, change the routing of the document being processed with:

    document.SetRouting('op', 'DEL')

    This will trigger an error in the admin console log if the document doesn't exist in the index.

    Regards,

    Mat

    • Marked as answer by mgrandis Wednesday, February 16, 2011 4:06 PM
    Wednesday, February 16, 2011 2:16 PM