none
FAST ESP 2010 Document Processing Pipeine

    Question

  • We are working with a US Product development organisation who is implementing FAST ESP 2010 for their E-Discovery Product which includes heavy customisation over the Document Processing Pipeline API. We need to Modify Document Processing Pipeline of FAST to add custom meta data and perform custom activities on the same.

     

    It will be helpful if you are able to provide some pointers in finding the soultion. Even a link which provides details around this will be appreciated.

    • Moved by Mike Walsh FIN Tuesday, November 30, 2010 6:51 AM FAST questions go to a FAST forum. SP 2010 questions that aren't FAST go to a SP 2010 forum. You posted this question to a pre-2010 SP forum. (From:SharePoint - Search (pre-SharePoint 2010))
    Monday, November 29, 2010 11:15 PM

All replies

  • Are you referring to FAST ESP 5.3, or to FAST Search Server 2010 for SharePoint?

    Regards,

    Mikael Svenson


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/ - http://www.comperiosearch.com/
    Tuesday, November 30, 2010 12:18 PM
  • Hi Mikael,

                  We are referring to  FAST Search Server 2010 for SharePoint.

    Regards,

    Indraneel

     

    Wednesday, December 01, 2010 6:46 AM
  • Hi Mikael,

                  We are referring to  FAST Search Server 2010 for SharePoint.

    Regards,

    Indraneel

     


    Then you should take a look at the following sections on MSDN:

    Configuring Optional Item Processing

    and

    Integrating and External Item Processing Component

    Regards,

    Mikael Svenson


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/ - http://www.comperiosearch.com/
    Wednesday, December 01, 2010 6:29 PM
  • Hi Mikael,

                   These links giv very limited information on the direction I am seeking. Here is a more detailed overview on what I am trying to achieve.

    1)      Add custom components to enhance the meta data during the indexing process

    2)      Save a copy of documents and perform an MD5 checksum activity to save only one copy of document

    3)      Maintain parent child relationship between archive files and individual files within them

    4)      Index document with as high as 16 threads

    5)      Modify the search results as per the requirement

    6)      Quick Preview of the documents indexed

     

    I have performed a similar activity using Coveo as a search engine whereby we added custom scripts during the pre-indexing and post indexing process to perform all the above activities.

     

     

    Now I have to perform similar functions in FAST search for SharePoint 2010 to accomplish similar objectives as specified above. Though FAST search also provides an option to embed custom scripts during the pre-indexing and Post indexing steps. However, I am not able to find any such documentation which would help me in this. I was able to add a static value using custom dot net code but that was more of an attempt to understand the mode of working.

     

     

    It would be helpful if you can provide me with the following:

    1)      FAST SDK

    FAST Virtual machine... I have come across a number of technical labs which are available in some virtual lab or virtual machine but I am not able to find either that virtual lab or virtual machine. The virtual machine 2010-a and 2010-b does not have the required files for those labs

    Friday, December 03, 2010 2:45 PM
  • Since we operate in SharePoint space you have three options the way I see it due to your parameters. One is to create your own protocol handler, which is depricated in SharePoint 2010, the other is to create a BCS connector (which I'm not sure how you can speed up to 16 threads), and the third is to use the Content API for Microsoft FAST Search Server 2010 for SharePoint which is also depricated and unsupported in future versions.

    The Content API will give you the most flexibility, as it's working outside of the SharePoint API's. Much like using the content api in ESP 5.3. The drawback is that you have to create the crawling framework and status database yourself. More plumbing, but possibly more flexibility and control.

    Any action you want to execute after the text is retrieved from the document files has to be done in the FAST pipeline, which is described in the docs from the previous links I posted.

    As long as you do most of your logic in the pipeline, then you should be able to use either of the three methods and still get good speed. Just set up more document processors.

    The FAST SDK is of no use when we are talking about FAST for SharePoint. If you go for FSIA/FSIS, then the story is different, as this is the old ESP 5.3.

    Hope this gave a bit more information which you can use in order to create the most optimal solution.

    As for a document preview, if you don't already have this capability developed I would look at BA-Insights Longitude Document Preview silverlight webpart. The built-in preview in SharePoint 2010 is only for office 2007/2010 format files, and they have to be stored in a SharePoint library.

    Regards,
    Mikael Svenson


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/ - http://www.comperiosearch.com/
    Friday, December 03, 2010 6:39 PM
  • Hi Mikael,

    I would like to elaborate each of these requirements to give you a better idea of what we are trying to accomplish over here:

    1)      Add custom components to enhance the meta data during the indexing process

    As per our previous experience, the crawling engine identifies a lot of document meta data during the indexing process. however, to keep the index optimized, they normally save only a portion of the meta data which is relevant only for the business users. Most of the meta data that I am looking for is hidden and not saved for me to pull it to my application database. My requirement is to bring it to my app database and use it as per the business rules in my application.

    2)      Save a copy of documents and perform an MD5 checksum activity to save only one copy of document

    While crawling the documents, the indexing engine contains the document in a binary format which has to be saved by my application and stored at a different location as per the business requirement. I can only do the MD5 after I know or specify the location for saving the document

    3)      Maintain parent child relationship between archive files and individual files within them.

    This will also come from the same hidden meta data.

    6)      Quick Preview of the documents indexed

    I can use the html to load it in a div using jquery to show the document as a preview when the user clicks on it.

    I am not sure about the content API but I had looked at it previously as well. I will certainly recheck on the the same.

    My basic problem boils down to access the hidden meta data as mentioned under Point 1, and save the files in the location which I can specify in the application.

    Thanks for helping on this.

    Best Regards,

    Indraneel

    Friday, December 03, 2010 8:02 PM
  • Hi Mikael,

                    One Update we have cracked the metadata peice we successfully found it in the crawl logs.

     

    We now need to understand how to save the files for preservation.

     

    Best Regards,

     

    Indraneel

    Friday, December 03, 2010 8:33 PM
  • With FAST ESP the binary data would be stored in a field called "data". I haven´t explored the pipeline in FS4SP yet, so I´m not sure it will be in this field or not.

    If it is, then you should be able to use the pipeline extensibility and send the data in that field to your external pipeline stage which could save the file.

    If we were "allowed" to create python doc procs with FS4SP, then doing the above and md5 would be simple. In a docproc you can return a value saying if the next docproc should be executed or not, necessary if you wanted to skip documents with the same md5. Let me know if you manage to get it working, if not I might have some time next week to explore this further.

    Regards,

    Mikael Svenson


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/ - http://www.comperiosearch.com/
    Saturday, December 04, 2010 12:44 PM
  • Some additional comments/suggstions in-line

    1)      Add custom components to enhance the meta data during the indexing process

    As per our previous experience, the crawling engine identifies a lot of document meta data during the indexing process. however, to keep the index optimized, they normally save only a portion of the meta data which is relevant only for the business users. Most of the meta data that I am looking for is hidden and not saved for me to pull it to my application database. My requirement is to bring it to my app database and use it as per the business rules in my application.

    >> The cost of storing additional metadata in the index is not high when you use FAST Search. As long at the managed properties are not enabled for sorting or query refinement, they do not use that much space in the index.

    >> You can also specify that all crawled properties within a category is mapped to the full-text index, but probably that does not solve your problem.

    >> If you really need to dump all crawled properties for the crawled items without creating any crawled property mapping, it may be more tricky

    2)      Save a copy of documents and perform an MD5 checksum activity to save only one copy of document

    While crawling the documents, the indexing engine contains the document in a binary format which has to be saved by my application and stored at a different location as per the business requirement. I can only do the MD5 after I know or specify the location for saving the document

    >> What you can do, is to use the pipeline extensibility (http://msdn.microsoft.com/en-us/library/ff795801.aspx ) to create the MD5, map the created MD5 to a new crawled property, which in turn is mapped to a managed property. Then you can use custom duplicate removal query time (http://msdn.microsoft.com/en-us/library/ff521593.aspx ) to filter duplicates.

    3)      Maintain parent child relationship between archive files and individual files within them.

    This will also come from the same hidden meta data.

    >> Should also be possible to use pipeline extensibility to extract a relevant collapse ID, given that both the parent and child item has the info in the metadata

    6)      Quick Preview of the documents indexed

    I can use the html to load it in a div using jquery to show the document as a preview when the user clicks on it.

    I am not sure about the content API but I had looked at it previously as well. I will certainly recheck on the the same.

    >> It may be difficult to get a preview of the content with the visual structure intact (e.g. HTML representation). But you can use a pipeline extensibility stage to dump the relevant properties of the items to some external store, add a custom URL to that store to a new property in the index, and present that URL to the user in the results.

    My basic problem boils down to access the hidden meta data as mentioned under Point 1, and save the files in the location which I can specify in the application.

    Tuesday, December 14, 2010 3:32 PM