Using FAST for custom metadata and document full text searches at same time RRS feed

  • Question

  • Please forgive me for the wall of text here, but I think I can reasonably promise that this is a somewhat interesting question around FAST/SharePoint.  Let me know if I'm asking the impossible here, but I'm a bit new to FAST and somewhat new to SharePoint 2010 as well.  I just need a pointer in the right direction concerning FAST/SharePoint and what features I ought to be taking advantage of to accomplish what I have in mind.  I'm finding that it's a rather large product. :)

    My customer has an in-house document management system backed by an OLTP database full of data about our customer's documents and there are related entities.  So, to use a simplistic example, we have a Document entity, with a related Contract entity, and a related Customer entity.  The Document entity has an attribute which points to the specific file on the SAN; and that can be a PDF, XLS, etc. 

    Now, there are quite a few visibility rules around when to show documents to users, so today when a user executes a search they have some fairly extensive SQL that executes and then they also execute a search in the file server's index service, and then they merge those results back in the application.  Obviously, this takes quite a bit of time for the user, and just generally doesn't work because index server itself isn't what one would hope, and the SQL itself can take too long to execute.

    What I would like to do is reduce the above process to a single round trip to a FAST server.  The user would send the query to FAST and then FAST would turn around and just query the search index and the database attributes in question would already be in the index for each document, and it would return the applicable documents.

    To do this, I'm expecting that our application would need to put together a query behind the scenes against the specific metadata attributes applicable to the documents.  So for example, if the user searches for 'marine insurance' as a full text query in a specific contract folder for a given year and customer, we would submit a query something like this:

    marine insurance Contract:995 Year:2005 Customer:128,135 BeginDate>2000-01-01

    So, you can see what I'm trying to do here is combine the operations such that we can do a full-text search along with a specific custom metadata attribute search for the foreign keys in the Document (from the application), and also use comparison and set operations as I've shown in the Customer and BeginDate portions of the query.  I know I don’t have the syntax quite right in the Customer portion, but you get the idea.

    So, this leaves some basic questions then:

    1.      Will FAST/SharePoint 2010 work for what I want to do?  How difficult will it be?

    2.      What feature(s)/APIs should I be using to create the metadata attributes and populate them in the index?

    3.      How do I get FAST to index the document along with the metadata attributes and correlate them to each other so I don’t have to use two separate content stores to achieve this?

    4.      What might be some good ways to get FAST to index specific documents when their metadata changes? So, for example, a specific document could have a Customer added to or deleted from it.

    FYI – I have a functioning SharePoint 2010 install here with FAST as the search engine.  I have set up and have a functioning file and BCS search too.  So, while I’m a bit of a beginner here, I can get around in the product to a degree.  Please just hit me with both barrels here and I’ll go do my homework on this.  I just want to be sure I’m going down the right road.

    Also, in case it wasn’t obvious from the above, the customer is not prepared to move off of their custom document management system at this point.  It’s not clear that SharePoint would even work for that given everything else their system does that SharePoint would have to somehow accommodate.  So, I’m taking about using FAST search from the context of within their system, and not from the context of a SharePoint portal.  I can use the FSIA/ESP product if it’s really needed, but I am trying to stay under the SharePoint 64-bit umbrella to maximize our future flexibility as this FAST install could very well be the basis for larger search solutions here in their enterprise.

    Thank you!


    Wednesday, February 2, 2011 5:00 PM

All replies

  • My first thought would be to create a database view with the added metadata and a column with a pointer to the physical file. Then create a custom pipeline stage to retrieve the file and combine the file text with the metadata. Unfortunately this won't work (in an easy way) as the custom stages are below the content extract stages in the pipeline. This would leave you to do the text extraction from the file yourself which is cumbersome.

    Doing it the other way around is then a better approach. Set up file crawling of the physical files. Then create a custom pipeline stage which looks up the file reference towards the database and fetches all the metadata and output them into new crawled properties. This would be a fairly easy SQL query from your description.

    Now you should have enough data to create a query where you can filter on any metadata.

    The easiest way for the indexing process to pick up metadata change is if you can update the file stamp on the physical files when the metadata change. The complexity of doing this could on the other hand be a downside. Unless you have huge amounts of data you could run a full crawl every day to get changed metadata into the index.

    I'll think some more about this and try to come up with a better solution for changed metadata.

    Another approach is to write your own protocol handler (crawler), but that will involve a lot of work.

    Mikael Svenson

    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4 -
    Thursday, February 3, 2011 12:12 PM
  • Those ideas are good food for thought Mikael.  Being somewhat new to FAST/SharePoint Search, I have to wonder: isn't there a feature that allows either indexer to navigate to a file based on a value in the SQL dataset and then index that file as part of that metadata record?  It would seem like SharePoint itself already must do this in support of its own features via remote blobs?  Should I be looking at RBS to either use or build this functionality myself?

    I feel like I"m missing something obvious here.  That's what I'm hoping anyway.  :)

    Thanks again for thinking about this!  I'll keep this thread updated as I either make progress or hit roadblocks.

    Thursday, February 3, 2011 7:16 PM
  • FSIS/FSIA has this capability (as does FS4SP technically), as you can modify the pipeline to suit your needs. In theory you can modify the pipeline in FS4SP as well, but this is not supported. You have to add your custom stages at a certain point in the pipeline, which unfortunately is a bit too far down.

    With FSIS/FSIA you would send in the metadata and the link (url/fileref) to the physical document, and the pipeline would retrieve the file, thus combining the meta data and the file data.

    Since this is not directly supported with FS4SP you have to try to find a way around it. And then without doing too much code or solutions which are too complex to maintain.

    A third option is to create a connector using the Content API. This is fairly trivial, but the thing is that this API was discontinued the day it was released and will not work with newer versions of FS4SP in the future. It's basically what is included with FSIS/FSIA.

    If you could have included the file data as a blob in a column in your query, it would have been easier :)

    Mikael Svenson

    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4 -
    Friday, February 4, 2011 7:42 AM
  • Mikael, I've gone down a couple more avenues with this, and I'm starting to think that having the blob as a column in the view that feeds BCD would be the best way to go.  I had initially dismissed this as a poor solution because of the additional overhead of having SQL Server retrieve the file, serialize it as a BLOB, etc. but now we're thinking that overhead could be worth it because we can add more intelligence into the view than we easily could with a straight filesystem crawl, so we can do things like pass up on extremely large files or files we know can't even be indexed, etc.

    I'm assuming you mean that FAST can crawl and index the contents of BLOBs, but do you know if standard SharePoint search can do this as well.

    I ask because we're also debating the value of FAST vs. stock SharePoint search right now.  We don't intend to take advantage of FAST's additional features at this point, so if you have additional input around considerations for that decision, then I'd love to hear it.

    Thanks again!


    Wednesday, February 9, 2011 7:06 PM
  • I'm not sure if standard SharePoint can index the blobs. In FAST's pipeline it examines the binary content of the data field and thus can handle a lot of file formats. As Standard SharePoint uses IFilter it's dependant on the file extension in order to invoke the correct filter.

    Based on this I assume it might save a copy of the actual files in a temp folder before running IFilter on it. Unless they also have some binary detection and invoke the correct IFilter with a stream instead of a file reference.

    Maybe someone else knows how this works for sure?

    Mikael Svenson

    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4 -
    Thursday, February 10, 2011 7:00 AM
  • I did find a MSDN article titled "Accessing BLOB Data from External Systems Using Business Connectivity Services in SharePoint Server 2010" which would seem to crack the case on this for me.  Only, instead of having BCS access a table with the BLOB for indexing, I'm going to have it access a view, through which I hope to return the BLOB using a SQL Server CLR function to access the file for me to be returned during the crawl/indexing.


    Sound crazy?  Wish me luck!  :)

    Friday, February 11, 2011 7:17 PM
  • Sounds like a smart idea, but does the SQL CLR allow file access? And how did it go?


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4 -
    Monday, February 21, 2011 7:20 PM
  • Actually, I did get it to work with a custom view in AdventureWorks.  Accessing the filesystem from a SQL CLR assembly is possible, but you have to mark the assembly as EXTERNAL ACCESS instead of SAFE, and to do that involves some DBA gymnastics along the lines of creating a database master key and then essentially adding your key to the SQL Server so the strongly named assembly can be validated on SQL Server. 

    The article I referenced above concerning BLOB indexing did the trick and then if I combine that technique with the changelog-based approach to crawling described in Eric White's paper , I have a solution to the overall problem.

    Getting it all lined up properly and working isn't as much fun as one would hope because SharePoint designer can't handle the operations needed for BLOBs (StreamAccessor) nor for the changelog-based crawls (ChangedIDEnumerator), so I'm spending a fair amount of time hacking around in BDCM XML files and learning by trial and error there.

    So, right now I'm in the process of re-erecting all of this on top of our actual operational database, and I should have a working proof of concept very shortly.  

    It turns out that F4SP had nothing in the toolbox to help with this and that this problem was really more about the BCS.

    Thanks for your thoughts and checking up on the situation!



    Tuesday, February 22, 2011 10:56 PM
  • You should do a writeup on this :) as I'm sure it's a very common scenario. Or maybe I'll give it a try myself when I have the time.

    Mikael Svenson

    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4 -
    Wednesday, February 23, 2011 8:56 AM