none
What is more maintainable, custom connector or custom pipeline stages? RRS feed

  • Question

  • Questions on this forum often relates to how can I index this data, or achieve this with my search results. With the built-in SharePoint search if one of the bundled connectors didn't work like you wanted, you had to roll your own custom connector via eg. BCS. With FS4SP we can create custom pipeline stages which often can solve some of the same issues and pains.

    As I have a long history of developing connectors I try to stay away from it since I know it's hard to cover a system 100% when crawling it, and support/maintenance will be costly. Creating a custom pipeline stage on the other hand is more like a patch and it lives on it's own well outside of SharePoint on a FAST server. The only reference to it is in an xml file. Maybe not too maintainable either?

    I'd love to hear what others think about maintainability and cost of these two approaches. 

    -m


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Friday, June 24, 2011 8:52 PM

Answers

  • I am leaning towards using the standard configuration of the pipeline, and putting more logic into the crawler.

    As I have a long story of managing Fast ESP pipeline configurations I try to stay away from it :-)

    My experience from managing only slightly complex pipelines for Fast ESP search solutions with multiple content sources is that it quickly becomes a big hassle to maintain the pipeline configuration. Different pipeline stages start to depend on what you did you previous stages. I have experienced many times that a simple taks such as a rename of a field becomes a complex task that you avoid because you dont know if you actually renamed all the fields in all the files before a document fails. I did not find having to maintain your logic a bunch of inter dependent Python and XML-files very convenient.

    I did prefer developing connectors using the exisiting APIs. It is a lot nicer to code logic in an IDE. Now I hope BCS can make this easier by providing a framework for this, especially since it will be easier to run multiple crawlers in parallell on the same or different servers.

    Also, if you are doing lean development you will have to refeed documents quite often, and it will quickly become a real pain if the reindex process is slow. Speeding it up by avoiding file copies seems like a good idea to me. But it might not be an issue if your system is scaled well.

    CTS sounds nice, but I havent used it.

    I guess there are pros and cons with both, and that it can be a matter of taste. Better tools would be nice though :-)

    Wit regards,

    Gunnar


    Gunnar Braaten - Search Consultant - Bouvet ASA - www.bouvet.no

    Tuesday, June 28, 2011 11:41 AM

All replies

  • Mikael:

    I have been trying to stay away from pipeline stage for couple reasons:

    1. index performance, the stage use XML file as input/output. Since there is only one pipeline in FS4SP, every document has to go through the stage regardless whether the stage should be applied or not. The file I/O will be costly.

    2. using pipeline stage would also mean losing the reporting capability that come with SharePoint connectors

    The SharePoint connectors, including connector assembly and custom BCS, do involve some serious coding and it is somewhat difficult to debug/diagnostic/deploy, but it offer the more flexibility on what processes applying to which content source.

    What do you think?

    Ben

    Friday, June 24, 2011 9:47 PM
  • Hi Ben

    Regarding your points:

    1. Yes, file I/O is costly for the custom pipeline stage, but it's actually the starting/stopping of a separate process which really takes time. Nonetheless, in intranet scenarios the document volumes are often not as big as in other search scenarios. When I first heard about this "launch a process for every document" design, I thought it would be almost unusable. But so far, I have not heard many complaints about performance. Adding a large number of document processors will also relieve some of that pain.

    2. I am not aware of any "reporting capability" that you lose when applying a pipeline stage?

    On the original topic of "custom connector" vs. "pipeline stage", I would assume using a pipeline stage would always be the preferred option, unless the limitations of the pipeline stage make it unusable (e.g., not being able to update various crawled properties). As Mikael started out saying; implementing a 100% production quality connector is a large undertaking. It is significantly less work to add some extra logic as part of the indexing process.

    It would also be interesting to hear whether anyone opted for using unsupported(!) python stages, which are still technically possible to use in FAST Seach for SharePoint. I know they are tempting for everyone having a FAST ESP background, but I must stress that I do not recommend using them anyway!

    Regards


    Thomas Svensen | Microsoft Enterprise Search Practice
    Monday, June 27, 2011 6:45 AM
    Moderator
  • Thomas, 

    I have to admit I haven't done a real pipeline stage in production yet. So my thoughts are solely based on my reading and a little POC on my development environment so far. 

    What I mean "reporting capability" with BCS connector is that you can use SharePoint analytic report to see the performance of the component. It is almost impossible to report on pipeline stage's performance since it is going to fire up for every document, even if it only do real process on some. So when it comes to manageability, I will choose BCS connector first since it only applies to the content I need to process.

    Using BCS connector also allows me to use SharePoint standard diagnostics tool like ULS log. Logging and diagnostics in Pipeline stage would be more difficult, but this could be my lack of knowledge. 

    I do agree it is much more complex to develop a BCS connector. I hope Microsoft can put more documentation on this to help us out:-)

    Thanks

    Ben

    Monday, June 27, 2011 1:53 PM
  • Eric Belisle did some benchmarking on a simple stage copying a file in a recent blog post (http://fs4sp.blogspot.com/2011/06/fs4sp-pipeline-performance-and.html).
    The C++ and .cmd versions took on average ~70ms each, while the .Net version took ~140ms. As the file copy itself take the same amount of time, the difference is for startup time. Then again, the added time would be an issue for initial indexing of several million documents, and insignificant in most day to day operations. So I agree with Thomas' points.
    As for logging in a pipeline stage to the SharePoint log, this could be done via a webservice (which you have to create). But firing off a web call for logging might increase the execution time unnecessary.
    I have an idea for a pipeline within the pipeline framework which would use a stub loader in the pipeline to notify a .Net service, which would pick up and process the xml file. This service would perform all custom stages as well, so you would only register one pipeline extenisbility module.
    And now we are moving closer and closer to CTS, which today exists for FSIS, but which will most likely see light of day for the next version of SharePoint.
    PS! I'm leaning towards pipeline modules as well :)
    Regards,
    Mikael Svenson

    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Monday, June 27, 2011 6:05 PM
  • I am leaning towards using the standard configuration of the pipeline, and putting more logic into the crawler.

    As I have a long story of managing Fast ESP pipeline configurations I try to stay away from it :-)

    My experience from managing only slightly complex pipelines for Fast ESP search solutions with multiple content sources is that it quickly becomes a big hassle to maintain the pipeline configuration. Different pipeline stages start to depend on what you did you previous stages. I have experienced many times that a simple taks such as a rename of a field becomes a complex task that you avoid because you dont know if you actually renamed all the fields in all the files before a document fails. I did not find having to maintain your logic a bunch of inter dependent Python and XML-files very convenient.

    I did prefer developing connectors using the exisiting APIs. It is a lot nicer to code logic in an IDE. Now I hope BCS can make this easier by providing a framework for this, especially since it will be easier to run multiple crawlers in parallell on the same or different servers.

    Also, if you are doing lean development you will have to refeed documents quite often, and it will quickly become a real pain if the reindex process is slow. Speeding it up by avoiding file copies seems like a good idea to me. But it might not be an issue if your system is scaled well.

    CTS sounds nice, but I havent used it.

    I guess there are pros and cons with both, and that it can be a matter of taste. Better tools would be nice though :-)

    Wit regards,

    Gunnar


    Gunnar Braaten - Search Consultant - Bouvet ASA - www.bouvet.no

    Tuesday, June 28, 2011 11:41 AM
  • Hi everyone,

    I take the liberty of marking Gunnar's reply as answer, as I don't think we'll get much further on this topic :-)

    A couple of additional notes:

    Mikael: your idea about a stub which calls out to an external web service has actually been implemented as a proof-of-concept by Microsoft consultants, code named "PEWS" - Pipeline Extensibility Web Service, I think it stands for. But the uptake hasn't been very broad, to my knowledge. My gut feeling is that people generally don't do operations that are complex enough in pipeline stages to justify this extra complexity.

    CTS is really nice, with a graphical UI to configure the logic. I am crossing my fingers that we will see it sooner rather than later for FAST Search for SharePoint!

    Regards


    Thomas Svensen | Microsoft Enterprise Search Practice
    Wednesday, June 29, 2011 1:42 PM
    Moderator
  • Hi Thomas,

    PEWS sounds like a cool idea, but I was thinking this more on an architectural note than just speed.

    I'd like to create a VS2010 template which sets up all a default processor skeleton, but also includes some basic functionality. When executing something like "yourproc.exe install" or "Install-CustomProc yourproc.dll", it would automatically register the input and output crawled properties in the configuration file, and also be loaded up in the service.

    This would make it easier to create and deploy custom modules. If they would run better, that's just a bonus :)

    I also hope CTS will make it's way to FS4SP sooner than later, but I hope it will be more flexible than todays version, and also more lightweight on the hardware requirements. It's really heavy for something which easily could be pretty lightweight imo.

    Regards,
    Mikael Svenson 


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Wednesday, June 29, 2011 7:50 PM
  • Hello Thomas, Michael,

    I have a number of pipeline stages which I believe could take advantage of the PEWS framework which I formally used at Microsoft. 

    Has\Is PEWS available for partners and customers outside Mcirosoft?

    Anthony

    Monday, October 24, 2011 9:57 PM
  • Hi Anthony

    I haven't heard any updates on the availability of PEWS, but checking again now. Will update this thread with any new info.

    Regards


    Thomas Svensen | Microsoft Consulting Services
    Tuesday, October 25, 2011 7:32 AM
    Moderator