none
How can I increase the speed of crawling data with my .NET Assembly Connector? RRS feed

  • Question

  • Hi!

    How can I crawl and send documents faster to my Fast Search for Sharepoint 2010 server? The speed I am currently getting is ~10 documents per second. This is much lower than what I believe is possible with my current setup.

    I have coded and deployed a .NET Assemlby Connector to BCS in my Sharepoint 2010 installation. This connector retrieves data from an Oracle database. I have also configured the Search Service Application in Sharepoint to use this external data source for crawling. The crawler impact rule specifies that 16 documents are to be retrieved in parallell. All this works just fine, apart from the fact that the retrieval of documents seems to go very slowly.

    The document processors in the Fast Search for Sharepoint 2010 are mostly idle, 2 of 10 are occationally used. Indexing is done at 250-300 documents per second. The CPU on the Sharepoint server is only using around 10-15% CPU. The Oracle server can deliver much more than 10 documents per second, at least 50-60 documents per second.

    Any suggestions on how to change my configuration or what to check out is welcome!

    Thanks and regards,

    Gunnar

    Tuesday, June 21, 2011 10:58 AM

Answers

  • I have received an interesting response from MS.

    All documents are indeed being parsed by IFilters in Sharepoint for no reason. And apparently all data is being filtered, not just the data from the StreamAccessor.
    My contact explained that "The reason that the filtering is there, is because the SharePoint crawler is a shared component between FAST Search for SharePoint 2010 and the built-in search in SharePoint."
    And they apparently have no official plans of changing this.

    (Wow! This will slow down crawling for EVERY content source for EVERY FS4SP out there. It applies to all content source types, not just BCS based. Why Microsoft did not fix this before shipping is beyond me. It is just incredibly sloppy work. I hope they fix it soon. )

    If you remove the filters from the registry all documents are being filtered by a Null filter that "will just pass the data through and not do any filtering". This will speed up the crawl for every source you have.

    The performance counters "OSS Search Gatherer\Documents Filtered" and "OSS Search Gatherer\Documents Successfully Filtered Rate" do indeed show that documents are filtered by IFilters before being sent to FS4SP, including the Null Filter.

    So I guess that a very clear recommendation is to remove these filter values from the registry under [HKLM\SOFTWARE\Microsoft\Office Server\14.0\Search\Setup\ContentIndexCommon\Filters\Extension] or [HKLM\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\14.0\Search\Setup\ContentIndexCommon\Filters\Extension] - Or both, I dont know which one yet - as soon as possible if you are only going to use FS4SP!
    And also exclude the [...AppData\Local\Temp\gthrsvc_osearch14] folder from anti virus scanning.

    Regards Gunnar


    Tuesday, June 5, 2012 3:45 PM

All replies

  • Hi Gunnar,

    Check this post: http://social.technet.microsoft.com/Forums/en-IN/sharepoint2010programming/thread/6042490c-4f83-4308-8962-4ed118033837

    This tells you how to check and set the performance level of the Search Service (osearch14) which handles the crawling. On my test image it was set to "PartiallyReduced", instead of "Maximum". If running in a reduced mode, it will disregard the number of threads set by the crawling impact rules.

    Also take a look at http://sharepoint.microsoft.com/Blogs/fromthefield/Lists/Posts/Post.aspx?ID=96 which might shed some light in this. Basically try to lower your number from 16, and see if the speed will increase. Too many threads will create a bottleneck on the Oracle calls.

    Regards,
    Mikael Svenson 


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Tuesday, June 21, 2011 7:24 PM
  • Hi Mikael,

    Thanks for your reply, but neither of those suggestions led to any notable improvements. Regardless of performance level and impact rules the speed remained the same. I know I can retrieve data more quickly from my Oracle server, but for some reason the crawler does not retrieve it. Maybe a Custom Connector rather than a .NET Assembly Connector will speed things up?

    Also, I noticed that when creating and setting a new Crawler Impact Rule, the performance level setting is reset to "PartiallyReduced". So, each time I created or changed a rule I had to reset the performance level to "Maximum". Quite annoying. Bug or feature? Does anyone else experience this?

    With regards,

    Gunnar


    Gunnar Braaten - Search Consultant - Bouvet ASA - www.bouvet.no

    Thursday, June 23, 2011 10:34 AM
  • I think it might have to do with me crawling binary file data.

     

    I retrieve blobs from a database using a StreamAccessor method. The retrieval of data is quick, but it takes a while before the data is actually sent to the FS4SP server. It looks like the Search Service in FS4SP is doing something with the binary data before sending it to FS4SP. I thought the parsing of data was done by FS4SP, and not this service, and that it should only pass the data along.

    • Am I wrong?
    • Is there any good documentation on what the Search Service does to binary file data before sending it to FS4SP?
    • Is there any good way to monitor what it does, an how long it takes?

    With regards,

    Gunnar

     

    Monday, June 27, 2011 12:34 AM
  • When I send document blobs via StreamAccessor it looks like the documents are being parsed by IFilters in Sharepoint, as well as by FS4SP.

    The performance counters "OSS Search Gatherer\Documents Filtered" and "OSS Search Gatherer\Documents Successfully Filtered Rate" shows that documents are "filtered" before being sent to FS4SP.

    I removed all the filters from the registry under [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\14.0\Search\Setup\ContentIndexCommon\Filters\Extension], and crawling speed increased dramatically.  And I still get correct results from FS4SP. But documents were still being "filtered", as per the definition of the counters.

    What do these counters refer to? Does anybody have any detailed information regarding this? Maybe someone in Microsoft knows?

    How are these filters used?

    What is the point of this extra parsing by these filters?

    With regards,

    Gunnar


    Gunnar Braaten - Search Consultant - Bouvet ASA - www.bouvet.no

    Tuesday, June 28, 2011 11:01 AM
  • Gunnar,

    Did you figure this out? I'm also seeing the same thing on two FAST farms - the documents are parsed on the crawling host as well as FAST. The crawling host is of course the bottleneck as the FAST farm is idling.

    Thanks,

    Step van Schalkwyk

    Thursday, April 5, 2012 4:30 PM
  • Hello Gunnar,

    I would not expect to filter the data twice.  At this point I would suggest to open a support case with SharePoint 2010 to understand why the SharePoint crawler is also filtering the data.

    Best Wishes,

    Michael Puangco | Senior Support Escalation Engineer | US Customer Service & Support

    Customer Service & Support                        Microsoft| Services

    Reply

    Tuesday, April 24, 2012 7:28 PM
    Moderator
  • Hi Step!

    I have not figured this one out, no.

    Here are som more cases I believe is related to this issue:

    http://social.technet.microsoft.com/Forums/en-AU/fastsharepoint/thread/137b3277-0cc1-4c74-bf10-ed3e50659373

    http://social.technet.microsoft.com/Forums/en-US/fastsharepoint/thread/ae3a61c8-218c-43ac-84cd-07d73c3c4bef

    http://social.msdn.microsoft.com/Forums/sa/sharepoint2010general/thread/6be9914c-48e0-4ec1-9b5f-44060f040759

    All of these look like error messages that are variants or the same as the Warning I am currently getting:

    "The filtering process terminated because the item content reached the maximum filter output limit. Check that the filter does not generate a large amount of data relative to the size of the document. The item's content may be too large to index."

    Again indicating that the documents are indeed being processed by BCS and FS4SP. But I am not sure.

    Have you tried removing the registry keys (after taking a backup:-) )?

    With regards,

    Gunnar

    Friday, April 27, 2012 8:21 AM
  • Hi Michael!

    Since you work at Microsoft, can you give us a little more information on how this crawling is implemented? Or find someone that can?

    There seems to be a lot of confusion around this issue, and very little documentation.

    With regards,

    Gunnar

    Friday, April 27, 2012 8:58 AM
  • I just opened a support case on this. Let's see if that will help us understand this issue better.

    Regards Gunnar


    Thursday, May 3, 2012 10:36 AM
  • Hi Gunnar

    Did you manage to get any help from MSFT? I am experiecing the same issue but my customer does not mind about crawling speeds, what they are concerned about is that the BCS connector seems to create copies of the stream object on the crawler account temp folder (example; windowsdirectory\Users\username\AppData\Local\Temp\gthrsvc_OSearch14). Everytime you run the crawler, that fodler get filled up with what looks like stream object with the exact file sizes as the source files.

    Job

    Tuesday, May 29, 2012 11:00 AM
  • And also you need to exclude this folder from antivirus monitoring, which helps in crawling speed. Have you tried by implementing batching in .NET Assembly connector. You can retrieve the records in batch say 1000 per request which also do the bulk feed to indexer.

    You can also monitor the performance of the Crawling through performance counter.

    http://technet.microsoft.com/en-us/library/ff383289.aspx#FASTContentPlugin

    use perfoman tool, add this counter and you can find out how much records are open/pending for indexing.


    Sriram S

    Tuesday, May 29, 2012 11:22 AM
  • Job,

    I'm not sure this particular point is unusual, my understanding is that this Temp directory is always written to  by the Sharepoint Gatherer prior to being passed-on to FAST. This is why it's a common recommendation from support to disable anti-virus touching this folder on Sharepoint crawl component side.   Here is what I see looking through various bits and pieces:

    *************************************************************************

    Gatherer Process:

    This part is responsible for crawling and indexing content from various repositories, such as SharePoint sites, HTTP sites, file shares, Exchange Server, etc. This component lives inside MSSearch.exe.

    When a request is issued to crawl a 'Content Source', the MSSearch.exe invokes a 'Filter Daemon' process called MssDmn.exe. This loads the required protocol handlers and filters necessary to connect, fetch and parse the content. The protocol handler and ifilter can be third-party code.

    MssDmn.exe is responsible for downloading the documents to %TEMP%\gthrsvc_osearch14 (temp dir of the Content SSA user. Example: C:\Users\FARMSVCACCOUNT\AppData\Local\Temp\gthrsvc_osearch14), before the FAST content plugin in MSSearch.exe picks them up.


    Igor Veytskin

    Tuesday, May 29, 2012 2:05 PM
    Moderator
  • I have received an interesting response from MS.

    All documents are indeed being parsed by IFilters in Sharepoint for no reason. And apparently all data is being filtered, not just the data from the StreamAccessor.
    My contact explained that "The reason that the filtering is there, is because the SharePoint crawler is a shared component between FAST Search for SharePoint 2010 and the built-in search in SharePoint."
    And they apparently have no official plans of changing this.

    (Wow! This will slow down crawling for EVERY content source for EVERY FS4SP out there. It applies to all content source types, not just BCS based. Why Microsoft did not fix this before shipping is beyond me. It is just incredibly sloppy work. I hope they fix it soon. )

    If you remove the filters from the registry all documents are being filtered by a Null filter that "will just pass the data through and not do any filtering". This will speed up the crawl for every source you have.

    The performance counters "OSS Search Gatherer\Documents Filtered" and "OSS Search Gatherer\Documents Successfully Filtered Rate" do indeed show that documents are filtered by IFilters before being sent to FS4SP, including the Null Filter.

    So I guess that a very clear recommendation is to remove these filter values from the registry under [HKLM\SOFTWARE\Microsoft\Office Server\14.0\Search\Setup\ContentIndexCommon\Filters\Extension] or [HKLM\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\14.0\Search\Setup\ContentIndexCommon\Filters\Extension] - Or both, I dont know which one yet - as soon as possible if you are only going to use FS4SP!
    And also exclude the [...AppData\Local\Temp\gthrsvc_osearch14] folder from anti virus scanning.

    Regards Gunnar


    Tuesday, June 5, 2012 3:45 PM
  • Will this helped you to increase the speed of crawling?. Are you able to see the difference in crawling speed?. if so, how much percentage of speed got increased? Any drawback or issues to watch on by setting this registry.

    Sriram S

    Monday, July 30, 2012 7:37 AM
  • Gunnar, can you confirm after this registry change it made things quicker for you and did your documents filtered and rate go to zero because it was no longer filtering or does it still how up in the counters?

    Sharepoint Administrator

    Friday, January 31, 2014 3:20 AM