none
Process Server timeout RRS feed

  • Question

  • Hello,

    We are indexing a fileshare with approx. 15.000 text-Files (.txt). The problem is that a crawl (full or incremental) takes very long (17 hours). It seems that the processor server are restarted every 5 minutes because of timeouts:


    [2012-01-25 12:11:44.101] INFO       contentdistributor Processor Server procserver_2 on <Server> has been restarted
    [2012-01-25 12:11:44.101] WARNING    contentdistributor Processor server <Server> _13396 timed out while processing batch. 4849 - 4849. Removing processor server and notifying client of batch failure
    [2012-01-25 12:12:10.147] INFO       contentdistributor Processor Server procserver_1 on <Server> has been restarted
    [2012-01-25 12:12:10.147] WARNING    contentdistributor Processor server <Server> _13395 timed out while processing batch. 4856 - 4856. Removing processor server and notifying client of batch failure

    The FAST Server runs in a virtual environment. The text files are not that big (the biggest file is 240 KB).

    Is it possible to configure this timeout value?

    best regards
    Hannes

    Wednesday, January 25, 2012 11:29 AM

Answers

  • Hi,

    Good find indeed!

    The out.gz files fill will be larger when you have more content as they contain terms gathered during indexing. The TermExtractor stage is used for the automatic spell tuning. I have actually never looked into this myself but I will check the same files for other environments to check the size.

    Depending on your content you might want to change parameters in C:\FASTSearch\etc\processors\linguistics\termextractor.xml or comment out the processor all together.

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    • Marked as answer by Hannes Pils Thursday, January 26, 2012 9:53 PM
    Thursday, January 26, 2012 9:14 PM

All replies

  • Hi,

    Could you check the following KB article and see if it might relate to your issue as well?

    http://support.microsoft.com/kb/2570111

    Other questions:

    • Any custom item processors?
    • One server deployment?
    • Number of CPUs?
    • How much ram?
    • Any errors in the windows event log on the FS4SP server(s)?

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Wednesday, January 25, 2012 8:35 PM
  • Hello Mikael,

    KB articel: I don´t think this applies to us, the indexed data is local on the FS4SP server.

    We have on custom item processor. It´s a .NET application that does two things:
    - parse the url
    - read an additional text file with metadata information for the indexed file

    We have a one server deployment, the hardware specs:
    - 4 CPUs (2, 66 GHz)
    - 8 GB RAM

    There are no errors related to F4SP in the eventlog. The whole thing works on a test environment (runs on a physical machine) without problems. Maybe the disk I/O on the production system (virtual environment) is the problem. With 4 document processor the CPU war 100% utilized during indexing. I reduced it to 2 document processors, now the CPU is 70-80% utilized.

    best regards
    Hannes

    Thursday, January 26, 2012 7:04 AM
  • Hi,

    It's better to have it be steady below 100%, so 70-80 sounds fine. You should use performance monitor to check your disk queue length and see if IO is being saturated. If the data is being read and indexed to the same storage you are pushing the IO for sure.

    The Average Disk Queue length counter should stay below two for a single disk. If it's above two for long periods of time you have disk issues. If it's a raid system then the numbers would be different. And depending on how the storage is mounted in the VM counts as well. Is it directly attached or virtualized.

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Thursday, January 26, 2012 8:04 AM
  • Hello Mikael,

    Thank you for your help. We will watch the Average Disk Queue length counter on the next crawl.

    Another problem we face is that all files are crawled on an incremental crawl even if they haven´t changed. Could this be because of the timeout issue?

    best regards
    Hannes

    Thursday, January 26, 2012 9:30 AM
  • I think i found something:

    contentdistributor.exe has the following parameter:
    --batch-processing-timeout=<#seconds>    timeout value for how long a procserver is permitted to process on a batch before it's being told to abort processing

    This parameter is currently not set in NodeConfg.xml. I would be interesting to know the default value.

    best regards
    Hannes

    Thursday, January 26, 2012 10:16 AM
  • Hi,

    I would be more interested in why each item is taking so long to process.

    Execute "psctrl statistics" and see if one of the processors are taking much more time than the others.

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Thursday, January 26, 2012 11:58 AM
  • Hello Mikael,

    I made a crawl on our test system and the processors taking most of the time are (in that order):
    - CustomerExtensibility
    - CompanyExtractor1
    - TermExtractor
    - FastHTMLParser
    - LocationExtractor1

    I think CustomerExtensibility is our .net application. I will try to optimze the application.

    We are not using the person/company/location refiners. Can i remove the processors CompanyExtractor1, CompanyExtractor2,LocationExtractor1.. from the pipeline to save execution time?

    best regards
    Hannes

    Thursday, January 26, 2012 2:01 PM
  • Hi,

    You can open etc\pipelineconfig.xml and comment out the Company and Location modules if not needed. Beware that an update to FS4SP can put them back in. Commenting them out will leave you in an unsupported state, but I wouldn't worry too much about that for those processors.

    You can always comment them back in after your initial crawl or when seeking support if needed at one time.

    Seems to be that your custom module is the culprit. I would try to comment it out as well and see how fast indexing is without it. Then re-add it and measure.

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Thursday, January 26, 2012 6:29 PM
  • Hello Mikael,

    I have started a crawl on the production system and the processor server statistics look different then on the test system. The pipeline "TermExtractor" has more than 10 times the value of CustomerExtensibility. It´s seems to me that this pipeline is the problem.

    TermExtractor is accessing files with the extension out.gz in the FAST\data\termextractor directory. There is one file for every processor server. This files are much bigger (30 MB) than on the test system (200 KB).

    Thats strange: Each out.gz file contains only one tiny text file with a few bytes. I stopped the processor server and moved this files away. They were recreated after i started them again. --> And there is no timeout anymore!!!

    It seems to me this is a bug in fast that this out.gz files grow with each indexed item. The extracting of this files takes too long if they are too big.

    best regards
    Hannes







    • Edited by Hannes Pils Thursday, January 26, 2012 8:55 PM
    Thursday, January 26, 2012 7:55 PM
  • Hi,

    Good find indeed!

    The out.gz files fill will be larger when you have more content as they contain terms gathered during indexing. The TermExtractor stage is used for the automatic spell tuning. I have actually never looked into this myself but I will check the same files for other environments to check the size.

    Depending on your content you might want to change parameters in C:\FASTSearch\etc\processors\linguistics\termextractor.xml or comment out the processor all together.

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    • Marked as answer by Hannes Pils Thursday, January 26, 2012 9:53 PM
    Thursday, January 26, 2012 9:14 PM
  • Hello Mikael,

    I wanted to thank you for your great help. I never would have found this by myself. The indexing of the production system just finished in 1 h 20 min (instead of 24 hours previously).

    best regards
    Hannes

    Thursday, January 26, 2012 9:52 PM
  • Hannes/Mikael,

     

    It is true that TermExtractor stage is used for spelltuning.  Remember that if you comment it out, your

    spellcheck dictionary will not be udpated.  The TermExtractor stage collects terms in each document and

    then the spelltuner checks these terms and writes them to the spellcheck dictionary.  This happens

    periodically.  By removing the stage, spellcheck dictionaries will not have any relelvant spellcheck

    suggestions.

     

    If you have been crawling for quite some time, it may be that in your case this stage is not truly needed,

    since you most likely will not be adding many new terms in new documents.  Most documents will have the

    same terms over and over, as there is only a limited number of terms in each language.  But again, highly

    dependent on your case.

     

    You could also use below article to add custom terms:

     

    http://support.microsoft.com/kb/2592062


    Igor Veytskin
    Friday, January 27, 2012 7:03 PM
    Moderator
  • Hi Igor,

    Is there a way to improve on the speed for this step in a supported way? For example scanning over FIXML files afterwards to generate a list for the spell-tuner. This way indexing could work without having to create the frequence list and it could be done afterwards during for example off-hours.

    I have also seen the extractors take up time, and if you don't use them it would have been great if you could turn them on/off like the people extractor.

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Friday, January 27, 2012 8:39 PM