none
Is it possible to index a 1 GB MS Office Document in FS4SP? RRS feed

  • Question

  • Hi,

    May sound a bit too much but is it possible to feed and index a 1 GB MS Office Document in FS4SP. Although FS4SP is 64-bit and RAM usage is much likely not limited. Has anyone tried this before?

    Thanks,

    Ken

    Tuesday, March 22, 2011 10:57 AM

Answers

  • There are several obstacle in the way for doing this.

    I'd love to see this work, so I'll give some input.

    First is the default max size for the crawler components in SharePoint 2010. (FASTContentSSA is the name of my test Content SSA)

    $s = Get-SPEnterpriseSearchServiceApplication FASTContentSSA
    $s.GetProperty("MaxDownloadSize")
    64
    

    As you can see, the default size is 64mb. All files larger than this will be skipped.

    This can be increased with the following command:

    $s.SetProperty("MaxDownloadSize",1024)
    $s.Update()
    
    #Restart-Service osearch14
    

    In theory this should let files of 1GB be crawled. How it will work I have no idea :)

    Office files would most likely be converted using IFilter, and you're in luck, as the default max output size is 1073741824 bytes, which is 1GB. If your office file is 1GB, then it most likely will have less than 1GB text in it.

    Next thing would be to allow the file to be indexed in FAST.

    Default the body content can is set to 16MB (specified in the index profile), and can be increased up to 2GB.

    In order do accomodate 1GB files we could do this:

    $field = Get-FASTSearchMetadataManagedProperty -Name body
    $field.MaxIndexSize = 1048576
    $field.Update()
    

    Once you have employed these two changes it should work in theory, unless some component decides to give up.

    Regards,
    Mikael Svenson 

     


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/ - http://www.comperiosearch.com/
    Tuesday, March 22, 2011 7:02 PM

All replies

  • There are several obstacle in the way for doing this.

    I'd love to see this work, so I'll give some input.

    First is the default max size for the crawler components in SharePoint 2010. (FASTContentSSA is the name of my test Content SSA)

    $s = Get-SPEnterpriseSearchServiceApplication FASTContentSSA
    $s.GetProperty("MaxDownloadSize")
    64
    

    As you can see, the default size is 64mb. All files larger than this will be skipped.

    This can be increased with the following command:

    $s.SetProperty("MaxDownloadSize",1024)
    $s.Update()
    
    #Restart-Service osearch14
    

    In theory this should let files of 1GB be crawled. How it will work I have no idea :)

    Office files would most likely be converted using IFilter, and you're in luck, as the default max output size is 1073741824 bytes, which is 1GB. If your office file is 1GB, then it most likely will have less than 1GB text in it.

    Next thing would be to allow the file to be indexed in FAST.

    Default the body content can is set to 16MB (specified in the index profile), and can be increased up to 2GB.

    In order do accomodate 1GB files we could do this:

    $field = Get-FASTSearchMetadataManagedProperty -Name body
    $field.MaxIndexSize = 1048576
    $field.Update()
    

    Once you have employed these two changes it should work in theory, unless some component decides to give up.

    Regards,
    Mikael Svenson 

     


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/ - http://www.comperiosearch.com/
    Tuesday, March 22, 2011 7:02 PM
  • Hi Mikael,

    Thank you for your response. I tested this scenario yesterday except for increasing the max index size to 1 GB (1048576). Created a an empty text file of 1 GB size and appended at the last line the following text: "FAST Search Test". Started crawling and after 49 mins the crawl status has not changed to "Idle".

    Is there a GUI for checking the docproc and indexer logs in FS4SP similar in ESP 5.3?

    Regards,

    Ken

     

    Wednesday, March 23, 2011 2:14 AM
  • Hi Ken

    Sorry, no GUI in FS4SP. But the "doctrace + doclog" utility gives you more details anyway. The official documentation is here.

    Regarding tuning of maximum document size, here is an un-official statement from a guy in the developer team:

    "... The document processors on the FAST nodes have a hard-coded 2GB memory limit. If they use more than that, they will restart. MaxIndexSize should therefore not be tuned too high, otherwise you will reach that limit. Due to various processing that is performed (document conversion, tokenization, entity extraction and so on), I think anything above 300MB on MaxIndexSize will give you trouble. In other words, you will probably only be able to index the first 300-400MB of extracted text from a crawled file. The rest will be truncated..."

    Of course, your mileage may vary, and only testing will give you the real truth.

    Please share findings here!

    Regards,

    Thomas


    Thomas Svensen | Microsoft Enterprise Search Practice
    Wednesday, March 23, 2011 10:02 AM
    Moderator
  • Thank you for the info, Thomas.

    Once my FS4SP environment is available I will try to test the scenario. In case in a real environment there is a file size around 100-500 MB. Have you tried feeding these file sizes? Do you need to have a custom item processor for this?

    Regards,

    Ken

    Wednesday, March 23, 2011 10:59 AM
  • You can use "psctrl doctrace on" and "doclog -a" as you would in ESP.

    Also you can add the spy stage manually or turn on FFDDumper in optional processing.

    -m


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/ - http://www.comperiosearch.com/
    Thursday, March 24, 2011 1:17 PM
  • Mikael,

    I followed your suggested steps above and the indexing of a content source that took 1.5 hours to initially full crawl was still running at 16 hours. Are there any logs I can view to determine what might have gone wrong?

    I also can't seem to be able to stop the crawl! I'll restart the server.

    I'm going to revert 300MB as per the developer guy's advise above and try again.

    Friday, June 10, 2011 8:26 AM
  • Hi Shane,

    I decided to test this. First I created a 300mb text file with repeating content: "this is a test this is a test".

    When indexing the mssearch processed (crawler) peaked at using 917mb ram, the content distributor using 619mb and the document processor peaked at 2.2gb, but then it died, and it kept on retrying the file for a while before I killed the indexing.

    My next test was with a 200mb file with the same content.

    mssearch: 600mb
    content distributor: 400mb
    docproc: 1.8gb 

    and same results, it did not manage to complete, and kept on retrying the file.

    It would never move past the PersonExtractor1 stage.

    I continued to reduce the filesize, and once I got down to 10mb it worked and managed to clear all the pipeline stages. 20mb failed. So somewhere between 10 and 20 megabytes of raw text seems to be some limit for my crude test. This means the file itself can be much larger, but the extracted text should not be much above 10mb (which in itself is quite a lot).

    Regards,
    Mikael Svenson 


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Friday, June 10, 2011 8:03 PM