none
The document data was not indexed when feed the content from the BDC with hugh XML file (FS4SP) RRS feed

  • Question

  • Hi,

    I implement the BDC Connector to feed the content from the xml and use FS4SP (FAST Content SSA) to index those data.

    Reference: http://toddbaginski.com/blog/how-to-create-a-searchable-sharepoint-2010-bdc-.net-assembly-connector-which-reads-from-a-flat-file/

    The problem I encounter is... if I index the small XML size such as 3-4 MB, everything works fine and all the xml nodes were indexed into FAST and searchable. But, If I index the bigger XML size such as 10MB, it seems the data was not indexed and the crawling seems never stop.

    On the crawl history panel, it shows success = 2, and it is still crawling. (Crawl duration is 2 hours and ticking)

    FYI, When I index 3-4 MB XML, it is just 4 minutes to complete.

    Do you have any idea what the cause of this problem?

    Thanks and Regards,

    Andy

    Tuesday, June 21, 2011 4:50 AM

All replies

  • Hello Andy,

    If you index files which contains a lot of text by themselves, then you will get errors in the pipeline. I have done some small tests on this and somwhere between 10 and 20mb of text will choke the extractors in the pipeline (location, names, companies), and the content will be retried several times before giving up, consuming a lot of time.

    But since you are indexing xml files there is hope. You should take a look at the XmlMapper (http://msdn.microsoft.com/en-us/library/ff795813.aspx#custom-xml-overview) and configure xpath statements to select from the nodes you want indexed, and map them to separate fields. This way the extracted text will be less than 10mb.

    10mb might not be very much in terms of data, but it is in fact a lot of text. Let me know how this goes and if you need further help on the issue.

    Regards,
    Mikael Svenson 


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Tuesday, June 21, 2011 8:00 AM
  • Hi Mikael,

     

    Thanks for your response.

    The reason I have to use BDC is I have to do security trimming and stamp the sid in the ACL item for each document. So, I don't think I can change from BDC to XMLMapper.

    Moreover, the crawler I previously told you stop now but all documents were not indexed with the following error -

    The server is unavailable and could not be accessed. The server is probably disconnected from the network.

    And as I told you earlier, I change the xml to the small one. And the documents were indexed succesfully. (So, I don't think it's not permission problem)

     

    FYI, each node of XML store the metadata of each document and one of metadata is filepath.

    And I implement BDC to open filestream in that filepath and index that content into FAST

     

    example of format xml;

    <documents>

         <document>

               <id>1</id>

               <author>Andy<author>

               <path>//myserver/sharefile/doc1.doc

         <document>

          <document>

               <id>2</id>

               <author>Andy<author>

               <path>//myserver/sharefile/doc2.doc

         <document>

    <documents>

     

    Thanks and Regards,

    Andy

     

    Tuesday, June 21, 2011 8:45 AM
  • Hi Andy,

    I see. So you are not sending the xml to FAST which I initially thought.

    The question then is, is it your custom crawler which fails on large files, or is it on the FAST side. Can you examine the logs on the crawler server (sharepoint logs)?

    Just a hunch, are you retrieving the binary file for every item before it's sent over to FAST in your iterator? If so, you will most likely run out of memory holding all the data in memory at the same time. Doing lazy loading of the binary files in the ReadItem method might solve the issue.

    And are you getting the same error if you index into the built-in SharePoint search instead of FAST?

    Regards,
    Mikael Svenson 


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Tuesday, June 21, 2011 9:47 AM
  • Hi Mikael,

    What do you mean for "lazy loading" in the ReadItem?

    Currently, In the readitem it will return all entity like this

    //read each document node

     public static entity1 ReadItem(string id)
            {
                foreach (entity1 entity in GetAllEntities())
                {
                    if (entity.ID == id)
                    {
                        return entity;
                    }
                }
                return null;
            }

    -----------------------------------------------------------------

    where getAllEntity will return all the document node from the xml except for the path which will call getAttachedment as follow:

    public static Stream GetAttachment(string id)
            {
                string filepath = GetFilePath(id);

                return filepath.Open(filepath, FileMode.Open, FileAccess.Read);
            }

     

    and this is the getAttachment in the xml model

    <Method Name="GetAttachment">
                  <Parameters>
                    <Parameter Name="id" Direction="In">
                      <TypeDescriptor Name="ID" TypeName="System.String" IdentifierName="ID" />
                    </Parameter>
                    <Parameter Name="stream" Direction="Return">
                      <TypeDescriptor Name="Stream" TypeName="System.IO.Stream" />
                    </Parameter>
                  </Parameters>
                  <MethodInstances>
                    <MethodInstance Name="AttachmentStream" Type="StreamAccessor" ReturnParameterName="stream">
                      <Properties>
                        <Property Name="MimeTypeField" Type="System.String">FileType</Property>
                        <Property Name="FileNameField" Type="System.String">OriginalFileName</Property>
                      </Properties>
                    </MethodInstance>
                  </MethodInstances>
                </Method>

     

    I don't know is this applying the lazy loading which you mentioned?

    Do you have any reference regarding the lazy loading on the readItem?

     

    And are you getting the same error if you index into the built-in SharePoint search instead of FAST?

    -> if I use the fileshare crawler, there is no error but I cannot do that because I have to do security trimming process.

     

    Thanks and Regards,

    Andy

    Tuesday, June 21, 2011 10:09 AM
  • Hi Andy,

    Calling GetAttachements per item as they are handled would be lazy loading yes :) If they were retreived in the GetAllEntities loop, it would be eager loading.

    Unfortunately I haven't too much experience with this myself. When you start the crawl, can you monitor the memory usage of the processes on your SharePoint server, and see if any of them grows a lot when you start the indexing?

    Maybe you could also split up the xml into smaller parts instead of creating one big one? If you turn the GetAllEntities method into an IEnumerable, add a loop to read several xml files, and yield out the items one per one (with the yield statement), this should help.

    Regards,
    Mikael Svenson 


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Tuesday, June 21, 2011 7:40 PM
  • Hi Mikael,

     

    I think I have done what you've said.

    I split the xml into small chunk such as 2 MB each and store in one folder.

    In the BDC, I do for loop for each XML document in the folder

    public static List<Entity> GetAllEntities()
            {
                List<Entity> EntityList = new List<Entity>();

                String[] filePaths = Directory.GetFiles(@"\\myserver\XMLfolder", "*.xml");
                XmlDocument doc = new XmlDocument();
                XmlNodeList nodeList;
                for (int i = 0; i < filePaths.Length; i++)
                {
                    doc.Load(filePaths[i]);
                    nodeList = doc.GetElementsByTagName("Document");
                    foreach (XmlNode node in nodeList)
                    {
                      //get value from xml node and pass to entity.property

                     }

                 }

                return EntityList;

              }

     

    Like I said, everything works fine if all size of XML in that folder is less than 10MB.

    It's kind of strange because it supposes to perform operation for each file.

    And the error:

    The server is unavailable and could not be accessed. The server is probably disconnected from the network.

    is not related to any memory issues.

    I have done some research about this error and they say that disableLoopBack might help.

    And also, some said that MS releases some patch to fix this issue. Do you have any idea regarding this?

     

    Thanks for your help so far :)

    Andy

    Wednesday, June 22, 2011 3:38 AM
  • Hi Mikael,

     

    FYI, after I have done disableLoopBack, the server unavailable was gone. BUT, I faced with the new error which is

    "The filter daemon did not respond within the timeout limit."


    And the same scenario occurs ....

     

    when I index XML (6MB), everything is fine.

    But, When I add more XML in that Folder and the size of all xml exceed 10MB, the crawler was taking long time and return 1 success and the remaining are error ("The filter daemon did not respond within the timeout limit.")

    And most of the document is word document so I think it is not the IFilter problem.

    Thanks and Regards,

    Andy

    Wednesday, June 22, 2011 9:57 AM
  • Hello Andy,

    Could you provide a copy of your connector and code which fails so I could try to reproduce it? You can reach me at miksvenson AT gmail.com.

    You can also try to increase the timeout  - http://technet.microsoft.com/en-us/library/ee808892.aspx

    Regards,
    Mikael Svenson 


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Thursday, June 23, 2011 6:07 PM
  • Hi Mikael,

     

    I have changed the logic of feeding to index a smaller chunk of the XML.

    Also, I did optimized coding relating to the reading XML file.

    Now, it seems ok. I haven't seen any error. But I am still not too sure it's 100% working.

    So, I assume that "The filter daemon did not respond within the timeout limit." may related to memory issue.

    Have you experienced any clients who face with this problem and then they report back it's because memory issue?

     

    Many Thanks

    Andy

    Friday, June 24, 2011 7:35 AM
  • Hello Andy,

    I have never encountered this error myself, that's why I wanted to try to investigate it further looking at the code.

    Glad you seem to have found a way around it :)

    -m 


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Friday, June 24, 2011 12:00 PM