none
Fast ESP : BULK Indexing - using filetraverser RRS feed

  • Question

  • Hi,

    our aim is to index 100 million records(average size per record is 15 kb) in FAST ESP 5.3sp3.

    Our environment has 7 indexers, 2 QR servers , 1 Admin server and 2 document processors.

     

     

    During initial phase we tried to index .4 million records all went fine. With "systemmsg" collection throughing following errors/warnings:

    1) [2011-02-14 02:22:15]   WARNING fsearch   <admin host>   37707   systemmsg   Couldn't get fully qualified host name for [<indexer host>]: Unable to find fully qualified hostname, gethostbyname("<indexer host>") returned "<indexer host>"


    2) [2011-02-14 02:24:00]   ERROR  fdispatch   <admin host>  37700   systemmsg   Document summary too short, couldn't unpack

     
    3) [2011-02-14 02:24:01]   WARNING  fdispatch    <admin host> 37700   systemmsg   System inconsistency, /Fast/data/data_index/<inderer host>.normalized.1297666824/index.cf has 9 catalogs, while /Fast/FastESP/var/searchctrl/etc/index.cf file has 11 catalogs. 

        Even though facing the problems we were able to index all the records properly. Verified it by searching randomely.

     

     

     

    After words when we gave 3 million batch to index following problems(errors/warnings:) were incountered:

    1) All the above problems of phase 1 and

    2)[2011-02-07 03:39:51] WARNING  filetraverser <admin host>  <collection> Document failed for URI=17051223: processing::Batch probably lost during processing


    3)[2011-02-10 09:43:26] INFO      : filetraverser@<host>: <collection>: Documents processed unsuccessfully: 5 (0.0/s)      
       [2011-02-10 09:43:28] INFO      : filetraverser@<host>: <collection>: Documents processed ok: 2696032 (130.3/s)      

        Even though filetraverser showed the all the feeded documents indexed successfully and now existing. Same was reflected in Admin GUI Collection over view.

     

     

    When tried to verify through view following was observed:

    1) Searching for unique id multiple records returned.

    2) Sometimes except search term(unique id) different records returned except the one searching for.

    in short the data was inconsistent. 

     

    I am totally confused as of what the reason could be. Please help me in finding the solution and reaching the root cause of the problem.

     

    Thanks

    Monday, February 14, 2011 8:08 AM

All replies

  •  


    Hallo Nitin,

     

    The problems 2 and 3 of your first phase are actially not really problems. This is a known issue, because of FAST not rewriting properly all config files when you upload a new index profile. Perform the following steps on all relevant nodes:

    1. nctrl stop

    2. delete

    %FASTESP%\var\searchctrl\etc\*

    %FASTESP%\var\etc\*

    3. nctrl start

    (consult this Article: http://support.microsoft.com/kb/2017688)

    Pls. report if this is working. On a linux envirement it is working perfectly, on a windows unfortunately not always.

     

    For problem 1 check if all ip configurations on your servers are done properly as described in the fast docu (prerequisits for installation).

     

    Phase 2: unseccessfully processed documents means that this documents have some failures in format (for example if xml than the xml is not correct, there is  a closing tag missing or something like that)

    If you do not have your 3 million doc this could means, the filetraverser has already sent this documents with this ids on the previous run, so there is no need to send them to the index again. You have to check your filetraverser updating and deleting options and modes. This may cause also your inconsictency in the index. The db tabels of the filetraverser are independent from the index. The filetraverser takes account of the indexed ids and it will not send already indexed ids to the index if you do not tell it to do so. Consult the docu for the filetraverser options and modes that are the best for your case and how to clean up the filtraverser db if you need it.

     

    Regards

    Monday, February 14, 2011 4:05 PM
  • Hi again,

     

    I am not really sure that I have explained it very good. If you delete a document from the index its id remains in the filetraverser db, so the next time you are indexing via filetraverser the filetraverser do not know that this document is not in the index and do not sent it if you are not using the right filetraverser options. The  same is with updating and so on

    FAST does not keep the filetraverser db and the index in synchron. You have to do this. One way is to feed, update and delete documents only via the filetraverser, then if you are using the right options the two will be always in synchron (this of course is the best case).

     

    Regards

    Monday, February 14, 2011 4:41 PM
  • Hi Lina,

    As per your suggestions Following things were observed.

    1) Regarding <host name> error-

    i have checked the whole configuration and the host names all are correct. Still Fast keeps on giving this error for all the indexing nodes, some or the other time.

     

    2) Regarding Document unsuccessful processing -

    I have observed that 2-3 times that it starts from one particular record. On analysis we have found out that that particular record size was 18 MB(single record size). Fast tries 2 times to process the record and then throws the warning

    "[2011-02-07 03:39:51] WARNING  filetraverser <admin host>  <collection> Document failed for URI=17051223: processing::Batch probably lost during processing"

    and all subsequent records are following log is obtained

    "[2011-02-10 09:43:26] INFO      : filetraverser@<host>: <collection>: Documents processed unsuccessfully: 5 (0.0/s)      
     [2011-02-10 09:43:28] INFO      : filetraverser@<host>: <collection>: Documents processed ok: 2696032 (130.3/s)"

    In short after warning all records are not processed successfully.

     

    Solution tried -

    filetraverser -c <collection name> -r <File share> -x :<tag>:<primary id> -S 1 -m 20480 -z 51200 -T 200

     

    Results obtained -

    On Standalone installation of FAST batch containing the 18 mb record got indexed with some delay. Results verified.

    On Multinode installation the same batch could not be indexed.

     

    3) Regarding the searching issue -

    on searching the Fast after indexing involving above errors, results obtained for unique search are wrong. Wrong here means sometimes searched entity is not present in the results at all or with the intended result we have some more nearby hits which were not at all desired.

     

    4) Was able to index 2.9 million records in FAST after removing batches containing huge records. Fast Notified that it is records are successfully indexed. But the Error 3(Searching issue still observed).

     

    Please help me in analyzing it, as what is happening in the backend (i.e. Fast side) and suggest a way out.

     

    Thanks

    Wednesday, February 16, 2011 6:56 AM
  • Hi,

    Regarding the searching issue -

    After changing the key datatype from double to int, getting proper results on searching.

    Thanks.

     

    Monday, February 21, 2011 9:57 AM