none
Crawling an same site, SP crawler cannot follow hyperlinks; FAST crawler can follow hyperlinks but report error: Could not convert decimal number RRS feed

  • Question

  • Hi,
    Recently we encountered an issue while crawling an simple internet site.

    Background:
    1. It is an anonymous access site, contains 1 homepage and 10 individual pages. Homepage contains 10 hyperlink which point to these 10 individual pages.
    2. Crawler server is able to access these 10 pages (tested by IE)
    3. Below is the http response from home page
    HTTP/1.1 200 OK
    Server: nginx
    Date: Mon, 25 Mar 2013 02:21:06 GMT
    Content-Type: text/html; charset=UTF-8
    Connection: keep-alive
    Keep-Alive: timeout=60
    Vary: Accept-Encoding
    Content-Language: en
    Expires: Thu, 01 Jan 1970 00:00:01 GMT
    Cache-Control: no-cache
    Cache-Control: private
    Content-Length: 571

    <html oty_id="1" ptl_id="198" name="Photo archive"><body><a href="com67_index.obt?obt_id=250186">250186</a><a href="com67_index.obt?obt_id=250190">250190</a><a href="com67_index.obt?obt_id=276298">276298</a><a href="com67_index.obt?obt_id=316266">316266</a><a href="com67_index.obt?obt_id=269604">269604</a><a href="com67_index.obt?obt_id=269606">269606</a><a href="com67_index.obt?obt_id=330751">330751</a><a href="com67_index.obt?obt_id=330745">330745</a><a href="com67_index.obt?obt_id=330743">330743</a><a href="com67_index.obt?obt_id=330744">330744</a></body></html>

    4. For one individual page, the http response is below
    HTTP/1.1 200 OK
    Server: nginx
    Date: Mon, 25 Mar 2013 02:24:52 GMT
    Content-Type: text/html; charset=UTF-8
    Connection: keep-alive
    Keep-Alive: timeout=60
    Vary: Accept-Encoding
    Content-Language: en
    Expires: Thu, 01 Jan 1970 00:00:01 GMT
    Cache-Control: no-cache
    Cache-Control: private
    Content-Length: 595

    <html object="330751"><head><meta name="Title" content="FirstNameAAA LastNameBBB"/><meta name="Description"/><meta name="ImageUrl" content="http://www.xxx.com/fileroot/gallery/73114a.jpg"/><meta name="DirectUrl" content="http://www.xxx.com/mars/search.search?id=198p_aun_obt_id=330751"/><meta name="Created" content="2011-10-28"/><meta name="Keywords"/><meta
    name="author" content="aaa bbb"/><body>Object 330751</body></head></html>

    Everything looks fine so far.

    Next we setup a content source in content SSA:
    Type: Web
    Start address: <the url of home page>
    Crawl setting: Only crawl within the server of each start address
    Also create a crawl rule (e.g. check crawl complex URL which contains ?)  that make sure the individual URL will be crawled.

    Then we have issue with SharePoint crawler:
    1. However, after full crawl, we only got the one success crawl, which is the home page. The SP crawler "refused" to follow the hyperlinks in homepage and crawl the 10 individual pages.
    2. If we set one individual page URL as start address, it can be crawled successfully. So these is nothing wrong with, for example, the connection with individual sites.
    3. As far as SP crawler reports the page is crawled, we can get it from search result.

    And we also have *different* issue with FAST crawler
    1. We tried FAST web crawler. In this case, FAST web crawler is able to find all pages (1 homepage + 10 individual pages) and I can see the document count in its brand new contentcollection become 11 from 0.
    2. However(again!), if i run with indexerinfo, i found out these 11 "successfully crawled" document are in not_indexed count.
    After future research, i found out FAST is able to extract crawled properties and generate FIM files, but failed to be indexed.
    The log file D:\FASTSearch\data\data_fixml\doc_errors_TheNewSearch.dat contains
    error as below:
    "ec507eaa1973814c40a7c4fefb8c32b1 272 Aborted document during indexing at fixml file line 2189 column 51. Reason: AddDecimalNumber(bi1, bconsortdate, 1363244311) failed: Could not convert decimal number '1363244311' to an integer using 0 digit decimal precision. Hindexing".

    The line 2189, column 51 in FIM is below
        <context name="bconsortdate"><![CDATA[1363244311']]></context>
    it belongs to below section "bi1":
      <catalog name="bi1">
        <context name="bconprocessingtime"><![CDATA[2013-03-14T06:59:10Z]]></context>
        <context name="bcondocdatetime"><![CDATA[2013-03-14T06:58:29Z]]></context>
        <context name="bconsize">617</context>
        <context name="bconhwboost">10000</context>
        <context name="bcondocrank"><![CDATA[0]]></context>
        <context name="bconsiterank"><![CDATA[0]]></context>
        <context name="bconurldepthrank"><![CDATA[500]]></context>
        <context name="bconwrite"><![CDATA[2013-03-14T06:58:29Z]]></context>
        <context name="bcondocumentsignature"><![CDATA[369874431937263284]]></context>
        <context name="bconsortdate"><![CDATA[1363244311']]></context>

    Let's summary:
    1. Looks like the content of the page is correct, since SP crawler is able to crawler and index the content of the page as far as it was crawled.
    2. Also the hyperlink looks fine as well, since FAST crawler is able to follow them, find and download all 10 pages.
    3. Something wrong with SP Crawler - cannot follow up hyperlink as FAST crawler did.
    4. Something with indexer when using FAST web crawler - reports Could not convert decimal number to integer; But using SP crawler we did not have this error.

    Did anyone have similar issue before?

    Many thanks,
    Feng
    Monday, March 25, 2013 3:04 AM

Answers

  • OK, now i am answering the issue.

    1. Issue of SP web crawler cannot follow hyperlinks - MS is able to reproduce it in SP2010 and SP2013. Not sure if it is caused by the way the web site was built.
    MS support said this issue can be fixed in SharePoint 2013 SP1 or SharePoint2010 SP2.

    2. Issue of FAST web crawler/indexer reports error:

    -          In our FAST ranking profile, we are using a managed property “sortdate” as freshness. This managed property “sortdate” maps to a crawled property named “sortdate”.

    -          For the site which has the problem, the pipeline will not put any value into crawled property “sortdate” (because it is a new site, we have not change anything in pipeline).

    -          We have problem that FAST web crawl cannot put the pages from the site to index (it just put them in “not_indexed”)

    -          After debug, we found in Fixml file, the indexer will report error as “Aborted document during indexing at fixml file line 2198 column 51. Reason: AddDecimalNumber(bi1, bconsortdate, 1368763127) failed: Could not convert decimal number '1368776044' to an integer using 0 digit decimal precision. Meanwhile, by checking the spy file, the FAST crawler will put an OOTB crawled property named “crawltime”, which contains value like #### ATTRIBUTE crawltime <type 'int'>: 1368763127

    -          Docpush is able to put the same page into index. Please note that in spy file of docpush, there is not “crawltime” crawled property.

    -          By removing “sortdate” from ranking profile, the problem is gone.

    -          If we create another managed property (e.g. “sortdate2”), and apply it to ranking profile, as far as “sortdate2” contains NOTHING (do not map it to any crawled property, or map to an empty crawled property, e.g. sortdate), and re-crawl the site will produce the same error “Aborted document during indexing at fixml file line 2193 column 51. Reason: AddDecimalNumber(bi1, bconsortdate2, 1368777820) failed: Could not convert decimal number '1368777820' to an integer using 0 digit decimal precision.”

    -          If map “sortdate2” to “crawltime” which contains integer value , the problem is gone.

     

    Guess root cause:

    since “sortdate” is used by ranking profile, but when it contains no value, FAST will put another value from somewhere, but it will report above “decimal convert to integer error”.

     

    Solution:

    -          Map “sortdate” to crawled property “crawltime” as 2nd order (after crawled property ”sortdate”). Even “crawltime” contains the same integer value in above error msg, it works. (Verified)

    -          Make sure the pipeline always generate value for managed property “sortdate” (Verified)

    -     Add a meta name sortdate to the html sources in order to feed a wished value to the Crawl Property when it is call in the document processor pipeline  (Not verified yet)




    • Marked as answer by Feng_Lu Thursday, June 27, 2013 8:32 AM
    • Edited by Feng_Lu Thursday, July 25, 2013 4:41 AM update answer from MS for the first issue.
    Thursday, June 27, 2013 8:30 AM

All replies

  • In addition, we had ported all pages from this external site to a local IIS site, but changed the file extension to *.html instead of "com67_index.obt?obt_id =***". All contents in pages remain the same.

    When we tried SP crawler with URL of the homepage in the local site, all pages are crawled and indexed successfully.

    Maybe something wrong with the http response header from the external internet site?

    -Feng
    Monday, March 25, 2013 3:09 AM
  • OK, now i am answering the issue.

    1. Issue of SP web crawler cannot follow hyperlinks - MS is able to reproduce it in SP2010 and SP2013. Not sure if it is caused by the way the web site was built.
    MS support said this issue can be fixed in SharePoint 2013 SP1 or SharePoint2010 SP2.

    2. Issue of FAST web crawler/indexer reports error:

    -          In our FAST ranking profile, we are using a managed property “sortdate” as freshness. This managed property “sortdate” maps to a crawled property named “sortdate”.

    -          For the site which has the problem, the pipeline will not put any value into crawled property “sortdate” (because it is a new site, we have not change anything in pipeline).

    -          We have problem that FAST web crawl cannot put the pages from the site to index (it just put them in “not_indexed”)

    -          After debug, we found in Fixml file, the indexer will report error as “Aborted document during indexing at fixml file line 2198 column 51. Reason: AddDecimalNumber(bi1, bconsortdate, 1368763127) failed: Could not convert decimal number '1368776044' to an integer using 0 digit decimal precision. Meanwhile, by checking the spy file, the FAST crawler will put an OOTB crawled property named “crawltime”, which contains value like #### ATTRIBUTE crawltime <type 'int'>: 1368763127

    -          Docpush is able to put the same page into index. Please note that in spy file of docpush, there is not “crawltime” crawled property.

    -          By removing “sortdate” from ranking profile, the problem is gone.

    -          If we create another managed property (e.g. “sortdate2”), and apply it to ranking profile, as far as “sortdate2” contains NOTHING (do not map it to any crawled property, or map to an empty crawled property, e.g. sortdate), and re-crawl the site will produce the same error “Aborted document during indexing at fixml file line 2193 column 51. Reason: AddDecimalNumber(bi1, bconsortdate2, 1368777820) failed: Could not convert decimal number '1368777820' to an integer using 0 digit decimal precision.”

    -          If map “sortdate2” to “crawltime” which contains integer value , the problem is gone.

     

    Guess root cause:

    since “sortdate” is used by ranking profile, but when it contains no value, FAST will put another value from somewhere, but it will report above “decimal convert to integer error”.

     

    Solution:

    -          Map “sortdate” to crawled property “crawltime” as 2nd order (after crawled property ”sortdate”). Even “crawltime” contains the same integer value in above error msg, it works. (Verified)

    -          Make sure the pipeline always generate value for managed property “sortdate” (Verified)

    -     Add a meta name sortdate to the html sources in order to feed a wished value to the Crawl Property when it is call in the document processor pipeline  (Not verified yet)




    • Marked as answer by Feng_Lu Thursday, June 27, 2013 8:32 AM
    • Edited by Feng_Lu Thursday, July 25, 2013 4:41 AM update answer from MS for the first issue.
    Thursday, June 27, 2013 8:30 AM
  • Hallo

    I have same problem: issue 1 . web crawler cannot follow links.

    To wait for SharePoint 2013 SP1 may take a long time.
    Is SP1 publishing date jet announced?

    Is there realy no erlier fix available?
    Any CU ? PU? that fix this?

    

    Friday, October 11, 2013 11:56 AM
  • : AddDecimalNumber(bi1, bconsortdate, 1368763127) failed

    issue is fixed through October CU for FAST. Please see the below link for FAST CU info.

    http://social.technet.microsoft.com/wiki/contents/articles/2796.sharepoint-2010-fast-search-for-sharepoint-cumulative-updates.aspx

    • Proposed as answer by Srini Du - MSFT Wednesday, October 16, 2013 11:09 PM
    Wednesday, October 16, 2013 11:09 PM