none
FAST Search Server 2010 won't follow links found on .CFM page RRS feed

  • Question

  • For some reason I cannot get FAST Search Server 2010 to follow links from a HTML page with the file extension .CFM.  I was able to confirm that the file extension alone is causing the problem, by generating different versions of the sitemap (map.jsp and map.cfm) which refer to a single piece of content (content.jsp and content.cfm respectively).  When I set the starting point to map.jsp both the sitemap and content page get indexed.  When I set the starting point to map.cfm, only the sitemap gets indexed.

    This error can be reproduced using Tomcat as the web application content source.  An additional servlet mapping is necessary in web.xml so that Tomat will process the .CFM file just like any other JSP file.

    <servlet-mapping>
            <servlet-name>jsp</servlet-name>
            <url-pattern>*.cfm</url-pattern>
    </servlet-mapping>

    map.jsp

    <a href="content.jsp">content</a>

    content.jsp

    <html>
    <meta name="description" content="This is meta content for the page" >
    
    hello world.
    </html>

    map.cfm

    <a href="content.cfm">content</a>

    content.cfm

    <html>
    <meta name="description" content="This is meta content for the page" >
    
    hello world.
    </html>

    The CFM file extension is NOT listed as a file type to exclude from content index in Central Administration.

    Any thoughts on what the cause is?  I know the HTML is malformed, but the fact that the JSP content page is indexed indicates that the HTML markup really isn't a problem for FAST.

     -Tim

    Tuesday, February 14, 2012 1:45 AM

Answers

  • We had the same problem and found changing a FAST setting on the Content SSA solved it.

    1. Open up a FAST PowerShell prompt (run as your FAST service account)
    2. Run the following to get a handle to the Content SSA (correct the SSA name as needed for each environment):
    $ssa = Get-SPEnterpriseSearchServiceApplication –Identity “<Content SSA name>”
    3. Run the following to get the list of extensions:
    Get-SPEnterpriseSearchExtendedConnectorProperty –SearchApplication $ssa -identity ExtensionsToFilter
    4. Add ColdFusion plus any others you may need (whatever comes back from previous command + cfm, etc):
    Set-SPEnterpriseSearchExtendedConnectorProperty –SearchApplication $ssa –identity ExtensionsToFilter –Value “;ascx;asp;aspx;htm;html;jhtml;jsp;cfm;”
    5. To verify, run the get again:
    Get-SPEnterpriseSearchExtendedConnectorProperty –SearchApplication $ssa -identity ExtensionsToFilter


    • Edited by cho_c Friday, July 19, 2013 7:26 PM typo
    • Proposed as answer by cho_c Friday, July 19, 2013 7:27 PM
    • Marked as answer by cecropin Thursday, July 25, 2013 2:59 PM
    Friday, July 19, 2013 7:25 PM

All replies

  • Another less Java-centric way of reproducing this effect is the following:

    Use a sitemap called map.txt with the following content:

    <html> <head> <title>Map Test</title> </head> <body> <a originalhref="content.txt">content</a><br> hello world</body> </html>

    Then create a context.txt page with any HTML content within it.  When I set map.txt as the starting page, map.txt gets indexed and its contenttype is detected as text/html despite the txt file extension; however content.txt is not crawled.

    Alternatively you could use the JSP and CFM pages above and deploy to IIS.  Since the pages are actually static HTML configure IIS to treat .jsp and .cfm as MIME type text/html.  The pages will then get served by IIS.  I've tried configuring user_converter_rules.xml so that .cfm and .jsp are recognized as text/html with autodetection and this still had no effect on the crawler following links.

    This must mean that document format detection must somehow be independant of the processing of links within a document???  Any idea what module in the pipeline handles this?  There's not much documentation on URLProcessor or any of the other modules.

    Wednesday, February 15, 2012 2:23 AM
  • I'm having the same issue. Were you able to find a solution?

    Thanks, -Erkan

    Thursday, July 18, 2013 3:58 PM
  • Erkan,

    Unfortunately I was not able to find a true solution to the problem.  Instead I developed a work around.  It so happened that I only needed to follow the links from a single CFM page that acts as a sitemap.  Because FAST wouldn't follow the links on a CFM page I created a proxy with a .Net page (ASPX).  So the ASPX rendered the original content and made adjustments to relative URLs.  I pointed FAST to the proxy and in the end it indexed the CFM content based on 1 hop from the ASPX sitemap.

    It's a kludge but got the job done.

    -Tim

    Thursday, July 18, 2013 4:48 PM
  • We had the same problem and found changing a FAST setting on the Content SSA solved it.

    1. Open up a FAST PowerShell prompt (run as your FAST service account)
    2. Run the following to get a handle to the Content SSA (correct the SSA name as needed for each environment):
    $ssa = Get-SPEnterpriseSearchServiceApplication –Identity “<Content SSA name>”
    3. Run the following to get the list of extensions:
    Get-SPEnterpriseSearchExtendedConnectorProperty –SearchApplication $ssa -identity ExtensionsToFilter
    4. Add ColdFusion plus any others you may need (whatever comes back from previous command + cfm, etc):
    Set-SPEnterpriseSearchExtendedConnectorProperty –SearchApplication $ssa –identity ExtensionsToFilter –Value “;ascx;asp;aspx;htm;html;jhtml;jsp;cfm;”
    5. To verify, run the get again:
    Get-SPEnterpriseSearchExtendedConnectorProperty –SearchApplication $ssa -identity ExtensionsToFilter


    • Edited by cho_c Friday, July 19, 2013 7:26 PM typo
    • Proposed as answer by cho_c Friday, July 19, 2013 7:27 PM
    • Marked as answer by cecropin Thursday, July 25, 2013 2:59 PM
    Friday, July 19, 2013 7:25 PM
  • HI,
    As you already mentioned CFM file extension is not excluded, I do not think it is the crawler rule issue.

    Are you using SP web crawler (not FAST web crawler)?

    I had a similar problem: for a customized web site (not based on asp.net), SP web crawler cannot follow the hyperlinks. But if we move all content to an asp.net based site, or simply use FAST web crawler, it works.

    I had contacted with MS support and they are able to reproduce this problem (at least with my web site). The answer I got from they is to wait for SharePoint2010 SP2. The good news is the SharePoint2010 was released yesterday, but i have not verified if SP2 can fix this issue.

    Or, you can try FAST web crawler.

    /Feng
    Thursday, July 25, 2013 4:39 AM
  • Using cho_c's solution worked great for us. ExtensionsToFilter property, albeit an undocumented feature, seems to be the key to whether the crawler in the FAST Content SSA follows the hyperlinks.

    Thanks, -Erkan

    Thursday, July 25, 2013 2:52 PM
  • I haven't had a chance to confirm your answer myself and probably might not for some time due to other priorities; however based on Erkan's response and the fact that FAST Content SSA is responsible for crawling website content on our search infrastructure I see no reason for it being incorrect.

    Thanks for the info.

    Feng,

    Cho's solution indicates that there's another layer of filtering performed by the FAST Content SSA in addition to the standard crawl rules you can access through the SharePoint Central Administration console.  Even if the crawl rule doesn't exclude .cfm, the FAST Content SSA's configurations do and that takes precedence.

    -Tim

    Thursday, July 25, 2013 3:07 PM