none
FAST Search Web Crawler Proxy Issue RRS feed

  • Question

  • Hi - I am using the FAST Search web crawler (not the SharePoint web crawler). In the XML configuration file If I include a proxy server I can crawl and index external (i.e. public website) content but not intranet content. If I remove the proxy server I can crawl internal content but not external (in the crawl log I see a http 503 error). I want a crawl configuration that will index intranet content and where there is a link to an external site to index the page linked to (i.e. a mixture of internal & external content).

    I guess a need a 'bypass proxy for local addresses' option but I can't see a configuration setting for this in the online technet help.

    Regards,

    Francis

     


    Monday, July 18, 2011 3:56 PM

All replies

  • Hello Francis,

    Do the easy route, create one configuration file for the internal sites without a proxy, and one configuration file for the external sites with your proxy.

    You can add as many configurations as you want, as you might want different options for different sites.

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Monday, July 18, 2011 7:08 PM
  • Hi Mikael - Thanks for the reply. Its true I could do that but it would be a very manual process.

    The Internal content I want to index is several SharePoint blogs. These blogs contain links to external websites and while I'd like to have the FAST Web Indexer index all internal blog content (DEPTH:FULL in the xml config file) I'd only like to index the specific page linked to for external content (DEPTH:0).

    Using this article http://goo.gl/zBnqP I have created a config xml file with a main <DomainSpecification> section that has DEPTH set to 0 and not to forward links. This are the settings that will apply to any external site linked to from the blogs

        <!-- Crawl Mode -->
        <section name="crawlmode">
          <!-- Crawl depth (use DEPTH:n to do level crawling). -->
          <attrib name="mode" type="string"> DEPTH:0 </attrib>
          <!-- Follow links from one hostname to another (interlinks). -->
          <attrib name="fwdlinks" type="boolean"> no </attrib>

    I also have a <SubDomain> section whose 'include_uris' section includes the address for the blog sites

    <section name="include_uris">
    <attrib name="prefix" type="list-string">
    <member>http://blogs.example.com</member>
    </attrib>

    The crawl settings in this subdomain that will match only the internal blog sites looks like this (i.e. index everything and follow links to external sites):

    <section name="crawlmode">
    <!--Crawl depth (use DEPTH:n to do level crawling).-->
    <attrib name="mode" type="string">FULL</attrib>
    <!--Follow links from one hostname to another (interlinks).-->
    <attrib name="fwdlinks" type="boolean">yes</attrib>

    I've tried to include this definition in both the <DomainSpecification> & <SubDomain> section but can't get it to work.

        <attrib name="proxy" type="list-string">
          <member> http://proxy.example.com:8080 </member> 
        </attrib>


    In the end there may not be a workaround to this - just wanted to check and see if anyone had a bright idea.

    Thanks,

    Francis

     

     

     

     

     

    Tuesday, July 19, 2011 7:46 AM
  • Hello Francis,

    I haven't tried this myself, but reading the documentation it should work as you set it up. Specifying include_uris (or include_domains) and setting the proxy in the SubDomain section, it should in theory pick up on the links crawled by the top configuration.

    You could perhaps try to use include_domains instead of include_uris and see if that works.

    Also see this thread - http://social.technet.microsoft.com/Forums/en-US/fastsharepoint/thread/e419aa0c-5eb2-4606-b436-6fbfe1afef0f/ - for information about an API to control the enterprise web crawler.

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Tuesday, July 19, 2011 10:52 AM