none
fast search for sharepoint RRS feed

  • Question

  • Hi guys

    im struggling to get an issue fixed. Im trying to use the fast web crawler in fast search for sharepoint. from the fast search server 2010 management shell i run a command "crawleradmin.exe -f c:\fastsearch\etc\mgRSS.xml" to push through a config file to crawl an rss website. However i keep getting an error message saying that the start url "http://www.thedti.gov.za" is invalid and not included in the start uri. what confuses me is that i dont even want to crawl that site and it has nothing to do with the site i want to crawl.i even tried using the crawler template that comes standard with fast search server 2010 but i still get the same error.

     

    Id appreciate any ideas or soutions please

     

    Thanks

    Clayton

    Tuesday, October 4, 2011 6:42 AM

All replies

  • Hi Clayton,

    It's difficult to determine the cause for this without seeing the config file you have in place.  Are you using a standard crawler template?  Out of the box, the C:\FASTSEARCH\etc\crawlerconfigtemplate-rss.xml doesn't have a start-uri, and it only has an example that says:

    <member>RSS STARTURI HERE</member> 

    This assumes that you will put your start-uri in.  If you did that, maybe you did not create hostname include rules?  Just a thought.  Maybe you could copy the relevant sections of your crawler configuration file, and we can try to determine if the problem is related to that.

    Thanks!

    Rob Vazzana | Sr Support Escalation Engineer | US Customer Service & Support

    Customer Service & Support                          Microsoft | Services

    Friday, October 7, 2011 8:58 PM
  • Hi Rob

    I am using the exact template that you have mentioned . below is my config

     

    <?xml version="1.0"?>
    <CrawlerConfig>
        <!-- Crawl collection name, must be unique for each collection.      -->
        <!-- Documents are indexed in the collection by the same name.       -->
        <DomainSpecification name="sp">

            <!-- Basic crawl options (RSS template).                            -->
            <!--                                                                -->
            <!-- To crawl only RSS feeds and the direct links from the feed,    -->
            <!-- specify the RSS feeds below and set "auto_discover",           -->
            <!-- "follow_links" and "ignore_rules" to no. Do not specify any    -->
            <!-- normal start URIs or crawl rules.                              -->

            <section name="rss">
                <!-- List of start (seed) URIs pointing to RSS feeds. -->
                <attrib name="start_uris" type="list-string">
                  <member> http://mg.co.za/rss </member>
                   <member> http://mg.co.za/page/rss-feeds/ </member>
                </attrib>
               
                <!-- Automatically discover new feeds. Use when combined with -->
                <!-- a regular crawl (with crawl rules!) only .               -->
                <attrib name="auto_discover" type="boolean"> yes </attrib>

                <!-- Follow hyperlinks from RSS documents. Enabling this will -->
                <!-- create a wider crawl, and crawl rules must be used.      -->
                <attrib name="follow_links" type="boolean"> yes </attrib>
               
                <!-- Ignore crawl rules. Should not be used together with     -->
                <!-- "follow_links".                                          -->
                <attrib name="ignore_rules" type="boolean"> no </attrib>

                <!-- Toggle indexing of the RSS feed itself.                  -->
                <attrib name="index_feed" type="boolean"> yes </attrib>

                <!-- Maximum age of links. Documents older than this will be  -->
                <!-- deleted if "del_expired_links" is enabled.               -->
                <attrib name="max_link_age" type="integer"> 14400 </attrib>

                <!-- Maximum number of links. Once exceeded the oldest        -->
                <!-- documents will be deleted if "del_expired_links" is      -->
                <!-- enabled.                                                 -->
                <attrib name="max_link_count" type="integer"> 128 </attrib>

                <!-- Delete expired links, see above.                         -->
                <attrib name="del_expired_links" type="boolean"> yes </attrib>
            </section>

            <!-- Delay in seconds between requests to a single site -->
            <attrib name="delay" type="real"> 10 </attrib>

            <!-- Length of crawl cycle expressed in minutes -->
            <attrib name="refresh" type="real"> 1440 </attrib>

            <!-- Maximum size of a document (bytes). -->
            <attrib name="cut_off" type="integer"> 5000000 </attrib>

            <!-- Toggle JavaScript support (using the Browser Engine). -->
            <attrib name="use_javascript" type="boolean"> no </attrib>

            <!-- Toggle near duplicate detection. -->
            <attrib name="near_duplicate_detection" type="boolean"> yes </attrib>

            <!-- Only crawl HTTP/HTTPS (e.g., don't crawl FTP). -->
            <attrib name="allowed_schemes" type="list-string">
                <member> http </member>
                <member> https </member>
            </attrib>

            <!-- Allow these MIME types to be retrieved. -->
            <attrib name="allowed_types" type="list-string">
                <member> text/* </member>
                <member> application/* </member>
                <member> application/xml </member>
                <member> application/pdf </member>
                <member> application/rss+xml </member>
            </attrib>
           
            <section name="include_domains">
                <attrib name="exact" type="list-string">
                    <member> http://mg.co.za/rss </member>
                    <member> http://mg.co.za/page/rss-feeds/ </member>
                </attrib>
            </section>
           
           


        </DomainSpecification>
    </CrawlerConfig>

    I dont see a problem in the config but i hope you can help determine the issue.

     

    Clayton

     

    Monday, October 10, 2011 7:23 AM
  • Hi Clayton,

     

    I did not see anything obvious with the config file, so I would recommend opening a service request with our Technical Support team to address this issue.

     

    Thanks!

    Rob Vazzana | Sr Support Escalation Engineer | US Customer Service & Support

    Customer Service & Support                          Microsoft | Services

    Friday, October 28, 2011 4:00 PM