none
SharePoint 2010 crawling external websites not working

    Question

  • Hi
    I have a SharePoint 2010 Enterprise Installation on two Windows 2008 R2 Server with an additional SQL 2008 R2 Server also on a Windows 2008 R2 Server. My problem is now that I cannot crawl external websites. Local SharePoint sites and File share are working, only external websites are not working. I have a SharePoint 2007 and also a SharePoint 2008 Search Server Express running on which I can crawl the same websites without problems. I set the DisableLoopbackCheck DWORD on booth server in the registry but this didn’t help. The warring in the evenlog is changing it depends on the crawled URL. Here are the two warrings:

    Source: SharePoint Server Search
    Event ID: 14

    The start address http://www.exampleurl1.ch cannot be crawled.

    Context: Application 'Search_Service_Application_01', Catalog 'Portal_Content'

    Details:
    Access is denied. Verify that either the Default Content Access Account has access to this repository, or add a crawl rule to crawl this repository. If the repository being crawled is a SharePoint repository, verify that the account you are using has "Full Read" permissions on the SharePoint Web Application being crawled.   (0x80041205)

    ------

    The start address http://www.exampleurl2.ch cannot be crawled.

    Context: Application 'Search_Service_Application_01', Catalog 'Portal_Content'

    Details:
    The filtering process has been terminated (0x80040db4)


    I know this warrings are known problems on SharePoint 2007 but none of the found solutions were working on SharePoint 2010. I hope someone can help me.

    Thanks

    Reto Boehi

    Thursday, June 24, 2010 12:27 PM

All replies

  • Is there anything between your server and target website? for example proxy, firewall...
    Thursday, June 24, 2010 6:19 PM
  • Nothing special.
    The websites of course are in the internet and are protected with a firewall.
    But with the SharePoint 2007 and the Search Server 2008 Express it is working on the same network.

    Reto

     

    Friday, June 25, 2010 11:41 AM
  • Reto,

    There is a problem related to SharePoint 2010 and Windows Server 2008 R2.   I have tracked down the specific issue and was able to provide Microsoft with logs and repro steps.   I was told that the Microsoft support team was able to repo the issue and that they are investigating.   If you drop back to Windows Server 2008 (not R2) everything will work fine.

    I am supposed to get a call back from Microsoft support today regarding this issue.  I will update the post once I hear more info.

    Friday, June 25, 2010 5:39 PM
  • Ok... here is something that might be a work around.  It looks like if the crawler finds a robots.txt file in the root of the public web server the crawl is successful.  If that file is missing it gives an access denied error.   This is only when running SharePoint 2010 on Windows Server 2008 R2.

    I am still waiting to see if we can get an answer on exactly why this is happening only on that OS and if there will be a fix since robots.txt is not a required item for a website.


    If you get your question answered, please come back and mark the reply as an answer.  
    If you are helped by an answer to someone else's question, please mark it as helpful.
    Mike Hacker | Blog: http://mphacker.spaces.live.com 

    Saturday, June 26, 2010 2:29 PM
  • If you are running FAST you can also configure FAST crawler to ignore ROBOTS by supplying a custom config through XML.

    These are the steps you would need to launch the FAST Enterprise crawler

    modify crawler config XML lines, ex:

    <attrib name="check_meta_robots" type="boolean"> no </attrib>

     <attrib name="robots" type="boolean"> no </attrib>

    And start the crawler from command prompt from %FASTSEARCH%\bin         run Crawleradmin –f <YOUR XML FILE>

    Sunday, June 27, 2010 8:27 PM
  • Hi Michael

    Thank you for your reply. I spent a lot of time to find a solution but now I know that I didn't make a configuration failure. I hope Microsoft will fix it a soon as possible. I realy don't want install all SharePoint 2010 with Win2008 instead of Win2008 R2.

    Reto

    Tuesday, June 29, 2010 6:50 AM
  • Thanks Natalya, however I am not using FAST search.  This is standard out of the box SharePoint 2010 standard running on Windows Server 2008 R2.

    I agree with you Reto.  Rebuilding on Windows Server 2008 would be a big problem for me too.   I have 4 server farms I would have to rebuild;  which I don't have time to do before my client is supposed to launch their production SharePoint environments.

    I just got another call very early this morning (1am) from another tech support person.   I guess he didn't check the time zone before calling.  Anyways, it looks like I am being passed to another tech support person at Microsoft to work on this issue.  This will be the 4th person I have spoke with regarding the issue.   I sure hope someone figures the problem out soon.

    Tuesday, June 29, 2010 11:16 AM
  • I just installed a Search Server 2010 Express on a Windows 2008 R2 server and also on a Windows 2008 server. I can agree with Michael that the problem only occur on the Windows 2008 R2 server. The search crawl on the Windows 2008 server works fine. I hope Micosoft will fix the problem soon.

    Monday, July 05, 2010 12:59 PM
  • Microsoft has said that it appears to be caused by differences in one of the HTTP system file between the Windows 2008 and 2008 R2 operating systems.  I am hoping to hear more about this issue later this week.

    If you get your question answered, please come back and mark the reply as an answer.  
    If you are helped by an answer to someone else's question, please mark it as helpful.
    Mike Hacker | Blog: http://mphacker.spaces.live.com 

    Monday, July 05, 2010 1:22 PM
  • Hi Reto,

    Did you tried to specify Proxy information in Search Service Application if not try to add proxy information to Search Service Application. I was able to get Crawl working after adding Proxy information in SSA.

    Hope this helps...

     

     


    Reagrds,

    Hiran
    Microsoft Online Community Support
    Tuesday, July 06, 2010 1:03 AM
  • Hi Hiran

    We don't have a proxy server in our network.

    Regards

    Reto

    Tuesday, July 06, 2010 9:06 PM
  • Spoke with Microsoft support again today.  The Windows server team is looking into the issue.  It looks like there was a change in Windows Server 2008 R2 for security reasons and it looks like it is impacting SharePoint's ability to crawl anonymous sites.  They have reproduced the issue and I was told they should be able to provide more information later this week.

    If you get your question answered, please come back and mark the reply as an answer.  
    If you are helped by an answer to someone else's question, please mark it as helpful.
    Mike Hacker | Blog: http://mphacker.spaces.live.com 

    Tuesday, July 06, 2010 11:08 PM
  • Reto,

    Do you have control of the anonymous web sites that you are crawling?  If so, see if they are running IIS and have both anonymous AND Windows Integrated configured.   If so, try disabling the Windows Integrated authentication on the web sites.     By default when creating a new IIS web site using the IIS wizard it checks both boxes.

    I found out that I can get crawling to work by disabling the Windows Integrated authentication on the anonymous web sites.

    This is not a fix, just a work around until a final resolution can be found.   I am still in touch with Microsoft support.  Any updates I get will be posted.


    If you get your question answered, please come back and mark the reply as an answer.  
    If you are helped by an answer to someone else's question, please mark it as helpful.
    Mike Hacker | Blog: http://mphacker.spaces.live.com 

    Tuesday, July 13, 2010 3:36 PM
  • Hi Mike

    Yes your work around works on anonymous IIS websites. Thanks for that. But I have not on all web servers from our customers access to change somethings. That's why I still wait for a final solution from Microsoft.

    Reto

    Wednesday, July 14, 2010 1:48 PM
  • Hi Marc/ Hiran,

     

    Sorry to change the context of the discussion, however i have similar issue.

     

    Problem Statement: I want to crawl content on MOSS 2007 server from Sharepoint 2010 Server.

    Action Performed:

    1. Created new content source in sharepoint 2010

    2. added url's of web application on MOSS 2007 server box

    3. Started full crawl.

     

    Result:

    It stops crawling after 2 minutes and when see crawl log it shows warning message as follows.

    "The content for this address was excluded by the crawler because this item was marked with a no-index meta-tag. To index this item, remove the meta-tag and recrawl. "

     

    It sounds so simple to remove meta- tag, however i am not able to find how to do this [remove meta-tag].

    NOTE: Account used to perform this activity has full permissions on both the servers.


    Please help!!!!!

    Regards,

    Ketan

     

     

    Saturday, July 31, 2010 10:36 AM
  • Hello Reto


    As Mike has pointed out Microsoft is aware of this issue and is currently working on a fix. We cannot comment on what form that fix will come in at this point in time. Please monitor the

    Update Center for Microsoft Office, Office Servers, and Related Products(http://technet.microsoft.com/en-us/office/ee748587.aspx). For the release of SharePoint Cumulative updates. Their KB's will list when this issue is address.


    Reagrds,

    Hiran
    Microsoft Online Community Support
    Wednesday, August 11, 2010 10:12 PM
  • Could anyone please tell me if Microsoft has come up with a fix for the issue? I can't seem to find it in the above mentioned link yet. I have the exact same problem.

    I have installed the Office 2010 Cumulative Updates from August 2010 without luck.

    I have monitored the network activities while performing the crawl, and it seems that it authenticates my Crawl users correctly, and returns the pages with content. But I still get the above mentioned Access denied.

    If i apply a Crawl Rule, checking "Crawl SharePoint content as http pages", it returns 2 Successful for each sub site (Default.aspx and _vti_bin/spsdisco.aspx) the rest are errors.

    Regards,
    Peder

    Friday, October 08, 2010 9:45 AM
  • Hi Peder

    Yes unfortunately it looks like that Microsoft didn’t fix this error yet. I’m also waiting for a fix.

    Regards

    Reto

     

     

    Tuesday, October 12, 2010 9:30 AM
  • I have the Same Issue , any update about it.
    developer
    Saturday, October 23, 2010 1:40 PM
  • Does this issue also apply to the following error:

    Item not crawled due to one of the following reasons: Preventive crawl rule; Specified content source hops/depth exceeded; URL has query string parameter; Required protocol handler not found; Preventive robots directive

    I love how they bundle 10 reasons into a generic error instead of just telling you what the issue is. I am able to crawl these same sites without issues using Search Server 2008.

    I am running Search Server 2010 Express with Windows 2008 Server R2. Thanks.

    Thursday, November 04, 2010 1:45 PM
  • Any updates on this issue? We're facing the same issue for a client and we do not have server access to the MOSS 2007 we're trying to crawl (so we cannot try the suggested workaround). Any suggestions?
    Tuesday, November 23, 2010 1:10 PM
  • Try adding the url to a federated search. I do not know if this requires FAST, but it seems to work for bing.com , wikipedia and other external urls

    http://download.microsoft.com/download/6/A/8/6A83D203-0369-4B6D-B1F2-21D93996B4D6/SP14EntSearchIT_FdrtdSrch_1.doc

    Tuesday, November 23, 2010 4:20 PM
  • I installed the latest cumulated updates (December 2010) for SharePoint Foundation 2010 but the problem is still there. Any help on this issue would be appreciated!
    Thursday, January 27, 2011 11:59 AM
  • I am having exactly the same problems others report, but I did accidently discover a clue/workaround.  The crawl had been working up to the point where I replaced a temporary URL for the crawled site with its permanent URL.  The URL that works is on the same AD domain as both the Index Server and the target web site, while the non-working URL uses on a different domain name. I don't know whether this is a factor or not.  I don't believe it is a name-resolution or authentication problem because I can browse the target site via a browser on the Index Server machine.
    Tuesday, February 01, 2011 11:08 PM
  • I'm having the same issue with SharePoint 2010 and Windows 2008 R2 after the February 2011 CU.  Microsoft, please fix.
    Thursday, April 14, 2011 9:43 PM
  • I'm having the same problem with Moss 2007 and Sharepoint 2010 environments when installed on 2008 R2. No problem crawling internal sites and most external sites but attempting to crawl a particular external site comes up with the error above. This site has a robots.txt file. Installed SP1 for r2 but still no luck.
    Friday, April 15, 2011 1:51 PM
  • Having the same issue here too.

    Any inkling of a fix coming?


    Mike Bennett
    Thursday, May 19, 2011 3:41 PM
  • The permission bug was fixed with the latest update.

    The error "Item not crawled due to one of the following reasons: Preventive crawl rule; Specified content source hops/depth exceeded; URL has query string parameter; Required protocol handler not found; Preventive robots directive." has nothing to do with the bug mentioned above.

    SharePoint has no problems crawling a site without a robots.txt. However if the site has a robots.txt. You must have an allow rule similar to the one below :

    User-Agent: Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; MS Search 6.0 Robot)

    Allow: /

     

     

     




    Tuesday, May 31, 2011 8:37 AM
  • Do you know what Update the permission bug was resolved in?

     

    Thanks,

    Monday, June 13, 2011 10:06 PM
  • It works for sure in version 14.0.5130.5002.

    However there could be other reasons why the crawler won't crawl the site. The error "Item not crawled due to one of the following reasons: Preventive crawl rule; Specified content source hops/depth exceeded; URL has query string parameter; Required protocol handler not found; Preventive robots directive." has nothing to do with the bug mentioned above." is not a bug!

    Tuesday, June 14, 2011 11:41 AM
  • As oscar mentioned we are on the same version but it is not working.. Search is retuns null on external site... ? Any ideas.

     

     


    sreenivas
    Wednesday, June 22, 2011 4:46 PM
  • I would recommend that you to use fiddle to debug the crawler. Its a great tool to see whats actually going on. Do the crawler only hit your robots.txt do you 401 or perhaps a 404.

    For how to use fiddler as reverse proxy follow the link below :

    http://www.fiddler2.com/fiddler/help/reverseproxy.asp

     

    Wednesday, June 22, 2011 5:01 PM
  • Hi,

    I am using Windows 2008 R2 and Search Server Express 2010.

     

    My issue means of the begin of this tread.

    I am trying crawl a external web site with Search Server I receive Access Denied error.

    I use the Fiddler to track the problem but the Search Server isn't generate requests to crawl the website.

    Someone have any idea to solve this problem?


    Friday, July 15, 2011 6:29 PM
  • The search service will cache the robots.txt. Restarting the search service will force the robots.txt to be reloaded.
    Monday, July 25, 2011 10:07 AM
  • Hi Jie Li,

     

    We have a problem similar to this....

     

    we configured a sharepoint 2010 webapplication with firewall and the search results are not coming when we try to access the application through firewall...

    so, we tried to configure a content source with this ip address (instead of localhost / server name) and planned to create a scope to crawl within this content source(neither i am able to create Content source nor i can create scopes).But the crawl failed giving errors. Its because of the numbers in ip address.

    So, how can we configure a content source with ip address and create scope to crawl in this content source....

     

    Regards,

    PVSAVSG.

    Friday, August 12, 2011 7:27 AM
  • We have similar issues with SharePoint crawlers stopping prematurely for web sites. We have Windows Server 2008 R2 as server OS

    Anyone found a way to crawl succesfully over phpBB forum sites?

    Another site is a JSP application website which can be crawled succesfully with FAST ESP 5.1 crawler (old search, we're replacing now by the new product) but fails to be crawled by SharePoint 2010 crawler.

    So, maybe good to provide some pointers to troubleshooting SharePoint Crawlers?

    Thanks

    Carry

     


    kind regards, Carry Megens
    • Edited by Carry Megens Wednesday, October 19, 2011 2:22 PM
    Wednesday, October 19, 2011 2:17 PM
  • Ignoring robots.txt is not good advice.

    robots.txt are there for a reason.

    Sending out crawlers not adhering to these rules can break websites that are hit by those crawlers.


    kind regards, Carry Megens
    Wednesday, October 19, 2011 2:19 PM
  • The permission bug was fixed with the latest update.

    The error "Item not crawled due to one of the following reasons: Preventive crawl rule; Specified content source hops/depth exceeded; URL has query string parameter; Required protocol handler not found; Preventive robots directive." has nothing to do with the bug mentioned above.

    SharePoint has no problems crawling a site without a robots.txt. However if the site has a robots.txt. You must have an allow rule similar to the one below :

    User-Agent: Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; MS Search 6.0 Robot)

    Allow: /

     

     

     





    We have this in our robots.txt

    User-agent: *

    So should be OK for SharePoint crawler, yet we face a crawler stop.

     


    kind regards, Carry Megens
    Wednesday, October 19, 2011 2:30 PM
  • Hi

    I am also not convinced this is fixed and have similar issues to Carry.

    We have a SP2010 fully patched server running Windows 2008 R2 (this seems to be the key thing windows 2008R2).

    We can crawl SharePoint content with no problems, however when we goto crawl another internal non SharePoint website (anonymous access allowed) the crawler seems to crawl only a certain amount of results before you get the access denied errors in the crawl log, weirdly if you attempt to crawl the source again it wont even go any further just says access denied.

    I initially put this down as a network issue i.e. some device on our network thinking we were doing some form of Denial of Service i.e. preventing our sources from getting crawled. However after I stop and restart the search service, the system seems to again try to crawl, before bombing out halfway through i.e. indexes 2000 out of the 30,000 expected.

    So in summary

    Our live SP2010 us windows 2008 R2 and preproduction 2008 R2 SP2010 environments both have this issue (virtualised)

    Our Dev boxes on windows 2008 (non R2) do not have this issue although not in same physical location as production but on same domain.

    Loopback is set to 1 in registry

    There is no proxy specified in the search admin

    No firewall as its internal site

    No alternative access mappings for the site we are crawling

    I increased timeout from 60 secs to 120secs just to see if this was an issue (no effect)

    We even added a robots.txt to allow access on this website, as that was mentioned as a possible work around, no effect

    The website we are crawling is nothing complicated, no security open access, standard html (apache based).

    We do have some crawl rules but we have copied these from dev -> Live so there should be no difference and dev finds many more results.

     So my question is

    1)      Does anyone have anything else I can try, was thinking of using fiddler 2 in some form of proxy not convinced this will highlight the problem thou

    2)      I really think its 2008 R2 related!!

    3)      If this was fixed via some cumulative update as per the message thread implies, can anyone say what was the fix was and when (which CU) it was definitely fixed in. (I would guess its some registry tweak rather than a binary update, in which case would be good to trial it to see if this fixes it as possibly this fix may have got overwritten / lost with future changes. Our environment is fully patched (14.0.6109.5002) but we still have this.

     Thanks

    Brad

    Monday, October 24, 2011 12:55 PM
  • Hi

    I am also not convinced this is fixed and have similar issues to Carry.

    We have a SP2010 fully patched server running Windows 2008 R2 (this seems to be the key thing windows 2008R2).

    We can crawl SharePoint content with no problems, however when we goto crawl another internal non SharePoint website (anonymous access allowed) the crawler seems to crawl only a certain amount of results before you get the access denied errors in the crawl log, weirdly if you attempt to crawl the source again it wont even go any further just says access denied.

    I initially put this down as a network issue i.e. some device on our network thinking we were doing some form of Denial of Service i.e. preventing our sources from getting crawled. However after I stop and restart the search service, the system seems to again try to crawl, before bombing out halfway through i.e. indexes 2000 out of the 30,000 expected.

    So in summary

    Our live SP2010 us windows 2008 R2 and preproduction 2008 R2 SP2010 environments both have this issue (virtualised)

    Our Dev boxes on windows 2008 (non R2) do not have this issue although not in same physical location as production but on same domain.

    Loopback is set to 1 in registry

    There is no proxy specified in the search admin

    No firewall as its internal site

    No alternative access mappings for the site we are crawling

    I increased timeout from 60 secs to 120secs just to see if this was an issue (no effect)

    We even added a robots.txt to allow access on this website, as that was mentioned as a possible work around, no effect

    The website we are crawling is nothing complicated, no security open access, standard html (apache based).

    We do have some crawl rules but we have copied these from dev -> Live so there should be no difference and dev finds many more results.

     So my question is

    1)      Does anyone have anything else I can try, was thinking of using fiddler 2 in some form of proxy not convinced this will highlight the problem thou

    2)      I really think its 2008 R2 related!!

    3)      If this was fixed via some cumulative update as per the message thread implies, can anyone say what was the fix was and when (which CU) it was definitely fixed in. (I would guess its some registry tweak rather than a binary update, in which case would be good to trial it to see if this fixes it as possibly this fix may have got overwritten / lost with future changes. Our environment is fully patched (14.0.6109.5002) but we still have this.

     Thanks

    Brad

    Did you remember to restart the search service after adding the robots.txt ?
    Tuesday, December 13, 2011 3:30 PM
  • Any updates on a fix for this?  
    • Proposed as answer by Sam Liang HB Wednesday, October 24, 2012 8:48 PM
    • Unproposed as answer by Sam Liang HB Wednesday, October 24, 2012 8:48 PM
    Friday, June 01, 2012 4:23 PM
  • Did you try enabling Directory browsing for the site that you want to crawl?  Had the sane message "Access is denied.." when trying to crawl pdfs on another web server.  Then I enabled Directory browsing on the website and set up a content source (web sites) in sharepoint .  Then did a full crawl on the web server.  All searched content pdfs were returned.



    • Edited by Sam Liang HB Wednesday, October 24, 2012 8:54 PM
    • Proposed as answer by Sam Liang HB Wednesday, October 24, 2012 8:54 PM
    • Unproposed as answer by Sam Liang HB Wednesday, October 24, 2012 8:54 PM
    Wednesday, October 24, 2012 8:53 PM
  • Hi,

    I have found this same issue, the problem was the search account did not have access to the seach directory on C Drive, after adding the Search account to the WSS_ADMIN_WPG search worked correctly.

    Regards,

    Kaine

    Friday, February 15, 2013 2:58 AM
  • This particular problem seems to be related to SharePoint 2010 running on Windows 2008 Server R2.


     There are a couple of options to resolve the issue:
     1.Turn off integrated authentication on the site that you are trying to crawl
     2.Install the August (or later) cumulative update for SharePoint 2010.
     
    If you still continue to experience issues after installing the cumulative update it is recommended that you contact Microsoft Support and have a case open to track and resolve your specific issue.

    Wednesday, May 15, 2013 9:38 PM