locked
implementting robots.txt in every site RRS feed

  • Question

  • Hi

    could you please let me know how to implement robots.txt for every site (in moss 2007) and how it could be tested to work.

     

    Thanks


    • Edited by NabX Monday, April 18, 2011 6:24 AM change
    • Moved by Mike Walsh FIN Monday, April 18, 2011 8:48 AM admin q (From:SharePoint - General Question and Answers and Discussion (pre-SharePoint 2010))
    Monday, April 18, 2011 6:23 AM

Answers

  • Nab, 

    Just copied the text from the article described by Sandeep. 

    How to use the Robots.txt file and HTML tags to prevent access to content on the portal site

    You can use a Robots.txt file to control where robots (Web crawlers) can go on a Web site. You can also use the Robots.txt file to indicate whether to exclude specific crawlers. Web servers use these rules to control access to Web sites by preventing robots from accessing certain areas. SharePoint Portal Server 2003 and SharePoint Server 2007 look for this file when it crawls, and it obeys the restrictions that are contained in the Robots.txt file.

    You can prevent another server from crawling content on the portal site by modifying the Robots.txt file. For example, you might want to restrict a specific robot from accessing the server because the frequency of requests from the robot is blocking the Web site. You may also want to restrict all robots from certain areas on the server.

    SharePoint Portal Server 2003 and SharePoint Server 2007 do not install a Robots.txt file. However, you can create a Robots.txt file and put the Robots.txt file in the home directory of the default Web site on the server. To determine the home directory of the default Web site on the server, follow these steps:
    1. Start Internet Information Services (IIS) Manager.
    2. Expand <var style="box-sizing: border-box;">server name</var>, and then expand Web Sites.
    3. Right-click Default Web Site, and then click Properties.
    4. Click the Home Directory tab.
    5. Make a note of the path that appears in the Local Path box, and then click Cancel.

      Put the Robots.txt file in the path that appears in the Local Path box. For example, if the path is D:\Inetpub\Wwwroot, put the Robots.txt in the D:\Inetput\Wwwroot folder on the server. To confirm that the Robots.txt file is in the correct folder on the server, start your Web browser, and then type http://<var style="box-sizing: border-box;">server name</var>/robots.txt.
    You can restrict access to certain documents by using HTML META tags. HTML META tags tell the robot whether a document can be included in the index and whether the robot can follow the links in the document by using the INDEX/NOINDEX attribute and the FOLLOW/NOFOLLOW attributes in the tag. For example, you can mark a document with the following if you do not want the document crawled and you do not want links in the document followed:
    <META name="robots" content= "NOINDEX, NOFOLLOW">
    SharePoint Portal Server 2003 and SharePoint Server 2007 automatically obey the restrictions that are contained in the Robots.txt file.


    V
    • Marked as answer by Emir Liu Tuesday, April 26, 2011 2:36 AM
    Monday, April 18, 2011 4:56 PM

All replies

  • Put only one robot.txt file at the root of the web application not necessary on every subsite. Disallow sites/pages in subsite from this file directly.

    use this links for reference

    http://support.microsoft.com/kb/837847

    http://www.robotstxt.org/robotstxt.html


    w: http://www.worldofsharepoint.com | t: @sharesandip
    Monday, April 18, 2011 7:52 AM
  • could you please elaborate the steps?

    Where actually do i need to put the robots.txt  inside the IIS folder or inside the content database.

    robots.txt in the application does have the option to selective enable/disable crawling; 1 know that we can put the text inside the robots.txt.

    can we give an option to content managers to enable/disable(access robots.txt ) from the sharepoint page so that he could enable or disable the crawling for t,he current working page.

     

    Also, the first link seems to be more on tiff files not on robots.txt

     

    Many Thanks...

    Monday, April 18, 2011 4:34 PM
  • Nab, 

    Just copied the text from the article described by Sandeep. 

    How to use the Robots.txt file and HTML tags to prevent access to content on the portal site

    You can use a Robots.txt file to control where robots (Web crawlers) can go on a Web site. You can also use the Robots.txt file to indicate whether to exclude specific crawlers. Web servers use these rules to control access to Web sites by preventing robots from accessing certain areas. SharePoint Portal Server 2003 and SharePoint Server 2007 look for this file when it crawls, and it obeys the restrictions that are contained in the Robots.txt file.

    You can prevent another server from crawling content on the portal site by modifying the Robots.txt file. For example, you might want to restrict a specific robot from accessing the server because the frequency of requests from the robot is blocking the Web site. You may also want to restrict all robots from certain areas on the server.

    SharePoint Portal Server 2003 and SharePoint Server 2007 do not install a Robots.txt file. However, you can create a Robots.txt file and put the Robots.txt file in the home directory of the default Web site on the server. To determine the home directory of the default Web site on the server, follow these steps:
    1. Start Internet Information Services (IIS) Manager.
    2. Expand <var style="box-sizing: border-box;">server name</var>, and then expand Web Sites.
    3. Right-click Default Web Site, and then click Properties.
    4. Click the Home Directory tab.
    5. Make a note of the path that appears in the Local Path box, and then click Cancel.

      Put the Robots.txt file in the path that appears in the Local Path box. For example, if the path is D:\Inetpub\Wwwroot, put the Robots.txt in the D:\Inetput\Wwwroot folder on the server. To confirm that the Robots.txt file is in the correct folder on the server, start your Web browser, and then type http://<var style="box-sizing: border-box;">server name</var>/robots.txt.
    You can restrict access to certain documents by using HTML META tags. HTML META tags tell the robot whether a document can be included in the index and whether the robot can follow the links in the document by using the INDEX/NOINDEX attribute and the FOLLOW/NOFOLLOW attributes in the tag. For example, you can mark a document with the following if you do not want the document crawled and you do not want links in the document followed:
    <META name="robots" content= "NOINDEX, NOFOLLOW">
    SharePoint Portal Server 2003 and SharePoint Server 2007 automatically obey the restrictions that are contained in the Robots.txt file.


    V
    • Marked as answer by Emir Liu Tuesday, April 26, 2011 2:36 AM
    Monday, April 18, 2011 4:56 PM
  • A couple of questions

    2. Expand <var style="box-sizing: border-box;">server name</var>, and then expand Web Sites.

    i did not understand this step after inetmgr where do i need to put it

     

     

    Also, in the article, "To confirm that the Robots.txt file is in the correct folder on the server, start your Web browser, and then type http://<var style="box-sizing: border-box;">server name</var>/robots.txt. "

    so if the servername is example.com

    then i should paste exactly

    http://<var style="box-sizing: border-box;">example.com</var>/robots.txt

    in the address bar.

    could you please let me know what is to be done?

    Tuesday, April 19, 2011 4:04 AM
  • Nab, 

    Ignore the html tags. They were added as a result of copy/paste (typical problem). 

    Hence it should be in this way. 

    The same mistake is done twice here. 

    2. Expand server name, and then expand Web Site

    To confirm that the Robots.txt file is in the correct folder on the server, start your Web browser, and then type http://server name/robots.txt.

     

    Let me know if it is not still clear. 

     

    Thanks

    V



     


    V
    • Edited by V284 Tuesday, April 19, 2011 2:45 PM html tags
    Tuesday, April 19, 2011 2:43 PM
  • one last question
    could you please let me know, after putting the crawler and after configuring the directories to be excluded while crawling from robots.txt, how am i going to know, it is actually working.

    Do we have some kind of tool to test it?
    Wednesday, April 20, 2011 2:44 PM
  • This article should give you good idea of tool that exists. 

    http://www.google.com/support/webmasters/bin/answer.py?answer=156449

     

    Thanks

    V


    V
    Wednesday, April 20, 2011 2:53 PM