locked
How to crawl complex URLs? (URLs with querystrings) RRS feed

  • Question

  • Hi, everyone!

     

    I have problems crawling complex URLs using MOSS 2007. I have added the site as a SharePoint content source and have added a crawl rule like this:

    This doesn't work. The crawler doesn't follow complex URLs like http://myserver.com/site1/Office.aspx?ID=1. I have checked the logs, and the crawler doesn't detect the complex URLs. No info is found in the logs. No info about exclusion etc.

     

    If i edit the crawl rule and check the box for "Crawl SharePoint content as Http pages", the crawler successfull crawl the pages.

     

    Why does this happen? I need to crawl the data as SharePoint-content, not as HTTP-pages. I have tried to solve this for days now, and I'm quite frustrated. I hope someone can help ;-)

     

    Regards,

    Erik

     

     

    Wednesday, February 6, 2008 10:47 PM

Answers

  • I found a partial answer:

    It's not possible to crawl complex URLs when crawling the site as SharePoint content, even if it looks like it in the user interface. I found this information as a note in a document called "Administering Enterprise Search in Office SharePoint Server" (http://go.microsoft.com/fwlink/?LinkId=100254 - page 58).

     

    Regarding complex URLs: "This option has no effect when crawling SharePoint sites, because Office SharePoint Server 2007 enumerates all content when crawling SharePoint sites."

     

    Now I have to crawl the content as HTTP-pages. My next problem is how crawl only a part of the pages as HTTP-content. I want to include http://myserver.com/siteA and everything beneath, but not the rest of the server. The problem is that siteA links to the root of the server, and the crawler follows these URLs. I can't see how I can control this using crawl rules as page depth and servers hops.

     

    The problem would have been solved if I could crawl the entire site as HTTP-pages, but this is not a solution for me :/

     

    Tuesday, February 12, 2008 9:12 AM

All replies

  • Hello,

     

    I am having the same problem. Same crawl rule and it only works when "Crawl SharePoint content as Http pages" is selected.

     

    Any ideas?

    Thursday, February 7, 2008 3:35 PM
  • I found a partial answer:

    It's not possible to crawl complex URLs when crawling the site as SharePoint content, even if it looks like it in the user interface. I found this information as a note in a document called "Administering Enterprise Search in Office SharePoint Server" (http://go.microsoft.com/fwlink/?LinkId=100254 - page 58).

     

    Regarding complex URLs: "This option has no effect when crawling SharePoint sites, because Office SharePoint Server 2007 enumerates all content when crawling SharePoint sites."

     

    Now I have to crawl the content as HTTP-pages. My next problem is how crawl only a part of the pages as HTTP-content. I want to include http://myserver.com/siteA and everything beneath, but not the rest of the server. The problem is that siteA links to the root of the server, and the crawler follows these URLs. I can't see how I can control this using crawl rules as page depth and servers hops.

     

    The problem would have been solved if I could crawl the entire site as HTTP-pages, but this is not a solution for me :/

     

    Tuesday, February 12, 2008 9:12 AM
  • I could't find a solution to this problem, so i ended up crawling the content as HTTP-content.

     

    Tuesday, March 4, 2008 12:54 PM
  • you can Exclude any Urls you don't need from the Roles page.
    Tarek El-Mallah
    Wednesday, June 8, 2011 4:04 PM