Answered by:
How to crawl complex URLs? (URLs with querystrings)

Question
-
Hi, everyone!
I have problems crawling complex URLs using MOSS 2007. I have added the site as a SharePoint content source and have added a crawl rule like this:
-
http://myserver.com/* - Include - Crawl complex URLS
This doesn't work. The crawler doesn't follow complex URLs like http://myserver.com/site1/Office.aspx?ID=1. I have checked the logs, and the crawler doesn't detect the complex URLs. No info is found in the logs. No info about exclusion etc.
If i edit the crawl rule and check the box for "Crawl SharePoint content as Http pages", the crawler successfull crawl the pages.
Why does this happen? I need to crawl the data as SharePoint-content, not as HTTP-pages. I have tried to solve this for days now, and I'm quite frustrated. I hope someone can help ;-)
Regards,
Erik
Wednesday, February 6, 2008 10:47 PM -
Answers
-
I found a partial answer:
It's not possible to crawl complex URLs when crawling the site as SharePoint content, even if it looks like it in the user interface. I found this information as a note in a document called "Administering Enterprise Search in Office SharePoint Server" (http://go.microsoft.com/fwlink/?LinkId=100254 - page 58).
Regarding complex URLs: "This option has no effect when crawling SharePoint sites, because Office SharePoint Server 2007 enumerates all content when crawling SharePoint sites."
Now I have to crawl the content as HTTP-pages. My next problem is how crawl only a part of the pages as HTTP-content. I want to include http://myserver.com/siteA and everything beneath, but not the rest of the server. The problem is that siteA links to the root of the server, and the crawler follows these URLs. I can't see how I can control this using crawl rules as page depth and servers hops.
The problem would have been solved if I could crawl the entire site as HTTP-pages, but this is not a solution for me :/
Tuesday, February 12, 2008 9:12 AM
All replies
-
Hello,
I am having the same problem. Same crawl rule and it only works when "Crawl SharePoint content as Http pages" is selected.
Any ideas?
Thursday, February 7, 2008 3:35 PM -
I found a partial answer:
It's not possible to crawl complex URLs when crawling the site as SharePoint content, even if it looks like it in the user interface. I found this information as a note in a document called "Administering Enterprise Search in Office SharePoint Server" (http://go.microsoft.com/fwlink/?LinkId=100254 - page 58).
Regarding complex URLs: "This option has no effect when crawling SharePoint sites, because Office SharePoint Server 2007 enumerates all content when crawling SharePoint sites."
Now I have to crawl the content as HTTP-pages. My next problem is how crawl only a part of the pages as HTTP-content. I want to include http://myserver.com/siteA and everything beneath, but not the rest of the server. The problem is that siteA links to the root of the server, and the crawler follows these URLs. I can't see how I can control this using crawl rules as page depth and servers hops.
The problem would have been solved if I could crawl the entire site as HTTP-pages, but this is not a solution for me :/
Tuesday, February 12, 2008 9:12 AM -
I could't find a solution to this problem, so i ended up crawling the content as HTTP-content.Tuesday, March 4, 2008 12:54 PM
-
you can Exclude any Urls you don't need from the Roles page.
Tarek El-MallahWednesday, June 8, 2011 4:04 PM