none
Web Crawler and http vs https RRS feed

  • Question

  • Our current web site search is being done using SharePoint and FAST. A basic web crawl goes out and indexes everything under our domain name. We have an issue where there are some pages being indexed under both http and https. FAST shows it as a duplicate, but always uses the https link as the result. The problem is, some of these pages contain embedded video clips that are http. Our current internet explorer group policy does not allow mixed content, so the videos do not show up.

    I've tried to remove the https url's from the search results, but am not having any luck.


    Jeff Scroggin

    Tuesday, February 14, 2012 9:32 PM

All replies

  • Hi,

    Have you tried to add a crawl rule to exclude crawling of https links? (Assuming you are using the SharePoint web crawler)

    Regards,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/

    Monday, February 20, 2012 8:39 PM
  • Jeff,

    If you are using FAST web crawler, is it possible for you to simply exclude all HTTPS links from getting crawled or do you need some of them? 

    Just to make sure I understand the use-case.  Some of the links you crawl have both HTTP and HTTPS prefixes, but they are the exact same pages as far as the content?  If that is the case, FAST web crawler should consider one of the types as a DUP and not let both be in the index and searchable.  Do you see something like 200 DUPLICATE call in the crawler fetch log for one of these types?  All the crawler is doing is validating whether the checksum for the crawled document matches what's already in the crawler store...if it does, it's considered to be a DUP and it should not appear in the index as well. 

    The way that checksum is calculated is using md5sum, so make sure that you get the same signature if you using "md5sum" utility to compare both files.  Is it possible that there are some slight differences, maybe in metadata, and they are not truly exactly the same?


    Igor Veytskin

    Monday, February 20, 2012 9:12 PM
    Moderator
  • Hi Igor,

    I cannot simply exclude all https as we do have pages served over https that we need in the index.

    FAST is considering these pages duplicates, and was using the https page in the results as the first link. There's a duplicate link in the results that if clicked on, shows both the http and https results. I tried putting in crawl rules excluding the https results, but that excluded the page from the index all together.  What I wound up doing is promoting the http links in site promotion and that seems to do the trick. There are still duplicates listed, but the http address is now the definite one on the results page, and not the https address.


    Jeff Scroggin


    Monday, February 20, 2012 9:21 PM
  • Hi Jeff,

    It would help if you could state which web crawler you are using. And as Igor says, for the Enterprise Web Crawler the MD5 should take care of the problem and with the SharePoint Web Crawler using crawler rules to block https://thatsite should also work.

    Regards,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/

    Tuesday, February 21, 2012 7:42 AM
  • Hi Jeff.

    I have another propsal.

    Read this blog entry:

    http://searchunleashed.wordpress.com/2011/12/08/how-remove-duplicate-results-works-in-fast-search-for-sharepoint/

    You can define on which properties the duplicate checksum will be calculated upon.

    Maybe you should try using url/urls managed property for this calculation. It might solve the issue, but will probably make the same result appear twice, once for http and once for https.

    Amir

    Tuesday, February 21, 2012 11:10 AM
  • I'm very new to FAST, but I believe I'm using the SharePoint crawler. I've basically installed FAST, and set up a search center in SharePoint. I  have then setup content sources and crawls using the GUI in SharePoint.

    Jeff Scroggin

    Tuesday, February 21, 2012 2:17 PM
  • Hi Jeff,

    We all have to start somewhere :)

    Take a look at http://technet.microsoft.com/en-us/library/ff473168.aspx which describes how to add crawl rules. Basically add a rule which will exclude https addresses for the server in question. You would have to do a full crawl afterwards in order for the https results to be removed.

    Regards,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/

    Tuesday, February 21, 2012 2:31 PM
  • Ha! We actually did this, and it excluded both http and https. I'll give it another shot though.

    Does the SharePoint crawler work like a traditional web crawler meaning it's not going to crawl the url unless it finds the link on a page? Our web developers swear to me these particular pages should have no references to https anywhere in the site.


    Jeff Scroggin

    Tuesday, February 21, 2012 2:43 PM
  • Hi,

    The SharePoint crawler does not work like the regular web crawler as it looks at lists and libraries when indexing files. And to be sure... you are using the SharePoint crawler, not the web crawler?

    Regards,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/

    Tuesday, February 21, 2012 3:05 PM
  • That would be the web crawler, sorry. My understanding is, if you're configuring this stuff through the SharePoint GUI, you're using the crawlers available in SharePoint and not the FAST crawlers which is what I meant.

    Jeff Scroggin

    Tuesday, February 21, 2012 3:58 PM
  • Just to make sure there is no confusion about terminology:

    FAST Connector(via FAST Content SSA)...essentially uses Sharepoint crawler/gatherer:

    http://technet.microsoft.com/en-us/library/ff384288.aspx

    FAST Web crawler(remnant of legacy FAST ESP product)...there is no GUI interface and it doesn't use Sharepoint crawler in the background...it's a pure web spider):

    http://technet.microsoft.com/en-us/library/ff381266.aspx


    Igor Veytskin


    Tuesday, February 21, 2012 4:00 PM
    Moderator
  • Jeff,

    If you are configuring things through Sharepoint GUI and creating a FAST Content SSA, you are using a FAST Connector, which uses Sharepoint crawler/gatherer with a slight difference of having a FAST plugin that redirects batches to the FAST Content Distributor.

    I know this terminology could be confusing, but we are referring to FAST web crawler as the crawler in the second technet article that's configured manually and without a GUI.


    Igor Veytskin

    Tuesday, February 21, 2012 4:07 PM
    Moderator
  • Ok,

    So the scenario is that you have set up web crawl via the FAST Content SSA as a web source. Your starting url is something like:

    http://myserver

    When crawling this you also get hits with https:// and they are then shown as duplicates in the search results. This clearly indicate the item having been crawled twice, once for http and once for https.

    Figuring out how it managed to venture over to https is not that easy to figure out as there is no log saying which links were found for what page. In theory the crawl rule to exclude everything on https://server/* should work. How did you write the rule? Also, do you have any server name mappings in place? (http://technet.microsoft.com/en-us/library/cc164184.aspx)

    Regards,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/

    Tuesday, February 21, 2012 7:49 PM
  • I appreciate the replies everyone. For now, we're just going to roll with using site promotion, and promoting the http results for the pages with videos. That seems to do the trick for now. I just got done building a new FAST dev farm this afternoon, so I'll try some more of the crawl rules to tweak my results.

    Jeff Scroggin

    Tuesday, February 21, 2012 10:03 PM