none
Is it possible to know which page URL that contains the images in the crawling process? RRS feed

  • Question

  • Hi All,

    I wonder that when we perform crawl and the crawl detect the content as image type, is it possible to know which page contains this image based on any properties?

    For example, there is one page which called http://contoso/a.html
    and this page contains "http://contoso/image1.jpg" image.

    In the crawling process, I would like to know that the "http://contoso/image1.jpg" is in the "http://contoso/a.html" or referer from "http://contoso/a.html", is it possible?

    Best Regards,
    Andy
    Friday, February 24, 2012 5:10 AM

All replies

  • Hi Andy,

    I think it's possible only if you''ll write the URLs to a different location, say a list or a database.

    It's not possible to directly transfer data/metadata/information between two items/documents, but you can accomplish it if you'll write the data to a DB and read it from the DB during crawling, with a pipeline extensability.

    I can't assume that the html page will be crawled prior to the image, it's logical but not proven, so you'll also have to crawl twice, once for writing the URLs to the database, and another crawl which will read the URLs from the database.

    Hope this helps,

    Amir

    Sunday, February 26, 2012 8:15 AM
  • You could get this info from FAST Web crawler:

    http://technet.microsoft.com/en-us/library/ff381266.aspx

    However, that would mean moving away from using FAST Connector(Sharepoint crawler), so it's probably not the best option.  If you haven't used it in ESP, it definitely has a bit of a learning curve in a sense of not having a GUI and not having a connection to Sharepoint back-end...it would really only be used for Web crawls.

    But in general, this web crawler has this information in its log files and it's db(not SQL db), showing both the crawled URL and its "referrer".


    Igor Veytskin

    Monday, February 27, 2012 2:44 PM
    Moderator