none
Is it possible to capture Meta data in document. RRS feed

  • Question

  • Below is the sample html file I need to index.


    <html>
    <body>
    <a href='/Detail.Page?id=123456789'>Title of Support Note with ID 123456789</a><br/>
    <a href='/Detail.Page?id=987654321'>Title of Support Note with ID 987654321</a><br/>
    <a href='/troubleshooting_guide_example.pdf'>Title of Troubleshooting Guide Example</a>
    <meta>
     <name="product" value="Laptops & Netbooks-Thinkpad Series Laptops-Thinkpad T60 Series-3100,Laptops & Netbooks-Thinkpad Series Laptops-Thinkpad T60 Series-3300"/>
     <name="category" value="Audio,BIOS"/>
     <name="audience" value="Call Center"/>
     <name="language" value="en_US"/>
    </meta>
    <br/>
    <a href='/troubleshooting_guide_example_2.pdf'>Title of Troubleshooting Guide Example #2</a>
    <meta>
     <name="product" value="Laptops & Netbooks-Thinkpad Series Laptops-Thinkpad T61 Series-3100,Laptops & Netbooks-Thinkpad Series Laptops-Thinkpad T61 Series-3110"/>
     <name="category" value="Video,Keyboard"/>
     <name="audience" value="Service Provider"/>
     <name="language" value="es_ES"/>
    </meta>
    <br/>
    </body>
    </htlm>


    I want to know (as shown in the html sample file above) if it is possible to index the pdf document pointed by the link 

    /troubleshooting_guide_example.pdf

    and attach the meta following the link 

     <name="product" value="Laptops & Netbooks-Thinkpad Series Laptops-Thinkpad T60 Series-3100,Laptops & Netbooks-Thinkpad Series Laptops-Thinkpad T60 Series-3300"/>
     <name="category" value="Audio,BIOS"/>
     <name="audience" value="Call Center"/>
     <name="language" value="en_US"/>

    to the indexed pdf document.

    If yes, how do I need to do that.
    We are running FAST ESP 5.3 on Windows 2003
    Wednesday, December 8, 2010 8:42 AM

All replies

  • You need to write a custom document processor stage in order to do this which is placed above the SearchExportConverter stage. Check out ESP_Document_Processor_Integration_Guide.pdf from the ESP documentation on how to create custom doc proc stages.

    Basically you create a stage which retrieves the html from either the "data" or "html" field (depending on which default pipeline you are using and the method the html document entered the pipeline). Then you can use either xpath (if it's xhtml) or turn to regular expressions (which I don't like to use on html, but it works for many cases) to grab the url for the pdf document from the html. Also grab out the meta data.

    Once you have the url you can pull in the pdf, and you write out the binary data to the "data" field, and set your metadata to the appropriate fields defined in your index profile.

    You might also be able to achieve the same thing by utilizing a combination of some of the existing stages.

    Hope this points you in the right direction. It's pretty straightforward once you've done it a couple of times.

    Regards,

    Mikael Svenson


    Search Enthusiast - MCTS SharePoint/WCF4/ASP.Net4
    http://techmikael.blogspot.com/ - http://www.comperiosearch.com/
    Wednesday, December 8, 2010 7:18 PM