Document Highlighting with CTS Flow. RRS feed

  • Question

  • If we use a CTS Flow to feed content, we ideally need to remove all the parsing stages, so Export...Html stage used to configure highlighting cannot be used. This stage, in short sets the html content data to the field htmlsource, so to emulate this functionality we are trying to do the same in mapper within CTS flow. Document Parser or retriever will not get the HtmlSource, we can get the body in string, but that is not acceptable as html source.

    Is there any way we can decompress the blob data and get an htmlsource outof the data element?


    Wednesday, November 9, 2011 12:50 AM

All replies

  • Hello,

    I don't completely understand your use case from the information provided, but it looks like you want to try and get the intact html source for an html document out of the CTS DocumentParser operator.  So you can stick that source into a field to pass into the ESP indexing pipeline.

    The only way to get the intact HTML source(with the tags intact) for an html document from the DocumentParser operator is the filter the document as "text/plain".  Then the 'body' field passed out of the DocumentParser should contain the intact html source of the document.

    You can test this pretty easily within a CTS flow.  I'd recommend creating just a simple one as a test(you could of course use different reader/writer operators):

    Before the DocumentParser operator you would need to insert a new Mapper operator.  In this new Mapper operator you would need to create two additional fields:

    Name              Type     Expression

    mimeType       String     "text/plain"

    fileExtension    String     ".html" <- this could technically be anything but blank


    Directly setting these fields will cause the DocumentParser to parse the html file as plain text.  This will keep all the html source intact in the body field which then gets passed out of the DocumentParser.  You should be able to test and confirm that by modifying a simple flow like the above.

    In your case you could potentially branch your CTS flow to pass the html documents through a secondary DocumentParser which parses them as text so that the original html source remains intact and is passed out in the body field.  You could then use a Mapper operator to put the body data into whatever field you need and a Join operator to then join that field back to the recordset being passed as a part of the primary flow.

    Definitely a bit convoluted, but it would allow you to retain your HTML source and pass it in a field.



    Jason Greene

    Thursday, November 17, 2011 8:24 PM
  • Sorry for the delay, the requirement is to ensure Document Hit Highlighting function in Fast ESP via CTS Flow. Considering the double parsing and multiple mappings and joins involved to support this functionality, it looks more like a work around and RunCode with using (WebClient client = new WebClient()) {htmlsource = client.DownloadString(id);} can be way simple; altough not the correct way to do it.

    For HTML sources, we can implement the flow below, cannot figure out how to remove the input1. and input0. properties from the hash join operator, which cuases data to be ignored in ESP writer as schema is not matching, adding a mapper gives an error "Cannot locate operator with label UnhandledDocumentUnion" ; any ideas?

    Also how do we get the same working for word, Excel and PDF documents inorder for document hit highlighting to work?

    If there was an optional parameter at the document parser level (like body, properties and title) to get the htmlsource to support Fast ESP's document hit highlighting functionality, then we could avoid a lot of duplicate processing.

    Please let us know,

    • Edited by mdate Wednesday, December 14, 2011 10:33 PM
    Wednesday, December 14, 2011 8:37 PM
  • Hi,


    I would recommend that you open up a service request with our Technical Support department so we can properly research the problem, and help move this towards resolution from within the confines of a support ticket.



    Rob Vazzana | Sr Support Escalation Engineer | US Customer Service & Support

    Customer Service & Support                          Microsoft | Services

    Wednesday, December 28, 2011 10:13 PM