none
How can I input XML data from an external web service into a scope field during a FAST pipeline stage? RRS feed

  • Question

  • Hi there,

    we have an external web service that we access during a custom pipeline stage in FAST. Our task is to get the XML that is output by this web service and store it into a scope field. At this point we have tried the following things:

    1. Inputting the XML output by the web service directly into the scope field. This did not work and the FAST pipeline complained that the data was of the wrong type (i.e. <type 'str'>).
    2. We also tried inputting the XML output of the web service directly into a flat field and then using the XMLParser and XMLScopifier to convert the string into the right format expected by the scope field. This also did not work and the FAST pipeline complained that the XML data in our flat field was invalid. Upon closer inspection, we discovered that FAST escaped all of the angle brackets of the XML web service output into &lt; and &gt; before storing the string into the flat field. This seems to be the cause of the XMLParser not being able to convert the string into the right object expected by the XMLScopifier.

    How can we solve this?

    We are basically stuck on this at the moment and have absolutely no idea how to move forward from here, so any help or hints would be much appreciated!

    Many thanks!

    Thursday, January 26, 2012 2:36 PM

All replies

  • If the XML output of the web service is escaped, you may have to look into how you invoke the web service, and how you store the output into the document attribute. I doubt that the FAST document processing machinery is responsible for escaping the XML data.

    How sure are you that the XML data *is* escaped, by the way? Is it possible that the escaping happens as part of whatever mechanism you use to look at the content of the document attribute(s)?

    Note that you (probably!) have to convert your XML output into a specific format before the XMLParser/XMLScopifier stage.

     

     

    Friday, January 27, 2012 10:07 AM
  • If the XML output of the web service is escaped, you may have to look into how you invoke the web service, and how you store the output into the document attribute. I doubt that the FAST document processing machinery is responsible for escaping the XML data.

    How sure are you that the XML data *is* escaped, by the way? Is it possible that the escaping happens as part of whatever mechanism you use to look at the content of the document attribute(s)?

    Note that you (probably!) have to convert your XML output into a specific format before the XMLParser/XMLScopifier stage.

     

     

     

    Thanks for the reply Raymond.

    To determine what was going on when we first saw the error, we ran a simple FQL query from the web front-end (i.e. http://localhost:15100/cgi-bin/xsearch?offset=0&hits=10&query=a) and looked at the source of the XML returned (e.g. 'ctrl+u' on Firefox). The angle brackets and the quotes in the XML value of the flat field of interest had all been escaped to &lt; &gt; and &quot;. We also initially thought this could be a product of the browser attempting to display the XML string correctly, but the same thing does not happen in the body field (i.e. <FIELD NAME="body">) that also contains tags in its string value (e.g. '<sep/>convert the whole structure to <key>a</key> series of maps <sep/>create <key>a</key> dao that handles all interaction with<sep/>').

    There is also the possibility that FAST is escaping the characters before the results are sent to the browser, although I am not sure how we can determine if this is the case.

    You mention that we may need to convert the XML output into a specific format before the XMLParser/XMLScopifier stage. Any ideas as to what this format may be and how we can generate it using our custom python stage? From my understanding, the XMLParser converts the string into a DOM model and stores it into a meta-value called 'dom' of the field we are trying to convert. The XMLScopifier then fetches this 'dom' meta-value and does its conversion into the scope field object. Is this correct?

    Either way, we are still stuck on this issue as it is.


    • Edited by rod82 Friday, January 27, 2012 11:19 AM
    Friday, January 27, 2012 11:18 AM
  • Ok, first of all, the encoding you see in the result XML is just to avoid confusion between the "container" format (the XML result) and text data in the result set. It does not indicate that the processing pipeline has encoded the data (unless you see something like "&amp;lt;", which would be a double-encoding of "<").

    The format I've used is something like

    <document>

      <field-name-1 type="string">string-value-1</field-name-1>

      <field-name-2 type="int32"> ... </field-name-2>

      <field-name-3 type="datetime"> ... </field-name-3>

      <field-name-4 type="float"> ... </field-name-4>

      ...

    </document>

    --- this is, in effect, an XML form of a simple hash table, where you can specify terms like

      xml:document:"field-name-1":"whatever"

    My pipeline uses the stages "XMLParser" and "XMLScopifier2".

     

     

     

     

    Monday, January 30, 2012 10:06 AM
  • How are you passing this document to the pipeline? Just to re-iterate our second point in the original question, our current workflow is the following:

    1) Trigger the file traverser to fetch the contents of files inside a specific folder. These files are not in XML, they are just unstructured and semi-structured files in various formats (e.g. pdf, excel, word, etc.)

    2) In the later stages of the pipeline, before the final document is output to the indexer (i.e. RTSOutput), we call a custom Python stage that:

        i) Sends the document content (i.e. document.GetValue('body') ) to an external webservice.

        ii) Gets the response from this webservice, which is in XML, and stores it into a flat field (i.e. document.Set('generic1', webservicedata) )

    3) We tried placing the XMLParser (tried setting "Attributes = generic1" and "Attributes = generic1:") and XMLScopifier (set "AutoRealType = generic1:xml") stages just after our custom stage above to convert the XML string in the 'generic1' flat field into a DOM object (XMLParser) and then input into a scope field (XMLScopifier), but that is where we are getting the errors (i.e. the XMLParser just says that the content of the XML it gets is invalid).

    We have tried inputting a hard coded piece of XML when we reach point 2.ii of the flow above, but we still get the error from the XMLParser saying that the data is invalid. It may help us move along to compare our way of tackling the problem with yours if you post the exact configuration you have used for the "XMLParser" and the "XMLScopifier2" to index the XML content you posted above. It would also be good to know if you did any of this inside a custom Python stage or if it was via the file traverser.

    Also, is there an easy way to get more detailed messages from the default XMLParser stage to determine what is wrong with the content it is getting? Like I said before, all that FAST tells us in the log is that the format was invalid, but it doesn't even output the string it was trying to process or tell us what may be wrong with it.

    Thanks,

    R

    Monday, January 30, 2012 2:18 PM
  • We're setting up the xml content in a custom connector. This connector gets its input from a set of XML files that are dropped into a specific directory; the connector picks up the XML files, parses and processes the files, and sends the information from the files into the FAST ContentDistributor (via the Content API).

    Part of the processing done by the connector is to extract data from the XML fields and create a new XML structure that matches what XMLParser/XMLScopifier2 expects.

    We configure XMLParser with default values, plus

    Attributes = xmldata:xml

    (xmldata is the string data field that contains the XML structure created by our connector).

     

    XMLScopifier2 is configured with default values, plus

    Mapping = xml:xml(result:resxml,datetimeresolution:minute)

    attributesFirst = 0

    TypeAttribute = type

     

    It is a bit odd that you're getting errors from XMLParser; this stage is only concerned with XML parsing (I think), so if it complains that the XML is invalid, it means that either your webservice is generating invalid XML, or the encoding of the XML may not match what the processing pipeline expects. One way of checking this would be to use the "Spy" stage (which I think is part of the standard document processor set); alternatively, you could create a small custom document processor that simply writes the content of your XML field to disk, somewhere.

     

    Tuesday, January 31, 2012 9:35 AM