none
Accessing OOTB fields when processing PDFs RRS feed

  • Question

  • Hello,

    When processing PDFs in the pipeline, some OOTB fields such as ‘url’ and 'title' are not accessible (i.e. is empty) during the document stages; however, a query to the collection shows that these fields have values populated in them.

    Setup:

    1. Document process stage created from custom class 'RegularExpression' to extract a portion of the 'url' (or 'title') to newly created field 'generic1'
    2. Pipeline created from template 'Generic' with added stage above (furthermore, reordered to try in various stages of the pipeline)
    3. Result is 'generic1' does not show any data but 'url' and 'title' has values populated.
    4. When the same document processor stage is used on HTML pages, the result 'generic1' is populated.


    Wednesday, May 2, 2012 9:14 AM

All replies

  • Clear the doclog and enable doctrace to see what and when fields are populated during feeding. If the URL is in fact populated before your stage that tries to extract something from it, it could be that your regex simply doesn't match.

    1. psctrl reset && psctrl doctrace on

    2. Feed a document

    3. doclog -a | more


    Dan Gøran Lunde

    Wednesday, May 2, 2012 10:29 AM
  • Use SPY stage. This stage will indicate you which properties are extracted with the pipeline.

    Thursday, May 10, 2012 6:16 PM