How could I extract text from xml file text by regular expression ?

Answered How could I extract text from xml file text by regular expression ?

  • Wednesday, December 12, 2012 6:06 AM
     
     

    I would like to extract text from xml file by regular expression which is like

    <delta  .*

    /delta>

    How coud I do that ?


    • Edited by blackjack08 Wednesday, December 12, 2012 6:40 AM m
    •  

All Replies

  • Wednesday, December 12, 2012 11:25 AM
     
     

    Use XPath to extract values.

    $xml=[xml]'<x/>'

    $xml.Load('myfile.xml')

    $xml.SelectSingleNode('//delta')

    THat's it.


    ¯\_(ツ)_/¯

  • Wednesday, December 12, 2012 12:17 PM
    Moderator
     
      Has Code

    Using V3 / regex:

    (gc test.xml -raw) -replace '(?ms).+(<delta.*/delta>).+','$1'


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

  • Thursday, December 13, 2012 12:08 PM
     
     

    Thank you.

    Is this Powershell ?

  • Thursday, December 13, 2012 12:34 PM
     
      Has Code

    How about pattern which is

    <delta operation="add"  ******

    ******

    /delta>

    There are  and operation="add" and operation="update" and operation="delete"

    and I would like to extract only add data.


    I tried

     '(?ms).+(<delta operation="add".*/delta>).+','$1'

    but it include delete and update operation

     
    • Edited by blackjack08 Thursday, December 13, 2012 12:41 PM m
    •  
  • Thursday, December 13, 2012 12:38 PM
    Moderator
     
     

    Need better information that that. 

    What exactly does the text look like and what exactly is "add data"?

    The first rule of regular expressions is "know your data", and right now I don't.


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

  • Thursday, December 13, 2012 4:58 PM
     
     

    Thank you.

    XML file is like

    <delta operation="add"  ******

    ******

    /delta>

    <delta operation="delete"  ******

    ******

    /delta>

    <delta operation="update"  ******

    ******

    /delta>

    <delta ...> ... </delta> represent one record and this add ,delete,update record appear randomly in XML file.

    I would like to extract only add record which is like

    <delta operation="add"  ******

    ******

    /delta>

    length of record is multiple lines so, I can not use grep to extract.

  • Thursday, December 13, 2012 4:58 PM
     
      Has Code

    How about pattern which is

    <delta operation="add"  ******

    ******

    /delta>

    There are  and operation="add" and operation="update" and operation="delete"

    and I would like to extract only add data.


    I tried

     '(?ms).+(<delta operation="add".*/delta>).+','$1'

    but it include delete and update operation

     

    This is what happens with XML.  A user asks for how on one smallthing thenexpands the question t ever more complex issues.  That is exactly why we invented XML.  TExtual storage of complex data cannot be easily or reliably queried without a format and schema except if you are a master at RegEx.  XPasth is for humans to talk to XML.

    My earlier example can be easily modified to extract only specific tags.

    $xml=[xml]'<x/>'
    $xml.Load('myfile.xml')
    $xml.SelectNodes("//delta[@operation='add']")

    Of course with out an example of the complete XML file we cannot verify this.

    You are free to use RegEx but it is a good idea for anyone working with computers as more than an end user to learn the basics of XML.


    ¯\_(ツ)_/¯


  • Thursday, December 13, 2012 5:27 PM
     
     

    With multiline captures you will have issues with RegeX.

    Using XML/XPath you cab do this easily.


    ¯\_(ツ)_/¯

  • Thursday, December 13, 2012 6:18 PM
     
      Has Code

    Terminology issues are also at work here.

    When the OP says "only data" - what is data?  All of the file is data or it is XML. If it is XML then what is being asked is "what is the value of the text node of the tag 'delta' when the attribute 'operation' has a value of 'add'.

    This is a complex query and,when it has multiple values spread over many lines it becomes much more difficult to reliable extract it with RegEx.

    Here is a complete and work example of how to do this with XPAth.

    $xml=[xml]@'
    <root>
        <delta>some data value 1</delta>
        <delta operation='add'>some data value 2</delta>
        <delta operation='update'>
    some data value 3
    that is stored on more
    than one line.        
        </delta>
        <delta operation='add'>some data value 4</delta>
        <delta operation='delete'>some data value 5</delta>
        <delta operation='add'>some data value 6</delta>
    </root>
    '@
    Write-Host " Get al nodes named 'delta'" -fore green
    $xml.SelectNodes('//delta')|%{$_.'#text'}
    Write-Host " Get al nodes named 'delta' with an attribute names 'operation'" -fore green
    $xml.SelectNodes('//delta[@operation]')|%{$_.'#text'}
    Write-Host " Get al nodes named 'delta' with an attribute names 'operation' whose value is 'add'" -fore green
    $xml.SelectNodes('//delta[@operation="add"]')|%{$_.'#text'}
    Write-Host " Get al nodes named 'delta' with an attribute names 'operation' whose value is 'delete'" -fore green
    $xml.SelectNodes('//delta[@operation="delete"]')|%{$_.'#text'}

    Just copy and paste it at a prompt and yopu will see how selective it is.  The example also demonstrates how it cN get the node text even if it is on more than one line.

    This is what XML is for and it works very nicely.

    Again. The header conditions of each XML file determine how the query is written.  For basic XML the example works.  With namespaces we need to add one step.


    ¯\_(ツ)_/¯

  • Thursday, December 13, 2012 6:40 PM
    Moderator
     
     

    With multiline captures you will have issues with RegeX.

    Using XML/XPath you cab do this easily.


    ¯\_(ツ)_/¯


    Not necessarily.  The regex I posted is doing a multi-line capture.

    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

  • Thursday, December 13, 2012 7:52 PM
     
     

    Rob  - I know you can do it but the spec keeps changing so it would beeasier to adjust teh XPath as it is very narrowly targeted.

    In any case both methods are available.


    ¯\_(ツ)_/¯

  • Thursday, December 13, 2012 8:03 PM
    Moderator
     
     

    I understand.  I just don't want the OP to get left with the wrong impression about regex, and beleive that there's a problem with regex being able to do multiline captures in general.

    The XPATH solution may be easier to implement, but at this point that  relies on an assumption we're dealing with well-formed XML, and I don't think we even know that for sure.


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

  • Thursday, December 13, 2012 8:08 PM
     
     
    Agreed.  Good luck on scope creep.

    ¯\_(ツ)_/¯

  • Thursday, December 13, 2012 8:11 PM
    Moderator
     
     
    Agreed.  Good luck on scope creep.

    ¯\_(ツ)_/¯

    I was about to wish you the same.  We'll see which solution gets latch onto.

    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

  • Thursday, December 13, 2012 8:46 PM
     
     
    If I win it'll cost you a pint.

    ¯\_(ツ)_/¯

  • Thursday, December 13, 2012 8:51 PM
    Moderator
     
     
    Done.  I guess if he takes both, we have to split it.

    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

  • Thursday, December 13, 2012 10:36 PM
     
     
    Done.  I guess if he takes both, we have to split it.

    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "


    How about a black and tan...that is referred to as half and half?

    ¯\_(ツ)_/¯

  • Thursday, December 13, 2012 11:34 PM
    Moderator
     
     
    Whatever works for you.  I'm not much of a drinker (have a genetic pre-disposition to migraines, and alcohol is one of the triggers).

    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

  • Monday, December 17, 2012 12:58 AM
     
     

    Thank you.

    I have a question.

    Regex match default is largest match or least match ?

  • Monday, December 17, 2012 12:58 AM
     
     

    Thank you.

    I will try to XPath.

  • Monday, December 17, 2012 1:06 AM
    Moderator
     
     Answered
    If I understand the question correctly, default is largest match (sometimes referred to as a "greedy" match).

    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "