none
newbie question on regular expressions RRS feed

  • Question

  • I have a file which looks like this:

    x: 0
          y:1
    z: 4

    THe output after parsing should be below. The idea is to split it on colon+space. If the lien begins with white space it is part of the value otherwise it is Id.

    ID is x, Value is "0 y:1"

    ID is z, Value is "4"

    I have the following code for parsing and it doesn't seem to work. I validate the regular expression here and it seems to work.

    http://regexpal.com/

    My guess is it is syntax. I would appreciate any ideas on this. Thanks,

    #$pattern = [regex]"(?m)((\S+)[^:'n]*:(\s+)(.+)'n)((\s+)(.+)'n)*"


    #$pattern = [regex]"(?m)((\S+)[^:'n]*:(\s+)(.+)'n)((\s+)(.+)'n)*"
    $pattern = [regex]"((\S+)[^:'n]*:(\s+)(.+)'n)((\s+)(.+)'n)*"
    $content.split($pattern,[System.StringSplitOptions]::RemoveEmptyEntries) | ForEach-Object {
    
    $split =  $_.split(":", 2)
        if ($split -eq $null ) {return } 
        if ($split[0] -ne $null)  {$Name=$split[0].Trim()} else {return }
        if ($split[1] -ne $null)  {  $Id = $split[1].Trim() } else {return }
        Write-host "#Name is- $Name #Id is- $Id"
    }
    

    Tuesday, April 22, 2014 5:33 PM

Answers

  • This appears to work according to the sample input and output you've posted:

    $text = Get-Content -Path .\test.txt -Raw
    
    $results = [regex]::Matches($text, '(?ms)^(\S+):\s+(.*?)(?=^\S+:\s+|\Z)')
    
    foreach ($match in $results)
    {
        $id = $match.Groups[1].Value
        $value = $match.Groups[2].Value -split '\s+' -match '\S' -join ' '
        
        "ID is $id, Value is $value"
    }
    

    • Marked as answer by vm121 Wednesday, April 23, 2014 5:35 AM
    Wednesday, April 23, 2014 3:34 AM
  • Here is a pretty annotated version:

    $x=''
    Get-Content \temp\regtest.txt |
         ForEach-Object{
              if($_ -match '(^.:\s\d+)'){
                   if($x){$x} # output if set
                   $x='Name is {0}, ID is {1}' -f $matches[1].Split(':')
              }else{
                   "$x$_"
                   $x='' # clear after output
              }
         }
    


    ¯\_(ツ)_/¯

    • Marked as answer by vm121 Wednesday, April 23, 2014 5:34 AM
    Wednesday, April 23, 2014 4:06 AM

All replies

  • HINT:  it is a multiline detection problem.  The default is single line.


    ¯\_(ツ)_/¯


    • Edited by jrv Tuesday, April 22, 2014 5:53 PM
    Tuesday, April 22, 2014 5:53 PM
  • I did have (?m) earlier which is commented out. That didn't seem to help either. Any other ideas?

    Tuesday, April 22, 2014 6:13 PM
  • Not sure how you're getting your content.  If you're using Get-Content filename then you won't have newline characters and the $Content.Split wouldn't be valid, at least, it wasn't for me.  When I was trying this I used

    $Content = (Get-Content c:\content.txt) -join "'n"

    That gives me a string with newlines.

    Then try splitting it like

    $Content -split "\n(?=\S)" | Foreach-Object {

    and go on from there.  That should group your sub-elements with their parent element by only splitting where a newline is followed by a non-whitespace character.  I didn't try any of the rest of your code but it looks ok.


    I hope this post has helped!

    Tuesday, April 22, 2014 7:26 PM
  • THanks. This does help but not completely. There are 2 cases which need to work. One which ends with a new line and another one where the new line is followed by space as shown in the orginal example. Your code doesn't work for the first case(new line only). Please see the example below

    SO the output I should see for the following test file:

    x: 1

       y: 2

    z: 3

    d: 4

    should be

    Name is x, Id is "1 y: 2"

    Name is z, Id is 3

    Name is d Id is 4


    • Edited by vm121 Tuesday, April 22, 2014 8:39 PM
    Tuesday, April 22, 2014 8:34 PM
  • Also the following case needs to be supported where there are multiple lines which start with blanks

    x: 1

       y: 2

       yy:10

      zz:100

      ....

    z: 3

    d: 4

    The output should be

    Name is x, Id is "1 y:2 yy:10 zz:100 ...."

    Name is z, Id is 3

    Name is d, Id is 4

    Tuesday, April 22, 2014 11:23 PM
  • just an idea.

    why not check for the first character like:

    Get first character

    then split the name and id

    then on second line get first character again

    if second line is a blank space append the data to the previous result

    else if second line is not a space or not a blank character just split the name and id 

    then continue evaluating until last line is detected.


    Every second counts..make use of it. Disclaimer: This posting is provided AS IS with no warranties or guarantees and confers no rights.
    IT Stuff Quick Bytes

    Wednesday, April 23, 2014 2:29 AM
  • Hint 2:

     $l=Get-Content data.txt
     $l |%{ if(   $_ -match '(^.:\s\d+)' ){ $matches[1] } }
    

    Now you have two-thirds of the answer.  It should take on extra detection to complete.


    ¯\_(ツ)_/¯


    • Edited by jrv Wednesday, April 23, 2014 2:55 AM
    Wednesday, April 23, 2014 2:53 AM
  • Ok - so you want it to look pretty too:

    $l=Get-Content data.txt
    $l |%{ if(   $_ -match '(^.:\s\d+)' ){ 'Name is {0}, ID is {1}' -f $matches[1].Split(':') } }


    ¯\_(ツ)_/¯



    • Edited by jrv Wednesday, April 23, 2014 3:01 AM
    Wednesday, April 23, 2014 2:59 AM
  • Hint #3:

    The }else{ clause gets you the rest.


    ¯\_(ツ)_/¯

    Wednesday, April 23, 2014 2:59 AM
  • This returns just 1 for id instead of "1 y:2 yy:10 zz:100 ....". I think the regular expression needs to take care of more like I have in my initial question.
    Wednesday, April 23, 2014 3:34 AM
  • This appears to work according to the sample input and output you've posted:

    $text = Get-Content -Path .\test.txt -Raw
    
    $results = [regex]::Matches($text, '(?ms)^(\S+):\s+(.*?)(?=^\S+:\s+|\Z)')
    
    foreach ($match in $results)
    {
        $id = $match.Groups[1].Value
        $value = $match.Groups[2].Value -split '\s+' -match '\S' -join ' '
        
        "ID is $id, Value is $value"
    }
    

    • Marked as answer by vm121 Wednesday, April 23, 2014 5:35 AM
    Wednesday, April 23, 2014 3:34 AM
  • As I posted.  Just add it in the else clause:

    $l=Get-Content data.txt
    $x='';$l |%{if($_ -match '(^.:\s\d+)' ){if($x){$x};$x='Name is {0}, ID is {1}' -f $matches[1].Split(':') }else{"$x$_";$x=''}}


    ¯\_(ツ)_/¯

    Wednesday, April 23, 2014 4:01 AM
  • Here is a pretty annotated version:

    $x=''
    Get-Content \temp\regtest.txt |
         ForEach-Object{
              if($_ -match '(^.:\s\d+)'){
                   if($x){$x} # output if set
                   $x='Name is {0}, ID is {1}' -f $matches[1].Split(':')
              }else{
                   "$x$_"
                   $x='' # clear after output
              }
         }
    


    ¯\_(ツ)_/¯

    • Marked as answer by vm121 Wednesday, April 23, 2014 5:34 AM
    Wednesday, April 23, 2014 4:06 AM
  • Thanks for helping.  I think we are getting there . The output comes up as .

    Name is x, ID is  1   y: 2
       yy:10
      zz:100
    Name is z, ID is  3
    Name is d, ID is  4

    There should be no new lines so more like this:

    Name is x, ID is  1   y: 2 yy:10 zz:100
    Name is z, ID is  3

    Name is d, ID is  4

    Wednesday, April 23, 2014 5:34 AM
  • Thanks. Is the symbol for indicating multi line match (?m) or (?ms). My code is slightly more complicated which I didn't mention earlier but this works for the input I provided.

    Wednesday, April 23, 2014 5:39 AM
  • You cannot keep changing the requirements.  I am sure if we adjust to this requirement there will be another. 

    You can adjust either solution yourself by simple extension of the logic.


    ¯\_(ツ)_/¯

    Wednesday, April 23, 2014 5:39 AM
  • I did mark you as the answer for what I have provided :).
    Wednesday, April 23, 2014 5:50 AM
  • (?m) enables multi-line mode, and (?s) enables single-line mode (which I think are really crappy names for what these features do, but that's what they're called.) (?ms) enables both at once.

    Multil-line mode makes the ^ and $ anchors match the beginning and end of individual lines, instead of only matching the beginning and end of the entire string.  Single-line mode makes the . character match everything (including newlines); by default, .* would only match up to the end of the current line.

    If you want to match the very end of the string while using Multi-line mode, use the \Z anchor instead of $ (as I did in the pattern I posted.)

    Wednesday, April 23, 2014 11:47 AM