none
Powershell Extract over Multiple Lines RRS feed

  • Question

  • Hi folks,

    My code extracts between 2 placeholders BEGIN and END.

    It works nicely except when it comes to the text being on multiple lines.

    example it extracts

    BEGIN hello a line  END

    But NOT

    BEGIN
    hello
    a
    nice
    line
    END


    ==============================

    $sPat = [regex]'(BEGIN)*(END)'


                $files = Get-ChildItem "C:\Users\Work\Desktop\Files\"

                for ($i=0; $i -lt $files.Count; $i++) {
        
       
                $outfile = $files[$i].FullName + ".txt"
         
                Get-Content $files[$i].FullName | Select-String -Pattern $sPat -AllMatches  | Set-Content $outfile   
                  
              }
    ==========================================

    Do I need to add some parameter for a newline characters into my script? Like a carriage return feed  Would that identify that its on a new line the text

    thank you :)

    pw
    Sunday, June 19, 2016 9:25 PM

Answers

  • Hi pw,

    you'll need to change your pattern:

    $spat = "BEGIN.+END"

    Plain letters are matched to the letters in your text.
    . matches anything (it's like a "*" in Wildcard processing)
    + matches any number of repetitions of the previous letter/item

    This combines to saying "Match anything that starts with 'BEGIN', then has any number of any kind of symbol and finally ends in END"

    You can also remember parts of it explicitly like this:

    $spat = "BEGIN(.+)END"
    Get-Content $files[$i].FullName -ReadCount 0 | Select-String -Pattern $sPat -AllMatches | select -expand Matches | select -expand Groups | ? { $_.Value -notlike "BEGIN*END" } | select -expand Value | Set-Content $outfile

    Cheers,
    Fred


    There's no place like 127.0.0.1

    • Marked as answer by Dan_CS Monday, June 20, 2016 6:07 PM
    Monday, June 20, 2016 7:25 AM

All replies

  • Hi pw,

    as Get-Content by default passes on the content of the file line-by-line, Select-String cannot process multiline values. To fix this, you can instruct Get-Content to get the whole document:

    Get-Content $files[$i].FullName -ReadCount 0 | Select-String -Pattern $sPat -AllMatches | Set-Content $outfile

    As you will notice when you try this out, you now get the entire document if it has at least one match. To counter this, you can try to grab just the matched parts, rather than the entire lines:

    Get-Content $files[$i].FullName -ReadCount 0 | Select-String -Pattern $sPat -AllMatches | select -expand Matches | select -expand Value | Set-Content $outfile

    Cheers,
    Fred


    There's no place like 127.0.0.1


    • Edited by FWN Monday, June 20, 2016 7:17 AM Expanded wrong property
    Sunday, June 19, 2016 9:40 PM
  • Hi Fred,

    thanks for your input here.

    I have been for a while trying to find away to work with multiple lines, it's awfully difficult :(

    I thought place holders may be able to identify a section of text to be extracted.

    In essence it is extracting a sub string from the text file.

    Just happens to be a few newlines and lots of white space

    I managed to extract only the END from this round

    aww -  I'll keep trying. :)

    Sunday, June 19, 2016 10:14 PM
  • Hi pw,

    you'll need to change your pattern:

    $spat = "BEGIN.+END"

    Plain letters are matched to the letters in your text.
    . matches anything (it's like a "*" in Wildcard processing)
    + matches any number of repetitions of the previous letter/item

    This combines to saying "Match anything that starts with 'BEGIN', then has any number of any kind of symbol and finally ends in END"

    You can also remember parts of it explicitly like this:

    $spat = "BEGIN(.+)END"
    Get-Content $files[$i].FullName -ReadCount 0 | Select-String -Pattern $sPat -AllMatches | select -expand Matches | select -expand Groups | ? { $_.Value -notlike "BEGIN*END" } | select -expand Value | Set-Content $outfile

    Cheers,
    Fred


    There's no place like 127.0.0.1

    • Marked as answer by Dan_CS Monday, June 20, 2016 6:07 PM
    Monday, June 20, 2016 7:25 AM
  • Hi,

    this can be done easier, look at my answer here:

    https://social.technet.microsoft.com/Forums/en-US/b4b3d6bf-6d06-422e-9acd-ebcf65c62700/split-a-file-between-two-line?forum=winserverpowershell

    Alternatively, you can put the array together with the -join operator. Use a special char, that is not in the text (or a group of chars like "xyz") als argument. This way, you can split the string again into an array. Then, your old pattern will work:

    $array = "something","BEGIN","hello","a","nice","line","END","something else"
    
    $string = $array -join "<xyz>"
    
    $match = $string | Select-String -Pattern "BEGIN.*END" -AllMatches | foreach{$_.Matches.Value}
    $match = $match -split "<xyz>"

    Best wishes

    Christoph

    Monday, June 20, 2016 7:51 AM
  • Hi Folks,

    thanks for the help. I have been fiddling about and tweaking, I think I am nearly there.

    With Fred's

    "BEGIN(.+)END"

    I get the text, but also everything after the END.

    Let me test some more

    I appreciate all the tips

    thank you to hp as well

    pw

    Monday, June 20, 2016 12:39 PM
  • Hi pw,

    I just figured out how to make it behave. I won't pretend to fully understand why it works, but ... it does. Anybody able to explain to me why, feel very free to explain it...

    That said, here's the pattern:

    "BEGIN(.+?)END"

    Yes, just adding a question-mark did the trick. The documentation on about_Regular_Expressions does not lead me to understand why it does work, but a post on .NET regex on that topic lead me to give this a try.

    Cheers,
    Fred


    There's no place like 127.0.0.1


    • Edited by FWN Monday, June 20, 2016 4:39 PM
    Monday, June 20, 2016 4:39 PM
  • Hi Fred,

    thanks ever so much for all the help.

    Regex is one of those areas  - I really struggle with

    "BEGIN(.+?)END"

    Really did the trick :)

    I've tested the life out of my files and well  - this is a great starting point for being able to extract Multi line blocks, although it does not preserve the newlines, I can just add placeholders and then replace at other end

    Have a great day my friend

    :)

    pw

    Monday, June 20, 2016 6:06 PM
  • Hi pw,

    glad to have been of assistance :)

    Btw, it should preserve the linebreaks, however the default notepad often struggles with them. Try checking the output-file with wordpad or notepad++.

    Cheers,
    Fred


    There's no place like 127.0.0.1

    Monday, June 20, 2016 6:27 PM