none
RegEx multiline matching [Solved] RRS feed

  • Question

  • Hi all,

    Main question, I have a text file, with some content that closely resembles HTML markup (because it is) however, the way the text is arranged I can extract some information from it. The following is the content of the file (literally). The exact same text.

    <div><b>EmployeeName:</b> Luckas Duckins</div>
    <div><b>CCName:</b> Mike McMice</div>
    <div><b>CCEmail:</b> MikeMcMice@funinc.com</div>
    <div><b>ExpirationDate:</b> 7/17/2015</div>
    

    I have a script that was working last Friday, but when I went back today to keep working on it, I got no match, <strike>so I wonder what is it that I was doing last Friday that I did not do today</strike>. Script as follows:

    $MyPath = "c:\Path\to\textfile.txt"
    $regex99 = @'
    (?ms)<div><b>EmployeeName:<\/b> (.+?)</div>
    <div><b>CCName:<\/b> (.+?)<\/div>
    <div><b>CCEmail:<\/b> (.+?)<\/div>
    <div><b>ExpirationDate:<\/b> (.+?)<\/div>
    '@
    
    [IO.File]::ReadAllText($MyPath) -match $regex99
    if ([IO.File]::ReadAllText($Mypath) -match $regex99)
      {
       $EmployeeName = $matches[1]
       $CCName = $matches[2] 
       $CCEmail = $matches[3] 
       $ExpirtationDate = $matches[4] 
      }
    "EmpName"
    $EmployeeName 
    "CC Name"
    $CCName 
    "CC Email"
    $CCEmail 
    "EXP Date"
    $ExpirtationDate 
    
    #output was
    #True
    #EmpName 
    #Luckas Duckins
    #CC Name
    #Mike McMice
    #CC Email
    #MikeMcMice@funinc.com
    #EXP Date
    #7/17/2015

    <strike>Right now I just get a big False</strike>. I suspect the issue may be regarding the file itself. I resaved the file (after adding a new line at the end of the file), and the script worked. Then, I removed the new line, and the script works. If I try either of the following regex, each one works, but I am trying to get it on one go.

    $regex99 = @'
    (?ms)<div><b>EmployeeName:<\/b>\s(.+?)<\/div>
    '@
    
    $regex99 = @'
    (?ms)<div><b>CCName:<\/b>\s(.+?)<\/div>
    '@
    
    $regex99 = @'
    (?ms)<div><b>CCEmail:<\/b>\s(.+?)<\/div>
    '@

    I have used https://mjolinor.wordpress.com/2012/01/05/powershell-multiline-regex-matching/ as a reference, as well as a post I found on Stackoverflow <strike>(cannot find it anymore :( )</strike>

    Any help is appreciated.

    UPDATE:

    Found the post on Stackoverflow that I used as reference. http://stackoverflow.com/questions/15375921/powershell-parse-parts-of-a-text-file-and-save-to-csv

    UPDATE 2:

    I kept working on the script and I modified the text file, so basically after resaving the file the script worked.

    Background about the text file. I get the text content from another script, I save the text on the text file, then I read the file to process it.

    Is it possible to save the text to a variable, and keep the text as a here string o I can process it?


    Monday, July 20, 2015 3:51 PM

Answers

  • You are using a regex new line \n. Windows uses \r\n as a line terminator.  That causes the match to fail. Unix and most web servers use \n or no line breaks.


    \_(ツ)_/

    Monday, July 20, 2015 7:18 PM
  • Just an update. I am not using the text file anymore. However, this still applies. I get the information from the HTML page, and then I run a replace on it to match the newLines.
    $feedURL = "http://website.com/feed/getfeed/" #sample url for AtomFeed
    #property object
    $property = New-Object System.Collections.Specialized.OrderedDictionary
    $property.Add('UseDefaultCredentials', $true)
    
    #I get the AtomFeed specific property that I  need
    #in the production script I would loop through all .rss.channel.item[n] instances
    #to capture the results
    $result = ((New-Object Net.Webclient -Property $property ).DownloadString($feedURL) -as [xml]).rss.channel.item[0].description.InnerText 
    
    #I replace the new line with the newline that matches my OS (Windows)
    $result = $result.Replace("`n","`r`n")
    
    #Then I run the former script
    $regex = @'
    (?ms)<div><b>EmployeeName:<\/b> (.+?)</div>
    <div><b>CCName:<\/b> (.+?)<\/div>
    <div><b>CCEmail:<\/b> (.+?)<\/div>
    <div><b>ExpirationDate:<\/b> (.+?)<\/div>
    '@
    
    #reset variables
    $EmployeeName, $CCName, $CCEmail, $ExpirtationDate = $null
    #check if there are matches
    $result -match $regex
    #get the values I want
    if ($resultHere -match $regex)
      {
       $EmployeeName = $matches[1]
       $CCName = $matches[2] 
       $CCEmail = $matches[3] 
       $ExpirtationDate = $matches[4] 
      }
    
    $EmployeeName 
    $CCName 
    $CCEmail 
    $ExpirtationDate 
    That works for me.

    Monday, July 20, 2015 8:19 PM

All replies

  • Patterns for matching cannot be stored in"here" strings:

    $regex99='(?ms)<div><b>EmployeeName:<\/b>\s(.+?)<\/div>'


    \_(ツ)_/

    Monday, July 20, 2015 4:17 PM
  • As shown on this script https://mjolinor.wordpress.com/2012/01/05/powershell-multiline-regex-matching/ seems they do. I have tested the script shown on that post myself, and edited a little to get the version number from the here string sample as well, it worked. I just wonder where is my regex wrong. I also found the post I used as reference http://stackoverflow.com/questions/15375921/powershell-parse-parts-of-a-text-file-and-save-to-csv, there the http://stackoverflow.com/posts/15382469/revisions regex is on a here string.

    Monday, July 20, 2015 4:53 PM
  • I would recommend a line-by-line approach and separate regular expressions based on input line, rather than trying to create a single regex that matches everything.

    As the old saying goes, now you have two problems.

    The simpler the regex, the better.


    -- Bill Stewart [Bill_Stewart]

    Monday, July 20, 2015 5:36 PM
    Moderator
  • The problem is that if the line terminators are complex the match will fail.  HTML pages may have only a linefeed or both cr and lf or may have nothing at all.

    Mostly I don't understand your question.  You say it work but that you get a false.  If you get a false then it doesn't work.

    We have no idea what is in your file.


    \_(ツ)_/

    Monday, July 20, 2015 5:44 PM
  • To continue on Bill's line; use multiple patterns and passes.  One for each extraction.  That would be most reliable.

    PS >$html=@'
    <div><b>EmployeeName:</b> Luckas Duckins</div>
    <div><b>CCName:</b> Mike McMice</div>
    <div><b>CCEmail:</b> MikeMcMice@funinc.com</div>
    <div><b>ExpirationDate:</b> 7/17/2015</div>
    '@
    
    PS > if($html -match 'EmployeeName:</b>(?<x>.*)</div>') { $matches['x'] }
    Luckas Duckins
    PS > if($html -match 'CCName:</b>(?<x>.*)</div>') { $matches['x'] }
    Mike McMice
    PS >


    \_(ツ)_/


    • Edited by jrv Monday, July 20, 2015 5:51 PM
    Monday, July 20, 2015 5:49 PM
  • The problem is that if the line terminators are complex the match will fail.  HTML pages may have only a linefeed or both cr and lf or may have nothing at all.

    Mostly I don't understand your question.  You say it work but that you get a false.  If you get a false then it doesn't work.

    We have no idea what is in your file.


    \_(ツ)_/

    I think you are on to something. So, the issue has taken a new turn. Will keep working on this issue. Will update when I find something new. jrv's suggestion is worth considering.

    Thanks all.

    Monday, July 20, 2015 7:00 PM
  • This is crazy. I changed the regex as follows:

    $regex = @'
    (?ms)<div><b>EmployeeName:<\/b> (.+?)</div>\n<div><b>CCName:<\/b> (.+?)<\/div>\n<div><b>CCEmail:<\/b> (.+?)<\/div>\n<div><b>ExpirationDate:<\/b> (.+?)<\/div>
    '@

    I had actual new lines on the regex, now I replaced those for escaped newlines (so the regex is on one line) and now the script works without issue. I just want to know why?

    Looks like we are all good now. Any ideas on how to replace the new lines from whichever new line is used to a specific newline?

    Thanks. 

    Monday, July 20, 2015 7:09 PM
  • You are using a regex new line \n. Windows uses \r\n as a line terminator.  That causes the match to fail. Unix and most web servers use \n or no line breaks.


    \_(ツ)_/

    Monday, July 20, 2015 7:18 PM
  • Just an update. I am not using the text file anymore. However, this still applies. I get the information from the HTML page, and then I run a replace on it to match the newLines.
    $feedURL = "http://website.com/feed/getfeed/" #sample url for AtomFeed
    #property object
    $property = New-Object System.Collections.Specialized.OrderedDictionary
    $property.Add('UseDefaultCredentials', $true)
    
    #I get the AtomFeed specific property that I  need
    #in the production script I would loop through all .rss.channel.item[n] instances
    #to capture the results
    $result = ((New-Object Net.Webclient -Property $property ).DownloadString($feedURL) -as [xml]).rss.channel.item[0].description.InnerText 
    
    #I replace the new line with the newline that matches my OS (Windows)
    $result = $result.Replace("`n","`r`n")
    
    #Then I run the former script
    $regex = @'
    (?ms)<div><b>EmployeeName:<\/b> (.+?)</div>
    <div><b>CCName:<\/b> (.+?)<\/div>
    <div><b>CCEmail:<\/b> (.+?)<\/div>
    <div><b>ExpirationDate:<\/b> (.+?)<\/div>
    '@
    
    #reset variables
    $EmployeeName, $CCName, $CCEmail, $ExpirtationDate = $null
    #check if there are matches
    $result -match $regex
    #get the values I want
    if ($resultHere -match $regex)
      {
       $EmployeeName = $matches[1]
       $CCName = $matches[2] 
       $CCEmail = $matches[3] 
       $ExpirtationDate = $matches[4] 
      }
    
    $EmployeeName 
    $CCName 
    $CCEmail 
    $ExpirtationDate 
    That works for me.

    Monday, July 20, 2015 8:19 PM
  • rss-atom feeds can be directly converted to XML as they are XML.  It you use a webclient then formatting is applied and the XML is lost.

    Just grab the xml and it will beeasy to uery with XPAth.


    \_(ツ)_/

    Monday, July 20, 2015 8:34 PM
  • This feed has both iTunes and atom namespaces.

    $feed=Invoke-WebRequest 'http://www.sciencefriday.com/audio/scifriaudio.xml'
    $xml=[xml]$feed.Content
    $xml.rss.channel


    \_(ツ)_/

    Monday, July 20, 2015 8:40 PM
  • jrv,

    Thank you for the feedback. I did try a script similar to yours, for some reason I could not get it to behave as I expected it. So I ended up sticking with the process I posted above.

    In another turn on this matter, due to some limitations regarding data availability and possible change in the pattern of the data (as it is generated by the server), I shifted research to access the information through a ConnectionString using http://powershell.com/cs/blogs/tobias/archive/2011/03/01/accessing-data-bases.aspx as reference. Then I run an SQL command to get the information we wanted as an object.

    I have made progress with regards to accessing the data, we are no just ironing out the UI to add the information to the database, which will happen through SharePoint 2010.

    Thanks for your help and time in this matter.



    • Edited by Mr. Potter III Thursday, July 23, 2015 1:49 PM Clarification
    Thursday, July 23, 2015 1:47 PM