locked
Grab Values within HTML Tag RRS feed

  • Question

  • <div class="primary-title-and-description">
    	<h1>
    		Title Here
    	</h1>
    
    	<p>
    		Paragraph Here
    	</p>
    </div>
    
    <div class="secondary-title-and-description">
    	<h1>
    		Secondary Title Here
    	</h1>
    
    	<p>
    		Secondary Paragraph Here
    	</p>
    </div>

    Hi

    I am trying to get the values within the <h1></h1> only from the <div class="primary-title-and-description">. How do I go about doing that?

    I tried my luck with the below code but no joy. What I did was copy the above code and save it as test.html and I use the code below.

    $test = gc test.html | % { [regex]::matches( $_ , '(?<=<div class="primary-title-and-description">\s+)(.*?)(?=\s+</h1>)' ) } | select -expa value
    $test

    I did not get any response from the powershell. What have I done wrong? Should I use xml method to get the values?


    Thanks

    Monday, October 6, 2014 6:31 AM

Answers

  • I wouldn't expect the command to work without the -raw switch.  Without that, get-content returns an array of single-line strings.  That is not the way the text will be returned from the web page, and the regex is not designed to work with the data in that form.

    Here's an alternate regex that works with your test data and may be more forgiving of extraneous white space:

    $text = 
    @'
    <div class="primary-title-and-description">
    	<h1>
    		Title Here
    	</h1>
    
    	<p>
    		Paragraph Here
    	</p>
    </div>
    
    <div class="secondary-title-and-description">
    	<h1>
    		Secondary Title Here
    	</h1>
    
    	<p>
    		Secondary Paragraph Here
    	</p>
    </div>
    '@
    
    $regex ='(?ms).*?<div class="primary-title-and-description">.*?<h1>(.+?)</h1>.+'
    
    $text -match $regex > $null
    $matches[1].trim()
    Title Here
    


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "


    • Edited by mjolinor Thursday, October 9, 2014 7:53 PM
    • Marked as answer by cyw77 Thursday, October 9, 2014 10:59 PM
    Thursday, October 9, 2014 7:52 PM
  • I had a go at this and finally got it to work with the -raw. Here is what I did. Thanks for your help too.

    here is my code

    $test = gc test.html
    $test = $test | ForEach-Object {$_ -replace ("`r","") }
    $test = $test | ForEach-Object {$_ -replace ("`t","") }
    $test = $test | ForEach-Object {$_ -replace ("`n","") }
    $ofc = "="
    $hello = [string]$test -match '<h1>(.*?)</h1>'
    $Matches[1]

    I have to use $ofc to make them into a string. But your solution is great too. Thanks for your help and I just realized that I had a reply that it was in bold. It wasn't meant to be rude. Not sure why iPhone converted the font as that. Anyway thanks for your help.

    The entire problem was trying to make $test which is a string of arrays and I had to go through each values in the array to remove the tab, carriage return and then use $ofc to make them into a single string and do a match.

    My above commands is not optimized but I am sure you have a smarter way to reduce them to a single line.

     


    Thanks

    • Marked as answer by cyw77 Thursday, October 9, 2014 11:41 PM
    Thursday, October 9, 2014 11:04 PM

All replies

  • I would use this approach.

    gc splits the file into lines and put them into an array. -raw option can be used to avoid that.

    ((gc -raw test.html)  -replace "`t|`n|`r","") -match 'primary-title-and-description"><h1>(.*?)</h1>'
    $matches[1]

    • Marked as answer by cyw77 Monday, October 6, 2014 10:25 AM
    • Unmarked as answer by cyw77 Thursday, October 9, 2014 8:17 AM
    Monday, October 6, 2014 7:36 AM
  • Hi

    Thanks for the code. I have this issue here.

    I used the DownloadString($url) from a URL and thus there is no way to do a -raw on that. And when I use

    $url -replace ("`t|`n|`r","") -match 'primary-title-and-description"><h1>(.*?)</h1>'
    $matches[1]

    I don't get any results.


    Thanks

    Wednesday, October 8, 2014 10:45 PM
  • This post seems like a tough nut to crack. No one seems to be able to resolve this. I tried searching this online but no one seems to have the same problem like me.

    Maybe I will source another forum for assistance.

    Thanks hysh_00 but the original solution didn't quite get what I wanted.


    Thanks

    Thursday, October 9, 2014 12:10 PM
  • Hi

    Thanks for the code. I have this issue here.

    I used the DownloadString($url) from a URL and thus there is no way to do a -raw on that. And when I use

    $url -replace ("`t|`n|`r","") -match 'primary-title-and-description"><h1>(.*?)</h1>'
    $matches[1]

    I don't get any results.


    Thanks

    $url will be the url of the web page you're downloading from, not the downloaded content of the page.

    It looks like the problem may not be with the regex, but with some other error in the code, but it's impossible to say without seeing the code.


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "


    • Edited by mjolinor Thursday, October 9, 2014 12:42 PM
    Thursday, October 9, 2014 12:41 PM
  • Basically the code is correct. Just that hysh_00 uses -raw.

    If you do a gc test.html without the raw, it will split each line into a single array.

    How do you then get only the values between <h1>*</h1>


    Thanks


    • Edited by cyw77 Thursday, October 9, 2014 10:58 PM
    Thursday, October 9, 2014 1:00 PM
  • Get-Content -Raw reads the data from a file as a single, multi-line string, which emulates the way you receive data from a web page using DownloadString($url).

    The confusion originated from testing with data from a different source, and obtained by a different method that what was actually being used in the production code.


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    Thursday, October 9, 2014 1:07 PM
  • I didn't do anything different. Instead of gc test.html I did webClient.DownloadString("http://localhost/test.html") -replace as above $matches[1] I got error cannot index to null array.

    Thanks

    Thursday, October 9, 2014 1:29 PM
  • Then the regex didn't match, which would seem to indicate that the test data doesn't match the production data.

    If you can post a sample of the actual data you're downloading, the regex can probably be adjusted to work.


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    Thursday, October 9, 2014 1:41 PM
  • The above HTML is all that it is. Copy the code into a HTML and launch IIS and host the file. Run the command that I copied from hysh. Alternatively, if you use the command that was given by hysh, try without the -raw and his code would not work. You will get the null array error. Try it and you will understand what I mean.

    Thanks

    Thursday, October 9, 2014 7:31 PM
  • I wouldn't expect the command to work without the -raw switch.  Without that, get-content returns an array of single-line strings.  That is not the way the text will be returned from the web page, and the regex is not designed to work with the data in that form.

    Here's an alternate regex that works with your test data and may be more forgiving of extraneous white space:

    $text = 
    @'
    <div class="primary-title-and-description">
    	<h1>
    		Title Here
    	</h1>
    
    	<p>
    		Paragraph Here
    	</p>
    </div>
    
    <div class="secondary-title-and-description">
    	<h1>
    		Secondary Title Here
    	</h1>
    
    	<p>
    		Secondary Paragraph Here
    	</p>
    </div>
    '@
    
    $regex ='(?ms).*?<div class="primary-title-and-description">.*?<h1>(.+?)</h1>.+'
    
    $text -match $regex > $null
    $matches[1].trim()
    Title Here
    


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "


    • Edited by mjolinor Thursday, October 9, 2014 7:53 PM
    • Marked as answer by cyw77 Thursday, October 9, 2014 10:59 PM
    Thursday, October 9, 2014 7:52 PM
  • I had a go at this and finally got it to work with the -raw. Here is what I did. Thanks for your help too.

    here is my code

    $test = gc test.html
    $test = $test | ForEach-Object {$_ -replace ("`r","") }
    $test = $test | ForEach-Object {$_ -replace ("`t","") }
    $test = $test | ForEach-Object {$_ -replace ("`n","") }
    $ofc = "="
    $hello = [string]$test -match '<h1>(.*?)</h1>'
    $Matches[1]

    I have to use $ofc to make them into a string. But your solution is great too. Thanks for your help and I just realized that I had a reply that it was in bold. It wasn't meant to be rude. Not sure why iPhone converted the font as that. Anyway thanks for your help.

    The entire problem was trying to make $test which is a string of arrays and I had to go through each values in the array to remove the tab, carriage return and then use $ofc to make them into a single string and do a match.

    My above commands is not optimized but I am sure you have a smarter way to reduce them to a single line.

     


    Thanks

    • Marked as answer by cyw77 Thursday, October 9, 2014 11:41 PM
    Thursday, October 9, 2014 11:04 PM