none
how to grab html within html RRS feed

  • Question

  • I have this simple script:

    $url = "http://www.nhs.uk/Services/hospitals/Services/Service/DefaultView.aspx?id=89268"
    $page = Invoke-WebRequest -Uri $url
    $ptemp = $page.parsedhtml.getelementbyid("ctl00_ctl00_ctl00_PlaceHolderMain_contentColumn2").InnerHTML

    $html = New-Object -ComObject "HTMLFile";
    $html.IHTMLDocument2_write($ptemp);
    $html.body.getElementsByTagName('ul')

    the last line shows the stuff I want in the .innerHTML property

    However,

    $html.body.getElementsByTagName('ul').innerHTML

    doesn't give me anything

    started feeling really silly about this.

    Is there an elegant way to grab html within html, preferably even without the conversion from string to HTML i did above?


    • Edited by PHEC Thursday, June 4, 2015 9:40 AM
    • Changed type PHEC Thursday, June 4, 2015 6:46 PM it is a question!
    Thursday, June 4, 2015 9:39 AM

Answers

  • We can also do this:

    $item=$page.parsedhtml.getelementbyid("ctl00_ctl00_ctl00_PlaceHolderMain_contentColumn2")
    $item.getElementsByTagName('li') |select innertext


    \_(ツ)_/

    • Marked as answer by PHEC Thursday, June 4, 2015 6:46 PM
    Thursday, June 4, 2015 10:20 AM

All replies

  • $url = "http://www.nhs.uk/Services/hospitals/Services/Service/DefaultView.aspx?id=89268"
    $page = Invoke-WebRequest -Uri $url
    $page.parsedhtml.getElementsByTagName('ul')|select uniqueid


    \_(ツ)_/

    Thursday, June 4, 2015 10:10 AM
  • Here is everything available on that page:

    $page.parsedhtml.getElementsByTagName('ul')|select id,innerhtml


    \_(ツ)_/

    Thursday, June 4, 2015 10:13 AM
  • We can also do this:

    $item=$page.parsedhtml.getelementbyid("ctl00_ctl00_ctl00_PlaceHolderMain_contentColumn2")
    $item.getElementsByTagName('li') |select innertext


    \_(ツ)_/

    • Marked as answer by PHEC Thursday, June 4, 2015 6:46 PM
    Thursday, June 4, 2015 10:20 AM
  • If the page was not illegal and broken HTML we could use XML to parse.  any older sites have badly formed HTML and are very hard to parse.

    This is a common construct and is not allowed in newer versions of HTML:

    <DIV class=panel-content>
    <DIV class="pad clear">
    <DIV class=module>

    Note ho the tag is uppercase and there are missing quotes.


    \_(ツ)_/

    Thursday, June 4, 2015 10:27 AM
  • Thanks jrv

    that pointed me in the right direction.

    I go for .innerHTML as I'm interested in the number of days on the next page (for all the 15k pages that have thsi block on it)

    Teh site is great for members of the public who want to know more of their local health services, but in the age of opendata they should also make this availble remotely to query for people like me who want to use it in combination with other sources.

    I don't see a vote as answer button, but your response has my vote!

    Cheers

    Thursday, June 4, 2015 3:51 PM
  • To get an answer button change thread type from discussion to uestioon.


    \_(ツ)_/

    Thursday, June 4, 2015 3:54 PM