locked
Getting data from a very specific <td> tag in an HTML document that won't parse in XML RRS feed

  • Question

  • So I have a batch of very (very, very, very) badly formatted and built HTML documents (no, I didn't build them). I need to find certain, very specific information on each one. 

    Header 1 Header 2 Header 3
    Content A Content B Content C.a
    Content C.b
    Content C.c
    Content D Content E Content F

    If the above were for my example, I would need to get all the text in the second row, third column.

    Ideally, there'd be <span> or <div> tag with a unique ID that would let me do a simple Select-String to find the cell and grab the lines below it. However, as I said, I did not build it, and the person who did didn't know enough to properly set up their HTML pages. Net result, I have go through a couple hundred individual documents and visually hunt down that once cell in the table, then manually enter the information I need (a count of each line in the cell).

    Bonus, the pages are locally hosted (so no using Invoke-WebRequest) and don't parse into the XML parser for Powershell.

    Possible solution: Build into the script a way to ignore the non-XML compliant stuff, like the <meta> tags, push it through the XML parser , then programmatically add some means of identifying the cell in the table based on Header 3 and Content A as column and row identifiers. (How to do this, I have no idea), then clear it from memory.

    Tuesday, June 20, 2017 10:40 PM

Answers

  • I recommend using RegEx to parse the documents for the tags and contents you need,

    You can also load the document into Internet Explorer and use HTML methods to extract tags by name, ID along with other methods.


    \_(ツ)_/

    Tuesday, June 20, 2017 10:43 PM

All replies

  • I recommend using RegEx to parse the documents for the tags and contents you need,

    You can also load the document into Internet Explorer and use HTML methods to extract tags by name, ID along with other methods.


    \_(ツ)_/

    Tuesday, June 20, 2017 10:43 PM
  • I did kinda expect that to be the case, thanks for the feedback.
    Wednesday, June 21, 2017 3:27 AM