none
use power Query to extract hyperlinks from a web table RRS feed

  • Question

  • Hi,

    is there any way to extract, using power query, hyperlinks from a table column?

    (e.g. actors links from a distribution page from a movie info website)

    with old data tab you can choose to import full html format and use a custom VBA function to extract the hyperlinks from the table, but then missing out all the amazing thinks you can do using power query.

    Is there any future plans to include such a functionality into Power Query

    Thank you.



    Saturday, February 8, 2014 7:54 PM

Answers

  • Power Query lets you browse the structure of HTML DOM so it theoretically should be possible for you to extract raw data from the page, though there is on built in function to do this and the formula to do this may grow complex and fragile. Could you post an example url to a page and the end results you are looking for?

    Thanks


    Zafar Abbas

    Thursday, February 13, 2014 7:09 AM
  • Could you expand on how you actually browse the HTML DOM using Power Query?

    Many Thanks

    • Marked as answer by hobbitalex Sunday, June 15, 2014 11:18 AM
    Friday, May 30, 2014 10:09 AM
  • Hello. Could you please post Power Query formulas of this process?
    • Marked as answer by hobbitalex Sunday, June 15, 2014 11:18 AM
    Friday, June 13, 2014 7:21 PM

All replies

  • Power Query lets you browse the structure of HTML DOM so it theoretically should be possible for you to extract raw data from the page, though there is on built in function to do this and the formula to do this may grow complex and fragile. Could you post an example url to a page and the end results you are looking for?

    Thanks


    Zafar Abbas

    Thursday, February 13, 2014 7:09 AM
  • Thanks,

    I follow your suggestion and done the import the data for this kind of cases in three steps: first, imported the table as Data{} form webpage, second a query using the webpage as text and extracted the hyperlinks, third merged the two queries together;

    let's take as example:

    http://www.cnvmr.ro/RCNVM/lista.php?sect=101

    NR.
    CRT.
    NR.REGISTRU
    DATA REGISTRU
    TIP PERS.
    DENUMIRE
    DECIZIE AUTORIZARE
    ATESTAT INSCRIERE
    LOCALITATE / JUDET
    TEL
    FAX
    EMAIL
    1
    10.02.2006
    PJR
    NBG SECURITIES ROMANIA
    2334/22.07.2003
    70/10.02.2006
    BUCURESTI
    021/316.71.87; 316.71.53; 330.71.87; 330.71.53
    021/316.72.69.

    where I want to import the table including hyperlinks addresses from second column "NR.REGISTRU" obtainig a table like:

    NR. CRT.

    NR.REGISTRU (VALUE)

    NR.REGISTRU (LINK)

    DATA REGISTRU

    TIP PERS.

    DENUMIRE

    DECIZIE AUTORIZARE

    ATESTAT INSCRIERE

    LOCALITATE / JUDET

    TEL

    FAX

    EMAIL

    1

    PJR01SSIF/400007

    http://www.cnvmr.ro/RCNVM/detalii.php?id=71&nrcnvm=PJR01SSIF/400007

    10.02.2006

    PJR

    NBG SECURITIES ROMANIA

    2334/22.07.2003

    70/10.02.2006

    BUCURESTI

    021/316.71.87; 316.71.53; 330.71.87; 330.71.53

    021/316.72.69.

    invest@eteba.ro

    -It was pretty straightforward using the three steps mentioned above to obtain a table in this structure (with the third column, "LINK" value added in the final result),

    This was just an easy example, which could become harder for more complex web pages and a build in facility to obtained the same result with power query without the need to dig in the page html code, I think would resolve couple of scenarios for importing data from webpages.

    Thanks again for the good work that have been done with the new power BI programs and keep it on!

    NR. CRT.

    NR.REGISTRU (VALUE)

    NR.REGISTRU (LINK)

    DATA REGISTRU

    TIP PERS.

    DENUMIRE

    DECIZIE AUTORIZARE

    ATESTAT INSCRIERE

    LOCALITATE / JUDET

    TEL

    FAX

    EMAIL

    1

    PJR01SSIF/400007

    http://www.cnvmr.ro/RCNVM/detalii.php?id=71&nrcnvm=PJR01SSIF/400007

    10.02.2006

    PJR

    NBG SECURITIES ROMANIA

    2334/22.07.2003

    70/10.02.2006

    BUCURESTI

    021/316.71.87; 316.71.53; 330.71.87; 330.71.53

    021/316.72.69.

    invest@eteba.ro


    Tuesday, February 18, 2014 9:05 PM
  • Could you expand on how you actually browse the HTML DOM using Power Query?

    Many Thanks

    • Marked as answer by hobbitalex Sunday, June 15, 2014 11:18 AM
    Friday, May 30, 2014 10:09 AM
  • Hello. Could you please post Power Query formulas of this process?
    • Marked as answer by hobbitalex Sunday, June 15, 2014 11:18 AM
    Friday, June 13, 2014 7:21 PM
  • Sorry for the late answer, i haven't log in for a while,

    for the example above because I need it to be run on multiple pages so it's split in two parts:

    first query ("Links-page"),  from the webpage filtered the rows with hyperlink and set up up as a function:

    (getSectiune) =>
    let
        Source = Table.FromColumns({Lines.FromBinary(Web.Contents("http://www.cnvmr.ro/asf/registru/lista.php?sect=" &  Number.ToText(getSectiune)))}),
        FilteredRows = Table.SelectRows(Source, each Text.Contains([Column1], "detalii.php?id="))
    in
        FilteredRows

    second, for 23 table pages, removed all the unnecessary  text frm the links:

    let
        Source = {1..23},
        TableFromList = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
        InsertedCustom = Table.AddColumn(TableFromList, "Custom", each #"Links-Page"([Column1])),
        #"Expand Custom" = Table.ExpandTableColumn(InsertedCustom, "Custom", {"Column1"}, {"Custom.Column1"}),
        ReplacedValue = Table.ReplaceValue(#"Expand Custom","<td class=""tablecontent"" align=left><div align=left><a href='detalii.php?id=","",Replacer.ReplaceText,{"Custom.Column1"}),
        SplitColumnDelimiter = Table.SplitColumn(ReplacedValue,"Custom.Column1",Splitter.SplitTextByEachDelimiter({"&"}, null, false),{"Custom.Column1.1", "Custom.Column1.2"}),
        SplitColumnDelimiter1 = Table.SplitColumn(SplitColumnDelimiter,"Custom.Column1.2",Splitter.SplitTextByDelimiter("'"),{"Custom.Column1.2.1"}),
        ReplacedValue1 = Table.ReplaceValue(SplitColumnDelimiter1,"nrcnvm=","",Replacer.ReplaceText,{"Custom.Column1.2.1"}),
        ChangedType = Table.TransformColumnTypes(ReplacedValue1,{{"Custom.Column1.1", type number}, {"Custom.Column1.2.1", type text}}),
        RenamedColumns = Table.RenameColumns(ChangedType,{{"Custom.Column1.1", "WebID"}, {"Custom.Column1.2.1", "Nr_reg"}})
    in
        RenamedColumns

    that's because in this case I didn't need the full URL of the link and only a part of it (a unique ID)

    the starting point for digging into the webpage was

    Source = Table.FromColumns({Lines.FromBinary(Web.Contents("http://www.exemplelink..."))}),

    from there filter, replace, split until you get to obtained only the part of the html code that contains the links

    Sunday, June 15, 2014 11:17 AM