none
Challenge: Transforming Data with M-Language - messy data scraped from the web RRS feed

  • Question

  • I am new to M-Language, and I am playing around with it to see how far I can get.

    I pulled data from the web: https://www.t-systems.com/de/en/locations for the purpose of creating a clean directory of company locations.

    After several failed attempts to get data loaded directly from the web into Excel / Power BI, I tried several web scrapers, of which only one worked halfways well.

    Finally, I got an Excel file that looks like this:

    I am trying to wrap my head around this data, and I was wondering if it's even possible to clean that data up with help of M?

    It's barely digestable for a human ...

    Obviously, it would take me less time to just manually copy and paste the data into a text file, clean it up manually and then load it into Excel. But I was wondering how a data pro would handle such a scenario, given the data set wouldn't be just a couple dozen addresses but a huge volume, impossible to get cleaned up manually.

    Please bear in mind that the example above is totally arbitrary, based on publically availeable data, for educational purposes only.

    I am using it to demostrate a realistic scenario.

    Thanks for your feedback!


    Rafael Knuth


    Tuesday, February 13, 2018 2:15 PM

Answers

  • Excel 2016 Pro Plus with Power Query (aka Get & Transform)
    http://www.mediafire.com/file/lzgxk3387a42ttg/02_13_18.xlsx


    Wednesday, February 14, 2018 1:44 AM
  • the truth is that this relies more on you knowing HTML than Power Query, so it might not be a good example unless you're into working with HTML documents.

    Some of the data that you need is inside a container and not in a list or a table, so the Web.Page function will be of little help here. You have multiple ways to target that container, either by drilling down into that specific id, the class associated to it (using the Web.Page function) or the manual js code that you could get from the html document  (you can read that html document as a text file) find the row where the values that you need appear and do the transformations from there.

    Here's an example on how you can get that data with Power Query, but I wouldn't recommend even an experienced user to do this as there are better tools to do this with.

    let
        Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://www.t-systems.com/de/en/locations"), null, null, 20106)}),
        #"Added Index" = Table.AddIndexColumn(Source, "Index", 1, 1),
        #"Added Custom" = Table.AddColumn(#"Added Index", "Custom", each Text.Contains([Column1], "placemark.description") ),
        #"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] = true)),
        #"Remove Characters" = Table.TransformColumns(#"Filtered Rows",{{"Column1", each Text.Remove(_, {"'", ";"}), type text}}),
        #"Replaced Value" = Table.ReplaceValue(#"Remove Characters"," placemark.description = ","",Replacer.ReplaceText,{"Column1"}),
        #"Removed Columns" = Table.RemoveColumns(#"Replaced Value",{"Index", "Custom"}),
        #"Added Custom1" = Table.AddColumn(#"Removed Columns", "Custom", each Web.Page([Column1])),
        #"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom1", "Custom", {"Data"}, {"Data"}),
        #"Expanded Data" = Table.ExpandTableColumn(#"Expanded Custom", "Data", {"Children"}, {"Children"}),
        #"Expanded Children" = Table.ExpandTableColumn(#"Expanded Data", "Children", {"Kind", "Name", "Children"}, {"Kind", "Name", "Children.1"}),
        #"Filtered Rows1" = Table.SelectRows(#"Expanded Children", each ([Name] = "BODY")),
        #"Removed Columns1" = Table.RemoveColumns(#"Filtered Rows1",{"Kind", "Name"}),
        #"Expanded Children.1" = Table.ExpandTableColumn(#"Removed Columns1", "Children.1", {"Kind", "Name", "Children", "Text"}, {"Kind", "Name", "Children", "Text"}),
        #"Removed Columns2" = Table.RemoveColumns(#"Expanded Children.1",{"Kind", "Name", "Children"}),
        #"Filtered Rows2" = Table.SelectRows(#"Removed Columns2", each [Text] <> null),
        #"Grouped Rows" = Table.Group(#"Filtered Rows2", {"Column1"}, {{"Count", each _, type table}}),
        #"Added Custom2" = Table.AddColumn(#"Grouped Rows", "Custom", each Table.Column([Count], "Text")),
        #"Removed Columns3" = Table.RemoveColumns(#"Added Custom2",{"Count"}),
        #"Extracted Values" = Table.TransformColumns(#"Removed Columns3", {"Custom", each Text.Combine(List.Transform(_, Text.From), "#(cr)#(lf)"), type text}),
        #"Removed Columns4" = Table.RemoveColumns(#"Extracted Values",{"Column1"})
    in
        #"Removed Columns4"

    Wednesday, February 14, 2018 4:52 AM

All replies

  • Excel 2016 Pro Plus with Power Query (aka Get & Transform)
    http://www.mediafire.com/file/lzgxk3387a42ttg/02_13_18.xlsx


    Wednesday, February 14, 2018 1:44 AM
  • the truth is that this relies more on you knowing HTML than Power Query, so it might not be a good example unless you're into working with HTML documents.

    Some of the data that you need is inside a container and not in a list or a table, so the Web.Page function will be of little help here. You have multiple ways to target that container, either by drilling down into that specific id, the class associated to it (using the Web.Page function) or the manual js code that you could get from the html document  (you can read that html document as a text file) find the row where the values that you need appear and do the transformations from there.

    Here's an example on how you can get that data with Power Query, but I wouldn't recommend even an experienced user to do this as there are better tools to do this with.

    let
        Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://www.t-systems.com/de/en/locations"), null, null, 20106)}),
        #"Added Index" = Table.AddIndexColumn(Source, "Index", 1, 1),
        #"Added Custom" = Table.AddColumn(#"Added Index", "Custom", each Text.Contains([Column1], "placemark.description") ),
        #"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] = true)),
        #"Remove Characters" = Table.TransformColumns(#"Filtered Rows",{{"Column1", each Text.Remove(_, {"'", ";"}), type text}}),
        #"Replaced Value" = Table.ReplaceValue(#"Remove Characters"," placemark.description = ","",Replacer.ReplaceText,{"Column1"}),
        #"Removed Columns" = Table.RemoveColumns(#"Replaced Value",{"Index", "Custom"}),
        #"Added Custom1" = Table.AddColumn(#"Removed Columns", "Custom", each Web.Page([Column1])),
        #"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom1", "Custom", {"Data"}, {"Data"}),
        #"Expanded Data" = Table.ExpandTableColumn(#"Expanded Custom", "Data", {"Children"}, {"Children"}),
        #"Expanded Children" = Table.ExpandTableColumn(#"Expanded Data", "Children", {"Kind", "Name", "Children"}, {"Kind", "Name", "Children.1"}),
        #"Filtered Rows1" = Table.SelectRows(#"Expanded Children", each ([Name] = "BODY")),
        #"Removed Columns1" = Table.RemoveColumns(#"Filtered Rows1",{"Kind", "Name"}),
        #"Expanded Children.1" = Table.ExpandTableColumn(#"Removed Columns1", "Children.1", {"Kind", "Name", "Children", "Text"}, {"Kind", "Name", "Children", "Text"}),
        #"Removed Columns2" = Table.RemoveColumns(#"Expanded Children.1",{"Kind", "Name", "Children"}),
        #"Filtered Rows2" = Table.SelectRows(#"Removed Columns2", each [Text] <> null),
        #"Grouped Rows" = Table.Group(#"Filtered Rows2", {"Column1"}, {{"Count", each _, type table}}),
        #"Added Custom2" = Table.AddColumn(#"Grouped Rows", "Custom", each Table.Column([Count], "Text")),
        #"Removed Columns3" = Table.RemoveColumns(#"Added Custom2",{"Count"}),
        #"Extracted Values" = Table.TransformColumns(#"Removed Columns3", {"Custom", each Text.Combine(List.Transform(_, Text.From), "#(cr)#(lf)"), type text}),
        #"Removed Columns4" = Table.RemoveColumns(#"Extracted Values",{"Column1"})
    in
        #"Removed Columns4"

    Wednesday, February 14, 2018 4:52 AM
  • Yes, I am already using Power Query in Excel 2016 Pro Plus, however I am new to the subject of ETL.

    Rafael Knuth

    Wednesday, February 14, 2018 8:06 AM
  • Hi Miguel,

    thanks for the clarification!

    What are commonly used tools for scenarios like the one above?

    Thanks!


    Rafael Knuth


    • Edited by Rafael Knuth Wednesday, February 14, 2018 8:10 AM
    Wednesday, February 14, 2018 8:10 AM
  • either 3rd party tools or pure and simple good ole' code (like python) with some sort of automation. For more suggestions on Web Scrapping and what tools to use I'd probably create a thread on StackOverflow or search for the tag "web-scraping" in that site.

    The main point here is that you can do Web scrapping with Power Query, but it's not easy and probably not a good idea either. For example>> If the data is behind a "log in" page then PQ can't access it, but with other tools you could get past that.

    Wednesday, February 14, 2018 5:20 PM