none
Remove HTML tags from a column in power query RRS feed

  • Question

  • I have brought in data from an Odata source. One of the columns (Executive Summary) has HTML tags included in it.

    Example 1: <p>​The DIA tool is an existing application that tracks Credits and Incentives crated to clients. The purpose of the project is to add enhancements to the dashboard, streamline the addition of users to specific programs, and making report additions.</p>

    Example 2: <p><span style="color: #404040; font-family: &quot;open sans&quot;; font-size: 10pt; font-weight: normal;">Within </span><span style="color: #404040; font-family: &quot;open sans&quot;; font-size: 10pt; font-weight: normal;">My Clients 2.0, identify and integrate new risk metrics with new and existing risk data, to visually highlight risk</span><span style="color: #404040; font-family: &quot;open sans&quot;; font-size: 10pt; font-weight: normal; vertical-align: baseline;"> indicators in an interactive manner, in order for LCSPs and account teams to proactively understand, manage, and mitigate risk.</span></p>

    I want to remove all the HTML tags from this column, leaving only the text.  The 2 examples above should look like this after removing the HTML tags.

    Example 1: The DIA tool is an existing application that tracks Credits and Incentives crated to clients. The purpose of the project is to add enhancements to the dashboard, streamline the addition of users to specific programs, and making report additions.

    Example 2: My Clients 2.0, identify and integrate new risk metrics with new and existing risk data, to visually highlight risk indicators in an interactive manner, in order for LCSPs and account teams to proactively understand, manage, and mitigate risk.

    I've looked at some code on here but haven't been able to make that work for me.  Any suggestion / ideas are appreciated.

    Can someone help me out with this please?

    Thanks,

    MM

     
    • Edited by MLM3877 Tuesday, June 16, 2015 6:41 PM
    Tuesday, June 16, 2015 4:25 PM

Answers

  • oops - there's still this " Within  " at the beginning of the 2nd example. Some brute force would do it, but I'm not sure if there's more to come of this sort:

    let
         Source = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
         ChType = Table.TransformColumnTypes(Source,{{"Data", type text}}),
         ReplaceStr = Table.ReplaceValue(ChType,"><", "--",Replacer.ReplaceText,{"Data"}),
         ListsCol = Table.AddColumn(ReplaceStr, "ID", each
                    Text.Split(Text.Combine(List.Alternate(Text.SplitAny([Data],"<>"),1,1,1)," "),"--")),
         ExpCol = Table.ExpandListColumn(ListsCol, "ID"),
        Replace = Table.ReplaceValue(ExpCol," Within  ","",Replacer.ReplaceText,{"ID"}),
        TransfCol = Table.TransformColumns(Replace, {"ID", each Text.TrimStart(_)}),
         TheEnd = Table.SelectColumns(TransfCol,{"ID"})
     in
         TheEnd


    Imke

    • Marked as answer by MLM3877 Wednesday, June 17, 2015 2:53 PM
    Wednesday, June 17, 2015 4:19 AM
    Moderator

All replies

  • These adjustments on Bills code (https://social.technet.microsoft.com/Forums/en-US/7ec64d6d-c3fc-4110-94c7-2e0087171475/how-to-remove-html-tags-from-a-column?forum=powerquery) will do it for your case:

    let
         Source = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
         ChType = Table.TransformColumnTypes(Source,{{"Data", type text}}),
         ReplaceStr = Table.ReplaceValue(ChType,"><", "---",Replacer.ReplaceText,{"Data"}),
         ListsCol = Table.AddColumn(ReplaceStr, "ID", each
                    Text.Split(Text.Combine(List.Alternate(Text.SplitAny([Data],"<>"),1,1,1)," "),"---")),
         ExpCol = Table.ExpandListColumn(ListsCol, "ID"),
         TransfCol = Table.TransformColumns(ExpCol, {"ID", each Text.TrimStart(_)}),
         TheEnd = Table.SelectColumns(TransfCol,{"ID"})
     in
         TheEnd

    Please watch that "---" acts as the separator, so this string shouldn't be included in your texts to keep - if thats a problem, replace by a different string.


    Imke

    Wednesday, June 17, 2015 4:01 AM
    Moderator
  • oops - there's still this " Within  " at the beginning of the 2nd example. Some brute force would do it, but I'm not sure if there's more to come of this sort:

    let
         Source = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
         ChType = Table.TransformColumnTypes(Source,{{"Data", type text}}),
         ReplaceStr = Table.ReplaceValue(ChType,"><", "--",Replacer.ReplaceText,{"Data"}),
         ListsCol = Table.AddColumn(ReplaceStr, "ID", each
                    Text.Split(Text.Combine(List.Alternate(Text.SplitAny([Data],"<>"),1,1,1)," "),"--")),
         ExpCol = Table.ExpandListColumn(ListsCol, "ID"),
        Replace = Table.ReplaceValue(ExpCol," Within  ","",Replacer.ReplaceText,{"ID"}),
        TransfCol = Table.TransformColumns(Replace, {"ID", each Text.TrimStart(_)}),
         TheEnd = Table.SelectColumns(TransfCol,{"ID"})
     in
         TheEnd


    Imke

    • Marked as answer by MLM3877 Wednesday, June 17, 2015 2:53 PM
    Wednesday, June 17, 2015 4:19 AM
    Moderator
  • I'm completely new at M code so.....in your's/Bills example I have a couple of questions:

    My source looks like this:  OData.Feed("https://stgepm2013.companyname.com/blah/blah/projectdata/Projects"), 

    Do I need {[Name="Tabelle1"]}[Content], at the end?   If so what does "Tabelle1" refer to?  [Content]? 

    It would be helpful and very much appreciated if the sample was written like this for us novice folks. :)

    text = Text.From(TheNameOfYourColumn),

    Thank you very very much for the reply, I'm working on putting this into my sheet now and trying to work through what goes where in the sample code.

    Also, there are NULL values in the ExecutiveSummary column.  Is there a line of code that can be added to deal with those "null" values?

    It literally says null in the column when there isn't any text there.  For now I filtered the nulls out but those lines may have data that I need.  The Executive Summary may not always have text in it, so I need those columns to just be blank.

    Thanks,

    MM



    • Edited by MLM3877 Wednesday, June 17, 2015 2:25 PM
    Wednesday, June 17, 2015 1:20 PM
  • Sorry MM,

    this would be it then:

         Source =  OData.Feed("https://stgepm2013.companyname.com/blah/blah/projectdata/Projects"), 

         anotherStep = Source {[Name="TheNameOfTheTableYouveSelected",Signature="table"]}[Data]
         ChType = Table.TransformColumnTypes(anotherStep,{{"TheNameOfYourColumnThatContainsTheTextToClean", type text}}),
         ReplaceStr = Table.ReplaceValue(ChType,"><", "---",Replacer.ReplaceText,{"Data"}),
         ListsCol = Table.AddColumn(ReplaceStr, "ID", each
                    Text.Split(Text.Combine(List.Alternate(Text.SplitAny([Data],"<>"),1,1,1)," "),"---")),
         ExpCol = Table.ExpandListColumn(ListsCol, "ID"),
        Replace = Table.ReplaceValue(ExpCol," Within  ","",Replacer.ReplaceText,{"ID"}),
        TransfCol = Table.TransformColumns(Replace, {"ID", each Text.TrimStart(_)}),
         TheEnd = Table.SelectColumns(TransfCol,{"ID"})
     in
         TheEnd 

    where source and anotherSteps are the first steps in your query to retrieve the content you're interested in. (If it's more tables in it - maybe your second step Looks different, but that doesn't matter). What's important are the 2 elements in the 3rd step (ChType/ i.e. the step where you want to begin your column clean operation): There your need to replace "anotherStep" by your last step (where you want to begin to clean up your column) and the column Name (TheNameOfYourColumnThatContainsTheTextToClean) by what it says :-)

    Hope it goes well now.


    Imke

    Wednesday, June 17, 2015 2:50 PM
    Moderator