none
How to remove html tags from a column RRS feed

  • Question

  • Hi

    Problem is this: I get a column with a comma separated list of id's and I can successfully parse these id's and use them elsewhere. BUT, occasionally there are html tags within that id list like this:

    1082471,1237423<br xmlns="http://www.w3.org/1999/xhtml" />

    Is there a way to just automatically remove all tags from a column? Could do this with regex, but since there is no support, I don't know what to do.

    Wednesday, April 15, 2015 6:27 AM

Answers

  • I propose slightly different approach to the problem.

    let
        Source = Excel.CurrentWorkbook(){[Name="Tabela2"]}[Content],
        ChType = Table.TransformColumnTypes(Source,{{"Data", type text}}),
        ReplaceStr = Table.ReplaceValue(ChType,"><",">,<",Replacer.ReplaceText,{"Data"}),
        ListsCol = Table.AddColumn(ReplaceStr, "ID", each
                   Text.Split(Text.Combine(List.Alternate(Text.SplitAny([Data],"<>"),1,1,1),","),",")),
        ExpCol = Table.ExpandListColumn(ListsCol, "ID"),
        FilterID = Table.SelectRows(ExpCol, each ([ID] <> "")),
        TheEnd = Table.SelectColumns(FilterID,{"ID"})
    in
        TheEnd

    Of course we can change part of this code to function.

    Friday, April 17, 2015 7:03 PM
  • You need only a small adjustment to the code:

    let
         fnRemoveHTML_ = (Value) =>
             let
                 text = Text.From(Value),
                 length = Text.Length(text),
                 position = Text.PositionOf(text, "<"),
                 positionEnd = Text.PositionOf(text, ">"),
                 range = positionEnd-position+1,
                 new = if position >= 0 then Text.ReplaceRange(text, position, range, "") else Value
             in
                 new,
         Quelle = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
         Last = Table.AddColumn(Quelle, "Custom", each fnRemoveHTML_([Value])),
         ChangedType = Table.TransformColumnTypes(Last,{{"Custom", type number}})
     in
         ChangedType


    Imke

    Friday, April 17, 2015 7:10 AM
    Moderator

All replies

  • Hi,

    If the HTML can be detected by a starting symbol like „<“, then you could use the following:

    Unfortuntely the operation “ReplaceRange” is only available on a Text-level, so you have to invoke a function (at least to my knowledge). You also need an Index-column in your table, so if you don’t have it, you need to create one as well.

    This is your function:

    let

       fnRemoveHTML = (Value, Index) =>

    let

       Source = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],

       IndeNo = Index,

       Value_ = Source{IndeNo-1}[Value],

       length = Text.Length(Text.From(Value_)),

       position = Text.PositionOf(Text.From(Value_), "<"),

       range = length-position,

       new= if Value_ is number then Value_ else Text.ReplaceRange(Value_, position, range, "")

    in

        new

    in

      fnRemoveHTML

    And this is how you invoke it:

    let

        Quelle = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],

        Last = Table.AddColumn(Quelle, "Custom", each fn_RemoveHTML([Value], [Index])),

        ChangedType = Table.TransformColumnTypes(Last,{{"Custom", type number}})

    in

        ChangedType

    Provided your table is called “Tabelle1” & the column with your values to be replaced “Value” & your index-col “Index”


    Imke

    Thursday, April 16, 2015 7:31 AM
    Moderator
  • I would simplify Imke's solution slightly to this, removing the reference to the source data from the function to make it more general.

    let
        fnRemoveHTML = (Value) =>
            let
                text = Text.From(Value),
                length = Text.Length(text),
                position = Text.PositionOf(text, "<"),
                range = length-position,
                new = if position >= 0 then Text.ReplaceRange(text, position, range, "") else Value
            in
                new,
        Quelle = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
        Last = Table.AddColumn(Quelle, "Custom", each fn_RemoveHTML([Value])),
        ChangedType = Table.TransformColumnTypes(Last,{{"Custom", type number}})
    in
        ChangedType

    However, this will only remove tags from the end of the data, not from the middle. Is that sufficient for your needs?
    Thursday, April 16, 2015 10:38 AM
  • Of course, if html tag is always at the end of text, we can just split the column by delimiter ("<" first left) and next remove the second column. :-)

    Regards

    Thursday, April 16, 2015 11:26 AM
  • Unfortunately this is not the case, but the tags are randomly anywhere...
    Friday, April 17, 2015 7:02 AM
  • You need only a small adjustment to the code:

    let
         fnRemoveHTML_ = (Value) =>
             let
                 text = Text.From(Value),
                 length = Text.Length(text),
                 position = Text.PositionOf(text, "<"),
                 positionEnd = Text.PositionOf(text, ">"),
                 range = positionEnd-position+1,
                 new = if position >= 0 then Text.ReplaceRange(text, position, range, "") else Value
             in
                 new,
         Quelle = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
         Last = Table.AddColumn(Quelle, "Custom", each fnRemoveHTML_([Value])),
         ChangedType = Table.TransformColumnTypes(Last,{{"Custom", type number}})
     in
         ChangedType


    Imke

    Friday, April 17, 2015 7:10 AM
    Moderator
  • Imke's solution looks like it could work in my case, if I would add the search for the end tag as well.

    The problem is, I do not know how to use functions together with PowerQuery. I found this tutorial around the functions and I will look at that next... http://blogs.msdn.com/b/mvpawardprogram/archive/2013/08/19/creating-power-query-functions.aspx

    I'll come back afterwards and tell if I got it working or not.

    Friday, April 17, 2015 7:11 AM
  • That source looks really good, however - it will not give you this compact format that Curt suggested (& I copied)

    You could also consider to simply copy this code and adjust it to your needs. It's only 2 fields actually:

    Well - almost - as you need to replace the reference to the table in our example with the reference to your csv-file:

    let
         fnRemoveHTML_ = (TheNameOfYourColumn) =>
             let
                 text = Text.From(TheNameOfYourColumn),
                 length = Text.Length(text),
                 position = Text.PositionOf(text, "<"),
                 positionEnd = Text.PositionOf(text, ">"),
                 range = positionEnd-position+1,
                 new = if position >= 0 then Text.ReplaceRange(text, position, range, "") else TheNameOfYourColumn
             in
                 new,

        Source= Csv.Document(File.Contents("TheNameOfYourCsvFile"),null,";",null,1252),
        FirstRowAsHeader = Table.PromoteHeaders(Source),
        Last = Table.AddColumn(FirstRowAsHeader, "Custom", each fnRemoveHTML_([TheNameOfYourColumn])),
        ChangedType = Table.TransformColumnTypes(Last,{{"Custom", type number}})
      in
          ChangedType

    So it's basically the path of your csv-file & the name of your column to search in that you need to replace. Also check the settings of the csv-file - they might be different for your language.


    Imke

    Friday, April 17, 2015 7:37 AM
    Moderator
  • I propose slightly different approach to the problem.

    let
        Source = Excel.CurrentWorkbook(){[Name="Tabela2"]}[Content],
        ChType = Table.TransformColumnTypes(Source,{{"Data", type text}}),
        ReplaceStr = Table.ReplaceValue(ChType,"><",">,<",Replacer.ReplaceText,{"Data"}),
        ListsCol = Table.AddColumn(ReplaceStr, "ID", each
                   Text.Split(Text.Combine(List.Alternate(Text.SplitAny([Data],"<>"),1,1,1),","),",")),
        ExpCol = Table.ExpandListColumn(ListsCol, "ID"),
        FilterID = Table.SelectRows(ExpCol, each ([ID] <> "")),
        TheEnd = Table.SelectColumns(FilterID,{"ID"})
    in
        TheEnd

    Of course we can change part of this code to function.

    Friday, April 17, 2015 7:03 PM
  • Hi All, 

    I'm still having problems with this solution, this is what a I have at the Advanced Editor.. can someone help me?

    let
        Source = Excel.Workbook(File.Contents("C:\TFS\Sales Response\Team Management\Weekly_Status.xlsx"), null, true),
        #"Work Items_Sheet" = Source{[Item="Work Items",Kind="Sheet"]}[Data],
        #"Promoted Headers" = Table.PromoteHeaders(#"Work Items_Sheet", [PromoteAllScalars=true]),
        #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"ID", Int64.Type}, {"Title", type text}, {"Priority", Int64.Type}, {"Assigned To", type text}, {"State", type text}, {"History", type text}, {"Changed Date", type datetime}, {"Created Date", type datetime}, {"Status", type any}, {"Extraction Date", type date}}),
        #"Replaced Value" = Table.ReplaceValue(#"Changed Type","Yellow","http://www.colorcombos.com/images/colors/FFCC00.png",Replacer.ReplaceText,{"Status Color"}),
        #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","Red","https://s-media-cache-ak0.pinimg.com/originals/0d/e7/dc/0de7dc042e1e246fc41ba184f54156a9.jpg",Replacer.ReplaceText,{"Status Color"}),
        #"Replaced Value2" = Table.ReplaceValue(#"Replaced Value1","Green","http://www.colorcombos.com/images/colors/00FF00.png",Replacer.ReplaceText,{"Status Color"}),
        fnRemoveHTML_ = (History) =>
             let
                 text = Text.From(History),
                 length = Text.Length(text),
                 position = Text.PositionOf(text, "<"),
                 positionEnd = Text.PositionOf(text, ">"),
                 range = positionEnd-position+1,
                 new = if position >= 0 then Text.ReplaceRange(text, position, range, "") else History
             in
                 new
        
    in
        #"Replaced Value2"

    Tuesday, May 9, 2017 9:36 PM
  • If the tags are always at the end of the text as shown in the example, then something like

    Table.TransformColumns(<previousStepName, {"Text", each Text.BeforeDelimiter(_, "<")})
    would work.

    Wednesday, May 10, 2017 4:56 AM