none
Assistance with data cleanup - unpivot and find/replace

    Question

  • I have a field I need to clean up, then unpivot. The field is data entered by users so the clean up is going to be interesting. The field contains a list of company codes, which are relevant for a particular customer.

    Some sample data is below, you can see that the field is going to need tidying before I can unpivot.

    If my approach is wrong, interested to her other opinions.

    Here is what I want to do:

    1. Replace colon or semi-colon with comma (am able to do this)
    2. Remove all character fields A-Z and non numeric characters - except comma (need help on this)
    3. UNPIVOT the CompanyCode field (am able to do this)
    4. need to pad out the company with leading zero, so its 10 chars long (need help)

    Sample Data

    CustomerCode,CompanyCode

    "101","ABC123 (Europe), ABC456 (USA)"

    "102","XYZ123;XYZ456;XYZ789"

    "103","No company codes"

    "104","DEF123"

    Sample Output

    101,0000ABC123

    101,0000ABC456

    102,0000XYZ123

    102,0000XYZ456

    102,0000XYZ789

    103,NULL

    104,DEF123

    Tuesday, March 18, 2014 7:52 PM

Answers

  • FYI, you can generate a list of letters the same way that you generated the list of numbers: {"A".."Z"}. You can expand to include lower-case letters by saying {"A".."Z", "a".."z"}.

    We'd like to improve the experience around doing this kind of manipulation, but we're not sure how best to do it. It's a little tempting to just add some kind of regular expression support, but people don't generally use the phrases "regular expression" and "improved user experience" in the same sentence...

    Good point. I overlooked {"A".."Z"} because I used a modified script that generated all of the ANSI characters between 32 and 126. In many cases, I find that you can use either the list generate function, or the list transform function, e.g. List.Transform({32..126}, each Character.FromNumber(_)). I also discovered that {" ".."~"} works just as well, so in this scenario, a function call (never mind two) isn't required at all.

    As far as the regular expression dilemma is concerned, my recommendation would be to take a page from Office. Word has always had the best support for regular expressions of the Office apps - except that Word (and the other Office apps) don't refer to the feature as regular expressions at all. The term used is wildcards - a term that most folks can associate with (the wildcards in Office don't follow the standard regex format). If a Word user needs to do a complex search, there's no way to avoid the somewhat cryptic nature of wildcards...and that's working in the UI - not via some script behind the scenes.

    In summary:

    1. Use the term "wildcards" if you think that "regular expression" is unfriendly
    2. Decide whether you want to follow the Word wildcard syntax, or the Regex syntax
    3. As a minimum, the Word metacharacters (or equivalent) should be supported. Word supports only one quantifier "@" (equivalent to "+" in regex), so you might consider supporting more (regex *, ?, {})
    4. Either extend the Text.Contains function to support wildcards, or include a new function, e.g. Text.Like

     

     
    Monday, March 24, 2014 4:03 PM

All replies

  • #4 is easy: Text.PadStart("ABC123", 10, "0") will do what you want.

    We don't have a particularly good library tools to help with #2. Additionally, I think you'll need to specify the problem a little more rigorously. For instance, it's not clear to me what rule allows "ABC123" to be okay but the "N" in "No company" to be not okay. Is it that all of the codes have to follow the pattern letter-letter-letter-number-number-number? Are lower-case letters okay in that pattern? If you can provide precise instructions for the transformation you want, we can probably help you write a function.

    Thursday, March 20, 2014 2:54 PM
    Moderator
  • I can offer a solution based on techniques that I currently use in PQ for validating data in columns. The solution is based solely on the information provided by Benzane, but can probably be modified depending on the amount of mind-reading that we were supposed to do. :)

    I'm also posting this solution with the hope that someone can indicate whether there is a less convoluted method to achieve the same (general) goal.

    The key to the solution lies within a custom function called TransformCompanyCode. The main script calls this function. The entire main script is generated through UI tasks only. The starting point is an Excel table with one column and all the data (including header) in rows. If the data is already split into two columns, the first few steps of the script will obviously be different.

    Main Script:

    let
        Source = Excel.CurrentWorkbook(){[Name="Table6"]}[Content],
        SplitColumnDelimiter = Table.SplitColumn(Source,"Column1",Splitter.SplitTextByEachDelimiter({","}, null, false),{"Column1.1", "Column1.2"}),
        ChangedType = Table.TransformColumnTypes(SplitColumnDelimiter,{{"Column1.1", type text}, {"Column1.2", type text}}),
        FirstRowAsHeader = Table.PromoteHeaders(ChangedType),
        RenamedColumns = Table.RenameColumns(FirstRowAsHeader,{{"CompanyCode", "CompanyCodeOld"}}),
        InsertedCustom = Table.AddColumn(RenamedColumns, "CompanyCode", each TransformCompanyCode([CompanyCodeOld],",;")),
        #"Expand CompanyCode" = Table.ExpandListColumn(InsertedCustom, "CompanyCode"),
        RemovedColumns = Table.RemoveColumns(#"Expand CompanyCode",{"CompanyCodeOld"})
    in
        RemovedColumns

    The custom function is called in the highlighted step. In an earlier step I renamed CompanyCode to CompanyCodeOld so that the custom column can be named CompanyCode.

    TransformCompanyCode:

    The function takes two arguments - the company code and a list of delimiters in quotes.

    let
        Source = (text as nullable text, separators as text) =>
    let
        TextRemovedQuotes = try Text.Remove(text,"""") otherwise null,
        TextList = try Text.SplitAny(TextRemovedQuotes,separators) otherwise null,
        RemovedUnwantedTextList = try List.Transform(TextList,each Text.Start(Text.Trim(_),6)) otherwise null,
        ValidAlphaPartList = List.Generate(()=>65, each _<=90, each _ +1, each Character.FromNumber(_)), //ANSI characters A-Z
        ValidNumericPartList = {"1".."9"}, //Change to {"0".."9"} if zero is also valid
        Final = try List.Transform(RemovedUnwantedTextList,each if Text.Length(Text.Remove(Text.Start(_,3),ValidAlphaPartList))>0
                                    or Text.Length(Text.Remove(Text.End(_,3),ValidNumericPartList))>0 then "NULL"
                                    else Text.PadStart(_,10,"0")) otherwise null
    in
        Final
    in
        Source

    The function works as follows:

    1. Removes any quotes in the text.
    2. Splits the text into a list, based on any of the delimiters supplied.
    3. Removes unwanted text, which is any text after the first six characters of each list element (it trims leading and trailing spaces from each element before removing unwanted characters).
    4. Generates two lists which will be used for subsequent validation. The first is a list of ANSI characters A-Z. The second is a list numbers as text.
    5. The final transformation takes the first three characters of each list element and removes the characters A-Z. If, after removing A-Z, the text length is >0, it means other characters exist, and the alpha part of the code is invalid. Similarly, using the last three characters of each list element, the list of numbers is removed. If the text length >0 then the number part of the code is invalid. In either case, the list element is changed to "NULL." If the list element passes the validation tests, it's padded out to 10 characters using 0 as the padding character.
    Saturday, March 22, 2014 3:23 PM
  • FYI, you can generate a list of letters the same way that you generated the list of numbers: {"A".."Z"}. You can expand to include lower-case letters by saying {"A".."Z", "a".."z"}.

    We'd like to improve the experience around doing this kind of manipulation, but we're not sure how best to do it. It's a little tempting to just add some kind of regular expression support, but people don't generally use the phrases "regular expression" and "improved user experience" in the same sentence...

    Monday, March 24, 2014 2:10 PM
    Moderator
  • FYI, you can generate a list of letters the same way that you generated the list of numbers: {"A".."Z"}. You can expand to include lower-case letters by saying {"A".."Z", "a".."z"}.

    We'd like to improve the experience around doing this kind of manipulation, but we're not sure how best to do it. It's a little tempting to just add some kind of regular expression support, but people don't generally use the phrases "regular expression" and "improved user experience" in the same sentence...

    Good point. I overlooked {"A".."Z"} because I used a modified script that generated all of the ANSI characters between 32 and 126. In many cases, I find that you can use either the list generate function, or the list transform function, e.g. List.Transform({32..126}, each Character.FromNumber(_)). I also discovered that {" ".."~"} works just as well, so in this scenario, a function call (never mind two) isn't required at all.

    As far as the regular expression dilemma is concerned, my recommendation would be to take a page from Office. Word has always had the best support for regular expressions of the Office apps - except that Word (and the other Office apps) don't refer to the feature as regular expressions at all. The term used is wildcards - a term that most folks can associate with (the wildcards in Office don't follow the standard regex format). If a Word user needs to do a complex search, there's no way to avoid the somewhat cryptic nature of wildcards...and that's working in the UI - not via some script behind the scenes.

    In summary:

    1. Use the term "wildcards" if you think that "regular expression" is unfriendly
    2. Decide whether you want to follow the Word wildcard syntax, or the Regex syntax
    3. As a minimum, the Word metacharacters (or equivalent) should be supported. Word supports only one quantifier "@" (equivalent to "+" in regex), so you might consider supporting more (regex *, ?, {})
    4. Either extend the Text.Contains function to support wildcards, or include a new function, e.g. Text.Like

     

     
    Monday, March 24, 2014 4:03 PM