none
Split column from character in middle of the string in Power query RRS feed

  • Question

  • Hi all,

    This is my first post. I am just working on a query and I need to edit this type of data, I need everything that is before the string:  

    • 26587p11 -> 26587
    • 7588p2 -> 7588
    • 26587p12 -> 26587

    • 5360xmaszagreb -> 5360

    I try some answer in the forum but nothing really help. 

    = Table.AddColumn(#"Changed Type2", "Vendor", each List.First(List.Select(Text.Split([Vendor.2]), each Number.Mod(try Number.From(_) otherwise null,1)=0)))
    

    I highly appreciate the help.

    Cheers,

    Jorge

    Wednesday, January 25, 2017 3:24 PM

Answers

  • = Table.AddColumn(
    #"Changed Type2",
    "Vendor",
    each List.First(Text.SplitAny([Vendor.2], Text.Combine({"A".."z"})))
    )



    Wednesday, January 25, 2017 4:54 PM
  • Alternatively, the following code will return the first part of the string until the first non-digit:

    = Table.AddColumn(#"Changed Type2", "Vendor", each Text.Combine(List.FirstN(Text.ToList([Vendor.2]), each _ >= "0" and _<= "9")))


    Wednesday, January 25, 2017 7:14 PM
  • Marcel,

    Speaking of code that looks inefficient, here is an option that I rejected out of hand, but on testing, doesn't appear to affect performance:

    = Table.AddColumn(
    #"Changed Type2",
    "Vendor",
    each List.First(
    Text.SplitAny(
    [Vendor.2],
    Text.Combine(
    {Character.FromNumber(0)..Character.FromNumber(47), Character.FromNumber(58)..Character.FromNumber(256)}
    )
    )
    )
    )

    The change is in the Text.Combine function. Instead of simply using only characters in the range {"A".."z"}, we look at all characters in the extended ANSI character set except for digits 0 to 9. One reason I didn't like this solution was because it seemed that if you have to go through the trouble of excluding the digits from a large set of possible characters, then you're better off with a solution that considers the digits only (which is a better solution in many scenarios). The modified code looks verbose because Character.FromNumber() is a long function name, compared to Char() :^) However, this solution retains the key "advantage" of the original solution, in that we only split the text into a list at the first non-numeric character occurrence, leaving the leading numeric characters together. If we want a list of valid characters only so that invalid characters are flagged (point 3 of previous post), then the character list would be modified accordingly.

    By the way, if we can ignore the first 31 non-printable characters of the extended ANSI character set, then we can avoid the verbosity of using the Character.FromNumber function.

    = Table.AddColumn(
          #"Changed Type",
          "Vendor",
          each List.First(Text.SplitAny([Vendor.2], Text.Combine({" ".."/", ":".."ÿ"}))))

    In the unlikely case that we have to consider every Unicode character (up to decimal 65535) just to exclude digits, now that would result in a significant performance hit!!!



    Thursday, January 26, 2017 7:02 PM

All replies

  • = Table.AddColumn(
    #"Changed Type2",
    "Vendor",
    each List.First(Text.SplitAny([Vendor.2], Text.Combine({"A".."z"})))
    )



    Wednesday, January 25, 2017 4:54 PM
  • Alternatively, the following code will return the first part of the string until the first non-digit:

    = Table.AddColumn(#"Changed Type2", "Vendor", each Text.Combine(List.FirstN(Text.ToList([Vendor.2]), each _ >= "0" and _<= "9")))


    Wednesday, January 25, 2017 7:14 PM
  • And if you want to avoid converting the text to a list (and back to text in Marcel's case):

    = Table.AddColumn(
    #"Changed Type",
    "Vendor",
    each Text.Start([Vendor.2], Text.PositionOfAny([Vendor.2], {"A".."z"}))
    )


    Wednesday, January 25, 2017 8:22 PM
  • @Colin: I agree with you that my code appears the least efficient.
    However, every now and then Power Query seems to be quite smart in evaluating code.

    So I wouldn't be surprised if the actual execution isn't exactly according to the code: there may be no conversion to list and back at all.

    At least that is my experience.

    As a test, I created a table in Excel with about 300,000 values and ran both codes (mine and your last code).
    I didn't notice a difference in runtime: both ran for about 22 seconds (hand clocked, so nothing scientific here, but for me enough to conclude that runtimes are equivalent).

    Otherwise your code will leave all characters with a code below "A" or above "z", including e.g. a space and !"#$%&'()*+,-./:;<=>?@ and special characters like é.
    Last but not least your code will return an empty string if the input string contains all digits.



    • Edited by MarcelBeug Thursday, January 26, 2017 4:52 AM small refinements
    Thursday, January 26, 2017 2:47 AM
  • Hi Marcel,

    "@Colin: I agree with you that my code appears the least efficient."

    Nope - I wasn't trying to imply that your code was less efficient in any way - far from it. My testing also showed that there was no discernable difference between the three versions.

    "Otherwise your code will leave all characters with a code below "A" or above "z", including e.g. a space and !"#$%&'()*+,-./:;<=>?@ and special characters like é"

    That's a good point that hasn't escaped me throughout this discussion. Depending on user preference, there may be grounds for choosing one solution over the other:
    1) If there are no exceptions that occur in the vendor code (follows the examples provided), then either solution is fine.
    2) If the user doesn't care about any character appearing in the vendor code, then your solution is preferable.
    3) If the user want's to flag illegitimate vendor code characters (perhaps to correct the source data), then my solution (with adjustments to the character list) would be a better option.

    "Last but not least, your code will return an empty string if the input string contains all digits."

    This is true if such exceptions occur, and is so only for the non-list conversion solution. However, as currently written, none of the solutions are optimized for handling exceptions. For example, if the vendor code is blank, all solutions return an error. Also, if there are no digits in the code, a blank is returned, which may or may not be the proper response.

    Thursday, January 26, 2017 3:32 PM
  • Marcel,

    Speaking of code that looks inefficient, here is an option that I rejected out of hand, but on testing, doesn't appear to affect performance:

    = Table.AddColumn(
    #"Changed Type2",
    "Vendor",
    each List.First(
    Text.SplitAny(
    [Vendor.2],
    Text.Combine(
    {Character.FromNumber(0)..Character.FromNumber(47), Character.FromNumber(58)..Character.FromNumber(256)}
    )
    )
    )
    )

    The change is in the Text.Combine function. Instead of simply using only characters in the range {"A".."z"}, we look at all characters in the extended ANSI character set except for digits 0 to 9. One reason I didn't like this solution was because it seemed that if you have to go through the trouble of excluding the digits from a large set of possible characters, then you're better off with a solution that considers the digits only (which is a better solution in many scenarios). The modified code looks verbose because Character.FromNumber() is a long function name, compared to Char() :^) However, this solution retains the key "advantage" of the original solution, in that we only split the text into a list at the first non-numeric character occurrence, leaving the leading numeric characters together. If we want a list of valid characters only so that invalid characters are flagged (point 3 of previous post), then the character list would be modified accordingly.

    By the way, if we can ignore the first 31 non-printable characters of the extended ANSI character set, then we can avoid the verbosity of using the Character.FromNumber function.

    = Table.AddColumn(
          #"Changed Type",
          "Vendor",
          each List.First(Text.SplitAny([Vendor.2], Text.Combine({" ".."/", ":".."ÿ"}))))

    In the unlikely case that we have to consider every Unicode character (up to decimal 65535) just to exclude digits, now that would result in a significant performance hit!!!



    Thursday, January 26, 2017 7:02 PM
  • Nice!

    Jorge can make a well informed decision.

    Thursday, January 26, 2017 7:48 PM
  • "Jorge can make a well informed decision."

    You think? I suspect by now he's totally confused, thanks to my rambling on about theoretical scenarios. :)

    Thursday, January 26, 2017 8:51 PM