none
What I miss in Power Query/M RRS feed

  • Question

  • Is the possibility to compare two columns in two different tables to find the rows in table.1, column.x that have the most matches with the words in the text in table.2, column.x and then lets the user, via a UI, make the definitive match via a drag & Drop visual, with that creating the definitive table, column. After all: Power BI users need to Analyse Data - data needs to be clean before it can be analysed - so why not clean the data at the point of analysis.

    Example (sorry for the Dutch text):

    Text in column.1

    Text in column.2

    So, if the text in column.2 contains both “Afschrijving” and “Goodwill” it’s a match with “Afschrijving Goodwill” in column.1.

    When I look at the many possibilities R offers, it’s probably the most efficient to create an R-visual that  offers adialog which allows a user to move options between two lists

    However, this is far out of my league. 

    Can anybody help met with this?

    Friday, August 18, 2017 4:36 PM

Answers

  • In this perspective, I have been developing an application in Excel / Power Query to support the matching and linking of 2 charts of accounts.

    It is based on both matching of words (or substrings) and the hierarchy within the reference chart of accounts.

    As far as I know: wouldn't this be a new concept of using Excel/Power Query as an interactive tool?

    I would really appreciate your honest feedback (I prefer "honest" above "friendly", both would be nice though) on the previous question and on this video (length: 10:39) about the application, including some Power Query specifics (from 6:30 in the video).

    Saturday, September 9, 2017 3:53 PM

All replies

  • I'm not completely sure about how you need to do the comparison, but here an idea. You could try:

    1. Adding an Index column on both tables so you know what row it came from
    2. On either table, do a Merge operation with the correct Join (right outter? inner join?) using the column and the Row Index number (if you need it at that level)
    3. Instead of expanding the table, just aggregate with a simple count
    4. Filter out the ones that are blank and you'll end up with the ones that match on both.
    Thursday, August 24, 2017 4:46 AM
  • Hi Robert

    If I catch your idea, you need to find in column.1 a string that has *most* matches to current string in column.2, and then place this string from column.1 right to the source string from column.2

    If you have a table named "Source" with [Column2] (this is a table and column with 'raw' text, column.2 in your example) and a table named "Lookup" with [Column1] in it (i.e. column.1 in your example), then the next code will help you:

        LookUpList = List.Buffer(Lookup[Column1]),
        LookUpSplit = List.Transform(LookUpList, each Text.Split(_," ")),
        Match = Table.AddColumn(Source, "Matched", (row)=> 
            let
                MatchList = List.Transform(
                    LookUpSplit, 
                    (LookUpString) => 
                        List.Sum(List.Transform(
                        LookUpString, 
                        (LookUpWord) => if Text.PositionOf(
                            row[Column2], 
                            LookUpWord, 
                            null, 
                            Comparer.OrdinalIgnoreCase) > 0 then 1 else 0))
                    ),
                MaxMatch = List.PositionOf(MatchList, List.Max(MatchList))
            in
                LookUpList{MaxMatch}
            )

    This code will add a new column with matched strings from column.1


    Maxim Zelensky Excel Inside

    Saturday, August 26, 2017 12:19 AM
  • This isn't possible in Power Query as you've described it. However, you might want to look at the Fuzzy Lookup tool that Microsoft offers for Excel:

    https://www.microsoft.com/en-us/download/details.aspx?id=15011
    Tuesday, August 29, 2017 6:55 PM
  • if this is about a Fuzzy lookup or close match lookup then Chris has the correct answer. Unless you want to go with the SQL Server version of Fuzzy Lookup instead of the Excel addin. Perhaps you could even use R to do a fuzzy lookup, but I have no clue if that would be a good idea.
    Wednesday, August 30, 2017 2:11 PM
  • Let's start with the statement:

    "So, if the text in column.2 contains both “Afschrijving” and “Goodwill” it’s a match with “Afschrijving Goodwill” in column.1"

    Clearly, there is nothing in that statement to suggests that we need to do a fuzzy lookup, since we are looking for exact text.

    Maxim's solution is on the right track, but goes off the rails somewhere. In my test data, I ended up with some matches from the wrong row. (Maxim, please recheck your code).

    If we assume that we want to match strings exactly (e.g. “Afschrijving Goodwill”), we can use a pattern that's been repeated about a half-dozen times already in this forum:

    let Source = Excel.CurrentWorkbook(){[Name="Source"]}[Content], AccountDescriptionList = List.Buffer(Lookup[accountDesc]), AddedCustom = Table.AddColumn( Source, "accountDesc", each List.Select( AccountDescriptionList, (current)=>
    Text.Contains(
    [IMMATERIELE VASTE ACTIVA],
    current,
    Comparer.OrdinalIgnoreCase
    )
    ){0}?
    ),
    FilteredNullRows = Table.SelectRows(AddedCustom, each ([accountDesc] <> null))
    in
    FilteredNullRows

    On the other hand, the phrase: contains both “Afschrijving” and “Goodwill”, suggests that the words can be in any order within the source text. If this is the case, the following modification can be used:  

    let
        Source = Excel.CurrentWorkbook(){[Name="Source"]}[Content],
        AccountDescriptionList = List.Buffer(Lookup[accountDesc]),
        SplittedListText = List.Transform(AccountDescriptionList, (current)=> Text.Split(current, " ")),
        AddedCustom = Table.AddColumn(
                          Source, 
                          "accountDesc", 
                          each List.Select(
                                  SplittedListText, 
                                  (current)=> 
                                      List.ContainsAll(
                                          List.Buffer(Text.Split([IMMATERIELE VASTE ACTIVA], " ")),
                                          current,
                                          Comparer.OrdinalIgnoreCase
                                      )
                                ){0}?
                       ),
        FilteredNullRows = Table.SelectRows(AddedCustom, each ([accountDesc] <> null)),
        ExtractedValues = Table.TransformColumns(FilteredNullRows, {"accountDesc", each Text.Combine(_, " "), type text})
    in
        ExtractedValues

    Wednesday, August 30, 2017 9:57 PM
  • Hi Colin. Thanks to pointing out. 

    Rechecked my code, this step needs improvement:

    Match = Table.AddColumn(Source, "Matched", (row)=> 
            let
                MatchList = List.Transform(
                    LookUpSplit, 
                    (LookUpString) => 
                        List.Sum(List.Transform(
                        LookUpString, 
                        (LookUpWord) => if Text.PositionOf(
                            row[Column2], 
                            LookUpWord, 
                            null, 
                            Comparer.OrdinalIgnoreCase) >= 0 then 1 else 0))
                    ),
                LM = List.Max(MatchList),
                MaxMatch = List.PositionOf(MatchList, LM)
            in
                if LM = 0 then null else LookUpList{MaxMatch}
            )


    Maxim Zelensky Excel Inside


    Friday, September 1, 2017 7:43 AM
  • Clearly, there is nothing in that statement to suggests that we need to do a fuzzy lookup, since we are looking for exact text.

    Except for " column.x that have the most matches with the words in the text in table.2"

    Friday, September 1, 2017 7:11 PM
  • Hi Colin. Thanks to pointing out. 

    Rechecked my code, this step needs improvement:

    Match = Table.AddColumn(Source, "Matched", (row)=> 
            let
                MatchList = List.Transform(
                    LookUpSplit, 
                    (LookUpString) => 
                        List.Sum(List.Transform(
                        LookUpString, 
                        (LookUpWord) => if Text.PositionOf(
                            row[Column2], 
                            LookUpWord, 
                            null, 
                            Comparer.OrdinalIgnoreCase) >= 0 then 1 else 0))
                    ),
                LM = List.Max(MatchList),
                MaxMatch = List.PositionOf(MatchList, LM)
            in
                if LM = 0 then null else LookUpList{MaxMatch}
            )


    Maxim Zelensky Excel Inside


    Hi Maxim,

    I still see a problem, unfortunately.

    Here are the tables I used for testing:

    Lookup Table

    accountDesc
    Afschrjving Goodwill
    Afschr.gebouwen
    VERVOERMIDDELEN
    AUTOMATISERIING
    Onderh. Werk/nog te faktureren
    Onverdeelde winst
    Reorganisatiekosten

    Source Table

    IMMATERIELE VASTE ACTIVA
    Afschrjving this is a test good
    This is Afschr.gebouwen a test
    This is a test VERVOERMIDDELEN
    AUTOMATISERIING this is a test
    Onderh. This is Werk/nog te faktureren a test
    Onverdeelde This is winst a test
    Reorganisatiekosten

    What should be returned if the lookup text and the source text must be identical:

    IMMATERIELE VASTE ACTIVA accountDesc
    This is Afschr.gebouwen a test Afschr.gebouwen
    This is a test VERVOERMIDDELEN VERVOERMIDDELEN
    AUTOMATISERIING this is a test AUTOMATISERIING
    Reorganisatiekosten Reorganisatiekosten

    What should be returned if all of the words in the lookup list must be in the source:

    IMMATERIELE VASTE ACTIVA accountDesc
    This is Afschr.gebouwen a test Afschr.gebouwen
    This is a test VERVOERMIDDELEN VERVOERMIDDELEN
    AUTOMATISERIING this is a test AUTOMATISERIING
    Onderh. This is Werk/nog te faktureren a test Onderh. Werk/nog te faktureren
    Onverdeelde This is winst a test Onverdeelde winst
    Reorganisatiekosten Reorganisatiekosten

    What your modified code is returning:

    IMMATERIELE VASTE ACTIVA accountDesc Matched
    This is Afschr.gebouwen a test Afschr.gebouwen Afschr.gebouwen
    This is a test VERVOERMIDDELEN VERVOERMIDDELEN VERVOERMIDDELEN
    AUTOMATISERIING this is a test AUTOMATISERIING AUTOMATISERIING
    Reorganisatiekosten Reorganisatiekosten Onderh. Werk/nog te faktureren

    Which is better than what your orignal code returned:

    IMMATERIELE VASTE ACTIVA accountDesc Matched
    This is Afschr.gebouwen a test Afschr.gebouwen Afschr.gebouwen
    This is a test VERVOERMIDDELEN VERVOERMIDDELEN VERVOERMIDDELEN
    AUTOMATISERIING this is a test AUTOMATISERIING Onderh. Werk/nog te faktureren
    Reorganisatiekosten Reorganisatiekosten Onderh. Werk/nog te faktureren

    Friday, September 1, 2017 10:42 PM
  • Clearly, there is nothing in that statement to suggests that we need to do a fuzzy lookup, since we are looking for exact text.

    Except for " column.x that have the most matches with the words in the text in table.2"

    Hi Chris, there are several ways to interpret the text, but the example appears to suggest that, for example, if the lookup text contains “Afschrijving” and “Goodwill” but the source contained only “Afschrijving”, then would be no match.

    However, since Robert appears to have abandoned the thread he started, we may never be sure of the intended interpretation, and all of this discussion may be nothing more than an academic exercise.

    Friday, September 1, 2017 10:54 PM
  • I have not abandoned the conversation and I thank all of you for your valuable input. However, sometimes the answers are far out of my league. I will continue to keep on trying to find a good solution.
    Monday, September 4, 2017 10:11 AM
  • I now add a extra column with a reference code

    Referentiecode =

    VAR account =
        FORMAT ( MID('line'[accountID]; 2; 3); "" )
    RETURN
        SWITCH (
            TRUE ();
            account = "100"; "BIva";
            account = "101"; "Bmva";
            account = "102"; "Bmva";
            account = "103"; "Bmva";
            account = "105"; "Bmva";
            account = "106"; "Beiv";
            account = "107"; "Beiv";
            account = "109"; "BVrz";
            account = "110"; "Blim";
            account = "113"; "Bvor";
            account = "115"; "Bvor";
            account = "116"; "BSch";
            account = "117"; "BSch";
            account = "118"; "BSch";
            account = "119"; "BSch";
            account = "120"; "BSch";
            account = "130"; "BVrd";
            account = "139"; "WKpr";
            account = "140"; "Wper";
            account = "141"; "WKpr";
            account = "143"; "WKpr";
            account = "142"; "Wafs";
            account = "144"; "WPer";
            account = "145"; "Wbed";
            account = "146"; "Wbed";
            account = "147"; "Wbed";
            account = "148"; "Wbed";
            account = "149"; "Wfbe";
            account = "160"; "WPer";
            account = "170"; "WKpr";
            account = "180"; "Womz";
            account = "181"; "Womz";
            account = "185"; "Womz";
            account = "190"; "WBbe";
            account = "199"; "WBbe";
            account = "900"; "WBbe";
            "boekhouder"
        )

    Monday, September 4, 2017 10:16 AM
  • Hi Robert.

    Do I understand you correctly that you try to find solution in the DAX? Because all proposed solutions is in Power Query


    Maxim Zelensky Excel Inside

    Monday, September 4, 2017 12:01 PM
  • Hi Colin!

    Thank you for comment. Sure, I found an error. Your code looks for the entire words of lookup phrase in the entire words of source phrase, but mine code looks for the part of text. Thats why my code found one of the lookup parts in the Onderh. Werk/nog te faktureren, specially "te", in the source phrase "Reorganisatiekosten".

    If this is the case (lookup phrases with prepositions and conjunctions) we can surely omit my code and use yours


    Maxim Zelensky Excel Inside

    Monday, September 4, 2017 12:21 PM
  • Ik moet het nu in DAX doen, omdat Ik niet weet hoe het in M te doen.
    Saturday, September 9, 2017 7:09 AM
  • Ik moet het nu in DAX doen, omdat Ik niet weet hoe het in M te doen.

    Hi Robert,

    We would like to help you to find an M solution. Looking at my post where I show many tables, could you indicate what's wrong with the results?

    Thanks!

    Saturday, September 9, 2017 2:12 PM
  • In this perspective, I have been developing an application in Excel / Power Query to support the matching and linking of 2 charts of accounts.

    It is based on both matching of words (or substrings) and the hierarchy within the reference chart of accounts.

    As far as I know: wouldn't this be a new concept of using Excel/Power Query as an interactive tool?

    I would really appreciate your honest feedback (I prefer "honest" above "friendly", both would be nice though) on the previous question and on this video (length: 10:39) about the application, including some Power Query specifics (from 6:30 in the video).

    Saturday, September 9, 2017 3:53 PM
  • Hi Marcel

    Honestly: my spoken language skills is not good enough (and it is impossible for me to catch meanings of Dutch phrases on the sheet), may be I missed some details, but for me the process in whole is unclear. It corresponds to understanding why you enter one or another number in "Selecteer" the table (here) and then on top of the table at 1:32 - what is the reason to enter level twice? And why you mark as 9 some rows (at 2:53, for example)? what is the difference with other rows? Manually?

    Anyway, I think that this application is an interesting tool to perform fuzzy match with iterations and different conditions. It shows that PQ is not only for transformations and data access, but also could be used as interactive tool, this is really cool example!

    Concerning PQ part: I specially marked this "comment step" approach for me for the future uses - nice tool for long queries!


    Maxim Zelensky Excel Inside

    Wednesday, September 13, 2017 3:17 PM
  • Thanks Maxim, well appreciated.

    Regarding "spoken language": I spent a few hours adding "closed captions" (subtitles) that will be displayed if you choose the "cc" option below the video.

    Regarding the unclarity about the process: maybe I should have been more clear that there actually 2 processes, matching and linking. One query refresh however, may include both matching and linking, so in the "look and feel" of the application, there is no clear distinction between matching and linking.
    Most of the process elements are about matching; the entries in "Selecteer" are about linking (and any rematching in a next round): a 9 means a definitive link of the 2 accounts on that row.
    Another number, e.g. 4 means that the account will be included in a next round of matching at level 4; at the same time, this 4 links the account with the reference account on the same row (which will be a level 3 account), so in the next round of matching, the account will be matched with the reference accounts under the selected reference account at level 3.
    The level at the top of the table defines the level of the current matching round, which is used to select al accounts with the corresponding "Selecteer" value.

    Anyhow I will use your comments to make sure that the process will be more clearly explained in my next video (for the Dutch market), so thanks for that.

    With regard to the "comment step" there is 1 thing to note: if you are in the middle of developing a query via the regular editor (not the advanced editor) and still deleting/moving/adding steps, then the "comment steps" may get referred to, which is not what you want. So I would recommend to add these steps as one of the last steps of the query development.


    • Edited by MarcelBeug Wednesday, September 13, 2017 6:38 PM
    Wednesday, September 13, 2017 6:37 PM
  • Yes, I also like the comments-technique very much.

    Entering them in the advanced editor instead of in the applied steps saves your sanity here :-) 


    Imke Feldmann - MVP Data Platform (PowerBI) - http://www.TheBIccountant.com 

    Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to vote it as helpful - Thanks!

    Friday, September 15, 2017 5:44 AM
    Moderator