none
How to remove duplicates based on sort order RRS feed

  • Question

  • In Excel, I can sort a table of data and then use Remove Duplicates and it will retain the first row with a unique value, honoring the sorted order.

    When I attempt this in Power Query, Remove Duplicates (Table.Distinct) retains the first row with a unique value based on SOURCE order, disregarding any sorting I have previously performed on the data.

    Here is a brief, contrived example:

    Tom and Mary make office visits periodically and their weight is recorded.  The order of dates and names is jumbled in the source data.  I want to transform the table so that only the information from each person's most recent visit is retained.

    Name DateSeen Weight
    Tom 3/1/2015 170
    Mary 12/23/2014 115
    Mary 10/26/2014 120
    Tom 5/1/2015 175
    Tom 2/1/2015 165
    Mary 6/1/2015 125

    In Excel, I can sort by Name and then by descending DateSeen, then Remove Duplicates on Name and I get what I want:

    Name DateSeen Weight
    Mary 6/1/2015 125
    Tom 5/1/2015 175

    In Power Query, the Remove Duplicates uses the Source Order to select the rows to keep and does not give me the output I expect:

    Name DateSeen Weight
    Mary 12/23/2014 115
    Tom 3/1/2015 170

    let
        Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
        #"Changed Type" = Table.TransformColumnTypes(Source,{{"DateSeen", type date}}),
        #"Sorted Rows" = Table.Sort(#"Changed Type",{{"Name", Order.Ascending}, {"DateSeen", Order.Descending}}),
        #"Removed Duplicates" = Table.Distinct(#"Sorted Rows", {"Name"})
    in
        #"Removed Duplicates"

    I have found a way to get the results I need, but it is a lot of extra overhead:

    1. I create a new table by doing GroupBy "Name" and compute MaxDateSeen
    2. I join that table back to the original by the Name field
    3. I then filter out all rows where the DateSeen does not equal the MaxDateSeen
    4. I then remove the MaxDateSeen column

    Then I get the results I want.

    Again, it works, but lots of extra overhead.

    Is there any way to force Table.Distinct to honor the prior SORT?  

    Is there another way to do this that is more straight forward and less overhead than my work around?

    Thanks,

    Dale

    

    Friday, June 19, 2015 8:14 PM

Answers

  • OK.  I think I just found my own answer:  Table.Buffer

    I'll certainly need to factor in performance when working on large datasets, but if you buffer the table after the sort, the Table.Distinct appears to work as expected.

    Name DateSeen Weight
    Mary 6/1/2015 125
    Tom 5/1/2015 175

    let
        Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
        #"Changed Type" = Table.TransformColumnTypes(Source,{{"DateSeen", type date}}),
        #"Sorted Rows" = Table.Sort(#"Changed Type",{{"Name", Order.Ascending}, {"DateSeen", Order.Descending}}),
        #"Buffered" = Table.Buffer(#"Sorted Rows"),
        #"Removed Duplicates" = Table.Distinct(#"Buffered", {"Name"})
    in
        #"Removed Duplicates"

    Dale

    • Marked as answer by Dale Hohm Friday, June 19, 2015 9:46 PM
    Friday, June 19, 2015 9:46 PM

All replies

  • OK.  I think I just found my own answer:  Table.Buffer

    I'll certainly need to factor in performance when working on large datasets, but if you buffer the table after the sort, the Table.Distinct appears to work as expected.

    Name DateSeen Weight
    Mary 6/1/2015 125
    Tom 5/1/2015 175

    let
        Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
        #"Changed Type" = Table.TransformColumnTypes(Source,{{"DateSeen", type date}}),
        #"Sorted Rows" = Table.Sort(#"Changed Type",{{"Name", Order.Ascending}, {"DateSeen", Order.Descending}}),
        #"Buffered" = Table.Buffer(#"Sorted Rows"),
        #"Removed Duplicates" = Table.Distinct(#"Buffered", {"Name"})
    in
        #"Removed Duplicates"

    Dale

    • Marked as answer by Dale Hohm Friday, June 19, 2015 9:46 PM
    Friday, June 19, 2015 9:46 PM
  • I just used your approach to solve a similar problem. I'm not sure if Table.Buffer is available from the GUI but it worked great. Many thanks! Dale
    Wednesday, September 9, 2015 6:58 PM
  • Thanks very much for this solution! I struggled with this for a few hours until now.
    Wednesday, January 13, 2016 9:13 AM
  • Hello Dale, it worked for me, thanks for sharing.

    Diego Gaona.

    Sunday, August 7, 2016 9:52 PM
  • Awesome! Exactly what I needed!
    Monday, March 6, 2017 8:47 PM
  • There seems to be an implicit index constructed in the backend that is overwritten when you add an index column. After adding the index column, the latest sort is respected when removing duplicates. Lot faster than grouping / merging or buffering the table! Would have been nice to see this in the manual, as we cannot be sure whether after updates this implicit behavior will stay the same.
    Sunday, July 9, 2017 1:46 PM
  • The logic employed by the Table.Distinct function when all of the columns are not selected, is questionable:

    1) It's not clear why the default behavior would be to select distinct rows based on the original row order of the table, and not the most recent sort order.

    2) It's not clear why distinct rows should be selected based on the most recent sort order, after an index column is introduced. Since the index column is not participating in any selection criteria, it's introduction results in a side effect (changes the behavior of Table.Distinct for no obvious reason). 

    Without a better understanding of the inner workings of this function, one is left to conclude that it is somehow flawed.

    Tuesday, July 11, 2017 3:05 PM
  • I am relatively new to PowerPivot and PowerQuery. This solved the exact problem I was having. It was an awesome moment when I saw it worked. Thanks for the leg work and sharing
    Sunday, October 1, 2017 2:16 PM