none
Random Sample in Power Query RRS feed

  • Question

  • I work in Audit and we often have to take random samples from documents that our clients provide us. How can I set up a query to pull random rows from a data set. 

    In the end I want to switch the file (this is assuming there is no SQL database), press refresh and then have 100 random records pop up from the data set. 

    I can get a random number to pop up with Number.Random() as number .... but that is as much as I have been able to figure out. 

    Monday, November 7, 2016 7:14 PM

Answers

  • Out of curiosity, why not provide a direct solution to the request? :)

    Imke attempted a direct solution, but there are two problems:

    1) Buffering the table for each row doesn't work because for each row, you end up recreating the table. The point of buffering is to create an immutable table that can then be applied per row. Therefore, the table must be buffered before the #"Added Custom" step.
    2) An even bigger problem is that random number functions (Number.Random & Number.RandomBetween) are unreliable for generating different random numbers in a column. In most cases, these functions will generate the same number in each row.

    The only reliable solution that I've found for reliably generating a different random number per row (which is documented in another thread on this forum), is using List.Random with an index column.

    An example direct solution could be: 

    let
        Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
        AddedIndex = Table.AddIndexColumn(Source, "Index", 0, 1),
        BufferedRandomNumberList = List.Buffer(List.Random(Table.RowCount(AddedIndex))),
        AddedCustom = Table.AddColumn(AddedIndex, "Random", each BufferedRandomNumberList{[Index]}),
        RemovedIndex = Table.RemoveColumns(AddedCustom,{"Index"}),
        SortedRows = Table.Sort(RemovedIndex,{{"Random", Order.Ascending}}), //order direction not important
        KeptFirst100Rows = Table.FirstN(SortedRows,100)
    in
        KeptFirst100Rows

    In the above code, each time the source is queried, a different random number is assigned to each row. Sorting ensures that the order of the rows will be random on each refresh, thus providing the best possible chance that the top 100 rows will always be a different combination of rows.


    Thursday, November 10, 2016 7:00 PM
  • Nice one Colin,

    so the List.Random will always create unique values?


    Imke Feldmann TheBIccountant.com

    Right. It's not clear to me why the number random/randombetween functions don't always generate a new random number for each row, given that a call to the function is made for each row. They are so inconsistent that I've added a random number column on a table and got a different random number per row. Subsequently, I deleted the query, redid the query under the exact same circumstances (same table, same number of rows, etc.) and got the same random number per row.  I've found List.Random to be 100% consistent.
    Thursday, November 10, 2016 9:59 PM
  • Yes, had that pain as well, but didn't end as successful as you. So thank you very much for  a new entry in my function library :-) :

    let func =  
     (SourceTable as table) =>
    
    let
         Source = SourceTable,
         AddedIndex = Table.AddIndexColumn(Source, "Index", 0, 1),
         BufferedRandomNumberList = List.Buffer(List.Random(Table.RowCount(AddedIndex))),
         AddedCustom = Table.AddColumn(AddedIndex, "Random", each BufferedRandomNumberList{[Index]}),
         Table.AddRandomNumberColumn = Table.RemoveColumns(AddedCustom,{"Index"})
    in
        Table.AddRandomNumberColumn
    
    , documentation = [
    Documentation.Name =  "  Table.AddRandomNumberColumn
    ", Documentation.Description = " Will add a column with a random number to a table
    " , Documentation.LongDescription = " Will add a column with a random number to a table. All numbers will be unique.
    ", Documentation.Category = " Table
    ", Documentation.Source = " https://social.technet.microsoft.com/Forums/en-US/af93118d-98c0-4fa7-bf4c-06765ca7c770/random-sample-in-power-query?forum=powerquery
    ", Documentation.Examples = {[Description =  " 
    " , Code = " 
     ", Result = " 
    "]}] 
    
     in 
      Value.ReplaceType(func, Value.ReplaceMetadata(Value.Type(func), documentation))


    Imke Feldmann TheBIccountant.com

    Friday, November 11, 2016 5:24 AM
    Moderator

All replies

  • Sort the recordset on some key like trx date.

    Index the recordset.

    create a column with a formula that divides by 5, 10, 18 evenly and delete the others.

    Generating a random number per record won't help IMHO because you just get a new column of unrelated numbers.

    Since you can see how many lines there are you can make the divisor as big or small as needed.

    By date would let you sample throughout the FY, but you could use amount, vendor id, some other column.

    I remember in school there was a table in the book to pick out the number to skip between.


    Monday, November 7, 2016 7:59 PM
  • "create a column with a formula that divides by 5, 10, 18 evenly and delete the others."

    So are you saying create a field that divides the index by 5 and then by 10 and then by 18?

    Thanks for your time but you lost me. 

    I am never going to know how big the file is going to be as well. 

    Monday, November 7, 2016 8:08 PM
  • A different way would be to add a column to your table that creates a random number which is within the range of the number of rows from your table:

    Number.RandomBetween(1, Table.RowCount(YourTable))

    Then your sort your table by that new column and take the top or last 100 from that list.

    But in order to make this work, you have to Buffer the table before you sort. Otherwise the random-number-generation will be affected again and screw this up. You can do this by editing your code like this:

    Table.Buffer(Table.AddColumn(YourTable, "Custom", each Number.RandomBetween(1, Table.RowCount(YourTable))))

    So your full code would look like this:

    let
        YourTable = ReferenceToSource,
        #"Added Custom" = Table.Buffer(Table.AddColumn(YourTable, "RandomNumber", each Number.RandomBetween(1, Table.RowCount(YourTable)))),
        #"Sorted Rows" = Table.Sort(#"Added Custom",{{"RandomNumber", Order.Ascending}}),
        #"Kept First Rows" = Table.FirstN(#"Sorted Rows",100)
    in
        #"Kept First Rows"


    Imke Feldmann TheBIccountant.com


    Tuesday, November 8, 2016 6:20 AM
    Moderator
  • Hi Mr. J. You might also consider using Remove Rows->Remove Alternate Rows. This allows you to, for example, keep only every 10th or 100th row, and will work no matter how large the data set is.

    Here's a tutorial video on the feature:

    https://youtu.be/GPsQmcVOxKQ?t=4m20s

    You could also incorporate Number.Random() into the formula to randomize the number of rows to remove. For example:

    = Table.AlternateRows(PreviousStepName, 1, Number.RoundDown(Number.Random() * 10 + 1), 1)

    Ehren


    Tuesday, November 8, 2016 8:07 PM
    Owner
  • I would do this: create a new query that contains random integers, then remove duplicates, Add an index to your data table, Do an inner join to decide which to keep

    Self Service BI Expert using Power Pivot http://exceleratorbi.com.au


    Thursday, November 10, 2016 4:32 PM
  • Out of curiosity, why not provide a direct solution to the request? :)

    Imke attempted a direct solution, but there are two problems:

    1) Buffering the table for each row doesn't work because for each row, you end up recreating the table. The point of buffering is to create an immutable table that can then be applied per row. Therefore, the table must be buffered before the #"Added Custom" step.
    2) An even bigger problem is that random number functions (Number.Random & Number.RandomBetween) are unreliable for generating different random numbers in a column. In most cases, these functions will generate the same number in each row.

    The only reliable solution that I've found for reliably generating a different random number per row (which is documented in another thread on this forum), is using List.Random with an index column.

    An example direct solution could be: 

    let
        Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
        AddedIndex = Table.AddIndexColumn(Source, "Index", 0, 1),
        BufferedRandomNumberList = List.Buffer(List.Random(Table.RowCount(AddedIndex))),
        AddedCustom = Table.AddColumn(AddedIndex, "Random", each BufferedRandomNumberList{[Index]}),
        RemovedIndex = Table.RemoveColumns(AddedCustom,{"Index"}),
        SortedRows = Table.Sort(RemovedIndex,{{"Random", Order.Ascending}}), //order direction not important
        KeptFirst100Rows = Table.FirstN(SortedRows,100)
    in
        KeptFirst100Rows

    In the above code, each time the source is queried, a different random number is assigned to each row. Sorting ensures that the order of the rows will be random on each refresh, thus providing the best possible chance that the top 100 rows will always be a different combination of rows.


    Thursday, November 10, 2016 7:00 PM
  • Nice one Colin,

    so the List.Random will always create unique values?


    Imke Feldmann TheBIccountant.com

    Thursday, November 10, 2016 8:22 PM
    Moderator
  • Nice one Colin,

    so the List.Random will always create unique values?


    Imke Feldmann TheBIccountant.com

    Right. It's not clear to me why the number random/randombetween functions don't always generate a new random number for each row, given that a call to the function is made for each row. They are so inconsistent that I've added a random number column on a table and got a different random number per row. Subsequently, I deleted the query, redid the query under the exact same circumstances (same table, same number of rows, etc.) and got the same random number per row.  I've found List.Random to be 100% consistent.
    Thursday, November 10, 2016 9:59 PM
  • Yes, had that pain as well, but didn't end as successful as you. So thank you very much for  a new entry in my function library :-) :

    let func =  
     (SourceTable as table) =>
    
    let
         Source = SourceTable,
         AddedIndex = Table.AddIndexColumn(Source, "Index", 0, 1),
         BufferedRandomNumberList = List.Buffer(List.Random(Table.RowCount(AddedIndex))),
         AddedCustom = Table.AddColumn(AddedIndex, "Random", each BufferedRandomNumberList{[Index]}),
         Table.AddRandomNumberColumn = Table.RemoveColumns(AddedCustom,{"Index"})
    in
        Table.AddRandomNumberColumn
    
    , documentation = [
    Documentation.Name =  "  Table.AddRandomNumberColumn
    ", Documentation.Description = " Will add a column with a random number to a table
    " , Documentation.LongDescription = " Will add a column with a random number to a table. All numbers will be unique.
    ", Documentation.Category = " Table
    ", Documentation.Source = " https://social.technet.microsoft.com/Forums/en-US/af93118d-98c0-4fa7-bf4c-06765ca7c770/random-sample-in-power-query?forum=powerquery
    ", Documentation.Examples = {[Description =  " 
    " , Code = " 
     ", Result = " 
    "]}] 
    
     in 
      Value.ReplaceType(func, Value.ReplaceMetadata(Value.Type(func), documentation))


    Imke Feldmann TheBIccountant.com

    Friday, November 11, 2016 5:24 AM
    Moderator
  • This is an old post, but I recently faced same problem.  The buffered list approach works well, but I figured out a simpler way that calculates on each row and doesn't require any external references (but you can add them).  Add a custom column and use the expression below to generate a random number between 1 and 50 on each row.  The user can change the "50" to whatever # is needed or use a dynamic reference.  The second variable calculates a random index value and is dynamically calculated based on the "range".

    =let range = {1..50},
           randomindex = Number.RoundDown(Number.RandomBetween(0,List.Count(range)),0)
       in
       range{randomindex}


    • Edited by Pat M2 Wednesday, August 1, 2018 1:21 AM
    Sunday, July 22, 2018 1:42 AM