none
Strange effects with Excel random numbers and Power Query table buffering RRS feed

  • Question

  • Today I ran into a strange phenomenon when splitting an Excel table with random numbers, and combining the parts back: if one of the parts is buffered, then the result may have a different number of records than the original table.

    As usual, I was trying something very complex and difficult to explain ...but fortunately I was able to reproduce the issue with a simple example:

    Input table is an Excel table with 1 column with random numbers between 1-10: =RANDBETWEEN(1,10).

    I have 2 queries, splitting the table in parts with numbers 1-5 and 6-10.
    In both cases, also an Index is added as original sort order.

    Query Result:

    let
        Source = Table1,
        #"Added Index" = Table.AddIndexColumn(Source, "OriginalSort", 1, 1),
        #"Numbers1-5" = Table.SelectRows(#"Added Index", each [Random] <= 5),
        #"Numbers6-10" = Table.SelectRows(#"Added Index", each [Random] >= 6),
        Combined = #"Numbers1-5"&#"Numbers6-10",
        Sorted = Table.Sort(Combined,{{"OriginalSort", Order.Ascending}})
    in
        Sorted

    The same query in which the first part is buffered:

    let
        Source = Table1,
        #"Added Index" = Table.AddIndexColumn(Source, "OriginalSort", 1, 1),
        #"Numbers1-5" = Table.Buffer(Table.SelectRows(#"Added Index", each [Random] <= 5)),
        #"Numbers6-10" = Table.SelectRows(#"Added Index", each [Random] >= 6),
        Combined = #"Numbers1-5"&#"Numbers6-10",
        Sorted = Table.Sort(Combined,{{"OriginalSort", Order.Ascending}})
    in
        Sorted

    Without buffering, the resulting table has always the same number of records as the original table.

    With buffering, the resulting table may have a different number of records (more or less than the input) and also the sort index may have both gaps and duplicates. (Remark: as the random numbers are regularly updated, the resulting tables have different numbers):

    Duplicates, while 6 and 13 are missing

    Can somebody explain why this is happening? It looks like the input table is imported twice, with different numbers.

    The issue only happens occasionally: there may be many consecutive runs without issue.
    I suspect this might be related to limited available memory, that may cause the input table to be loaded twice.

    Wednesday, September 27, 2017 6:17 PM

Answers

  • I think your theory about this discrepancy being load-related is probably correct.

    As you surmise, each time we enumerate the rows -- in this case, that's effectively represented by the two different invocations of Table.SelectRows -- we go back to the source data. Why does this matter? Well, if you watch the sheet closely while refreshing the query, you'll notice that the source data is changing. That's because a change to the sheet causes Excel to automatically recalculate any random cells. So when you see the inconsistent results, the second enumeration is actually seeing different source data than the first enumeration saw.

    This is likely to be very timing-dependent. If the auto-recalculation in Excel is or isn't delayed by a little bit, you either won't see the effect or you will. Introducing the Table.Buffer to the first enumeration adds just a little bit of extra delay in enumerating the first half of the data. I'd guess that adding it to the second enumeration is less likely to produce this effect because there will be less of a pause between the two enumerations.

    And of course, anything that's timing-dependent is going to be sensitive to system load.

    • Marked as answer by MarcelBeug Thursday, September 28, 2017 2:02 PM
    Thursday, September 28, 2017 12:37 PM

All replies

  • I've duplicated your scenario exactly, but I'm unable to duplicate your results after about 100 consecutive refreshes. In every case (table with buffering), all 1,004 rows are returned, with no changes in the sort column. These are small tables, so it's not clear why memory would be a factor.
    Wednesday, September 27, 2017 8:34 PM
  • Thanks Colin and my apologies: I should have elaborated a bit on the suspected memory issues.

    The issue seems to occur after running some other, heavy, queries and while the mashup engine seems to be  still busy with those queries (although they already returned their output, I can hear the fan of my laptop rotating like mad...).

    When I was just testing again, I had my screen casting software running as well:

    https://youtu.be/n5AGS02ha9c


    • Edited by MarcelBeug Thursday, September 28, 2017 5:05 AM
    Thursday, September 28, 2017 4:54 AM
  • I think your theory about this discrepancy being load-related is probably correct.

    As you surmise, each time we enumerate the rows -- in this case, that's effectively represented by the two different invocations of Table.SelectRows -- we go back to the source data. Why does this matter? Well, if you watch the sheet closely while refreshing the query, you'll notice that the source data is changing. That's because a change to the sheet causes Excel to automatically recalculate any random cells. So when you see the inconsistent results, the second enumeration is actually seeing different source data than the first enumeration saw.

    This is likely to be very timing-dependent. If the auto-recalculation in Excel is or isn't delayed by a little bit, you either won't see the effect or you will. Introducing the Table.Buffer to the first enumeration adds just a little bit of extra delay in enumerating the first half of the data. I'd guess that adding it to the second enumeration is less likely to produce this effect because there will be less of a pause between the two enumerations.

    And of course, anything that's timing-dependent is going to be sensitive to system load.

    • Marked as answer by MarcelBeug Thursday, September 28, 2017 2:02 PM
    Thursday, September 28, 2017 12:37 PM
  • That is very interesting, Curt.

    Many thanks for your comprehensive and clear explanation.

    What I'll take from this, is not to use random data from Excel directly as input for Power Query, but first make a static copy in Excel.
    That would be closer to real live situations as well.

    Thursday, September 28, 2017 2:01 PM