none
PQ query optimization RRS feed

  • Question

  • Hi Curt,

    Some time ago Chris Webb wrote about using Table.Buffer() to cache table for reusing it in next PQ steps.

    I tried use this technique but timing the same.

    Could you please share more information how to use Buffer() properly?

    Thanks,

    Paul

    Monday, May 18, 2015 3:52 PM

Answers

  • Certain operations force the rows of a table value to be enumerated. If the enumeration operation itself is expensive, then using Table.Buffer can be a performance optimization because we store the values in memory so that second and subsequent enumerations of the rows go against memory.

    If the table is only being enumerated once (which is the most common scenario) or if the underlying enumeration is fast anyway, then Table.Buffer won't help performance.

    Table.Buffer can actually hurt performance in some cases, because we cap RAM usage of the query at 256 MB -- which means that a query which uses more than 256 MB is now forced to page RAM to/from disk. Enough paging, and the performance cost can be quite dramatic.

    For simple scenarios against relational databases, the most likely source of slowness is that we weren't able to push enough of the query to run on the server. If you're reasonably familiar with SQL, you can turn on logging and then look at the trace files to see the SQL that was actually run -- and compare it to your query to see how much work is happening locally.

    • Marked as answer by Paul Levchuk Monday, May 18, 2015 5:19 PM
    Monday, May 18, 2015 4:09 PM

All replies

  • Certain operations force the rows of a table value to be enumerated. If the enumeration operation itself is expensive, then using Table.Buffer can be a performance optimization because we store the values in memory so that second and subsequent enumerations of the rows go against memory.

    If the table is only being enumerated once (which is the most common scenario) or if the underlying enumeration is fast anyway, then Table.Buffer won't help performance.

    Table.Buffer can actually hurt performance in some cases, because we cap RAM usage of the query at 256 MB -- which means that a query which uses more than 256 MB is now forced to page RAM to/from disk. Enough paging, and the performance cost can be quite dramatic.

    For simple scenarios against relational databases, the most likely source of slowness is that we weren't able to push enough of the query to run on the server. If you're reasonably familiar with SQL, you can turn on logging and then look at the trace files to see the SQL that was actually run -- and compare it to your query to see how much work is happening locally.

    • Marked as answer by Paul Levchuk Monday, May 18, 2015 5:19 PM
    Monday, May 18, 2015 4:09 PM
  • Thanks for quick reply.

    Actually, almost all data which I load from DBs are from native SQL queries.

    So Query Folding does not take place. 

    But I would like to understand how to improve performance in PQ.

    I dont measure how much memory SQL queries eat, but taking in to account that a lot my tables 500K+ rows, I think 256Mb is not enough.

    Probably there is some guidelines for above mentioned case (when used Native SQL Query) which PQ operations cost less?
    Or there is some guideline for table merging?

    -- Paul
    Monday, May 18, 2015 4:35 PM
  • Currently, "table at a time" operations like joins, sort, many groupings, pivot, unpivot, etc., all happen in RAM (unless folded). For large tables, these will consume a lot of memory.

    In the Windows Task Manager, you can add a column to the detailed view which shows "Peak working set". Keep an eye on this for your Microsoft.Mashup.Container processes to get an idea for how much memory you're using to run your queries. If it's much over 256MB, consider using "Send a Frown" to report the size. If we get enough requests like that, we may add a configuration option to let the user change the limit.

    Monday, May 18, 2015 8:47 PM
  • Hi Curt,

    Actually I have already looking for Microsoft.Mashup.Container and can tell you that PQ splits processing logic for 2 up to 5 such processes. I think they concur because often they all together eat 100% processor time and after that I got 100% of Disk Usage. 

    I don't understand, if PQ uses memory to join tables why this is much slower than PP which also uses memory. Is it just because PP is columnar based storage? Or there is another reason?

    Actually MS has done good job and I have a choice how to load data: via PQ or via PP. But often I see that loading data via PP is much more efficient. Data loaded with bigger chunks (PP 120K rows instead of 30K in PQ using MySQL connector). I going to unify my approach for data loading that is migrate for data loading to PQ, but I cant because of PQ slow performance.

    On the other hand 21 century is BigData century so volume of data will be just growing and growing very fast. So I hope MS will improve PQ mechanism of loading Big Data.

    -- Paul

    Tuesday, May 19, 2015 6:37 AM
  • Is there any written information with regard to RAM and memory usage in Power Query and best practices? Thanks!
    Monday, January 11, 2016 4:32 AM
  • Hi Paul,

    Sorry to ask this question almost one year later. I just saw this post. When you said "load data via PP", do you mean using PP directly to load data? If I choose Data Model ->From Database ->From SQL Server, I will encounter this window. But the retrieving speed is really slow, around 30,000 rows per second. Is this the normal case?

    Monday, May 9, 2016 4:44 PM