none
PowerQuery Execution time RRS feed

  • Question

  • Have lately been involved in quite some different projects which needed tooling where PowerQuery was my tool of choice with good results. One aspect that I do get stuck with is execution performance. I’ve been timing execution time of my entire chain with a stopwatch and I have been seeing some pretty varying results (from 50s to 2m30s on the same query). This is on a workbook with 3 source tables in Excel (hundreds / thousands of lines), and a chain of about 14 queries of varying complexity.

    On what exactly depends the execution time? For instance, should I expect different results after:
    -running the same query chain with the same data? is anything reused from cache?
    -changing small parts of the query
    -restarting excel (saving, closing, reopening)
    -making changes to the source tables in excel

    I found the posts available from Ehrern about the caching model and query execution chain very confusing; even more so when I have measured execution time of 50s before the weekend after making small changes, and now after pasting new data (of similar size) it’s suddenly back to over 2m again.

    Can anyone give some pointers on how to setup the file for a proper execution time 'lab test' ?

    Tuesday, May 9, 2017 11:34 AM

Answers

  • Ok, here are my numbers:

    1. 50 seconds
    2. 45 seconds
    3. 37 seconds

    It still seems to me to be related to resources already being pre-allocated (for example, perhaps it's the large amount of sheet data that Excel ends up handing over to PQ). When I did step 3, there was a significantly higher amount of RAM allocated by the Excel process after re-doing step 1 and before re-doing step 2, than when doing step 2 from a cold start.

    My recommendation for timing your queries going forward would be to do a full refresh once, and then time subsequent refreshes within the same session. That way, you should eliminate the startup costs which seem to be causing the variation you're seeing.

    Hope that helps.

    Ehren


    Thursday, July 27, 2017 7:34 PM
    Owner

All replies

  • Hi there. Are the three Excel tables the only data sources you're pulling from? And are they in the same workbook as the PQ queries in question, or a different file?

    Ehren

    Monday, May 15, 2017 7:43 PM
    Owner
  • Hi Ehren! Firstly, my apologies for not responding, I had an alert activated but I am not sure how it works - did not get an email. Will look into it.

    The input tables (large) and the queries and the output tables are all in the same file. 

    Sunday, July 9, 2017 1:51 PM
  • Hm. If that's the case then caching shouldn't be affecting the execution times, as I don't think we cache data from the current Excel workbook.

    If you're able to share a copy of the workbook with us (scrubbed of any sensitive data and anonymized, of course), that would be great. Let me know if you're able to do that.

    Ehren

    Monday, July 17, 2017 5:17 PM
    Owner
  • I've also experienced very inconsistent execution times.

    They happened when RAM-limits were hit: But not used physical memory, but committed. This lead to hard faults which seemed to cause longer execution times. Although one MashupContainer only uses up to 263k RAM, it will start to "block" as much modified RAM as available ... until the query finishes .. or not.

    So depending on which other processes on my PC were running and consuming RAM, that "critical" area, where RAM-limits slowed down the execution time, seemed to vary and cause varying execution times.

    Please have a look at the resource monitor and check level of "Commit Charge": Does this hit the limit? 

    Thx!


    Imke Feldmann - MVP Data Platform (PowerBI) - http://www.TheBIccountant.com 

    Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to vote it as helpful - Thanks!

    Monday, July 17, 2017 9:00 PM
    Moderator
  • Hi Ehren,

    Scrubbing / anonimizing the data would be quite a challenge due to the amount of data involved; would you have any tips on how to do this efficiently? It is data from my customer and I can't release it. Instead I have created a Minimal Working Example.

    I created 5 tables with each 500,000 rows and 3 columns. Each cell is filled with a random number (in Excel). I remove duplicates on each table based on the first column, then cascade down some merges and expands.

    https://1drv.ms/f/s!AtbWIhGBTwxqmHdBNu8LiUpnWi9I

    My testing steps:

    1. Close Excel. Open Excel and open the file from the file menu. The file is set to manual calculations so that the random values stay the same in each test. I press show queries and hit the refresh button, and start my stopwatch. I stop the stopwatch when the query has finished loading (the 'refresh circle' turns into a 'table icon' again). Results:

    1:13
    1:03
    1:03

    2. I try the same file but I remove the last step of the output query, which is expanding the last merge. Same procedure as above, with completely closing and restarting excel:

    0:50 seconds
    0:57 seconds
    0:55 seconds

    3. I try step 3 [correction: step 1] again, it loads in 1:03, but then after it has finished loading I go into PQ, and remove the last step (expand table). I immediately close the editor. A refresh is triggered. Result is produced in:

    0:28


    So my question is, what causes the last case to go faster? And assuming this is a real life case with more interdependencies between queries, how would we go about measuring performance and quickly finding performance blockers within the steps? I find PQ great for calculating reports with a minimum amount of user interaction, but it is significantly more time consuming to develop in PowerQuery than Excel as many of the GUI interactions depend on previews being loaded. If I at least would be able to understand better why my query times are exploding that would be a great help in learning how to more efficiently create transformation steps in the future.Q


    Thursday, July 20, 2017 12:18 PM
  • Hi there. Can you clarify what you mean in step 3 by "I try step 3 again" (which I'm assuming was a typo)?

    If I'm understanding your steps correctly (which may not be the case), each successive step runs in less time because there are fewer steps involved. Or am I missing something?

    One way to deal with perf issues when building your queries is to work against a subset of the data. For example, you might do a "Keep Top Rows" and only use the first 100 records from a given source. Then, once you've authored your query, you can remove the "Keep Top Rows" step and let it run against the full data set.

    Ehren

    Friday, July 21, 2017 7:18 PM
    Owner
  • Hi Ehren, indeed it was a typo, I meant step 1.

    So I have two powerquery setups that each have different execution time. Setup 2 has 1 fewer calculation than Setup 1 and therefore takes about 10 seconds less. However if I execute Setup 1, then remove the last calculation without closing the excel file, it takes 40 seconds less instead of 10 seconds less.

    On your second point: unfortunately working against a subset of the data during development is not satisfying as many of the cases for which I have to build calculation steps (e.g. for treating exceptions in the data) do not occur if I take a random subset. I have tried it from time to time to zoom in on a specific issue though.

    Regards, Menno

    Tuesday, July 25, 2017 10:36 AM
  • It's difficult to say. My guess is that it's taking less time in the "warm" case (vs. the "cold" case of freshly starting Excel) due to lots of things already being loaded into memory. If you're using Excel 2016, this includes a fairly substantial dependency: the .NET framework (which is loaded on demand the first time PQ is invoked).

    Ehren

    Tuesday, July 25, 2017 7:45 PM
    Owner
  • That would also mean that if I create a really small query it would have to load the .NET framework and explain all those seconds. But a really small query loads very very fast even on cold start.

    Additionally, with more complex queries I have observed even more extreme differences in execution times, to the point where any kind of performance A/B'ing became completely useless. If I ever have a good example without  sensitive contents I'll revert, but it may take a while.


    Thursday, July 27, 2017 9:32 AM
  • Yeah, I was experimenting with this earlier today and realized that in Excel 2016, .NET is loaded when you open a PQ-ified workbook, not the first time you refresh. And it doesn't account for the disparity you're seeing.

    I'll do some more digging.

    Ehren

    Thursday, July 27, 2017 5:36 PM
    Owner
  • Ok, here are my numbers:

    1. 50 seconds
    2. 45 seconds
    3. 37 seconds

    It still seems to me to be related to resources already being pre-allocated (for example, perhaps it's the large amount of sheet data that Excel ends up handing over to PQ). When I did step 3, there was a significantly higher amount of RAM allocated by the Excel process after re-doing step 1 and before re-doing step 2, than when doing step 2 from a cold start.

    My recommendation for timing your queries going forward would be to do a full refresh once, and then time subsequent refreshes within the same session. That way, you should eliminate the startup costs which seem to be causing the variation you're seeing.

    Hope that helps.

    Ehren


    Thursday, July 27, 2017 7:34 PM
    Owner
  • Thanks Ehren, I'll try it out in my next project.
    Friday, July 28, 2017 6:40 PM