none
How do I cache or buffer an intermediate query result for repeated use? RRS feed

  • Question

  • I need to loop through more than 2,000 text files in the cloud, transform each file in two different ways, and then load two tables in the Data Model, one with all the Way 1 data and one with the Way 2 data.

    To keep from opening each text file twice, I set up a query that combines Way 1 and Way 2 data into one query. Then I use the results of that query in two different queries; one grabs the Way 1 data and the other grabs the Way 2 data.

    However, I it appears that the first query's results aren't being cached.

    Is there any way that I can cache a query's results so that I can use them in several other queries without having having the first query recalculate from scratch each time it's called?

    I can't "cache" preliminary results in a worksheet because there's too much data for that.

    Thanks.

    Charley


    Founder, ExcelUser.com





    • Edited by Charley Kyd Friday, February 22, 2019 9:30 PM
    Friday, February 22, 2019 9:26 PM

Answers

  • Table.Buffer doesn't cache data on disk. The table is put in memory. The purpose of doing so is to avoid materializing the table when you make multiple calls to it within a query. Unlike other values that are fixed when defined in a query, tables, lists, and binary data (files on disk) are always recalculated when multiple requests are made to then. Buffering makes these types act like other types that are fixed when defined within a query. In the case of a table, buffering also stops an query folding (not relevant for text files). 

    Queries are never buffered or cached. However, the query sources may or may not be cached. See the following link for a full explanation. 

    https://social.technet.microsoft.com/Forums/en-US/8d5ee632-fdff-4ba2-b150-bb3591f955fb/queries-evaluation-chain?forum=powerquery

    and here's another interesting thread on the subject

    https://social.technet.microsoft.com/Forums/en-US/34e454b5-3a18-4eef-b920-40703c93f390/tablebuffer-for-cashing-intermediate-query-results-or-how-workaround-unnecessary-queries-issue?forum=powerquery

    • Marked as answer by Charley Kyd Tuesday, February 26, 2019 8:37 PM
    Monday, February 25, 2019 4:50 PM
  • Hi Charley,

    In recent versions of Excel 2016+ and Power BI Desktop, multiple query refreshes that are triggered at the same time will share an on-disk (aka persistent) cache. However, this is not true for the PQ addin running in Excel 2010/2013. Which are you using?

    The following info may be helpful, regardless of your environment:

    1. The persistent cache is normally* only updated if the data being requested is read in its entirety. Certain operations (such as Promote Headers) may only read a subset of the data (e.g. the first row, in the case of Promote Headers). This can result in multiple requests to the data source, even when all the data was returned by the first request. If only a subset of the data was consumed, the data will not be cached.
    2. Power Query's persistent cache (which stores data on disk during a particular refresh) is updated via a background thread. And separate evaluations (i.e. separate Microsoft.Mashup.Container.*.exe processes) running at the same time are not coordinated; when evaluation A is accessing the persistent cache (including updating it), this doesn't block evaluation B from accessing the cache. This means that even when using a shared persistent cache, PQ can potentially end up requesting the same data twice. It depends on the timing of the various requests.

    For #1, you can force the data you're requesting to be read in its entirety (and thus cached) the first time it's accessed by surrounding your call to Web.Contents with a call to Binary.Buffer. (Note that this will only work if all the binary data you're accessing will fit in RAM.)

    For #2, you can reduce the number of evaluations happening at any given time (and thus the likelihood of timing-related cache misses, or additional unwanted requests) by doing the following:

    • Disabling background analysis (which pre-calculates the PQ Editor previews behind the scenes)
    • Disabling the Data Privacy Firewall (which does its own evaluations during refresh that could potentially cause duplicate requests, and also only consume a subset of the data in certain cases)

    In PBIDesktop, you can also reduce the likelihood of timing-related cache misses by disabling parallel loading. (In Excel, the loading of multiple queries is always sequential.)

    I hope this is helpful.
    Ehren

    *I say "normally" because certain sources are cached as pages, rather than "all-or-nothing".



    Tuesday, February 26, 2019 7:02 PM
    Owner
  • Hi Charley,

    If understand you correctly, I should set up one query that other queries would call for their data. if nothing else, that would deal with the timing issue. Correct?

    Yes, I'd recommend doing this. But it has more to do with the organization of your queries (i.e. not duplicating code) than anything else. If you were loading multiple queries, each with their own call to Web.Contents("foo"), this would behave exactly the same from a caching perspective as if they all referenced a common query that contained a single call to Web.Contents("foo").

    I could use the same workbook with 2010 and 2013, but because the called query wouldn't save a persistent cache, those queries would take longer. Correct?

    Correct.

    Binary.Buffer/List.Buffer/Table.Buffer all do essentially the same thing, but with different types of input: they load the entire contents of what they're passed into memory. That's it. In your case, this would be a way to force each of your text files to be fully read and cached, in order to avoid situations where certain operations (such as Promote Headers) read a subset of the data, which later causes an extra request due to the full data not getting cached.

    I'd recommend using a tool like Fiddler to watch what happens when you refresh your queries. Start by building the queries without any bells and whistles. Refresh and watch what happens. Then you can experiment with changes to see if they help reduce any duplicate calls. It might be the case that it works as you expect with little or no modification.

    Ehren

    • Marked as answer by Charley Kyd Tuesday, February 26, 2019 8:38 PM
    Tuesday, February 26, 2019 7:59 PM
    Owner
  • Ehren,

    Yes, I'm using Web Contents, which is good, based on your message. But I'm uncertain about the degree to which the data is cached. 

    In Excel, if Query1 uses Web.Contents as a connection only. And then Query2 and Query3 both use Web.Contents during the same refresh to reference the same URL, will they reference the cache or the web? 

    If Query2 and Query3 both call Query1 for their data, will THAT reference the cache?

    And to what degree does your answer depend on the version of Excel that's used? It would appear that we have five relevant versions: 2010 and 2013 with up-to-date add-ins, 2016 and 2019, and 365. 

    Thanks, Ehren!

    Charley


    Founder, ExcelUser.com

    • Marked as answer by Charley Kyd Tuesday, February 26, 2019 8:37 PM
    Tuesday, February 26, 2019 1:01 AM
  • Ehren,

    Generally, I use Excel 365. But I also need to work with 2010 and 2013 at times. 

    All my data will fit into RAM, but I won't know what kind of paging is going on as a result of other processes.

    If understand you correctly, I should set up one query that other queries would call for their data. if nothing else, that would deal with the timing issue. Correct?

    Come to think of it, a general strategy would be to set up functions to load all data, and then to call those functions as needed in other queries. Correct?

    I could use the same workbook with 2010 and 2013, but because the called query wouldn't save a persistent cache, those queries would take longer. Correct?

    What little Microsoft documentation I can find about Primary.Buffer is of little help. But Chris Webb mentioned it in Improving Power Query Calculation Performance With List.Buffer, which was written in 2015. Does List.Buffer or Table.Buffer have any bearing on this discussion? Is four-year-old information about those buffers still accurate?

    Thanks, Ehren!

    Charley


    Founder, ExcelUser.com





    • Edited by Charley Kyd Tuesday, February 26, 2019 7:52 PM
    • Marked as answer by Charley Kyd Tuesday, February 26, 2019 8:38 PM
    Tuesday, February 26, 2019 7:42 PM

All replies

  • If I understand correctly, your queries fork out into two paths.

    Try putting Table.Buffer() around the step of your combined W1 and W2 query, the step that has the latest information before the fork. This should save the tables on your disk, thus preventing you from connecting to each table twice.

    Have you disabled load for the combined query as well?

    Monday, February 25, 2019 12:01 PM
  • Table.Buffer doesn't cache data on disk. The table is put in memory. The purpose of doing so is to avoid materializing the table when you make multiple calls to it within a query. Unlike other values that are fixed when defined in a query, tables, lists, and binary data (files on disk) are always recalculated when multiple requests are made to then. Buffering makes these types act like other types that are fixed when defined within a query. In the case of a table, buffering also stops an query folding (not relevant for text files). 

    Queries are never buffered or cached. However, the query sources may or may not be cached. See the following link for a full explanation. 

    https://social.technet.microsoft.com/Forums/en-US/8d5ee632-fdff-4ba2-b150-bb3591f955fb/queries-evaluation-chain?forum=powerquery

    and here's another interesting thread on the subject

    https://social.technet.microsoft.com/Forums/en-US/34e454b5-3a18-4eef-b920-40703c93f390/tablebuffer-for-cashing-intermediate-query-results-or-how-workaround-unnecessary-queries-issue?forum=powerquery

    • Marked as answer by Charley Kyd Tuesday, February 26, 2019 8:37 PM
    Monday, February 25, 2019 4:50 PM
  • Colin,

    In this case, I'm getting my data from online text files. Are those queries stored in the persistent cache? If so, there shouldn't be much of a problem.

    Is there a way to monitor the consistent cache for myself?

    Thanks.

    Charley


    Founder, ExcelUser.com

    Monday, February 25, 2019 8:41 PM
  • Colin,

    After re-reading Ehren's explanation, there's seems to be no way to avoid two trips to the server. This is because when Q2 calls Q1, I get one persistent cache. But when Q3 calls Q1 I get a SECOND persistent cache.

    Nuts.

    What we need is a Table.CacheAsCSV function!

    Thanks.

    Charley


    Founder, ExcelUser.com

    Monday, February 25, 2019 9:34 PM
  • Hi Charley,

    Are you using Web.Contents? If so, then the request results should be cached. File data sources are not cached on disk if they're accessed via File.Contents, but Web.Contents does perform on-disk caching.

    Ehren

    Tuesday, February 26, 2019 12:30 AM
    Owner
  • Ehren,

    Yes, I'm using Web Contents, which is good, based on your message. But I'm uncertain about the degree to which the data is cached. 

    In Excel, if Query1 uses Web.Contents as a connection only. And then Query2 and Query3 both use Web.Contents during the same refresh to reference the same URL, will they reference the cache or the web? 

    If Query2 and Query3 both call Query1 for their data, will THAT reference the cache?

    And to what degree does your answer depend on the version of Excel that's used? It would appear that we have five relevant versions: 2010 and 2013 with up-to-date add-ins, 2016 and 2019, and 365. 

    Thanks, Ehren!

    Charley


    Founder, ExcelUser.com

    • Marked as answer by Charley Kyd Tuesday, February 26, 2019 8:37 PM
    Tuesday, February 26, 2019 1:01 AM
  • Hi Charley,

    In recent versions of Excel 2016+ and Power BI Desktop, multiple query refreshes that are triggered at the same time will share an on-disk (aka persistent) cache. However, this is not true for the PQ addin running in Excel 2010/2013. Which are you using?

    The following info may be helpful, regardless of your environment:

    1. The persistent cache is normally* only updated if the data being requested is read in its entirety. Certain operations (such as Promote Headers) may only read a subset of the data (e.g. the first row, in the case of Promote Headers). This can result in multiple requests to the data source, even when all the data was returned by the first request. If only a subset of the data was consumed, the data will not be cached.
    2. Power Query's persistent cache (which stores data on disk during a particular refresh) is updated via a background thread. And separate evaluations (i.e. separate Microsoft.Mashup.Container.*.exe processes) running at the same time are not coordinated; when evaluation A is accessing the persistent cache (including updating it), this doesn't block evaluation B from accessing the cache. This means that even when using a shared persistent cache, PQ can potentially end up requesting the same data twice. It depends on the timing of the various requests.

    For #1, you can force the data you're requesting to be read in its entirety (and thus cached) the first time it's accessed by surrounding your call to Web.Contents with a call to Binary.Buffer. (Note that this will only work if all the binary data you're accessing will fit in RAM.)

    For #2, you can reduce the number of evaluations happening at any given time (and thus the likelihood of timing-related cache misses, or additional unwanted requests) by doing the following:

    • Disabling background analysis (which pre-calculates the PQ Editor previews behind the scenes)
    • Disabling the Data Privacy Firewall (which does its own evaluations during refresh that could potentially cause duplicate requests, and also only consume a subset of the data in certain cases)

    In PBIDesktop, you can also reduce the likelihood of timing-related cache misses by disabling parallel loading. (In Excel, the loading of multiple queries is always sequential.)

    I hope this is helpful.
    Ehren

    *I say "normally" because certain sources are cached as pages, rather than "all-or-nothing".



    Tuesday, February 26, 2019 7:02 PM
    Owner
  • Ehren,

    Generally, I use Excel 365. But I also need to work with 2010 and 2013 at times. 

    All my data will fit into RAM, but I won't know what kind of paging is going on as a result of other processes.

    If understand you correctly, I should set up one query that other queries would call for their data. if nothing else, that would deal with the timing issue. Correct?

    Come to think of it, a general strategy would be to set up functions to load all data, and then to call those functions as needed in other queries. Correct?

    I could use the same workbook with 2010 and 2013, but because the called query wouldn't save a persistent cache, those queries would take longer. Correct?

    What little Microsoft documentation I can find about Primary.Buffer is of little help. But Chris Webb mentioned it in Improving Power Query Calculation Performance With List.Buffer, which was written in 2015. Does List.Buffer or Table.Buffer have any bearing on this discussion? Is four-year-old information about those buffers still accurate?

    Thanks, Ehren!

    Charley


    Founder, ExcelUser.com





    • Edited by Charley Kyd Tuesday, February 26, 2019 7:52 PM
    • Marked as answer by Charley Kyd Tuesday, February 26, 2019 8:38 PM
    Tuesday, February 26, 2019 7:42 PM
  • Hi Charley,

    If understand you correctly, I should set up one query that other queries would call for their data. if nothing else, that would deal with the timing issue. Correct?

    Yes, I'd recommend doing this. But it has more to do with the organization of your queries (i.e. not duplicating code) than anything else. If you were loading multiple queries, each with their own call to Web.Contents("foo"), this would behave exactly the same from a caching perspective as if they all referenced a common query that contained a single call to Web.Contents("foo").

    I could use the same workbook with 2010 and 2013, but because the called query wouldn't save a persistent cache, those queries would take longer. Correct?

    Correct.

    Binary.Buffer/List.Buffer/Table.Buffer all do essentially the same thing, but with different types of input: they load the entire contents of what they're passed into memory. That's it. In your case, this would be a way to force each of your text files to be fully read and cached, in order to avoid situations where certain operations (such as Promote Headers) read a subset of the data, which later causes an extra request due to the full data not getting cached.

    I'd recommend using a tool like Fiddler to watch what happens when you refresh your queries. Start by building the queries without any bells and whistles. Refresh and watch what happens. Then you can experiment with changes to see if they help reduce any duplicate calls. It might be the case that it works as you expect with little or no modification.

    Ehren

    • Marked as answer by Charley Kyd Tuesday, February 26, 2019 8:38 PM
    Tuesday, February 26, 2019 7:59 PM
    Owner
  • Hi Charley,

    Above you asked:

    In Excel, if Query1 uses Web.Contents as a connection only. And then Query2 and Query3 both use Web.Contents during the same refresh to reference the same URL, will they reference the cache or the web? 

    If Query2 and Query3 both call Query1 for their data, will THAT reference the cache?

    Here's a diagram that should hopefully answer these questions. The answer is the same regardless of whether Query2 and Query3 each do their own Web.Contents call, or reference a shared connection-only Query1 that does the Web.Contents call.

    Diagram of Charley's Scenario

    Ehren


    Tuesday, February 26, 2019 8:09 PM
    Owner
  • Thanks a lot, Ehren!

    I hope we used a lot of keywords that will help others about this issue, as well!

    Charley


    Founder, ExcelUser.com

    Tuesday, February 26, 2019 8:43 PM
  • Hi Ehren,

    On a side note, accessing data functions should have something like an "isLocal" flag. Most of my clients store files either on a network share or a document management system (a different matter altogether). Because File.Contents mindlessly assume that all files are local, these remote files sources aren't cached when the opposite should be the case.

    On the other hand, I assume that tables on my local SQL Server installation are cached, when they don't need to be. 

    Tuesday, February 26, 2019 11:11 PM
  • Ehren,

    Also, unless I'm missing something (which is likely), there's no way for one query to reference the results of another query that loads to the Data Model...unless we also load it to a worksheet. 

    There ought to be some way that we can cache a table for use in ANOTHER query, even if the initial table is loaded to the Data Model.

    Charley


    Founder, ExcelUser.com

    Tuesday, February 26, 2019 11:22 PM
  • Hi Colin. These sound like great ideas to post on the Excel/PBI UserVoice sites.

    Ehren

    Tuesday, February 26, 2019 11:29 PM
    Owner
  • Hi Charley,

    Yes, you are correct.

    Feel free to post your on-demand-caching idea on the Excel/PBI UserVoice sites.

    Ehren

    Tuesday, February 26, 2019 11:30 PM
    Owner
  • Hi Colin. These sound like great ideas to post on the Excel/PBI UserVoice sites.

    Hi, Ehren. Thanks...but no thanks. :) The one area on the uservoice forums that gets no love at all is anything to do with Power Query/M. There's stuff with much higher priority that doesn't get acknowledged. So I've just learnt to live with the product's limitations or tried to overcome them the best way that I could. 

    Wednesday, February 27, 2019 3:16 PM
  • Ehren,

    To Colin's point, I've had the same issue with Excel. The only suggestion I've ever made that was actually implemented--the IFERROR function--was in an email to someone on the Excel team. I've found the Wish forums to be a write-only cache. 

    Charley


    Founder, ExcelUser.com

    Wednesday, February 27, 2019 4:37 PM
  • I've found the Wish forums to be a write-only cache. 

    Lol! More like a write-only flushed cache!

    Wednesday, February 27, 2019 4:52 PM
  • Ehren,

    I have some follow-up questions, if I may.

    My set of queries downloads about 2500 text files, each of which needs to be split into data and meta-data tables. And then, the data tables must be merged and transformed in various ways. Finally, when I'm done, I have one fact table and several lookup tables that are loaded to Power Pivot. So...

    1. Am I correct in assuming that PQ doesn't maintain a cache of all 2500 text files? 

    2. Suppose the Q1 function merely queries a specified text file and then both Q2 and Q3 call Q1. Does PQ run Q2 and Q3 somewhat in parallel so that they're both accessing the cache at about the same time? If so, I'd only need a one-file cache, not a 2500-file cache...theoretically. Correct?

    3. If Q2 and Q3 *do* run in parallel, what if Q2 has twice the work to do than Q3? Does Q3 wait for Q2? Or does it rush ahead, thus caching many files before Q2 can get to them?

    4. Is there any practical way that I can monitor the number of calls my queries make to the source? Can I monitor the cache in some way?

    Thanks!

    Charley


    Founder, ExcelUser.com

    Wednesday, March 6, 2019 8:25 PM
  • Hi Charley,

    1. If you're accessing them with Web.Contents, PQ will maintain a cache of all 2500 text files (or however many fit within the configured cache size limits). Note that there are multiple on-disk caches: a long-term one for the PQ Editor previews, and temporary ones for each refresh.
    2. PQ doesn't do any coordination across evaluations, so it's possible that Q2 and Q3 could each not find the data they need in the on-disk cache, and then each fetch it themselves. It's also possible that one of them could run first, thus populating the cache for the other one. And either way, the on-disk cache will still end up storing up to 2500 cache entries (depending on the configured cache size limits, of course), one for each file.
    3. No, there's no waiting or coordination across evaluations. Each of them is doing their work without regard for the other evaluations that may be using/caching the same data at the same time.
    4. An easy way to monitor calls to your data sources is to use a tool like Fiddler. Doing so is often also a good indication of what is getting cached vs. not. (For example, if you see foo.csv getting requested multiple times, then it indicates the subsequent calls aren't finding it in the cache.)

    I hope that helps.

    Ehren



    Monday, March 11, 2019 7:38 PM
    Owner
  • Outstanding!

    Thanks, Ehren.

    Charley


    Founder, ExcelUser.com

    Monday, March 11, 2019 7:46 PM
  • Ehren,

    A related issue...

    For one set of web queries, I'm getting random 404 html text. But when I query one at a time, I don't. So I thought I'd add a loop with a short wait, so that I could, perhaps, get good data after several tries.

    However, because PQ caches my initial 404 result, it would seem that my loop would ALWAYS return that initial 404 result.

    So, is there any way to flush the cache for that one page in my loop so that I get refreshed data for each try? Or, is there any way to cause PQ to bypass caching until I get a result worth using again?

    Thanks!

    Charley


    Founder, ExcelUser.com

    Tuesday, March 12, 2019 3:29 PM
  • Hi Charley. See the IsRetry option for Web.Contents.

    https://docs.microsoft.com/en-us/powerquery-m/web-contents

    Ehren

    Tuesday, March 12, 2019 4:46 PM
    Owner
  • (Outstanding!)^2

    Founder, ExcelUser.com

    Tuesday, March 12, 2019 4:50 PM