locked
Power Query Large File Performance RRS feed

  • Question

  • I'm iterating over two folders of multiple CSVs, the total size is roughly 46MB.

    Two Queries, each looking at data in one folder, updated quickly (c2-3 mins). I then created a third query taking those queries and merging them. My data now takes a long time to update (c1hour+), with a lot of time spent slowly iterating over the files referenced (the Excel power Query dialogue pane shows the different file names and volume of data processes ticking over painfully slowly.

    To try to degug the slowness, I used the excellent Trace Logs Analyser Power BI created by Rui Romano (I can't link in this post apparently, but Google is your friend).

    The slow file iteration created long durations for 'Packagestorage'. However the full query time was hugely taken up by three other processes: RemoteDocumentEvaluator, PackagePartitionAnalysisInfo, and RemotePreviewValueSource.

    Here's a screengrab of the Trace Report for scale of delay (edit, no images allowed either! In essence, the three processes listed are several thousand x larger than everything else, even the painfully slow iteration over the files which is under 'PackageStorage'.):

    My question is, can anyone help me solve the long processing time delay?

    I'm using 64bit Excel (v15.0.4833.1001) with the latest Power Pivot 64 extension (2.34.4372.163), and am going to upgrade Excel to the latest version of office 360 to see if that helps, though I somewhat doubt it will!

    I use c85% of physical memory when running the Power Query refresh, but again, this is only since including the third Query which does minimal processing. Here is the Advanced Editor code from this Query:

    let
        Source = Table.NestedJoin(Adwords,{"Search Term", "Start Date", "Device"},#"Search Console",{"Search Term", "Start Date", "Device"},"Org",JoinKind.FullOuter),
        #"Expanded Org" = Table.ExpandTableColumn(Source, "Org", {"Search Term", "Av Position", "Impressions", "Clicks", "CTR", "Vs Avg. CTR", "Click Opportunity"}, {"Org.Search Term", "Org.Av Position", "Org.Impressions", "Org.Clicks", "Org.CTR", "Org.Vs Avg. CTR", "Org.Click Opportunity"})
    in
        #"Expanded Org"

    Friday, July 22, 2016 11:15 AM

Answers

  • OK, so I've tested a few approaches and nailed down the issue: loading to the Data Model is creating the lag.

    I edited a copy of the file to generate an index column by merging three columns to make a unique index that would allow the relationship to be created in powerpivot. Fast, easy, no problem while still connection only queries.

    However, loading the two tables to the data model (just the data model, i.e. no table, just as in my original configuration) takes just as long as the full join data - i.e. it actually takes 3-4 hours all in.

    So, I began testing, and found if I just sucked it up an loaded the table - i.e. uncheck load to data model (which is no longer needed) and load to a table - the full join query runs just as quickly as my old queries. 4-5 mins or so.

    Similarly, to test this I converted my indexed queries in the duplicate sheet to both load to tables in the same way and, yep, they load quickly.

    I also made sure I was loading to a 'clean' data model by creating new Excel docs and recreating the queries so the first run to the ata model was my finished query, as I know there are issues with loading to the data model when you 'overwrite' with minor changes in your query. That's not the case here.

    So the issue for me is if I load to JUST the data model, my Queries all take a huge amount of time to run. Loading to a table, fixes that.

    This seems counter-intuitive: I thought that the data model was just as viable - and potentially faster - than loading to a 'physical' table in the worksheet.

    Functionally I'm unaffected, I suppose, as it just means I have a table I don't use other than to reference in my pivot, but it *feels* inefficient and against the grain of the data model design.

    So: fixed, I guess, but this does feel like a bug rather than a feature!

    • Marked as answer by CLiversidge Thursday, August 18, 2016 11:25 AM
    Thursday, August 18, 2016 11:25 AM

All replies

  • Hi there. Is the full outer join necessary?

    Alternatively, have you considered loading the two tables to the Data Model separately and creating a relationship between them in Power Pivot (thus avoiding the need to do a Merge/NestedJoin in PQ)?

    Ehren

    Tuesday, July 26, 2016 11:10 PM
  • I wish a full join was not necessary, but yes it is.

    I'll test trying the relationship approach, though it is an approach I started out with before transitioning the logic to Power Query. I was getting into lagging behaviour on Excel's side which is why I went full Power Query. Now I have my data fully transformed in the two tables though it is worth revisiting.

    If it works, I'll update!

    Thursday, July 28, 2016 12:53 PM
  • Hi CLiversidge,

    How did the approach of handling the relationship in the Data Model go?


    Regards,

    Michael Amadi

    Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to vote it as helpful :)

    Website: http://www.nimblelearn.com, Twitter: @nimblelearn

    • Edited by Michael Amadi Monday, August 15, 2016 7:23 PM Minor edit
    Sunday, August 14, 2016 7:50 PM
  • To me this looks as if we should expect the intermediate results being cached once executed and not seeing large iterations over the source-files.

    Try with buffering:

    let
        Source = Table.NestedJoin(Table.Buffer(AdWords),{"Search Term", "Start Date", "Device"},Table.Buffer(#"Search Console"),{"Search Term", "Start Date", "Device"},"Org",JoinKind.FullOuter),
        #"Expanded Org" = Table.ExpandTableColumn(Source, "Org", {"Search Term", "Av Position", "Impressions", "Clicks", "CTR", "Vs Avg. CTR", "Click Opportunity"}, {"Org.Search Term", "Org.Av Position", "Org.Impressions", "Org.Clicks", "Org.CTR", "Org.Vs Avg. CTR", "Org.Click Opportunity"})
    in
        #"Expanded Org"


    Imke Feldmann TheBIccountant.com

    • Proposed as answer by Michael Amadi Monday, August 15, 2016 7:22 PM
    Monday, August 15, 2016 6:08 AM
  • OK, so I've tested a few approaches and nailed down the issue: loading to the Data Model is creating the lag.

    I edited a copy of the file to generate an index column by merging three columns to make a unique index that would allow the relationship to be created in powerpivot. Fast, easy, no problem while still connection only queries.

    However, loading the two tables to the data model (just the data model, i.e. no table, just as in my original configuration) takes just as long as the full join data - i.e. it actually takes 3-4 hours all in.

    So, I began testing, and found if I just sucked it up an loaded the table - i.e. uncheck load to data model (which is no longer needed) and load to a table - the full join query runs just as quickly as my old queries. 4-5 mins or so.

    Similarly, to test this I converted my indexed queries in the duplicate sheet to both load to tables in the same way and, yep, they load quickly.

    I also made sure I was loading to a 'clean' data model by creating new Excel docs and recreating the queries so the first run to the ata model was my finished query, as I know there are issues with loading to the data model when you 'overwrite' with minor changes in your query. That's not the case here.

    So the issue for me is if I load to JUST the data model, my Queries all take a huge amount of time to run. Loading to a table, fixes that.

    This seems counter-intuitive: I thought that the data model was just as viable - and potentially faster - than loading to a 'physical' table in the worksheet.

    Functionally I'm unaffected, I suppose, as it just means I have a table I don't use other than to reference in my pivot, but it *feels* inefficient and against the grain of the data model design.

    So: fixed, I guess, but this does feel like a bug rather than a feature!

    • Marked as answer by CLiversidge Thursday, August 18, 2016 11:25 AM
    Thursday, August 18, 2016 11:25 AM
  • Oh, and I should say, my file size increase by 30MB  (from 5.3 to 35.3) too, so that's another drawback I'd say.

    With very large data that will start making this approach unviable (though arguably I shouldn't be using Excel at that point!).

    Thursday, August 18, 2016 11:34 AM
  • That's interesting.

    How are your settings re Data Load: Creating or updating relationships when loading to Data Model activated?


    Imke Feldmann TheBIccountant.com

    Thursday, August 18, 2016 1:08 PM
  • The relationships are created when ticking the 'Load to Data Model' option in the Query connection 'Load to' dialogue.

    Otherwise they are added when the table is created in Excel in the normal way, i.e. Excel creates the relationship.

    I remove old data model data by editing the xlsx with 7zip to remove the /xl/model folder and repairing the file when I reopen it in Excel. See comment by dejazeus here: https://www.reddit.com/r/excel/comments/40yn76/remove_broken_data_model_w_broken_powerpivot/

    Thursday, August 18, 2016 2:30 PM
  • Sorry, I was a bit unclear here. Tried to ask if any of these boxes are checked:


    Imke Feldmann TheBIccountant.com

    Thursday, August 18, 2016 2:42 PM
  • Ah, yes both are checked.

    Thursday, August 18, 2016 3:39 PM
  • Just wondering if your query would run faster if you uncheck both boxes?

    Imke Feldmann TheBIccountant.com

    Thursday, August 18, 2016 3:51 PM
  • Sadly not.
    Monday, August 22, 2016 8:01 AM
  • I'm very interested in all performance aspects of Power Query and this example is really massive. Would you mind sending a frown to the PQ-team for this to be investigated further - and then share your results here? :-)

    Thanks a lot!


    Imke Feldmann TheBIccountant.com

    Tuesday, August 23, 2016 9:14 AM
  • I'd be happy to - how do I do that?
    Tuesday, August 23, 2016 10:11 AM
  • Super!

    In Excel you can find them here:

    or in the query-editor under File -> SendFeedback


    Imke Feldmann TheBIccountant.com

    Tuesday, August 23, 2016 12:18 PM