none
ODBC and Large Data Sets

    Question

  • I'm having an issue with Power Query (both Power BI and Excel) and am casting a wide net looking for options.

     

    Here's my requirement - the data source has about 300M records. I have the option of using three different database engines - AWS Athena, AWS Redshift, or Presto. The problem that I'm having is with Power Query SQL generation using the ODBC driver. Let's say I select two attribute columns and two measure columns. If I group by both attribute columns, Power Query pushes down the group by SQL to the database and returns only a few records with the measures aggregated appropriately. In my analysis I noticed that one of the measure columns had a bunch of null values that I want to exclude. So I placed a filter on that column. When I hit go Power Query proceeds to try to bring back the entire dataset (filtered, but that only removes 10% of the records) before doing the group by. The client soon chokes on all the incoming data. This seems like a signficant deficiencey in the Power Query engine.

     

    I've tested with Tableau and the same driver. Tableau creates a SQL statmenet like SELECT foo, SUM(foofact) FROM bar WHERE foofact IS NOT NULL GROUP BY foo. That's exactly what I would expect from Power BI.

     

    Any suggestions appreciated...


    Tuesday, February 20, 2018 4:29 PM

Answers

  • Our existing ODBC connector uses ODBC's capability reporting to determine what a driver can and cannot support. If we did the grouping locally, then it was because the capabilities reported by the driver suggested that we could not run the query on the server. Unfortunately, a lot of drivers misreport their capabilities or have other kinds of bugs. That's why (for instance) Tableau has "TDC" files. The "custom data connector" mechanism in Power BI lets you do something similar by wrapping the ODBC driver and overriding some of the information it reports. This is admittedly not very accessible, but we're working on making it easier to share these between users.

    A more sophisticated approach -- which we'd like to implement in the future -- would be to run exemplar queries against an unknown source to figure out what it's actually capable of. It's quite possible that this is what Tableau is doing.

    A special case of this problem is the "top" or "limit" option, because the feature neither exists in SQL-92 (which is what the ODBC spec supports) nor is it reported by the ODBC capabilities mechanism pre-version-4.

    Friday, March 16, 2018 1:40 PM

All replies

  • Hey Scott,

    Would you mind posting your M code?

    You can check if query folding is happening using the view native query on each of the steps of your query:

    From the logs of your ODBC driver, what SQL Statement is being sent? with those pieces of information, you should be able to pinpoint where exactly the query folding is not happening.

    If ALL fails, you can enter the exact SQL statement that you need through the SQL Statement (optional) parameter box when choosing your odbc driver:

    Tuesday, February 20, 2018 6:20 PM
  • Thanks very much for the response. Entering SQL or M code is not an option here. We're using Power BI to allow the users to get at the data without needing to code. I'm working really hard to justify Power BI over Tableau here.

    I'm pretty sure that I know where the "folding" isn't occurring - it can't seem to write a SQL statement with both a where and group by clauses.

    It is sending the following query when coming into the Power Query editor in preview mode (10 min wait):

    SELECT foo1, foo2,...foo74 FROM bar

    Note: NO LIMIT CLAUSE! Tableau uses a limit clause and returns in 5 seconds.

    Then I ask for only my two attributes and two measures (2-3 minute wait):

    SELECT foo1, foo2, foo3, foo4 FROM bar

    Again: no limit clause

    Then I filter where one measure is not null (another 2-3 minute wait):

    SELECT foo1, foo2, foo3, foo4 FROM bar WHERE foo3 IS NOT NULL

    Then I do a group by both attributes with SUM on both measures. I can't provide this query from Power BI as it blows up my client before finishing. It's trying to return all 300M records to my client so it's pretty obvious that it did about the same thing as immediately above except without whatever limits it normally uses.

    Tableau sends the following, which returns in about 5 seconds:

    SELECT foo1, foo2, SUM(foo3) foo3, SUM(foo4) foo4 FROM bar WHERE FOO3 IS NOT NULL GROUP BY foo1, foo2 LIMIT 1000

    Here is the M code from Power BI:

    let
        Source = Odbc.DataSource("dsn=datalake-poc-athena-64", [HierarchicalNavigation=true]),
        AwsDataCatalog_Database = Source{[Name="AwsDataCatalog",Kind="Database"]}[Data],
        hmda_Schema = AwsDataCatalog_Database{[Name="hmda",Kind="Schema"]}[Data],
        hmda_lar_Table = hmda_Schema{[Name="hmda_lar",Kind="Table"]}[Data],
        #"Removed Other Columns" = Table.SelectColumns(hmda_lar_Table,{"action_taken_name", "agency_name", "applicant_income_000s", "loan_amount_000s"}),
        #"Filtered Rows" = Table.SelectRows(#"Removed Other Columns", each ([applicant_income_000s] <> null)),
        #"Grouped Rows" = Table.Group(#"Filtered Rows", {"action_taken_name", "agency_name"}, {{"Income", each List.Sum([applicant_income_000s]), type number}, {"Loan", each List.Sum([loan_amount_000s]), type number}})
    in
        #"Grouped Rows"

    Tuesday, February 20, 2018 7:57 PM
  • The 'View Native Query' option is more of a way to tell if query folding is present at the current step or not.  That's the only method that Power Query has to let the user know if it's doing any query folding or not and, in some cases, it might still be doing some query folding even if the view native query button is grayed out. You shouldn't need any special M code to make it work.

    Nevertheless, you could try changing the filter to "Remove Empty" and that might help you do the filter. 

    also, just to make more tests while working on the query, I'd recommend that your first step uses a "Keep Top Rows" and select 1000 rows. That'll make sure that your query at least loads and you can work on it to see where exactly things are going south.

    I'm going to let the MSFT folks jump on this thread to see if there's anything else that you can do for this.

    ---Calling Ehren for support

    Wednesday, February 21, 2018 7:18 AM
  • Miguel, thank you very much for the thoughtful reply. I did what you said and it had essentially the same impact. Here's the rub:

    • 10 min lag to preview
    • 2-5 min each time I add an operation in Power Query
    • 10 min wait THEN bring 300M records back when I group by
    • In looking at the database, I'm seeing many queries that are essentially SELECT * (no WHERE clause).

    Brining back 300M records back just isn't going to work - even if it didn't take 20 min to get to that point. Also, all of these unfiltered and ungrouped queries are going to create havoc on my DB. 

    As I mentioned - I'm a Power BI guy. Unfortunately, because of the way Tableau works I was able to start from scratch, set up a connection, select data, preview it, filter it, and create a report in less time than I can even open the data in Power Query. Having said that (and so you don't think I'm a shill for Tableau) - Tableau can't carry Power BI's water when it comes to spinning the data. There's no native answer to Power Query. BUT if I can't even get the data, the existence of Power Query is not relevant.

    I've done the following:

    1. Entered an improvement request in the appropriate forum
    2. I'm pursuing support via our corporate Office 365 license - we have 20K seats so I'm hoping for some attention.
    3. Continue to research options...

    For background, here's what Tableau sends to Athena:

    SELECT CAST("hmda_lar"."action_taken" AS VARCHAR) AS "action_taken",
      "hmda_lar"."action_taken_name" AS "action_taken_name",
      "hmda_lar"."agency_name" AS "agency_name",
      SUM("hmda_lar"."applicant_income_000s") AS "sum_applicant_inco",
      SUM("hmda_lar"."loan_amount_000s") AS "sum_loan_amount_00"
    FROM "hmda"."hmda_lar" "hmda_lar"
    WHERE (NOT ("hmda_lar"."applicant_income_000s" IS NULL))
    GROUP BY CAST("hmda_lar"."action_taken" AS VARCHAR),
      "hmda_lar"."action_taken_name",
      "hmda_lar"."agency_name"

    Power BI insists on sending SELECT * (not really *, but the whole field list) FROM hmda.hmda_lar


    Wednesday, February 21, 2018 2:57 PM
  • You mentioned in the original post that you had access to that database through Amazon Redshift. Power BI Desktop has a native connector for that data source. When you install PBI Desktop (msi version) it creates a new folder under "C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers" where you can see the ODBC drivers that installed with it and there you can find the one that it uses for Amazon Redshift.

    Are you getting the same issues when connecting through that connector?

    (note that this connector doesn't use the same ODBC Driver that you find on the Amazon website, but one that was created by Simba technologies )

    I'm sure that the Power Query team will be interested in learning more about your case and that this will be solved, but, in the meantime, I'd highly recommend that an end-user only gets to see a View of the data and not the Tables. Can you imagine even if just 1% of the users (200) do a full query to get ALL the rows from that table, and all doing it at the same time? By only having access to those views, you'll make sure that they only get to see a summarized version of your data without any possibility of ever having 200 or 20,000 users choking your server. I'm not sure how Athena or the rest of the services work, but Power Query can also run database functions (through the UI) and even stored procedures (through some light M coding) - so those are other options that are available.

    Thursday, February 22, 2018 10:55 AM
  • Miguel, thank you once again for the response. Any idea how I can get in touch with the Power Query team? I've got an issue open with Power BI, but they are walking me through the obvious stuff. I'm also trying to get help through our corporate folks.

    Before starting this thread I have tried the Redshift drivers and they have a show stopper bug - they do not show external Redshift tables. Ironically AWS' ODBC drivers had a similar bug. I have been working with a Redshift PM and their driver developer. A new, fixed ODBC driver should be coming out soon. I have a beta copy and while it shows the external tables, Power BI doesn't work any better with it than ODBC to Athena or Presto. The source of the problem unfortunately is Power BI.

    I hear you on the views. There are a couple of things going for me - Athena scales really well as does S3 for storage. It does require an elastic wallet, but since charges can be pegged to the user, cost can help constrain query craziness. For my user base I don't anticipate them being able to overload Athena even if they tried - there's always the 30 min query limit as a safety net. I could be wrong here, but I don't anticipate overwhelming the architecture.

    Another issue with views is that this is a data lake requirement. As soon as I start making aggregation decisions I'm limiting analytics. I'm a star schema guy by background and I want to get away from over-architecting where possible. I truly believe that it's possible here. Even so, at some point I may need to put on some constraints for some use cases. That's fine I just don't want the tools forcing that decision.

    Again, thanks Miguel!

    Thursday, February 22, 2018 3:54 PM
  • I'm hoping that the fix would be something simple like adding something to the connection string be enabling "something", but you can always click on the frown icon of the Power Query tool (on the top right or top left corner) and that way you'll send them not only your comments, but also your code, screenshots and more.

    If that fails, you can also write them directly to this email: pq2fb@microsoft.com 

    The PQ team usually monitors this forum, so it's only a matter of time until it catches their attention.

    Thursday, February 22, 2018 7:09 PM
  • Scott, could you please share your power query trace files? It should have the information why your query is not folded. If you don't want to share it with the public, you can send to me directly at xinguo@microsoft.com

    Of course, please remove the sensitive information in the trace files before sending :)

    • Proposed as answer by Miguel Escobar Tuesday, February 27, 2018 5:52 PM
    Monday, February 26, 2018 8:05 PM
  • I now have an open support case via our corporate account. The case # is 118022217693054. They have the trace files and a bunch of other things. Thus far I'm getting "Works as designed." I'm still pushing. Will post resolution here once I have one.
    Tuesday, February 27, 2018 11:40 PM
  • I promised a post when I got a resolution. There isn't one at this time, but we closed the support case.

    It seems that when working with ODBC sources Power BI just doesn't do much in the way of writing efficient SQL. The behavior when filtering then grouping is to filter at the source, but group on the client. Also the previews do not send limit clauses. Effectively this all means that Power BI can't work with datasets beyond a few thousand records with ODBC. 

    The development team did accept this as a potential improvement so there is hope that someday this capability will be added. Until then unfortunately I'll have to use a competing product to Power BI.

    Friday, March 16, 2018 1:02 PM
  • Our existing ODBC connector uses ODBC's capability reporting to determine what a driver can and cannot support. If we did the grouping locally, then it was because the capabilities reported by the driver suggested that we could not run the query on the server. Unfortunately, a lot of drivers misreport their capabilities or have other kinds of bugs. That's why (for instance) Tableau has "TDC" files. The "custom data connector" mechanism in Power BI lets you do something similar by wrapping the ODBC driver and overriding some of the information it reports. This is admittedly not very accessible, but we're working on making it easier to share these between users.

    A more sophisticated approach -- which we'd like to implement in the future -- would be to run exemplar queries against an unknown source to figure out what it's actually capable of. It's quite possible that this is what Tableau is doing.

    A special case of this problem is the "top" or "limit" option, because the feature neither exists in SQL-92 (which is what the ODBC spec supports) nor is it reported by the ODBC capabilities mechanism pre-version-4.

    Friday, March 16, 2018 1:40 PM
  • Curt, thank you very much for providing a clear explanation of the specifics. I was unaware of the subtleties of identifying the capabilities of an ODBC driver. I'm relatively certain that Tableau has explicitly specified the Athena as well as Redshift properties as there is a selection for both in the connection type list. I'm looking forward to this improvement in Power Query.

    Thanks again!

    Tuesday, April 3, 2018 9:21 PM
  • Dear Scott,

    Did you ever find a solution on how to prefilter the ODBC query at step one - avoding all data to load into Power BI before filtering?

    BR, Henrik
    WIKAP.dk


    Resultater med Microsoft CRM

    Wednesday, September 26, 2018 6:52 AM