none
Power Query Poor Performance - Join from Data table to tab in workbook RRS feed

  • Question

  • Hi, 

    We are having performance issues with power query with the following scenario.

    We are using a Big Data platform - SQL Server APS (analytic platform system / PDW) and are suggesting power query as a self service application for business users to interact with the data.

    One really common scenario is that users have a list of values that they want to filter a database table by and to return a small subset of rows.

    We are struggling to know how to do this. 

    Currently we store the list of filtering values in an excel tab and convert to a query in the power query window pane.

    We then do a merge join with a table and select just the columns we need. 

    What happens is that the client machine runs out of disk space because power query tries to return the big data table into the client machine. Then it trys to join the result set with the filtering values list. This is not going to work for us.

    How can we do an "in database join" so that the filtering values query values are passed down to the APS appliance and joined with the data set there and then just the filtered results are passed back to the client machine?

    Can someone help point us in the right direction?

    Best Regards,

    Geoff


    Geoff

    Thursday, December 15, 2016 2:18 PM

Answers

  • I've solved similar issues by building dynamic SQL queries in Power Query so the the results returned by the database have already been filtered.

    Here is an example where I pass a list of Customer ID's to the where clause of the query:

        CustomerID = Source[CustID],
        CustomerParameterList = List.Accumulate(CustomerID, 0, (state, current) => Number.ToText(state) & ", " & Number.ToText(current)),
        CustomerParameter = if(CustomerParameterList <> 0) then Text.End(CustomerParameterList, Text.Length(CustomerParameterList) - 3) else "9999999999999",
        CustNames = Sql.Database("DBserver", "DatabaseName", [Query="Select CustomerName from Customer where CustID in " & Character.FromNumber(40) & CustomerParameter & Character.FromNumber(41)"])

    Here is an explanation of the steps:

    CustomerID is the list of values you are using to filter the data. In this case a customer ID.  This generally comes from another Query or an early step in this query.

    CustomerParameterList basically takes the list of CustomerID values and creates a comma separated string of the values. 

    CustomerParameter is basically a cleanup step.  It might only be necessary because I don't have a deeper understanding of List.Accumulate right now but basically it causes every comma separated string to begin with "0, ".  So for example you might get a list like this "0, 101, 24, 203".  I only care about the last 3 values so this step checks to see if I have multiple Customer ID's in my list and removes the leading "0, ".  There is a if statement to check if the value of the string just equals 0 because if it does, that means I don't have any parameter values to pass into my query and therefore make it equal to a value that would never occur in my DB (9999999999999).

    CustNames is the dynamic query to my database using the CustomerParameter. The basic structure is SELECT Field FROM Table WHERE value IN (CustomerParameter).  Character.FromNumber(40 & 41) correspond to "()". Just build the query string and concatenate the parts together with '&'. 

    Using these steps all the filtering of the large table will be done server side with only filtered results coming back to Power Query.

    I also noticed that Chris Webb's Blog had a post this week about Value.NativeQuery which is a somewhat newer function.  It looks like it could help address similar issues but I haven't actually tried using it myself yet:

    Blog Link




    Thursday, December 15, 2016 10:39 PM

All replies

  • I've solved similar issues by building dynamic SQL queries in Power Query so the the results returned by the database have already been filtered.

    Here is an example where I pass a list of Customer ID's to the where clause of the query:

        CustomerID = Source[CustID],
        CustomerParameterList = List.Accumulate(CustomerID, 0, (state, current) => Number.ToText(state) & ", " & Number.ToText(current)),
        CustomerParameter = if(CustomerParameterList <> 0) then Text.End(CustomerParameterList, Text.Length(CustomerParameterList) - 3) else "9999999999999",
        CustNames = Sql.Database("DBserver", "DatabaseName", [Query="Select CustomerName from Customer where CustID in " & Character.FromNumber(40) & CustomerParameter & Character.FromNumber(41)"])

    Here is an explanation of the steps:

    CustomerID is the list of values you are using to filter the data. In this case a customer ID.  This generally comes from another Query or an early step in this query.

    CustomerParameterList basically takes the list of CustomerID values and creates a comma separated string of the values. 

    CustomerParameter is basically a cleanup step.  It might only be necessary because I don't have a deeper understanding of List.Accumulate right now but basically it causes every comma separated string to begin with "0, ".  So for example you might get a list like this "0, 101, 24, 203".  I only care about the last 3 values so this step checks to see if I have multiple Customer ID's in my list and removes the leading "0, ".  There is a if statement to check if the value of the string just equals 0 because if it does, that means I don't have any parameter values to pass into my query and therefore make it equal to a value that would never occur in my DB (9999999999999).

    CustNames is the dynamic query to my database using the CustomerParameter. The basic structure is SELECT Field FROM Table WHERE value IN (CustomerParameter).  Character.FromNumber(40 & 41) correspond to "()". Just build the query string and concatenate the parts together with '&'. 

    Using these steps all the filtering of the large table will be done server side with only filtered results coming back to Power Query.

    I also noticed that Chris Webb's Blog had a post this week about Value.NativeQuery which is a somewhat newer function.  It looks like it could help address similar issues but I haven't actually tried using it myself yet:

    Blog Link




    Thursday, December 15, 2016 10:39 PM
  • To avoid the "cleanup" step, simply redo the parameter step:

    CustomerParameterList = Text.Trim(List.Accumulate(Source,"",(state,current)=>state & "," & Number.ToText(current)),",")

    In the CustNames step, why can't you just specify the string as:

    Sql.Database("DBserver", "DatabaseName", [Query="Select CustomerName from Customer where CustID in (" & CustomerParameterList & ")"])

    Your particular example doesn't need a native query though, because you can create a Power Query script that folds with a where..in clause - and doing so where the "in" list comes from a parameters table.


    Friday, December 16, 2016 12:13 AM
  • All good suggestions Colin.  I'll have to experiment with using Text.Trim in the second step when I have a chance.

    Another thing the "clean up" step is doing is checking to see if I have no parameters to pass to the sql query.  I'll have to tinker with the second step to see if I can get it do that part as well.

    As far as the CustNames suggestion, I would agree that it looks cleaner.  At the time however of trying to figure this piece out in the context of a much larger and more complex overall query, I kept running into issues with keeping my parenthesis straight in the M functions and the text strings.  Using Character.FromNumber helped me keep track of things a bit easier. It was probably due to other sections of the query.  I don't recall the details but once I had it working it didn't revisit this part.

    Friday, December 16, 2016 3:03 PM
  • Hi Mike,

    I'll have to experiment with using Text.Trim in the second step when I have a chance.

    Whenever you concatenate text with a separator using List.Accumulate, you are going to end up with a leading or trailing separator. Text.Trim simply removes the extra separator at the beginning or end of the string. The separator to remove is the second parameter in Text.Trim.

    Another thing the "clean up" step is doing is checking to see if I have no parameters to pass to the sql query.  I'll have to tinker with the second step to see if I can get it do that part as well.

    If the parameter list is empty when passed to the List.Accumulate function, an error will be generated. In the CustomerParameterList step, you could use a try...otherwise clause to trap the error:

    = try Text.Trim(List.Accumulate(Source,"",(state,current)=>state & "," & Number.ToText(current)),",") otherwise "9999999999999"

    Here's an example using Power Query queries instead of using a native SQL query. The example uses the Contoso DW database. The table we want to filter is FactSales (4M rows). We want to filter by a list of product keys.

    The first query is simply a table called Parameters, which has a list of product keys in a column named ProductID.

    Excel.CurrentWorkbook(){[Name="Table1"]}[Content]

    Since this query is a single step, we don't need to use let...in and specify a variable name.

    The main query fetches the FactSales table from Contoso:

    let
        Source = Sql.Databases("Colin-PC\SQLSERVERTABULAR"),
        ContosoRetailDW = Source{[Name="ContosoRetailDW"]}[Data],
        dbo_FactSales = ContosoRetailDW{[Schema="dbo",Item="FactSales"]}[Data],
        RemovedColumns = Table.RemoveColumns(dbo_FactSales,{"ETLLoadID", "LoadDate", "UpdateDate", "DimChannel", "DimCurrency", "DimDate", "DimProduct", "DimPromotion", "DimStore"}),
        ProductIDList = Parameters[ProductID],
        FilteredRows = Table.SelectRows(RemovedColumns, each List.Contains(ProductIDList,[ProductKey]))
    in
        FilteredRows

    This query is a modification of the original query done though the Query editor UI. The original FilteredRows step filtered was just a single product key ...each [ProductKey] = 956. ProductKey is a column in the FactSales table.

    In the modified query, the step ProductIDList was added (list from Parameters table), and the FilteredRow step changed to ...each List.Contains(ProductIDList,[ProductKey]).

    The query was folded, and the SQL script sent to the database included a where clause with an "in" operator containing the list of product keys (IDs).

    Friday, December 16, 2016 5:02 PM
  • Just to say thanks so much for your posts and quick responses! I am so impressed with you guys and with the Microsoft community.

    We are looking into the solutions you are providing here. Providing some dynamic SQL queries looks like the approach we will use.

    In terms of developing the Power Query product for better "self service" use by business users it would be good to see some more intelligence built into the product. For example in the scenario we face here.... it would be good if Power Query can detect when to shift heavy processing back to the server to fulfil a business user's demanding query. This would mean there would be no need for a data team to prepare dynamic SQL queries. At the end of the day ... less hand-offs between business and BI teams means a faster ability to provide business value and contribute to competitive advantage.


    Geoff

    Thursday, December 22, 2016 8:47 AM
  •  For example in the scenario we face here.... it would be good if Power Query can detect when to shift heavy processing back to the server to fulfil a business user's demanding query.


    Geoff

    The PQ example I presented does exactly that. All of the heavy processing is done on the server, and filtering the 4M rows took under a second. PQ doesn't have to detect when to shift heavy processing back to the server. It will always do so, if possible.

    From a filtering perspective in the Query Editor, the mind-boggling omission in the column filter drop-down is a "List" option that can take a literate set of values or a list variable.

    Thursday, December 22, 2016 5:13 PM