locked
Group by turning estimated execution from batch mode into row mode - Columnstore index overhead RRS feed

  • Question

  • Hi there,

    This may have very little information regarding my issue and I really apologies about it before hand, but I find it hard replicating this with fake data, also, I'm not quite sure if this is the right place to spot this but the help is greatly appreciated. 

    Some background:

    I'm running a query using Excel Power query 2016 that merges 4 tables using left outer join for each, the tables reside in a SQL server 2016 engine on-premise, relational DB. All tables have clustered columnstore index set. If I run the query w/o using Power Query, I get the following:

    - Total rows that result from query = 56886

    - After aggregation (for columns), rows result from query = 1880

    The Issue:

    When I apply a group by sentence - using Power Query - for only the same 4 columns after all merges are applied the query freezes. If I look at the execution plan - by copying and pasting the query from SQL profiler into management studio and running it - I see that some tables have huge reads and my cool clustered columnstore index are no longer the coolest: After enabling Live Execution Plans and checking on where the bottleneck occurs, I see that in one table is going from execution mode as "batch mode" into "row mode".  This causes a massive overhead, I had to stop the query when the reads on one table were going in 1300000000 "estimated number of rows read" (I know, I'm not missing a zero here). 

    If I go one step back and delete the "group By" step then I end up running the query in 2 seconds. Estimated number of rows read for the same table is 1128290. 

    Why is a humble group by is causing such a massive overhead? within the Live execution plans I can see that when I apply the group by, quite a couple of Table Spool and Nested loops get created.

    Thanks!

    Monday, July 31, 2017 8:00 AM

Answers

  • Hi there.

    I have re-written this code in sql and the performance improves hugely

    Can you share the rewritten version of the above SQL query? This will help us understand how we could better optimize our generated SQL.

    Thanks,

    Ehren



    Wednesday, August 2, 2017 4:41 PM

All replies

  • Hi there. Is there an issue in the SQL query that PQ is generating? Or is this more an issue of tuning the backend SQL db?

    If it's the former, can you provide the text of the problematic SQL query that we're generating?

    Ehren

    Monday, July 31, 2017 7:10 PM
  • Hi Ehren, thanks for replying, The code I get from sql profiler is as below (I have changed the name of the fields). 

    execute sp_executesql N'
    select 
    [rows].[FirstName] as [FirstName],
        [rows].[LastName] as [LastName],
        [rows].[Country] as [Country],
        [rows].[Location] as [Location]
    from 
    (
        select [$Outer].[Scan_Door_ID],
            [$Outer].[Guest_ID],
            [$Outer].[FirstName],
            [$Outer].[LastName],
            [$Outer].[Country],
            [$Inner].[Location]
        from 
        (
            select [$Outer].[Scan_Door_ID],
                [$Outer].[GDate],
                [$Outer].[G_LocationID],
                [$Outer].[GStatus_ID],
                [$Outer].[Guest_ID],
                [$Inner].[FirstName],
                [$Inner].[LastName],
                [$Inner].[Country]
            from 
            (
                select [$Outer].[Scan_Door_ID],
                    [$Outer].[GDate],
                    [$Outer].[G_LocationID],
                    [$Outer].[GStatus_ID],
                    [$Inner].[Guest_ID]
                from 
                (
                    select [_].[Scan_Door_ID],
                        [_].[GDate],
                        [_].[G_LocationID],
                        [_].[GStatus_ID]
                    from 
                    (
                        select [Scan_Door_ID],
                            [GDate],
                            [G_LocationID],
                            [GStatus_ID]
                        from [dbo].[Scan] as [$Table]
                    ) as [_]
                    where ([_].[GDate] = convert(datetime2, ''2017-07-27 00:00:00'') and [_].[GDate] is not null) and [_].[GStatus_ID] < 10
                ) as [$Outer]
                left outer join 
                (
                    select 
    [Door_ID],
                        [Guest_ID],
                        [VarCode]
                    from [dbo].[Guest_Access] as [$Table]
                ) as [$Inner] on ([$Outer].[Scan_Door_ID] = [$Inner].[Door_ID] or [$Outer].[Scan_Door_ID] is null and [$Inner].[Door_ID] is null)
            ) as [$Outer]
            left outer join 
            (
                select [Guest_ID] as [Guest_ID2],
                    [FirstName] as [FirstName],
                    [LastName] as [LastName],
                    [Country] as [Country]
                from [dbo].[Guest] as [$Table]
            ) as [$Inner] on ([$Outer].[Guest_ID] = [$Inner].[Guest_ID2] or [$Outer].[Guest_ID] is null and [$Inner].[Guest_ID2] is null)
        ) as [$Outer]
        left outer join 
        (
            select [Location],
                [G_Location_ID]
            from [dbo].[Region] as [$Table]
        ) as [$Inner] on ([$Outer].[G_LocationID] = [$Inner].[G_Location_ID] or [$Outer].[G_LocationID] is null and [$Inner].[G_Location_ID] is null)
    ) as [rows] 
    group by 
    [rows].[FirstName] ,
        [rows].[LastName] ,
        [rows].[Country] ,
        [rows].[Location]'

    If I take out the group by aggregation then my life turns to happiness, but I want grouped rows to avoid overloading the excel spreadsheet. 

    Thanks!

    Monday, July 31, 2017 10:13 PM
  • Hi there. I shared the SQL query with our team, and it looks to be what we'd expect to generate, given the PQ queries you described. Is there a more efficient SQL query you'd prefer that we generate (one that's semantically equivalent but yields better performance)? If so, we could look at improving the SQL we generate. But if not, you'll likely need to tune your backend SQL server to yield better performance on the above query, and this isn't something we're equipped to help with.

    Ehren

    Tuesday, August 1, 2017 4:03 PM
  • Hi Ehren, 

    Thanks very much for this quick reply. In regards to your answer, you are right, more efficient code can be written and parsed (If you check the Execution plan, a new branch of nested loops and table spool gets created after applying the group by, and it can be avoided if subqueries are avoided) in fact, I have re-written this code in sql and the performance improves hugely. This said, the idea of allowing end users running analytics using PQ would turn out inefficient, and a push back to a sql techie responsible of running the query straight from SQL Management Studio connected to a DB  would be the choice .

    I believe this can be optimized at the mashup engine still, perhaps by not creating so many subqueries for non-manipulated in a dataset, if:

    - No data transformation nor grouping are added, push that to the scope of the subquery if it exists.

    Also, some additional's, I hope I'm making sense here.

    - Improve the dictionary of amalgamation for native PQ functions vs DB engine functions to avoid excel from pulling records from the db to the local computer to run basic calculations, for instance, try with Date.Year(date_field), it pulls all records through to excel to apply the formula. Instead, and recognizing SQL has it's own year() function, records should still reside in the DB  and separated. 

    - Avoid additional Compute scalar: For instance, converting fields of type datetime into datetime2.

    Thanks!  


    • Edited by MarcenfaaMym Wednesday, August 2, 2017 4:34 AM some add txt
    Wednesday, August 2, 2017 2:20 AM
  • Hi there.

    I have re-written this code in sql and the performance improves hugely

    Can you share the rewritten version of the above SQL query? This will help us understand how we could better optimize our generated SQL.

    Thanks,

    Ehren



    Wednesday, August 2, 2017 4:41 PM
  • Hi Ehren,

    Please find query below, my apologies for the late reply. 

    select 
    [Guest].[FirstName] as [FirstName],
        [Guest].[LastName] as [LastName],
        [Guest].[Country] as [Country],
        [Region].[Location] as [Location]
    from [dbo].[Scan] as [$Table]
    left outer join Guest_Access 
    on [$Table].Scan_Door_ID = Guest_Access.Door_ID
    left outer join [dbo].[Guest]  as 
    on Guest_Access.Guest_ID = [Guest].Guest_ID
    left outer join [dbo].[Region] 
    on [Region].[G_Location_ID] = [$Table].[G_Location_ID]
    where
    [$Table].GDate = '2017-07-27' and [$Table].[GStatus_ID] < 10
    group by
    [rows].[FirstName] as [FirstName],
        [rows].[LastName] as [LastName],
        [rows].[Country] as [Country],
        [rows].[Location] as [Location]

    Thursday, August 17, 2017 4:50 AM
  • Our team took a look and they suspect that the perf difference is due to the handling of NULL. To preserve the semantics of M, we have to include logic in the query to ensure that NULLs will match when doing the join.

    If you add logic to your SQL query that causes NULLs to match, does that slow the perf down?

    Ehren

    Thursday, August 17, 2017 5:36 PM
  • Hi Ehren,

    Actually, there are no null rows in any of the data tables as I'm matching by PK/FK initially.  I think the problem is that first, the Mashup Engine is passing a lot of fields in the Subquery that impact in the execution plan as it changes it in comparison to the SQL query (adding additional index scans and scalar functions). Second, Scalar functions can be avoided, there is no need to transform a date field into datetime2.

    In addition to this, transformations are not amalgamated to the relational engine of SQL and its native set of functions, I reckon there should be a module within Mashup that goes ask whether the M function is applicable at the database layer, by doing this no rowsets are passed to the spreadsheet and processed. 

    I have noticed also that separate Subqueries once merged (eg. Pull table a,b, c and d. merge a and b, c & d, and produce an extra overhead to the mashup engine and SQL engine as it adds a new subquery atop the merged tables and sometimes this gets processed in the spreadsheet even if that could have been done in the database engine too. 

    I hope this helps,

    Thanks

    Saturday, August 19, 2017 8:06 AM
  • It would be useful to see both the table definitions and the M query which led to these SQL queries.

    When you say "there are no null rows", we don't generate queries based on anything other than metadata. So what matters here is whether our type system thinks that the value might be null, not whether there are actually any null values in the table. Consider the clause

    where ([_].[GDate] = convert(datetime2, ''2017-07-27 00:00:00'') and [_].[GDate] is not null) and [_].[GStatus_ID] < 10

    Here, you can see that our type system believes that GStatus_ID is not nullable because there's no clause testing for null in the query. GDate, on the other hand, is considered to be nullable. Are you saying that the definition of the GDate column in the database is non-nullable?

    In other cases where there's a null check, one of the comparands is sometimes on the outer side of an outer join. So even if the underlying table column is defined as non-nullable, the projection may in fact contain a null.

    With respect to the "convert(datetime2, ", are you saying that the type of the GDate column is date and not datetime2? If so, we should be converting to date and not datetime2. That said, I would be very surprised if there were any performance impact of this, as the SQL query optimizer ought to be constant-folding both the explicit and the implicit conversion operations. It would be interesting to learn whether changing just that part of the query we emitted had any impact on overall performance.

    With respect to additional transformations in M being converted (or "folded" as we say) to SQL, it really depends on the transformation. There are some we can fold and some we can't. Of the ones we can fold, there are undoubtedly some for which the work simply hasn't been done to actually make it happen. Are there transformations in particular you've observed this for?

    Saturday, August 19, 2017 3:36 PM
  • Hi Curt,

    Thanks very much for your answer.  I appreciate the very prompt feedback I'm receiving so I will try to get back accordingly. 

    When I referred to "not null rows" I was meaning no null keys values to be matched in the join/merge, That not null actually creates a Filter stage but as far as my query goes it does not impact the execution plan as the filter reads keys that are indexed in the source DB. But if I refer to your question, none of the fields (GDate, GStatus_IDhave been set as non-nullable in the source DB.'

    GDate being converted to datetime2. The source field is coming as type = Date. I reckon (from my non-advance knowledge of both M and SQL)  that passing a less large field to the processor will have an impact regardless of the implicit transformation.  

    Folded !! thanks I keep forgetting this hehe. I have been creating new columns using Date.Year/month/day and basic text extractions at the moment. what I have seen from other test examples is that - as the very first post exposes - the query is not getting "folded" or run at the engine side if these M functions occurs outside an already aggregated subquery, something that (for the scope of the example) a datepart(year,[Gdate]) and Substring(textfield,1,x) could have been applied to and the resulset exposed in the power query pane afterwards. I wouldn't dare commenting on additional transformations but it would be great to see what does and does not get folded naturally by the M engine.

    I hope I make sense from my, still, learning stage, and again thanks for your help. Power Query is amazing and Ithink there is no other tool as cool and out-of-the-box like this one. 

     

    Sunday, August 20, 2017 7:11 AM