none
Question on UNION ALL.

    Question

  • Hi there, I've came across an exam question, and wonder if someone can give me some advice: -

    These data in these two tables is distinct from one another, and each table contains more than 100 million rows. Each table has a Primary Key column named SalesOrderId, users want a report that includes aggregate information about the total number of global sales and total sales amounts. We need to ensure that the query executes in the minimum possible time: -

    SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount
    FROM (SELECT SalesOrderId, SalesAmount
    FROM DomesticSalesOrders UNION ALL SELECT SalesOrderId, SalesAmount FROM InternationalSalesOrders) AS p

    SELECT COUNT (*) AS NumberOf Sales, SUM (SalesAmount) AS TotalSalesAmount
    FROM DomesticSalesOrders UNION ALL
    SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount FROM InternationalSalesOrders

    The paper suggested the second query to be the answer, but I noticed the second query produced a result table with two rows (unlike the first query, which produced one row and more likely to fulfil the user request), I would like to know how would the second query outperforms the first one in terms of performance.

    Can someone good in this please enlighten me??

    Thank you.

    Thursday, September 05, 2013 3:59 AM

Answers

  • Take a look at the execution plans.

    The first query contains a subquery with 200 million rows!  Can't possibly be fast.

    As far as getting 2 rows with the second query, is a good point. Summation still necessary, but that should be fast.


    Kalman Toth Database & OLAP Architect IPAD SELECT Query Video Tutorial 3.5 Hours
    New Book / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012


    Thursday, September 05, 2013 4:11 AM
  • You cannot answer this question without knowing what query plan each of those queries generates.  As written, the first one contains a subquery that generates 200 million rows.  But SQL can and does take shortcuts.  So what do the query plans look like?

    I generated two tables with 10 million rows - I didn't want to take the time to generate tables with 100 million rows, but 10 million should be sufficient.  (To run this you will need a Numbers table with at least all integers from 1 to 1,000,000.)

    create table t1Foo(SalesOrderID int identity primary key, SalesAmount numeric(13,2));
    create table t2Foo(SalesOrderID int identity(100000001,1) primary key, SalesAmount numeric(13,2));
    go
    insert t1Foo(SalesAmount) Select 1.27 From dbo.Numbers n1 Inner Join dbo.Numbers n2 On n2.Number <= 10 Where n1.Number <= 1000000;
    insert t2Foo(SalesAmount) Select 1.27 From dbo.Numbers n1 Inner Join dbo.Numbers n2 On n2.Number <= 10 Where n1.Number <= 1000000;

    This meets the requirements of the question (the SalesOrderID are unique (one table the identity starts at 1 and the other starts at 100,000,001) and the have a SalesAmount.  Now look at the execution plans for

    SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount
    FROM (SELECT SalesOrderId, SalesAmount
    FROM t1Foo UNION ALL SELECT SalesOrderId, SalesAmount FROM t2Foo) AS p
    
    SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount
    FROM t1Foo UNION ALL
    SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount FROM t2Foo

    The first few steps in each plan are the same and have the same cost.  They both do a clustered index scan of t1Foo (cost 22.8813), a stream aggregate of t1Foo (cost 1.5) and a parallelism gather streams on t1Foo (cost 0.0285).  So for both queries the total cost to scan t1Foo and aggregate the data is 24.4098.  Both queries also do the same for t2Foo and since the tables are the same size the costs are the same.

    So far, both queries have done exactly the same thing.  And the cost so far for both queries is 48.8196 (2 * 24.4098).  At this point the execution plans for the queries start to differ.

    Query 1 does a concatenation (cost 0.0000008), a stream aggregate (cost 0.0000053), a compute scalar (cost 0.0000001) and a select (cost 0.0) for a total cost of 24.4098 + 0.0000008 + 0.0000053 + 0.0000001 or 24.4098061 (or rounded to 6 significant figures 24.4098).

    Query 2 does 2 stream aggregates (cost 0.0000029 each or 0.0000058 total), 2 compute scalars (cost 0.0000001 each or 0.0000002 total), a concatenation (cost 0.0000002) for a total cost of 24.4098 + 0.0000058 + 0.0000002 + 0.0000002 or 24.4098063 (or rounded to 6 significant figures 24.4098).

    So for all practical purposes, these are equally efficient.  If you run them, sometimes one will be slightly faster, sometimes the other due to random noise.

    Tom

    Thursday, September 05, 2013 6:02 AM

All replies

  • Take a look at the execution plans.

    The first query contains a subquery with 200 million rows!  Can't possibly be fast.

    As far as getting 2 rows with the second query, is a good point. Summation still necessary, but that should be fast.


    Kalman Toth Database & OLAP Architect IPAD SELECT Query Video Tutorial 3.5 Hours
    New Book / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012


    Thursday, September 05, 2013 4:11 AM
  • Hi there, I've came across an exam question, and wonder if someone can give me some advice: -

    These data in these two tables is distinct from one another, and each table contains more than 100 million rows. Each table has a Primary Key column named SalesOrderId, users want a report that includes aggregate information about the total number of global sales and total sales amounts. We need to ensure that the query executes in the minimum possible time: -

    SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount
    FROM (SELECT SalesOrderId, SalesAmount
    FROM DomesticSalesOrders UNION ALL SELECT SalesOrderId, SalesAmount FROM InternationalSalesOrders) AS p

    SELECT COUNT (*) AS NumberOf Sales, SUM (SalesAmount) AS TotalSalesAmount
    FROM DomesticSalesOrders UNION ALL
    SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount FROM InternationalSalesOrders

    The paper suggested the second query to be the answer, but I noticed the second query produced a result table with two rows (unlike the first query, which produced one row and more likely to fulfil the user request), I would like to know how would the second query outperforms the first one in terms of performance.

    Can someone good in this please enlighten me??

    Thank you.

    Hi,

    The 2nd query is better in performance, as it is directly referring to tables and then get union all of 2 recordsets, which means records are being retrieved per the existing table indexes, while the 1st one is querying against one recordset which is a combination of 2 subqueries, and we cannot expect any indices are being used upon retrieving the records.


    Many Thanks & Best Regards, Hua Min



    Thursday, September 05, 2013 4:16 AM
  • You cannot answer this question without knowing what query plan each of those queries generates.  As written, the first one contains a subquery that generates 200 million rows.  But SQL can and does take shortcuts.  So what do the query plans look like?

    I generated two tables with 10 million rows - I didn't want to take the time to generate tables with 100 million rows, but 10 million should be sufficient.  (To run this you will need a Numbers table with at least all integers from 1 to 1,000,000.)

    create table t1Foo(SalesOrderID int identity primary key, SalesAmount numeric(13,2));
    create table t2Foo(SalesOrderID int identity(100000001,1) primary key, SalesAmount numeric(13,2));
    go
    insert t1Foo(SalesAmount) Select 1.27 From dbo.Numbers n1 Inner Join dbo.Numbers n2 On n2.Number <= 10 Where n1.Number <= 1000000;
    insert t2Foo(SalesAmount) Select 1.27 From dbo.Numbers n1 Inner Join dbo.Numbers n2 On n2.Number <= 10 Where n1.Number <= 1000000;

    This meets the requirements of the question (the SalesOrderID are unique (one table the identity starts at 1 and the other starts at 100,000,001) and the have a SalesAmount.  Now look at the execution plans for

    SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount
    FROM (SELECT SalesOrderId, SalesAmount
    FROM t1Foo UNION ALL SELECT SalesOrderId, SalesAmount FROM t2Foo) AS p
    
    SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount
    FROM t1Foo UNION ALL
    SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount FROM t2Foo

    The first few steps in each plan are the same and have the same cost.  They both do a clustered index scan of t1Foo (cost 22.8813), a stream aggregate of t1Foo (cost 1.5) and a parallelism gather streams on t1Foo (cost 0.0285).  So for both queries the total cost to scan t1Foo and aggregate the data is 24.4098.  Both queries also do the same for t2Foo and since the tables are the same size the costs are the same.

    So far, both queries have done exactly the same thing.  And the cost so far for both queries is 48.8196 (2 * 24.4098).  At this point the execution plans for the queries start to differ.

    Query 1 does a concatenation (cost 0.0000008), a stream aggregate (cost 0.0000053), a compute scalar (cost 0.0000001) and a select (cost 0.0) for a total cost of 24.4098 + 0.0000008 + 0.0000053 + 0.0000001 or 24.4098061 (or rounded to 6 significant figures 24.4098).

    Query 2 does 2 stream aggregates (cost 0.0000029 each or 0.0000058 total), 2 compute scalars (cost 0.0000001 each or 0.0000002 total), a concatenation (cost 0.0000002) for a total cost of 24.4098 + 0.0000058 + 0.0000002 + 0.0000002 or 24.4098063 (or rounded to 6 significant figures 24.4098).

    So for all practical purposes, these are equally efficient.  If you run them, sometimes one will be slightly faster, sometimes the other due to random noise.

    Tom

    Thursday, September 05, 2013 6:02 AM