Answered by:
Question on UNION ALL.

Hi there, I've came across an exam question, and wonder if someone can give me some advice: 
These data in these two tables is distinct from one another, and each table contains more than 100 million rows. Each table has a Primary Key column named SalesOrderId, users want a report that includes aggregate information about the total number of global sales and total sales amounts. We need to ensure that the query executes in the minimum possible time: 
SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount
FROM (SELECT SalesOrderId, SalesAmount
FROM DomesticSalesOrders UNION ALL SELECT SalesOrderId, SalesAmount FROM InternationalSalesOrders) AS pSELECT COUNT (*) AS NumberOf Sales, SUM (SalesAmount) AS TotalSalesAmount
FROM DomesticSalesOrders UNION ALL
SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount FROM InternationalSalesOrdersThe paper suggested the second query to be the answer, but I noticed the second query produced a result table with two rows (unlike the first query, which produced one row and more likely to fulfil the user request), I would like to know how would the second query outperforms the first one in terms of performance.
Can someone good in this please enlighten me??
Thank you.
Question
Answers

Take a look at the execution plans.
The first query contains a subquery with 200 million rows! Can't possibly be fast.
As far as getting 2 rows with the second query, is a good point. Summation still necessary, but that should be fast.
Kalman Toth Database & OLAP Architect IPAD SELECT Query Video Tutorial 3.5 Hours
New Book / Kindle: Pass SQL Exam 70461 & Job Interview: Programming SQL Server 2012 Proposed as answer by Saeid Hasani Thursday, September 05, 2013 5:26 AM
 Marked as answer by Allen Li  MSFTModerator Tuesday, September 17, 2013 7:57 AM

You cannot answer this question without knowing what query plan each of those queries generates. As written, the first one contains a subquery that generates 200 million rows. But SQL can and does take shortcuts. So what do the query plans look like?
I generated two tables with 10 million rows  I didn't want to take the time to generate tables with 100 million rows, but 10 million should be sufficient. (To run this you will need a Numbers table with at least all integers from 1 to 1,000,000.)
create table t1Foo(SalesOrderID int identity primary key, SalesAmount numeric(13,2)); create table t2Foo(SalesOrderID int identity(100000001,1) primary key, SalesAmount numeric(13,2)); go insert t1Foo(SalesAmount) Select 1.27 From dbo.Numbers n1 Inner Join dbo.Numbers n2 On n2.Number <= 10 Where n1.Number <= 1000000; insert t2Foo(SalesAmount) Select 1.27 From dbo.Numbers n1 Inner Join dbo.Numbers n2 On n2.Number <= 10 Where n1.Number <= 1000000;
This meets the requirements of the question (the SalesOrderID are unique (one table the identity starts at 1 and the other starts at 100,000,001) and the have a SalesAmount. Now look at the execution plans for
SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount FROM (SELECT SalesOrderId, SalesAmount FROM t1Foo UNION ALL SELECT SalesOrderId, SalesAmount FROM t2Foo) AS p SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount FROM t1Foo UNION ALL SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount FROM t2Foo
The first few steps in each plan are the same and have the same cost. They both do a clustered index scan of t1Foo (cost 22.8813), a stream aggregate of t1Foo (cost 1.5) and a parallelism gather streams on t1Foo (cost 0.0285). So for both queries the total cost to scan t1Foo and aggregate the data is 24.4098. Both queries also do the same for t2Foo and since the tables are the same size the costs are the same.
So far, both queries have done exactly the same thing. And the cost so far for both queries is 48.8196 (2 * 24.4098). At this point the execution plans for the queries start to differ.
Query 1 does a concatenation (cost 0.0000008), a stream aggregate (cost 0.0000053), a compute scalar (cost 0.0000001) and a select (cost 0.0) for a total cost of 24.4098 + 0.0000008 + 0.0000053 + 0.0000001 or 24.4098061 (or rounded to 6 significant figures 24.4098).
Query 2 does 2 stream aggregates (cost 0.0000029 each or 0.0000058 total), 2 compute scalars (cost 0.0000001 each or 0.0000002 total), a concatenation (cost 0.0000002) for a total cost of 24.4098 + 0.0000058 + 0.0000002 + 0.0000002 or 24.4098063 (or rounded to 6 significant figures 24.4098).
So for all practical purposes, these are equally efficient. If you run them, sometimes one will be slightly faster, sometimes the other due to random noise.
Tom
 Marked as answer by Allen Li  MSFTModerator Tuesday, September 17, 2013 7:57 AM
All replies

Take a look at the execution plans.
The first query contains a subquery with 200 million rows! Can't possibly be fast.
As far as getting 2 rows with the second query, is a good point. Summation still necessary, but that should be fast.
Kalman Toth Database & OLAP Architect IPAD SELECT Query Video Tutorial 3.5 Hours
New Book / Kindle: Pass SQL Exam 70461 & Job Interview: Programming SQL Server 2012 Proposed as answer by Saeid Hasani Thursday, September 05, 2013 5:26 AM
 Marked as answer by Allen Li  MSFTModerator Tuesday, September 17, 2013 7:57 AM

Hi there, I've came across an exam question, and wonder if someone can give me some advice: 
These data in these two tables is distinct from one another, and each table contains more than 100 million rows. Each table has a Primary Key column named SalesOrderId, users want a report that includes aggregate information about the total number of global sales and total sales amounts. We need to ensure that the query executes in the minimum possible time: 
SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount
FROM (SELECT SalesOrderId, SalesAmount
FROM DomesticSalesOrders UNION ALL SELECT SalesOrderId, SalesAmount FROM InternationalSalesOrders) AS pSELECT COUNT (*) AS NumberOf Sales, SUM (SalesAmount) AS TotalSalesAmount
FROM DomesticSalesOrders UNION ALL
SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount FROM InternationalSalesOrdersThe paper suggested the second query to be the answer, but I noticed the second query produced a result table with two rows (unlike the first query, which produced one row and more likely to fulfil the user request), I would like to know how would the second query outperforms the first one in terms of performance.
Can someone good in this please enlighten me??
Thank you.
Hi,
The 2nd query is better in performance, as it is directly referring to tables and then get union all of 2 recordsets, which means records are being retrieved per the existing table indexes, while the 1st one is querying against one recordset which is a combination of 2 subqueries, and we cannot expect any indices are being used upon retrieving the records.
Many Thanks & Best Regards, Hua Min
 Edited by HuaMin Chen Thursday, September 05, 2013 4:47 AM

You cannot answer this question without knowing what query plan each of those queries generates. As written, the first one contains a subquery that generates 200 million rows. But SQL can and does take shortcuts. So what do the query plans look like?
I generated two tables with 10 million rows  I didn't want to take the time to generate tables with 100 million rows, but 10 million should be sufficient. (To run this you will need a Numbers table with at least all integers from 1 to 1,000,000.)
create table t1Foo(SalesOrderID int identity primary key, SalesAmount numeric(13,2)); create table t2Foo(SalesOrderID int identity(100000001,1) primary key, SalesAmount numeric(13,2)); go insert t1Foo(SalesAmount) Select 1.27 From dbo.Numbers n1 Inner Join dbo.Numbers n2 On n2.Number <= 10 Where n1.Number <= 1000000; insert t2Foo(SalesAmount) Select 1.27 From dbo.Numbers n1 Inner Join dbo.Numbers n2 On n2.Number <= 10 Where n1.Number <= 1000000;
This meets the requirements of the question (the SalesOrderID are unique (one table the identity starts at 1 and the other starts at 100,000,001) and the have a SalesAmount. Now look at the execution plans for
SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount FROM (SELECT SalesOrderId, SalesAmount FROM t1Foo UNION ALL SELECT SalesOrderId, SalesAmount FROM t2Foo) AS p SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount FROM t1Foo UNION ALL SELECT COUNT (*) AS NumberOfSales, SUM (SalesAmount) AS TotalSalesAmount FROM t2Foo
The first few steps in each plan are the same and have the same cost. They both do a clustered index scan of t1Foo (cost 22.8813), a stream aggregate of t1Foo (cost 1.5) and a parallelism gather streams on t1Foo (cost 0.0285). So for both queries the total cost to scan t1Foo and aggregate the data is 24.4098. Both queries also do the same for t2Foo and since the tables are the same size the costs are the same.
So far, both queries have done exactly the same thing. And the cost so far for both queries is 48.8196 (2 * 24.4098). At this point the execution plans for the queries start to differ.
Query 1 does a concatenation (cost 0.0000008), a stream aggregate (cost 0.0000053), a compute scalar (cost 0.0000001) and a select (cost 0.0) for a total cost of 24.4098 + 0.0000008 + 0.0000053 + 0.0000001 or 24.4098061 (or rounded to 6 significant figures 24.4098).
Query 2 does 2 stream aggregates (cost 0.0000029 each or 0.0000058 total), 2 compute scalars (cost 0.0000001 each or 0.0000002 total), a concatenation (cost 0.0000002) for a total cost of 24.4098 + 0.0000058 + 0.0000002 + 0.0000002 or 24.4098063 (or rounded to 6 significant figures 24.4098).
So for all practical purposes, these are equally efficient. If you run them, sometimes one will be slightly faster, sometimes the other due to random noise.
Tom
 Marked as answer by Allen Li  MSFTModerator Tuesday, September 17, 2013 7:57 AM