ROW_NUMBER OVER PARTITION BY Puzzle
-
Monday, February 18, 2013 2:33 PMModerator
Can the following be done in one CTE (I want to keep the outer query clean without RN predicate)? Thanks.
The nested CTE-s are selecting the recent salary rate for each employee of AdventureWorks Cycles fictitious company, a manufacturer of mountain, road and touring bikes.
WITH CTE AS (SELECT EmpID=BusinessEntityID, RN=ROW_NUMBER() OVER (PARTITION BY BusinessEntityID ORDER BY RateChangeDate DESC), Rate FROM HumanResources.EmployeePayHistory), CTEfinal AS (SELECT EmpID, Rate FROM CTE WHERE RN = 1) SELECT * FROM CTEfinal ORDER BY EmpID;
EmpID Rate
1 125.50
2 63.4615
3 43.2692
4 29.8462
5 32.6923
6 32.6923
....
Kalman Toth Database & OLAP Architect
Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012
- Edited by Kalman TothMicrosoft Community Contributor, Moderator Monday, February 18, 2013 2:35 PM
- Edited by Kalman TothMicrosoft Community Contributor, Moderator Monday, February 18, 2013 3:50 PM
All Replies
-
Monday, February 18, 2013 3:01 PM
WITH CTE AS (SELECT EmpID=BusinessEntityID, RN=ROW_NUMBER()
OVER (PARTITION BY BusinessEntityID
ORDER BY RateChangeDate DESC), Rate
FROM HumanResources.EmployeePayHistory)
SELECT * FROM CTEfinal WHERE RN= 1 ORDER BY EmpID;Or am I missing something?
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se -
Monday, February 18, 2013 3:10 PM
Trivially yes, but really no. Window functions can only be used in SELECT and ORDER BY.
You can always switch between CTEs and Subqueries, but a CTE "pipeline" is much nicer than a bunch of lexically nested subqueries.
WITH CTEfinal AS ( SELECT EmpID, Rate from ( SELECT EmpID=BusinessEntityID, RN=ROW_NUMBER() OVER (PARTITION BY BusinessEntityID ORDER BY RateChangeDate DESC), Rate FROM HumanResources.EmployeePayHistory ) CTE where RN = 1 ) SELECT * FROM CTEfinal ORDER BY EmpID;DavidDavid http://blogs.msdn.com/b/dbrowne/
- Marked As Answer by Kalman TothMicrosoft Community Contributor, Moderator Monday, February 18, 2013 3:35 PM
-
Monday, February 18, 2013 4:00 PM
Guessing you don't want the old-fashioned way of doing this? You could of course wrap this in a CTE but no need for this simple example:
SELECT eph.BusinessEntityID AS EmpID, eph.Rate FROM HumanResources.EmployeePayHistory eph INNER JOIN ( SELECT BusinessEntityID, MAX(RateChangeDate) RateChangeDate FROM HumanResources.EmployeePayHistory GROUP BY BusinessEntityID ) max_ph ON eph.BusinessEntityID = max_ph.BusinessEntityID AND eph.RateChangeDate = max_ph.RateChangeDate
-
Monday, February 18, 2013 4:25 PMModeratorIs GROUP BY better than OVER PARTITION BY? Thanks.
Kalman Toth Database & OLAP Architect
Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012- Edited by Kalman TothMicrosoft Community Contributor, Moderator Monday, February 18, 2013 4:25 PM
-
Monday, February 18, 2013 4:43 PM
Well I had to scale up to 2.5 million rows to see any difference, but the GROUP BY with subquery does appear slightly quicker, similar reads, and better CPU ( eg ~10k versus 2k ). To be honest the difference is fairly trivial, although in terms of CPU the GROUP BY query makes a slightly better neighbour on a server. See what you think.
My simple rig:
Warning!! Do not run this on a production server!
USE AdventureWorks2012 GO IF OBJECT_ID('HumanResources.EmployeePayHistory2') IS NOT NULL DROP TABLE HumanResources.EmployeePayHistory2 GO SELECT * INTO HumanResources.EmployeePayHistory2 FROM HumanResources.EmployeePayHistory GO -- OK we haven't done any indexes .... INSERT INTO HumanResources.EmployeePayHistory2 SELECT BusinessEntityID, DATEADD( day, 1, RateChangeDate ), Rate + (Rate * 0.1), PayFrequency, GETDATE() FROM HumanResources.EmployeePayHistory2 GO 13 SELECT FORMAT( COUNT(*), '#,#') FROM HumanResources.EmployeePayHistory2 GO DBCC DROPCLEANBUFFERS DBCC FREEPROCCACHE GO ;WITH CTE AS ( SELECT EmpID=BusinessEntityID, RN = ROW_NUMBER() OVER ( PARTITION BY BusinessEntityID ORDER BY RateChangeDate DESC), Rate FROM HumanResources.EmployeePayHistory2) , CTEfinal AS (SELECT * FROM CTE WHERE RN = 1) SELECT * FROM CTEfinal ORDER BY EmpID; GO DBCC DROPCLEANBUFFERS DBCC FREEPROCCACHE GO SELECT eph.BusinessEntityID AS EmpID, eph.Rate FROM HumanResources.EmployeePayHistory2 eph INNER JOIN ( SELECT BusinessEntityID, MAX(RateChangeDate) RateChangeDate FROM HumanResources.EmployeePayHistory2 GROUP BY BusinessEntityID ) max_ph ON eph.BusinessEntityID = max_ph.BusinessEntityID AND eph.RateChangeDate = max_ph.RateChangeDate ORDER BY EmpID GO -- Cleanup IF OBJECT_ID('HumanResources.EmployeePayHistory2') IS NOT NULL DROP TABLE HumanResources.EmployeePayHistory2 GOProfiler result:
- Marked As Answer by Kalman TothMicrosoft Community Contributor, Moderator Monday, February 18, 2013 9:15 PM
-
Monday, February 18, 2013 8:48 PMModerator
I repeated your test. The GROUP BY version beats OVER PARTITION BY for the sample data.
Kalman Toth Database & OLAP Architect
Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012 -
Monday, February 18, 2013 9:52 PM
Once you have some index structure, cross apply sometimes gets really good (and parallelizable) plans.
create clustered index ix_eph2 on HumanResources.EmployeePayHistory2(BusinessEntityID,RateChangeDate) select BusinessEntityID, Rates.Rate from HumanResources.Employee emp cross apply (select top 1 Rate from HumanResources.EmployeePayHistory2 where BusinessEntityID = emp.BusinessEntityID order by RateChangeDate desc) Rates
David
David http://blogs.msdn.com/b/dbrowne/
- Edited by davidbaxterbrowneMicrosoft Employee Monday, February 18, 2013 9:54 PM
- Marked As Answer by Kalman TothMicrosoft Community Contributor, Moderator Tuesday, February 19, 2013 1:07 AM
-
Monday, February 18, 2013 10:49 PM
One word of caution. The GROUP BY method really is equivalent to the RANK() (or DENSE_RANK) function, not the ROW_NUMBER() function. It works in this case since since for each BusinessEntityID, there is only one row in EmployeePayHistory2 with the maximum value in RateChangeDate. If for a particular BusinessEntityID there were multiple rows with RateChangeDate equal to the maximum value of RateChangeDate for that BusinessEntityID, all of those rows would be returned. That is what RANK or DENSE_RANK would do, ROW_NUMBER will return only one of the rows.
Tom
-
Tuesday, February 19, 2013 1:05 AMModerator
Including CROSS APPLY method in the test.
The first "select" is the CROSS APPLY method, a sure winner (see "Reads", "CPU") for this sample data.
DBCC DROPCLEANBUFFERS DBCC FREEPROCCACHE GO select BusinessEntityID AS EmpID, Rates.Rate from HumanResources.Employee emp cross apply (select top 1 Rate from HumanResources.EmployeePayHistory2 where BusinessEntityID = emp.BusinessEntityID order by RateChangeDate desc) Rates ORDER BY EmpID; GO DBCC DROPCLEANBUFFERS DBCC FREEPROCCACHE GO ;WITH CTE AS ( SELECT EmpID=BusinessEntityID, RN = ROW_NUMBER() OVER ( PARTITION BY BusinessEntityID ORDER BY RateChangeDate DESC), Rate FROM HumanResources.EmployeePayHistory2) , CTEfinal AS (SELECT * FROM CTE WHERE RN = 1) SELECT * FROM CTEfinal ORDER BY EmpID; GO DBCC DROPCLEANBUFFERS DBCC FREEPROCCACHE GO SELECT eph.BusinessEntityID AS EmpID, eph.Rate FROM HumanResources.EmployeePayHistory2 eph INNER JOIN ( SELECT BusinessEntityID, MAX(RateChangeDate) RateChangeDate FROM HumanResources.EmployeePayHistory2 GROUP BY BusinessEntityID ) max_ph ON eph.BusinessEntityID = max_ph.BusinessEntityID AND eph.RateChangeDate = max_ph.RateChangeDate ORDER BY EmpID GO
Kalman Toth Database & OLAP Architect
Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012
- Edited by Kalman TothMicrosoft Community Contributor, Moderator Tuesday, February 19, 2013 1:09 AM
- Edited by Kalman TothMicrosoft Community Contributor, Moderator Tuesday, February 19, 2013 1:10 AM
- Edited by Kalman TothMicrosoft Community Contributor, Moderator Tuesday, February 19, 2013 1:10 AM
-
Tuesday, February 19, 2013 1:28 AM
This was my winner (with original index):
ALTER TABLE HumanResources.EmployeePayHistory2 ADD CONSTRAINT PK PRIMARY KEY ( BusinessEntityID, RateChangeDate ) GO select BusinessEntityID, ( select top 1 Rate from HumanResources.EmployeePayHistory2 where BusinessEntityID = emp.BusinessEntityID order by RateChangeDate desc ) Rate from HumanResources.Employee emp order by 1
Seems counter-intuitive that a correlated subquery with a sort should win. Still, goes to show you should always try more than one method with your data.- Edited by wBobMicrosoft Community Contributor Tuesday, February 19, 2013 1:39 AM original index added
-
Tuesday, February 19, 2013 1:37 AMGood point, however the original primary key on HumanResource.EmployeePayHistory is on ( BusinessEntity, RateChangeDate ) would have enforced this. So for this particular example they would be equivalent.
-
Tuesday, February 19, 2013 3:10 AM
While they are the same here since we only want one column from group, the OUTER APPLY can return any number of columns from the first row in the group. So it's semantically equivalent to the ROW_NUMBER() query.
David
David http://blogs.msdn.com/b/dbrowne/
- Edited by davidbaxterbrowneMicrosoft Employee Tuesday, February 19, 2013 3:10 AM

