ROW_NUMBER OVER PARTITION BY Puzzle

Answered ROW_NUMBER OVER PARTITION BY Puzzle

  • Monday, February 18, 2013 2:33 PM
    Moderator
     
      Has Code

    Can the following be done in one CTE (I want to keep the outer query clean without RN predicate)?  Thanks.

    The nested CTE-s are selecting the recent salary rate for each employee of AdventureWorks Cycles fictitious company, a manufacturer of mountain, road and touring bikes.

    WITH CTE AS (SELECT EmpID=BusinessEntityID, RN=ROW_NUMBER() 
    				OVER (PARTITION BY BusinessEntityID 
    				ORDER BY RateChangeDate DESC), Rate
    			 FROM HumanResources.EmployeePayHistory),
    CTEfinal AS (SELECT EmpID, Rate FROM CTE WHERE RN = 1)
    
    SELECT * FROM CTEfinal ORDER BY EmpID;

    EmpID Rate

    1 125.50

    2 63.4615

    3 43.2692

    4 29.8462

    5 32.6923

    6 32.6923

    ....


    Kalman Toth Database & OLAP Architect
    Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012



All Replies

  • Monday, February 18, 2013 3:01 PM
     
     

     WITH CTE AS (SELECT EmpID=BusinessEntityID, RN=ROW_NUMBER()
                         OVER (PARTITION BY BusinessEntityID
                         ORDER BY RateChangeDate DESC), Rate
                     FROM HumanResources.EmployeePayHistory)
     SELECT * FROM CTEfinal WHERE RN= 1 ORDER BY EmpID;

    Or am I missing something?


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
  • Monday, February 18, 2013 3:10 PM
     
     Answered Has Code

    Trivially yes, but really no.  Window functions can only be used in SELECT and ORDER BY.

    You can always switch between CTEs and Subqueries, but a CTE "pipeline" is much nicer than a bunch of lexically nested subqueries.

    WITH CTEfinal AS 
    (
      SELECT EmpID, Rate
      from
      (
        SELECT EmpID=BusinessEntityID, RN=ROW_NUMBER() 
        OVER (PARTITION BY BusinessEntityID 
        ORDER BY RateChangeDate DESC), Rate
        FROM HumanResources.EmployeePayHistory
      ) CTE
      where RN = 1
    )
    SELECT * FROM CTEfinal ORDER BY EmpID;
    David

    David http://blogs.msdn.com/b/dbrowne/

  • Monday, February 18, 2013 4:00 PM
     
      Has Code

    Guessing you don't want the old-fashioned way of doing this?  You could of course wrap this in a CTE but no need for this simple example:

    SELECT eph.BusinessEntityID AS EmpID, eph.Rate
    FROM HumanResources.EmployeePayHistory eph
    	INNER JOIN
    	(
    	SELECT BusinessEntityID, MAX(RateChangeDate) RateChangeDate
    	FROM HumanResources.EmployeePayHistory
    	GROUP BY BusinessEntityID
    	) max_ph ON eph.BusinessEntityID = max_ph.BusinessEntityID
    		AND eph.RateChangeDate = max_ph.RateChangeDate

  • Monday, February 18, 2013 4:25 PM
    Moderator
     
     
    Is GROUP BY better than OVER PARTITION BY? Thanks.

    Kalman Toth Database & OLAP Architect
    Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012


  • Monday, February 18, 2013 4:43 PM
     
     Answered Has Code

    Well I had to scale up to 2.5 million rows to see any difference, but the GROUP BY with subquery does appear slightly quicker, similar reads, and better CPU ( eg ~10k versus 2k ).  To be honest the difference is fairly trivial, although in terms of CPU the GROUP BY query makes a slightly better neighbour on a server.  See what you think.

    My simple rig:

    Warning!! Do not run this on a production server!

    USE AdventureWorks2012
    GO
    
    IF OBJECT_ID('HumanResources.EmployeePayHistory2') IS NOT NULL DROP TABLE HumanResources.EmployeePayHistory2
    GO
    
    SELECT *
    INTO HumanResources.EmployeePayHistory2
    FROM HumanResources.EmployeePayHistory
    GO
    
    -- OK we haven't done any indexes ....
    
    INSERT INTO HumanResources.EmployeePayHistory2
    SELECT BusinessEntityID, DATEADD( day, 1, RateChangeDate ), Rate + (Rate * 0.1), PayFrequency, GETDATE()
    FROM HumanResources.EmployeePayHistory2
    GO 13
    
    SELECT FORMAT( COUNT(*), '#,#') FROM HumanResources.EmployeePayHistory2
    GO
    
    DBCC DROPCLEANBUFFERS
    DBCC FREEPROCCACHE
    GO
    
    ;WITH CTE AS 
    (
    SELECT EmpID=BusinessEntityID, RN = ROW_NUMBER() 
    OVER ( PARTITION BY BusinessEntityID ORDER BY RateChangeDate DESC), Rate
    FROM HumanResources.EmployeePayHistory2)
    , CTEfinal AS (SELECT * FROM CTE WHERE RN = 1)
    SELECT * FROM CTEfinal ORDER BY EmpID;
    GO
    
    DBCC DROPCLEANBUFFERS
    DBCC FREEPROCCACHE
    GO
    
    SELECT eph.BusinessEntityID AS EmpID, eph.Rate
    FROM HumanResources.EmployeePayHistory2 eph
    	INNER JOIN
    	(
    	SELECT BusinessEntityID, MAX(RateChangeDate) RateChangeDate
    	FROM HumanResources.EmployeePayHistory2
    	GROUP BY BusinessEntityID
    	) max_ph ON eph.BusinessEntityID = max_ph.BusinessEntityID
    		AND eph.RateChangeDate = max_ph.RateChangeDate
    ORDER BY EmpID
    GO
    
    -- Cleanup
    IF OBJECT_ID('HumanResources.EmployeePayHistory2') IS NOT NULL DROP TABLE HumanResources.EmployeePayHistory2
    GO

    Profiler result:

  • Monday, February 18, 2013 8:48 PM
    Moderator
     
     

    I repeated your test. The GROUP BY version beats OVER PARTITION BY for the sample data.


    Kalman Toth Database & OLAP Architect
    Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012

  • Monday, February 18, 2013 9:52 PM
     
     Answered Has Code

    Once you have some index structure, cross apply sometimes gets really good (and parallelizable) plans.

    create clustered index ix_eph2 on HumanResources.EmployeePayHistory2(BusinessEntityID,RateChangeDate) select BusinessEntityID, Rates.Rate from HumanResources.Employee emp cross apply (select top 1 Rate from HumanResources.EmployeePayHistory2 where BusinessEntityID = emp.BusinessEntityID order by RateChangeDate desc) Rates

    David


    David http://blogs.msdn.com/b/dbrowne/


  • Monday, February 18, 2013 10:49 PM
     
     

    One word of caution.  The GROUP BY method really is equivalent to the RANK() (or DENSE_RANK) function, not the ROW_NUMBER() function.  It works in this case since since for each BusinessEntityID, there is only one row in EmployeePayHistory2 with the maximum value in RateChangeDate.  If for a particular BusinessEntityID there were multiple rows with RateChangeDate equal to the maximum value of RateChangeDate for that BusinessEntityID, all of those rows would be returned.  That is what RANK or DENSE_RANK would do, ROW_NUMBER will return only one of the rows.

    Tom

  • Tuesday, February 19, 2013 1:05 AM
    Moderator
     
      Has Code

    Including CROSS APPLY method in the test.

    The first "select" is the CROSS APPLY method, a sure winner (see "Reads", "CPU") for this sample data.

    DBCC DROPCLEANBUFFERS
    DBCC FREEPROCCACHE
    GO
    select BusinessEntityID AS EmpID, Rates.Rate
    from HumanResources.Employee emp
    cross apply 
      (select top 1 Rate from HumanResources.EmployeePayHistory2 
       where BusinessEntityID = emp.BusinessEntityID 
       order by RateChangeDate desc) Rates
    ORDER BY EmpID;
    GO
    DBCC DROPCLEANBUFFERS
    DBCC FREEPROCCACHE
    GO
    
    ;WITH CTE AS 
    (
    SELECT EmpID=BusinessEntityID, RN = ROW_NUMBER() 
    OVER ( PARTITION BY BusinessEntityID ORDER BY RateChangeDate DESC), Rate
    FROM HumanResources.EmployeePayHistory2)
    , CTEfinal AS (SELECT * FROM CTE WHERE RN = 1)
    SELECT * FROM CTEfinal ORDER BY EmpID;
    GO
    
    DBCC DROPCLEANBUFFERS
    DBCC FREEPROCCACHE
    GO
    
    SELECT eph.BusinessEntityID AS EmpID, eph.Rate
    FROM HumanResources.EmployeePayHistory2 eph
    	INNER JOIN
    	(
    	SELECT BusinessEntityID, MAX(RateChangeDate) RateChangeDate
    	FROM HumanResources.EmployeePayHistory2
    	GROUP BY BusinessEntityID
    	) max_ph ON eph.BusinessEntityID = max_ph.BusinessEntityID
    		AND eph.RateChangeDate = max_ph.RateChangeDate
    ORDER BY EmpID
    GO


    Kalman Toth Database & OLAP Architect
    Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012




  • Tuesday, February 19, 2013 1:28 AM
     
      Has Code

    This was my winner (with original index):

    ALTER TABLE HumanResources.EmployeePayHistory2 ADD CONSTRAINT PK PRIMARY KEY ( BusinessEntityID, RateChangeDate )
    GO
    
    select BusinessEntityID, 
    	(
    	select top 1 Rate from HumanResources.EmployeePayHistory2 
    	where BusinessEntityID = emp.BusinessEntityID 
    	order by RateChangeDate desc ) Rate
    from HumanResources.Employee emp
    order by 1


    Seems counter-intuitive that a correlated subquery with a sort should win.  Still, goes to show you should always try more than one method with your data.


  • Tuesday, February 19, 2013 1:37 AM
     
     
    Good point, however the original primary key on HumanResource.EmployeePayHistory is on ( BusinessEntity, RateChangeDate ) would have enforced this.  So for this particular example they would be equivalent.
  • Tuesday, February 19, 2013 3:10 AM
     
     

    While they are the same here since we only want one column from group, the OUTER APPLY can return any number of columns from the first row in the group.  So it's semantically equivalent to the ROW_NUMBER() query.

    David


    David http://blogs.msdn.com/b/dbrowne/