none
TSQL Finding the next record in dataset that matches criteria - CTE Recursion?

    Question

  • Hi,

    I'm working on a section of T-SQL code that is done using a Cursor type of query in that it parses through a dataset and looks for the next record that has a greater date and a greater value.  This bit of code takes 24+ hours to run with the amount of data it's processing.  I'm looking for a way to process this faster and I think the solution might be a form of Recursive CTE but I'm having trouble writing it. Perhaps there is another path beyond cursors that I've not considered.

    SQL Version (@@Version) :

    Microsoft SQL Server 2008 R2 (RTM) - 10.50.1617.0 (X64)   Apr 22 2011 19:23:43   Copyright (c) Microsoft Corporation  Enterprise Edition (64-bit) on Windows NT 6.0 <X64> (Build 6002: Service Pack 2) (Hypervisor) 

    Below is a sample of data in a demo table I've included a column [EventDesirable] for ease in this discussion in identifying the desirable event and tagged the desirable records as 'y'.  This column is not in the actual data, so the code must look backwards a the last desirable record that was identified.

    The criteria for selecting the next desirable record (row in the dataset)  is as follows:

    • the product id must remain the same 
    • the event date must be greater than the previous "desirable record"
    • the event read must be greater that the previous "desirable record"
    • the first record in the table that has the earliest date will be the starting point, commonly the event read will be 0.

    DECLARE @demoTable TABLE (ProductID varchar(3), EventDate date, EventRead int, EventDesirable varchar(1))
    INSERT INTO @demoTable VALUES ('abc','08/29/2006','0','y')
    INSERT INTO @demoTable VALUES ('abc','12/10/2012','7454212','y')
    INSERT INTO @demoTable VALUES ('abc','12/17/2012','7601927','y')
    INSERT INTO @demoTable VALUES ('abc','12/26/2012','7684135','y')
    INSERT INTO @demoTable VALUES ('abc','01/02/2013','7711375','y')
    INSERT INTO @demoTable VALUES ('abc','01/14/2013','7802123','y')
    INSERT INTO @demoTable VALUES ('abc','01/28/2013','7942439','y')
    INSERT INTO @demoTable VALUES ('abc','01/29/2013','7972469','y')
    INSERT INTO @demoTable VALUES ('abc','01/31/2013','7972511','y')
    INSERT INTO @demoTable VALUES ('abc','02/26/2013','8805260','y')
    INSERT INTO @demoTable VALUES ('abc','03/05/2013','8249572','n')
    INSERT INTO @demoTable VALUES ('abc','03/07/2013','8284862','n')
    INSERT INTO @demoTable VALUES ('abc','03/11/2013','8310419','n')
    INSERT INTO @demoTable VALUES ('abc','03/13/2013','8322439','n')
    INSERT INTO @demoTable VALUES ('abc','03/18/2013','8383571','n')
    INSERT INTO @demoTable VALUES ('abc','04/22/2013','8721137','n')
    INSERT INTO @demoTable VALUES ('abc','05/20/2013','9057232','y')
    
    SELECT * FROM @demoTable

    What I've tried so far

    ;WITH baseData AS (
    
    SELECT 
    ProductID, 
    EventDate, 
    EventRead,
    DateRank =  ROW_NUMBER() OVER(PARTITION BY ProductID ORDER BY EventDate ASC) 
    FROM @demoTable 
    ), recursiveData AS(
    
    SELECT TOP(1) 
    	ProductID, 
    	EventDate, 
    	EventRead,
    	DateRank = CAST(1 AS INT)
    FROM baseData
    WHERE DateRank = 1
    
    UNION ALL
    
    SELECT
    bd.ProductID, 
    bd.EventDate, 
    bd.EventRead,
    DataRank = CAST(bd.DateRank AS INT) + 1
    FROM baseData bd INNER JOIN recursiveData rd
    ON bd.ProductID =rd.ProductID
    WHERE bd.DateRank > rd.DateRank
    AND bd.EventDate > rd.EventDate
    AND bd.EventRead > bd.EventRead
    
    )
    
    SELECT * FROM recursiveData

    The above is my current attempt at recursion and is returning only 1 result, the anchor piece of recursive data.  I think there is a key piece I'm missing or not understanding with  recursive CTEs.  If this is the case and there is something I should read to get a better understanding, guidance would be appreciated!

    For reference I've included the desired result set here:


    ProductID EventDate EventRead
    abc 8/29/2006 0
    abc 12/10/2012 7454212
    abc 12/17/2012 7601927
    abc 12/26/2012 7684135
    abc 1/2/2013 7711375
    abc 1/14/2013 7802123
    abc 1/28/2013 7942439
    abc 1/29/2013 7972469
    abc 1/31/2013 7972511
    abc 2/26/2013 8805260
    abc 5/20/2013 9057232

    Notes about the actual dataset(providing a scale to the TSQL functionality required):

    • 7million + records to parse
    • 600k product ids
    • processing occurs monthly
    • current cursor runs in 24 hours - goal is to be faster

    If any additional information would be helpful, please let me know. 

    Thanks,

    Mike

    Monday, June 24, 2013 11:27 PM

Answers

  • SELECT * 
    FROM @demoSourceTable s
    WHERE NOT EXISTS (SELECT * FROM @demoSourceTable s2
      WHERE s.ProductID = s2.ProductID AND s.EventDate > s2.EventDate AND s.EventRead <= s2.EventRead)

    should do what you want.  But I'm not sure that it will be faster than your cursor.  If you do the above, it will be a lot more efficient if you have either an index on ProductID and EventDate or an index on ProductID and EventRate.  And it may very well be best to have both indexes.

    Tom

    • Marked as answer by recanem3 Tuesday, June 25, 2013 7:40 PM
    Tuesday, June 25, 2013 5:31 PM

All replies

  • The last JOIN condition in second CTE is incorrect. The comparison is done on column from same table.

    try this,

    ;WITH baseData AS (
    
    SELECT 
    ProductID, 
    EventDate, 
    EventRead,
    DateRank =  ROW_NUMBER() OVER(PARTITION BY ProductID ORDER BY EventDate ASC) 
    FROM @demoTable where EventDesirable='y'
    ), recursiveData AS(
    
    SELECT 
    	ProductID, 
    	EventDate, 
    	EventRead,
    	DateRank 
    FROM baseData
    WHERE DateRank = 1
    
    UNION ALL
    
    SELECT
    bd.ProductID, 
    bd.EventDate, 
    bd.EventRead,
    bd.DateRank
    FROM baseData bd INNER JOIN recursiveData rd
    ON bd.ProductID =rd.ProductID
    WHERE bd.DateRank=rd.DateRank+1
    AND bd.EventDate > rd.EventDate
    AND bd.EventRead > rd.EventRead
    
    )
    
    SELECT * FROM recursiveData


    Thanks
    Sarat

    Please use Marked as Answer if my post solved your problem and use Vote As Helpful if a post was useful.

    Tuesday, June 25, 2013 12:40 AM
  • >> I'm working on a section of T-SQL code that is done using a Cursor type of query in that it parses through a dataset and looks for the next record [sic] that has a greater date and a greater value.  <<

    Rows are not records! A table is not a dataset! Why are you using terms from magnetic tape files and punch cards? This mindset is why you have a cursor; the cursor in SQL is based on the 1960's IBM tape file. 

    >> Below is a sample of data in a demo table I've included a column [EventDesirable] for ease in this discussion in identifying the desirable event and tagged the desirable records [sic] as 'y'. <<

    Of course, you would never use an assembly language like this in real SQL. Just as we would never have “table” in a table name. And why did you fail to use ISO-8601 dates? You cast a string to an integer over and over; why??  This is skeleton code, but the bones are broken. Here is a my guess and extra typing:

    CREATE TABLE Demo 
    (product_id VARCHAR(3) NOT NULL, 
     event_date DATE NOT NULL, 
     PRIMARY KEY (product_id, event_date),
     event_read INTEGER NOT NULL);

    INSERT INTO Demo
    VALUES ('abc', '2006-08-29', 0), 
       ('abc', '2012-12-10', 7454212), 
       ('abc', '2012-12-17', 7601927), 
       ('abc', '2012-12-26', 7684135), 
       ('abc', '2013-01-02', 7711375), 
       ('abc', '2013-01-14', 7802123), 
       ('abc', '2013-01-28', 7942439), 
       ('abc', '2013-01-29', 7972469), 
       ('abc', '2013-01-31', 7972511), 
       ('abc', '2013-02-26', 8805260), 
       ('abc', '2013-03-05', 8249572), 
       ('abc', '2013-03-07', 8284862), 
       ('abc', '2013-03-11', 8310419), 
       ('abc', '2013-03-13', 8322439), 
       ('abc', '2013-03-18', 8383571), 
       ('abc', '2013-04-22', 8721137), 
       ('abc', '2013-05-20', 9057232), 

    ;WITH X1 (product_id, event_date, event_read, next_event_read)
    AS
    (SELECT product_id, event_date, event_read, 
            LEAD (event_read) 
             OVER (PARTITION BY product_id ORDER BY event_date)
      FROM Demo),
    X2 (product_id, event_date, event_read, delta, prior_max)
    AS
    (SELECT product_id, event_date, event_read,
           (next_event_read - event_read),
            MAX(event_read) 
              OVER (PARTITION BY product_id ORDER BY event_date
                  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
      FROM X1)

      SELECT product_id, event_date, event_read,
             CASE WHEN prior_max <= event_read 
             THEN 'y' ELSE 'n' END AS demo_flg
       FROM X2
    -- ORDER BY product_id, event_date;

    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    Tuesday, June 25, 2013 5:32 AM
  • A bit change in CELKO's solution. May be I my mistake to understand the requriement. Add below WHERE clause in SELECT statement of X2 (the last query)

    where prior_max <= event_read


    Best Luck, Shenoy

    Tuesday, June 25, 2013 6:57 AM
  • Sarat - Thanks for the help.  I think this only works because of the where EventDesirable clause in basedata.  When the where clause is commented out the last "good" record is missed.  That column is only present in the demo table to identify the desired results and shouldn't be used as part of the query.

    SELECT 
    ProductID, 
    EventDate, 
    EventRead,
    DateRank =  ROW_NUMBER() OVER(PARTITION BY ProductID ORDER BY EventDate ASC) 
    FROM @demoTable where EventDesirable='y'

    Tuesday, June 25, 2013 11:15 AM
  • Celko - Thanks for your help.  I believe this is written for SQL 2012, and I'm currently constraint to SQL 2008 R2 so I cannot test this approach.

    I'm curious as to what you are using for definitions/context? For example,  a quick google search for row and record  returned the same definition and the below link is one of many examples indicating they are used interchangeably.

    http://www.techterms.com/definition/record


    Thanks

    Mike
    Tuesday, June 25, 2013 11:28 AM
  • Sarat - Thanks for the help.  I think this only works because of the where EventDesirable clause in basedata.  When the where clause is commented out the last "good" record is missed.  That column is only present in the demo table to identify the desired results and shouldn't be used as part of the query.

    SELECT 
    ProductID, 
    EventDate, 
    EventRead,
    DateRank =  ROW_NUMBER() OVER(PARTITION BY ProductID ORDER BY EventDate ASC) 
    FROM @demoTable where EventDesirable='y'

    Even i put it as sample query , if all the 'eventdesirable' to be included , use Dense_rank() instead of row_number()

    ;WITH baseData AS (
    
    SELECT 
    ProductID, 
    EventDate, 
    EventRead,EventDesirable,
    DateRank =  dense_rank() OVER(PARTITION BY ProductID,EventDesirable ORDER BY EventDate ASC) 
    FROM @demoTable
    ), recursiveData AS(
    
    SELECT 
    	ProductID, 
    	EventDate, 
    	EventRead,EventDesirable,
    	DateRank 
    FROM baseData
    WHERE DateRank = 1
    
    UNION ALL
    
    SELECT
    bd.ProductID, 
    bd.EventDate, 
    bd.EventRead,rd.EventDesirable,
    bd.DateRank
    FROM baseData bd INNER JOIN recursiveData rd
    ON bd.ProductID =rd.ProductID and bd.EventDesirable = rd.EventDesirable
    WHERE bd.DateRank=rd.DateRank+1
    AND bd.EventDate > rd.EventDate
    AND bd.EventRead > rd.EventRead
    
    )
    
    SELECT * FROM recursiveData order by ProductID,EventDesirable,DateRank
    


    Thanks
    Sarat

    Please use Marked as Answer if my post solved your problem and use Vote As Helpful if a post was useful.

    Tuesday, June 25, 2013 4:19 PM
  • Sarat - Thank you again.   I think I'm not correctly explaining what I intended by including the EventDesirable Column .  That column will never be present in the data, I added it to condense the code I would have to post in the forums here (ie. posting once declare statement instead of two).   To aid in this discussion I've added a code block below with the two tables.

    Below are two separate variable tables with comments.  The first should look familiar, but the comments indicate , starting from the anchor, which records are desirable and which should be skipped and why.  The second variable table shows the desired result set from the query.  

    --had typos here
    DECLARE @demoSourceTable TABLE (ProductID varchar(3), EventDate date, EventRead int)
    INSERT INTO @demoSourceTable VALUES ('abc','08/29/2006','0')	--Anchor value
    INSERT INTO @demoSourceTable VALUES ('abc','12/10/2012','7454212')	---This value has incremented from the last "good" EventRead
    INSERT INTO @demoSourceTable VALUES ('abc','12/17/2012','7601927')	---This value has incremented from the last "good" EventRead
    INSERT INTO @demoSourceTable VALUES ('abc','12/26/2012','7684135')	---This value has incremented from the last "good" EventRead
    INSERT INTO @demoSourceTable VALUES ('abc','01/02/2013','7711375')	---This value has incremented from the last "good" EventRead
    INSERT INTO @demoSourceTable VALUES ('abc','01/14/2013','7802123')	---This value has incremented from the last "good" EventRead
    INSERT INTO @demoSourceTable VALUES ('abc','01/28/2013','7942439')	---This value has incremented from the last "good" EventRead
    INSERT INTO @demoSourceTable VALUES ('abc','01/29/2013','7972469')	---This value has incremented from the last "good" EventRead
    INSERT INTO @demoSourceTable VALUES ('abc','01/31/2013','7972511')	---This value has incremented from the last "good" EventRead
    INSERT INTO @demoSourceTable VALUES ('abc','02/26/2013','8805260')	---This value has incremented from the last "good" EventRead
    INSERT INTO @demoSourceTable VALUES ('abc','03/05/2013','8249572')	---EventRead has decremented, CTE should keep processing until it finds the next increment
    INSERT INTO @demoSourceTable VALUES ('abc','03/07/2013','8284862')	---EventRead has decremented, CTE should keep processing until it finds the next increment
    INSERT INTO @demoSourceTable VALUES ('abc','03/11/2013','8310419')  ---EventRead has decremented, CTE should keep processing until it finds the next increment
    INSERT INTO @demoSourceTable VALUES ('abc','03/13/2013','8322439')	---EventRead has decremented, CTE should keep processing until it finds the next increment
    INSERT INTO @demoSourceTable VALUES ('abc','03/18/2013','8383571')	---EventRead has decremented, CTE should keep processing until it finds the next increment
    INSERT INTO @demoSourceTable VALUES ('abc','04/22/2013','8721137')	---EventRead has decremented, CTE should keep processing until it finds the next increment
    INSERT INTO @demoSourceTable VALUES ('abc','05/20/2013','9057232')  ---This value has incremented from the last "good" EventRead
    
    SELECT * FROM @demoSourceTable  --- THis will return what the source data looks like
    
    DECLARE @resultsTable TABLE (ProductID varchar(3), EventDate date, EventRead int)
    INSERT INTO @resultsTable VALUES ('abc','08/29/2006','0')  
    INSERT INTO @resultsTable VALUES ('abc','12/10/2012','7454212')
    INSERT INTO @resultsTable VALUES ('abc','12/17/2012','7601927')
    INSERT INTO @resultsTable VALUES ('abc','12/26/2012','7684135')
    INSERT INTO @resultsTable VALUES ('abc','01/02/2013','7711375')
    INSERT INTO @resultsTable VALUES ('abc','01/14/2013','7802123')
    INSERT INTO @resultsTable VALUES ('abc','01/28/2013','7942439')
    INSERT INTO @resultsTable VALUES ('abc','01/29/2013','7972469')
    INSERT INTO @resultsTable VALUES ('abc','01/31/2013','7972511')
    INSERT INTO @resultsTable VALUES ('abc','02/26/2013','8805260')
    INSERT INTO @resultsTable VALUES ('abc','05/20/2013','9057232')
    
    SELECT * FROM @resultsTable  --This is what the query results should be from the @demoSourceTable table


    Again, thank you for your help.  I'm still not quite sure if I asking the question the right way.  Please let me know if more information will help.

    Thanks,

    Mike


    • Edited by recanem3 Tuesday, June 25, 2013 4:43 PM typos in first code block
    Tuesday, June 25, 2013 4:40 PM
  • Just an aside note - you need to install SP3 on your SQL Server version. With SQL Server you want to keep up with service packs, they are important.

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Tuesday, June 25, 2013 4:45 PM
    Moderator
  • Naomi -  Thanks.  Updating now, hadn't realized this service pack was out of date.
    Tuesday, June 25, 2013 4:56 PM
  • SELECT * 
    FROM @demoSourceTable s
    WHERE NOT EXISTS (SELECT * FROM @demoSourceTable s2
      WHERE s.ProductID = s2.ProductID AND s.EventDate > s2.EventDate AND s.EventRead <= s2.EventRead)

    should do what you want.  But I'm not sure that it will be faster than your cursor.  If you do the above, it will be a lot more efficient if you have either an index on ProductID and EventDate or an index on ProductID and EventRate.  And it may very well be best to have both indexes.

    Tom

    • Marked as answer by recanem3 Tuesday, June 25, 2013 7:40 PM
    Tuesday, June 25, 2013 5:31 PM
  • Actually, this is screwed up all over the place. We have three concepts to worry about! Here are the short, loose definitions

    Tuple: A mathematical abstraction that Dr. Codd made popular in the Relational Model. Relations are made of tuples. Tuples are made of dimensions. The dimensions have an abstract type that holds scalar values drawn from abstract domains. The dimensions in a tuple have no ordering. Tuples have no ordering in the relation. All we know about these values is that they have theta operators, so we can do joins.

    This is pure math; but it is the foundations of the Relational Model. Think of a number versus a numeral, IEEE floating point representations versus an irrational number. Etc. Only geeks liek me care about it. 

    Record: A physical data storage made of a contiguous physical unit . The classic example is “unit records”, aka punch cards. The fields are read left to right. The values are physically encoded with binary, BCD, Hollerith, etc. without any domain concept. The meaning of the fields is from the program reading the data! The key concepts are “sequence” and “contiguous” in this model. It works great for mag tapes and served us well for decades; this good today for many things other than database. 

    |The only place the term “field” appears in SQL is the parts of a temporal value (year, month, day, hour, minute second). The term “record” is for non-SQL language data and is not part of SQL itself.

    Row: The hybrid offspring in SQL and RDBMS. A table is made of unordered rows (tuple model). A

    row is made up of columns with a default ordering for use in SQL (record model) fro SELECT *, INSERT, etc. This default ordering might not be the physical ordering of the record that holds the row. In fact, many SQLs store the varying character columns at the end of the physical records of base tables to improve access.

    The data types are limited to a specific set (numeric, string, temporal, etc). The data type and its constraints are part of the column. This defines a domain of sorts (no operators, etd). This is not in the host program, but in the DDL. This is why 80-95% of the real work in SQL is done with DDL, not DML.

    A table can be virtual (VIEW, CTE, derived table) while a file cannot not. A row can be virtual while a record cannot not. Ever use a columnar database? If a row and record are the same thing, then show me. Given a row (a,b,c) each column can be on a separate physical disk, have redundant RAID copies on several disks or be created by a computation rule.

    This is why a Noob will materialize temp tables and an SQL programmer will use virtual table. Their mindset is still physical — they want to see punch cards and a mag tape! No abstractions! 

    Once a child makes the leap from physically counting (1,2,3, many) to abstract numbers (1,2,3, aleph nul), he has a mind tool that is incredibly powerful. After teaching SQL and RDBMS for 30+ years, I have found that thinking in set based abstraction (tables are no files, rows are not records, columns are not fields) is that same kind leap.  


    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    Tuesday, June 25, 2013 6:15 PM
  • Tom - Thank you. This is the solution I was looking for and has a substantial improvement in speed without the indexing (30 mins now vs. 24hrs+)

    Thanks,

    Mike

    Tuesday, June 25, 2013 7:45 PM