none
Efficient Cumulative Sum Calculation

    Question

  • Hello,

    Please can anyone advice my how to create cumulative sum efficiently? My data structure could be represented by the example below:

    CustomerID

    ProductID

    PurchaseOrderID

    Quantity

    Cumulative Purchase

    1

    1

    1

    12

    12

    1

    1

    2

    67

    79

    1

    1

    3

    12

    91

    1

    2

    1

    7

    7

    1

    2

    2

    65

    72

    1

    2

    3

    48

    120

    2

    2

    1

    20

    20

    2

    2

    2

    29

    49

    2

    2

    3

    24

    73

    2

    2

    4

    32

    105

    2

    3

    1

    12

    12

    2

    3

    2

    88

    100

    2

    3

    3

    19

    119

    This is code generating the same table for SQL testing:

    declare @TestPurchases as table (

            CustomerID int,

            ProductID int,

            PurchaseOrderID int,

            Quantity int

    )

    insert @TestPurchases

    values  (1, 1, 1, 12),

                    (1, 1, 2, 67),

                    (1, 1, 3, 12),

                    (1, 2, 1, 7),

                    (1, 2, 2, 65),

                    (1, 2, 3, 48),

                    (2, 2, 1, 20),

                    (2, 2, 2, 29),

                    (2, 2, 3, 24),

                    (2, 2, 4, 32),

                    (2, 3, 1, 12),

                    (2, 3, 2, 88),

                    (2, 3, 3, 19)

    Each customer purchases one or more product and can purchase one product more than one time. I’m trying to calculate cumulative sum of purchases as described in the last column. I usually use following code:

    select dt.CustomerID

            , dt.ProductID

            , dt.PurchaseOrderID

            , dt.Quantity

            , sum(CumSum.Quantity) as CummulativePurchase

    from @TestPurchases as dt

    left join @TestPurchases as CumSum

            on      dt.CustomerID = CumSum.CustomerID

            and     dt.ProductID = CumSum.ProductID

            and dt.PurchaseOrderID >= CumSum.PurchaseOrderID

    group by dt.CustomerID

            , dt.ProductID

            , dt.PurchaseOrderID

            , dt.Quantity

    The code works well as long as one customer doesn’t have too many purchases of one product. The temporarily table created by optimizer has one half of square of purchase count per each customer/product combination. My real data has several millions of customer/product combinations and some of them have several thousand purchases.

    Can anybody recommend approach that would be more efficient than the query above?

    Thank you!

    Daniel
    Tuesday, August 06, 2013 8:39 PM

Answers

  • The easiest, shortes, and by far the most performant solution is using windowed SUM function. In sql 2012 onwards you can add "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" in the OVER() clause - that will give you the running sum. See this BOL page.

    • Marked as answer by drym Wednesday, August 07, 2013 2:43 AM
    Tuesday, August 06, 2013 10:36 PM

All replies

  • can you please give us the indexes and your real data table structure...

    if you are trying to load the millions of rows into the variable table as above also not using proper indexes then you might get into the performance trouble as you mentioned.

    so, it will help us to figure out the issue if you can post the indexes and the table structure as such.


    Nothing is Permanent... even Knowledge.... My Blog

    Tuesday, August 06, 2013 8:54 PM
  • Surendra,

    Thank you very much for your help. I'm afraid I can't share the real data structure, but if I continue the analogy to this example the table would be using following unique clustered index:

            CustomerID int,

            ProductID int,

            PurchaseOrderID int,

    The real table has more attributes that are part of the index, but all of them are also part of the join clause.

    Please let me know if this insight is helpful or if more details are needed.

    Thank you!

    Daniel

    Tuesday, August 06, 2013 10:19 PM
  • The easiest, shortes, and by far the most performant solution is using windowed SUM function. In sql 2012 onwards you can add "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" in the OVER() clause - that will give you the running sum. See this BOL page.

    • Marked as answer by drym Wednesday, August 07, 2013 2:43 AM
    Tuesday, August 06, 2013 10:36 PM
  • Thank you very much Vedran,

    That's exactly what I'm looking for. My entire data-set have been processed in less than 3 minutes, that's fantastic! I just hope the rows between is a new 2012 function, because I can't understand how could I live without it all those years.

    Thanks again!

    Daniel

    P.S. This is the solution for my original example based on Vedran's recommendation:

    declare @TestPurchases as table (

            CustomerID int,

            ProductID int,

            PurchaseOrderID int,

            Quantity int

    )

    insert @TestPurchases

    values  (1, 1, 1, 12),

                    (1, 1, 2, 67),

                    (1, 1, 3, 12),

                    (1, 2, 1, 7),

                    (1, 2, 2, 65),

                    (1, 2, 3, 48),

                    (2, 2, 1, 20),

                    (2, 2, 2, 29),

                    (2, 2, 3, 24),

                    (2, 2, 4, 32),

                    (2, 3, 1, 12),

                    (2, 3, 2, 88),

                    (2, 3, 3, 19)

    select dt.CustomerID

            , dt.ProductID

            , dt.PurchaseOrderID

            , dt.Quantity

            , SUM(Quantity) OVER (PARTITION BY dt.CustomerID

                                                                                    , dt.ProductID

                                                 ORDER BY PurchaseOrderID

                                                 ROWS UNBOUNDED PRECEDING) AS CumulativeTotal

    from @TestPurchases as dt

    Wednesday, August 07, 2013 2:43 AM
  • There is one thing you could try to optimize this query even further: creating a supporting index. Key columns should be columns in PARTITION BY followed by ORDER BY columns of the windowing clause. Included columns should be all other columns you reference in a query.
    For example:
    CREATE INDEX TestPurchases_IX_CustomerProduct ON #TestPurchases
    (CustomerID, ProductID, PurchaseOrderID) INCLUDE(Quantity)
    The goal is to eliminate the sort operation. Verify that estimated execution plan has changed and does not include sort operation anymore. Query should have no outer ORDER BY (outside windowing function) or they must exactly match the index key. e.g. ORDER BY CustomerID, ProductID, PurchaseOrderID.
    Wednesday, August 07, 2013 7:07 AM
  • Thank you very much Vedran, index is helping a lot.

    Thanks!

    Daniel

    Wednesday, August 07, 2013 10:07 PM
  • I'm glad to hear that.
    Wednesday, August 07, 2013 10:20 PM