This article discusses one of the problems of relational algebra which was brought up in this Transact-SQL MSDN Forum T-sql - finding all sales orders that have similar products.



Introduction


There are certain kinds of problems in relational databases which may be solved using principals from Relational Division. There are many articles about Relational Division or Relational Algebra. Here is a list of a few very interesting articles Divided We Stand: The SQL of Relational Division by Celko and Relational division and  Relationally Divided over EAV by Peter Larsson and readers may want to take a look at them and other articles on this topic. 

Problem Definition


In the aforementioned thread the topic starter first wanted to find all orders that have similar products. He provided the table definition along with few rows of data.
Rather than using data from that thread let's consider the same problem but using AdventureWorks database instead. First let's see a solution for the problem of finding orders that have the same products.

Solutions


This problem has several solutions. First two are the true relational division solutions and the last solution is non-portable T-SQL only solution based on the de-normalization of the table. The first solution in that script was suggested by Peter Larsson after I asked him to check this article. I'll post the script I ran to compare all three solutions:

USE AdventureWorks2012;
SET NOCOUNT ON;
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
PRINT 'PESO Solution';
 
SELECT t1.SalesOrderID AS OrderID
    ,t2.SalesOrderID AS SimilarOrderID
FROM (
    SELECT SalesOrderID
        ,COUNT(*) AS Items
        ,MIN(ProductID) AS minProdID
        ,MAX(ProductID) AS maxProdID
    FROM Sales.SalesOrderDetail
    GROUP BY SalesOrderID
    ) AS v
INNER JOIN Sales.SalesOrderDetail AS t1 ON t1.SalesOrderID = v.SalesOrderID
INNER JOIN Sales.SalesOrderDetail AS t2 ON t2.ProductID = t1.ProductID
    AND t2.SalesOrderID > t1.SalesOrderID
INNER JOIN (
    SELECT SalesOrderID
        ,COUNT(*) AS Items
        ,MIN(ProductID) AS minProdID
        ,MAX(ProductID) AS maxProdID
    FROM Sales.SalesOrderDetail
    GROUP BY SalesOrderID
    ) AS w ON w.SalesOrderID = t2.SalesOrderID
WHERE w.minProdID = v.minProdID
    AND w.maxProdID = v.maxProdID
    AND w.Items = v.Items
GROUP BY t1.SalesOrderID
    ,t2.SalesOrderID
HAVING COUNT(*) = MIN(v.Items);
 
PRINT 'Common Relational Division /CELKO/Naomi solution';
 
SELECT O1.SalesOrderId AS OrderID
    ,O2.SalesOrderID AS SimilarOrderID
FROM Sales.SalesOrderDetail O1
INNER JOIN Sales.SalesOrderDetail O2 ON O1.ProductID = O2.ProductID
    AND O1.SalesOrderID < O2.SalesOrderID
GROUP BY O1.SalesOrderID
    ,O2.SalesOrderID
HAVING COUNT(O1.ProductID) = (
        SELECT COUNT(ProductID)
        FROM Sales.SalesOrderDetail SD1
        WHERE SD1.SalesOrderID = O1.SalesOrderID
        )
    AND COUNT(O2.ProductID) = (
        SELECT COUNT(ProductID)
        FROM Sales.SalesOrderDetail SD2
        WHERE SD2.SalesOrderID = O2.SalesOrderID
        );
 
 
PRINT 'XML PATH de-normalization solution';
 
WITH cte
AS (
    SELECT SalesOrderID
        ,STUFF((
                SELECT ', ' + CAST(ProductID AS VARCHAR(30))
                FROM Sales.SalesOrderDetail SD1
                WHERE SD1.SalesOrderID = SD.SalesOrderID
                ORDER BY ProductID
                FOR XML PATH('')
                ), 1, 2, '') AS Products
    FROM Sales.SalesOrderDetail SD
    GROUP BY SD.SalesOrderID
    )
SELECT cte.SalesOrderID AS OrderID
    ,cte1.SalesOrderID AS SimilarOrderID
    ,cte.Products
FROM cte
INNER JOIN cte AS cte1 ON cte.SalesOrderID < cte1.SalesOrderID
    AND cte.Products = cte1.Products;
 
SET STATISTICS IO OFF;

First solution JOINS with the number of items in each order and MIN/MAX product in each order. This solution is based on the idea Peter proposed in this closed MS Connect Item Move T-SQL language closer to completion with a DIVIDE BY operator.

The second solution self-joins the table based on the ProductID using an extra condition of O1.OrderID < O2.Order2 (we're using < instead of <> in order to avoid opposite combinations), then groups by both OrderID columns and uses HAVING clause to make sure the number of products is the same as the number of products in each individual order. This HAVING idea is very typical for the Relational Division problem.

Interestingly, the number of combinations in AdventureWorks database is 1,062, 238 (more than rows in the SalesOrderDetail table itself). This is due to the fact that many orders consist of only single product.

The last solution is rather straightforward and uses XML PATH approach to get all products in one row for each order ID, then self-joins based on this new Products column.  This solution is not portable into other relational database languages but specific for T-SQL only. Interestingly, it performs better than second 'true' Relational Division solution as you can see in this picture.



As you can see, the first query takes 0%, second 60% while the last takes 40% of the execution time. 

The last solution, however, is also not very flexible and is only suitable for finding exact matches.

These are results I got on SQL Server 2012 SP1 64 bit (they are much better on SQL Server 2014 CTP according to Peter):

PESO Solution

Table 'SalesOrderDetail'. Scan count 3410626, logical reads 7265595, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 855922, logical reads 3462746, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 35272 ms,  elapsed time = 114920 ms.

Common Relational Division /CELKO/Naomi solution

 Table 'SalesOrderDetail'. Scan count 36, logical reads 3292, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 266, logical reads 907592, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 478703 ms,  elapsed time = 214748 ms.

XML PATH de-normalization solution

Table 'Worktable'. Scan count 0, logical reads 12971, physical reads 0, read-ahead reads 0, lob logical reads 8764, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderDetail'. Scan count 62932, logical reads 194266, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 5054 ms,  elapsed time = 14069 ms.

Best Exact Match Solution


Peter sent yet another variation of the solution for the integer Product ID (this solution will not work if the product ID /Item ID uses character or GUID key).

SELECT                   v.SalesOrderID AS OrderID,
                         w.SalesOrderID AS SimilarOrderID,
                         v.Items
FROM                     (
                                     SELECT                   SalesOrderID,
                                                              COUNT(ProductID) AS Items,
                                                              MIN(ProductID) AS minProdID,
                                                              MAX(ProductID) AS maxProdID,
                                                              SUM(ProductID) AS sumProdID,
                                                              CHECKSUM_AGG(10000 * ProductID) AS cs
                                     FROM                     Sales.SalesOrderDetail
                                     GROUP BY    SalesOrderID
                         ) AS v
INNER JOIN  (
                                     SELECT                   SalesOrderID,
                                                              COUNT(ProductID) AS Items,
                                                              MIN(ProductID) AS minProdID,
                                                              MAX(ProductID) AS maxProdID,
                                                              SUM(ProductID) AS sumProdID,
                                                              CHECKSUM_AGG(10000 * ProductID) AS cs
                                     FROM                     Sales.SalesOrderDetail
                                     GROUP BY    SalesOrderID
                         ) AS w ON w.Items = v.Items
                                     AND w.minProdID = v.minProdID
                                     AND w.maxProdID = v.maxProdID
                                     AND w.cs = v.cs
                                     AND w.sumProdID = v.sumProdID
WHERE                    w.SalesOrderID > v.SalesOrderID

This solution joins 2 aggregate information together based on CHECKSUM_AGG function. By checking all these aggregate functions it is enough to conclude if the orders consist of the same products or not. This is the simplest and ingenious query and it performs the best among the other variations I tried. The limitation of this query is that it assumes integer key for the product id.

I got the following results for this solution:

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderDetail'. Scan count 2, logical reads 2492, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 562 ms,  elapsed time = 9791 ms.

Slight Variation of the Original Problem


In that thread the topic starter also wanted to compare orders based on partial similarity. You may recognize this problem as 'Customers who bought this item also bought...' as you often can see in different websites.

Say, we want to find orders, that have 2/3 or more of the products matching. We will only consider orders with more than 2 items (3 and up) for this problem. The first solution can be easily adjusted for this new problem:

;
 
WITH cte
AS (
    SELECT SalesOrderID
        ,ProductID
        ,COUNT(ProductID) OVER (PARTITION BY SalesOrderID) AS ProductsCount
    FROM Sales.SalesOrderDetail
    )
SELECT O1.SalesOrderId AS OrderID
    ,O2.SalesOrderID AS SimilarOrderID
FROM cte O1
INNER JOIN cte O2 ON O1.ProductID = O2.ProductID
    AND O1.SalesOrderID < O2.SalesOrderID
WHERE O1.ProductsCount > = 3
    AND O2.ProductsCount >= 3
GROUP BY O1.SalesOrderID
    ,O2.SalesOrderID
HAVING COUNT(O1.ProductID) >= (
        (
            SELECT COUNT(ProductID)
            FROM Sales.SalesOrderDetail SD1
            WHERE SD1.SalesOrderID = O1.SalesOrderID
            ) * 2.0
        ) / 3.0
    AND COUNT(O2.ProductID) >= (
        (
            SELECT COUNT(ProductID)
            FROM Sales.SalesOrderDetail SD2
            WHERE SD2.SalesOrderID = O2.SalesOrderID
            ) * 2.0
        ) / 3.0
ORDER BY OrderID
    ,SimilarOrderID;


We can verify our results back for the few first rows:
SELECT SalesOrderID
    ,stuff((
            SELECT ', ' + cast(ProductID AS VARCHAR(30))
            FROM Sales.SalesOrderDetail SD1
            WHERE SD1.SalesOrderID = SD.SalesOrderID
            ORDER BY ProductID
            FOR XML PATH('')
            ), 1, 2, '') AS Products
FROM Sales.SalesOrderDetail SD
WHERE SalesOrderID IN (
        43659
        ,43913,
 
        43659,  44528,
43659,  44566,
43659,  44761,
43659,  46077)
         
GROUP BY SalesOrderID
ORDER BY SalesOrderID


I will show two variations of the solutions for the similar orders problem. While I am getting better reads for the second query, the execution time is much better for the first query:

SET STATISTICS TIME ON;
SET STATISTICS IO ON;
 
DECLARE @Percentage DECIMAL(10, 2);
 
SET @Percentage = 0.75;
 
WITH cte
AS (
    SELECT SalesOrderID
        ,ProductID
        ,COUNT(ProductID) OVER (PARTITION BY SalesOrderID) AS ProductsCount
    FROM Sales.SalesOrderDetail
    )
SELECT O1.SalesOrderId AS OrderID
    ,O2.SalesOrderID AS SimilarOrderID
FROM cte O1
INNER JOIN cte O2 ON O1.ProductID = O2.ProductID
    AND O1.SalesOrderID < O2.SalesOrderID
WHERE O1.ProductsCount > = 3
    AND O2.ProductsCount >= 3
GROUP BY O1.SalesOrderID
    ,O2.SalesOrderID
HAVING COUNT(O1.ProductID) >= (
        SELECT COUNT(ProductID)
        FROM Sales.SalesOrderDetail SD1
        WHERE SD1.SalesOrderID = O1.SalesOrderID
        ) * @Percentage
    AND COUNT(O2.ProductID) >= (
        SELECT COUNT(ProductID)
        FROM Sales.SalesOrderDetail SD2
        WHERE SD2.SalesOrderID = O2.SalesOrderID
        ) * @Percentage
ORDER BY OrderID
    ,SimilarOrderID;
 
WITH cte
AS (
    SELECT SalesOrderID
        ,COUNT(ProductID) AS Items
    FROM Sales.SalesOrderDetail
    GROUP BY SalesOrderID
    )
SELECT O1.SalesOrderId AS OrderID
    ,MIN(C1.Items) AS [Products 1]
    ,O2.SalesOrderID AS SimilarOrderID
    ,MIN(C2.Items) AS [Products 2]
FROM Sales.SalesOrderDetail O1
INNER JOIN cte C1 ON O1.SalesOrderID = C1.SalesOrderID
INNER JOIN Sales.SalesOrderDetail O2 ON O1.ProductID = O2.ProductID
    AND O1.SalesOrderID < O2.SalesOrderID
INNER JOIN cte C2 ON O2.SalesOrderID = C2.SalesOrderID
 
GROUP BY O1.SalesOrderID   
    ,O2.SalesOrderID
     
HAVING COUNT(*) >= MIN(C1.Items) * @Percentage
    AND COUNT(*) >= MIN(C2.Items) * @Percentage
    AND MIN(C1.Items)>=3 AND MIN(C2.Items) >=3
ORDER BY OrderID
    ,SimilarOrderID;

I will be interested in your results and ideas for this type of the query for partial (percentage) match.


Note from the Comments


I received the following comment from Andriy Zabavskyy:

We could significantly improve the first out of the latest two solutions extending the join condition by the following statement:

AND O1.ProductsCount between floor(O2.ProductsCount * @Percentage) and ceiling(O2.ProductsCount / @Percentage)

On my local machine it gives an improvement equal to about 2.5 times in terms of time execution.

I let the readers of this article to test this suggestion.


Conclusion


In this article I showed that some common relational division problems can be solved using set-based solutions. These solutions may not perform well, however, on the big datasets. I encourage the readers to provide their ideas for the listed problems and their Pros/Cons.

Credits


I want to thank Peter Larsson who reviewed this article and came up with several great solution ideas. I knew I can count on Peter!


See Also



This article participated in the TechNet Guru Contributions for December 2013 and won the Gold Prize.