This article discusses one of the problems of relational algebra which was brought up in this Transact-SQL MSDN Forum T-sql - finding all sales orders that have similar products.

# Introduction

There are certain kinds of problems in relational databases which may be solved using principals from Relational Division. There are many articles about Relational Division or Relational Algebra. Here is a list of a few very interesting articles Divided We Stand: The SQL of Relational Division by Celko and Relational division and  Relationally Divided over EAV by Peter Larsson and readers may want to take a look at them and other articles on this topic.

# Problem Definition

In the aforementioned thread the topic starter first wanted to find all orders that have similar products. He provided the table definition along with few rows of data.
Rather than using data from that thread let's consider the same problem but using AdventureWorks database instead. First let's see a solution for the problem of finding orders that have the same products.

# Solutions

This problem has several solutions. First two are the true relational division solutions and the last solution is non-portable T-SQL only solution based on the de-normalization of the table. The first solution in that script was suggested by Peter Larsson after I asked him to check this article. I'll post the script I ran to compare all three solutions:

`USE AdventureWorks2012;`
`SET` `NOCOUNT ``ON``;`
`SET` `STATISTICS` ``` IO ````ON``;`
`SET` `STATISTICS` ``` TIME``` `ON``;`
```PRINT ````'PESO Solution'``;`

`SELECT` `t1.SalesOrderID ``AS` `OrderID`
`    ````,t2.SalesOrderID ````AS` ``` SimilarOrderID```
`FROM` `(`
`    ``SELECT` `SalesOrderID`
`        ``,``COUNT````(*) ````AS` ``` Items```
`        ``,``MIN````(ProductID) ````AS` ``` minProdID```
`        ``,``MAX````(ProductID) ````AS` ``` maxProdID```
`    ``FROM` `Sales.SalesOrderDetail`
`    ``GROUP` `BY` ``` SalesOrderID```
`    ````) ````AS` ``` v```
`INNER` `JOIN` ```Sales.SalesOrderDetail ````AS` ``` t1 ````ON` ``` t1.SalesOrderID = v.SalesOrderID```
`INNER` `JOIN` ```Sales.SalesOrderDetail ````AS` ``` t2 ````ON` ``` t2.ProductID = t1.ProductID```
`    ``AND` `t2.SalesOrderID > t1.SalesOrderID`
`INNER` `JOIN` `(`
`    ``SELECT` `SalesOrderID`
`        ``,``COUNT````(*) ````AS` ``` Items```
`        ``,``MIN````(ProductID) ````AS` ``` minProdID```
`        ``,``MAX````(ProductID) ````AS` ``` maxProdID```
`    ``FROM` `Sales.SalesOrderDetail`
`    ``GROUP` `BY` ``` SalesOrderID```
`    ````) ````AS` ``` w ````ON` ``` w.SalesOrderID = t2.SalesOrderID```
`WHERE` `w.minProdID = v.minProdID`
`    ``AND` `w.maxProdID = v.maxProdID`
`    ``AND` `w.Items = v.Items`
`GROUP` `BY` ``` t1.SalesOrderID```
`    ``,t2.SalesOrderID`
`HAVING` `COUNT``(*) = ` `MIN``(v.Items);`

```PRINT ````'Common Relational Division /CELKO/Naomi solution'``;`

`SELECT` `O1.SalesOrderId ``AS` `OrderID`
`    ````,O2.SalesOrderID ````AS` ``` SimilarOrderID```
`FROM` `Sales.SalesOrderDetail O1`
`INNER` `JOIN` ```Sales.SalesOrderDetail O2 ````ON` ``` O1.ProductID = O2.ProductID```
`    ``AND` `O1.SalesOrderID < O2.SalesOrderID`
`GROUP` `BY` ``` O1.SalesOrderID```
`    ``,O2.SalesOrderID`
`HAVING` `COUNT``(O1.ProductID) = (`
`        ``SELECT` `COUNT``(ProductID)`
`        ``FROM` `Sales.SalesOrderDetail SD1`
`        ``WHERE` `SD1.SalesOrderID = O1.SalesOrderID`
`        ``)`
`    ``AND` `COUNT``(O2.ProductID) = (`
`        ``SELECT` `COUNT``(ProductID)`
`        ``FROM` `Sales.SalesOrderDetail SD2`
`        ``WHERE` `SD2.SalesOrderID = O2.SalesOrderID`
`        ``);`

```PRINT ````'XML PATH de-normalization solution'``;`

`WITH` `cte`
`AS` `(`
`    ``SELECT` `SalesOrderID`
`        ``,STUFF((`
`                ``SELECT` `', '` `+ ``CAST````(ProductID ````AS` ``` VARCHAR````(30))`
`                ``FROM` `Sales.SalesOrderDetail SD1`
`                ``WHERE` `SD1.SalesOrderID = SD.SalesOrderID`
`                ``ORDER` `BY` ``` ProductID```
`                ``FOR` `XML PATH(``''``)`
`                ````), 1, 2, ````''``) ``AS` `Products`
`    ``FROM` `Sales.SalesOrderDetail SD`
`    ``GROUP` `BY` ``` SD.SalesOrderID```
`    ``)`
`SELECT` `cte.SalesOrderID ``AS` `OrderID`
`    ````,cte1.SalesOrderID ````AS` ``` SimilarOrderID```
`    ``,cte.Products`
`FROM` `cte`
`INNER` `JOIN` `cte ``AS` `cte1 ``ON` `cte.SalesOrderID < cte1.SalesOrderID`
`    ``AND` `cte.Products = cte1.Products;`

`SET` `STATISTICS` ``` IO ````OFF``;`

First solution JOINS with the number of items in each order and MIN/MAX product in each order. This solution is based on the idea Peter proposed in this closed MS Connect Item Move T-SQL language closer to completion with a DIVIDE BY operator.

The second solution self-joins the table based on the ProductID using an extra condition of O1.OrderID < O2.Order2 (we're using < instead of <> in order to avoid opposite combinations), then groups by both OrderID columns and uses HAVING clause to make sure the number of products is the same as the number of products in each individual order. This HAVING idea is very typical for the Relational Division problem.

Interestingly, the number of combinations in AdventureWorks database is 1,062, 238 (more than rows in the SalesOrderDetail table itself). This is due to the fact that many orders consist of only single product.

The last solution is rather straightforward and uses XML PATH approach to get all products in one row for each order ID, then self-joins based on this new Products column.  This solution is not portable into other relational database languages but specific for T-SQL only. Interestingly, it performs better than second 'true' Relational Division solution as you can see in this picture.

As you can see, the first query takes 0%, second 60% while the last takes 40% of the execution time.

The last solution, however, is also not very flexible and is only suitable for finding exact matches.

These are results I got on SQL Server 2012 SP1 64 bit (they are much better on SQL Server 2014 CTP according to Peter):

PESO Solution

SQL Server Execution Times:
CPU time = 35272 ms,  elapsed time = 114920 ms.

Common Relational Division /CELKO/Naomi solution

SQL Server Execution Times:
CPU time = 478703 ms,  elapsed time = 214748 ms.

XML PATH de-normalization solution

SQL Server Execution Times:
CPU time = 5054 ms,  elapsed time = 14069 ms.

# Best Exact Match Solution

Peter sent yet another variation of the solution for the integer Product ID (this solution will not work if the product ID /Item ID uses character or GUID key).

`SELECT`                   `v.SalesOrderID ``AS` `OrderID,`
`                         ````w.SalesOrderID ````AS` ``` SimilarOrderID,```
`                         ``v.Items`
`FROM`                     `(`
`                                     ``SELECT`                   `SalesOrderID,`
`                                                              ``COUNT````(ProductID) ````AS` ``` Items,```
`                                                              ``MIN````(ProductID) ````AS` ``` minProdID,```
`                                                              ``MAX````(ProductID) ````AS` ``` maxProdID,```
`                                                              ``SUM````(ProductID) ````AS` ``` sumProdID,```
`                                                              ````CHECKSUM_AGG(10000 * ProductID) ````AS` ``` cs```
`                                     ``FROM`                     `Sales.SalesOrderDetail`
`                                     ``GROUP` `BY`    ``` SalesOrderID```
`                         ````) ````AS` ``` v```
`INNER` `JOIN`  `(`
`                                     ``SELECT`                   `SalesOrderID,`
`                                                              ``COUNT````(ProductID) ````AS` ``` Items,```
`                                                              ``MIN````(ProductID) ````AS` ``` minProdID,```
`                                                              ``MAX````(ProductID) ````AS` ``` maxProdID,```
`                                                              ``SUM````(ProductID) ````AS` ``` sumProdID,```
`                                                              ````CHECKSUM_AGG(10000 * ProductID) ````AS` ``` cs```
`                                     ``FROM`                     `Sales.SalesOrderDetail`
`                                     ``GROUP` `BY`    ``` SalesOrderID```
`                         ````) ````AS` ``` w ````ON` ``` w.Items = v.Items```
`                                     ``AND` `w.minProdID = v.minProdID`
`                                     ``AND` `w.maxProdID = v.maxProdID`
`                                     ``AND` `w.cs = v.cs`
`                                     ``AND` `w.sumProdID = v.sumProdID`
`WHERE`                    `w.SalesOrderID > v.SalesOrderID`

This solution joins 2 aggregate information together based on CHECKSUM_AGG function. By checking all these aggregate functions it is enough to conclude if the orders consist of the same products or not. This is the simplest and ingenious query and it performs the best among the other variations I tried. The limitation of this query is that it assumes integer key for the product id.

I got the following results for this solution:

SQL Server Execution Times:
CPU time = 562 ms,  elapsed time = 9791 ms.

# Slight Variation of the Original Problem

In that thread the topic starter also wanted to compare orders based on partial similarity. You may recognize this problem as 'Customers who bought this item also bought...' as you often can see in different websites.

Say, we want to find orders, that have 2/3 or more of the products matching. We will only consider orders with more than 2 items (3 and up) for this problem. The first solution can be easily adjusted for this new problem:

`;`

`WITH` `cte`
`AS` `(`
`    ``SELECT` `SalesOrderID`
`        ``,ProductID`
`        ``,``COUNT````(ProductID) OVER (PARTITION ````BY` ``` SalesOrderID) ````AS` ``` ProductsCount```
`    ``FROM` `Sales.SalesOrderDetail`
`    ``)`
`SELECT` `O1.SalesOrderId ``AS` `OrderID`
`    ````,O2.SalesOrderID ````AS` ``` SimilarOrderID```
`FROM` `cte O1`
`INNER` `JOIN` `cte O2 ` `ON` ``` O1.ProductID = O2.ProductID```
`    ``AND` `O1.SalesOrderID < O2.SalesOrderID`
`WHERE` `O1.ProductsCount > = 3`
`    ``AND` `O2.ProductsCount >= 3`
`GROUP` `BY` ``` O1.SalesOrderID```
`    ``,O2.SalesOrderID`
`HAVING` `COUNT``(O1.ProductID) >= (`
`        ``(`
`            ``SELECT` `COUNT``(ProductID)`
`            ``FROM` `Sales.SalesOrderDetail SD1`
`            ``WHERE` `SD1.SalesOrderID = O1.SalesOrderID`
`            ``) * 2.0`
`        ``) / 3.0`
`    ``AND` `COUNT``(O2.ProductID) >= (`
`        ``(`
`            ``SELECT` `COUNT``(ProductID)`
`            ``FROM` `Sales.SalesOrderDetail SD2`
`            ``WHERE` `SD2.SalesOrderID = O2.SalesOrderID`
`            ``) * 2.0`
`        ``) / 3.0`
`ORDER` `BY` ``` OrderID```
`    ``,SimilarOrderID;`

We can verify our results back for the few first rows:
`SELECT` `SalesOrderID`
`    ``,stuff((`
`            ``SELECT` `', '` `+ ``cast````(ProductID ````AS` ``` VARCHAR````(30))`
`            ``FROM` `Sales.SalesOrderDetail SD1`
`            ``WHERE` `SD1.SalesOrderID = SD.SalesOrderID`
`            ``ORDER` `BY` ``` ProductID```
`            ``FOR` `XML PATH(``''``)`
`            ````), 1, 2, ````''``) ``AS` `Products`
`FROM` `Sales.SalesOrderDetail SD`
`WHERE` `SalesOrderID ``IN` `(`
`        ``43659`
`        ``,43913,`

`        ``43659,  44528,`
`43659,  44566,`
`43659,  44761,`
`43659,  46077)`
`        `
`GROUP` `BY` ``` SalesOrderID```
`ORDER` `BY` ``` SalesOrderID```

I will show two variations of the solutions for the similar orders problem. While I am getting better reads for the second query, the execution time is much better for the first query:

`SET` `STATISTICS` ``` TIME``` `ON``;`
`SET` `STATISTICS` ``` IO ````ON``;`

`DECLARE` `@Percentage ``DECIMAL``(10, 2);`

`SET` `@Percentage = 0.75;`

`WITH` `cte`
`AS` `(`
`    ``SELECT` `SalesOrderID`
`        ``,ProductID`
`        ``,``COUNT````(ProductID) OVER (PARTITION ````BY` ``` SalesOrderID) ````AS` ``` ProductsCount```
`    ``FROM` `Sales.SalesOrderDetail`
`    ``)`
`SELECT` `O1.SalesOrderId ``AS` `OrderID`
`    ````,O2.SalesOrderID ````AS` ``` SimilarOrderID```
`FROM` `cte O1`
`INNER` `JOIN` `cte O2 ` `ON` ``` O1.ProductID = O2.ProductID```
`    ``AND` `O1.SalesOrderID < O2.SalesOrderID`
`WHERE` `O1.ProductsCount > = 3`
`    ``AND` `O2.ProductsCount >= 3`
`GROUP` `BY` ``` O1.SalesOrderID```
`    ``,O2.SalesOrderID`
`HAVING` `COUNT``(O1.ProductID) >= (`
`        ``SELECT` `COUNT``(ProductID)`
`        ``FROM` `Sales.SalesOrderDetail SD1`
`        ``WHERE` `SD1.SalesOrderID = O1.SalesOrderID`
`        ``) * @Percentage`
`    ``AND` `COUNT``(O2.ProductID) >= (`
`        ``SELECT` `COUNT``(ProductID)`
`        ``FROM` `Sales.SalesOrderDetail SD2`
`        ``WHERE` `SD2.SalesOrderID = O2.SalesOrderID`
`        ``) * @Percentage`
`ORDER` `BY` ``` OrderID```
`    ``,SimilarOrderID;`

`WITH` `cte`
`AS` `(`
`    ``SELECT` `SalesOrderID`
`        ``,``COUNT````(ProductID) ````AS` ``` Items```
`    ``FROM` `Sales.SalesOrderDetail`
`    ``GROUP` `BY` ``` SalesOrderID```
`    ``)`
`SELECT` `O1.SalesOrderId ``AS` `OrderID`
`    ``,``MIN````(C1.Items) ````AS` ``` [Products 1]```
`    ````,O2.SalesOrderID ````AS` ``` SimilarOrderID```
`    ``,``MIN````(C2.Items) ````AS` ``` [Products 2]```
`FROM` `Sales.SalesOrderDetail O1`
`INNER` `JOIN` `cte C1 ` `ON` ``` O1.SalesOrderID = C1.SalesOrderID```
`INNER` `JOIN` ```Sales.SalesOrderDetail O2 ````ON` ``` O1.ProductID = O2.ProductID```
`    ``AND` `O1.SalesOrderID < O2.SalesOrderID`
`INNER` `JOIN` `cte C2 ` `ON` ``` O2.SalesOrderID = C2.SalesOrderID```

`GROUP` `BY` ``` O1.SalesOrderID    ```
`    ``,O2.SalesOrderID`
`    `
`HAVING` `COUNT``(*) >= ` `MIN``(C1.Items) * @Percentage`
`    ``AND` `COUNT``(*) >= ` `MIN``(C2.Items) * @Percentage`
`    ``AND` `MIN````(C1.Items)>=3 ````AND` ``` MIN````(C2.Items) >=3`
`ORDER` `BY` ``` OrderID```
`    ``,SimilarOrderID;`

I will be interested in your results and ideas for this type of the query for partial (percentage) match.

I received the following comment from Andriy Zabavskyy:

We could significantly improve the first out of the latest two solutions extending the join condition by the following statement:

`AND` `O1.ProductsCount ``between` `floor(O2.ProductsCount * @Percentage) ``and` `ceiling(O2.ProductsCount / @Percentage)`

On my local machine it gives an improvement equal to about 2.5 times in terms of time execution.

# Conclusion

In this article I showed that some common relational division problems can be solved using set-based solutions. These solutions may not perform well, however, on the big datasets. I encourage the readers to provide their ideas for the listed problems and their Pros/Cons.

# Credits

I want to thank Peter Larsson who reviewed this article and came up with several great solution ideas. I knew I can count on Peter!