none
T-sql - finding all sales orders that have similar products.

    Question

  • Hi,

    I've the following table and the data in it as well as per below. My question now is, I've to find all orders that have similar products placed in the sales table.

    So looking at the data entered, I should get output as -  SO1,SO2,SO3,SO8,SO4,SO6 since

    S01,SO2 have same products placed

    SO3,S08 have same products placed

    SO4,SO6 have same products placed

    CREATE TABLE [dbo].[SALES_TEST](
    [SALESID] [nchar](10) NULL,
    [ITEMID] [nchar](10) NULL
    ) ON [PRIMARY]

    INSERT INTO [dbo].[SALES_TEST]
               ([SALESID]
               ,[ITEMID])
         VALUES
    ('SO1','BA1'),       
    ('SO1','KM100'),
    ('SO1','SM1RR'),
    ('SO1','ZT6'),

    ('SO2','BA1'),
    ('SO2','KM100'),
    ('SO2','sM1RR'),
    ('SO2','ZT6'),       

    ('SO3','BA1'),       
    ('SO3','KM100'),     
    ('SO3','SM1RR'),     

    ('SO4','BA2'),       
    ('SO4','KM200') ,    
    ('SO4','SM1RR'),     
    ('SO4','ZT6'),       

    ('SO5','KM100'),     
    ('SO5','SM1RR'),     
    ('SO5','ZT6'),       

    ('SO6','BA2'),       
    ('SO6','KM200'),     
    ('SO6','SM1RR'),     
    ('SO6','ZT6'),      

    'SO7','BA2'),       
    ('SO7','KM100'),     
    ('SO7','SM1RR'),     
    ('SO7','ZT6'),

    ('SO8','BA1'),       
    ('SO8','KM100'),     
    ('SO8','SM1RR')        


    Tuesday, December 24, 2013 6:45 PM

Answers

  • The first quick solution that comes to mind is to de-normalize that table and compare that way. So, this is one possible solution:

    IF OBJECT_id('TEMPDB..#SalesItems', N'U') IS NOT NULL
    	DROP TABLE #SalesItems;
    
    SELECT SalesID AS SalesID
    	,STUFF((
    			SELECT ', ' + ITEMID
    			FROM SALES_TEST ST1
    			WHERE ST1.SALESID = S.SALESID
    			ORDER BY ITEMID
    			FOR XML PATH('')
    				,type
    			).value('.', 'nvarchar(max)'), 1, 2, '') AS AllItems
    INTO #SalesItems
    FROM SALES_TEST S
    GROUP BY S.SALESID
    ORDER BY S.SALESID;
    
    SELECT T.SalesID
    	,T1.SalesID
    FROM #SalesItems T
    INNER JOIN #SalesItems T1 ON T.AllItems = T1.AllItems
    	AND T.SalesID < T1.SalesID


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Tuesday, December 24, 2013 7:01 PM
    Moderator
  • The NP-Complete factor hammers query execution exponentially as the sales data volumes grow with each sale item being compared to ALL other sales items. The query below identifies distinct sales and item sets then joins each set to every other. Total items within each set and percentage of items that intersect the comparison set are calculated, to constrain and optimize comparisons, a new column has been added, TRADEDAY, and only sales items brought on the same day are compare with each other.

    IF NOT EXISTS(SELECT name FROM sys.databases WHERE name = 'MSDN1')  
        CREATE DATABASE [MSDN1] 
    GO
      
    USE [MSDN1] 
    GO 
      
    IF OBJECT_ID('SALES_TEST', N'U') IS NOT NULL
        DROP TABLE SALES_TEST; 
      
    CREATE TABLE [dbo].[SALES_TEST]( 
        [SALESID] [nchar](10) NOT NULL, 
        [ITEMID] [nchar](10) NOT NULL, 
        [TRADEDAY] [integer] NOT NULL, 
     CONSTRAINT [PK_SALES_TEST] PRIMARY KEY CLUSTERED  
    ( 
        [SALESID] ASC, 
        [ITEMID] ASC
    ) ON [PRIMARY] 
    ) ON [PRIMARY] 
    GO 
     
    CREATE NONCLUSTERED INDEX [IX_SALES_TEST_TRADEDAY_SALESID] ON [dbo].[SALES_TEST]
    (
    	[TRADEDAY] ASC,
    	[SALESID] ASC
    )
    INCLUDE ([ITEMID])  
    ON [PRIMARY]
    GO
      
    -- TEST DATA 
    DECLARE @x INT = 0, @TRADEDAY INT = 1 
    DECLARE 
    @S01i INT, @S02i INT, @S03i INT, @S04i INT,  
    @S05i INT, @S06i INT, @S07i INT, @S08i INT, @S09i INT
      
    DECLARE 
    @S01 VARCHAR(10), @S02 VARCHAR(10), @S03 VARCHAR(10), @S04 VARCHAR(10), 
    @S05 VARCHAR(10), @S06 VARCHAR(10), @S07 VARCHAR(10), @S08 VARCHAR(10),@S09 VARCHAR(10) 
      
      
    WHILE @TRADEDAY < 50 -- Set to data volume required (32 sales items per day) 
    BEGIN
      
        SELECT  @S01i = 1 + @X, @S02i = 2 + @X, @S03i = 3 + @X, @S04i = 4 + @X, @S05i = 5 + @X, @S06i = 6 + @X, @S07i = 7 + @X, @S08i = 8 + @X, @S09i = 9 + @X 
        SET @S01 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S01i AS VARCHAR(5))))) + CAST(@S01i AS VARCHAR(9))  
        SET @S02 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S02i AS VARCHAR(5))))) + CAST(@S02i AS VARCHAR(9))  
        SET @S03 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S03i AS VARCHAR(5))))) + CAST(@S03i AS VARCHAR(9))  
        SET @S04 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S04i AS VARCHAR(5))))) + CAST(@S04i AS VARCHAR(9))  
        SET @S05 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S05i AS VARCHAR(5))))) + CAST(@S05i AS VARCHAR(9))   
        SET @S06 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S06i AS VARCHAR(5))))) + CAST(@S06i AS VARCHAR(9))  
        SET @S07 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S07i AS VARCHAR(5))))) + CAST(@S07i AS VARCHAR(9))  
        SET @S08 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S08i AS VARCHAR(5))))) + CAST(@S08i AS VARCHAR(9))  
        SET @S09 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S09i AS VARCHAR(5))))) + CAST(@S09i AS VARCHAR(9))  
      
        INSERT INTO [dbo].[SALES_TEST] ([TRADEDAY],[SALESID],[ITEMID]) 
        VALUES 
        (@TRADEDAY,@S01,'BA1'),(@TRADEDAY,@S01,'KM100'),(@TRADEDAY,@S01,'SM1RR'),(@TRADEDAY,@S01,'ZT6'), 
        (@TRADEDAY,@S02,'BA1'),(@TRADEDAY,@S02,'KM100'),(@TRADEDAY,@S02,'SM1RR'),(@TRADEDAY,@S02,'ZT6'), 
        (@TRADEDAY,@S03,'BA1'),(@TRADEDAY,@S03,'KM100'),(@TRADEDAY,@S03,'SM1RR'), 
        (@TRADEDAY,@S04,'BA2'),(@TRADEDAY,@S04,'KM200'),(@TRADEDAY,@S04,'SM1RR'),(@TRADEDAY,@S04,'ZT6'), 
        (@TRADEDAY,@S05,'KM100'),(@TRADEDAY,@S05,'SM1RR'),(@TRADEDAY,@S05,'ZT6'), 
        (@TRADEDAY,@S06,'BA2'),(@TRADEDAY,@S06,'KM200'),(@TRADEDAY,@S06,'SM1RR'),(@TRADEDAY,@S06,'ZT6'), 
        (@TRADEDAY,@S07,'BA2'),(@TRADEDAY,@S07,'KM100'),(@TRADEDAY,@S07,'SM1RR'),(@TRADEDAY,@S07,'ZT6'), 
        (@TRADEDAY,@S08,'BA1'),(@TRADEDAY,@S08,'KM100'),(@TRADEDAY,@S08,'SM1RR'), 
        (@TRADEDAY,@S09,'BA1'),(@TRADEDAY,@S09,'AAA'),(@TRADEDAY,@S09,'SM1RR'); 
      
        SET @x = @x + 10 
        SET @TRADEDAY = @TRADEDAY + 1 
    END
      
    ;WITH Sales AS
    ( 
        SELECT TRADEDAY, SALESID, COUNT(*) AS ItemsCount 
        FROM [dbo].[SALES_TEST] 
        GROUP BY TRADEDAY, SALESID 
    ), 
    Items AS
    ( 
        SELECT ITEMID, COUNT(*) AS Items 
        FROM [dbo].[SALES_TEST] 
        GROUP BY ITEMID 
    ), 
    JoinSets AS
    ( 
        SELECT 
            x1.TRADEDAY, 
            x1.SALESID AS LeftSALESID,  
            x1.ItemsCount  AS LeftSalesItemsCount, 
            x2.SALESID AS RightSALESID,  
            x2.ItemsCount  AS RightSalesItemsCount, 
            x3.ITEMID, 
            (SELECT COUNT(*) FROM [dbo].[SALES_TEST] WHERE SALESID = x1.SALESID AND ITEMID = x3.ITEMID) AS ItemInLeftSale, 
            (SELECT COUNT(*) FROM [dbo].[SALES_TEST] WHERE SALESID = x2.SALESID AND ITEMID = x3.ITEMID) AS ItemInRightSale     
        FROM Sales x1 
        FULL OUTER JOIN Sales x2  
            ON x1.SALESID < x2.SALESID  
        LEFT OUTER JOIN
        ( 
            SELECT ITEMID 
            FROM Items 
        ) x3 
            ON 1 = 1 
        WHERE x1.TRADEDAY = x2.TRADEDAY 
    ), 
    vTable AS
    ( 
        SELECT 
            z.TRADEDAY, 
            z.LeftSALESID, 
            z.RightSALESID, 
            AVG(z.LeftSalesItemsCount) AS LeftSalesItemsCount, 
            AVG(z.RightSalesItemsCount) AS RightSalesItemsCount, 
            SUM(NotInRightCount) AS NotInRightCount,     
            SUM(NotInLeftCount) AS NotInLeftCount 
        FROM JoinSets z 
        LEFT OUTER JOIN
        ( 
            SELECT LeftSALESID, RightSALESID, ITEMID, COUNT(*) AS NotInRightCount 
            FROM JoinSets 
            WHERE ItemInLeftSale = 1 
            AND ItemInRightSale = 0 
            GROUP BY LeftSALESID, RightSALESID, ITEMID 
        ) x4 
            ON x4.LeftSALESID = z.LeftSALESID 
            AND x4.RightSALESID = z.RightSALESID 
            AND x4.ITEMID = z.ITEMID 
      
        LEFT OUTER JOIN
        ( 
            SELECT LeftSALESID, RightSALESID, ITEMID, COUNT(*) AS NotInLeftCount 
            FROM JoinSets 
            WHERE ItemInLeftSale = 0 
            AND ItemInRightSale = 1 
            GROUP BY LeftSALESID, RightSALESID, ITEMID 
        ) x5 
            ON x5.LeftSALESID = z.LeftSALESID 
            AND x5.RightSALESID = z.RightSALESID 
            AND x5.ITEMID = z.ITEMID 
          
        GROUP BY
            z.TRADEDAY, 
            z.LeftSALESID,  
            z.RightSALESID 
    ), 
    Results AS
    ( 
        SELECT 
                a.TRADEDAY, 
                a.LeftSALESID,  
                a.LeftSalesItemsCount,  
                b.PercentofRightInLeft, 
                a.RightSALESID,  
                a.RightSalesItemsCount, 
                b.PercentofLeftInRight 
        FROM vTable a 
        LEFT OUTER JOIN
        ( 
            SELECT 
                LeftSALESID,  
                RightSALESID,  
                ISNULL(CASE NotInLeftCount WHEN NULL THEN 100 ELSE 100 - ((100/RightSalesItemsCount) * NotInLeftCount) END,100) AS PercentofRightInLeft, 
                ISNULL(CASE NotInRightCount WHEN NULL THEN 100 ELSE 100 - ((100/LeftSalesItemsCount) * NotInRightCount) END,100) AS PercentofLeftInRight 
            FROM vTable 
        ) b 
            ON b.LeftSALESID = a.LeftSALESID 
            AND b.RightSALESID = a.RightSALESID 
        WHERE a.LeftSALESID IS NOT NULL AND a.RightSALESID IS NOT NULL
    ) 
      
      
    SELECT 
            TRADEDAY, 
            LeftSALESID,  
            LeftSalesItemsCount,  
            PercentofRightInLeft, 
            RightSALESID,  
            RightSalesItemsCount,  
            PercentofLeftInRight 
    FROM Results 
    -- WHERE (PercentOfRightSaleItemsInLeft + PercentOfLeftSaleItemsInRight) = 200 -- filter exact matches only
    ORDER BY 
        (PercentofRightInLeft + PercentofLeftInRight) DESC, 
        LeftSALESID,  
        RightSALESID

    In the results, SALESID Sxxxx1 & Sxxxx2,  Sxxxx3 & Sxxxx8, Sxxxx4 & Sxxxx6 are 100% matches both sides. 75% of Sxxxx1 items appear in Sxxxx3, 100% of Sxxxx3 items appear in Sxxxx1 and so on.

    To improve the performance further add another new column to the table, 'shop' for example, and only compare shop 1 sales with shop 2 on the same trading day. You need to constraint comparison sets to achieve scale, to compare all sales items against each other, Analysis Services is the way to go.
    Saturday, January 04, 2014 2:25 AM

All replies

  • The first quick solution that comes to mind is to de-normalize that table and compare that way. So, this is one possible solution:

    IF OBJECT_id('TEMPDB..#SalesItems', N'U') IS NOT NULL
    	DROP TABLE #SalesItems;
    
    SELECT SalesID AS SalesID
    	,STUFF((
    			SELECT ', ' + ITEMID
    			FROM SALES_TEST ST1
    			WHERE ST1.SALESID = S.SALESID
    			ORDER BY ITEMID
    			FOR XML PATH('')
    				,type
    			).value('.', 'nvarchar(max)'), 1, 2, '') AS AllItems
    INTO #SalesItems
    FROM SALES_TEST S
    GROUP BY S.SALESID
    ORDER BY S.SALESID;
    
    SELECT T.SalesID
    	,T1.SalesID
    FROM #SalesItems T
    INNER JOIN #SalesItems T1 ON T.AllItems = T1.AllItems
    	AND T.SalesID < T1.SalesID


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Tuesday, December 24, 2013 7:01 PM
    Moderator
  • Thanks Naomi. I will try your solution and see if it works or not.
    Tuesday, December 24, 2013 7:14 PM
  • I got the following results after running your solution.

    SalesID SalesID
    SO1        SO2       
    SO1        SO5       
    SO2        SO5       
    SO4        SO6       
    SO1        SO7       
    SO2        SO7       
    SO5        SO7       

    I'm not quite sure what the results tell me. The correct result should be as per below.

    SalesID
    SO1       
    SO2         
    SO3          
    SO4               
    SO6          
    SO8       


    Tuesday, December 24, 2013 7:31 PM
  • This is an alternative solution, you may want to compare execution plans:

    ;
    
    WITH cte
    AS (
    	SELECT *
    		,COUNT(ItemID) OVER (PARTITION BY SalesID) AS cntItemsPerSale
    	FROM SALES_TEST
    	)
    SELECT cte.SalesID
    	,cte1.SalesID AS SimilarSalesID
    FROM cte
    INNER JOIN cte AS cte1 ON cte.SALESID < cte1.SALESID
    	AND cte.ITEMID = cte1.ITEMID
    WHERE cte.cntItemsPerSale = cte1.cntItemsPerSale
    GROUP BY cte.SALESID
    	,cte1.SALESID
    HAVING COUNT(cte.ItemID) = MIN(cte.cntItemsPerSale)


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Tuesday, December 24, 2013 7:35 PM
    Moderator
  • With your data in both cases I got this result:

    SO1       SO2       
    SO4       SO6       
    SO3       SO8       

    It shows SalesID and a similar to it SalesID. You can always UNPIVOT that data to get them into one column.


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Tuesday, December 24, 2013 7:41 PM
    Moderator
  • You're right. I had modified my table and I was not getting the right results. Sorry about that. One more question, if I have let's say 4 or multiple Sales Orders that match all the items, will this solution work to find them as well ?
    Tuesday, December 24, 2013 7:49 PM
  • This is to return your desired result (I used the second solution idea and slightly modified it):

    ;
    
    WITH cte
    AS (
    	SELECT cte.SalesID AS SalesID1
    		,cte1.SalesID AS SimilarSalesID
    	FROM Sales_Test cte
    	INNER JOIN Sales_Test AS cte1 ON cte.SALESID < cte1.SALESID
    		AND cte.ITEMID = cte1.ITEMID
    	GROUP BY cte.SALESID
    		,cte1.SALESID
    	HAVING COUNT(cte.ItemID) = (
    			SELECT count(ItemID)
    			FROM SALES_TEST S
    			WHERE S.SALESID = cte.SALESID
    			)
    		AND COUNT(cte1.ItemID) = (
    			SELECT count(ItemID)
    			FROM SALES_TEST S
    			WHERE S.SALESID = cte1.SALESID
    			)
    	)
    SELECT SalesID
    FROM cte
    UNPIVOT(SalesID FOR ColumnName IN (
    			[SalesID1]
    			,[SimilarSalesID]
    			)) unpvt
    ORDER BY SalesID


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Tuesday, December 24, 2013 7:53 PM
    Moderator
  • Yes, the solution should get all 4 of them. It may produce more combinations but unpivoting data will produce the result you're after.

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Tuesday, December 24, 2013 7:56 PM
    Moderator
  • I think you nailed it. Thanks for the help. Merry Christmas.
    Tuesday, December 24, 2013 7:59 PM
  • Happy Holidays!

    The similar problems were discussed here in the past, you may try searching for them as well.


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles


    Tuesday, December 24, 2013 8:02 PM
    Moderator
  • >> I've the following table and the data in it as well as per below. My question now is, I've to find all orders that have similar products placed in the sales table. <<

    Why do you have no key? And no way to ever have a key?? 
    This is one kind of relational division. Google it. You seem to want exact matches, but we have no specs. Sometime this kind of query has a divisor and can accept sales with more than the diversion it in – how many people bought at least peanut butter and jelly?
    The worst way to do this is with XML. The code will not port, performance is bad and it is not easy to maintain. 

    CREATE TABLE Sales
    (sales_id CHAR(3) NOT NULL, 
     item_id VARCHAR(10) NOT NULL, 
     PRIMARY KEY (sales_id, item_id));

    >> So looking at the data entered, I should get output as -  SO1, SO2, SO3, SO8, SO4, SO6 since
    S01, SO2 have same products placed
    SO3, S08 have same products placed
    SO4, SO6 have same products placed <<


    SELECT S1.sales_id, S2.sales_id
      FROM Sales AS S1, Sales AS S2
     WHERE S1.sales_id < S2.sales_id
       AND S1.item_id = S2.item_id
     GROUP BY S1.sales_id, S2.sales_id 
     HAVING COUNT(*)
            = (SELECT COUNT(*)
                 FROM Sales AS S3 
                WHERE S1.sales_id = S3.sales_id)
        AND COUNT(*)
            = (SELECT COUNT(*)
                 FROM Sales AS S4 
                WHERE S2.sales_id = S4.sales_id);


    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    Tuesday, December 24, 2013 9:18 PM
  • Hi Celko,

    First of all, thanks for looking into the problem and providing your suggestions. I do have PK on the real sales table on SalesID and ItemID but for this exercise, I created a dummy table for illustration purpose only. I see that you suggest not to do with XML for performance reasons and I'll compare the execution plans of both the solutions and go with the best since I'll be running the code for 1000's of orders. 

    Another question, and this one was my first intended one. If I want to find all orders that are >50% similar to other orders, how can I modify your sql to go about it? All the solutions above give me a 100% match so far.


    Thursday, December 26, 2013 1:48 PM
  • To change that solution to be 50%+ is simple:

    SELECT cte.SalesID AS SalesID1
    		,cte1.SalesID AS SimilarSalesID
    	FROM Sales_Test cte
    	INNER JOIN Sales_Test AS cte1 ON cte.SALESID < cte1.SALESID
    		AND cte.ITEMID = cte1.ITEMID
    	GROUP BY cte.SALESID
    		,cte1.SALESID
    	HAVING COUNT(cte.ItemID) >= (
    			SELECT count(ItemID)
    			FROM SALES_TEST S
    			WHERE S.SALESID = cte.SALESID
    			) * 0.5
    		AND COUNT(cte1.ItemID) >= (
    			SELECT count(ItEmID)
    			FROM SALES_TEST S
    			WHERE S.SALESID = cte1.SALESID) * 0.5
    I didn't re-test this idea, but I think it should do it.


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles


    Thursday, December 26, 2013 5:10 PM
    Moderator

  • I see that you suggest not to do with XML for performance reasons and I'll compare the execution plans of both the solutions and go with the best since I'll be running the code for 1000's of orders. 

    It is more than performance. Mixed language code is a bitch to maintain and cannot be optimized. Remember 80-90% of the cost of a system is in maintaining it. 

    If I want to find all orders that are >50% similar to other orders, how can I modify your SQL to go about it? All the solutions above give me a 100% match so far.

    What does 50% similar mean? If you have an explicit divisor {peanut butter, jelly}, does that mean A={peanut butter} is one set and B={jelly} is another,  even tho A and B have no intersection?  Division with remainder makes sense and it is esy. 


    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    Thursday, December 26, 2013 5:50 PM
  • By >50% similar what I meant was, if one Sales Order SO1 has items A,B,C,D,E,F  and another order SO2 has items 

    D,E,F,G,H,I  then order SO2 is 50% similar to order SO1. Not necessarily the percentage for analysis will remain the same though. It could also change to for example , show me all orders that are 85 % similar.

    Friday, December 27, 2013 8:18 PM
  • Did you try my solution? It should be as simple as adding having clause and checking count >=(select count(Item) as TotalItems ..) * 0.85

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Friday, December 27, 2013 8:43 PM
    Moderator
  • Here is a sample test:

    USE AllTests
    
    IF OBJECT_ID('SALES_TEST', N'U') IS NOT NULL
    	DROP TABLE SALES_TEST;
    
    CREATE TABLE [dbo].[SALES_TEST] (
    	[SALESID] [nchar](10) NULL
    	,[ITEMID] [nchar](10) NULL
    	) ON [PRIMARY]
    
    INSERT INTO [dbo].[SALES_TEST] (
    	[SALESID]
    	,[ITEMID]
    	)
    VALUES (
    	'SO1'
    	,'BA1'
    	)
    	,(
    	'SO1'
    	,'KM100'
    	)
    	,(
    	'SO1'
    	,'SM1RR'
    	)
    	,(
    	'SO1'
    	,'ZT6'
    	)
    	,(
    	'SO2'
    	,'BA1'
    	)
    	,(
    	'SO2'
    	,'KM100'
    	)
    	,(
    	'SO2'
    	,'sM1RR'
    	)
    	,(
    	'SO2'
    	,'ZT6'
    	)
    	,(
    	'SO3'
    	,'BA1'
    	)
    	,(
    	'SO3'
    	,'KM100'
    	)
    	,(
    	'SO3'
    	,'SM1RR'
    	)
    	,(
    	'SO4'
    	,'BA2'
    	)
    	,(
    	'SO4'
    	,'KM200'
    	)
    	,(
    	'SO4'
    	,'SM1RR'
    	)
    	,(
    	'SO4'
    	,'ZT6'
    	)
    	,(
    	'SO5'
    	,'KM100'
    	)
    	,(
    	'SO5'
    	,'SM1RR'
    	)
    	,(
    	'SO5'
    	,'ZT6'
    	)
    	,(
    	'SO6'
    	,'BA2'
    	)
    	,(
    	'SO6'
    	,'KM200'
    	)
    	,(
    	'SO6'
    	,'SM1RR'
    	)
    	,(
    	'SO6'
    	,'ZT6'
    	)
    	,(
    	'SO7'
    	,'BA2'
    	)
    	,(
    	'SO7'
    	,'KM100'
    	)
    	,(
    	'SO7'
    	,'SM1RR'
    	)
    	,(
    	'SO7'
    	,'ZT6'
    	)
    	,(
    	'SO8'
    	,'BA1'
    	)
    	,(
    	'SO8'
    	,'KM100'
    	)
    	,(
    	'SO8'
    	,'SM1RR'
    	)
    	,(
    	'SO9'
    	,'BA1'
    	)
    	,(
    	'SO9'
    	,'AAA'
    	)
    	,(
    	'SO9'
    	,'SM1RR'
    	)
    
    /*
    IF OBJECT_id('TEMPDB..#SalesItems', N'U') IS NOT NULL
    	DROP TABLE #SalesItems;
    
    SELECT SalesID AS SalesID
    	,STUFF((
    			SELECT ', ' + ITEMID
    			FROM SALES_TEST ST1
    			WHERE ST1.SALESID = S.SALESID
    			ORDER BY ITEMID
    			FOR XML PATH('')
    				,type
    			).value('.', 'nvarchar(max)'), 1, 2, '') AS AllItems
    INTO #SalesItems
    FROM SALES_TEST S
    GROUP BY S.SALESID
    ORDER BY S.SALESID;
    
    SELECT T.SalesID
    	,T1.SalesID
    FROM #SalesItems T
    INNER JOIN #SalesItems T1 ON T.AllItems = T1.AllItems
    	AND T.SalesID < T1.SalesID
    
    ;
    
    WITH cte
    AS (
    	SELECT *
    		,COUNT(ItemID) OVER (PARTITION BY SalesID) AS cntItemsPerSale
    	FROM SALES_TEST
    	)
    SELECT cte.SalesID
    	,cte1.SalesID AS SimilarSalesID
    FROM cte
    INNER JOIN cte AS cte1 ON cte.SALESID < cte1.SALESID
    	AND cte.ITEMID = cte1.ITEMID
    WHERE cte.cntItemsPerSale = cte1.cntItemsPerSale
    GROUP BY cte.SALESID
    	,cte1.SALESID
    HAVING COUNT(cte.ItemID) = MIN(cte.cntItemsPerSale)
    
    
    ;
    */
    DECLARE @Percentage DECIMAL(10, 2) = 60;
    
    --WITH cte
    --AS (
    SELECT cte.SalesID AS SalesID1
    	,cte1.SalesID AS SimilarSalesID
    FROM Sales_Test cte
    INNER JOIN Sales_Test AS cte1 ON cte.SALESID < cte1.SALESID
    	AND cte.ITEMID = cte1.ITEMID
    GROUP BY cte.SALESID
    	,cte1.SALESID
    HAVING COUNT(cte.ItemID) >= (
    		SELECT count(ItemID)
    		FROM SALES_TEST S
    		WHERE S.SALESID = cte.SALESID
    		) * @Percentage * 0.01
    	AND COUNT(cte1.ItemID) >= (
    		SELECT count(ItemID)
    		FROM SALES_TEST S
    		WHERE S.SALESID = cte1.SALESID
    		) * @Percentage * 0.01
    ORDER BY cte.SalesID;
    	--)
    	--SELECT SalesID
    	--FROM cte
    	--UNPIVOT(SalesID FOR ColumnName IN (
    	--			[SalesID1]
    	--			,[SimilarSalesID]
    	--			)) unpvt
    	--ORDER BY SalesID


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Friday, December 27, 2013 8:56 PM
    Moderator
  • By >50% similar what I meant was, if one Sales Order SO1 has items A,B,C,D,E,F  and another order SO2 has items 

    D,E,F,G,H,I  then order SO2 is 50% similar to order SO1. Not necessarily the percentage for analysis will remain the same though. It could also change to for example , show me all orders that are 85 % similar.

    Nope, this is an NP-Complete problem. Basically, every pair has to be tested, so it is (n!) tests are required. 

    Now if you said {D,E, F} is the common subset life is easy. We can use INTERSECT and be done.


    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    Saturday, December 28, 2013 4:18 AM
  • I posted an article for this problem

    http://social.technet.microsoft.com/wiki/contents/articles/22165.t-sql-relational-division.aspx

    Please take a look and make suggestions.

    Thanks in advance.


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Sunday, December 29, 2013 9:26 PM
    Moderator
  • I tested your solution and I'm getting the intended results based on the "Percentage" that I declare. I will test the solution on our DEV environment first to see the performance of the query before moving it to LIVE. I also saw your article on TN and I believe it will help others as well who are looking for such solutions.
    Monday, December 30, 2013 2:38 PM
  • "Nope, this is an NP-Complete problem. Basically, every pair has to be tested, so it is (n!) tests are required"

    An NP-Complete problem? I will have to search what it is but what I understand is, the bigger the dataset, the longer it will take to test every pair.


    Monday, December 30, 2013 3:03 PM
  • I had some emails exchange with Peter Larsson in regards to this problem. He is questioning my solution and also proposes another which is supposed to be more performant which I am not fully understanding.

    I tried another version of exact match query

    ;with cteCnt AS (
    	SELECT SalesOrderID
    		,COUNT(ProductID) AS ProductsCount
    	FROM Sales.SalesOrderDetail
    	GROUP BY SalesOrderID
    	)
    SELECT O1.SalesOrderId AS OrderID
    	,O2.SalesOrderID AS SimilarOrderID
    FROM Sales.SalesOrderDetail O1
    INNER JOIN Sales.SalesOrderDetail O2 ON O1.ProductID = O2.ProductID
    	AND O1.SalesOrderID < O2.SalesOrderID
    INNER JOIN cteCnt cnt1 ON O1.SalesOrderID = cnt1.SalesOrderID
    INNER JOIN cteCnt cnt2 ON O2.SalesOrderID = cnt2.SalesOrderID
    
    GROUP BY O1.SalesOrderID
    	,O2.SalesOrderID
    HAVING COUNT(O1.ProductID) = MIN(cnt1.ProductsCount)
    AND COUNT(O2.ProductID) = MIN(cnt2.ProductsCount)	
    ORDER BY OrderID
    	,SimilarOrderID;

    which I believe is more in accordance to his latest article and this Connect item https://connect.microsoft.com/SQLServer/feedback/details/670531/move-t-sql-language-closer-to-completion-with-a-divide-by-operator, but for AW database the performance of this query is worse than other two. The AW is not really a good test case since we have many orders with a single product, but anyway, I was trying to find a good solution.

    I hope Peter will jump himself into this thread with his ideas.


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Monday, December 30, 2013 7:33 PM
    Moderator
  • I added Peter's solution to that article. I tried running all 3 queries at once and while I see Peter's query giving 0% in the actual query plan, the elapsed time is lowest for the XML PATH solution on the AW database SQL Server 2012 SP1.

    I am wondering if you can run similar test for AW as well as for your own tables and tell us which query performs the best and what results you'll get.

    I am also curious if anyone can run these queries in other SQL Server versions.

    Thanks in advance.


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Monday, December 30, 2013 10:40 PM
    Moderator
  • Peter and I have a question in regards to what to consider similar orders. Say, we want 60% similarity:

    The problem arises when Order #1 have items A, B, C, D and E, and Order #2 have items A, B, C, Q, R, S, T, U, V, X, Y and Z. They both have 3 items in common; A, B and C. That means Order #1 have 60% of its item in common with Order #2, but Order #2 only have 25% it its items in common with Order #1.

     

    The question is then if Orders #1 and #2 are considered similar to each other?

    --------------------------

    In my opinion, both orders should have at least 60% of orders matching to be considered similar. This is the problem I was trying to solve. However, we may try to solve the first problem instead.

    Let me know what definition you're using.


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles


    Friday, January 03, 2014 6:12 PM
    Moderator
  • I believe the one that I'm looking for this particular problem is that both orders should have at least 60% of orders matching.
    Friday, January 03, 2014 8:58 PM
  • There are two different solutions posted in my article for that problem.

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Friday, January 03, 2014 9:24 PM
    Moderator
  • Everything breaks down to what your definition of similar is.

    The problem arises when Order #1 have items A, B, C, D and E, and Order #2 have items A, B, C, Q, R, S, T, U, V, X, Y and Z. They both have 3 items in common; A, B and C. That means Order #1 have 60% of its item in common with Order #2, but Order #2 only have 25% it its items in common with Order #1.


    N 56°04'39.26"
    E 12°55'05.63"

    Friday, January 03, 2014 9:29 PM
  • An NP-Complete problem? I will have to search what it is but what I understand is, the bigger the dataset, the longer it will take to test every pair.


    Actually, "longer" means "impossible for any good sized set"; look it up on Wikipedia. And my math is off; If I have 100 shopping baskets, I have 4,950 pairs to examine. 1000 baskets = 499,500 pairs; 10000 baskets = 49,995,000 pairs, etc.

    This is worse because of the fuzzy matching criteria. 


    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    Saturday, January 04, 2014 12:58 AM
  • The NP-Complete factor hammers query execution exponentially as the sales data volumes grow with each sale item being compared to ALL other sales items. The query below identifies distinct sales and item sets then joins each set to every other. Total items within each set and percentage of items that intersect the comparison set are calculated, to constrain and optimize comparisons, a new column has been added, TRADEDAY, and only sales items brought on the same day are compare with each other.

    IF NOT EXISTS(SELECT name FROM sys.databases WHERE name = 'MSDN1')  
        CREATE DATABASE [MSDN1] 
    GO
      
    USE [MSDN1] 
    GO 
      
    IF OBJECT_ID('SALES_TEST', N'U') IS NOT NULL
        DROP TABLE SALES_TEST; 
      
    CREATE TABLE [dbo].[SALES_TEST]( 
        [SALESID] [nchar](10) NOT NULL, 
        [ITEMID] [nchar](10) NOT NULL, 
        [TRADEDAY] [integer] NOT NULL, 
     CONSTRAINT [PK_SALES_TEST] PRIMARY KEY CLUSTERED  
    ( 
        [SALESID] ASC, 
        [ITEMID] ASC
    ) ON [PRIMARY] 
    ) ON [PRIMARY] 
    GO 
     
    CREATE NONCLUSTERED INDEX [IX_SALES_TEST_TRADEDAY_SALESID] ON [dbo].[SALES_TEST]
    (
    	[TRADEDAY] ASC,
    	[SALESID] ASC
    )
    INCLUDE ([ITEMID])  
    ON [PRIMARY]
    GO
      
    -- TEST DATA 
    DECLARE @x INT = 0, @TRADEDAY INT = 1 
    DECLARE 
    @S01i INT, @S02i INT, @S03i INT, @S04i INT,  
    @S05i INT, @S06i INT, @S07i INT, @S08i INT, @S09i INT
      
    DECLARE 
    @S01 VARCHAR(10), @S02 VARCHAR(10), @S03 VARCHAR(10), @S04 VARCHAR(10), 
    @S05 VARCHAR(10), @S06 VARCHAR(10), @S07 VARCHAR(10), @S08 VARCHAR(10),@S09 VARCHAR(10) 
      
      
    WHILE @TRADEDAY < 50 -- Set to data volume required (32 sales items per day) 
    BEGIN
      
        SELECT  @S01i = 1 + @X, @S02i = 2 + @X, @S03i = 3 + @X, @S04i = 4 + @X, @S05i = 5 + @X, @S06i = 6 + @X, @S07i = 7 + @X, @S08i = 8 + @X, @S09i = 9 + @X 
        SET @S01 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S01i AS VARCHAR(5))))) + CAST(@S01i AS VARCHAR(9))  
        SET @S02 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S02i AS VARCHAR(5))))) + CAST(@S02i AS VARCHAR(9))  
        SET @S03 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S03i AS VARCHAR(5))))) + CAST(@S03i AS VARCHAR(9))  
        SET @S04 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S04i AS VARCHAR(5))))) + CAST(@S04i AS VARCHAR(9))  
        SET @S05 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S05i AS VARCHAR(5))))) + CAST(@S05i AS VARCHAR(9))   
        SET @S06 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S06i AS VARCHAR(5))))) + CAST(@S06i AS VARCHAR(9))  
        SET @S07 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S07i AS VARCHAR(5))))) + CAST(@S07i AS VARCHAR(9))  
        SET @S08 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S08i AS VARCHAR(5))))) + CAST(@S08i AS VARCHAR(9))  
        SET @S09 = 'S' + REPLICATE('0',(5 - LEN(CAST(@S09i AS VARCHAR(5))))) + CAST(@S09i AS VARCHAR(9))  
      
        INSERT INTO [dbo].[SALES_TEST] ([TRADEDAY],[SALESID],[ITEMID]) 
        VALUES 
        (@TRADEDAY,@S01,'BA1'),(@TRADEDAY,@S01,'KM100'),(@TRADEDAY,@S01,'SM1RR'),(@TRADEDAY,@S01,'ZT6'), 
        (@TRADEDAY,@S02,'BA1'),(@TRADEDAY,@S02,'KM100'),(@TRADEDAY,@S02,'SM1RR'),(@TRADEDAY,@S02,'ZT6'), 
        (@TRADEDAY,@S03,'BA1'),(@TRADEDAY,@S03,'KM100'),(@TRADEDAY,@S03,'SM1RR'), 
        (@TRADEDAY,@S04,'BA2'),(@TRADEDAY,@S04,'KM200'),(@TRADEDAY,@S04,'SM1RR'),(@TRADEDAY,@S04,'ZT6'), 
        (@TRADEDAY,@S05,'KM100'),(@TRADEDAY,@S05,'SM1RR'),(@TRADEDAY,@S05,'ZT6'), 
        (@TRADEDAY,@S06,'BA2'),(@TRADEDAY,@S06,'KM200'),(@TRADEDAY,@S06,'SM1RR'),(@TRADEDAY,@S06,'ZT6'), 
        (@TRADEDAY,@S07,'BA2'),(@TRADEDAY,@S07,'KM100'),(@TRADEDAY,@S07,'SM1RR'),(@TRADEDAY,@S07,'ZT6'), 
        (@TRADEDAY,@S08,'BA1'),(@TRADEDAY,@S08,'KM100'),(@TRADEDAY,@S08,'SM1RR'), 
        (@TRADEDAY,@S09,'BA1'),(@TRADEDAY,@S09,'AAA'),(@TRADEDAY,@S09,'SM1RR'); 
      
        SET @x = @x + 10 
        SET @TRADEDAY = @TRADEDAY + 1 
    END
      
    ;WITH Sales AS
    ( 
        SELECT TRADEDAY, SALESID, COUNT(*) AS ItemsCount 
        FROM [dbo].[SALES_TEST] 
        GROUP BY TRADEDAY, SALESID 
    ), 
    Items AS
    ( 
        SELECT ITEMID, COUNT(*) AS Items 
        FROM [dbo].[SALES_TEST] 
        GROUP BY ITEMID 
    ), 
    JoinSets AS
    ( 
        SELECT 
            x1.TRADEDAY, 
            x1.SALESID AS LeftSALESID,  
            x1.ItemsCount  AS LeftSalesItemsCount, 
            x2.SALESID AS RightSALESID,  
            x2.ItemsCount  AS RightSalesItemsCount, 
            x3.ITEMID, 
            (SELECT COUNT(*) FROM [dbo].[SALES_TEST] WHERE SALESID = x1.SALESID AND ITEMID = x3.ITEMID) AS ItemInLeftSale, 
            (SELECT COUNT(*) FROM [dbo].[SALES_TEST] WHERE SALESID = x2.SALESID AND ITEMID = x3.ITEMID) AS ItemInRightSale     
        FROM Sales x1 
        FULL OUTER JOIN Sales x2  
            ON x1.SALESID < x2.SALESID  
        LEFT OUTER JOIN
        ( 
            SELECT ITEMID 
            FROM Items 
        ) x3 
            ON 1 = 1 
        WHERE x1.TRADEDAY = x2.TRADEDAY 
    ), 
    vTable AS
    ( 
        SELECT 
            z.TRADEDAY, 
            z.LeftSALESID, 
            z.RightSALESID, 
            AVG(z.LeftSalesItemsCount) AS LeftSalesItemsCount, 
            AVG(z.RightSalesItemsCount) AS RightSalesItemsCount, 
            SUM(NotInRightCount) AS NotInRightCount,     
            SUM(NotInLeftCount) AS NotInLeftCount 
        FROM JoinSets z 
        LEFT OUTER JOIN
        ( 
            SELECT LeftSALESID, RightSALESID, ITEMID, COUNT(*) AS NotInRightCount 
            FROM JoinSets 
            WHERE ItemInLeftSale = 1 
            AND ItemInRightSale = 0 
            GROUP BY LeftSALESID, RightSALESID, ITEMID 
        ) x4 
            ON x4.LeftSALESID = z.LeftSALESID 
            AND x4.RightSALESID = z.RightSALESID 
            AND x4.ITEMID = z.ITEMID 
      
        LEFT OUTER JOIN
        ( 
            SELECT LeftSALESID, RightSALESID, ITEMID, COUNT(*) AS NotInLeftCount 
            FROM JoinSets 
            WHERE ItemInLeftSale = 0 
            AND ItemInRightSale = 1 
            GROUP BY LeftSALESID, RightSALESID, ITEMID 
        ) x5 
            ON x5.LeftSALESID = z.LeftSALESID 
            AND x5.RightSALESID = z.RightSALESID 
            AND x5.ITEMID = z.ITEMID 
          
        GROUP BY
            z.TRADEDAY, 
            z.LeftSALESID,  
            z.RightSALESID 
    ), 
    Results AS
    ( 
        SELECT 
                a.TRADEDAY, 
                a.LeftSALESID,  
                a.LeftSalesItemsCount,  
                b.PercentofRightInLeft, 
                a.RightSALESID,  
                a.RightSalesItemsCount, 
                b.PercentofLeftInRight 
        FROM vTable a 
        LEFT OUTER JOIN
        ( 
            SELECT 
                LeftSALESID,  
                RightSALESID,  
                ISNULL(CASE NotInLeftCount WHEN NULL THEN 100 ELSE 100 - ((100/RightSalesItemsCount) * NotInLeftCount) END,100) AS PercentofRightInLeft, 
                ISNULL(CASE NotInRightCount WHEN NULL THEN 100 ELSE 100 - ((100/LeftSalesItemsCount) * NotInRightCount) END,100) AS PercentofLeftInRight 
            FROM vTable 
        ) b 
            ON b.LeftSALESID = a.LeftSALESID 
            AND b.RightSALESID = a.RightSALESID 
        WHERE a.LeftSALESID IS NOT NULL AND a.RightSALESID IS NOT NULL
    ) 
      
      
    SELECT 
            TRADEDAY, 
            LeftSALESID,  
            LeftSalesItemsCount,  
            PercentofRightInLeft, 
            RightSALESID,  
            RightSalesItemsCount,  
            PercentofLeftInRight 
    FROM Results 
    -- WHERE (PercentOfRightSaleItemsInLeft + PercentOfLeftSaleItemsInRight) = 200 -- filter exact matches only
    ORDER BY 
        (PercentofRightInLeft + PercentofLeftInRight) DESC, 
        LeftSALESID,  
        RightSALESID

    In the results, SALESID Sxxxx1 & Sxxxx2,  Sxxxx3 & Sxxxx8, Sxxxx4 & Sxxxx6 are 100% matches both sides. 75% of Sxxxx1 items appear in Sxxxx3, 100% of Sxxxx3 items appear in Sxxxx1 and so on.

    To improve the performance further add another new column to the table, 'shop' for example, and only compare shop 1 sales with shop 2 on the same trading day. You need to constraint comparison sets to achieve scale, to compare all sales items against each other, Analysis Services is the way to go.
    Saturday, January 04, 2014 2:25 AM