none
COUNT of the number of duplicates from table with billion rows RRS feed

  • Question

  • Hi gurus

    I need to know how many duplicate rows are in a table which has Billion rows

    This table has 52 columns and I want to check if there are any repated row with same values in all the 52 columns.

    I have tried 

    1)
    SELECT COUNT(*) FROM (
    SELECT COLUUMN1,COLUMN2,COLUMN3....COLOUMN52 ,COUNT(*)
    FROM TABLE 
    GROUP BY 
    COLUUMN1,COLUMN2,COLUMN3....COLOUMN52
    HAVING COUNT(*)>1)
    
    2)
    
    I have also tried with ROW_NUMBER 
    
    

    Could not allocate space for object 'dbo.SORT temporary run storage:  141881590939648' in database 'tempdb' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup


    This takes forever and this fail with tempdb error .Please help. 

    Kind Regards,

    Keerthi.




    • Edited by chinna738 Tuesday, October 22, 2019 10:37 PM
    Friday, October 18, 2019 3:57 PM

All replies

  • Hi chinna738,

    The error messages talked about tempdb space usage.  It indicated that the tempdb data and log file run out of space.

     

    If you are sure that you have enough space for tempdb data and log file, then the problem probably will be that you do not have autogrow or your tempdb files reached to its size limit.

     

    Or you haven't got suitable autogrowth settings (see this earlier post for information about that), if it's not autogrowth, you'll have to either delete other files from the disk that your tempdb is on, or locate your tempdb on a bigger disk.

     

    Please check it.

     

    Best Regards,

    Rachel


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Monday, October 21, 2019 6:39 AM
  • Try create an indexed view, see example

    CREATE  VIEW dbo.IV WITH SCHEMABINDING AS

    SELECT  col1, col2, COUNT_BIG(*) AS row_count
    FROM    dbo.HugeTable
    GROUP   BY
            col1,
            col2;

    GO

    -- Index it

    CREATE  UNIQUE CLUSTERED INDEX cuq ON dbo.IV (col1, col2);

    CREATE  NONCLUSTERED INDEX nc1 ON dbo.IV (row_count);

    GO

    -- NOEXPAND hint required for non-Enterprise SKUs

    SELECT  IV.col1,
            IV.col1
    FROM    dbo.IV AS IV WITH (NOEXPAND)
    WHERE   row_count > 1


    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Consulting: Large scale of database and data cleansing
    Remote DBA Services: Improves MS SQL Database Performance
    SQL Server Integration Services: Business Intelligence

    Tuesday, October 22, 2019 4:33 AM
    Moderator
  • Thanks Uri

    I have 52 coloumns ,to create unique index i cannot have more than 16 columns.

    The index 'cuq' on table 'dbo.IV' has 52 columns in the key list. The maximum limit for index key column list is 16.

    Thanks,

    Chinna.

    Tuesday, October 22, 2019 10:36 PM
  • Well, then I think you need to run SELECT in chunks 

    SELECT * FROM TableWithDups
    WHERE EXISTS
    (SELECT * FROM TableWithDups TWD
    WHERE TWD.row_id <> TableWithDups.row_id
    AND TWD.col1 = TableWithDups.col1
    AND TWD.col2 = TableWithDups.col2

    AND..............)


    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Consulting: Large scale of database and data cleansing
    Remote DBA Services: Improves MS SQL Database Performance
    SQL Server Integration Services: Business Intelligence

    Wednesday, October 23, 2019 4:05 AM
    Moderator
  • Hi Uri,

    I used the below

    SELECT -COUNT_BIG(*) Dup into #T FROM (
    SELECT DISTINCT Table.* FROM Table)X
    
    Insert into #T
    SELECT COUNT_BIG(*) FROM Table
    
    SELECT SUM(DUP) FROM #T
    
    DROP TABLE #T
    
    Took few hours to run as this table has billions of records.

    Kind Regards,

    Chinna.



    • Edited by chinna738 Thursday, October 24, 2019 10:43 AM
    Thursday, October 24, 2019 10:42 AM
  • Per your description , you insert whole table  into temp table . In your script , I could not find where clause . I'm afraid even if you create an index, the efficiency won't change much.

    Best Regards,

    Rachel 


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Friday, October 25, 2019 6:58 AM
  • Hi Rachel,I am only inserting unique rows using "DISTINCT *"

    Kind Regards,

    Chinna.

    Friday, October 25, 2019 10:05 AM
  • >>Insert into #T
    >>SELECT COUNT_BIG(*) FROM Table

    >>>SELECT SUM(DUP) FROM #T

    >>>DROP TABLE #T

    What is  the logic behind the scene of above script? You insert simple number into a temporary table and then issue SUM()?

    For example , this returns 258 on my machine

    SELECT COUNT_BIG(*) FROM sys.objects

    How do you identify dups?


    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Consulting: Large scale of database and data cleansing
    Remote DBA Services: Improves MS SQL Database Performance
    SQL Server Integration Services: Business Intelligence

    Sunday, October 27, 2019 5:15 AM
    Moderator
  • Hi Uri,

    Please read the comments above each line.

    Insert the number of unique rows into Temp table 
    
    SELECT -COUNT_BIG(*) Dup into #T FROM (
    SELECT DISTINCT Table.* FROM Table)X
    
    
    Insert the total number of rows into Temp table
    
    Insert into #T
    SELECT COUNT_BIG(*) FROM Table
    
    Remove the number of unique rows from total number of rows to get the number of duplicate rows
    
    SELECT SUM(DUP) FROM #T
    
    
    

    Kind Regards,

    Chinna.

    Monday, October 28, 2019 2:15 PM
  •  Sorry for my poor understanding .

    Could you  please share us your table structure (CREATE TABLE …) and some sample data (INSERT INTO…) along with your expected result? So that we’ll get a right direction and make some test.

     

    Best Regards,

    Rachel


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Tuesday, October 29, 2019 6:42 AM