none
Duplicate Records for All Fields but One

    Question

  • All

    How would I go about deleting rows from a table which are complete dups except for one field (one row will have a value for FileName and the other will be NULL in FileName - otherwise they are dups)?  I think the below code is correct, but due to spaces, nulls, and probably some other things I haven't thought of, not all joins seem to be working.  How to account for this?  I have to join about 33 fields, all of which are of the nvarchar(255) data type.

    delete from b
    from
    
    (select *
     from table1
     where FileName is not null) a
    
    inner join 
    
    (select 
     from table1
     where FileName is null) b
    
    on 
    
    isnull(a.field1,0) = isnull(b.field1,0)
    and isnull(a.field2,0) = isnull(b.field2,0)
    etc..
    
    --joining on all fields except FileName



    Bonediggler

    Monday, August 26, 2013 6:52 PM

Answers

  • IF EXISTS(SELECT * FROM tempdb.Information_Schema.TABLES WHERE Table_Name LIKE '#Temp%')
    DROP TABLE #temp3
    CREATE TABLE #temp ([Id] int, [Name] varchar(50), [Age] int, [Sex] bit default 1)
    Go
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] )
    Values(1,'James',25,default)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] )
    Values(1,'James',25,default)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(1,'James',25,default)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(2,'Lisa',24,0)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(2,'Lisa',24,0)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(2,'Lisa',24,0)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(3,'Mirsa',23,0)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(3,'Mirsa',23,0)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(3,'Mirsa',23,0)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(4,'John',26,default)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(5,'Abraham',28,default)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(6,'Lincoln',30,default)
     
     
    DELETE T FROM
    (SELECT Row_Number() Over(Partition BY [ID],[Name],[Age],[Sex] ORDER BY [ID]) AS RowNumber,* FROM #Temp)T
    WHERE T.RowNumber > 1
    SELECT * FROM #temp
     
    Above is the example to delete duplicate rows, just need to add in column in Over(Partition BY [ID],[Name],[Age],[Sex] ORDER BY [ID]) on which need to check when going to delete duplicacy.
    Tuesday, August 27, 2013 8:28 PM
  • Try

    DENSE_RANK OVER (partition by all other columns except FileName ORDER BY case when FileName IS NULL then 2 else 1 end) as Row

    delete from Dups where Row > 1 or FileName IS NULL (in case you need to always delete rows with FileName null even if no other rows exist).


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Tuesday, August 27, 2013 8:21 PM
    Moderator

All replies

  • Monday, August 26, 2013 7:14 PM
    Moderator
  • Solution:

    Narrow down uniquely identifying combination of fields and join only on those (they should have been in there anyway as the primary key).



    Bonediggler

    Monday, August 26, 2013 7:36 PM
  • Why do you need to JOIN the information? If the only problem is to delete bad row, then normally you would run this query

    ;with cte as (select *, row_number() over (partition by field1, .... order by FileName) as Row, count(*) over (partition by field1, .., fieldN) as cntDups

    from myTable)

    select * from cte where cntDups > 1 -- this is to examine dups first

    To get rid of fileName = '' you change select * from cte where cntDups > 1 to 

    delete from cte where Row > 1


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Monday, August 26, 2013 7:38 PM
    Moderator
  • http://dimantdatabasesolutions.blogspot.co.il/2007/02/dealing-with-duplicates.html

    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Blog: Large scale of database and data cleansing
    Remote DBA Services: Improves MS SQL Database Performance

    Tuesday, August 27, 2013 6:46 AM
    Answerer
  • Hi Naomi This isn't quite working but I think it is close. If I throw FileName into the PARTITION BY clause, every row is distinct (nothing is returned) If I omit FileName from the PARTITION BY clause, it shows duplicate rows where every value is the same except for FileName - when FileName is populated with a value besides NULL (for my purposes, these are not duplicate rows). It does not seem to consider rows with NULL FileName. I need to delete only those rows that have NULL as FileName, whereas every other field matches another row that does not have NULL as FileName. Any ideas?

    Bonediggler

    Tuesday, August 27, 2013 6:39 PM
  • Try

    DENSE_RANK OVER (partition by all other columns except FileName ORDER BY case when FileName IS NULL then 2 else 1 end) as Row

    delete from Dups where Row > 1 or FileName IS NULL (in case you need to always delete rows with FileName null even if no other rows exist).


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Tuesday, August 27, 2013 8:21 PM
    Moderator
  • IF EXISTS(SELECT * FROM tempdb.Information_Schema.TABLES WHERE Table_Name LIKE '#Temp%')
    DROP TABLE #temp3
    CREATE TABLE #temp ([Id] int, [Name] varchar(50), [Age] int, [Sex] bit default 1)
    Go
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] )
    Values(1,'James',25,default)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] )
    Values(1,'James',25,default)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(1,'James',25,default)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(2,'Lisa',24,0)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(2,'Lisa',24,0)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(2,'Lisa',24,0)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(3,'Mirsa',23,0)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(3,'Mirsa',23,0)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(3,'Mirsa',23,0)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(4,'John',26,default)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(5,'Abraham',28,default)
    INSERT INTO #temp ([Id] , [Name] , [Age] , [Sex] ) Values(6,'Lincoln',30,default)
     
     
    DELETE T FROM
    (SELECT Row_Number() Over(Partition BY [ID],[Name],[Age],[Sex] ORDER BY [ID]) AS RowNumber,* FROM #Temp)T
    WHERE T.RowNumber > 1
    SELECT * FROM #temp
     
    Above is the example to delete duplicate rows, just need to add in column in Over(Partition BY [ID],[Name],[Age],[Sex] ORDER BY [ID]) on which need to check when going to delete duplicacy.
    Tuesday, August 27, 2013 8:28 PM