Marking duplicates in a table

Answered Marking duplicates in a table

  • Friday, January 18, 2013 2:55 PM
     
      Has Code

    I have a table where there are multiple duplicates in one of the columns.  My need is to find those duplicate values and mark all but one of them as a duplicate.   This should be fairly simple, but most of the solutions I've found have to deal with deleting or just removing the dupes in a SELECT.

    My output needs to resemble:

    findingkey         dupefinding
    -----                  ----
    2.38116            0
    2.38116            1
    2.38116            1
    25.45022           0
    25.45022           1
    25.45022           1

    I've written the following query, but it obviously marks ALL the duplicate records.  Have been trying to come up with a solution that doesn't mark the first occurrence as s duplicate.

    ;WITH cte_dupes AS (
        SELECT findingkey, COUNT(*) totalCount
        FROM excelDataRepository
        WHERE dataSetID=1
        GROUP BY findingKey
        HAVING COUNT(*) >1
        )
    UPDATE excelDataRepository
    SET dupeFinding = 1
    FROM excelDataRepository as ee
    INNER JOIN cte_dupes as dupes
    ON ee.findingKey = dupes.findingKey
    WHERE dupes.totalCount > 1
    Would appreciate the help once again.  Thanks.






    • Edited by ms960 Friday, January 18, 2013 2:56 PM
    •  

All Replies

  • Friday, January 18, 2013 3:23 PM
     
     Answered Has Code

    ;WITH cte_dupes AS (
        SELECT findingkey, dupefinding, Row_Number() Over (Partition By findingkey Order By findingkey) As rn
        FROM excelDataRepository
        WHERE dataSetID=1
        )
    UPDATE cte_dupes
    SET dupeFinding = 1
    WHERE rn > 1

    That will mark all but one of each findingkey.  Note that with the above, you don't know which of the duplicates with be marked.  If you have an algorithm for which of the duplicates should be the one selected to not be marked, just change the order by in the Over clause so that the one you don't want marked comes first.

    As with any update, carefully test and be sure to have a good backup before running against a production database.

    Tom

    • Marked As Answer by ms960 Friday, January 18, 2013 3:30 PM
    •  
  • Friday, January 18, 2013 3:31 PM
     
      Has Code

    ;WITH cte_dupes AS (
        SELECT findingkey, dupefinding, Row_Number() Over (Partition By findingkey Order By findingkey) As rn
        FROM excelDataRepository
        WHERE dataSetID=1
        )
    UPDATE cte_dupes
    SET dupeFinding = 1
    WHERE rn > 1

    That will mark all but one of each findingkey.  Note that with the above, you don't know which of the duplicates with be marked.  If you have an algorithm for which of the duplicates should be the one selected to not be marked, just change the order by in the Over clause so that the one you don't want marked comes first.

    As with any update, carefully test and be sure to have a good backup before running against a production database.

    Tom

    I was inching closer but your solution solves my problem.  Thank you very much!
  • Friday, January 18, 2013 3:32 PM
    Moderator
     
      Has Code
    ;WITH cte_dupes AS (
        SELECT findingkey, dupeFinding, ROW_NUMBER() OVER(PARTITION BY findingkey ORDER BY findingkey ) rn
        FROM excelDataRepository
        WHERE dataSetID=1
         
        )
    UPDATE cte_dupes
    SET dupeFinding = CASE WHEN rn=1 THEN 0 ELSE 1 END