none
binary_checksum produce same number for different values

    Question

  • Hi All

    I want to produce unique number using with table columns,the problem is,i used lots of combination of checksum,binary_checksum,but these functions produce same number for different values.Do you have any suggestion ? I want to check the table column using with different column and if it already exists in there,i will delete it and insert new one,i dont want to do BIG join cause of performance issue. Example belows produce same number.

    SELECT
        BINARY_CHECKSUM ('LU82','FS8B22','0012301112','0012301112','0002712649','2013-08-27','B000711001014900',NULL) ,
        BINARY_CHECKSUM ('LU82','FS8B22','0012301112','0012301112','0002712646','2013-08-27','B000711001014600',NULL)

    Saturday, August 31, 2013 3:36 PM

Answers

  • Concatenate string values with "|" between them?

    Checksum is not one-to-one mapping. It is many-to-one mapping, hence the duplicates.

    BOL: "BINARY_CHECKSUM(*), computed on any row of a table, returns the same value as long the row is not subsequently modified. BINARY_CHECKSUM(*) will return a different value for most, but not all, changes to the row, and can be used to detect most row modifications.

    BINARY_CHECKSUM can be applied over a list of expressions, and returns the same value for a specified list. BINARY_CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and byte representation. For this definition, null values of a specified type are considered to have the same byte representation.

    BINARY_CHECKSUM and CHECKSUM are similar functions: They can be used to compute a checksum value on a list of expressions, and the order of expressions affects the resultant value. The order of columns used for BINARY_CHECKSUM(*) is the order of columns specified in the table or view definition. These include computed columns.

    CHECKSUM and BINARY_CHECKSUM return different values for the string data types, where locale can cause strings with different representation to compare equal. The string data types are char, varchar, nchar, nvarchar, or sql_variant (if the base type of sql_variant is a string data type). For example, the BINARY_CHECKSUM values for the strings "McCavity" and "Mccavity" are different. In contrast, in a case-insensitive server, CHECKSUM returns the same checksum values for those strings. CHECKSUM values should not be compared with BINARY_CHECKSUM values."

    Link: http://technet.microsoft.com/en-us/library/ms173784.aspx


    Kalman Toth Database & OLAP Architect sqlusa.com
    New Book / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012


    • Marked as answer by altuko Monday, September 02, 2013 12:34 AM
    Saturday, August 31, 2013 7:35 PM

All replies

  • Concatenate string values with "|" between them?

    Checksum is not one-to-one mapping. It is many-to-one mapping, hence the duplicates.

    BOL: "BINARY_CHECKSUM(*), computed on any row of a table, returns the same value as long the row is not subsequently modified. BINARY_CHECKSUM(*) will return a different value for most, but not all, changes to the row, and can be used to detect most row modifications.

    BINARY_CHECKSUM can be applied over a list of expressions, and returns the same value for a specified list. BINARY_CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and byte representation. For this definition, null values of a specified type are considered to have the same byte representation.

    BINARY_CHECKSUM and CHECKSUM are similar functions: They can be used to compute a checksum value on a list of expressions, and the order of expressions affects the resultant value. The order of columns used for BINARY_CHECKSUM(*) is the order of columns specified in the table or view definition. These include computed columns.

    CHECKSUM and BINARY_CHECKSUM return different values for the string data types, where locale can cause strings with different representation to compare equal. The string data types are char, varchar, nchar, nvarchar, or sql_variant (if the base type of sql_variant is a string data type). For example, the BINARY_CHECKSUM values for the strings "McCavity" and "Mccavity" are different. In contrast, in a case-insensitive server, CHECKSUM returns the same checksum values for those strings. CHECKSUM values should not be compared with BINARY_CHECKSUM values."

    Link: http://technet.microsoft.com/en-us/library/ms173784.aspx


    Kalman Toth Database & OLAP Architect sqlusa.com
    New Book / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012


    • Marked as answer by altuko Monday, September 02, 2013 12:34 AM
    Saturday, August 31, 2013 7:35 PM
  • Actually, Checksums are by principle not meant to be used to detect differences. Here you can read more about Checksums: http://en.wikipedia.org/wiki/Checksum

    Depending on the Amount of Data you are having, it might be easier to just compare on column-base.

    Otherwise you should use Hashing..

    I use an SSIS Transformation called “Multiple Hash” for big Datawarehouse tables. There you can use different algorithms. http://ssismhash.codeplex.com/


    Andreas Wolter | Microsoft Certified Master SQL Server
    Blog: www.insidesql.org/blogs/andreaswolter
    Web: www.andreas-wolter.com

    Saturday, August 31, 2013 9:22 PM
  • The BINARY_CHECKSUM function calculates a hash, and tries to avoid but will never guarantee there will never be collisions (= duplicates).

    The BINARY_CHECKSUM function returns an int. Any data will more than 32 bits of information cannot fit in this int. If this is the case for your information, then you should expect duplicates.

    If you want to continue to pursue this approach (which I have also used quite successfully), you could implement a uniquefier, just like the storage engine does for any non-unique clustered index. Declare the key as the combination of a checksum column and a counter column.

    If the maximum number of instances of a duplicate is 255, then you only need a tinyint.

    CREATE TABLE my_table
    (checksum_id int          NOT NULL
    ,uniquefier  tinyint      NOT NULL
    ,big_data    varchar(900) NOT NULL
    ,...
    ,CONSTRAINT PK_my_table
     PRIMARY KEY CLUSTERED (checksum_id, uniquefier)
    )
    

    It does mean, that if you need to search a row, and you don't know the uniquefier value yet, you not only have to match the checksum value, but also all individual columns that you hid with the checksum.

    Because of that, I think this approach is only a good idea if you will access the rows with their Primary Key most of the time. Of course a low number of duplicates is also a prerequisite.


    Gert-Jan

    Sunday, September 01, 2013 9:16 AM