Hash functions in T-SQL

SQL Server has the CHECKSUM() (and BINARY_CHECKSUM()) function for generating hash values. This is a simple hash function that maps input value(s) to a 32-bit INTEGER value. CHECKSUM() and BINARY_CHECKSUM() differ in how they treat inputs of the string data-type; see the BINARY_CHECKSUM() topic in BOL for more information.

When might you use a hash function? Hash functions are a useful option to improve the efficiency of particular queries on large volumes of data. For example, on long strings of text, you can build a hash-index to perform efficient lookups or to speed up aggregate operations. Hash values generated for an entire row are useful for efficiently searching for differences between rows in tables. Hash functions have many other uses that are beyond the scope of this post.

One problem with CHECKSUM() is that the probability of a collision (generating the same output value for two different input values) may not be sufficiently low for all applications – it’s not too difficult to come across examples of two different inputs hashing to the same output value. Of course, collisions are possible with any function that has a larger domain than its range (by definition!), it’s just that the probability with CHECKSUM() is a little too high to live with for many applications. For example, the following script shows an example of two UNIQUEIDENTIFIER values that hash to the same value:

DECLARE @guid1 UNIQUEIDENTIFIER    ,@guid2 UNIQUEIDENTIFIER

SELECT @guid1 = ’3DB7D309-A8F4-47C4-BA90-0CB458B44CB0′    , @guid2 = ‘EFE7F2C5-19F9-42B9-9C16-21BED41E882B’
SELECT chksum_guid1 = CHECKSUM(@guid1), chksum_guid2 = CHECKSUM(@guid2)

This is a particular weakness with CHECKSUM() (and BINARY_CHECKSUM()) since it has only 232 possible output values. It’s trivial to find further examples by brute force.

Whether or not you are concerned about collisions with CHECKSUM() depends somewhat on your application requirements. For lookups (via hash index), collisions are generally not so costly, providing the initial lookup eliminates the vast proportion of rows. However, for aggregates across a large dataset, collisions eliminate the usefulness of the hashing function and are therefore very costly.

You could run CHECKSUM() twice, against both the input value and its reverse, generating a combined 64-bit output value. This reduces the probability of a collision. Taking the previous two colliding UNIQUEIDENTIFIER values from the earlier example:

DECLARE @guid1 UNIQUEIDENTIFIER    ,@guid2 UNIQUEIDENTIFIER

SELECT @guid1 = ’3DB7D309-A8F4-47C4-BA90-0CB458B44CB0′, @guid2 = ‘EFE7F2C5-19F9-42B9-9C16-21BED41E882B’

SELECT chksum_guid1 = CONVERT(BIGINT, CONVERT(BINARY(4), CHECKSUM(REVERSE(@guid1))) + CONVERT(BINARY(4), CHECKSUM(@guid1)))
     , chksum_guid2 = CONVERT(BIGINT, CONVERT(BINARY(4), CHECKSUM(REVERSE(@guid2))) + CONVERT(BINARY(4), CHECKSUM(@guid2)))

The HASHBYTES()function was introduced with SQL Server 2005. This function gives the option to specify a selection of different hash algorithms and compared to CHECKSUM(), for MD5 and SHA1 in particular, is far less likely to result in a collision for different inputs (BOL mentions this in the CHECKSUM() topic) – MD5 outputs a 128-bit value and SHA1 outputs a 160-bit value, giving 2128 and 2160 possible output values respectively.

4 Comments

  1. Generating a unique number for each row « Exploring Microsoft.NET & Apple Macintosh said,

    August 4, 2008 @ 10:05 am

    [...] Update: I happen to read that checksum() function is not guaranteed to give unique values. This is the case with SQL Server 2000 and I believe the same case with SQL Server 2005. For more information on checksum() collisions, take a look at this URL: http://blogs.clarience.com/davide/?p=11. [...]

  2. Ryan WIlliams said,

    December 3, 2008 @ 4:10 pm

    FFF61235-EC1A-428C-9FC8-DD794D4115A4 also has the same checksum

  3. Bernie said,

    January 28, 2009 @ 2:20 pm

    It is some what ridiculous to have a checksum function that returns collisions to this degree. In my own experience using a population of a 100K rows and 3 fields(char, datetime, int) to compute a checksum on a record. 23 times the Checksum function return the same value. I deem this to be hardly deterministic and therefore this function and sql server by association is making my job more difficult. Temporarily, I am using the hashbyte function and passing that value to the checksum. No collisions so far :). I am very concerned that SQLServer has such a poor implementation of a CRC-32 hash algorith. I might switch to Oracle if I have anymore issues with hashing or third party hash written in C++.

  4. Crikey said,

    August 30, 2012 @ 9:56 pm

    re: “I am very concerned that SQLServer has such a poor implementation of a CRC-32 hash algorith. I might switch to Oracle…”

    The CRC-32 algorithm is standardized and not defined by SQL Server (or Oracle). All implementations of CRC-32 must return the same output when given the same input (or else they aren’t actually computing CRC-32). Switching from one implementation of CRC-32 to another is not going to reduce the likelihood of collisions.

    re: “It is some what ridiculous to have a checksum function that returns collisions to this degree.”

    That is kind of like saying that golf carts are ridiculous because they can’t haul 12 tons of cargo at a 1000 miles an hour. Or that windows are silly because they can be broken so easily. Algorithms each have their place and satisfy different sets of requirements and constraints. Its up to the designer to understand this and use the right algorithm for the right problem.

RSS feed for comments on this post