## Hash functions in T-SQL

SQL Server has the CHECKSUM() (and BINARY_CHECKSUM()) function for generating hash values. This is a simple hash function that maps input value(s) to a 32-bit INTEGER value. CHECKSUM() and BINARY_CHECKSUM() differ in how they treat inputs of the string data-type; see the BINARY_CHECKSUM() topic in BOL for more information.

When might you use a hash function? Hash functions are a useful option to improve the efficiency of particular queries on large volumes of data. For example, on long strings of text, you can build a hash-index to perform efficient lookups or to speed up aggregate operations. Hash values generated for an entire row are useful for efficiently searching for differences between rows in tables. Hash functions have many other uses that are beyond the scope of this post.

One problem with CHECKSUM() is that the probability of a collision (generating the same output value for two different input values) may not be sufficiently low for all applications – it’s not too difficult to come across examples of two different inputs hashing to the same output value. Of course, collisions are possible with any function that has a larger domain than its range (by definition!), it’s just that the probability with CHECKSUM() is a little too high to live with for many applications. For example, the following script shows an example of two UNIQUEIDENTIFIER values that hash to the same value:

DECLARE @guid1 UNIQUEIDENTIFIER ,@guid2 UNIQUEIDENTIFIER

SELECT @guid1 = ’3DB7D309-A8F4-47C4-BA90-0CB458B44CB0′ , @guid2 = ‘EFE7F2C5-19F9-42B9-9C16-21BED41E882B’

SELECT chksum_guid1 = CHECKSUM(@guid1), chksum_guid2 = CHECKSUM(@guid2)

This is a particular weakness with CHECKSUM() (and BINARY_CHECKSUM()) since it has only 2^{32} possible output values. It’s trivial to find further examples by brute force.

Whether or not you are concerned about collisions with CHECKSUM() depends somewhat on your application requirements. For lookups (via hash index), collisions are generally not so costly, providing the initial lookup eliminates the vast proportion of rows. However, for aggregates across a large dataset, collisions eliminate the usefulness of the hashing function and are therefore very costly.

You could run CHECKSUM() twice, against both the input value and its reverse, generating a combined 64-bit output value. This reduces the probability of a collision. Taking the previous two colliding UNIQUEIDENTIFIER values from the earlier example:

DECLARE @guid1 UNIQUEIDENTIFIER ,@guid2 UNIQUEIDENTIFIER

SELECT @guid1 = ’3DB7D309-A8F4-47C4-BA90-0CB458B44CB0′, @guid2 = ‘EFE7F2C5-19F9-42B9-9C16-21BED41E882B’

SELECT chksum_guid1 = CONVERT(BIGINT, CONVERT(BINARY(4), CHECKSUM(REVERSE(@guid1))) + CONVERT(BINARY(4), CHECKSUM(@guid1)))

, chksum_guid2 = CONVERT(BIGINT, CONVERT(BINARY(4), CHECKSUM(REVERSE(@guid2))) + CONVERT(BINARY(4), CHECKSUM(@guid2)))

The HASHBYTES()function was introduced with SQL Server 2005. This function gives the option to specify a selection of different hash algorithms and compared to CHECKSUM(), for MD5 and SHA1 in particular, is far less likely to result in a collision for different inputs (BOL mentions this in the CHECKSUM() topic) – MD5 outputs a 128-bit value and SHA1 outputs a 160-bit value, giving 2^{128} and 2^{160} possible output values respectively.