Archive for September, 2007

Hash functions in T-SQL

SQL Server has the CHECKSUM() (and BINARY_CHECKSUM()) function for generating hash values. This is a simple hash function that maps input value(s) to a 32-bit INTEGER value. CHECKSUM() and BINARY_CHECKSUM() differ in how they treat inputs of the string data-type; see the BINARY_CHECKSUM() topic in BOL for more information.

When might you use a hash function? Hash functions are a useful option to improve the efficiency of particular queries on large volumes of data. For example, on long strings of text, you can build a hash-index to perform efficient lookups or to speed up aggregate operations. Hash values generated for an entire row are useful for efficiently searching for differences between rows in tables. Hash functions have many other uses that are beyond the scope of this post.

One problem with CHECKSUM() is that the probability of a collision (generating the same output value for two different input values) may not be sufficiently low for all applications – it’s not too difficult to come across examples of two different inputs hashing to the same output value. Of course, collisions are possible with any function that has a larger domain than its range (by definition!), it’s just that the probability with CHECKSUM() is a little too high to live with for many applications. For example, the following script shows an example of two UNIQUEIDENTIFIER values that hash to the same value:

DECLARE @guid1 UNIQUEIDENTIFIER    ,@guid2 UNIQUEIDENTIFIER

SELECT @guid1 = ’3DB7D309-A8F4-47C4-BA90-0CB458B44CB0′    , @guid2 = ‘EFE7F2C5-19F9-42B9-9C16-21BED41E882B’
SELECT chksum_guid1 = CHECKSUM(@guid1), chksum_guid2 = CHECKSUM(@guid2)

This is a particular weakness with CHECKSUM() (and BINARY_CHECKSUM()) since it has only 232 possible output values. It’s trivial to find further examples by brute force.

Whether or not you are concerned about collisions with CHECKSUM() depends somewhat on your application requirements. For lookups (via hash index), collisions are generally not so costly, providing the initial lookup eliminates the vast proportion of rows. However, for aggregates across a large dataset, collisions eliminate the usefulness of the hashing function and are therefore very costly.

You could run CHECKSUM() twice, against both the input value and its reverse, generating a combined 64-bit output value. This reduces the probability of a collision. Taking the previous two colliding UNIQUEIDENTIFIER values from the earlier example:

DECLARE @guid1 UNIQUEIDENTIFIER    ,@guid2 UNIQUEIDENTIFIER

SELECT @guid1 = ’3DB7D309-A8F4-47C4-BA90-0CB458B44CB0′, @guid2 = ‘EFE7F2C5-19F9-42B9-9C16-21BED41E882B’

SELECT chksum_guid1 = CONVERT(BIGINT, CONVERT(BINARY(4), CHECKSUM(REVERSE(@guid1))) + CONVERT(BINARY(4), CHECKSUM(@guid1)))
     , chksum_guid2 = CONVERT(BIGINT, CONVERT(BINARY(4), CHECKSUM(REVERSE(@guid2))) + CONVERT(BINARY(4), CHECKSUM(@guid2)))

The HASHBYTES()function was introduced with SQL Server 2005. This function gives the option to specify a selection of different hash algorithms and compared to CHECKSUM(), for MD5 and SHA1 in particular, is far less likely to result in a collision for different inputs (BOL mentions this in the CHECKSUM() topic) – MD5 outputs a 128-bit value and SHA1 outputs a 160-bit value, giving 2128 and 2160 possible output values respectively.

Comments (4)

Auto-create/auto-update statistics and read-only databases

This post is a short follow on to my last post about read-only data and locking in SQL Server. To prevent the lock-manager from taking out shared-locks on infrequently updated databases (particularly databases updated via some kind of batch process) such databases are often set in a read-only state immediately following update.

Be aware, that if you set a database read-only, one consequence is that the auto-create and auto-update statistics functionality becomes disabled (for obvious reasons). At the very least, you may want to invoke sp_updatestats prior to setting the DB read-only. Ideally, ensure you have all the required statistics / indexes predefined so that the optimizer is unlikely to require additional column statistics. Of course, you may generously index your read-only DB, since the usual warnings about over-indexing do not apply for read-only data…

Comments off

Read-only data

If you have read-only data, placing it in a read-only database will prevent shared-locks from being taken on that data (for all transaction isolation levels) reducing some locking overhead.

However, there are many disadvantages to placing data in a separate database, for example, you cannot enforce foreign key constraints across databases and auto create/update statistics functionality is effectively disabled in a read-only database.

A common misconception is that shared-locks are not taken on data stored on read-only filegroups. This is (unfortunately) not the case – SQL Server will take shared-locks on data backed by a read-only filegroup – you can verify this easily enough.

A few workarounds come to mind.

  1. Specify the WITH (NOLOCK) hint wherever you access your read-only tables. Or alternatively, create views on your read-only tables that specify the WITH (NOLOCK) hint and only access read-only data through these views. None of the usual warnings around “read uncommitted” apply providing the underlying data is not being modified.

  2. Specify table-locking only (by disabling row/page locks via the ALTER INDEX statement or sp_indexoption stored procedure) for your read-only tables/indexes. This way, you’ll eliminate much of the locking overhead associated with accessing these read-only tables.

Of the two options above, I prefer option 2; it’s likely the most maintainable approach and option 1 involves intermingling logical and physical constructs which is never a good idea.

Comments (1)