Archive for SQL Server

Hash functions in T-SQL

SQL Server has the CHECKSUM() (and BINARY_CHECKSUM()) function for generating hash values. This is a simple hash function that maps input value(s) to a 32-bit INTEGER value. CHECKSUM() and BINARY_CHECKSUM() differ in how they treat inputs of the string data-type; see the BINARY_CHECKSUM() topic in BOL for more information.

When might you use a hash function? Hash functions are a useful option to improve the efficiency of particular queries on large volumes of data. For example, on long strings of text, you can build a hash-index to perform efficient lookups or to speed up aggregate operations. Hash values generated for an entire row are useful for efficiently searching for differences between rows in tables. Hash functions have many other uses that are beyond the scope of this post.

One problem with CHECKSUM() is that the probability of a collision (generating the same output value for two different input values) may not be sufficiently low for all applications – it’s not too difficult to come across examples of two different inputs hashing to the same output value. Of course, collisions are possible with any function that has a larger domain than its range (by definition!), it’s just that the probability with CHECKSUM() is a little too high to live with for many applications. For example, the following script shows an example of two UNIQUEIDENTIFIER values that hash to the same value:

DECLARE @guid1 UNIQUEIDENTIFIER    ,@guid2 UNIQUEIDENTIFIER

SELECT @guid1 = ’3DB7D309-A8F4-47C4-BA90-0CB458B44CB0′    , @guid2 = ‘EFE7F2C5-19F9-42B9-9C16-21BED41E882B’
SELECT chksum_guid1 = CHECKSUM(@guid1), chksum_guid2 = CHECKSUM(@guid2)

This is a particular weakness with CHECKSUM() (and BINARY_CHECKSUM()) since it has only 232 possible output values. It’s trivial to find further examples by brute force.

Whether or not you are concerned about collisions with CHECKSUM() depends somewhat on your application requirements. For lookups (via hash index), collisions are generally not so costly, providing the initial lookup eliminates the vast proportion of rows. However, for aggregates across a large dataset, collisions eliminate the usefulness of the hashing function and are therefore very costly.

You could run CHECKSUM() twice, against both the input value and its reverse, generating a combined 64-bit output value. This reduces the probability of a collision. Taking the previous two colliding UNIQUEIDENTIFIER values from the earlier example:

DECLARE @guid1 UNIQUEIDENTIFIER    ,@guid2 UNIQUEIDENTIFIER

SELECT @guid1 = ’3DB7D309-A8F4-47C4-BA90-0CB458B44CB0′, @guid2 = ‘EFE7F2C5-19F9-42B9-9C16-21BED41E882B’

SELECT chksum_guid1 = CONVERT(BIGINT, CONVERT(BINARY(4), CHECKSUM(REVERSE(@guid1))) + CONVERT(BINARY(4), CHECKSUM(@guid1)))
     , chksum_guid2 = CONVERT(BIGINT, CONVERT(BINARY(4), CHECKSUM(REVERSE(@guid2))) + CONVERT(BINARY(4), CHECKSUM(@guid2)))

The HASHBYTES()function was introduced with SQL Server 2005. This function gives the option to specify a selection of different hash algorithms and compared to CHECKSUM(), for MD5 and SHA1 in particular, is far less likely to result in a collision for different inputs (BOL mentions this in the CHECKSUM() topic) – MD5 outputs a 128-bit value and SHA1 outputs a 160-bit value, giving 2128 and 2160 possible output values respectively.

Comments (4)

Auto-create/auto-update statistics and read-only databases

This post is a short follow on to my last post about read-only data and locking in SQL Server. To prevent the lock-manager from taking out shared-locks on infrequently updated databases (particularly databases updated via some kind of batch process) such databases are often set in a read-only state immediately following update.

Be aware, that if you set a database read-only, one consequence is that the auto-create and auto-update statistics functionality becomes disabled (for obvious reasons). At the very least, you may want to invoke sp_updatestats prior to setting the DB read-only. Ideally, ensure you have all the required statistics / indexes predefined so that the optimizer is unlikely to require additional column statistics. Of course, you may generously index your read-only DB, since the usual warnings about over-indexing do not apply for read-only data…

Comments off

Read-only data

If you have read-only data, placing it in a read-only database will prevent shared-locks from being taken on that data (for all transaction isolation levels) reducing some locking overhead.

However, there are many disadvantages to placing data in a separate database, for example, you cannot enforce foreign key constraints across databases and auto create/update statistics functionality is effectively disabled in a read-only database.

A common misconception is that shared-locks are not taken on data stored on read-only filegroups. This is (unfortunately) not the case – SQL Server will take shared-locks on data backed by a read-only filegroup – you can verify this easily enough.

A few workarounds come to mind.

  1. Specify the WITH (NOLOCK) hint wherever you access your read-only tables. Or alternatively, create views on your read-only tables that specify the WITH (NOLOCK) hint and only access read-only data through these views. None of the usual warnings around “read uncommitted” apply providing the underlying data is not being modified.

  2. Specify table-locking only (by disabling row/page locks via the ALTER INDEX statement or sp_indexoption stored procedure) for your read-only tables/indexes. This way, you’ll eliminate much of the locking overhead associated with accessing these read-only tables.

Of the two options above, I prefer option 2; it’s likely the most maintainable approach and option 1 involves intermingling logical and physical constructs which is never a good idea.

Comments (1)

An Alternative to delimited-lists

Sooner or later, you’ll come across the need to pass “lists” of items into a SQL Server Stored Procedure. Unfortunately, SQL Server doesn’t help us out much here. There are a lot of approaches1, but most of them seem more “workaround” than “solution” in nature.

When dealing with fixed length data-items, you can make use of the [sequence_generator] function that I showed in an earlier post. Here’s an example function to extract UNIQUEIDENTIFIER values from a VARBINARY(max) input parameter:

CREATE FUNCTION [dbo].[unpivot_uniqueidentifiers](@uniqueidentifiers VARBINARY(max))

      RETURNS TABLE

      WITH SCHEMABINDING

AS RETURN

      SELECT v = CAST(SUBSTRING(@uniqueidentifiers, seq.i, 16) AS UNIQUEIDENTIFIER)

            FROM [dbo].[sequence_generator](DATALENGTH(@uniqueidentifiers) / 16, 1, 16) seq

 

You can easily adapt the above function for INTEGER, DATETIME and other fixed-length data-types. Turning the “list” into a table is straightforward:

DECLARE @input VARBINARY(max)

–Build a “list” of (up to 100) UNIQUEIDENTIFIERs

SELECT @input = ISNULL(@input, CAST(NEWID() AS BINARY(16))) + CAST(NEWID() AS BINARY(16))

      FROM [dbo].[sequence_generator](RAND() * 100, 0, 1)

–”unpivot” into a table

SELECT ui.v
      FROM [dbo].[unpivot_uniqueidentifiers](@input) ui

 

1 See this article by Erland Sommarskog for a detailed discussion of many approaches (including a variation of the method discussed here) to representing lists in SQL Server.

Comments off

Take care when choosing your data-types

I’ve come across many SQL Server database designs where the vast majority of the data-types used were INTEGER, NVARCHAR, UNIQUEIDENTIFIER and DATETIME.

If you have a domain table, with less than say 100 rows, it has an INTEGER Primary Key and it’s rarely updated; perhaps TINYINT would meet your needs?

If you’re storing ASCII character data now and in the future, e.g. an “email address” column, there’s no need to use NVARCHAR, use VARCHAR instead. URLs and codes (e.g. ISO country codes) are further examples that may not require Unicode support.

When storing date/time information, do you really need granularity down to the nearest three milliseconds? If not, maybe SMALLDATETIME (nearest minute) will meet your requirements instead of DATETIME?

How about fixed length strings? Use (N)CHAR instead of (N)VARCHAR – save those two bytes of storage that the CHARACTER VARYING types use to record the length of the string.

If you’re careful when choosing your data-types you’ll reap the benefits in terms of reduced storage requirements and faster performing queries – reduced Memory, I/O and CPU storage/overhead (from greater index/data page density) almost always translates to improved performance. You’ll be making more efficient use of buffer-cache (more data in cache and/or more room for the plan-cache (which can be at least as large as the buffer cache for some workloads)) and transfer less data over your (relatively) slow LAN/WAN connection to the client resulting in lower latency and hence better client response times. In fact, the knock-on benefits are so compelling; it’s well worth that little extra effort up-front with your schema design.

 

A note on implicit data-conversion
Ensure that your literals and variable data-types match up with the column data-type. Otherwise, you may (in certain circumstances) inadvertently prevent an index from being eligible for consideration.

Comments off

« Previous Page« Previous entries « Previous Page · Next Page » Next entries »Next Page »