Tuesday, March 6, 2012

An Implementation of the FNV1a Hash Algorithm for SQL Server


Summary

This CLR code implements the Fowler-Noll-Vo (FNV) variant "1a" hash function. This algorithm provides exellent avalanche characteristics, low collision, and high dispersion across either string or integer inputs such as GUIDs, URLs, host names, file names, long text, IPv4 or v6 addresses, etc. My personal testing shows that the 64-bit version can successfully hash 10 billion 36-byte UNIQUEIDENTIFIER (GUID) values to 8-byte BIGINT (long) values with zero collisions. It is IMHO far, far, far better than the SQL native HASHBYTES, CHECKSUM-based hashing, etc.
You can read more about the FNV algorithm, its mathematical underpinning, and its variants here.
Three functions are provided:
xf_GetHash16
Hashes a VARCHAR or NVARCHAR string (up to 4000 characters) to a SMALLINT
OK collision avoidance
xf_GetHash32
Hashes a VARCHAR or NVARCHAR string (up to 4000 characters) to an INT
Better collision avoidance
xf_GetHash64
Hashes a VARCHAR or NVARCHAR string (up to 4000 characters) to a BIGINT
Best collision avoidance
Note that these are NOT encryption functions, they are hashing functions and have no equivalent "decryption" algorithm. One-way only!

Source Code

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text;

public partial class UserDefinedFunctions
{

    static readonly ulong prime64 = 1099511628211;
    static readonly ulong offset64 = 0xcbf29ce484222325;
    static readonly uint prime32 = 16777619;
    static readonly uint offset32 = 2166136261;

    [SqlFunction(IsDeterministic = true, IsPrecise = true, DataAccess = DataAccessKind.None)]
    public static SqlInt64 xf_GetHash64(SqlString value)
    {
        return (SqlInt64)HashFNV1a_64((string)value);
    }

    [SqlFunction(IsDeterministic = true, IsPrecise = true, DataAccess = DataAccessKind.None)]
    public static SqlInt32 xf_GetHash32(SqlString value)
    {
        return (SqlInt32)HashFNV1a_32((string)value);
    }

    [SqlFunction(IsDeterministic = true, IsPrecise = true, DataAccess = DataAccessKind.None)]
    public static SqlInt16 xf_GetHash16(SqlString value)
    {
        return (SqlInt16)HashFNV1a_16((string)value);
    }

    private static long HashFNV1a_64(string value)
    {
        ulong hash = offset64;

        byte[] bytes = Encoding.UTF8.GetBytes(value.ToLower());

        for (int i = 0; i < bytes.Length; i++)
        {
            hash = (hash ^ bytes[i]) * prime64;
        }
        return (long)(hash - long.MaxValue);

    }

    private static int HashFNV1a_32(string value)
    {
        uint hash = offset32;

        byte[] bytes = Encoding.UTF8.GetBytes(value.ToLower());

        for (int i = 0; i < bytes.Length; i++)
        {
            hash = (hash ^ bytes[i]) * prime32;
        }
        return (int)(hash - int.MaxValue);

    }

    private static short HashFNV1a_16(string value)
    {
        uint MASK_16 = (((uint)1 << 16) - 1);    /* i.e., (u_int32_t)0xffff */
        uint hash = offset32;

        byte[] bytes = Encoding.UTF8.GetBytes(value.ToLower());

        for (int i = 0; i < bytes.Length; i++)
        {
            hash = (hash ^ bytes[i]) * prime32;
        }

        hash = (hash >> 16) ^ (hash & MASK_16);

        return (short)(hash - short.MaxValue);

    }

};

Usage

Once installed, you can use these CLR functions like any TSQL function. Here is a proc that will return the 16-, 32-, and 64-bit hashes for a given passed string:
CREATE PROCEDURE GetHashes(@stringval NVARCHAR(4000))
AS BEGIN

DECLARE
   @smallhash SMALLINT, 
   @inthash INT,
   @bighash BIGINT

SELECT
   @smallhash=dbo.xf_GetHash16(@stringval),
   @inthash=dbo.xf_GetHash32(@stringval),
   @bighash=dbo.xf_GetHash64(@stringval)

SELECT
   @smallhash as Hash16Bit,
   @inthash as Hash32Bit,
   @bighash as Hash64Bit

END
Reference inline in a query:
SELECT dbo.xf_GetHash32(x.somecolumn) as HashValue
FROM mytable x
WHERE....
Or use in a computed column and optionally persist it for use as an index or even the primary key:
CREATE TABLE StringTable (
StringValue VARCHAR(512),
HashKey AS (dbo.xf_GetHash64(StringValue)) PERSISTED,
CONSTRAINT PK_StringTable PRIMARY KEY CLUSTERED (HashKey),
CONSTRAINT UK_StringTable_StringValue UNIQUE (StringValue)
)

Disclaimer

The source code is provided to you as is, without warranty. There is no warranty for the program, expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose and non infringement of third party rights. The entire risk as to the quality and performance of the program is with you. Should the program prove defective, you assume the cost of all necessary servicing, repair and correction.

Source:

No comments:

Post a Comment