Technical Report Number
This report describes a hashing scheme for a dictionary of short bit strings. The scheme, which we call near-perfect hashing, was designed as part of the construction of Mercury BLAST, an FPGA-based accelerator for the BLAST family of biosequence comparison algorithms. Near-perfect hashing is a heuristic variant of the well-known displacement hashing approach to building perfect hash functions. It uses a family of hash functions composed from linear transformations on bit vectors and lookups in small precomputed tables, both of which are especially appropriate for implementation in ardware logic. We show empirically that for inputs derived from genomic DNA sequences, our scheme obtains a good tradeoff between the size of the hash table and the time required to ompute it from a set of input strings, while generating few or no collisions between keys in the table. One of the building blocks of our scheme is the H_3 family of hash functions, which are linear transformations on bit vectors. We show that the uniformity of hashing performed with randomly chosen linear transformations depends critically on their rank, and that randomly chosen transformations have a high probability of having the maximum possible uniformity. A simple test is sufficient to ensure that a randomly chosen H3 hash function will not cause an unexpectedly large number of collisions. Moreover, if two such functions are chosen independently at random, the second function is unlikely to hash together two keys that were hashed together by the first. Hashing schemes based on H3 hash functions therefore tend to distribute their inputs more uniformly than would be expected under a simple uniform hashing model, and schemes using pairs of these functions are more uniform than would be assumed for a pair of independent hash functions.
Buhler, Jeremy, "Mercury BLAST dictionaries: analysis and performance measurement" Report Number: WUCSE-2007-13 (2007). All Computer Science and Engineering Research.