hash function instruction count

6

Click here to load reader

Upload: api-1752250

Post on 11-Apr-2015

371 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Hash Function Instruction Count

A typical hash function at work

Hash function

From Wikipedia, the free encyclopedia

A hash function is any well-defined procedure or mathematical function for turning some kind of data into a

relatively small integer, that may serve as an index into an array. The values returned by a hash function are

called hash values, hash codes, hash sums, or simply hashes.

Hash functions are mostly used to speed up table lookup or data comparison tasks — such as finding items in a

database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and

so on.

Hash functions are related to (and often confused with) checksums, check digits, fingerprints, randomizing

functions, error correcting codes, and cryptographic hash functions. Although these concepts overlap to some

extent, each has its own uses and requirements. The HashKeeper database maintained by the National Drug

Intelligence Center, for instance, is more aptly described as a catalog of file fingerprints than of hash values.

Contents

1 Applications

1.1 Hash tables

1.2 Finding duplicate records

1.3 Finding similar records

1.4 Finding similar substrings

1.5 Geometric hashing

2 Properties

2.1 Determinism

2.2 Uniformity

2.3 Variable range

2.4 Data normalization

2.5 Continuity

3 Hash function algorithms

3.1 Injective and perfect hashing

3.2 Use of short search key as indirect

index

3.3 Hashing uniformly distributed data

3.4 Hashing data with other distributions

3.5 Special-purpose hash functions

3.6 Hashing with checksum functions

3.7 Hashing with cryptographic hash

functions

3.8 Audio identification

4 Origins of the term

5 See also

6 Notes

7 References

8 External links

Applications

Hash tables

Wikipedia is sustained by people like you. Please donate today.

Hash function - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Hash_function

1 of 6 24/7/2008 14:50

Page 2: Hash Function Instruction Count

Hash functions are mostly used in hash tables, to quickly locate a data record (for example, a dictionary

definition) given its search key (the headword). Specifically, the hash function is used to map the search key to

the index of a slot in the table where the corresponding record is supposedly stored.

In general, a hashing function may map several different keys to the same hash value. Therefore, each slot of a

hash table contains (implicitly or explicitly) a set of records, rather than a single record. For this reason, each

slot of a hash table is often called a bucket, and hash values are also called bucket indices.

Thus, the hash function only hints at the record's location — it only tells where one should start looking for it.

Still, in a half-full table, a good hash function will typically narrow the search down to only one or two entries.

In the Java programming language, for example, hash functions are central to the way the language allows

objects to be stored in hashing collections such as the standard HashMap and HashSet classes. To this end,

Java's Object parent class provides a hashCode() method mandating the generation of a 32-bit integer hash

value.

Finding duplicate records

To find duplicated records in a large unsorted file, one may use a hash function to map each file record to an

index into a table T, and collect in each bucket T[i] a list of the numbers of all records with the same hash value

i. Once the table is complete, any two duplicate records will end up in the same bucket. The duplicates can then

be found by scanning every bucket T[i] which contains two or more members, fetching those records, and

comparing them. With a table of appropriate size, this method is likely to be much faster than any alternative

approach (such as sorting the file and comparing all consecutive pairs).

Finding similar records

Hash functions can also be used to locate table records whose key is similar, but not identical, to a given key; or

pairs of records in a large file which have similar keys. For that purpose, one needs a hash function that maps

similar keys to hash values that differ by at most m, where m is a small integer (say, 1 or 2). If one builds a table

of T of all record numbers, using such a hash function, then similar records will end up in the same bucket, or in

nearby buckets. Then one need only check the records in each bucket T[i] against those in buckets T[i+k] where

k ranges between -m and m.

Finding similar substrings

The same techniques can be used to find equal or similar stretches in a large collection of strings, such as a

document repository or a genomic database. In this case, the input strings are broken into many small pieces,

and a hash function is used to detect potentially equal pieces, as above.

The Rabin-Karp algorithm is a relatively fast string searching algorithm that works in O(n) time on average. It is

based on the use of hashing to compare strings.

Geometric hashing

This principle is widely used in computer graphics, computational geometry and many other disciplines, to locate

close pairs of points in the plane or in three-dimensional space, similar shapes in a list of shapes, similar images

in an image database, and so on. In these applications, the set of all inputs is some sort of metric space, and the

hashing function can be interpreted as a partition of that space into a grid of cells. The table is often an array

with two or more indices (called a bucket grids), and the hash function returns an index tuple. This special case

of hashing is known as geometric hashing or the grid method.

Properties

Determinism

Hash function - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Hash_function

2 of 6 24/7/2008 14:50

Page 3: Hash Function Instruction Count

To serve its purpose, a hash function must be fast and deterministic — meaning that two identical or equivalent

inputs must generate the same hash value.

Uniformity

A good hash function should map the expected inputs as evenly as possible over its output range. That is, every

hash value in the output range should be generated with roughly the same probability. The reason for this last

requirement is that the cost of hashing-based methods goes up sharply as the number of collisions — pairs of

inputs that are mapped to the same hash value — increases. Basically, if some hash values are more likely to

occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding

table entries.

For many applications the hash values need only be uniformly distributed, not random in any sense. A good

randomizing function is usually good for hashing, but the converse need not be true. For cryptographic hash

functions, the function must however be randomizing, that is, although deterministic, pragmatically

indistinguishable from a random function.

Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may

contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the

uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just

for the global set of all possible entries.

In other words, if a typical set of m records is hashed to n table slots, the probability of a bucket receiving many

more than m/n records should be vanishingly small. In particular, if m is less than n, very few buckets should

have more than one or two records. (Ideally, no bucket should have more than one record; but a small number

of collisions is virtually inevitable, even if n is much larger than m (see the birthday paradox).

Variable range

In many applications, the range of hash values may be different for each run of the program, or may change

along the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash

function which takes two parameters — the input data z, and the number n of allowed hash values.

Data normalization

In some applications, the input data may contain features that are irrelevant for comparison purposes. When

looking up a personal name, for instance, it may be desirable to ignore the distinction between upper and lower

case letters. For such data, one must use a hash function that is compatible with the data equivalence criterion

being used: that is, any two inputs that are considered equivalent must yield the same hash value.

Continuity

A hash function that is used to search for similar (as opposed to equivalent) data must be as continuous as

possible; two inputs that differ by a little should be mapped to equal or nearly equal hash values.

Note that continuity is usually considered a fatal flaw for checksums, cryptographic hash functions, and other

related concepts. Continuity is desirable for hash functions only in some applications, such as hash tables that

use linear search.

Hash function algorithms

The choice of a hashing function depends strongly on the nature of the input data, and their probability

distribution in the intended application.

Injective and perfect hashing

Hash function - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Hash_function

3 of 6 24/7/2008 14:50

Page 4: Hash Function Instruction Count

The ideal hashing function should be injective — that is, it should map each valid input to a different hash value.

Such a function would directly locate the desired entry in a hash table, without any additional search.

An injective hash function whose range is all integers between 0 and n−1, where n is the number of valid inputs,

is said to be perfect. Besides providing single-step lookup, a perfect hash function also results in a compact hash

table, without any vacant slots.

Unfortunately, injective and perfect hash functions exist only in very few special situations (such as mapping

month names to the integers 0 to 11); and even then they are often too complicated or expensive to be of

practical use. Indeed, hash functions are typically required to map a large set of valid potential inputs to a much

smaller range of hash values and therefore cannot be injective.

Use of short search key as indirect index

In special situations, such as short, one or two byte search keys, the search key itself can be used as the first

(indirect) 'index' to a table of target index values (1-255) - with any gaps in sequence set as null (i.e. index=0) -

denoting an 'invalid' key. Consider the search key to be an 8-bit character containing the alphanumeric

characters A-Z, (both upper and lower case) and the numeric digits 0-9, with other possible values an error. The

'hash table' can simply be a pre-defined, and pre-compiled, static 256 byte table containing a sequential index

(X'00' - 'FF') in the relevant offsets - equating to the search key itself.

If no validation is required, (because the validity of the key is already known), the table need only be as large as

the range of (high-low) keys. For example if the possible ASCII keys were say A thru Z, a table of only 26 bytes

(containing x'00' - x'19') would be needed for the table - providing excellent locality of reference. The

'conversion' from search key to index value can be accomplished in an extremely short fixed time interval - in

fact the time to execute around 5 or 6 machine instructions - or even less in some architectures - irrespective of

the search value. Similar methods can also be applied for 16-bit keys (range 0-65535) with an index range but in

this case requiring a maximum pre-defined table size of around 128K. If the search key range is known however,

the table can often be considerably less. Path length in terms of instruction count is similar to that required for 8

bit keys. Once obtained, the index value can be used for any number of direct indexing including use in branch

tables to determine program logic where appropriate and available in the language being used.

Hashing uniformly distributed data

If the inputs are bounded-length strings (such as telephone numbers, car license plates, invoice numbers, etc.),

and each input may independently occur with uniform probability, then a hash function needs only map roughly

the same number of inputs to each hash value. For instance, suppose that each input is an integer z in the range 0

to N−1, and the output must be an integer h in the range 0 to n−1, where N is much larger than n. Then the hash

function could be h = z mod n (the remainder of z divided by n), or h = (z × n) ÷ N (the value z scaled down by

n/N and truncated to an integer), or many other formulas.

Hashing data with other distributions

These simple formulas will not do if the input values are not equally likely, or are not independent. For instance,

most patrons of a supermarket will live in the same geographic area, so their telephone numbers are likely to

begin with the same 3 to 4 digits. In that case, if n is 10000 or so, the division formula (z × n) ÷ N, which

depends mainly on the leading digits, will generate a lot of collisions; whereas the remainder formula z mod n,

which is quite sensitive to the trailing digits, may still yield a fairly even distribution of hash values.

When the data values are long (or variable-length) character strings — such as personal names, web page

addresses, or mail messages — their distribution is usually very uneven, with complicated dependencies. For

example, text in any natural language has highly non-uniform distributions of characters, and character pairs,

very characteristic of the language. For such data, it is prudent to use a hash function that depends on all

characters of the string — and depends on each character in a different way.

Special-purpose hash functions

Hash function - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Hash_function

4 of 6 24/7/2008 14:50

Page 5: Hash Function Instruction Count

In many such cases, one can design a special-purpose (heuristic) hash function that yield many fewer collisions

than a good general-purpose hash function. For example, suppose that the input data are file names such as

FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, etc., with mostly sequential numbers. For such data, a function

that extracts the numeric part k of the file name and returns k mod n would be nearly optimal. Needless to say, a

function that is exceptionally good for a specific kind of data may have dismal performance on data with

different distribution.

Hashing with checksum functions

One can obtain good general-purpose hash functions for string data by adapting certain checksum or

fingerprinting algorithms. Some of those algorithms will map arbitrary long string data z, with any typical

real-world distribution — no matter how non-uniform and dependent — to a fixed length bit string, with a fairly

uniform distribution. This string can be interpreted as a binary integer k, and turned into a hash value by the

formula h = k mod n.

This method will produce a fairly even distribution of hash values, as long as the hash range size n is small

compared to the range of the checksum function. Bob Jenkins' LOOKUP3 (http://burtleburtle.net/bob/c

/lookup3.c) algorithm[1] uses a 32-bit checksum. A 64-bit checksum should provide adequate hashing for tables

of any feasible size.

Hashing with cryptographic hash functions

Some cryptographic hash functions, such as MD5, have even stronger uniformity guarantees than checksums or

fingerprints, and thus can provide very good general-purpose hashing functions. However, the uniformity

advantage may be too small to offset their much higher cost.

Audio identification

For audio identification [2] such as finding out whether an MP3 file matches one of a list of known items, one

could use a conventional hash function such as MD5, but this would be very sensitive to highly likely

perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or

changes in volume. Using something like MD5 is useful as a first pass to find exactly identical files, but another

more advanced algorithm is required to find all items that would nonetheless be interpreted as identical to a

human listener. Though they are not common, hashing algorithms do exist that are robust to these minor

differences. There is a service called MusicBrainz which creates a fingerprint for an audio file and matches it to

its online community driven database.

Origins of the term

The term "hash" comes by way of analogy with its standard meaning in the physical world, to "chop and mix".

Donald Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo

dated January 1953, and that Robert Morris used the term in a survey paper in CACM which elevated the term

from technical jargon to formal terminology.[3]

In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are then

"mixed" with one another using carefully chosen mathematical functions. The range ("hash value") is made to

be a definite size, 160 bits (which may be either smaller or larger than the domain), through the use of modular

division.

See also

Universal hashing

Cryptography

Cryptographic hash function

HMAC

Hash function - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Hash_function

5 of 6 24/7/2008 14:50

Page 6: Hash Function Instruction Count

Geometric hashing

Distributed hash table

Perfect hash function

Linear hash

Rolling hash

Rabin-Karp string search algorithm

Zobrist hashing

Bloom filter

Hash table

Hash list

Hash tree

Coalesced hashing

Transposition table

List of hash functions

Notes

^ In the remainder of this article, the term function is used to refer to algorithms as well as the functions

they compute.

1.

References

^ Jenkins, Bob (September, 1997), Hash Functions, “Algorithm Alley”, Dr. Dobb's Journal, <http://www.ddj.com

/184410284>

1.

^ "Robust Audio Hashing for Content Identification by Jaap Haitsma, Ton Kalker and Job Oostveen"

(http://citeseer.ist.psu.edu/rd/11787382%2C504088%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache

/papers/cs/25861

/http:zSzzSzwww.extra.research.philips.comzSznatlabzSzdownloadzSzaudiofpzSzcbmi01audiohashv1.0.pdf

/haitsma01robust.pdf)

2.

^ Knuth, Donald (1973). The Art of Computer Programming, volume 3, Sorting and Searching, 506-542.3.

External links

General purpose hash function algorithms (C/C++/Pascal/Java/Python/Ruby) (http://www.partow.net

/programming/hashfunctions/index.html)

Hash Functions and Block Ciphers by Bob Jenkins (http://burtleburtle.net/bob/hash/index.html)

Integer Hash Function (http://www.concentric.net/~Ttwang/tech/inthash.htm) by Thomas Wang

The Goulburn Hashing Function (http://www.geocities.com/drone115b/Goulburn06.pdf) by Mayur Patel

Hash Functions (http://www.azillionmonkeys.com/qed/hash.html) by Paul Hsieh

HSH 11/13 (http://herbert.gandraxa.com/herbert/hsh.asp) by Herbert Glarner

Online Char (ASCII), HEX, Binary, Base64, etc... Encoder/Decoder with MD2, MD4, MD5, SHA1+2,

etc. hashing algorithms (http://www.paulschou.com/tools/xlate/)

FNV (http://isthe.com/chongo/tech/comp/fnv/) Fowler, Noll, Vo Hash Function

Retrieved from "http://en.wikipedia.org/wiki/Hash_function"

Categories: Hash functions | Search algorithms | Error detection and correction

Hidden categories: All articles with unsourced statements | Articles with unsourced statements since February

2007

This page was last modified on 16 July 2008, at 11:18.

All text is available under the terms of the GNU Free Documentation License. (See Copyrights for

details.)

Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a U.S. registered 501(c)(3)

tax-deductible nonprofit charity.

Hash function - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Hash_function

6 of 6 24/7/2008 14:50