dash hash functions for storage and data management
TRANSCRIPT
Great tools may be inappropriate for some tasks
In many situations, popular hash functions are used in situations they are not designed for
This blunder is often found in storage/data management
SynopsisSynopsis
“Enterprise storage needs will increase by a factor of seven over
the next three years.”- Strategic Research Corporation, 2002
An enterprise spends an average of three dollars managing storage for every one dollar spent on storage
hardware.” (Gartner quoted in “Emerging Technology: Keeping
Storage Costs Under Control”, Network Magazine, Oct. 5, 2002)
According to a study at University of California , Berkeley , more new
information is predicted to be stored in the next several years, than in all of previous recorded human history
combined. .
“(…) technology improvements in current magnetic and optical data
storage systems are saturating, (…) reaching their theoretically
achievable storage densities.”(Liz Murphy, Vice President of Marketing, InPhase
Technologies)
Storage Management: An increasing needStorage Management: An increasing need
Given the extreme growth of storage needs, storage management has become an imperative necessity…
… and thus the bread and butter of many immerging IT companies.
Storage management
Reduce storage space
Data Mining/
Warehousing
Reliable and efficient backup of
files
Duplicate detection
Mirroring/Synchronization
Compressed representation
Indexing
Hash Functions
Storage Management and Hash FunctionsStorage Management and Hash Functions
101110100010100101000011101000100010100101001011
010101001110001100101110100100111111110001010110
101000100001101001101000100100110010111010010011
111100000101010111001101001101000100101110110100
111000110010100111110000101000100101010010111010
101101001001001101010110010010110100101100101110
001011100010100100011010001011100100101110100010
010010111010101000101010010101011110000001110111
100000010111110111110000101011000000011111111111
10 0111
Hash values
Bit-streams
A hash function associates bit-streams to a small number of “short” hash values
What is a Hash Function?What is a Hash Function?
A hash value is essentially a digital fingerprint
What is a Hash Function?What is a Hash Function?
A hash function is an algorithm to automatically create the fingerprints
Hash functions are used in many applications, such as
• Duplicate detection• Mirroring/Synchronization• Indexing• Error-detecting and error-correcting• Privacy/Security• Many more applications…
But often the existing hash functions are used for the wrong purpose, or the scientific community has not yet produced any hash
functions designed for this particular purpose
The many uses of Hash FunctionsThe many uses of Hash Functions
• Most hash functions used in storage management applications are off-the-shelf hash functions that were designed for dissimilar, sometimes contrary purposes
• The adequacy of these hashes are erroneously assessed, given that this assessment is usually based on the probability that two random equiprobable bit-streams hash to the same value. Yet:– Bit-streams generated by computer applications are not random, but
have a definite statistical or deterministic structure– Collision probabilities do not scale linearly with the number of files
hashed, but exponentially! (Ask your mathematician about the “birthday paradox”)
• Inadequate hash functions may lead to slower processes, greater memory requirements, and sometimes complete disasters (loss of data, data corruption, etc.)
• Given the exponential growth of storage needs, current “good enough” hashes will become inadequate—if not disastrous—in the near future.
The need for better hash functionsThe need for better hash functions
101110100010100101000011101000100010100101001011
010101001110001100101110100100111111110001010110
101000100001101001101000100100110010111010010011
111100000101010111001101001101000100101110110100
111000110010100111110000101000100101010010111010
101101001001001101010110010010110100101100101110
001011100010100100011010001011100100101110100010
010010111010101000101010010101011110000001110111
100000010111110111110000101011000000011111111111
10 11
Few bits to represent hash values
Hash values fast to compute
00
We want our fingerprints to be small and fast to compute…
Desired properties of all Hash FunctionsDesired properties of all Hash Functions
…But further desired properties diverge depending on what we are using the fingerprints for.
Applied to…
All possible
bit-streams
“Catch”transmission
errors
Goals
(e.g. Check-sum, CRC, Reed-Solomon, etc.)
Error-Detecting and Correcting Hash FunctionsError-Detecting and Correcting Hash Functions
3. Fingerprint of received data is computed
101110100010100101000011101000100010100101001011
10
101110100010100101000011010100100010100101001011
10 11
1. Fingerprint of data is computed
2. Data and fingerprint are sent together
4. If received and computed fingerprints don’t match, we know an error occurred
These hash functions are designed to “catch” transmission errors.
Error-Detecting and Correcting Hash FunctionsError-Detecting and Correcting Hash Functions
(e.g. Check-sum, CRC, Reed-Solomon, etc.)
Applied to…
All possible
bit-streams
Privacy/Security
Goals
(e.g. MD5, SHA, GOST, RIPEMD etc.)
Cryptographic Hash FunctionsCryptographic Hash Functions
101110100010100101000011101000100010100101001011
010101001110001100101110100100111111110001010110
101000100001101001101000100100110010111010010011
111100000101010111001101001101000100101110110100
111000110010100111110000101000100101010010111010
101101001001001101010110010010110100101100101110
001011100010100100011010001011100100101110100010
010010111010101000101010010101011110000001110111
100000010111110111110000101011000000011111111111
10 0111
These hash functions are intended for security and privacy issues. They are designed so that given a fingerprint, it is unfeasible to create a bit-stream having this fingerprint.
Cryptographic Hash FunctionsCryptographic Hash Functions(e.g. MD5, SHA, GOST, RIPEMD etc.)
Applied to… Goals
Applicationgenerated
bit-streams
What storage management applications need: Differentiate bit-streams generated by computer applications
What storage management uses: Off-the-shelf hash functions
Consequence: Less efficiency, less effectiveness, more memory
“Catch”transmission
errors
Privacy/Security
Differentiatebit-streams
What storage management needs from Hash FunctionsWhat storage management needs from Hash Functions
101110100010100101000011101000100010100101001011
010101001110001100101110100100111111110001010110
101000100001101001101000100100110010111010010011
111100000101010111001101001101000100101110110100
111000110010100111110000101000100101010010111010
101101001001001101010110010010110100101100101110
001011100010100100011010001011100100101110100010
010010111010101000101010010101011110000001110111
100000010111110111110000101011000000011111111111
10 0111
differentiation effectiveness ↔ collision probability
Goal in storage management settings: Differentiate bit-streams
What storage management needs from Hash FunctionsWhat storage management needs from Hash Functions
Applicationgenerated
bit-streams
“Catch”transmission
errors
Privacy/Security
Differentiatebit-streams
DASH: Differentiating Application Specific HashDASH: Differentiating Application Specific Hash
Applied to… Goals
Any type of bit-streams
Bit-streams generated by computer applications
Effective Differentiation
Reliable Transmission
Secure Encryption
Any type of bit-streamsCRC, etc.
MD5, etc.
DASH
Hash
Files
HashValues
HashGroups
DuplicateGroupsHashes allow duplicate
detection processes to group files into “probable duplicates” groups, reducing further byte-to-byte comparison to be carried out on significantly smaller collections of files.
Hash Functions in Duplicate DetectionHash Functions in Duplicate Detection
Files
HashValues
HashGroups
DuplicateGroups
Low collision prob. =
Efficient duplicate detection
Hash Functions in Duplicate DetectionHash Functions in Duplicate Detection
Computehash value
Computehash value
Comparehash values
Transmithash value
Master site Mirror site
If hash values different, files are different, so master file is sent to mirror for backup.
Transmitfile
Hash Functions in Mirroring/SynchronizationHash Functions in Mirroring/Synchronization
Computehash value
Computehash value
Comparehash values
Transmithash value
Master site Mirror site
If hash values are equal, files are assumed to be equal, so master file is NOT sent for backup.
In this case collision probability must be extremely low, so that the likelihood of not backing up a file that should be backed up is almost nil.
Longer hash values
Lower collision
probabilities
More network load
But..
Hash Functions in Mirroring/SynchronizationHash Functions in Mirroring/Synchronization
101110100010100101000011101110100010100101001011
101110100010100101001011101000100101010001000011
101000100101010001000011101110100010100101000011
101110100010100101000011101110100010100101001011
101110100010100101001011101000100101010001000011
101000100101010001000011101110100010100101000011
Users point of view Stored as
Hash Functions in Compressed RepresentationsHash Functions in Compressed Representations
StandardStorage
101110100010100101000011101110100010100101001011
101110100010100101001011101000100101010001000011
101000100101010001000011101110100010100101000011
Users point of view Stored as
1011101000101001
01000011
101110100010100101001011
1010001001010100
FactoredStorage
Hash Functions in Compressed RepresentationsHash Functions in Compressed Representations
Hash Functions in IndexingHash Functions in Indexing
Error-correcting and detecting hash functions
Cryptographic hash functions
Indexing hash functions
Similar bit-streams↔
Dissimilar fingerprints
Scramble relation between bit-streams
and fingerprints
Similar bit-streams↔
Similar fingerprints
When indexing bit-streams using fingerprints, with the intent of carry out information retrieval, we want similar bit-streams to produce similar fingerprints. This is precisely what most customary hash functions avoid.
When popular hashes such as check sums, CRC, MD5, or SHA are used for the sole purpose of bit-stream differentiation,
• the hash values are larger, and • the hash computation load higher,
than what is necessary and sufficient for the task of differentiating bit-streams.
Further,• the ACTUAL collision probabilities are higher than the claimed best-case scenario, since bit-streams generated by computer applications are not equiprobable.
Consequences of using customary hashesConsequences of using customary hashes
If all bit-streams are random, or their structure is unknown, “balanced hashes” such as CRC, MD5, SHA, etc. have optimal collision probability. Yet, in this case, faster balanced hashes may be used, which reach the same optimal collision probability.
Adapting hash functions to what they’ll hashAdapting hash functions to what they’ll hash
Yet, when dealing with computer generated data, the bit-streams are often not random, but have a given structure specific to the creating application. In this case, the mentioned customary hashes have higher collision probabilities than those inferred by the equiprobable assumption.
Higher probability Lower probability
Adapting hash functions to what they’ll hashAdapting hash functions to what they’ll hash
Yet, when dealing with computer generated data, the bit-streams are often not random, but have a given structure specific to the creating application. In this case, the mentioned customary hashes have higher collision probabilities than those inferred by the equiprobable assumption.
Higher probability Lower probability
Wasting fingerprints on highly unlikely bit-streams…
… fingerprints which would better be used to differentiate highly likely bit-streams.
Adapting hash functions to what they’ll hashAdapting hash functions to what they’ll hash
We had better have fingerprints that are adapted to the likelihoods of the bit-streams. These fingerprints would be
• More effective (lower collision probabilities)• Shorter (lower hash sizes, taking less space)• More efficient (faster to compute)
Adapting hash functions to what they’ll hashAdapting hash functions to what they’ll hash
If we were mostly fingerprinting human beings…
… The above would be a better fingerprinting scheme.
An anthropomorphic exampleAn anthropomorphic example
• Allow duplicate detection applications to - need less space to store file fingerprints- compute fingerprints faster- create smaller candidate duplicate groups- reduce time needed to purge file system of duplicates
• Allow synchronization applications to- compute fingerprints faster- reduce required network load- reduce likelihood of not backing up a file that needed to be backed up
Better hashes to…Better hashes to…
• Allow bit-stream factoring to- avoid data corruption and loss due to collision- pin-point most common bit-streams- hence reduce storage space throughout file system
• Allow hashed indexing to- reflect the semantic relationship of the bit-streams- do so in an efficient manner
Better hashes to…Better hashes to…
• Produce general hashes for a large class of file types• Produce optimal hashes for specific common file types• Design an application that will collect statistics of files in a given file system, file server, or network, and automatically produce hashes that are optimal for the current specific environment• Design an application that will automatically produce optimal hash functions having specific parameters and functionality
A few implementation ideas…A few implementation ideas…
• Design a system to safely dispatch new hashes to all components of a given protocol scope• Research new ways to reduce storage requirements by hashing common bit-streams found in the files of a files system• Produce new indexing hashes for information retrieval and search engines
A few implementation ideas…A few implementation ideas…