file authentication. just as people need authentication, so, too, do the files that are so freely...

File authentication

Just as people need authentication, so, too, do the files that are so freely distributed through the Internet. The Web works on a basic underpinning of implicit confidence that everything operates as it should; we generally believe that the files we send and receive are authentic and uncorrupted.

But there is a growing awareness that this is not always true. Files can be altered, whether by accident or malicious intent, and the casual user may be completely unaware of the changes.

When a file is sent as an attachment to an e-mail message, or is retrieved via a web browser, how can the user be sure that the document that is received is, in fact, the document that was sent? It is perfectly plausible that a file may be corrupted in transmission as it moves through the system.

The corruption could be physical, as with a noisy modem line, such that a few characters are changed (this used to be quite common in the days of slow analog phone links). The corruption might be deliberate, such that an order issued to buy stocks or transfer money is modified in some specific manner, such as changing the account number. The file might be replaced altogether, as part of a prank or hoax.

There are numerous standards for how various types of information should be stored and managed, but at a “lower” level there are mechanisms for detecting when bits have been altered.

At the most fundamental level there is the use of “parity bits”. The concept of parity is built into most computer hardware and networking equipment; it is present in your compact disc system as well.

The basic idea is as follows: • for any given sequence of binary

digits (perhaps 7), count the number of ‘1’ bits

• add an extra bit at the end of the sequence

• the value of the extra bit depends on …

– whether “even” or “odd” parity is desired– how many ‘1’ bits already exist, such that

if “even” parity is desired, and the existing sequence has an odd number of bits, then the extra “parity” bit is set to ‘1’. If the existing sequence already has an even number of bits, then the parity bit is set to ‘0’ (If “odd” parity is desired, then the parity bit values are the opposite of the “even” parity situation).

Original bit sequence Parity bit value

0101101 0 1100100 1 0011010 1 1010110 0

Thus, the use of parity bits provides an easy and fast mechanism to detect certain types of errors in the storage or transmission of binary data. The value of the parity bit is set initially, and can later be re-examined … if a bit has been flipped, then we can detect a “parity error”.

An unfortunate detail is that while we can detect that one of the bits is “bad”, we don’t know which bit it is. Simple parity schemes allow for error detection, but not error correction.

Parity bits are useful, but not foolproof. For example, suppose two ‘0’ bits get flipped to values of ‘1’. In this case, the original parity bit would still be accurate, and the error would go undetected. Not so good, really, since we would prefer to detect the errors, and, even better, correct them when detected so as to avoid having to re-transmit the original data and thereby waste time and resources.

The need for this type of error detection and correction in the transmission of data led to the development and use of a technique called “Cyclic Redundancy Code” (CRC).

The basic idea behind CRC is to extend the use of parity bits by organizing data into small blocks that might be viewed as a two-dimensional array.

A simple CRC strategy would calculate parity for each row, generating a new column of parity bits, and then for each column, generating a new bottom row of parity bits. If any individual bit were to be altered, then the parity bits in the corresponding row and column would be wrong, and would immediately identify the precise location of the “bad” bit. In this way, the error could be detected and corrected.

Parity can be used to detect when bits are modified, but it does not verify that the file requested is, in fact, the file that was received … files en-route whose data is perfectly correct by parity but whose contents have in fact been altered or replaced prior to or during transmission.

The solution to file authentication involves an extra requirement, whereby the files stored at their origin are examined and a special digital “fingerprint” is computed and stored for each document.

When a user requests a particular document, the document is sent, along with an encrypted version of the document’s fingerprint. When the user receives the document, and the fingerprint, it re-computes the fingerprint (using the same algorithm) and then compares the local result with the now-unencrypted fingerprint sent from the original location.

If the two fingerprint values match then the document is determined to be unmodified, its integrity validated by the use of the digital fingerprint values.

You might think that this solution strategy doesn’t completely solve the problem either, as what would stop someone from sending the document and its associated fingerprint?

The solution to this problem relies on the encryption process, as the fingerprint value is protected using public-key cryptography that ensures with a high degree of confidence that the communication took place amongst the intended participants and not with some outside interloper.

What does a fingerprint look like? It is a number, the output of a function, that consists of a fixed number of binary digits. The technique is generally know as a “message digest function”, whereby a file of any size is processed by a function that yields a specific output value.

The most famous and widely used message digest function is known as “Message Digest 5” (MD5), developed by Ron Rivest, one of the best known names in the field of security and authentication, and a founder of RSA Security, Inc.

The (MD5) algorithm takes as input a message of arbitrary length and produces as output a 128-bit ‘fingerprint’ or ‘message digest’ of the input. It is conjectured that it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given prespecified target message digest.[1]

2 Request for Comment (RFC) 1321, R. Rivest, p. 1

The significance of the preceding statement is as follows: suppose an MD5 output is calculated for a given file. The change of even one character in the file will change the MD5 function result. Thus, if a file is altered it is immediately detectable.

And, just as important, if an adversary happens to know the MD5 value for a particular file, it would be highly improbable (“computationally infeasible”) that the adversary could create a substitute file that would produce the same output value as the original file.

MD5 is an example of a “hash” function that maps input values (in this case, files of variable lengths) to a fixed range of output values (in this case, numbers that can be represented in 128 bits). It is therefore possible that two input files will map to the same output value, but the attractiveness of MD5 is that it is highly unlikely and virtually impossible to manipulate in a meaningful way.

It is important to note that MD5 is used to validate the document, but not to protect the document—that would be done by encrypting the entire document, as compared to this example where only the fingerprint value is encrypted.

The MD5 strategy was published as RFC 1321, back in 1992. It is widely known, and has withstood testing and close scrutiny for more than a decade, thus increasing the confidence that the algorithm is reasonably robust and reliable for the intended purpose of authenticating documents …

and (to be discussed shortly), in the generation of “digital signatures” that are used to establish an irrepudiable association between an individual and a particular document.

A detailed treatment of the MD5 system can be found in RFC 1321 “The MD5 Message-Digest Algorithm, which also includes sample code that demonstrates the algorithm in action.

file authentication. just as people need authentication, so, too, do the files that are so freely...

Documents