adding security to compressed digital audiopalms.ee.princeton.edu/palmsopen/bracken00adding.pdf ·...

Adding Security to Compressed Digital Audio

Joseph BrackenAugust 9, 2000

Submitted to Prof. Ruby LeePrinceton University

2

INTRODUCTION:

In 1987, Fraunhofer IIS introduced the CD-quality perceptual audio compression standard MPEG-1, Audio Layer 3 (‘MP3’). Combined with increasing network speed and proliferation, this has createdopportunities for commercial as well as pirated digital music distribution. A decade later, the volume ofunauthorized Internet music distribution far exceeds that of commercial distribution. College students,traditionally avid music consumers, are taking advantage of their schools’ powerful Internet connectionsand of music swapping software such as Napster to download large amounts of free music. 75% ofstudents on campus claim to have engaged in illegal music piracy over the internet.[1] The RecordingIndustry Association of America claims that college students are now purchasing one-third less music.[1] Anumber of portable MP3 players have been introduced, removing a final barrier to MP3 acceptance.

The music industry’s response has been litigious: it has searched the Internet for illegaldistributions of MP3s and has sued the manufacturers of the players and the swapping software. However,the industry recognizes the commercial potential of the medium and is attempting to profit by selling musiconline.

In order to both provide for-sale music online and to prevent illegal music distribution, theindustry requires a secure audio compression format. Most current proposals accomplish this is asequential manner, i.e. full audio compression followed by encryption, creating a protective shell aroundthe data. These techniques take advantage of cryptographically proven algorithms such as those in [2].Authorized users possess the key to unlock the audio data; the bitstream is useless to those without the key.The protective shell technique is not specific to audio bitstreams and may be applied to any insecure data.

Any intellectual property protection scheme must relate the strength of protection to the value ofthe property. Computationally intensive schemes appropriate for financial or military secrets may not befor music worth a few dollars. Shell techniques recognize this by protecting just a small percentage of thedata to minimize decoding computation. This paper proposes an alternative: a method to embed securityinto the audio compression algorithm itself. It allows a more elegant and efficient solution to the problemof audio protection.

The goals of the encryption scheme are simple:1. Maintain audio quality for authorized listeners. Any data-hiding method, such as that used in

audio watermarking, introduces the possibility of audible distortion.2. Preserve bitstream efficiency. MP3 is designed for a high compression ratio, and a significant

increase in download time and storage space is unacceptable to consumers.3. Degrade audio for unauthorized listeners in a controlled, predicable manner. A scheme

should not necessarily render the audio entirely unusable. A distributor may desire the optionof a preview quality, poor enough to deter piracy yet intelligible enough to encouragepurchase of the authorized track.

4. Minimize memory and computation increases. This is especially relevant in the decoder asportable device use grows. These portable players are designed for low cost and therefore analgorithm with high computational or memory increases is unacceptable.

5. Ensure applicability to other compression schemes. MP3 is the de facto standard for audiocompression on the Internet. To entice paying customers away from piracy, the industry mustoffer a substantial improvement in performance. A number of schemes, such as MPEG2AAC, offer CD-quality music at improved compression ratios.

6. Combine decoding and decryption into one operation. This eliminates the possibility that anattacker, without knowledge of the audio stream, may break the key and leave the dataunprotected.

PSYCHOACOUSTICS:

Psychoacoustics is the study of human auditory perception. Perceptual coding of digital audiouses psychoacoustics to model audible responses that guide compression decisions. This section willbriefly describe the human auditory system, followed by a more thorough discussion of its cognitiveeffects.

3

The ear may be divided into outer, middle, and inner sections. The outer ear collects sound withthe pinna. The size of the pinna increases the loudness of the sound and adds directional information. Theear canal connects the pinna to the tympanic membrane (‘the eardrum’), and also increases loudness nearthe resonant frequencies.

Figure 1: the human ear[3]

The middle ear has three bones (‘the ossicles’) which provide impedance matching for thetransmission of sound from the outer air to the inner fluid. Maximal efficiency in transferring energybetween systems is achieved when the impedances of those systems are equal. The impedance ratio fromouter ear to inner ear is about 4000:1, indicating the need for impedance matching. The ossicles connectthe 90mm^2 eardrum to the 3.2mm^2 footplate (‘the oval window’) of the inner ear, a 27:1 ratio. Themotion of the ossicles results in a 2:1 reduction in the effective amplitude. The combined transfer ratio isthen 54:1, close to the sqrt(4000) = 63 pressure ratio.[4][5]

The inner ear contains the balancing vestibular canals and the cochlea, which performs auditoryprocessing. The fluid-filled cochlea is coiled in 2.5 turns and encased in bone. The motion of the footplatedrives longitudinal standing waves down the basilar membrane which runs the length of the inner ear.Changing frequencies change the position of maximum amplitude of the wave. Low frequencies arestrongest farthest from the footplate; high frequencies are strongest near it. Frequency discrimination ismost sensitive at the low frequencies.[6]

Waves in the fluid stimulate 20,000 hair cells along the basilar membrane. These cells sendinformation to the brain. The phenomenon of resonance means that when a frequency is applied tostructures of different periods of vibration, the structure with natural period closest to that frequency willvibrate the most. The resonance theory of hearing suggests that portions of the membrane vibrateselectively in response to different frequencies, i.e. the membrane engages in a Fourier analysis of limitedresolution.[5] In the 19th century, Helmholtz suggested two possible explanations for this effect: that thestructure of the hair fibers themselves was constructed for selective resonance by tension, or that thetension in the membrane itself or its placement of hairs creates gradations in resonance.[5] Both theoriesare still debated, though a straightforward resonance theory has been rejected as too simplistic. The readeris referred to [4], [5], [6], or [7] for further discussion.

Loudness is the perceived human response to sound, and is a function of four variables asshown, ),,,( SCFIfL = , where I is intensity, F is frequency, C is complexity or bandwidth, and S is the

stage of the auditory pipeline.[5] In all further discussion, S is assumed to represent the final cognitiveprocessing.

Intensity is a mathematical quantity defined ambiguously in a number of ways, most commonly asthe maximum displacement of the audible signal or the sound pressure level (SPL). Intensity is generallyreferenced to the minimum audible sound pressure level at 1kHz, referred to as a ‘phon’ and correspondingto an air pressure of 20 µ Pa.[8] Loudness increases with intensity for fixed F and C; a 10dB increase in

SPL results in an approximate doubling of loudness.

4

Frequency has a nonlinear effect on loudness. The ear perceives frequencies from 20Hz to 20kHzbut is most sensitive to those near 3kHz.[5] An absolute threshold of hearing below which sounds areinaudible is that generated by air particles contacting the eardrum.

An example calculation of the absolute threshold as a function of frequency is:

)()1000/(105.6)1000/(64.3)( 43)3.31000/(6.08.0 2

dBfeffTq f −−−− +−= [8]

A graph of equal loudness curves for varying frequency and intensity is shown below, including theabsolute threshold.

Fig 2: equal loudness curves[10]

The discussions of intensity and frequency thus far have implied evaluation of single tones. Realaudio combines tones into complex sounds over finite bandwidths. These bandwidths may affect loudness.For example, a tone at 1kHz has approximately the same loudness as a bandwidth of up to 160Hz, centeredat 1kHz, possessing the same intensity. A bandwidth of greater than 160Hz, however, increases theloudness. The hair cells are divided into critical bands stimulated together, each about 1.3mm long butcorresponding to unequal frequency bands.[3] Far from the footplate, at low frequencies, each critical bandreacts to just a 100Hz range; close to the footplate, the largest critical band may span four kHz. As long asthe bandwidth of the signal is confined to one critical band, the same cells react and loudness is constant.When the bandwidth expands into adjacent critical bands, more cells react and the loudness increases. Thecritical bandwidth surrounding arbitrary frequency f is given by

)(])1000/(4.11[7525)( 69.02 HzffBWc ++= . [8]

Although critical bands are transitory, it is convenient for psychoacoustic analysis to define fixed bands. Atypical table of twenty-five bands spanning the human auditory range is shown below.

5

Fig 3: Critical band division [6]

Tones raise the audible threshold in the nearby frequency region. Frequency masking occurswhen one tone raises the threshold of a second tone. Two sounds close in frequency compete for the samehair cells in the cochlea. A significantly stronger tone dominates the response of those cells. A diagram ofmasking occurs below.

Figure 4 shows a large tone masking a smaller tone of close frequency. Because the smaller toneis above the absolute threshold shown and would have been audible if solitary, it is said to be completelymasked. Masking is greatest at the masker’s frequency and decreases slowly at frequencies higher than thatof the masker. The effect decreases more rapidly for frequencies lower than that of the masker. Masking isnot confined within critical bandwidths. Narrowband, tone-like maskers have less effect than broader,noise-like sounds.

Just as masking occurs in the frequency domain, it occurs in the time domain. Forward maskingrefers to the end of a sound masking a weaker tone that begins sometime later. Recently activated hair cellsrequire a finite time, on the order of milliseconds, to recover sensitivity. Backward masking refers to thebeginning of a tone masking a weaker tone ending sometime earlier and is due to integration of signals overtime in the nervous system. A picture of time masking is shown below. Note that forward masking ismuch stronger.

Fig 4: frequency masking [3]

6

Fig 5: temporal masking [3]

PERCEPTUAL ENCODING:

The psychoacoustic phenomena discussed in the previous section provide guidelines forconstruction of a perceptual encoder to compress audio data. A general diagram of perceptual encoding isprovided below.

Fig 6: block diagram of perceptual encoding

The MPEG audio compression algorithm will be discussed here, with particular attention to MPEG-1Audio Layer III (‘MP3’). A more thorough discussion may be found in [11].

MPEG, like all perceptual encoders, is a lossy compression algorithm. It exploits auditorymasking, eliminating inaudible information and allowing masked quantization noise. A general blockdiagram of the MPEG encoding process, from subband filtering to creation of the final bitstream, is shownin figure 7.

Fig 7: A generic MPEG encoder[12]

INPUT: The algorithm takes input 16 bit PCM data sampled at 32kHz, 44.1kHz, or 48kHz. One or twoaudio channels are supported in MPEG-1; more are supported in subsequent versions.

SUBBAND FILTERING & MDCT: MP3 audio compression begins with a polyphase filter bank. 32 PCMsamples are shifted into a 512-sample FIFO buffer. A filter convolution is performed as follows:

∑=

−=511

0

][*][][n

it nHntxis , ]64

*)16(*)1*2(cos[*][][

π−+= ninhnH i [11]

h[n] corresponds to the low pass filter response defined in the standard and H[n] is a modulation of that lowpass filter into the appropriate frequency band. s[i] represents the filter output sample for subband i. 32

AudioSignal

FilterBank

Quantization &Coding

PsychoacousticModeling

CompressedOutput

7

band pass filters are created, each with a bandwidth of pi/32T and with center frequencies at k*pi/64T,k=1,3,5,…. Filter outputs are subsampled by 32, as shown in figure 7, so that for each 32 input samples, atotal across all filters of 32 output samples are generated.

A more efficient implementation is:

∑∑= =

++=63

0

7

0

])64[*]64[(*]][[][k j

t jkxjkCkiMis

where C[n] are filter coefficients defined in the standard and ]64

*)16(*)1*2(cos[]][[

π−+= kikiM

are the analysis coefficients. [11]

MP3 provides further frequency discrimination with time domain aliasing cancellation byinputting the subband samples to a lossless Modified Discrete Cosine Transform (MDCT). Two differentblock lengths are used; a long length (18 samples) allows greater frequency resolution, while a short length(6 samples) allows greater time resolution. Each MDCT frame uses all long, all short, or mixed modes. Inmixed mode, lower frequencies use long blocks and higher frequencies use short blocks. Because of a 50%overlap between transform windows, the window length is either 36 or 12 samples. Any errors introducedby transform are cancelled when successive blocks of the inverse transform are added.

A word is necessary on the data sizes. Each set of 32 input samples resulted in 32 output samples,one at each output of the critically sampled polyphase filters. Samples from each filter are blocked into setsof either 6 or 18 and input to an equal number of MDCTs, which are again critically sampled. This resultsin 32*18=576 MDCT output samples for long blocks or 32*6=192 MDCT coefficients for short blocks. Inthe latter case, output samples are blocked into sets of three. Therefore, the entire process is criticallysampled. For 576 input samples, 576 MDCT coefficients are produced. A group of 576 samples is referredto as a granule, and each frame contains two granules. Therefore, each frame contains 1152 samples.

PSYCHOACOUSTIC MODEL: A psychoacoustic model determines minimum masking thresholds forinaudibility. The MPEG standard allows flexibility in implementation of the psychoacoustic model, butsuggests two outlines, one in particular suitable for MP3. The MP3 recommended model uses the results ofone psychoacoustic analysis per frame. It begins by converting time samples into a frequency domainrepresentation. It Hann windows the data and submits it to a 1024 sample FFT. The resulting frequencycoefficients are grouped into critical bands. Each critical band is defined to span one ‘bark’. The barkspectrum may be calculated by:

))(arctan(5.3)76.0arctan(13 2 barkffx += . [8]

Because tonal and non-tonal components have different masking effects, it is necessary tocalculate a tonality index for each perceptual value. The index is a Spectral Flatness Measure, and is basedon a measure of predictability such as:

)/(log10 10 AMGMSFM = [8]

where GM and AM are the geometric and arithmetic means of the components in each band. A low SFMindicates high predictability and therefore high tonality.

The energy inside each critical band is summed and the data is now considered to be in theperceptual domain. This reduces the number of data points, compacting the high frequency quanta morethan the low frequency. Next a spreading function is applied. The effect of masking across critical bands iscomputed, before considering tonality,

as: )()474.0(15.17)474.0(5.781.15)( 2 dBxxxSF ++−++= . [8]

The revised critical band spectrum C is calculated by convolving the original bark spectrum with the SFvalues. The tonality is then computed in an offset such as:

)(5.5)1()5.14( . dBiOi αα −++= [8]

and revises the perceptual spectrum as)10/()(log1010 ii OC

iT −= . [8]

8

The result Ti is the threshold of each critical region i.The final masking value for each subband is chosen as the maximum of Ti, adjusted for time

masking, and the constant absolute threshold. Finally, the model calculates the signal-to-mask ratio foreach critical band, the ratio of signal energy within that critical band to the masking threshold. Thosevalues are passed to the quantization/bit allocation routine.

QUANTIZATION: Figure 8 gives a graphical description of the bit allocation process. This determines thenumber of bits used to encode each frequency coefficient based on noise allocation. Encoding with fewerbits, i.e. less resolution, yields greater quantization noise which must be kept beneath the maskingthreshold. After discarding inaudible high frequency coefficients, the quantization process uses twoloops: the outer assigns quantization bits to each audible signal and calculates the resulting noise. If anysubbands have noise exceeding the masking threshold, the inner loop decreases the quantizer step size forthat band and the quantization bits are recounted. The process repeats until all critical bands have allowedquantization distortion.

Fig 8: Perceptual Quantization using SMR [3]

BITSTREAM FORMATTING: The 576 quantized frequency coefficients in each granule are ordered byincreasing frequency. In the case of short blocks in the MDCT, the samples are ordered first by frequencyand then chronologically. These samples are broken into partitions in preparation for lossless entropycoding. A picture of the division is shown in figure 9.

Fig 9: MP3 frequency coefficient arrangement

9

First, from the highest frequency, there should be a long run of all zero coefficients because highfrequencies contain little audible information. This region is not transmitted; its size is deduced from thatof the other regions. Next, there should be a run of low amplitude coefficients, only –1,0,or1. Thesecoefficients are grouped by fours and encoded using one of two Huffman tables designated by a flag in theheader. This section is labeled ‘count1’. The remaining coefficients, ‘bigvalues’, are subdivided into threeregions, each using a different Huffman table. These tables are chosen from among thirty possibilities,based on their maximum quantized value and the statistics of the signal, and are designated by flags in theheader. Coefficients are grouped into pairs and used to index the tables.

A high-level picture of an MP3 bitstream is shown in figure 10. Each frame contains a header andrelevant side information. This provides instructions to the decoder on scale factor formats, data sizes, andthe location of the beginning of main data. In MP3, main data may begin before or after the header. Thisallows the main data sizes of each frame to vary while the headers remain at regular intervals. The maindata of each frame contains the scale factors and Huffman coefficients for both granules in the frame.

Fig 10: an MP3 bitstream [11]

.THE FREQUENCY HOPPING ALGORITHM:

This audio security proposal is based on a concept from wireless communications. In tacticalsystems, military communications rely on a frequency hopping method in which the frequency at which thedata are modulated varies pseudo-randomly in time. This defeats jammers and eavesdroppers. (Imaginetrying to tune a radio station that randomly changes its frequency many times a second.) The hoppingalgorithm, although classified, is presumed known by the enemy. What is not known is the key whichcontrols the hopping pattern.

Fig 11: Frequency hopping in wireless communications [13]

10

This idea can be used to secure audio data. In MP3, information is transmitted for each of 576distinct frequency bands. A pseudo-random permutation of frequency coefficients across the humanauditory spectrum provides adequate security. The permutation may be large enough to make a bruteforce attack computationally infeasible and need not even be time varying, as suggested in [14]. Anunauthorized user decompressing the file hears each signal in the wrong frequency band. This gives themusic distributor the ability to degrade the signal in a predictable manner for unauthorized listeners.Authorized listeners possessing the hopping pattern may decompress the signal with perfect fidelity.

A demonstration of the permutation is shown in figure 12. It represents a permutation of 10Huffman coefficients in the encoder and the repermutation in the decoder.

Index Initial Permutation Encoder Decoder Decoder Sequence Pattern Permuted Received Depermuted Sequence Sequence Sequence

i Coeff[i] perm[i] Coeff[perm[i]] Rec[i] Output[perm[i]]=Rec[i]

Fig 12: Permuting ten Huffman coefficients

This permutation maintains audio quality for authorized users, yielding an identical post-decodersignal, and adds nothing but possible key data to the compressed bitstream. It minimizes memory andcomputational demands and degrades audio quality predictably, as will be seen in subsequent sections.Because all perceptual compression algorithms use some subband or transform coding, it is directlyapplicable to any format. Finally, it makes decryption and at least partial decoding inseparable.

LOADING:

A

B

C

D

E

F

G

H

I

F

G

B

I

A

E

J

C

D

H

5

6

1

8

0

4

9

2

3

7

F

G

B

I

A

E

J

C

D

H

A

B

C

D

E

F

G

H

I

J

0

1

2

3

4

5

6

7

8

9

0

J

11

Decoding the hopping algorithm is as fast as or faster than decoding traditional means ofencryption. Deformatting N Huffman coefficients in an unprotected bitstream requires N repetitions ofreading in a coefficient, Huffman decoding it, and storing the resulting two frequency coefficients, eachone byte, in an output array. The hopping algorithm, assuming all permutations are within the sameHuffman tables, performs the same N repetitions of reading from the bitstream and Huffman decoding, buteach time looks up the next permutation value and stores the two frequency coefficients at the arraylocation specified by that value. In total, the algorithm adds just the N memory reads of the permutationvalues. Because the permutation values are read sequentially, any cache prefetching scheme shouldminimize these memory access times. No arithmetic operations are required.

The above operations are repeated every granule. For example, hopping N=40 Huffmancoefficients (allowing 10^47 possible permutations) with a 44.1kHz input signal (76.5 granules/sec)requires an additional 76.5*40=3060 memory accesses per second. N is altered as security and loadinggoals dictate.

In the case of permuting across Huffman tables, a condition statement must be made on eachpermutation value to determine the Huffman table in use.

Total memory requirements are minimally increased: the only addition is storage of thepermutation pattern. This requires up to 288 entries for a full permutation, but upper frequencies containlittle information and therefore are rarely encoded and never hopped. The pattern may safely be reduced tono more than 256 entries, each then requiring one byte. If specifying that no more than N coefficients willbe permuted, only N key values need to be stored. In total, the algorithm adds just N bytes of additionalmemory demands.

This may be compared to a traditional cryptographic algorithm such as the fast stream cipher RC4performed on the compressed bitstream. Discounting s-box initialization, for each byte the following codeis performed before Huffman decoding can begin:

i = (i+1) % 256j = (j+S[i]) % 256swap (S[i],S[j])t = (S[i]+S[j]) % 256

output byte = input byte XOR S[t] [2]This requires the following operations per byte: 3 adds, 1 register move, three memory reads, two memorywrites, and an XOR. In addition, each byte must be read from the bitstream buffer before decryption andwritten back to it after because the Huffman decoding routine does not operate on whole bytes. Fulldecryption of even a mono 64kBit/second channel would then require 57,000 memory accesses and 33,000register/register operations per second. Using a slower and more secure algorithm with rounds such asDES may take ten times that load. Partial encryption of the bitstream may suffice and can be chosen forany desired speed. MMP literature suggests a requirement of encrypting just 8 bytes / kByte (a rate of lessthan one byte per granule) using DES[15], which requires more than twice the decoding load of thehypothetical 40 coefficient permutation above.

Cryptographic algorithms consume more memory than the hopping algorithm. RC4 requires 256bytes for the state and 256 bytes for the key. To save permanent storage, RC4 may alternatively recomputestate bytes each time decryption is performed, but this adds 1300 register / register and 1300 memoryaccesses per call. A software implementation of DES, even bypassing initial and final permutations,requires 700 bytes of permanent storage in addition to working memory.

<TBD official time comparison in code here.>

AUDIO QUALITY EVALUATION:A strength of the frequency hopping algorithm is predictable degradation of quality in the

protected signal. Traditionally, audio quality is measured as signal to noise ratio (SNR) or total harmonicdistortion (THD). These measures are misleading when applied to perceptually encoded audio, whichintentionally removes inaudible signal and introduces audible quantization noise. Perceptual encoders canachieve transparent quality over a broad range of SNRs. A more accurate perceptual quality measurementis required.

Subjective listening tests are the most common way to evaluate the effectiveness of audio coders.The test uses only pre- and post- screened expert listeners, known as ‘golden ears’. ITU-Rrecommendation BS.1116 specifies a listening environment, a test procedure, and a performance scale.

12

Listeners are presented with three signals: the first is a reference, the second and third contain the referenceand the coded signal in random order.[16] Listeners must identify the encoded signal and grade its qualityon the 41 point scale shown below.

ITU-R 5-grade impairment scaleSeverity of Degradation ITU-R GradeImperceptible 5.0Perceptible, but not annoying 4.0-4.9Slightly annoying 3.0-3.9Annoying 2.0-2.9Very annoying 1.0-1.9

Fig 13: Subjective impairment measure [17]

Subjective listening tests are considered the most reliable evaluation of coder quality. Nevertheless, theymay be influenced by factors such as presentation and agreement on weighting spectral discrepancies, andthe reliability of ‘golden ears’ has been questioned in [8] and elsewhere. These concerns, as well as easeand cost, have fueled research into objective perceptual quality measurement.

Telecommunications companies wanting a quick way to evaluate online transmission quality haveperformed much of the research. British Telecom developed PAMS, the Perceptual Analysis MeasurementSystem, to identify audible errors introduced by a system. A diagram of the system is shown in the figurebelow.

Fig 14: Perceptual Analysis Measurement System [18]

The ITU has adopted standard ITU-R BS.1387, but improved schemes are being developed.BS.1387 incorporates seven separate audio quality models. [19] Most follow an outline similar to the onedescribed in figure 14. A degraded signal and its reference are input to models of the auditory path and adifference signal is taken. That difference signal is then input to a cognitive model that predicts theperceived audibility of the differences. The auditory path model begins with a time-to-frequencytransformation which is then attenuated by a frequency dependent nonlinear function modeling the outerand middle ear. This output is integrated into critical bands and convolved with a spreading function tomodel the frequency response of the basilar membrane in the inner ear. A number of options exists for theperceptual/cognitive model. Similarity of the degraded and reference signals may be estimated, probabilityof detection of the errors may be calculated, or properties of the error surface may be computed directly.[20]

For this research, a simplified version of the auditory model above is implemented. Both theprotected bitstream and the reference bitstream are dequantized but not yet resynthesized into their outputaudio files. This provides output data already in the frequency domain. The coefficients are grouped into

13

twenty five critical bands and integrated to provide the total intensity in each. The frequency spreadfunction shown in the figure below is applied to each critical band, and masked data from each subband isdiscarded. Finally, an intrinsic frequency-dependent energy is added to each critical band where theaudible signal either does not exist or has been masked. This ensures that inaudible critical bands are equalbetween the two signals and do not contribute to error surface calculations. Equations for the describedactions are provided below.

SMR slope = -10dB/bark slope=25dB/bark

Cb n-1 Cb n Cb n+1Fig 15: Implemented spreading function

Requantization: ‘is’ are the Huffman decoded frequency coefficients, ‘xr’ are the dequantizedcoefficients.[9]

)))*_1*(*2*_1*(2(25.)_*864_(25.3/4 2*2*)()(pretabscalescalefacpreflagscalefacscalescalefacgainsubbandgainglobaliisixr

+−+−−=Computing the bark spectrum: ‘xr’ are the dequantized coefficients, ‘Bi’ are the bark spectrum coefficients.

∑=

=i

i

hb

lbni nxrB ][ , where lbi and hbi are the lower and upper bounds of critical band i.

Computing the energy masking effects: ‘Bi’ are the bark coefficients, ‘Ci’ are masking thresholdsDiscard Bj if Bj<Cj, where Cj is defined as

iii SFBC *= , with Bi converted to dB as )(log10302.90 210 iB+

Computing the absolute threshold: enter the center coefficients of figure 3 into the equation on page 3.Twenty five critical band values are then collected for each granule over the length of the input for

each signal, and the two sets of points are differenced. This yields a three dimensional (time x bark x error)error surface such as those shown in figure 16. A useful cognitive analysis, proposed by [20], is derivedfrom work on adaptive transform coding of video images by Mester and Franke[21]. Mester and Frankesuggest calculating two quantities to measure subjective error effects in an image: error surface activity anderror surface entropy. Activity is simply the volume of error in the surface. By itself the measure isinadequate; uniformly distributed error may have little subjective impact yet create as much error as ahighly significant concentrated error. The second quantity, entropy, is introduced to gauge thisconcentration. Entropy in a probabilistic context relates to the dispersion of a probability distribution; inthe context of a surface it indicates the degree of energy spread. An error surface has low entropy if itsenergy is clustered into only a few coefficients; subjective significance of error is then inversely related toerror entropy. Note that neither of these quantities concern the displacement of energy; that is, acomponent moving to an adjacent critical band or to one far away contribute equally to these distortionmeasures. A third quantity, correlation of error with the reference signal, accounts for delay and echo. Thethree quantities are computed as follows:

∑∑= =

=N

i

M

ja jicE

1 1

|),(| [21]

EajicjiajiajiaEeN

i

M

j

/|),(|),(,),(ln*),(1 1

=−= ∑∑= =

[21]

14

)1)...(1(,1

01

−+−== ∑−

=+−+

NNiBcRcbN

kkikNi

[20]

where c = the error surface, N = time instances in granules, M = critical bands in bark, Ea = Error activity,Ee = error activity, B = the reference signal intensity surface, and R as shown is the cross-correlation of theerror sequence with the reference sequence in time for constant bark.

15

Jazz sample 1 175kHz toneFig 16: Sound samples and their error surfaces from distortion generated by permuting all bigvalues within

Huffman tables

Hollier and Cosier [22] propose a method of predicting subjective performance from these data byrelating the error activity and entropy to the ITU subjective scale. Their equation was calibrated fromsubjective experimentation with human voice at error activities above 200, and, as Mester and Frankeexpect, these data become unreliable for low activity and entropy. Units on the ITU scale are MOS (meanopinion score).

322 2723012.0459106.6752538.500000143.00030718.01834.132

,1))]exp(1/()[exp(*4

EeEeEeEaEaw

wwMOS

+−+−+−=++=

A more general combination is of the form: 121 log −+= EeaEaaDistortion . What follows

is a brief chart of the error parameters calculated for two samples of music:

Input file Permutationmethod

Average erroractivity

Error entropy Sound qualityrank

Audibledistortionestimate,a1=a2

Jazz sample 1 Permute acrossHuffman tables

14.8 5.8 3rd 1.343

Jazz sample 1 Permute withinHuffman tables

11.1 5.6 2nd 1.224

Jazz sample 1 Permute justregion 2

0.7 4.6 1st 0.06

175 kHz Tone Permute withinHuffman Tables

9.0 3.8 2nd 1.217

175 kHz Tone Permute justregion 2

0.1 1.6 1st 0

Fig 17: Calculated error parameters for different frequency permutations. Error fragments measuring 1/3seconds are displayed, and parameters are averaged across fragments.

PERMUTATION VARIATIONS:All discussions thus far have implied permutation of the Huffman coefficients. There are two

reasons for avoiding either frequency encryption or scale factor permutation or encryption. Frequencyencryption, in addition to protecting a smaller percentage of the bitstream per unit processing time, rendersthat bitstream invalid. This is unattractive to music distributors enticing customers with degraded audio

16

tracks. It also results in unpredictable distortion on players that discard or bypass the inevitable errors inHuffman decoding.

Scale factor protection initially offers a fast way to provide significant degradation, since muchsignal strength information is contained in just a few bytes. Scalefactor protection, however, eitherencrypted or permuted, yields unpredictable audio output. Because each scalefactor multiplies the relevantfrequency coefficient (in MP3 coefficients are actually multiplied by two to a power that is a function ofthe scalefactor), manipulation of scaling could even cause damage to audio equipment or human hearing ifnot limited.[14] Experimentation with scalefactor protection produced unpredictable results; one keyprovides little audible distortion while a different key provides intense distortion.

Frequency hopping options are available to the music distributor, according to the audiodegradation and computational complexity desired. Encrypting lower frequencies (region 0 coefficients)only, where most of the signal energy is likely to be concentrated, provides strong error activity with theleast computational complexity. Smoother degradations are found in higher frequencies (region 1 and 2)but require more permutations to generate entropy. Target MOSs are useful: MOS of no higher than 4.0(‘perceptible but not annoying’) and no lower than 2.0 (‘annoying’) should be sought for preview quality.To achieve this range, entropy should be between 7 and 10 with activity above 50. Manipulations withinfixed N coefficients are faster than those within regions because a constant permutation pattern is used.

GENERATING PSEUDO-RANDOM NUMBERS:The hopping algorithm relies on a secure permutation pattern. The pattern is implemented using a

pseudo-random number generator (PRNG) in which the key is the seed. The PRNG output should beuniformly distributed along the length of the permutation (i.e. between 1 and 288), meaning that each valueis equally likely to appear.

The simplest and fastest random number generator is called ‘linear-congruential’ and constructed

as MBAXX ii )%(1 +=+ . The sequence repeats over one period; A, B, and M, the multiplier,

increment, and modulus, respectively, are chosen to maximize the period, which cannot exceed M. Formore details see [23]. Most compiler-provided random number generators use a linear congruentialsequence. However, successive entries in the series are correlated and low order bits may have poorrandom properties.

Feedback shift register (FSR) generators are more suitable for cryptographic applications. AnFSR is a recurrence function on a shift register, the next output being the least significant entry and the next

input being the result of the function. A simple bitwise operation on a register of N bits has 12 −N states,

discounting the repeating all-zero state, and therefore a maximal period of 12 −N . For maximalperiodicity, the characteristic polynomial of the recurrence must be primitive mod 2. A number q is

primitive to prime number p if any integer between 1 and p-1 may be expressed as pq k % , where k < p-1.

A primitive polynomial of degree N is irreducible and divides 112 +−nx but not 1+dx for any d that

divides the number 12 −n . Generating primitive polynomials is slow, and FSR generators are often slow insoftware.[2]

A solution achieving both better randomness and speed is presented by Twisted GeneralizedFSRs. A promising example is the Mersenne Twister. The Mersenne Twister is a Generalized FSRmultiplying the output by a non-identity matrix A and dividing the shift register between two entries. The

fundamental linear recurrence is given by Axxxx lk

ukmknk )|( 1+++ ⊕= [24] operating on a sequence of

words of length w. n is the degree of the recurrence, r is an integer dividing the words, ukx is the upper w-r

bits of kx , lkx 1− is the lower r bits of 1−kx and A is a sparse matrix for fast multiplication. The output is

multiplied by a ‘tempering matrix’ which improves the k-distribution, a spectral measure of randomness.

K-distribution to v-bit accuracy means that each kv2 possible combinations of leading bits occurs equallyoften in a period. [24]

The result of the generator is a 623-distribution and a period of 199372 , i.e. it never repeats. Itaccomplishes this with a fast test of polynomial primitivity and Mersenne prime. A Mersenne prime is of

17

the form 12 −p where p is prime. The algorithm is just 5% slower than rand() and uses a working memoryof just 624 words. [24]

CONSTRUCTING THE SECURE PERMUTATION:Uniform pseudo-random number generators such as the Mersenne Twister provide the appearance

of randomness suitable for non-secure applications such as Monte Carlo simulations. The output is notcryptographically secure, however, since it can be reduced to a linear recurring sequence. In the case of theMersenne Twister, multiplying by the inverse of the constant tempering matrix recovers the fundamentalrecurrence, although that recurrence is non-trivial. An attacker with the recurrence and the output sequencemay be able to deduce the current state.

A simple hash function maps a possible large key into one of a finite number of slots. A one-wayhash function as used in cryptography operates on an arbitrarily long message and returns a fixed length

value of h bits, i.e. h2 slots. [25] The proper solution to secure the output of a PRNG is conversionthrough a cryptographically secure one-way hash function denying the attacker the linear recurringsequence.

The brevity of the hopping algorithm makes this secure step unnecessary. 288≤N randomnumbers need to be generated, each simply hashed into the slots remaining to be filled. Let N = 40,meaning that a permutation array 40 slots long is to be generated. The first random number is simplyhashed to a value between 1 and 40. The result is removed from consideration, for example by removingthe component of a linked list, and the second random number is hashed to a value between 1 and 39. Nofraction of these random numbers is sufficient to identify the state of the linear recursion and hence theremaining values. A secure hashing algorithm provides additional security not warranted by the value ofthe data. Nevertheless, the algorithm is implemented with both simple and secure MD5 hashingalgorithms. Note that the simple hash is not a simple modulo as might be expected. For added security atalmost no computational cost, the PRNG output is rightshifted by a byte before the modulo divide in orderto discard the least random least significant bits. MD5 is selected for the secure hash because it is fasterthan the more secure SHA signature.

The implementation prepared for this research generates just one set of random numbers at thebeginning of encoding or decoding, using the key as the seed. This set of numbers becomes thepermutation pattern as shown in figure 12. The permutation pattern is constant for the length of the track.The total time of the permutation in the decoder, without the MD5 hash, is less than one percent of the totaldecoding time of a one minute music sample. Allamanche and Herre [14] suggest running the PRNGthroughout the encoding/decoding process, mapping each output to a specific manipulation. This time-varying permutation remains closer to the original concept of frequency hopping, but its use may beunacceptable for a number of reasons. If a permutation array is periodically generated as described in thepreceding paragraphs, a high update frequency is a computational burden. If PRNG outputs are simplyhashed directly to permutation arrays stored in memory, the speed is acceptable but the memoryrequirement is untenable for the portable devices for which this algorithm is designed. In addition,prespecified permutations greatly reduce the workload of a brute force attack. Time varying permutationswould on average have no effect on the activity and entropy distortion measures used above, and anattacker could make reasonable guesses based on signal strength of frequencies corresponding across thepermutation update.

SECURITY:If the security of the length N permutation array is assured, an attacker demanding perfect

reconstruction is faced with N! possible permutations, large enough to discourage a well-funded brute forceattack. Note that because the Huffman tables have finite lengths of ≤ 256 entries, the possiblepermutations is some cases may not achieve the full number of possibilities, i.e. multiple coefficients areguaranteed to have the same value.

An attacker finds less than perfect reconstruction an easier task. Because psychoacousticmodeling integrates within critical bands, a reasonable aim that agrees with experimentation is placing allcoefficients into their original critical band. The workload is then permutation dependent. For example,

18

assume that the N lowest frequency coefficients are permuted. For N ≤ 30 at 44.1kHz, one canapproximate N/3 equal width critical bands. There are 3!(N-3)! possible permutations to restore thecoefficients in just the lowest critical band, and 3!^(N/3) possible permutations to restore all coefficients tothe proper critical bands. Out of N! total permutations, then (3!)^(N/3) will restore the data. Generalizing,where a brute force attack for acceptable rather than perfect reconstruction seeks one of

∏=

CB

iilen

1

))!(( possible reconstruction sequences, where CB is the number of critical bands subject to

hopping and ilen is the number of frequency coefficients in critical band i. A brute force search for an

acceptable reconstruction has an order of N!/X^N, still computationally infeasible for large N.Most discussion in this report has suggested permuting N coefficients with little reference to

Huffman table region. The music distributor must decide between permuting across Huffman tables orwithin based on their security and marketing goals. Permuting within tables ensures bitstream validity evenwithout the permutation pattern in the decoder; because no Huffman coefficient is the prefix of another inthe same table, the N coefficients remain unambiguous. Permuting across tables requires a slightly largerworkload for correct depermutation, as discussed in the section on loading. Without the key, however, theHuffman decoding will err on the first coefficient from the wrong table and on all subsequent coefficients.Errors then become as unpredictable as if encrypted. The attacker faces a much more difficult task, but theproperty of preview quality has been denied.

Permuting across Huffman tables facilitates permutation of a constant N coefficients. Huffmanregions change size among granules, and permuting within requires that each granule divide the key amongregions. Alternatively, the distributor may permute regions precisely, i.e. just region 1 is permuted eachgranule, but requires storage of a full permutation pattern for a worst-case region size and adds boundchecking to each key entry read. In addition, certain permutation patterns leave security lapses. Forexample, on some granules the region 2 does not exist. Permutation of just region 2 then leaves anunprotected granule from which an attacker can match each frequency coefficient with a similar one in anadjacent protected granule, thereby forming an estimate of the hopping pattern. In this situation, timevarying permutations are recommended.

APPLICATIONS:The music industry recognizes the potential of commercial online distribution and the

corresponding importance of protecting their intellectual property rights. A number of companies areworking to make MP3 and other coding schemes into profitable media. Recognizing that MP3 has noinherent security features, they have created secure envelopes that knows the terms of purchase. Thesecompanies provide the wrappers that protect rights to the data and the commerce systems to execute thetransactions.

Efficient Internet music commerce will rely on an open accepted standard for distribution. TheRecording Industry Association of America and its European and Japanese counterparts have announcedthe creation of the Secure Digital Music Initiative, a forum of companies and organizations collaboratingon an open anti-piracy distribution system for PCs and portable devices.[26] SDMI is currently evaluatingproposals for the screening technology. SDMI’s Digital Music Access Technology (DMAT) will needcompliance from the manufacturers and software providers. DMAT technology is expected to offer thefollowing:- DMAT encourages free preview of music.- DMAT devices allow users to play unprotected music they already have as well as DMAT files.- DMAT devices do not play pirated DMAT files.- DMAT permits audio watermarking to track unauthorized distribution.- DMAT allows portable devices to play content from multiple DMAT-compliant vendors. [26]The technology descriptions that follow adhere to the above properties.

Other multimedia standardization efforts are active, including the Open Platform Initiative forMultimedia Access (OPIMA) and MPEG Intellectual Property Management and Protection (IPMP).OPIMA standardizes access control and content management, primarily for broadcast technology. [27]MPEG-IPMP associates intellectual property rights information to audio and visual objects and systems inMPEG-4.[28] It does not standardize the specification of those rights; it specifies the interface to IP rights.

19

It attaches IP Descriptors and IP Elementary Streams to objects, treating them as any other Descriptor andElementary Stream constructs. IPMP-System includes public key encryption-decryption control, a blockcipher, cryptographic hash functions, and key management protocol. [28]

A brief description of some of the major companies operating Internet music distribution systemsand their security technologies follows:

MCY - MCY licenses MMP technology from Fraunhofer IIS and calls it NETrax. MMP ciphersparts of the content with secure ciphering algorithms such as DES. It adds headers at the beginning of andthroughout the file to provide key and algorithm information. Every download is digitally encrypted toprevent unauthorized duplication. MCY has a software system that tracks sales and forwards theinformation to the rights owner. Before playing a NETrax file or loading it onto a player, the softwareverifies ownership of the track.[29]

DESKGATE – Deskgate makes VIAcommerce applications and VIAmusic for secure MP3. Ituses a proprietary security technology called DGX based on Blowfish.

INTERTRUST – Intertrust develops the secure MP3 format MP3Plus. It uses a wrapper called aDigibox, with technology provided by ASPSecure. The InterRights point is the consumer’s PC, whichdoes rights management processing. It is a database that stores user’s rights, identities, transactions, etc.Only an InterRights point can open a Digibox. Rules such as pricing, playing, copying, etc are set by theprovider and contained in the Digibox. The user’s PC, which must be online, connects to a transactionauthority which collects the payments and additional information. Intertrust is working with Fraunhofer IISto develop the audio-specific details. [30]

AT&T – Provides a2b music, which is a DES encrypted MPEG AAC audio file. AT&T and itsPolicymaker distribution mechanism have been very influential in the industry, beginning with the paper[31]. Policymaker is to be used with a standard cryptographic container, generated by generating a randomkey, cryptographically hashing the music, block ciphering the data ,and signing the hash with the privatekey of the owner using RSA public/private keying. The keys and containers are stored in a database andupon purchase downloaded jointly. Policymaker then verifies the credentials of anyone accessing themusic. It inputs an access policy, a set of licenses, and a requested action and returns a possiblyconditional success or failure. Policymaker may be built into applications or may be a separate service. Itallows complicated access rights such as copy N times or play for N days. [32]

Frequency hopping is suitable for any of the distribution methodologies discussed here. Areference and patent search reveals a number of predecessors to the audio security suggested here,beginning with the 1942 frequency hopping patent filed by Hedy Lamar and George Antheil. Some patentson keyed signal rearrangement technology are [33], [34], [35], and [36]. Patents appropriate for intellectualproperty protection are [37], an example of software copy prevention, [38], accessing encrypted data onlocal media using a networked key, [39], the secure transfer of programs, and [40], a copy managementsystem. [41 provides a pseudorandom secure keystream generator, and [42] presents the most recentpatented method for encoding and encryption an audio signal.1 In February 2000, Fraunhofer IIS publisheda paper [14] discussed earlier, proposing an audio scrambling technology similar to that discussed here,using bitwise XORs and/or word length permutations.

CONCLUSIONS:This paper presents frequency hopping, an elegant and efficient method to preserve intellectual

property rights in digital music. Audio quality and compression ratio are maintained with insignificantcomputational increase in the encoder and decoder. This scheme allows on-line retailers to safely distributemusic to authorized customers. Anticipating future industry trends towards portable MP3 players, it isdesigned for quick, low memory decryption . The protection scheme is designed for open access topreview quality music; a Huffman table-by-table permutation may be read by any decoder, but at reducedaudio quality. Permuting across tables renders the bitstream unreadable and protects more content per unittime than any cryptographic algorithm. Frequency hopping is mathematically invulnerable to a brute-forceattack.

1 Patent search was limited to the United States Patent Office

20

REFERENCES:1. Gehr, Richard “The MP3 Era”, Time Digital 20002. Schneier, Bruce Applied Cryptography 19963. Pohlmann, Ken Principles of Digital Audio 19954. Buser, Pierre and Imbert, Michel Audition 19925. Stevens, Stanley and Davis, Hallowell Hearing: Its Psychology and Physiology 19836. Tobias, Jerry Foundations of Modern Auditory Theory 19707. Warren, Richard Auditory Perception 19828. Painter, Ted and Spanias, Andreas “Perceptual Coding of Digital Audio” Proceedings of the IEEE,

April 2000, vol 889. International Organization for Standardization “MPEG-1 Coding of Moving Picture and Audio” 199310. Fletcher, H and Munson, W “Loudness, Its Definition, Measurement, and Calculation” Journal of the

Acoustical Society of America 193711. Pan, Davis “A Tutorial on MPEG/Audio Compression” IEEE Transactions on Multimedia, 1995

number 2, vol 212. Bhaskaran, Vasudev and Konstantinides, Konstantinos Image and Video Compression Standards:

Algorithms and Architectures 2nd Ed. 199713. The Research Dept. Canterbury Christ Church College, University of Kent14. Allamanche, Eric and Herre, Jurgen “Secure Delivery of Compressed Audio by Compatible Bitstream

Scrambling” 108th Audio Engineering Society Convention February 200015. Rump, Niels “Copyright Protection of Multimedia Data: The Multimedia Protection Protocol 19th

International Convention on Sound Design, 199716. Bosi, Marina “Perceputal Audio Coding” IEEE Signal Processing Magazine, September 1997, number

5 vol 1417. ITU. Methods for the Subjective Assessment of Small Impairments in Audio Systems Including

Multichannel Sound Systems. Technical Report, 199418. Rix, A, Bourret, A and Hollier, M “Models of Human Perception” British Telecom Technology

Journal, January 1999, number 1 vol 1719. ITU. ITU-R recommendation BS.138720. Hollier, M, Hawksford, M, and Guard, D “Error Activity and Error Entropy as a Measure of

Psychoacoustic Significance in the Perceptual Domain”, IEE Proceedings-Visual, Image, SignalProcessing June 1994 number 3 vol 141

21. Mester, R and Franke, “Spectral Entropy-Activity Classification in Adaptive Transform Coding”IEEE Journal on Selected Areas of Communications June 1992

22. Hollier, M and Cosier, G “Assessing Human Perception” BT Tech J January 1996, number 1 vol 1423. Press, W, Teukolsky, S, Vetterling, W, Flannery, B Numerical Recipes in C 2nd Ed 199724. Matsumoto, Makoto, and Nishimura, Takuji “Mersenne Twister: A 623-dimensionally Equidistributed

Uniform Pseudorandom Number Generator” ACM Transactions on Modeling and ComputerSimulations 1998

25. Cormen, T, Leiserson, C, and Rivest, R Introduction to Algorithms 199926. www.riaa.com27. OPIMA. OPIMA Specification version 1.1 June 200028. Lacy, J, Rump, N, and Kudumakis, P “MPEG-4 Intellectual Property Management & Protection

Overview and Applications” ISO/IEC JTC1/SC29/WG11 Coding of Moving Pictures and Audio MPEG98 December 1998

29. www.mcy.com30. www.intertrust.com31. Blaze, M, Feigenbaum, J and Lacy, J “Decentralized Trust Management” IEEE Conference on

Security and Privacy May 199632. Lacy, J, Snyder, J and Maher, D “Music on the Internet and the Intellectual Property Protection

Problem” Proceedings of the International Symposium on Industrial Electronics July 199733. Jayant, et. al. “Uniform Permutation Privacy System” Bell Telephone Laboratories, US Patent, July 11,

197834. Davidov “Pay TV scrambling by audio encryption” Oak Industries Inc. US Patent, Aug 18, 198735. Davio, et. al. “Cryptographic system and process and its application” US Philips Corp, US Patent,

March 19, 1991

21

36. Scheidt et. al. “Cryptographic communication process and apparatus” TecSec Inc. US Patent, June 13,2000

37. Curran, et. al. “Software protection methods and apparatus” General Computer Corporation, USPatent, June 25, 1985

38. Mages, et. al. “Method of secure server control of local media via a trigger through a network forinstant local access of encrypted data on local media” US Patent, April 6, 1999

39. Katz, et. al. “Digital information library and delivery system with logic for generating files targeted tothe playback device” Audible, Inc. US Patent, July 20, 1999

40. Warren et. al. “Multimedia copy management system” Solana Technology Development Corp. USPatent, October 5, 1999

41. Clark “Anti-spoof without error extension” ITT Corp. US Patent, Mar 31, 199842. Tsutsui “Methods and apparatus for encoding, decoding, encrypting, and decrypting an audio signal,

recording medium therefore, and method of transmitting an encoded encrypted audio signal” SonyCorp US Patent June 27,2000.

43. Bladeenc 92.0 MP3 Encoder source code

adding security to compressed digital audiopalms.ee.princeton.edu/palmsopen/bracken00adding.pdf ·...

Documents