an audio codec for multiple generations compression ...department of computer science v, university...

FrankKurth An Audio Codecfor Multiple GenerationsCompressionwithoutLossof PerceptualQuality

AN AUDIO CODEC FOR MULTIPLE GENERATIONS COMPRESSIONWITHOUT LOSSOF PERCEPTUAL QUALITY

FRANK KURTH

Departmentof ComputerScienceV, Universityof Bonn,Bonn,[email protected]

We describea genericaudiocodecallowing for multiple, i.e., cascaded,lossycompressionwithout lossof perceptualquality ascomparedto the first generationof compressedaudio. For this sakewe transferencodinginformationto allsubsequentcodecsin a cascade.Thesupplementalinformationis embeddedin thedecodedaudiosignalwithout causingdegradations.The new methodis applicableto a wide rangeof currentaudiocodecsasdocumentedby our MPEG-1implementation.

INTR ODUCTION

Low bit ratehighqualitycodingis usedin awiderangeofnowadaysaudioapplicationssuchasdigital audiobroad-castingor networkconferencing.Although the decodedversionsof thecompresseddatamaintainveryhighsoundquality, multiple- or tandem-codingmay result in accu-mulatedcodingerrorsresultingfrom lossydatareductionschemes.Suchmultipleor tandemcoding,leadingto thenotionof a signal’s generations, maybedecribedasfol-lows.Assumingacoderoperation

�andacorresponding

decoderoperation� we shall call, for a givensignal � ,� � � thefirst generationand,for an integer � , � � �� the � -th generationof � .In this paperwe proposea methodto overcomeageingeffects. More precisely, our ultimategoal is to preservethefirst generation’sperceptualquality. For this sakeweuseanembeddingtechnique,conceptuallysimilar to theaudiomole[1], to transportcodinginformationfrom onecodecin a cascadeto subsequentcodecs.As comparedto [1], ourembeddingtechniqueis performedin thetrans-form domain. Moreover it is basedon psychoacousticprincipleswhich allows the embeddingto be performedwithoutcausingaudibledistortions.The paperis organizedas follows. In the first sectionwe perform a detailedanalysisof ageingeffects. Forthis sakewe look at a generalmodelof a psychoacous-tic codecand investigatewhich of its componentsmayinduceartifactsin casacdedcoding. Ageingeffectsin areal-worldcodingapplicationaredocumentedby there-sultsof listening testsaswell asmeasurementsof rele-vantencodingparameters.The secondsectiondevelopsa genericaudiocodecfor cascadedcodingwithout per-ceptuallossof quality. In thethird section,animplemen-tationbasedon anMPEG-1Layer II codecis described.Thefifth sectiongivesacodecevaluationbasedonexten-sive listening testsaswell asobjective signalsimilaritymeasurements.Concludingwe point out someapplica-tionsandfuturework in thisarea.

1. AGEING EFFECTS IN CASCADED CODING

Block- or Subband-Transform

Spectral Analysis/PsychoacousticModel

Quantization

Encoder

Requantization

Decoder

InverseTransform

Input Output

Figure1: Simplifiedschemeof psychoacousticcodec.

A generalschemeof a psychoacousticaudio codecisgiven in Fig. 1. In this figure we have omittedoptionalcomponentssuchasentropycodingor scalefactorcalcu-lation, althoughthosemay indirectly influencedegener-ation effects. Sucheffectswill have to be dealtwith ineachparticularcase.Theencoderunit consistsof a subbandtransform� op-eratingin a blockby blockmode.For a signalblock � , asynchronous(w.r.t. thesubbandtransformof thatblock)spectralanalysis� is usedto obtainparametersof apsy-choacousticmodel �� . A bit allocation � �� is performedusing the psychoacousticparametersandleadingto achoiceof quantizersfor lossydatareduction.Following quantization,codewordsandsideinformation,e.g.,quantizers,scalefactorsetc., are transmittedto thedecoder. The decoderreconstructssubbandsamplesus-ing the side information,e.g., by dequantization,code-book look-up, or inversescaling,andthencarriesout areconstructionsubbandtransform �� .Within this framework, sourcesfor signal degenerationare

1. thelossyquantizationstep,

2. round-off andarithmeticerrorsdueto � , �� , , aswell aspossiblyfurthercalculations,

3. aliasingdueto � , cancelledby �� only in the ab-senceof lossyquantization,

AES17� � InternationalConferenceonHigh QualityAudio Coding 1


4. NPR(nearperfectreconstruction)errors,i.e., if ��is only a pseudoinverseof

�,

5. inaccuracy of thepsychoacousticmodel � ,

6. non-suitabletime-frequency resolutionin spectralanalysis,

7. missingtranslationinvarianceof�

and �� (in viewof multiplecoding).

Thequantizationerror1. dominatesandis likely to bere-sponsiblefor thegreatestpartof thedegenerations.Thiserror may differ in several magnitudesfrom all of theother errors. As an extremeexample,considerMPEGcodingwhere,dependingonthecodecconfiguration,sev-eralof thehighestsubbandsof thetransformedsignalaresimplysetto zero.Round-off andNPR-errors(2. and4.)aresmallascomparedto 1., althoughthey mustbecon-trolled from codecto codecin a cascade.Thoseerrorswill beespeciallyimportantin conjunctionwith our em-beddingtechnique.As, e.g.,LayerIII of MPEG-1audiocodingshows, aliasingerrors(3.) have to be dealtwithcarefully. Theerrors5. and6. indirectlycontributeto thequantizationerror.

−3 −2 −1 0 1 2 3

1

2

3

4

5

6

7

8

Pop (Abba)

Castanets

Speech, female

Speech, male

Wind Instruments

Orchestra

Pop (Jackson)

Percussion

equalworse better

Figure2: Comparisonof first vs third generations.Thesolidbarsgivetheaveragerating,thesmallbarsshow thevariance.

Listeningtestsareusedto investigatesignaldegenerationwith increasinggenerations.Fig. 2 showstheresultsof alisteningtestwhere26 listenerscomparedfirst andthirdgenerationscodedwith anMPEG-1LayerII codecat128kbps. Negative valuesindicatethat the third generationswere rated worsethan first generations,whereasposi-tive valuesgive thembetterratings. It is obvious fromFig. 2 that the third generationscouldalmostall be dis-tinguishedfrom thefirst onesandweregenerallyjudgedto beof worsequality.

Themostimportantresults[2] maybesummarizedasfol-lows:� Almostall testpiecesalreadyshow noticeableper-

ceptualchangesin the secondand third genera-tions.� Fourth to sixth generationsshow a clear loss insoundquality. The inducednoisein many of thepiecesis annoying. Almost all piecesabove theeighthgenerationshow a severeperceptualdistor-tion.� Quietandvery harmonicsignalpartsare(almost)not distorteddueto MPEGsextensive useof scalefactorsanddueto higherSMR ratios inducedbytonalsignalcomponents.

It becomesclear that lossy coding changesthe signalsspectralcontentin a way thatdoesnot allow subsequentencodersto performa suitablepsychoacousticanalysis.This leadsto adegenerationof codingparametersand,fi-nally, of overall soundquality. We further illustratethis

0 5 10 15 20 25 30 35−40

−20

0

20

40

60

80

Original

Generation 5

Generation 10

Generation 15

Subband

Ene

rgy

[dB

]

Figure3: Degenerationof subbandenergiesin MPEG-1coding. For a fixed frame, the figure shows the energyin eachof the subbands.The four lines correspondtotheenergiesof theoriginalsignalandthefifth, tenth,andfifteenthgenerationrespectively.

by an exampledocumentingthe changeof the spectralcontentof a signal for increasinggenerations. Fig. 3shows thesubbandenergiesfor a fixedframein thecaseof anMPEG-1encodedguitarpiece.Thedifferentlinesrepresenttheenergiesfor differentgenerations.Oneob-servesthedecreaseof energy in subbands6-9and11-17.The higher subbandsare attenuatedas an effect of theMPEGencodernot allocatingany bits to them.In Fig. 4the energy of the scalefactorbandsis given for severalframesof a piececontaininga strummedelectricguitar



Frame

Sca

lefa

ctor

ban

d

20 40 60

10

20

30

40

50

60

70

80

Frame

Sca

lefa

ctor

ban

d

20 40 60

10

20

30

40

50

60

70

80

Frame

Sca

lefa

ctor

ban

d

20 40 60

10

20

30

40

50

60

70

80

Sca

lefa

ctor

ban

d

Frame20 40 60

10

20

30

40

50

60

70

80

Figure4: Degenerationof subbandenergies in MPEG-1 coding. For successive frames,the figure shows theenergy in eachof the scalefactorbandsas an intensityplot. Thefour subplotscorrespondto theenergiesof thefirst, fifth, tenth,andfifteenthgenerationrespectively.

(the rhythmof strummingmaybeobservedasthe verti-cal structurein eachof theplots). Increasinggenerationsareplottedin a zig-zagscheme.The signal’s degenera-tion is obviousfrom the lossof structure.Black regionsindicatethat scalefactorbandsareset to zeroduring bitallocation.We briefly mentionthe effect of missingtranslationin-varianceof the codingoperation. As alreadyobservedin [3], acrucialpoint in cascadedcodingis synchronicity.Taking the 32-bandmultirateMPEG-1filter bankasanexample,a subsequentencoderwill not obtainthe samesubbandsamplesasthepreceedingdecoderunlessthein-putsignalis “in phase”with the32-bandfilter bank.Thisis, thefilter bankis only translationinvariantw.r.t. signalshiftsof 32 samples.On a frameby framebasis,thingsareeven worse,sincein order to obtain the samecon-tentin a framewisemanner, theinput to thesecondcoderhasto beframe-aligned(e.g.within 1152samplesin theMPEG-1Layer II case).Hence,we may alsohave var-ious typesof generationeffects,dependingon eventualsignalshiftsprior to subsequentencoding.Listeningtestsin ourMPEG-1LayerII framework show remarkabledif-ferecesin thetypeof degenerationdependingonwhetherthesignalwas

1. fed into subsequentcodecswithout any synchro-nization,or

2. fully synchronizedto the original input and thenfed into thecodec.

Whereastheunsynchronizedsignalstendto produceau-dible artifactsand noise-likecomponents,the synchro-nizedversionstendto attenuateseveral frequency bands.

2040

6080

510

15

−10

−5

0

SubbandGeneration

Ene

rgy

diffe

renc

e [d

B]

2040

6080

510

15

−6−4−2

024

SubbandGeneration

Ene

rgy

diffe

renc

e [d

B]

510152025

510

15

−0.5

0

0.5

SubbandGeneration

Diff

eren

ce in

allo

catio

n

510152025

510

15

−0.4

−0.2

0

0.2

SubbandGeneration

Diff

eren

ce in

allo

catio

n

Figure 5: Developmentof meansubbandenergy (left)andmeanbit allocation(right) asa differencesignalof(increasing)generationsandthefirst generation.Thetopplots show thosedifferencesignalsfor the caseof syn-chronizationbetweeneachof thecodeccascades,thebot-tom plots for thecaseof absenceof sucha synchroniza-tion.

It dependson thesignal,which of the degenerationsarepercievedto bemoresevere.Fig. 5 showsfour differenceplots (first generationversushighergenerations)for thescalefactorbandsmeanenergiesandmeandifferenceinbit allocation,wherethemeanrangesover 200framesofa relatively stationarysignal.Clearly, in thecaseof syn-chronizationbetweensubsequentcoding(top left plot),thesignalenergy is attenuated,whereasabsenceof syn-chronization(bottomleft plot) shows no suchtendencyresp.distributestheenergy acrossthesubbands.It is remarkablethat all formerly proposedmethodsforcascadedaudiocoding[1, 3] takeinto accountasynchro-nizationmechanismto overcomethe mentioneddistor-tion effects.Yet it shouldbeclearthatsucha mechanismaloneis not sufficient to overcomedegeneration.

2. A CODEC PREVENTING AGEING EFFECTS

The ideabehindour new codingmethodis to reusethefirst generation’s encodingparametersin all of the sub-sequentcodecsin a cascade.This allows usto indirectlyusethepsychoacousticanalysisof thefirst encoderin allof the following encoders.Sincethe encodingparame-tersarenot presentat the point of decoding,we usethedecodingparameters,i.e., sideinfos. This is reasonablefor in mostcaseswemayderive theencodingparametersfrom them.While it first seemstrivial thatonemayreusetheencod-ing informationof a certainencodingoperationfor sub-sequentencodings,it is notat all clear� thatonemaydeduceall encodinginformationfrom



thedecodinginformationonly,� that it is possible,using this information, to per-fectly reproducethe compressedbitstreamof thefirst generation(requantizationand scalingcouldnotbeinjective),� how to convey this informationfrom onecodectoanother.

While thefirst two issuesdependontheparticularcodec,the last one posesa generalproblem, for it is in gen-eral undesirableto createa secondarybitstreamor fileformat. Sucha secondarybitstreamwould causeaddi-tional dataoverhead,would requirea new format def-inition, and, most important,would makeit very diffi-cult to usethe proposedtechniquein conjunctionwithstandardmedianot supportingthe new format. For thissake,thesecondaryinformationfor subsequentencodingshouldbeembeddedinto thedecodedPCMdata.Theuseof suchasteganographictechnique[4] in thisframeworkwasalsoindependentlydevelopedby Fletcher[1]. How-ever, Fletcher’s embeddingmethodsignificantly differsfrom themethodpresentedin thispaperleadingto ratherdifferentcodecstructures.

2.1. PsychoacousticEmbedding

The embeddingof secondarydatainto sometarget datastreamis known assteganographyandhasbeenstudiedto a considerableextent [4]. Our demandson a stegano-graphicprocessaretheabsenceof any perceptualdegra-dationin thetargetsignal,a high embeddingcapacityona frameby framebasis,anda computationallyefficientembeddingprocedure. Furthermore,efficient detectionandextractionof theembeddeddatamustbepossible.For astraightforwardembeddingapproachconsiderasam-ple ��! #" $&%' (*) � ' + ' alsowritten in its binaryrepresenta-tion �,�.- � " $&%&/ / / � ) 0 , � '2143 5 6 7 8 . For 9;:=< andanembeddingword >;�?- > @ $&%&/ / / > ) 0 we definethe 9 -bit(direct)embeddingbyA %�B - � " $&%&/ / / � ) 0 6 - > @ $&%&/ / / > ) 0CD - � " $&% / / � @ > @ $&% / / > ) 0 /(1)

This kind of embeddingdestroysthe 9 leastsignificantbits of thedataword � . Applicationof this techniquetoa time signalalreadycausesa noticeablenoiselevel forsmall 9 . In theaudiomoleproposal[1], this embeddingtechniqueis usedwith the leastsignificantbit(s). De-pendingon thebit resolution,thepossibilityof smallau-dible distortionsare reported[5]. As an alternative tooverwriting the leastsignificantbit, the authorsproposeto changeit accordingto a parity/non parity decision,whichcausesa morerandomsignalchange.A refinementof the time-domainembeddingtechniqueusesaninvertiblelineartransformE prior to embedding.

Embeddingnow becomesa map

- F 6 > 0CD E $&% A % - E�F 6 > 0 / (2)

Fromasteganographicpointof view, asignificantadvan-tageof this methodis that theembeddingis not aseasyto detectas the above without knowledgeof E . Yet itis not guaranteedthat the reconstructedsignalis still ofperceptuallytransparentquality.Froma psychoacousticcodingpoint of view it seemstobeadviseableto performa kind of selective embedding.Embeddingisperformedprior to applyingthereconstruc-tion transform GE . Theembeddingpositionsand-widths(e.g., 9 above) aregiven by the psychoacousticparam-etersor, in view of embeddingin the decoderunit, bythedecodingparameters.Intuitively, thepossiblechoiceof reconstructionlevels for a givenquantizerH with re-quantizer GH is increasedwith theamountof datareduc-tion, i.e.,decreasingnumberof reconstructionlevels.Theembeddingthusbecomesa map

- F 6 > 0CD E $&% GA % -IGH,- H�E�F 0 6 > 6 H 0 6 (3)

wherethe embeddingwidth of GA % dependson H . Thistechniqueis usedin theproposedembeddingcodecpre-sentedin thenext paragraph.

2.2. CodecModel

We first give an overview of the proposedcodecsfunc-tionality. Fig. 6 shows the genericcodecscheme. To

Input

Output

Encoder

Decoder

Decoder

b)

a)

Detectordetected

not detected

Psychoacoustic Bit Allocation

Subband- or

Inverse Transform

Quantization

Mux

Channel

Demux

DequantizationEmbedding

Block Transform

Embedding Bit Allocation

Analysis and Model

Figure6: Codecfor directembedding.

describethefunctionality, westartwith thedecoder. Thedecoderextractscodewordsandsideinformationfor onedatablock from the bitstreamandproceedsby dequan-tization. All parametersneededby the next encodertoreproducethe compressedbitstreamareassembledin abit buffer. Then,usingthequantizationinformation,abitallocationalgorithmdecideswhichsubbandswill beusedfor embedding,andhow many bits will beembeddedin

AES17J K InternationalConferenceonHigh QualityAudio Coding 4


eachof them. If enoughbits couldbeallocated,the de-coderproceedsby embeddingall datafrom thebit buffer.To allow extractionof theembeddeddatain subsequentencodingsteps,certainmarkers, i.e., bit combinations,areusedto indicateembedding.Afterwards,the inversetransformreconstructsthetimesignalblock.Theencoderproceedsby transformingthesignalblocksproducingsubbandsignals.For thissake,weassumethatencodingis sychronousto the decodingoperationw.r.t.theprocessedsignalblocks. The detectortries to detectthemarkerscreatedby thedecoder. If markersarefound,a decoderextracts the embeddedinformation from thesubbands.Usingtheembeddedinformation,thebit allo-cationandfurtherencoderparameterscanbesetaccord-ingly. If, however, nomarkersarefoundor theembeddedinformationis detectedto becorrupt,theencoderfollowsthestandardencodingprocedureusingthepsychoacous-tic model. Quantizationandchannelmultiplexing con-cludetheencoding.The single modulesdeserve a more detailedtreatment,whereweshallagainstartwith thedecoder:L To obtain the embeddingcapacity, i.e., the num-

berof bits per samplewhich we may useto storeour secondarydatastream,we divide thesubbandsamples(or transformblocksin transformcoding)in groupsof samplessharingthesamequantizers.Thosegroupsare called embeddingblocks. Foreachembeddingblock we calculatethenumberofbitsavailablefor embedding,whichdependsonthequantizer’s granularity. If, e.g., the quantizerre-ducesan M -bit sampletoa NPO#M bit representation,wemayuseatmost MRQ�N bitsfor embeddingin thissample.Notethatthisdirectcorrespondenceis notvalid for all kindsof quantizersandmoreelaborateconversionruleshave to be usedby non-uniformquantizers.It is alsoimportantto notethatlosslesscodingoperationsmight eventually limit the em-beddingcapacity, asit will show up in thecaseofMPEG’sscalefactors.L All informationto beembeddedis collectedin anembeddingbitstreamor bit buffer. Thetypeof in-formationdependson thechosencodec,examplesare bit allocation, quantizers,scalefactors,code-books,bitratesetc. Sometimesit will beusefultoexploit thecompressedbitstream’sstructure,sincemostof the latter informationis alreadystoredinthisbitstreamin avery compactform.L If thetotal amountof embeddingcapacitysufficesto transmitthedesiredcodinginformation,wehaveto selectthe embeddingblockswherewe want tostorethe embeddingbit stream. Furthermore,wehave to determineanembeddingbit width perem-beddingblock. The decisionaboutthoseparame-

tersis termedembeddingbit allocation. The em-beddingbit allocation,althoughconsumingonly afractionof computationalresourcesascomparedtothe encoder’s bit allocation,is the mostcomplexpartof theproposedcodecextension.

Sincetheproposedembeddingtechniquetheoreti-cally mayyield a biggerreconstructionerror thannormallyinducedby theutilizedquantizers,acon-servative allocationmethodshouldbe used. Weproposea greedybit allocationalgorithmwith em-beddingblocks sortedin descendingorder w.r.t.theirembeddingcapacities.To eachof theembed-ding blockswe assigna certainbit budget,obey-ing a safetymargin accountingfor thepossiblein-creasedreconstructionerror. If thegreedybit allo-cationloop doesnot yield enoughembeddingca-pacity, thebit budgetis increasedaslongasnofur-ther increaseis possible.Embeddingis only per-formedif theallocationproceduresucceeds.L To facilitate the detectionof the embeddedinfor-mation, the kind and position of the embeddinghave to besignaledto theencoderresp. its detec-tor unit. This amountsto conveying the embed-ding blocksandembeddingbit widths chosenbythe embeddingbit allocation. Several techniquesarepossible,e.g.,theuseof adescriptorembeddingblock containingall of this “logistic” information.We proposea block-by-blocksolutionwhereeachusedembeddingblock containsa separatemarkerindicating the correspondingbit width. We onlyhave to ensurethatno “wrong detections”(i.e., theembeddingbit width is not detectedcorrectly)oc-cur.L Sincefloating point operationscommonlycausesarithmeticerrors,a forwarderrorcorrection(FEC)haseventuallyto beappliedto theembeddedinfor-mation. Thoseerrorsmay accumulatewith NPRerrorsasdiscussedabove. A CRCchecksumcanbeusedtodecidewhethertheinformationextractedfrom a certainembeddingblock is valid or not.L Thefirst taskof the encoderis to synchronizethePCMbitstreamw.r.t. theframeboundariesusedbya previous codec. For this purpose,againseveraltechniquesmay be utilized. In the caseof a fil-ter banktransform,a (fast) searchfor markersoncertainpredetermined(frequentlyused)subbandsmaybeused.Oncethesynchronizationis done,theencoderproceedsonaframeby framebasisandnofurthersynchronizationis necessary.L The detectortries to find the embeddedmarkers,deducethe embeddingblocks embeddingwidthsandchecktheembeddingblocksintegrity.

AES17S T InternationalConferenceonHigh QualityAudio Coding 5


U Hybrid codingprovidesa mechanismagainstcor-rupt embeddeddata or frameswhereno embed-dingwaspossible.In ahybrid framework, for eachframeit is possibleto choosebetween

– usageof all embeddedcodinginformationifthis informationcouldbeextracted,

– partial usageof embeddedinformation,e.g.bit allocationonly,

– discardingof all of theembeddedinformationanduseof thestandardencodingprocedure.

Input: Codewords V W X Y , decodingparametersZ ,embeddingfunction [�\

1. RequantizeV W X YR]^`_X�a�bIW _X \ c d d d c _X e Y .2. Determineembeddingpositionsandwidths,W _X c Z,YR]^fW g c h Ya�bIW W g \ c d d d c g i�Y c W h \ c d d d c h i�Y Y c

as well as a suitablemarker j , such that k!lg monpg m q&\rlps , and a suitablebinary repre-sentation W Ztc j4Y4]^ bin W Ztc j4Y , consistingofubin W Ztc j4Y u awv ix y \ h x bits.

Here,h x denotestheembeddingwidth in bitsof po-sition z .

3. Createa partitionbin W Ztc j4Y�a�b&{ \| | | { i , where{ x�}2~ � c k � � � .4. For k�lwz�l!� embedinto _X using h x -bit embed-

ding:

[�\�b W _X � � c { x Y]^ _ _X � � d5. Let _ _X m�b a=_X m for all � , where ��arg x for k�lwztn� .

6. Reconstruct_ _X�]^.�� \ _ _X .Output: Signalblock �� \ _ _X .

Figure7: Pseudocodeversionof thedecodingalgorithm.

A pseudocodeversionof thedecodingalgorithmis givenin Fig. 7. We assumethatcodewords V W X Y aswell asde-codingparametersZ aregiven asan input. In this ex-ample, V denotesthe encodersquantizerfunction. Forsakeof simplicity we only usefixed quantizersaswellasa global marker j . Note that we separateencodinganddecodingparametersto clarify thecodingsteps.Theembeddingbit allocationis summarizedin 2. In 3., theembeddingbit streamis partitionedaccordingto the bitallocation.Finally, embeddingis performedin 4. prior tothereconstructiontransform.

Input: Signalblock � }P� e , transform�psychoacousticmodel � , bit allocation� ,detectorfunction � on � e , �� tW �,Ydenotesthesetof all valid markers

1. CalculateX�b a�� .2. Detect��W X YRa�b j .

3. If j��} �(a) Calculate��W �IY . Let X �b a#X .(b) Calculate� W ��W �IY c X c �IY andfrom this

encodingparameters[4a#[,W �IY .4. else

(a) DecodeX�]^�_V W X Y�a�b W X � c Z,Y .(b) CalculateencodingparametersZr]^�[ .

5. QuantizeX � using [ : X �*]^�V W X � Y .6. Determinedecodingparameters[4]^�Z .

Output: codewords V W X � Y , decodingparametersZ .

Figure8: Pseudocodeversionof theencodingalgorithm.

We summarizethe encodingalgorithmin Fig. 8. In thepseudocode,the functionsymbolsfor thebit allocationandpsychoacousticmodelarechosenasin theinitial dis-cussions.Thesetof admissibleor valid markersusedtoindicateembeddingis denotedby � . Thedetectorfunc-tion is named� , quantizeranddequantizerare V and _V ,respectively. Theheartof theencodingalgorithmlies inthedecisionwhetherto usetheembeddedinformation4.or not3. andusethestandardprocedureinstead.It is importantto note that several designdecisionsin-cluding the choiceof embeddingblocksandembeddingbit allocationheavily dependon the underlyingcodec.Suchdecisionswill betreatedin thenext section.

3. MPEG-1 LAYER II IMPLEMENT ATION

To implementacodecpreventingageingeffectswechoseMPEG-1Layer II [6] asa basis.The well-know codingschemeis depictedin Fig. 9. Although,strictly speaking,only thebitstreamsyntaxanddecodingarestandardized,in whatfollowsweshallalsotalk of theencoder. RelatingtheLayer II codecto our genericscheme,we have a 32-bandmultiratefilterbankasa transformanda psychoa-cousticmodel basedon a windowed Fourier spectrum.The psychoacousticmodelyields a signal-to-maskratiousedin thebit allocationprocess.To reduceamplitudere-dundancy, blocksof subbandsamplesareassignedcom-monscalefactorsandscaledaccordingly. Thescaledval-uesarelinearly quantizedusinga variablequantizerstepsizeaccordingto thebit allocation.Codewordsandsideinformationarestoredin theMPEGbitstreamandtrans-



Encoder

Decoder

Output

Input 32-band multiratefilterbank (analysis)

Scalefactor calculation

transform modelPsychoacoustic

Signal to maskratio

Bit- and Scalefactordecoding

32-band multirate filterbank (synthesis)

allocationBit and Scalefactor

Scaling andQuantization

Mux

Channel

Demux

Requantization

scalingand inverse

Windowed Fourier-

Figure9: MPEG-1LayerII codec.

portedto thedecoder.To extendtheMPEGLayerII codecto theproposednovelcodecscheme,weconsider� thechoiceof suitableembeddingblocks,� the side informationto be embeddedandthe em-

beddingbit buffer,� the determinationof the available embeddingbitwidths,� theembeddingbit allocation,� errorcorrectionmechanisms,and� thechoiceof suitablemarkers.

3.1. Embeddingblocks

MPEG-codingworksonaframeby framebasis.In LayerII, blocksof 1152samplesperchannelarefiltered,yield-ing 36 samplesin eachof the32 subbands.Scalefactorsare assignedto blocks of 12, 24 or 36 sampleswithinonesubband,dependingon the magnitudesof the sam-ples. Sincescalefactorsimplicitly changethe quantiza-tion resolution,the embeddingcapacitydependson theparticularscalefactorassignment/combination.For sim-plicity, we assumetheworst-caseof 12-samplescalefac-tor blocksandchoosethoseblocksasembeddingblocks.Hencetherearepotentially threeembeddingblockspersubbandandframe.Thefollowing discussionassumesafixedgivenMPEGframe.

3.2. Embeddedinformation and bit buffer

Besidesglobal informationsuchasthenumberof chan-nels, bitrate or joint stereocoding modes,we have totransmit/embedthefollowing frame-relatedinformation:

� Bit allocation: dependingon the target bit rateavariablenumberof subbandsis allowedfor bit al-location.In MPEG,2-4bits areusedfor eachsub-band’sallocationinformation.� Scalefactorselection: eachsubbandwith a posi-tive numberof allocatedbits is assigned1-3 scalefactors,dependingon the subbandsamplesmag-nitudes. The select-informationconsistsof 2 bitsconveying theutilizedscalefactorpattern.� Scalefactors:for eachof the1-3scalefactors,6 bitsareused.

First, onemight think that the transmissionof scalefac-tors could beobsoletefor they only representa losslesscodingstep.Yet sincedequantizationeventuallysignifi-cantlyaltersa sample’smagnitude,asubsequentencodermightcalculatedifferentscalefactors,againimplicitly re-sultingin adifferentquantizerstepsize.Thuswetransmitboth scalefactorsandselectinformation. Listeningtestsfor thecasewherewe only transmittedbit allocationin-formationshow a significantamountof degenerationdueto theaforementionedeffectsof scalefactorchanges.To storethoseparameters,we implementeda bit buffer-ing mechanismconsistingof a buffer with read/writeac-cessto �w�#� bit blocks. The coding parametersaresequentiallyfed into the buffer. During embedding,thebuffer is readin � ,�2� bit blocks,correspondingto theembeddingblock size � andtheembeddingbit width � .In theextractor, theprocedureis reversed.

3.3. Embeddingbit width

For eachscaling/embeddingblock ¡ �*¢ £ ¤ (where � de-notesthesubbandnumberand ��¥#£&¥#¦ oneof thethreescalefactors)we determinethemaximumembeddingbitwidth § ¨ © ª from the correspondingsubbandsquantizerresolution«�¨ (in bits) andthe assignedscalefactor¬ ¨ © ª ,��¥#£¥#¦ . Wehave to accountfor thescalefactors,sincethey implicitly increasethe bit resolution: scalingby afactor ¨ prior to quantizationallows for anincreaseinbit resolutionby � bitsascomparedto quantizationwith-out scaling.We do not considersubbandswith zerobitsallocated( «�¨#®�¯ ). In this casewe define § ¨ © ªt° ®�¯for all £ . In the casethat lessthanthreescalingfactorsareassignedto a specificsubband,we determinethe ¬ ¨ © ªaccordingto thescalefactorpattern.If the scalefactor¬ ¨ © ª of scalefactorband ¡ �*¢ £ ¤ is given

by ¬ ¨ © ª&®w¬ ±t° ®f²³ ´ µ ± ¶ for �t·¹¸ ¦�° º » , we let§ ¨ © ª° ®!� º�¼2½P¾ ¿I¡ � º ¢ À ¼tÁ Â Ã Ä*¬ ¨ © ª ÅRÆ;«�¨ ¤ ÇSince À ¼tÁ Â Ã Ä ¬ ¨ © ª Å�®=À ¼tÁ Â Ã Ä ²³ ´ µ ± ¶ Å�®oÈ ± µ É ¼#� weobtain § ¨ © ª&®4� º�¼¹½P¾ ¿I¡ � º ¢IÊ � ¦Ë ¼��RÆ;«�¨ ¤ Ç

AES17Ì Í InternationalConferenceonHigh QualityAudio Coding 7


For Î�Ï;Ð Ñ,Ò Ó Ô , scalingcausesthelossof at mostonebitof precision.In this caseÕ Ö × ØRÒ Ù!Ú ÛÝÜ¹ÞPß àIá Ú Û â ã�Ö�Ü�Ú ä .Thecalculationof Õ Ö × Ø maybemotivatedasfollows:start-ing from an initial precisionof 16 bits, we obtain theMPEGreconstructionprecisionasthesumof thealloca-tion precisionã�Ö andtheprecisiongainedby usingscale-factors.Thedifferencebetweenthosetwo quantitiesandthe initial resolutiongivesthe maximumembeddingbitwidth. In caseof a precisionexceeding16bits (which ispossiblein MPEG),we let Õ Ö × ØÒ ÙwÑ .3.4. Embeddingbit allocation

For the embeddingbit allocationwe usea greedyalgo-rithm asalreadysketchedabove. We shall only give anoverview of thealgorithm:å Sort the embeddingblocks in orderof decreasing

capacities.å For eachblock á æ*â ç ä , assignan initial embeddingbit width è Ö × ØéwÕ Ö × Ø .å Main allocationloop: allocateembeddingblocksin the given descendingorder. Add capacitydueto è Ö × Ø to counterfor embeddedbit size.Note thatat this point, we have to takecareof FECbits andmarkerstoo.å End the allocation if enoughbits could be allo-cated.å Endtheallocationif no changein theallocatedbitsizeoccurredascomparedto thelastloop.å Increaseeachè Ö × Ø by onebit provided è Ö × ØRé4Õ Ö × Ø ,thenrestartthemainallocationprocedure.

If theembeddingbit allocationfails, thedecoderdoesnotembedanything for thatparticularframe.

3.5. Err or correction and markers

SincetheMPEGfilterbankdoesnot yield perfectrecon-struction,we cannotrely on theencoder’s subbandsam-plesbeingidenticalto thoseproducedby ourembeddingprocedure.Thereis a certainreconstructionerror inher-ent in thefilter bankwhich accumulateswith arithmeticerrorspossiblyintroducedby floating point operations.Thus,prior to embeddingweperformanarithmeticFEC,mappingthewordto beembedded,ê , to ã�êRëPì4Ù�Ò í á êIäfor suitablepositive integersã and ì . Then, í á êIä is theword to beembedded.As anexample,considerê to bean integer, then,choosingãîÒ Ùðï , ì.Ò Ùðñ , we may re-cover ê from í á êIäÝÙ4ï ê�ë�ñÝë#è for è�Ï�Ð Ü�ñ â ò Ô . In thiscase,thecodingoverheadfor thearithmeticcodewouldbe3 bits.Finally, we considerthe markersindicating the embed-ding positionsandbit widths. In our implementationwe

óóôôõõöö÷÷øøùùúúûûüüýýýþþþÿÿÿ�� !!"" # ## #$ $$ $% %% %& && &' '' '( (( () )) )* ** *+ ++ +, ,, ,- -- -- -. .. .. ./ // // /0 00 00 01 11 11 12 22 22 23 33 33 34 44 44 45 55 55 56 66 66 67 77 77 78 88 88 8

9 99 99 9: :: :: :

;;<<

Bits

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Sample

6

7

8

9

10

11

5

4

3

2

1

0

Bits for FEC

Marker bits

Embedded data bits

Bits of subband sample

Figure10: Schemeof anembeddingblock andthe em-beddingpositions.Theleastsignificantbits aredepictedon thel.h.s.

chosevariablewidthmarkers,embeddedatthebeginningof ourembeddingblocks.Themarker’swidthdependsonthe actualembeddingbit width. To prevent falsedetec-tionsin theencoder, weperformananalysis-by-synthesisexaminationof all possibleembeddingblocksand,if nec-essary, adapttheircontentaccordingly. Fig. 10 illustratestheconceptof anembeddingblockandgivesaroughim-pressionof theembeddingpositions.For sakeof simplic-ity, weuseda 16bit PCMrepresentationasareference.

4. CODEC EVALUATION

4.1. Testsettings

Theproposedcodecwasevaluatedon a broadvarietyofaudiopieces,mostlychoosenfrom thewidely usedEBUtestmaterial[7]. We testedour codecon musicaswellaswith maleandfemalespeakers.In our test settings,thesourcematerialwasgivenas16 bit PCM stereowitha samplingrateof 44.1kHz. We usedstereobit ratesof128to 192kbpsfor compression.

4.2. Listening tests

In whatfollows,weshallconcentrateonthelisteningtestresultsfor a bitrate of 128 kbps. For our testswe re-cruited26 testlistenerschosenfrom amongthemembersof theaudiosignalprocessinggroupandothercomputersciencestudentsatBonnUniversity.In this section,the termstandard codecrefersto a non-modified MPEG-1 Layer II codecimplementation. Insummaryweconductedthefollowing tests:

1. Comparisonof 5thgenerationsof thestandardcodecand 5th generationsof the proposedcodec(rela-

AES17= >

InternationalConferenceonHigh QualityAudio Coding 8


tive).

2. Comparisonof first generationsand25th genera-tionsof theproposedcodec(relative).

3. Comparisonof 5th generationsusingthe standardcodecand25thgenerationsof theproposedcodec(relative).

4. Comparisonof 3rd generationsusingthestandardcodecand5th generationsof the proposedcodec(relative).

5. Comparisonof 10thgenerationsof thestandardco-dec and 25th generationsof the proposedcodec(relative).

6. Absoluteratingsof

? First generations,? 5th generationsof the standardcodec(with-outsynchronization),? 5thgenerationsof thestandardcodec(includ-ing synchronization),? 5thgenerationsof theproposedcodec,and? 15thgenerationsof theproposedcodec.

In thosetestsabsoluteratingswere given accordingtothefive point MOS impairmentscale,while relative rat-ings weregiven w.r.t. an integer scaleof

@ ACB D EFB G, in-

dicating if a secondstimulus is judgedto be of better( H EJI D EFK D EFB L ),worse( H ACB D ACK D AMI L ) or same( N ) qual-ity as comparedto a first stimulus. We shall give anoverview of thetestresults.

−3 −2 −1 0 1 2 3

1

2

3

4

5

6

7

8

Pop (Abba)

Castanets

Speech, female

Speech, male

Wind Instruments

Orchestra

Pop (Jackson)

Percussion

betterworse equal

Figure11: Comparisonof first generationsand25thgen-erationsusingtheproposedcodec.For mostof thepieces,bothversionscouldnotbedistinguished.

−3 −2 −1 0 1 2 3

1

2

3

4

5

6

7

8

Pop (Abba)

Castanets

Speech, female

Speech, male

Wind Instruments

Orchestra

Pop (Jackson)

Percussion

betterequalworse

Figure 12: Comparisonof 10th generationsusing thestandardcodecvs 25th generationsusing the proposedcodec. Positive valuesgive the proposedcodeca betterrating.

In Fig. 11, the resultsof thecomparisonof first genera-tionsand25thgenerationsusingtheproposedcodecaregiven. It is obviousthatfor almostall of thepieces,bothversionscouldnotbedistinguished.Thedecreasein qualitycausedby thestandardcodecwasalreadyreportedin thefirst section.Fig. 12 givesthere-sultsfor the“extreme”situationof a comparisonof 10thgenerationsusing the standardcodecand 25th genera-tions using the proposedcodec. Clearly, the proposedcodecis consideredto yield muchbetterresults.There-sults of the other testsare consistentwith the onesre-portedhere. Oneproblemprevalent in our testsettingswasthelossof quality in thefirst generationsintroducedby the Layer II codecin caseof oneor two pieces. Inthosecases,someof thelistenerswereunsureaboutwhichof a piece’sversionsshouldberatedworse.We summarizethetrendsof our listeningtests:

? For six of thetestpieces,first generationscouldnotbe distinguishedfrom 25th generationsgeneratedby theproposedcodec.

? For the other pieces,the quality of higher (about25th)generationsaregenerallyjudgedto becom-parableto or betterthanthe 3rd generationspro-ducedby thestandardcodec.

? Testswith highgenerations(about50th)show thatthesignalqualitystaysstable.

It is importantto commentonsegmentswherenoembed-dingis possible.As will beshown in thenext subsection,embeddingis indeednot possiblefor eachframe. Yet it

AES17O P InternationalConferenceonHigh QualityAudio Coding 9


turnsout thatthis is notcrucialfor ourMPEGimplemen-tation. More precisely, the frameswith no possibilityofembeddingturnoutto correspondtoquietsignalpassageswith tonalcontent.Since

1. the useof scalefactorsimplies an increasedaccu-racy within thosesegmentsand

2. tonal componentsarequantizedin a conservativeway(highersignal-to-maskratio assumed),

thereis a natural“workaround”to this problem.Our lis-teningtestsconfirmthis reasoning.

4.3. Objective measurements

To supporttheaboveresults,wegivesomeobjectivemea-surementsconcerningthesignalchangeswith increasinggenerations.Furthermore,weconsiderpossible(or max-imum) embeddingcapacitiesbeingof interestfor othersteganographicapplications.

0 10 20 30 40 50 60 70 80 900

50

100

Seg

SN

R [d

B]

0 10 20 30 40 50 60 70 80 900

50

100

Seg

SN

R [d

B]

0 10 20 30 40 50 60 70 80 900

50

100

Seg

SN

R [d

B]

0 10 20 30 40 50 60 70 80 900

50

100

Frames

Seg

SN

R [d

B]

Figure13: Eachof the four partsof theplot eachshows(fromtopto bottom)thesegmentalSNRsbetween1stand3rd generations,3rd and 5th generations,5th and 10thgenerations,aswell as10thand25thgenerations.Smallcirclesatabout60dB indicatethatthecorrespondingseg-mentsareidentical

In Fig. 13 the frameby frameSegSNRbetweenvariousgenerationsof the castagnetpieceis given. This is, thesegmentsize is 1152, in synchronicitywith the MPEGframes. While the SegSNR is plotted as a solid line,segmentswherethesignalcontentdoesnot changefromgenerationto generationareindicatedby smallcirclesatabout60dB. It canbeobservedthatwith increasinggen-erations(from the top to thebottomplot) therearemoreandmoresegmentswherethesignaldoesnotchangeany-more. This is perfectlyin accordancewith the observa-tions of the listeningtestthat the signalquality startstobe“stable”startingfrom about3rd to 5thgenerations.Inotherwords,we may saythat the proposedcodectends

0 1 2 3 4 5 6 7 8 9 10

x 104

−1

−0.5

0

0.5

1

x 104

Signal [Samples]

Am

plitu

de

1 2 3 4 5 6 7 8 9 10

x 104

−50

0

50

Signal [Samples]

Diff

eren

ce

Figure14: Theplot showsthewaveformof thecastagnetpiece(bottom).Above,thedifferencesignalbetween5thand10thgenerationsis given. Theupperhorizontalbarsindicatepositionswhereembeddingcouldbeperformedwhile thelowerbarsindicatesegmentswherenoembed-dingwasperformed.It is obviousthatthedifferencesig-nal vanishesat the embeddingpositions(Note that themagnitudeof the differencesignal is much lower thanthatof thesignal.).

to map the signal partially to somefixed point signal.Comparingtheembeddingsegmentsto thosefixedpointpositionswithin the signal, we oberve a match,as de-picted in Fig. 14. The lower part of the plot shows thewaveform of the castagnetpiece. The differencesignalbetween5th and10th generationsis given in the upperpart. Theupperhorizontalbarsindicatepositionswhereembeddingcouldbeperformedwhile thelowerbarsindi-catesegmentswhereno embeddingwasperformed.It isobviousthatthedifferencesignalvanishesat theembed-ding positions. Measurementswith differenttestpiecesandothermeasures( Q R -distance,relativesegmentalSNR)show similar results.

In view of somenaturalextensionsof theproposedcodecsaswell asothersteganographicapplications,weexaminetheembeddingcapacitiesobtainableby our apporach.Incaseof thecastagnetpiece,Fig. 15shows thevariousbitdemandsandcapacitiesfor eachframe. The solid linegivesthebit demandfor theembeddingof thecodingpa-rametersandthe dashedline shows the total embeddingcapacity. The dottedsolid line givesthe net embeddingcapacity, i.e., thetotalembeddingcapacityminustheca-pacity neededfor the error correctionand the markers.Embeddingis only performedfor frameswherethesolidline is plotted below the dottedsolid line. One clearlyobservestheoverheadproducedby themarkersanderrorcorrection.Furthermore,somesignalpartsallow a veryhugeamountof embedding(about4000-6000bits per

AES17S T InternationalConferenceonHigh QualityAudio Coding 10


0 10 20 30 40 50 60 70 80 900

2000

4000

6000

8000 Bit Demand Bits/Frame Net Embedding capacity Total embedding capacity

Frames

Bits

0 1 2 3 4 5 6 7 8 9 10

x 104

−1

−0.5

0

0.5

1

x 104

Signal

Figure 15: Embeddingcapacitiesand demandsfor thecastagnetpiece. The solid line gives the bit demandfor embeddingof thecodingparameters(perframe),thedashedline shows the total embeddingcapacity, andthedottedsolid line givesthenetembeddingcapacity.

20 40 60 80 100 120

0

5000

10000

Frames

Bits

140 160 180 200 220 2400

5000

10000

15000

Frames

Bits

260 280 300 320 340 360 380

0

5000

10000

Bits

380 400 420 440 460 480 500 520 540 560

0

5000

10000

Bits

Figure 16: Embeddingcapacitiesand demandsfor alongersegmentof a pieceof popmusic.

frame)which is really a considerablequantity. Note thatthis is nothingreallyexceptional,asillustratedin Fig.16.Theplot is analogousto theupperpartof Fig. 15 exceptthata considerablylongersignalpieceis shown. For thispieceof popmusic,therearesomepartswith netembed-ding capacitiesof about4000bits per frame,in this casecorrespondingto a totalof 10000bitsperframe.

5. CONCLUSIONS AND FUTURE WORK

We have presenteda novel audiocodecschemesuitablefor multiple generationsaudiocompressionwithout lossof perceptualquality. As an implementationof the newconceptwe decribedan MPEG-1Layer II basedcodec.A thoroughcodecevaluationincluding extensive listen-ing testsandobjective measurementsshows the proper

functionalityof the proposedcodec.Thecodecconceptis genericandmaybecombinedwith a widerangeof to-daysaudiocodecs(and,naturally, video/imagecodecs).Furtherresearchdescribingan improved secondproto-typebasedonpopulartransformcodingtechniquesis re-portedelsewhere.Wealsoconsideredsteganographicapplications.Thepos-siblehugeembeddingcapacitiesasillustratedin Fig. 16show the flexibility of our embeddingapproachin viewof otherapplications.Amongthoseis oneof ourcurrentprojectswhich is concernedwith the synchronousinte-grationof textual or scoreinformationinto the decodedPCMbitstream.Naturally, thereis muchroom for further improvementsof ourprototypicimplementationincludingthereductionof FEC/markeroverheadusinga techniquesimilar to theMPEG-1LayerIII bit reservoir, or introductionof robust-nessw.r.t. transmissionerrors. The applicationof theintroducedtechniqueto heterogeneouscodeccascadesisalsoof greatinterestfor further studies. Recentdevel-opmentson a VQ embeddingframework allowing forcodecswhich,undersomemild conditions,maybeprovento keeptheperceptualquality of thefirst generationsig-nal,will bereportedelsewhere.

6. ACKNOWLEDGEMENT

Theauthorwouldlike to expresshisgratitudeto MichaelClausenfor supervisingthiswork andto MeinardMullerfor giving several valuablecomments.The researchac-tivities have beenpartially supportedby DeutscheFor-schungsgemeinschaftundergrantCL 64/3-1,CL 64/3-2,andCL 64/3-3.

REFERENCES

[1] John Fletcher, “Iso/mpeg layer 2 - optimum re-encodingof decodedaudiousinga molesignal,” inProc.104thAESConvention,Amsterdam, 1998.

[2] FrankKurth, VermeidungvonGenerationseffekteninderAudiocodierung(in german), Ph.D.thesis,Insti-tut fur InformatikV, UniversitatBonn,1999.

[3] Michael Keyhl, HaraldPopp,ErnstEberlein,Karl-Heinz Brandenburg, Heinz Gerhauser, and Chris-tian Schmidmer, “Verfahren zum kaskadiertenCodierenund Decodierenvon Audiodaten,” 1994,PatentschriftbeimDeutschenPatentamtDE 4405659C1.

[4] RossJ.AndersonandFabienA.P. Petitcolas,“On thelimits of steganography,” IEEE Journal on selectedareasin communications, vol. 16,no.4,pp.474–482,May 1998.

AES17U V InternationalConferenceonHigh QualityAudio Coding 11


[5] A. J. Mason,A. K. McParland,and N. J. Yeadon,“The ATLANTIC audiodemonstrationequipment,”in 106thAESConvention,Munich, Germany, 1999.

[6] ISO/IEC11172-3, “Coding of moving picturesandassociatedaudiofor digital storagemediaatupto 1.5mbits/s-audiopart,” 1992,InternationalStandard.

[7] Technical Centre of the European BroadcastingUnion, “SoundQualityAssessmentMaterialRecod-ingsfor SubjectiveTests,” 1988.

AES17W X InternationalConferenceonHigh QualityAudio Coding 12

an audio codec for multiple generations compression ...department of computer science v, university...

Documents