guidelines for jury evaluations

Upload: nehney

Post on 03-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Guidelines for Jury Evaluations

    1/22

    400 Commonwealth Drive, Warrendale, PA 15096-0001 U.S.A. Tel: (724) 776-4841 Fax: (724) 776-5760

    SAE TECHNICALPAPER SERIES 1999-01-1822

    Guidelines for Jury Evaluations

    of Automotive Sounds

    Norm Otto and Scott AmmanFord Motor Company

    Chris EatonHEAD acoustics Inc

    Scott LakeGeneral Motors Corporation

    Reprinted From: Proceedings of the 1999 Noise and Vibration Conference(P-342)

    Noise and Vibration Conference & ExpositionTraverse City, Michigan

    May 17-20, 1999

  • 8/12/2019 Guidelines for Jury Evaluations

    2/22

    The appearance of this ISSN code at the bottom of this page indicates SAEs consent that copies of thepaper may be made for personal or internal use of specific clients. This consent is given on the condition,however, that the copier pay a $7.00 per article copy fee through the Copyright Clearance Center, Inc.Operations Center, 222 Rosewood Drive, Danvers, MA 01923 for copying beyond that permitted by Sec-tions 107 or 108 of the U.S. Copyright Law. This consent does not extend to other kinds of copying such ascopying for general distribution, for advertising or promotional purposes, for creating new collective works,

    or for resale.

    SAE routinely stocks printed papers for a period of three years following date of publication. Direct yourorders to SAE Customer Sales and Satisfaction Department.

    Quantity reprint rates can be obtained from the Customer Sales and Satisfaction Department.

    To request permission to reprint a technical paper or permission to use copyrighted SAE publications inother works, contact the SAE Publications Group.

    No part of this publication may be reproduced in any form, in an electronic retrieval system or otherwise, without the prior written

    permission of the publisher.

    ISSN 0148-7191Copyright 1999 Society of Automotive Engineers, Inc.

    Positions and opinions advanced in this paper are those of the author(s) and not necessarily those of SAE. The author is solely

    responsible for the content of the paper. A process is available by which discussions will be printed with the paper if it is published in

    SAE Transactions. For permission to publish this paper in full or in part, contact the SAE Publications Group.

    Persons wishing to submit papers to be considered for presentation or publication through SAE should send the manuscript or a 300

    word abstract of a proposed manuscript to: Secretary, Engineering Meetings Board, SAE.

    Printed in USA

    All SAE papers, standards, and selectedbooks are abstracted and indexed in the

    Global Mobility Database

  • 8/12/2019 Guidelines for Jury Evaluations

    3/221

    1999-01-1822

    Guidelines for Jury Evaluations of Automotive Sounds

    Norm Otto and Scott AmmanFord Motor Company

    Chris EatonHEAD acoustics Inc

    Scott LakeGeneral Motors Corporation

    Copyright 1999 Society of Automotive Engineers, Inc.

    ABSTRACT

    The following document is a set of guidelines intended tobe used as a reference for the practicing automotivesound quality (SQ) engineer with the potential for applica-tion to the field of general consumer product sound qual-ity. Practicing automotive sound quality engineers arethose individuals responsible for understanding and/orconducting the physical and perceptual measurement ofautomotive sound. This document draws upon the experi-ence of the four authors and thus contains many "rules-of-thumb" which the authors have found to work well intheir many automotive related sound quality projects overthe past years. When necessary, more detailed publica-tions are referenced. The intent of publication of this doc-ument is to provide a reference to assist in automotivesound quality work efforts and to solicit feedback from thegeneral sound quality community as to the completenessof the material presented. To that end, contact informa-tion for each author is given at the end of the document.

    INTRODUCTION

    Why do subjective testing and analysis in automotivesound quality investigations? One might ask why botherwith the trouble of conducting subjective testing in thefirst place? In the authors' experience, conducting sub-

    jective jury evaluations of automotive sounds has led to adeeper understanding of those sounds and the way thatpotential customers react to and sometimes appreciateautomotive sounds. The following is an attempt todescribe subjective testing and analysis as applied tosound quality, and its relevance to gaining this deeperunderstanding. The remainder of this document drawsupon the experience of the four authors and, as a result,may be biased toward the techniques they commonly useor have found to work well in their automotive sound qual-ity studies. However, an attempt has been made toaddress other techniques commonly used by other

    researchers in the general field of product sound qualityAlthough not a comprehensive document, it is hoped that

    this paper will provide a set of guidelines whichaddresses a majority of the issues and techniques usedin the field of automotive and general product soundquality. It is hoped that this guide will act as a spring-board; a launching point for your own individual investigation into subjective testing and analysis for automotivesound quality.

    DEFINITIONS It is appropriate to begin with a few fun-damental definitions of terms used throughout this docu-ment.

    Subjective.In Websters Dictionary, subjective is defined

    by the following: ..peculiar to a particular individual,..modified or affected by personal views, experience, obackground, ..arising from conditions within the brain orsense organs and not directly caused by external stimulietc. In certain situations, the word subjective conjures upnegative connotations, as if subjective results are lessvaluable pieces of information than objective results. Wedo not hold that opinion in our treatment of this topic, butconsider subjective evaluation to be a vital, informationrich portion of automotive sound quality work.

    Quality. Again Webster helps to clarify what we are investigating. According to Webster quality is: ..a distinguish

    ing attribute,..the attribute of an elementary sensationthat makes it fundamentally unlike any other sensationNotice that "goodness" or "badness" does not enter intothe definition.

    Subjective testing and analysis. Subjective testing andanalysis involves presentation of sounds to listenersthen requesting judgment of those sounds from the listeners, and finally performing statistical analysis on theresponses.

    Jury testing. Jury testing is simply subjective testing donewith a group of persons, rather than one person at a time

  • 8/12/2019 Guidelines for Jury Evaluations

    4/222

    Subjective testing can be done with a single person ormany people at a time, both cases having their own set ofbenefits and caveats.

    THE TASK OF SOUND QUALITY In automotive soundquality work, one tries to identify what aspects of a sounddefine its quality. It has been the experience of most per-sons involved in noise and vibration testing, that analysison acoustic signals alone does not identify the quality (as

    defined by Webster) of those signals. Individuals will usewords like "buzzy," "cheap," "luxurious," "weak," etc. todescribe the defining attributes in sounds. Knowing howto design the correct attributes into a vehicle sounddirectly impacts the appeal of the vehicle, and ultimatelyimpacts the profitability of a vehicle line. No instrumentsor analysis techniques to date have been able to quantifythe descriptive terms mentioned above without the aid ofsubjective testing of some kind, hence, the need for sub-jective testing and analysis.

    THE REST OF THE STORY The remainder of thisguide will take you through most of the salient issues

    involved in subjective testing for automotive sound qualitywork. This document is not intended to cover psychoa-coustic testing, but rather provide guidance for the prac-ticing sound quality engineer. Specific topics to becovered include:

    Listening Environment

    Subjects

    Sample (sound) Preparation

    Test Preparation and Delivery

    Jury Evaluation Methods

    Analysis Methods

    Subjective to Objective Correlation

    Before any type of jury evaluation can be conducted anadequate listening space is required. This is the topic ofthe first section.

    LISTENING ENVIRONMENT

    ROOM ACOUSTICS If the sounds are presented overloudspeakers in a room other than an anechoic chamberthe frequency characteristic of the room will be superim-posed on the frequency characteristics of the loud-speaker and the sounds. Room resonances will then

    have an effect on the perceived sound. Additionally, if ajury evaluation is being conducted in which the subjectsare physically located in different positions within theroom, the room acoustics will effect the sound differentlyfor each subject. No longer will all the subjects experi-ence the exact same sound, thus, introducing a bias intothe results. If it is necessary to conduct listening evalua-tions using loudspeakers, adherence to Sections 4.1(Room size and shape), 4.2.1 (Reflections and reverber-ation), and 4.2.2 (Room modes) of the AES standard [1]is recommended. It is recommended that loudspeakerarrangement conform to AES20-1996 Section 5.2 (Loud-

    speaker locations) and location of listeners conform toAES20-1996 Section 5.4 (Listening locations).

    AMBIENT NOISE Reduction of ambient noise is essential for the proper administration of a subjective listeningevaluation. Noise sources within the listening room canbe due to: computer fans, fluorescent lighting, HVAC, etcThe influence of these sources can be minimized throughthe remote location or containment of computers in

    acoustic enclosures, incandescent lighting and HVACbaffles/sound treatment. High transmission loss into theroom is desirable to minimize the influences of outsidenoise.

    Single value dB or dBA levels are generally inadequate indescribing the ambient noise levels of indoor environ-ments. ANSI S3.1 defines octave and one-third octaveband noise levels for audiometric test rooms [2]. How-ever, bands below 125 Hz are undefined and everydaysounds with energy below 125 Hz are commonly encountered. As a result, it is recommended that ambient noiselevels should conform to NCB (noise criteria) 20 or better[3] which specifies allowable levels in the 16 to 8000 Hzoctave bands.

    During jury evaluations, the station at which the subject islocated should be free from influences from the othersubjects. Many times partitions are placed between sub-jects to minimize the interaction between subjects. Whenlistening to low level sounds, subjects with colds or respi-ratory ailments can make it difficult for not only themselves, but also the adjacent subjects to hear the stimuli.

    DECOR The listening room should be a comfortableand inviting environment for the subject. The room shouldlook natural as opposed to high tech. The more clinica

    the room looks, the more apprehension and anxiety thesubject will experience. Neutral colors should be used fothe walls and furniture. Comfortable chairs and headphones (if used) are essential to reducing the distractionsand keeping the subject focused on the task at handModerate lighting should be used. Lighting which is toodim may reduce the subjects attention to the desiredtask, especially, during lengthy or monotonous listeningevaluations.

    AIR CIRCULATION, TEMPERATURE AND HUMIDITY The listening area should be air conditioned at 72 to 75Fand 45 to 55% relative humidity. Air circulation and filtra-tion should be adequate to prevent distractions due to lin-gering odors. Construction materials used in the facilityshould be nonodorous.

    SUBJECTS

    In this document, the term subject is used to refer to anyperson that takes part in the evaluation of sounds in a lis-tening study. This section discusses the selection andtraining of these subjects.

  • 8/12/2019 Guidelines for Jury Evaluations

    5/223

    SUBJECT SELECTION Some of the factors thatshould be considered when selecting subjects includesubject type, the number of subjects required and howthese subjects are obtained.

    Subject type Subject type is defined based on listeningexperience, product experience, and demographics.

    Listening Experience As a general rule, it is desired

    that the listening experience level of subjects is appropri-ate to the task at hand as well as representative of thetarget customer. An experienced listener may be morecapable of judging certain sound attributes than an inex-perienced subject. An example is the evaluation of loud-speaker timbre for high end audio systems [4]. For thistask, audiophiles familiar with the concept of timbre arethe subjects of choice. An inexperienced listener wouldno doubt have difficulty in discerning the nuances impor-tant to the expert. However, most sound quality evalua-tions do not require such a high level of expertise. Mostautomotive sound quality work falls into this categoryand, generally, subjects are not required to have previous

    listening experience. In fact, in these cases, using onlyexperts may not be a desirable thing. Experts often pickup little things that are not particularly important to thecustomer. Generally, screening subjects for hearing lossis not done. To be done properly, hearing tests requireskills and equipment usually not at ones disposal. Inaddition, such testing may violate subject privacy. Pre-sented in the Test Preparation and Delivery section aremethods of detecting poor subject performance andthese methods will help to identify any hearing relatedperformance issues.

    Product Experience Listeners judgments of sounds are

    always influenced by their expectations. These expecta-tions are, in turn, affected by the experience they havewith the product. Thus, it is important that one is con-scious of these expectations when selecting subjects. Forexample, one would not use luxury car owners to evalu-ate the engine noise of sporty cars. The product experi-ence of the subjects must be matched to the task athand. Because company employees generally haveexposure to all product segments, they are often immuneto this effect. Because of this, company employees maybe used as subjects for the majority of listening evalua-tions. One should use actual customers as subjectswhen segment specific information is required.

    Demographics In the listening studies the authors havedone over the years, a huge dependence of the resultson subject demographics has not been observed. Never-theless, the subject population should contain a demo-graphic mix (age, gender, economic status) that isrepresentative of the customer base for the product. Gen-erally, customers from only one vehicle segment are usedto insure proper demographics. Note that a representa-tive mix does not always mean an equal demographicmix. For example, in the luxury vehicle segment, themean customer age is well over 50 and the owners are

    predominantly male. When company employees areused as subjects, it is more difficult to control demographics. Usually an attempt is made to use roughlyequal numbers of males and females in evaluations.

    Number of subjects This section discusses the numbeof subjects needed for a listening evaluation. This decision is greatly influenced by whether extensive subjectraining is required as well as by the difficulty of the evalu

    ation task.

    Simple Evaluation Tasks Fairly simple evaluation methods (like those described in the Test Preparation andDelivery section) require little or no subject training. As aresult, a number of subjects can take part in the studyover a relatively short amount of time. Furthermore, if afacility exists which allows several subjects to listen to therecorded sounds simultaneously, a large number of people can take part in the evaluation. The question is howmany subjects are required to obtain representativeresults. This is an important point because it is oftenimplicitly assumed that results obtained with N subjects

    would be unchanged if 2N or 10N subjects were usedThe question is what is the value of N for which thisassumption is approximately true. If one knew the distribution of the subject responses, then the value of N couldbe calculated for a given confidence level. There are several limitations to measuring this distribution. One is sim-ply the time involved in gathering such data. Another isthat the response scales used do not always conform tothe usual statistical assumptions like those of the centralimit theorem. Thus, one must rely on experience forthese estimates. In general, the more subjects the betterbut time constraints always are a factor. Usually, 25 to 50is an appropriate number of subjects for listening studies

    which use company employees as subjects. About 10%of these subjects will have their data removed because opoor performance. If customers are used as subjectshowever, 75-100 participants are selected [5]. Customershave greater variability in their responses than employees and tend to exhibit a higher proportion of poor per-formers. Finally, always schedule more than the requirednumber of subjects to allow for no-shows (about 20% ofthe subjects).

    Complex Evaluation Tasks Difficult evaluation tasksgenerally require some training of the subjects prior toevaluation. This training serves to familiarize the subjectswith both the sounds and the evaluation task. Whilemethods requiring extensive training are outside thescope of this paper, it is instructive to note that the training time often limits the number of subjects that can participate in the evaluation to a small number, certainly lessthan 10 and, often, less than 5. As a result, the inter-subject variability is quite high in these studies.

    Recruiting subjects

    Company Employees Recruiting company employeesis generally fairly easy. An announcement asking fo

  • 8/12/2019 Guidelines for Jury Evaluations

    6/224

    volunteers can be posted on the e-mail system and reacha large number of people. As a rule, employees are gen-erally not paid for their participation in listening clinics.However, company employees have been provided withrefreshments in appreciation of their efforts. Mostemployees are glad to give their time to help improve thecompany products. When using employees, it is veryimportant to keep each subjects results confidential.These evaluations should not become a contest to see

    who gets the highest grade.

    Customers Recruiting customers to participate in lis-tening clinics is more involved than recruiting employees.Since the task often focuses on specific product seg-ments, the owners of these products must first be identi-fied. This information is not always readily availableparticularly if one wants to include owners of competitiveproducts. This is why it may be necessary to work withmarket research companies when dealing with custom-ers. Using owner registration lists, these companies willidentify, recruit, and schedule subjects for testing. In addi-tion, they will also gather demographic information. To

    attract customers, the participants are paid for participa-tion in the clinic. This payment can range from $50-$150dollars depending on the time required and the type ofcustomer. It takes a larger fee to interest luxury car own-ers than it does for compact car owners.

    SUBJECT TRAINING Training refers to the process ofacclimating subjects to both the sounds and the evalua-tion task. Depending on the difficulty of the task, thistraining can range from very simple familiarization toextensive regimes.

    Simple evaluation tasks A distinction should be made

    between simple and complex evaluation tasks. Simpletasks often require the subject only to give their opinionabout the sounds. These opinions may take the form ofchoosing which of two sounds the subject prefers or rat-ing certain sound attributes. Since most people areaccustomed to giving opinions, little or no training isneeded beyond the simple mechanics of the evaluation.Since it is recommended to use real life product derivedsounds that most everyone has heard before, subjects donot need to be trained to recognize the sounds. For thesecases, the subjects are simply familiarized with thesounds and the evaluation process by having a practiceblock at the start of each session. All the sounds in thestudy should be included in this practice block so thatsubjects get an idea of the range covered in the evalua-tion. This is particularly important when sounds are pre-sented, and evaluated, sequentially. The evaluation ofany given sound can be adversely affected if subjects arenot aware of the other sounds in the study. This is partic-ularly true if sounds are being rated on scales which arebounded. Hearing a very good sound, a subject might betempted to use the top of the scale if they did not knowthat an even better sound existed in the evaluation sam-ples.

    Complex evaluation tasks These tasks are those towhich the subject has little or no familiarity and requirestraining to bring their performance to an acceptable levelThe more common psychoacoustic methods fall into thiscategory. Methods like magnitude estimation, pure tonematching techniques and detection tasks are just someexamples. The principle behind subject training is thaperformance will improve with increasing exposure to thetask. Feedback on performance during training has been

    shown to increase the acclimation rate. Often this trainingis very extensive requiring 3 or more hours per day for anumber of weeks depending on the difficulty of the taskTraining is complete when subject performance reachesan asymptote. Bech [6, 7] gives a very good discussionon subject training.

    SAMPLE PREPARATION

    GOOD RECORDING/MEASUREMENT PRACTICE Goodrecording practices are dictated when preparing soundsamples to be used in any jury test of real productsounds. Adequately prepared samples do not ensure a

    successful investigation, yet poorly recorded or editedsound samples can ruin an otherwise valid test. There-fore, close attention must be paid to recording practices.

    In general, a literal representation of product sounds isdesirable so that jurors get the sensation of participatingin the original auditory event; whether that event is ridingin the passenger seat of an automobile or experiencing afine orchestra in a renowned concert hall. The most pop-ular way of achieving this authentic near-duplicate of theoriginal event is through the binaural recording techniquewhich employs an artificial head to record the sound ontosome form of digital media to be audited later in a con-

    trolled environment usually via high quality headphonesThis section will be addressed with an emphasis on artifi-cial head recordings, although some of the same considerations can be applied to other recording techniquesFollowing is a guide for stimuli preparation for presentingreal product sounds to jurors for listening tests.

    LEVEL SETTING AND CALIBRATION In general, digital recordings are considered the standard today andshould be used if possible. Currently, digital audio taperecordings are the most popular media for SQ record-ings. A high dynamic range (90+ dB) and inexpensivecost per megabyte for data storage are two of the most

    appealing properties for this media. The following guidelines should be followed for achieving authentic record-ings.

    Recording practices All sounds to be used in a particu-lar listening test should be recorded using the same sensitivity and equalization settings on the measurementsystem, if at all possible. This reduces the likelihood ooperator error being introduced with hardware changesduring recording since recording settings are unchangedRecordings made with the same transducer sensitivitymay permit the sounds to be presented via calibrated

  • 8/12/2019 Guidelines for Jury Evaluations

    7/225

    headphones or loudspeakers at the correct playback vol-ume so they do not need any additional amplitude com-pensation through software or hardware. Another benefitof this practice is that it ensures the same transducer/recorder self-noise is recorded each time. This is animportant issue since different transducer sensitivities willhave different background noise levels due the corre-sponding instrumentation self-noise. This unwantednoise, which varies amongst recordings made with differ-

    ent sensitivities, may affect the perception of the soundstimuli or be distracting especially when compared back-to-back in A-B fashion.

    Recording range As with making measurements withother transducers, it is good measurement practice tochoose a recording range that yields the highestrecorded signal without measurement system overloadso that the maximum number of bits in the digitized signalare modulated. This ensures that that greatest differentialin level exists between the sound under investigation andthe noise floor of the measuring system. This is not a triv-ial task since this should apply to the loudest sound of

    interest, so you need a bit of intuition to set a recordinglevel that is high enough to capture the loudest signal butnot too high to add unnecessary noise in the recording.Setting the maximum without overload is easier toaccomplish with the advent of improved digital recordingtechnology of 20 bits and beyond. The additional 4 bitsprovides 24 dB more dynamic range making fine-tuningof recording amplitudes less critical.

    Measurement variation Measurement (recording) varia-tion not due to instrumentation changes should be con-trolled as closely as possible.

    Recording Variations Examples of factors that influencethe perception and measurement results of a recordingare room acoustics (or cabin acoustics as in automotiveor airframe), direction and distance to the sound source,and also background noise or extraneous noise infiltra-tion (as mentioned in the previous section). Obviousexceptions to these guidelines would be a jury test usedfor architectural acoustics purposes where the environ-ment is the focus of the investigation or in a test forspeech intelligibility, where various speech segments arepart of the stimulus. In these previous examples, it isadvantageous to measure and record different acousticenvironments since this is key to the nature of the test.Differences in recording equipment and acoustic environ-ment should be kept to a minimum within a set of soundstimuli, otherwise jurors may be sensitized or cued tounintentionally give a response based on these perceivedchanges instead of focusing on the important character-istics of the stimuli.

    Sample Variations When making recordings, it isimportant to use or actuate products in the same man-ner, being careful to not bias a test because componentswere used at different speeds or other test conditions.One problem often encountered is a noticeable pitch dis-

    crepancy between sounds recorded at the correct speedand a sound recorded slightly off-speed. Sometimeshowever, products must be operated at different speedsto get the same performance such as power output orcooling effect, for example. Whatever the circumstancesit is important to be consistent. For example, if the listening test is for automotive door closing sounds, then thedata collection should be consistent across all closingevents; each door must be closed the same way with

    the same physical parameters.

    Measurement Position It is important to have procedures in place that require consistent transducer placement, whether artificial head or microphone, so thatconsistent measurements are made.

    SAMPLE (SOUND) SELECTION Listening test sample selection should be governed by the purpose of theinvestigation. For example, if the objective of the test is tofind where the sound quality of a product lies in compari-son to competitive products, then real product soundsfrom competitors as found in the marketplace should be

    used. The sounds, recorded using real products, shouldbe as authentic as possible and reflect what a consumewould hear and experience under normal use conditionsSometimes products are tested under extreme operatingconditions, such as those used in a wide-open-throttlevehicle acceleration test for powertrain sound quality. Bytesting the extremes using this strategy, it may be easieto pinpoint differences among product sounds and findpotential sound quality shortcomings and vulnerableareas that need to be targeted for improvement. If, however, the objective of the investigation is to determine theoptimum sound quality for a product, then your collectionof sound stimuli should exhibit variation across the rele

    vant attributes (like drone, hum, hiss) for that family ofsounds or across sensation continua (loudness, pitch, orroughness). By studying the level of the attributes or sen-sation continua and the impact of their qualities on thewhole sound, the optimum overall sound quality may befound. Consumer terms as liking, minimum annoy-ance, or comfort, etc. can be posed to the jurors tostudy and pinpoint an optimum attribute mix.

    Sample editing and preparation Sample preparation islimited to discussion of audio stimuli and pre-stimuli preparation for listening tests.

    Extraneous Noise Since test accuracy is directly influ-enced by the accuracy of the sounds themselves, carefuscreening of the sound stimuli is important. The purposeof screening is to ensure that the sound is true to theoriginal situation. Screening is accomplished by simplyauditing the sounds as they are to be presented. Thesound samples should be prepared so that they includeno extraneous noises. If the sounds are to be presentedin a quiet room via headphones, then screening shouldbe performed under the same conditions. For examplewhere product sounds are concerned, extraneoussounds, like talking or environmental noise, that are no

  • 8/12/2019 Guidelines for Jury Evaluations

    8/226

    intended to heard by jurors should be eliminated from therecordings by editing or choosing a different data seg-ment that is not contaminated. Recordings should bemade in quiet environments, absent of noises to distractfrom, or interfere with, the product sound. Only theintended sounds should be audible. The stimuli should befree of unwanted noises that might interact with or dis-tract from the test. Sounds should be scrutinized forunwanted noises before they are administered to a jury.

    Equalizing Loudness Sounds may be amplitudeadjusted so that each sound in a group gives the samecalculated loudness value. This is a useful techniquewhen it is necessary to have listeners focus on otherimportant quality aspects of the sounds other than loud-ness level. This can be very effective when using pair-comparison test strategies.

    Sample Length In general, for steady sounds, thesound stimuli should be between 3 and 5 seconds long.For transient events, such as door closure sounds, thesingle event may need to be repeated per comparison if

    greater differentiation is sought between short durationsounds.

    Binaural Recording Equalization Equalization isachieved during recording and playback. Its purpose istwofold; to make recordings sound authentic and to allowmeasurement compatibility. It is important to be consis-tent with equalization. Improper or unmatched record-playback equalization pairs can affect the tone color ofsound presented using the system. Use the same recordequalization and playback equalization since they are amatched set. There is a paradox in the use of artificialhead equalization in some measurement systems that

    use software to equalize headphone playback. Mostmodern artificial heads used for sound quality work haveonly an ear-canal entrance (or a cavum conchae only)and no ear canal that would otherwise normally extend toa microphone placed at the eardrum location. The earcanal is not needed since it does not add any directionalcues to the sound signal. Since there is no ear canal, it isnot appropriate to use these artificial heads to measureheadphone characteristics to generate an equalizationcurve, since an erroneous acoustic impedance of thecoupling of headphone to ear canal will result. Thismakes measured headphone equalization and any soundheard through this system inaccurate or different from theoriginal.

    FF - Free Field Equalization. Free field equalization issuitable only when the artificial head is directly in front ofthe test object (0 deg azimuth, 0 deg altitude) and in areflection free environment. FF provides a 2-stage equal-ization that nulls the cavum and ear canal entrance reso-nance and normalizes the measurement to give a flatamplitude response for frontal sound incidence. FF Play-back equalization is necessary to flatten the headphoneresponse and reverse the effect of the normalizing equal-ization for frontal sound incidence.

    DF - Diffuse Field Equalization. Diffuse field equalizationassumes that all incoming sound frequencies and directions are weighted equally. The inverse of the averageartificial head response is applied. While not truly mea-surement microphone compatible because of the averaging effect, it is included as a feature across manufacturersof artificial heads.

    ID Independent of Direction Equalization. ID is ageneral-purpose equalization that nulls the cavum and

    ear canal entrance resonance so that instrumental mea-surements such as frequency spectra or loudness mea-surements do not show these resonances. This givesmeasurement results consistent with what an omni-directional microphone might yield while also providingrealistic playback capability. In general, ID equalizationshould be used whenever the sound field is spatially dis-tributed whether due to source location, number ofsources, or room/cabin acoustics issues. In most casessound fields in recording environments for product soundquality are neither diffuse nor free so ID equalization isthe correct choice.

    Other recording issues

    Sampling Frequency Typical sample frequencies are44.1 kHz, the compact disc standard, and 48 kHz the dig-ital audio tape standard. There are other sample ratessuch as 32 kHz, available on some DAT machines andcomputer sound cards. 44.1 and 48 kHz are the mospopular sample rates since their frequency responsespans the audible frequency range. A 44.1 kHz sampleshould be employed if CD audio files are desired in thefuture. Many companies have a standardized method forecording using only one sample rate (44.1 kHz, fo

    example). This ensures that any computational resultsbased on the audio files do not differ because of samplerate differences. Also, depending on the hardware, clicksmay result on the audio output when sounds with differ-ent sample rates are played back-to-back. Sounds withdiffering sample rates should be thought of as incompatible since they have different time resolution, which wilusually affect any data analysis performed on themUsing a re-sampling algorithm, sample rates can bechanged to make data compatible for playback and anal-ysis.

    Other Inputs Embedded tachometer signals are used

    to record a speed or rotational signal related to a prod-ucts sound. This signal is particularly useful for modifying(removing, reducing, amplifying, or extracting) parts ofthe sound that are related to the speed. This featurehelps sound quality personnel solve problems and alsoplay what-if games by modifying existing product soundsto create new ones. The embedded tachometer is regis-tered by modulating the least significant bit of the digitaword and storing it in the sound signal. The tachometesignal should not be audible in the digital signal if theplayback equipment is capable of ignoring the least sig-nificant bit.

  • 8/12/2019 Guidelines for Jury Evaluations

    9/227

    Quantization Quantization refers to the number of lev-els in A/D conversion. Usually in sound quality work 16bit recordings (or 15 bit plus pulse signal) or better (20bit) are used for jury work. This maintains an importantlevel of accuracy necessary for transmitting these sig-nals.

    Emphasis In a digital recorder, emphasis is provided bya pair of analog shelving filters, one before A/D conver-

    sion and one after D/A conversion to provide a 1:1 inputversus output signal. The emphasis circuit provides a sig-nal with a high frequency boost, as much as 9 dB at 20kHz. The purpose of emphasis is to improve the signalto noise ratio of a recording by yielding a hotter signalwith greater high frequency content. An emphasizedsignal must be de-emphasized for either analysis or lis-tening tests via hardware or software. In general empha-sis should be avoided in light of compatibility issues andimprovements in digital recording technology that affordsquieter measurement systems and higher dynamic range

    TEST PREPARATION AND DELIVERY

    SOUND PRESENTATION (PLAY) ORDER Incorpo-rating a method of controlling sound presentation order isimportant for reducing experimental error due to biasesas much as possible. Guidelines follow for testing strate-gies mentioned.

    Paired comparison tests

    Forced Choice Task For a paired comparison of prefer-ence test, usually t!(t-1) pairs are presented, where t isthe number of sounds in the study. This is known as atwo-sided test. One kind of presentation order effect is

    removed since each sample appears with each othersample twice, but in opposite order. For example, thestimuli pair blue-red would be repeated in reverse orderas red-blue. The two-sided test introduces other nicetiessuch as the ability to check subject consistency or see ifsubjects overall agreement improves during the test. Thetwo-sided test is, of course, usually roughly twice thelength of a one-sided test, which is its only disadvantage.To optimize the play order to further reduce presentationeffects, pairs should be evenly spread out so that no twoidentical pairs are near each other and so that no twoadjacent sounds in the play order are the same. Strate-gies for determining optimum presentation order can befound in [8]. Ideally, each juror would hear a different playorder, but this is usually prohibited by available hardwareand test set-up time.

    Scaling Task For a paired comparison of similarity ordifference test, usually t2pairs are prepared for presenta-tion. Unlike the paired comparison of preference, thesame-sound pairs are presented such as red-red, blue-blue. The same general guidelines can be followed as inthe description for paired comparison of other test meth-ods as well (response scales) outlined above, althoughthe data scaling and interpretation are much different.

    PRESENTATION OF SAMPLES

    Pacing or timing

    Self-paced Tests Self-paced means that the user hascontrol of the test and can play the sounds as many timesas necessary. Using this methodology it is possible todeliver the same sounds to each juror but with a differentplay order. This benefit can be used to minimize the

    effects of play order on the test results. This is usuallyexecuted using a computer based jury system.

    Paced Juries Paced jury tests present one set of stimulto several jurors at once, typically through headphones.

    Sample size The number of samples included in a testare usually chosen based on the test length constraintsand the number needed to reach some desired level ofproduct variation. Another consideration is as the numberof samples increase (and they better represent the varia-tion to be expected among that product sounds), the like-lihood of building a more robust or accurate mode

    relating to consumer preference goes up. Clearly, themost important first consideration is the test methodologyto be used (Jury Evaluation Methods section). This cangovern the number and range of stimuli to be presented.

    Test length The length of the test is important due tothe potential for juror fatigue which, in turn, depends onthe level, duration, and annoyance of the stimuli. Also, thehealth of the juror must be kept in consideration sinceexposure to high noise levels can cause long term hearing damage. Also, a test that is too long produces resultsthat are less discriminating. In general, try to limit themaximum test length to 30-45 minutes.

    Sound reproduction method

    Loudspeakers Sounds may be audited through loud-speakers. Presentation of the same stimuli to all jurors isdifficult, however since the speaker type, position and listening room will influence the sound. Loudspeaker play-back is more appropriate for products recorded in areflection free environment (free-field) whose source isnot spatially distributed.

    Headphones Headphone auditioning can be an effec-tive way to ensure that each juror hears the same stimulunder the same conditions. Headphones can be level calibrated and equalized so that their responses are equiva-lent. Headphone auditing also allows flexibility in juryfacility setup. Since the sound each auditor hears is pre-sented over headphones, it is not influenced by therooms acoustic properties or by listener positioning.

    Low Frequency Phenomena with Headphones Jurorsmay report that the sound that they hear via headphonesis louder or has more bass than the real productsound they are accustomed to, although the sound isdelivered at the correct calibrated playback level. This

  • 8/12/2019 Guidelines for Jury Evaluations

    10/228

    discrepancy is introduced when our mind tries to assimi-late and process what is seen in conjunction with what isheard. Under normal hearing conditions there is no prob-lem since our visual and auditory scenes agree. How-ever, when listening to a recorded acoustic image whileexperiencing the visual cues of a listening room, a mis-match exists. For best results and the most authenticexperience, sounds should be played back over head-phones in the original sound environment, or a simulation

    of the environment or some sort of mock up. Most timesthis is impractical though and listening tests are usuallylimited to a jury room. Anything to improve the visual con-text in which the sound is to be experienced is a plus.

    Headphones with Subwoofer Headphone listening canbe enhanced through the use of a subwoofer system. Asubwoofer system can replace missing low frequencysound energy that normally impinges on the human bodyduring exposure to sound events. The subwoofer systemcan improve the realism of sound playback by a giving asense of the importance of the vibration behavior of thedevice under scrutiny and a better sense of the low fre-

    quency sound characteristics of the product.

    Visual stimuli Context improvement possibilities exist tohelp the jurors get a sense of being there at the site ofthe original recording. By demonstrating the product inuse in a video or even a still picture the expectation of thejurors is better controlled and focused on the productunder test.

    DATA COLLECTION ENVIRONMENT

    Forms processing Using software and a scanner, a testadministrator can design a form to be used to collect juror

    responses. The bubble-in type survey is a very familiarinterface and should require little or no training for thejurors to use it. The test administrator may elect for theforms data to be sent to a master database once formimages are scanned into the computer. From the data-base, statistics can be applied and conclusions madeabout the data. Form entry, while being the most flexibledata collection device, requires a form to be designedeach time a new type of test is taken. Fortunately, tem-plates can be created so that the design time can begreatly reduced.

    Computer display A computer can be utilized to setupjury tests. This system uses a desktop computer and aprofessional sound card to deliver listening tests. The testmay be conducted on an individual workstation where thetest may be run separately for each panelist. An evalua-tion may also be conducted with a group where all panel-ists are presented the samples at the same time forevaluation.

    Hand held devices Hand held devices, controllers, andPDAs, could be setup to collect data into a computer.

    SUBJECT INSTRUCTIONS The instructions given tosubjects are very important when trying to obtain goodsubjective data without inadvertently biasing the jurySamples of instructions given for paired comparisonsemantic differential, attribute intensity scaling (responsescaling) and magnitude estimation tasks are provided inthe appendix. Every evaluation is unique and may requiresignificant alterations of the examples given. Theseexamples are intended to provide a starting point for the

    evaluation organizer. The jury evaluation methods wilnow be discussed in the next section.

    JURY EVALUATION METHODS

    This section discusses the methods that are used to eliciopinions of sounds. The term jury evaluation is meant tobe synonymous with other like descriptors such as listen-ing test, and, more generally, subjective evaluationThese methods define both the presentation and evalua-tion format. In addition, they may also imply particulaanalysis methods (Analysis Methods section).

    DEFINITION OF SCOPE No attempt is made to dis-cuss every possible psychometric method in this sectionFor testing consumer products, it is important that thesubjective results be representative of customer opinionSince most consumers are neither sound quality expertsnor audiophiles, the scope will be limited to those meth-ods appropriate to inexperienced, relatively untrainedsubjects (see SUBJECTS section). This excludes manytraditional psychoacoustic methods (like matching tech-niques [9] and Levitt procedures [10]) from this discus-sion.

    METHODS Several jury evaluation methods appropri-

    ate for inexperienced, untrained subjects are discussedin the sections that follow. While, each method has itsstrengths and weaknesses, it is important to note that noone method works best for every application. It is veryimportant to chose the method which best fits the appli-cation.

    Rank order Rank ordering of sounds is one of the sim-plest subjective methods. Subjects are asked to ordersounds from 1 to N (where N is the number of sounds)based on some evaluation criteria (preference, annoy-ance, magnitude, etc.). The sounds are presentedsequentially. The subjects often have the option in thismethod of listening to a sound as many times as theywant. However, since the complexity of the ordering taskgrows combinatorially with the number of sounds, thesample size is usually kept low (six or less). The majordisadvantage of this method is that it gives no scalinginformation. While ranking will tell one that sound A is, forexample, preferred to sound B, it does not reveal howmuch more preferred A is over B. Because of this lack ofscaling information, rank order results are not useful forcorrelation with the objective properties of the soundsRank ordering is used only when one simply wants aquick idea of how the sounds compare. For example

  • 8/12/2019 Guidelines for Jury Evaluations

    11/229

    when evaluating customer preference for alternative com-ponent designs.

    Response (rating) scales In general, the term responsescale refers to any evaluation method in which subjectsresponses are recorded on a scale. However, for this dis-cussion, the subject will be limited to numbered responsescales, like the familiar 1-10 ratings. The discussion ondescriptive scales is deferred until later. Numbered

    response scales (e.g. 1-10) are very familiar to most peo-ple. Subjects rate sounds by assigning a number on ascale. The sounds are presented sequentially with, gen-erally, no option to replay. This method is quick and, onthe surface, easy. Scaling information is directly provided.However, rating scales can be difficult for inexperienced,untrained subjects to use successfully. Some of the rea-sons for this are given below.

    1. Numbered response scales do not allow the subjectsto express their impressions in an easy and naturalway. Inexperienced subjects have no idea what a "3"or a "5" or a "8" rating means in terms of their impres-sions. When, for example, people normally listen toan engine sound, they do not say that the enginesounds like a "3" or a "5" or a "8". Instead, they woulddescribe their engine as loud, rough, powerful, etc.

    2. Different subjects use the scales differently. Someuse only a small rating range, while others may usemost of the scale. An example of this effect is givenby Kousgaard [11], in which four different loudspeak-ers were rated by five subjects on a 0-9 scale. Therating ranges for each subject are given below.

    Subject 1 3.0-7.0

    Subject 2 6.0-7.2

    Subject 3 6.5-8.5Subject 4 0.0-8.0

    Subject 5 6.0-8.4

    While subjects 2, 3 and 5 have reasonably similarranges, subjects 1 and 4 use the scale very differ-ently. Another source of inter-subject rating variabilityis that different subjects use different sound attributesas the basis for their judgments. Placing the attributeto be rated on the scale can eliminate this problem.In any case, with intrinsic rating differences likeshown above, statistics like the average may be mis-leading.

    3. The extremes of the scales (e.g. "1" and "10") aregenerally not used. Because sounds are usuallyrated sequentially, subjects avoid extreme ratings forthe current sound just in case an upcoming sound isbetter (or worse). An example of this effect is takenfrom a study of powertrain sound quality [12]. Apaired comparison of similarity was conducted inwhich subjects rated similarity on a 1-10 scale, with 1being the most dissimilar and 10 the most similar.The results showed that the rating extremities werenever used even when a sound was compared to

    itself! The numbered scale was then replaced thenumeric ratings with an unnumbered line labeled"Very Dissimilar" and "Very Similar" at the ends of theline. With this method, subjects readily used theextremes.

    4. There is absolutely no reason to believe that ratingson an arbitrary interval scale should correlate withthe objective characteristics of the sounds. Whiletrendwise agreement between subjective ratings and

    objective values can be achieved, correlationrequires ratings that are proportional to the objectivecharacteristics of the sounds. This is rarely achievedwith rating scales.

    In summary, rating scales are fraught with difficulties foruntrained subjects and should be used with caution.

    Paired comparison methods Paired comparison (PC)methods are those in which sounds are presented inpairs and subjects asked to make relative judgments onthe sounds in the pair. Of course, this same basic paradigm can be extended to more than two sounds but wil

    limit our discussion to the paired presentation. For pairedsound presentations, a number of evaluation tasks havebeen developed. Three of these will be discussed insome detail, the first two are forced choice procedureswhere the subject must choose one sound in the pairwhile the last is a scaling task.

    Detection Tasks In this implementation of the pairedcomparison method, the subject must choose which othe sounds in the pair contains the signal to be detectedThis method is often used for determining detectionthresholds. For example, detection of a tone masked bybroadband noise. One of the sounds in the pair is the

    masker alone and the other is the tone plus masker. Thelevel of the tone is varied from pair to pair. Since theexperimenter knows the answer (which sound containsthe tone), subject performance (psychometric function) ismeasured by the percentage of correct answers. Thetone level at which the subjects percentage correcequals 75% is defined as the threshold [10]. Difficuldetection tasks require extensive subject training.

    Evaluative Tasks In this method, subjects make relativejudgments (pick A or B) on sounds presented to them inpairs based on some evaluative criterion. The criterion fothese paired judgments can be virtually anything. Prefer-

    ence is often used as the basis for judgment. If all thesamples are unpleasant, preference seems inappropriateand annoyance is often used. Attributes like loudnessand roughness can be used but care must be taken toinsure that subjects can readily distinguish the attributeamong the different sounds being evaluated. This paijudgment process is repeated until all possible pairs havebeen evaluated (complete block design). Very often, areplicate experiment is conducted. A variation on this procedure is to include scaling information as part of thejudgment [11]. An example might be to both pick which

  • 8/12/2019 Guidelines for Jury Evaluations

    12/2210

    sound in the pair you prefer and rate how much more youprefer that sound on a 1-10 scale. The scaling is gener-ally unnecessary because PC models like Bradley-Terry(see the Analysis section) will give you the same infor-mation with much less work. Since judgments are rela-tive, not absolute, subjects never have to worry aboutprevious or future judgments. The PC method is very nat-ural and easy for untrained subjects to use because itreflects something they do in everyday life, make compar-

    isons. A number of successful studies using the pairedcomparison method have been conducted [13,14,15].One disadvantage of the paired comparison method isthat the number of pairs can be quite large since theygrow as the square of the number of sounds. More spe-cifically, the number of pairs in a full paired comparison ist(t-1)/2, where t is the number of sounds. This means thatthe evaluation can be quite tedious if there are a largenumber of sounds. An incomplete block design cansometimes alleviate this problem. In this design, onlysome of the pairs are evaluated and these results used toinfer how the other pairs would have been judged. Thisapproach will work well only if one chooses the presented

    pairs appropriately. This is best done adaptively, wherethe next pair presented is based on the current results.

    Similarity Tasks Unlike, detection and evaluation, simi-larity is not forced choice but a scaling task. Sounds areagain presented in pairs, but instead of choosing one ofthe sounds, an estimate of their similarity is made. Simi-larity judgment is rated on an unnumbered line, labeledonly at the extremities as "very dissimilar" and "very simi-lar". By not numbering the scale, some of the problemsassociated with response scales are avoided. All possiblepairs are evaluated in this manner. In addition, eachsound is paired with itself to help judge how well subjects

    are performing. After the evaluation, a numbered grid isplaced over the line and the markings converted to num-bers, usually 1-10. Similarity scaling is useful for deter-mining how well subjects discriminate among the soundsin the study. Combined with proper analysis techniquessuch as multidimensional scaling and cluster analysis,this method can determine the number of perceptualdimensions that underlie the judgments as well as givingclues to the important objective properties of the sounds[5, 12].

    Semantic differential While the paired comparisonmethod focuses on one attribute of the sounds (prefer-ence, annoyance, similarity, etc.) the semantic differential(SD) technique allows evaluation of multiple soundattributes. Subjects evaluate sounds on a number ofdescriptive response scales, using bipolar adjective pairs.A bipolar pair is simply an adjective and its antonym. Theadjectives generally consist of attributes (quiet/loud,smooth/rough) or impressions (cheap/expensive, power-ful/weak) of the sound. These lie at opposite ends of ascale with several gradations. The gradations are labeledwith appropriate adverbs that allow the subject to rate themagnitude of their impressions. Five, seven and nine

    point scales are common. A typical seven-point scale isshown below for the quiet/loud category.

    Extremely Very Somewhat Neither Somewhat Very Extremely

    Quiet ___ ___ ___ ___ ___ ___ ___ Loud

    Subjects choose whichever gradation best fits theiimpression of the sound. In choosing the semantic pairsit is very important that they are appropriate to the appli-cation. This can be done in several ways. Newspapersand magazines are often good sources of semanticdescriptors for consumer product sounds. Focus groupsare another source. In general, technical engineeringlingo is not good because customers are usually notfamiliar with these terms. Finally, pairs chosen for Ameri-cans are probably not ideal for Europeans or JapaneseOf course, the converse is also true. It is also important toavoid pairs that are closely associated with each other(associative norms). For example, quiet/loud and peace-ful/noisy will almost always be correlated. Thus, there isno value in using the second pair since quiet/loud wilserve the same purpose. By using both, you are "wast-ing" a pair, which could be better used for other, unique

    descriptors. This is especially important when you con-sider that the practical limit of semantic pairs is 8-12.

    Magnitude estimation Magnitude estimation is amethod where subjects assign a number to someattribute of the sound (how loud or how pleasant it is).There is generally no limit to the range of numbers a subject may use. Magnitude estimation is basically a scalingtask without a bounded scale. This method may offersome advantages over bounded response scale methods(numbered or semantic) in that the subject need neverrun out of scale. A major disadvantage of this method isthat different subjects may give wildly different magnitude

    estimates. Thus, a key element of this technique is sub-ject training . Magnitude estimation is more difficult, initially, for subjects to accomplish. They must be given aperiod of experimentation and practice in the techniquebefore the actual data gathering exercise. This is accom-plished by providing the subjects with a series of practicesounds and asking them to perform the magnitude esti-mation upon those sounds. There is no general rule tothe amount of practice that is appropriate, although subject progress can be monitored. Subjects with prior experience in magnitude estimation often require less trainingand, thus, this technique is probably more appropriate for"expert" evaluators. Subject-to-subject variability can be

    addressed in a number of other ways, as well. One is topresent a reference sound with a specified magnitude(like 100) and have all other sounds rated relative to thareference (ratio estimation). A variant of this techniqueagain uses a reference, but no value is given to that reference. In either case, presentation of a reference witheach test stimuli effectively doubles the length of the evaluation.

  • 8/12/2019 Guidelines for Jury Evaluations

    13/2211

    ANALYSIS METHODS

    Analysis methods for sensory evaluation data are numer-ous. It is the intent of this section to outline methodswhich have been used successfully in sound quality juryevaluation techniques described in the Jury EvaluationMethods section. Details of each analysis technique willnot be given, but references will be sited in which thetechnique is used or demonstrated. The techniques dis-

    cussed here, along with many others used for sensorydata analysis, can be found in Meilgaard, et. al. [16].Additionally, Malhotra [17] provides a comprehensiveexamination of statistical analysis used in the evaluationof market research data. Both of these texts are easy toread and provide a good overview of techniques used forthe interpretation and analysis of customer data.

    MAGNITUDE ESTIMATION, RATING AND SEMANTICDIFFERENTIAL SCALES Magnitude estimation, ratingand semantic differential scales fall into the categorycalled interval scaling. An interval scale contains all theinformation of an ordinal scale but also allows the compu-

    tation of differences between objects. The data gener-ated by magnitude estimation, rating and semanticdifferential scales can be analyzed by a number of differ-ent methods that are described in the following sections.Before outlining the common analysis techniques usedfor these types of data, a few words must be said aboutthe normalization of magnitude estimation responses.

    Since magnitude estimation involves the subjects creat-ing their own scales, a method of normalization ofresponses is needed before any statistical analysis canbe performed. There are two basic methods for normaliz-ing the results of a magnitude estimation exercise. The

    first involves creation of a geometric average for eachstimulus across the range of subjective magnitude esti-mation scores for that stimulus. This is done by multiply-ing all individual scores for a particular stimulus together,then raising that result to the 1/n power, where n is thenumber of subjective magnitude estimation scores. Thegeometric averaging process ensures that each personsscale is given equal importance in the overall average forthe particular stimuli.

    A second technique commonly used involves a transfor-mation of each subjects range of scores for the stimuli toa percentage scale, then the percentages from each sub-ject are averaged together for a particular stimulus. Anindividuals maximum score given in the set of stimuli isset to 100%, their minimum score given in the set of stim-uli is set to 0%, and all values in between are scaledaccordingly. This is done for each subject in the evalua-tion, and then the percentages are averaged together fora particular stimulus. Bisping [18] gives an accounting ofthis technique with a comparison to absolute scaling.

    Distribution analysis

    Measures of Location Measures of location are used toquantify the central tendency of a distribution of

    responses. The common measures of location are themean, median and mode.

    Mean. The mean is simply the value obtained by summing all responses and dividing by the number ofresponses. The mean can be somewhat deceptive in thatit can be heavily influenced by outliers in a data set. Agiven data set should always be screened for outliers todetermine if the mean is being unduly influenced by justone or two data values and may not be representative of

    the majority of the population.

    Median. The median is the value above which half of theresponses fall and below which half of the responses fallThe median is sometimes used over the mean because itis less sensitive to the influence of outliers in theresponse data.

    Mode. Another measure of central tendency is the modeThe mode is the value which occurs the most in a sampledistribution and is commonly applied to data which is cat-egorical or which has been grouped into categories.

    Measures of Variability Measures of variability are usedto describe the spread in a data distribution. Measures ocentral tendency mean very little without knowing some-thing about the spread of a given set of data.

    Range. The difference between the largest and smallesvalues in a set of responses. The range can be greatlyimpacted by outliers and should be used with caution.

    Interquartile Range. The range in distribution coveringthe middle 50% of the responses. This measure is muchmore robust to outliers in a data set.

    Variance and Standard Deviation. The variance is themean squared deviation of all the values from the mean

    The standard deviation is the square root of the varianceThe variance and standard deviation are the most com-monly used measures of distribution spread.

    Measures of Shape Measures of shape can be used toquantify the nature of distributions resulting fromresponse data. Common shape measures include theskewness and kurtosis.

    Skewness. Skewness is the characteristic of a distribution which describes the distributions symmetry abouthe mean. It is the third central moment of the distributiondata. For a normal distribution the skewness is zero. I

    the skewness is positive, the distribution is skewed to theright. If it is negative, the distribution is skewed left. Skewing of scaling data at the scale extremes is common andcan be quantified using skewness.

    Kurtosis. The kurtosis is a measure which quantifies thepeakedness or flatness of the distribution. It is the fourthcentral moment of the distribution data. The kurtosis of anormal distribution is zero. If the distribution data contains a large number of outliers the distribution tails wilbe thick and the kurtosis will be positive. However, if thedata is closely centered about the mean the distributionwill be peaked and the kurtosis will be negative.

  • 8/12/2019 Guidelines for Jury Evaluations

    14/2212

    Test for Normality Since many test procedures assumethat the data distribution is normal, it is desirable to havetests which will indicate whether the response dataactually belongs to the family of normal distributions. TheKolmogorov-Smirnov (K-S) test statistic is commonlyused to test how good a sample distribution fits a particu-lar theoretical distribution. This statistic is defined as themaximum vertical absolute deviation of the samplecumulative distribution function (CDF) from the theoreti-

    cal CDF. For a normal distribution, the actualpopulationmean and variance must be known in order to generatethe theoretical CDF. Of course, this information is notknown. Lillefors variation of the K-S test statistic substi-tutes the sample mean and variance for the actual meanand variance. The sample CDF can then be compared tothe normal CDF generated from the sample mean andvariance.

    Graphical techniques Enormous amounts of quantita-tive information can be conveyed by using graphical tech-niques. Too often experimenters are quick to jump intoquantitative statistics before exploring their data qualita-

    tively. Graphical techniques can provide insight into thestructure of the data that may not be evident in non-graphical methods. Numerous graphical techniques havebeen developed for areas such as distribution and datarelationship analysis. Only a few of these techniques willbe outlined here. For a comprehensive treatment, the textby Chambers [19] is excellent.

    Scatter Plots Scatter plots can be used to explore rela-tionships among data sets. Two subjective as well as onesubjective and one objective sets of data may be plottedagainst one another to investigate relationships. Scatterplots can reveal information on data relationships that

    may not be apparent when using numerical approaches.Whether a relationship is linear, logarithmic, etc. canoften be guessed at by an initial investigation of a scatterplot. Additionally, data outliers are also readily apparent.

    Quantile-Quantile and Normal Probability Plots

    Quantile-Quantile or Q-Q plots are commonly used tocompare two data distributions to see how similar the twodata distributions are. When quantiles for two empiricaldata sets are plotted against one another the points willform a straight line if the data distributions come from thesame family. Sometimes the quantiles from an empiricaldata set are plotted against the quantiles from a theoreti-cal distribution to investigate how closely the empiricaldata set matches that of the theoretical distribution. Nor-mal probability plots are a special case of this situation inwhich the theoretical distribution used is that of a normaldistribution. As a result, the normal probability plot can beused as a graphical means of checking for data normality.

    Histograms Another way to summarize aspects of adata distribution is to use a histogram. The histogramdivides up the data into equal intervals and plots thenumber of points in each interval as a bar where the bar

    height is representative of the number of data occur-rences. Histograms are very easy for nontechnical peo-ple to understand; however, the selection of interval sizecan be very important in determining the conclusionsdrawn from the histogram. Intervals that are too wide mayhide some of the detail of the distribution (multi-modalityoutliers, etc.) while intervals which are too narrow removethe simplicity of the display.

    Others There are many other graphical techniquescommonly used in exploratory data analysis. Stem-and-leaf, box plots, bar charts, etc. are examples of graphicatechniques commonly used to initially investigateresponse data.

    Confidence intervals The confidence interval is therange of values of a population parameter that isassumed to contain the true parameter value. The popu-lation parameter of interest here is the true mean value ofthe response data (as opposed to the sample mean)Normally distributed data is assumed. Confidence levelsare chosen and specify the probability that the true mean

    value will be covered by the confidence interval. Confidence levels of 0.90, 0.95 and 0.99 are commonly cho-sen. For example, if a confidence level of 0.95 is chosenit can be said that we are 95% confident that the trueresponse mean is contained within the confidence inter-vals calculated from the sample responses.

    Large confidence intervals indicate large variability in theresponse data. If responses obtained for two differentsounds have confidence intervals with significant overlapthe true mean values of the responses may not be differ-ent. Observing the overlap of confidence intervalsobtained from subject responses is a visually appealing

    and intuitive method of determining whether significantdifferences exist in the ratings. However, this process isvery qualitative. Tests such as the t-test for pairwise com-parisons or analysis of variance for group testing must beperformed if rigorous significance testing is desired.

    Testing and comparing sample means (t-test) Thet-test is a parametric test used for making statementsabout the means of populations. It is based on the Students t statistic. The t statistic assumes that the data arenormally distributed and the mean is known or assumedknown and the population variance is estimated from thesample. The t distribution is similar to the normal distribu-

    tion and approaches the normal distribution as the num-ber of data points or responses goes up (for sample sizes> 120 the two distributions are nearly identical).

    One Sample t-test A one sample t-test is commonlyused to test the validity of a statement made about themean of a single sample (in our case the responses froma single sound). The one sample t-test can be used todetermine if the mean value is significantly different fromsome given standard or threshold. For example, is themean value of the responses significantly different fromzero? Additionally, if the threshold of acceptance of a

  • 8/12/2019 Guidelines for Jury Evaluations

    15/2213

    sound has previously been determined to be 7 on a 10point scale and the sample mean is 7.9, what is the prob-ability that the true mean is greater than 7?

    Two-sample t-test Very often it is desirable to testwhether the means of the responses given for two differ-ent sounds are significantly different. The two-samplet-test tests the equality of means of the two sets of inde-pendent responses where independence implies that the

    responses generated for one sound have no effect onthose generated for the second sound.

    Comparing equality of means for k samples (ANOVA) An extensive treatment of ANalysis Of VAriance (ANOVA)is beyond the scope of this document. Only a brief over-view will be given here. Application of ANOVA applied tosubjective listening evaluations can be found in [20].

    Data relationships from interval scaled data can beexplored using ANOVA. ANOVA can be used to deter-mine if the mean values given to various sounds areindeed significantly different. ANOVA analysis makes theassumption that the distributions of the scale values foreach sound are normal. It also assumes that the stan-dard deviations are equal. These assumptions can betested using some of the previously described tech-niques. Variations from these two assumptions can resultin erroneous conclusions. The test statistic used to deter-mine significance is the F statistic. Typically the test sig-nificance used is "=0.05.

    In so-called designed experiments ANOVA may also beused to determine if certain factors inherent in the designare influential in the mean values given to the sounds.For instance, the loudness of the sound could have aninfluence on the annoyance ratings given to sounds in a

    sample. The loudness could be determined either sub-jectively or objectively. In this case, only one factor is ofinterest and thus the analysis is referred to as a one-wayANOVA. When m factors are of interest the analysis isreferred to as a m-way ANOVA. The factors influencingthe mean values do not necessarily need to be intrinsicproperties of the sound. Typically, the subjects them-selves will impart their own variation in the form of scaleuse differences. In this case, the evaluators themselvesare a factor which cause variation in the scale rating.Since subjects may use the scale differently, factors in theexperiment may not prove to be significant due to largevariances in the responses for individual sounds. If eachevaluators response is normalized by the mean value ofall the sounds rated by that evaluator, the evaluator vari-ance can be removed. This is commonly referred to as awithin subject ANOVA. The disadvantage of the withinsubject ANOVA is that only relative values are obtainedand the absolute scale values are lost. For example, onecould say that in presenting a sound with a loudness of20 sones and one with a loudness of 25 sones the influ-ence on the annoyance ratings of all the evaluators(using a within-subject ANOVA) was an increase of 3 (10point scale); however, it could not be said that the abso-

    lute rating would go from a 5 to an 8 since for some evaluators ratings it could go from a 2 to a 5, or 3 to 6, etc.

    Fishers LSD Once significance has been determinedamong average values given for a sample of sounds, poshocpairwise testing may be done to determine if signifi-cant differences occur for individual pairings of soundsThe most common of the pairwise comparison tech-niques is Fishers Least Significant Difference (LSD)

    method. This is essentially the application of the two-sample t test to each pair of means. Other post hoc pair-wise tests include Duncan's Multiple Range, Newman-Keuls and Tukey's HSD.

    Linear regression analysis Regression analysis is atechnique used to assess the relationship between onedependent variable (DV) and one or several independenvariables (IV). Assumptions made in any regression analysis include: normal distribution of the variables, linearelationships between the DV and IVs and equal varianceamong the IVs. Regression analysis, as it applies to theprediction of subjective responses using objective mea

    sures of the sound data, is discussed in the next sectionA thorough treatment of regression techniques may befound in [21]. Regression analysis may also be used toexplore relationships among subjective responses. Foinstance, the annoyance ratings from different aspects oa set of sounds may be used to predict an overall preference rating. The regression equation coefficients canthen be used to quantify the relative importance of eachof the sound aspects to the overall preference of thesound. An example as applied to automotive windshieldwiper systems is given in [15].

    It should be mentioned that although regression analysis

    can reveal relationships between variables, this does notimply that the relationships are causal. A measure usedto quantify the regression model fit is called the coefficient of determinationor R2. R2takes on values from 0 to1 with 0 indicating no relationship and 1 showing perfectcorrelation. The tendency is to add IVs to the regressionmodel in order to improve the fit. Caution should be usedsince the addition of more variables may not actuallyhave a significant contribution to the prediction of thedependent variable. P-values for the individual indepen-dent variables are commonly used to assess how meaningful the contribution is. 1 minus the p-value is theprobability that a variable is significant. Generally, vari-

    ables with p-values greater than 0.20 may be consideredas questionable additions to the regression model. Inregression analysis a large ratio of cases (number of sub-jective responses) to independent variables is desired. Abare minimum of 5 times more cases than IVs should beused.

    Residual analysis can be useful when using regressiontechniques. Residuals are the difference between anobserved value of the response variable and the valuepredicted by the model. Residual plots can be used tovalidate the assumptions of normality, linearity and equa

  • 8/12/2019 Guidelines for Jury Evaluations

    16/2214

    variance. Residual analysis can also identify outlierswhich can have a strong impact on the regression solu-tion.

    Factor analysis Factor analysis is a general term for aclass of procedures used in data reduction. Only an over-view of the basic principles of factor analysis will be givenhere. References [22] and [23] are excellent introductorytexts on factor analysis. References [24-26] are case

    studies in which factor analysis is used with automotivepowertrain sounds.

    Factor analysis is a statistical technique applied to a sin-gle set of variables to discover which sets of variablesform coherent subsets that are relatively independent ofone another. In other words, factor analysis can be usedto reduce a large number of variables into a smaller set ofvariables or factors. It can also be used to detect struc-ture in the relationship between variables. Factor analysisis similar to multiple regression analysis in that each vari-able is expressed as a linear combination of underlyingfactors. However, multiple regression attempts to predictsome dependent variable using multiple independentvariables, factor analysis explores the association amongvariables with no distinction made between independentand dependent variables.

    The general factor model is given as:

    Xi=Ai1F1+Ai2F2+Ai3F3+...+AimFm+ViUi (1)

    where,

    Xi = ith standardized variable

    Aij = standardized multiple regression coefficient ofvariable i on common factor j

    F = common factor

    Vi = standardized regression coefficient of variablei on unique factor i

    Ui = the unique factor for variable i

    m = number of common factors

    The unique factors are uncorrelated with each other andwith the common factors. The common factors them-selves can be expressed as linear combinations of theobserved variables.

    Fi=Wi1X1+Wi2X2+Wi3X3+...+WikXk (2)

    where,

    Fi = estimate of ith factorWij = weight or factor score coefficient for factor i on

    standardized variable j

    Figure 1 shows the general process used in factor analy-sis. For the case of applying factor analysis to scalingdata derived from a jury evaluation, the first step is theacquisition of the data. From that data a correlationmatrix is derived. This matrix shows the simple correla-tions between all possible pairs of variables. Next, amethod of factor analysis is chosen. The two main typesof factor analysis are common factor analysis (CFA)and

    principal component analysis (PCA). In PCA you assumethat all variability in a variable should be used in the analysis, while in CFA only the variability in a variable that iscommon to the other variables is considered. PCA isgenerally used as a method for variable reduction whileCFA is usually preferred when the goal of the analysis isto detect structure. In most cases, these two methodsyield similar results.

    Figure 1. Process Flowchart for Factor Analysis

    One of the goals of factor analysis is variable reduction

    The question of how many factors should be extractedmay be based on prior knowledge of the structure inquestion. Other guidelines used in practice include theKaiser Criterion and the Scree Test. Both methods usethe eigenvalues of the correlation matrix to determine theappropriate number of factors. The references containdetailed information on these and other techniques fodetermination of the number of factors to be extracted.

    The factor analysis generates a factor pattern matrixwhich contains the coefficients used to express the stan-dardized variables in terms of the factors. These coefficients are called factor loadings and represent the

    correlations between the factors and the variables. Ini-tially, this matrix contains factors which are correlated tomany variables which makes interpretation difficult. Factor pattern matrix rotation reduces the number of vari-ables with which a factor is correlated and makesinterpretation easier. Orthogonal rotations such as thevarimax, equimax and quartimax procedures result infactors which are uncorrelated to one another. Obliquerotations may provide a clearer reduction in the variablesbut leads to factors which are correlated.

    Construct CorrelationMatrix

    Factor Analysis

    Acquire Scaling Info. for Individual Variables

    Determine Number ofFactors

    Interpret Factors

    Rotation of Factors

    Calculate Factor ScoresSelect Surrogate Variables

    Determination of Model Fit with Surrogate Variables

  • 8/12/2019 Guidelines for Jury Evaluations

    17/2215

    A factor is generally made up of variables which havehigh factor loadings for that particular factor. If the goal offactor analysis is to reduce the original set of variables toa smaller set of composite variables or factors, to be usedin subsequent analysis, the factor scores for eachrespondent can be calculated using equation (1). Some-times the researcher is interested in selecting a variablewhich represents the factor. This variable can then beused in subsequent listening evaluations to represent the

    factor of interest.The final step is to determine the factor model fit to theactual data. The correlation between the variables can bereproduced from the estimated correlation between thevariables and factors. The difference between theobserved correlation (correlation matrix) and the repro-duced correlations (estimated from the factor patternmatrix) can be used to measure model fit.

    In general, there should be at least four or five times asmany observations as variables. For example, if factoranalysis is used to reduce the variables that result from asemantic differential evaluation in which there were

    twelve semantic pairs, approximately 48-60 subjectresponses would be required.

    PAIRED COMPARISONS In one approach to pairedcomparison evaluations, the subjects are presented withtwo sounds and asked to select one based on some cri-terion. This is referred to as a forced choice method. As aresult, the data obtained are ordinal. Paired comparisonof similarity has also been used in sound quality evalua-tions. Similarity is a special application of the PC para-digm. Sounds are again presented in pairs, but instead ofchoosing one of the sounds, an estimate of their similarityis made, thus, providing interval scaled data.

    Forced choice paired comparisons A thorough treat-ment of the methods discussed in this section can befound in David [8]. A very readable synopsis of the pairedcomparison analysis including application to automotivewind noise listening evaluations is given by Otto [5].Selected applications of the methods discussed here canbe found in [15, 20, 27-29].

    Tests of Subject Performance Performance measuresare used to judge subject suitability. These measuresreveal how each subject, as well as the population as awhole, perform. Because paired comparison is a forcedchoice method, an answer will always be obtained. How-ever it is not known whether that answer is valid or just aguess. The performance measures help us find out thisinformation. Repeatability and consistency are the twomeasures used most for PC evaluations.

    Subject Repeatability. Repeatability is defined as simplythe percentage of all comparisons which are judged thesame the first and second time through the PC. Subjectswho are doing poorly, for whatever reason, will usuallynot have high repeatability. Subject average repeatabilityshould be 70% or greater. Subjects with individual

    repeatabilities of 60% or below usually have their dataremoved from further consideration. If the subject aver-age repeatability falls much below 70%, then there isusually a problem with the evaluation itself. For examplehaving sounds which are not easily discriminable canoften give rise to poor repeatability.

    Subject Consistency. Consistency is a measure of howwell the pair judgments map into higher order constructsOf general use is the Kendall Consistency [30] which is

    based on triads. For example, a triad of sounds A, B, C issaid to be consistent if the paired comparison shows thatif A > B and B > C then A > C and inconsistent if C > Awhere > can mean preferred, more annoying, louderrougher, etc. The Kendall Consistency is defined as thepercentage of consistent triads relative to the total num-ber of triads. A subject average consistency value of 75%or higher is considered acceptable. Individuals who havepoor repeatability often show poor consistency as well.

    Scores Once the PC data have been adjusted for sub-ject performance, the next step is to analyze the dataUnder the assumption of transitivity it is possible to con-vert paired comparison data to rank order data. One verystraightforward way to obtain the rank order result is touse the scores. The score for a given sound is simply thetotal number of times that the sound is chosen, summedover all paired comparisons. Scores are a quick and easyway to look at the results, but are not appropriate for cor-relation with objective metrics. This is because scoresare a measure of how each sound compares against therest of the sound population. Scores tell you nothingabout how one sound was judged against another. Inorder to determine if significant differences exist amongthe scores a test statistic such as Friedmans T can be

    calculated. This statistic is analogous to the F statistic forequality of treatment means in an ANOVA. Pairwise testing of the scores can also be done using the nonpara-metric least significant difference method. Again, this isanalogous to the paired t-test. It should also be men-tioned that since scores provide rank order data, they canbe subjected to the same methods discussed in the section on rank order data analysis.

    Bradley-Terry and Thurstone-Mosteller Models It ispossible to derive scale data from paired comparisondata using statistical models proposed by ThurstoneMosteller [31, 32] and Bradley-Terry [33]. To perform

    each of these models the scores must be converted intoa random variable having either a normal (Thurstone-Mosteller) or logistic (Bradley-Terry) distribution. The val-ues of these random variables are then further analyzedto arrive at a scale value for each sound in the study.These scale values are appropriate for correlation withobjective properties of the sounds. The Bradley-Terrymodel has met with a great deal of success in the auto-motive sound quality industry. As a result, the mechanicsfor derivation of the Bradley-Terry scale values will beoutlined here. This material is taken from David [8].

  • 8/12/2019 Guidelines for Jury Evaluations

    18/2216

    Data on how one sound compares to another are avail-able in the form of pair probabilities. For example, in apaired comparison of preference study with 100 subjects,if 80 subjects prefer sound i to sound j, the probabilitythat sound i is preferred to sound j, p ij , is 0.8 while theprobability that j is preferred to i, pji, is 0.2. In general, therelationship is pij= 1-pji. These pair probabilities are thebasis for the models of the paired comparison evaluation.These are statistical models of the paired comparison

    process that take pair probabilities as input and producea single valued measure for each sound.

    All the linear PC models follow the same basic premise,that for each sound in a PC study, there exists a value,called the merit value, that underlies all the pair judg-ments. These merit values lie along a linear scale and therelative position of each value is indicative of how thesounds will be judged in the PC. Sounds with merit val-ues that are close together should have pair probabilitiesnear 0.5 while sounds with vastly different merit valuesshould have pair probabilities approaching 1 (or 0). Likeall models, this is an abstraction, but is useful in analyz-ing the results. If merit values determine pair probabili-ties, then one can reverse this process to use pairprobabilities (which are known) to find merit values(which are unknown). Otto [27] gives a straightforwardsummary of the mathematical formulation for the Bradley-Terry model.

    In