introduction vascular biometrics vascular image data format, standardization · pdf...

28
V Vascular Biometrics Vascular Image Data Format, Standardization Vascular Image Data Format, Standardization ALEX H. CHOI 1 ,J ONATHAN R. AGRE 2 1 Department of Information Engineering Myongji University, Seoul, South Korea 2 Fujitsu Laboratories of America College Park, MD, USA Synonyms Vascular biometrics; Vein biometrics Definition AVascular Biometrics, Overview Image Format Stan- dard is useful for the exchange of vascular biometric image information across different systems developed by multiple organizations. As one part of this standar- dization effort, the International Standard Organization (ISO) has published a standard for a vascular biometric image interchange format, which is the ISO/IEC 19794- 9 (Biometric Data Interchange Format – Part 9 Vascular Image Format). The standard includes general require- ments for image capture devices, environmental condi- tions, specific definitions of image attributes, and the data record format for storing and transmitting vascu- lar biometric images. The vascular biometric image format standard was developed in response to the need for system interoperability which allows different vascular biometric systems to be easily integrated with other biometric modalities in a large-scale system. Introduction Vascular biometric technologies have existed for many years. Moreover, new technologies employing vascular images obtained from various parts of the human body are emerging or under continuous improvement as a result of new, state-of-the-art imaging devices. Some of these technologies are being widely adopted as reliable biometric modalities [1]. Vascular biometrics offer several intrinsic advan- tages in comparison with the other popular biometric modalities. First, the vascular imaging devices use near-infrared or infrared images to capture the vein pattern underneath the skin. This provides a high degree of privacy that is not available with fingerprints, which can be unintentionally left on objects, or by facial images for face recognition schemes, which are easily captured without ones knowledge. A similar possibility exists for iris images captured without con- sent for use in iris recognition schemes. Second, the vascular imaging devices can be constructed to operate in a non-contact fashion so that, it is not necessary for the individual to touch the sensor in order to provide the biometric data. This is advantageous in applica- tions that require a high degree of hygiene such as medical operating room access or where persons are sensitive about touching a biometric sensing device. Third, a high percentage of the population is able to provide viable vascular images for use in biometric identification, increasing usability by providing an additional way to identify persons not able to provide fingerprints or other biometric modal data. Fourth, depending on the particular wavelength of (near-) infrared light that is used, the image can capture only the vein patterns containing oxygen depleted blood. This can be a good indication that the biometric image is from a live person. Fifth, the complexity of the vascular image can be controlled so that the underlying amount of information contained in the image can be quite high when compared to a fingerprint, allowing one to reduce the false accept or false reject rates to low # 2009 Springer Science+Business Media, LLC

Upload: truongxuyen

Post on 16-Mar-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

V

Vascular Biometrics

▶Vascular Image Data Format, Standardization

Vascular Image Data Format,Standardization

ALEX H. CHOI1, JONATHAN R. AGRE

2

1Department of Information Engineering Myongji

University, Seoul, South Korea2Fujitsu Laboratories of America College Park,

MD, USA

Synonyms

Vascular biometrics; Vein biometrics

Definition

AVascular▶Biometrics, Overview Image Format Stan-

dard is useful for the exchange of vascular biometric

image information across different systems developed

by multiple organizations. As one part of this standar-

dization effort, the International Standard Organization

(ISO) has published a standard for a vascular biometric

image interchange format, which is the ISO/IEC 19794-

9 (Biometric Data Interchange Format – Part 9 Vascular

Image Format). The standard includes general require-

ments for image capture devices, environmental condi-

tions, specific definitions of image attributes, and the

data record format for storing and transmitting vascu-

lar biometric images. The vascular biometric image

format standard was developed in response to the

need for system interoperability which allows different

vascular biometric systems to be easily integrated with

other biometric modalities in a large-scale system.

# 2009 Springer Science+Business Media, LLC

Introduction

Vascular biometric technologies have existed for many

years. Moreover, new technologies employing vascular

images obtained from various parts of the human body

are emerging or under continuous improvement as a

result of new, state-of-the-art imaging devices. Some of

these technologies are being widely adopted as reliable

biometric modalities [1].

Vascular biometrics offer several intrinsic advan-

tages in comparison with the other popular biometric

modalities. First, the vascular imaging devices use

near-infrared or infrared images to capture the vein

pattern underneath the skin. This provides a high

degree of privacy that is not available with fingerprints,

which can be unintentionally left on objects, or by

facial images for face recognition schemes, which are

easily captured without ones knowledge. A similar

possibility exists for iris images captured without con-

sent for use in iris recognition schemes. Second, the

vascular imaging devices can be constructed to operate

in a non-contact fashion so that, it is not necessary for

the individual to touch the sensor in order to provide

the biometric data. This is advantageous in applica-

tions that require a high degree of hygiene such as

medical operating room access or where persons are

sensitive about touching a biometric sensing device.

Third, a high percentage of the population is able to

provide viable vascular images for use in biometric

identification, increasing ▶ usability by providing an

additional way to identify persons not able to provide

fingerprints or other biometric modal data. Fourth,

depending on the particular wavelength of (near-)

infrared light that is used, the image can capture only

the vein patterns containing oxygen depleted blood.

This can be a good indication that the biometric image

is from a live person. Fifth, the complexity of the

vascular image can be controlled so that the underlying

amount of information contained in the image can be

quite high when compared to a fingerprint, allowing

one to reduce the false accept or false reject rates to low

1370V Vascular Image Data Format, Standardization

levels. At the same time, the image information can be

compressed or it can be processed into a template to

reduce storage requirements.

Vascular biometric technologies are being used or

proposed for many applications. Some of these include

access control to secure areas, employee time-clock

tracking, Automatic Teller Machines (ATMs), secure

computer login, person identification, and as one of

several biometrics in multi-biometric systems. The

technology is not appropriate for certain other appli-

cations such as criminal forensics or surveillance.

Currently, however, little vascular biometric image

information is being exchanged between the equip-

ment and devices from different vendors. This is due

in part to the lack of standards relating to interopera-

bility of vascular biometric technology. In the general

area of biometrics interoperability, the International

Standard Organization (ISO) and the regional organi-

zations, such the INCITS M1 group in the US, define a

collection of standards relating to the various biomet-

ric modalities that include data interchange formats,

conformance testing of image and template inter-

change formats, performance testing and application

profiles. The most critical are the formats for infor-

mation exchange that would ensure interoperability

among the various vendors. Definition and standardi-

zation of the data structures for the interoperable use

of biometric data among organizations is addressed in

the ISO/IEC 19794 series [2], which is the multipart

biometric data interchange format standard, which

describes standards for capturing, exchanging, and

transferring different biometric data from personal

characteristics such as voice, or properties of parts of

the body like face, iris, fingerprint, hand geometry, or

vascular patterns.

To address this short-coming in the vascular do-

main, the ISO has published a standard for a vascular

biometric image interface format, entitled the ISO/IEC

19794-9 (Biometric data interchange format – part 9

Vascular image format) [3].

The main purpose of this standard is to define a

data record format for storing and transmitting vascu-

lar biometric images and certain of their attributes

for applications requiring the exchange of raw or

processed vascular biometric images. It is intended

for applications not severely limited by the amount of

storage required and is a compromise or a trade-off

between the resources required for data storage or

transmission and the potential for improved data

quality/accuracy. Basically, it enables various prepro-

cessing or matching algorithms to identify and verify

the type of vascular biometric image data transferred

from other image sources and to allow operations on

the data. The currently available vascular biometric

technologies that are commercialized and that may

utilize this standard for image exchange are technolo-

gies that use the back of the hand, the palm, and the

finger [4–6]. There is the ability to extend the standard

to accommodate other portions of the body if the

appropriate technology is brought forward.

The use of standardized source images can provide

interoperability among and between vendors relying

on various different recognition or verification algo-

rithms. Moreover, the format standard will offer the

developer more freedom in choosing or combining

matching algorithm technology. This also helps appli-

cation developers focus on their application domain

without concern about variations in how the vascular

biometric data was processed in the vascular biometric

modalities.

Introduction to ISO/IEC 19794-9 VascularImage Data Format Standard

ISO published the ISO/IEC 19794-9 Vascular Image

Data Format Standard in 2007, as a part of the ISO/

IEC 19794 series. The ISO/IEC 19794-9 vascular image

data format standard specifies an image interchange

format for biometric person identification or verifica-

tion technologies that utilize human vascular biometric

images and may be used for the exchange and compari-

son of vascular image data [7]. It specifies a data record

format for storing, recording, and transmitting vascu-

lar biometric information from one or more areas

of the human body. It defines the contents, format,

and units of measurement for the image exchange. The

format consists of mandatory and optional items, in-

cluding scanning parameters, compressed or uncom-

pressed image specifications, and vendor-specific

information.

The ISO/IEC 19794-9 vascular image data format

standard describes the data interchange format for

three different vascular biometric technologies utiliz-

ing different parts of the hand including back-of-hand,

finger, and palm. The standard also supports room for

extension to other vascular biometrics on other parts

of the human body, if needed. Figure 1 shows an

Vascular Image Data Format, Standardization. Figure 1 Examples of vascular biometric areas on different parts of the

hand [3].

Vascular Image Data Format, Standardization V 1371

V

example of vascular biometric areas on different parts

of the hand that are specified in ISO/IEC 19794-9.

The interchange format follows the standard data

conventions of the 19794 series of standards such as

requiring all multi-byte data to be in big-endian for-

mat, transmitting the most significant byte first and

the least significant byte last, and within a byte, the

order of transmission shall be the most significant bit

first and the least significant bit last. All numeric values

are treated as unsigned integers of fixed-length.

The vascular pattern biometric technologies cur-

rently available employ images from the finger, back of

the hand, and palm side of the hand. The location used

for imaging is to be specified in the format. To further

specify the locations, the object (target body) coordi-

nate system for each vascular technology is defined.

Standard poses and object coordinate systems are also

defined. All the coordinate systems are right-handed

Euclidian coordinate systems. It is then possible to

optionally specify a rotation of the object from the

standard pose. In order to map the object coordinate

system to the image coordinate system without further

translation, an x- and y-axis origin for scanning can be

specified in the data.

The image is acquired by scanning a rectangular

region of interest of a human body from the upper left

corner to the lower right in raster scan order, that is,

along the x-axis from top to bottom in the y direction.

The vascular image data can be stored either in a raw or

compressed format. In a raw format, the image is

represented by a rectangular array of ▶ pixels with

specified numbers of columns and rows. Images can

also be stored using one of the specified lossless or lossy

compression methods, resulting in compressed image

data. The allowable compression methods include the

JPEG [8], JPEG2000 [9], and JPEG LS [10]. It is

recommended that the compression ratio be less than

a factor of 4:1 in order to maintain a quality level

necessary for further processing.

Image capture requirements are dependent on var-

ious factors such as the type of application, the avail-

able amount of raw pixel information to be retained or

exchanged, and the targeted performance. Another

factor to consider as a requirement for vascular bio-

metric imaging is that the physical size of the target

body area where an application captures an image for

the extraction of vascular pattern data may vary sub-

stantially (unlike other biometric modalities).

The image capture requirements also define a set of

additional attributes for the capture devices such as

▶ gray scale depth, ▶ illumination source, horizontal

and vertical resolution (in pixels per cm), and the

aspect ratio. For most of the available vascular biomet-

ric technologies, the gray scale depth of the image

ranges up to 128 gray scale levels, but may, if required,

utilize two or more bytes per gray scale value instead of

one. The illumination sources used in a typical vascu-

lar biometric system are near-infrared wavelengths in

the range of approximately 700–1200 nm infrared light

sources. However, near-infrared, mid-infrared, and

Vascular Image Data Format, Standardization. Table 1 Vascular image biometric data block

Bytes Type Content Description

1–26 Data block header Header used by all vascular biometric image providers.Information on format version, capture device ID, number ofvascular images contained in the data block, etc.

27–58 Vascular image header Image header for the first image. Contains all individual imagespecific information

Unsigned char Image data

� �� �Vascular image header Image header for the last image

Unsigned char Image data

1372V Vascular Image Data Format, Standardization

visible light sources can be defined and more than one

source may be employed.

Table 1 shows the basic structure of the vascular

image biometric data block. A single data block starts

with a vascular image record header, which contains

general information about the data block such as the

identification of the image capture device and the

format version. One or more vascular image blocks

follow the record header. Each image block consists

of an image header and raw or compressed image data.

The image header contains all the image specific infor-

mation such as the body location, rotation angle, and

imaging conditions. All images in a data block must

come from the same capture device. If multiple devices

are used, then multiple blocks must be used.

The vascular image record header consist of general

information on the vascular images contained in the

data block, such as the format version number, total

length of the record block, capture device identifica-

tion, and the number of images contained in the data

block. More specific information includes format iden-

tifier, version number, record length, capture device

ID, and number of images.

For each image in the data block, the vascular

image header describes individual image-specific in-

formation including image type, vascular image record

length, image width and height, gray scale depth,

image position, property bit field, and rotation angle.

Other information in the vascular image header may

include Image format, illumination type, image back-

ground, horizontal scan resolution, vertical scan reso-

lution, pixel aspect ratio, and vascular image header

constants. The image data follows and is used to store

the biometric image information in the specific format

defined in the vascular image record header.

Future Activities

There are considerable ongoing standardization activ-

ities relating to vascular biometrics, building upon the

biometric data interchange format for vascular images

standard. A companion document that specifies the

conformance testing for the data interchange format

is currently under development. The conformance

standard specifies how to check whether the data pro-

duced by a vascular imaging device, does indeed agree

with the interchange format, as well as which items are

mandatory or optional. There are also ongoing efforts,

both internationally and in the U.S., to include the

vascular image formats into the various application

profiles (such as the INCITS M1 Profile for Point-

of-Sale Biometric Identification/Verification), which

define how to use vascular biometrics in the specific

context of an application. There are also efforts at

including vascular methods in multi-biometric fusion

schemes or as a biometric component of a smart-card

based solution. Eventually, it is expected that vascular

methods will become one of the important biometric

modalities, offering benefits not provided by the other

techniques in certain applications.

Summary

Vascular biometric technologies including vascular

images from the back-of-hand, finger, and palm are

being used as a security integrated solution in many

applications. The need for ease of exchanging and

transferring vascular biometric data from biometric

recognition devices and applications or between differ-

ent biometric modalities requires the definition of a

Vector Quantization V 1373

vascular biometrics data format standard. The devel-

opment of the vascular biometric data interchange

format standard also helps to ensure interoperability

among the various vendors. This paves the pathway so

that vascular biometric technologies can be adopted

as a standard security technology which is easily in-

tegrated in various ranges of applications.

Related Entries

▶Back-of-hand Vein

▶ Finger Data Interchange Format Standardization

▶ Finger Vein

▶Palm Vein

▶Vein and Vascular Recognition

V

References

1. Choi, A.H., Tran, C.N.: Handbook of Biometrics: Hand Vascular

Pattern Recognition Technology. Springer, New York (2007)

2. ISO/IEC 19794-1 Information Technology: Biometric Data

Interchange Format – Part 1: Framework/reference model

3. ISO/IEC 19794-9 Information Technology: Biometric Data

Interchange Format – Part 9: Vascular image data

4. Im, S.K., Park, H.M., Kim, Y.W., Han, S.C., Kim, S.W., Kang,

C.H.: Biometric identification system by extracting hand vein

patterns. J. Korean Phy. Soc. 38(3), 268–272 (2001)

5. Miura, N., Nagasaka, A., Miyatake, T.: Feature Extraction of

Finger-Vein Patterns Based on Repeated Line Tracking and Its

Application to Personal Identification. Mach. Vis. Appl. 15,

194–203 (2004)

6. Watanabe, M., Endoh, T., Shiohara, M., Sasaki, S.: Palm vein

authentication technology and its applications. In: Proceedings of

Biometric Consortium Conference, VA, USA, September 2005

7. Volner, R., Bores, P.: Multi-Biometric techniques, standards

activities and experimenting. In: Baltic Electronics Conference,

pp. 1–4. Tallinn, Estonia (2006)

8. ISO/IEC 10918 (all parts) Information Technology: Digital

Compression and Coding of Continuous Tone Still Images

9. ISO/IEC 15444 (all parts) Information Technology: JPEG 2000

Image Coding System

10. ISO/IEC 14495 (all parts) Information Technology: Lossless and

Near-Lossless Compression of Continuous Tone Still Images

Vascular Network Pattern

The network pattern composed of blood vessels.

Human blood vessels develop network structures in

each level of artery, arteriole, capillary, venule, and

vein. The network of major blood vessels can be seen

in funduscopy and in visual observation of body sur-

face. The vascular networks in fundus image are those

of retinal arteries and retinal veins. The blood vessels

observed on body surface are the cutaneous veins.

Both network patterns can be used in biometric

authentication. There are no apparent evidence on

the uniqueness and the permanence of the vascular

network pattern. However, in practice, the vascular

pattern has been used for biometric authentication

without a serious problem. Since the retinal pattern is

kept inside an eye, it is stable and seldom affected by

the change of outer environment. It is not easily ob-

servable by others and robust against the theft and the

forgery. The retinal pattern is complex, and high iden-

tification accuracy can be expected. The authentication

using this retinal pattern has been used in the institu-

tions that require high level of security.

The vascular network pattern in a hand and in a

finger can be visualized by transillumination imaging

or reflection-type imaging using near-infrared light.

The authentication with vascular pattern of a hand

and a finger is safer and more convenient than that

with retinal pattern. It has been used in common

security applications such as the authentication in

ATM and in access management.

▶Performance Evaluation, Overview

Vascular Recognition

▶Retina Recognition

Vector Quantization

The vector quantization (VQ) is a process of mapping

vectors from a large vector space to a finite number of

regions in that space (Linde, Y., Buzo, A., Gray, R.: An

algorithm for vector quantizer design. IEEE Trans.

1374V Vein

Comm. 28, 84–9517 (1980)). Each region is called

a cluster and can be represented by its center called a

codeword. The collection of all codewords is called

a codebook. During the training phase, a speaker-

specific VQ codebook is generated for each known

speaker by clustering the corresponding training acous-

tic vectors. The distance from a vector to the closest

codeword of a codebook is called a VQ-distortion.

During the recognition phase, an input utterance of

an unknown voice is vector-quantized using each

trained codebook, outputting a VQ distortion for

each codebook, each client speaker. The speaker

corresponding to the VQ codebook with the smallest

distortion is identified. Both for the training and test-

ing phases, the VQ process works independently on

each input frame and produces an averaged result (a

codebook or VQ distortion). Thus, there is no need to

perform a time alignment. The lack of time warping

greatly simplifies the system; however, it neglects

speaker-dependent temporal information that may be

present in prompted phrases.

▶ Speaker Matching

Vein

Veins are the blood vessels that carry blood to the

heart. In the cardiovascular system, blood vessels con-

sist of arteries, capillaries, and veins. Veins collect

blood from capillaries and carry it toward the heart.

In most of the veins, the blood is deoxygenated. The

pulmonary vein is one of the exceptions that carry

oxygenated blood. The walls of veins are relatively

thinner and less elastic than those of arteries. Some

veins have one-way flaps called venous valves that

prevent blood from flowing back. The valves are

found in the veins that carry blood against the force

of gravity, especially in the veins of the lower

extremities.

The vein in the subcutaneous tissue is called a

cutaneous vein. Some of the cutaneous veins can be

observed on the body surface with the naked eye. With

the light of high transmission through body tissue such

as near-infrared light, we can obtain a clear image of

the cutaneous vein. Since the pattern of venous

network is largely different between individuals, the

images can be used for authentication. The biometric

authentication using the venous network patterns in a

palm and a finger is common.

▶Palm Vein Image Sensor

▶Performance Evaluation, Overview

Vein Biometrics

▶Vascular Image Data Format, Standardization

Vein Recognition

▶Retina Recognition

Velocity (Speed)

Velocity of pen movement during the signing process.

Velocity features seem to be one of the most useful

features of on-line signatures. Generally, velocity is com-

puted from the first-order derivative of the pen position

signal with respect to time. The easiest way to compute

the velocity is to calculate the distance between two

consecutive pen-tip positions if the data is acquired at

equidistant sample points. Velocity features are repre-

sented in two ways: velocities along the x-axis and y-axis

or velocity along the pen movement direction (tangen-

tial direction). In the latter case, the direction of pen

movement is also considered as a separate feature.

▶ Signature Recognition

Verification

Biometric verification is a process that shows true or

false a claim about the similarity of biometric reference(s)

Visual-dynamic Speaker Recognition V 1375

and recognition biometric sample(s) by making a bio-

metric comparison(s).

▶Verification/Identification/Authentication/Recogni-

tion: The Terminology

Vetting

▶Background Checks

Video Camera

▶ Face Device

Video Surveillance

▶Human Detection and Tracking

Video-based Face Recognition

▶ Face Recognition, Video-based

V

Video-based Motion Capture

▶Markerless 3D HumanMotion Capture from Images

Visible Spectrum

Synonyms

Optical spectrum; Visible light

Definition

The portion of the electromagnetic spectrum that is

visible (detected) by the human eye. The wavelengths

for this spectrum is 380 to 750 nm, which are the

wavelengths seen (detected) by the human eye in air.

▶ Iris Databases

Visual Memory

Visual memory is the perceptual ability that allows

visual images to remain in memory after they are no

longer visible. It supports the matching process be-

tween two fingerprints when eye movements are

required.

▶ Latent Fingerprint Experts

Visual Sensor

▶ Face Device

Visual-dynamic Speaker Recognition

▶ Lip Movement Recognition

1376V Vitality

Vitality

▶ Liveness Detection: Fingerprint

▶ Liveness Detection: Iris

Viterbi Algorithm

The Viterbi algorithm is the conventional, recursive,

efficient way to decode a Hidden Markov Model that is

to find the optimal state sequence, given the observa-

tion sequence and the model. It provides information

about the hidden process and is a good an efficient

approximation of the evaluation problem.

▶Hidden Markov Models

VOCs (Volatile Organic Compounds)

Organic chemicals that have a high vapor pressure

resulting in a relatively high abundance in the head-

space of samples.

▶Odor Biometrics

Voice Authentication

Voice authentication is also known as speaker au-

thentication, speaker verification, and one-to-one

speaker recognition. For example, for a client – a

bank customer – to be authenticated, the client must

first go through an enrollment procedure, also known

as training. During enrollment, the client provides a

number of voice samples to the system, which in turn

are used to build a voice model for the client. When

requesting a voice authentication, a client must first

announce his or her identity. This may be done verbal-

ly by saying name, user id, account number or the

like, or it may be done by presenting an identifying

token such as a staff card or bank card. Then the

authentication takes place when the person speaks a

set phrase or a requested phrase or simply engages in

a dialogue with the authentication system. If the

voice sample matches the stored model or template

of the claimed identity, the client is authenticated.

If an impostor tries to be authenticated as a particular

client, the impostor’s voice will not match the client

model and the impostor will be rejected. The authen-

tication paradigm only compares a speech sample

with a single client model, namely the model of

the claimed identity. Hence, it is sometimes known as

one-to-one speaker recognition. In contrast speaker

identification compares a speech sample with every

possible client model, to find the closest match.

Hence this paradigm is also known as one-to-many

speaker recognition.

▶ Liveness Assurance in Voice Authentication

▶ Speaker Recognition Standardization

Voice Biometric

▶ Speaker Recognition, Overview

Voice Biometric Engine

▶ Speaker Matching

Voice Device V 1377

Voice Device

DOROTEO T. TOLEDANO,

JOAQUIN GONZALEZ-RODRIGUEZ,

JAVIER ORTEGA-GARCIA

ATVS – Biometric Recognition Group, Escuela

Politecnica Superior, Universidad Autonoma de

Madrid, Spain

V

Synonyms

Microphone; Speech input device

Definition

Voice device in the context of biometrics is frequently

used as a synonym for a simpler word: microphone.

A microphone [1] is a transducer that converts sound

(or equivalently, air pressure variations) into electrical

signals. There are many different types of microphones

that use different methods to achieve this transduction,

most of which will be revised in this article. Besides

the method employed to do the transduction, micro-

phones aremost frequently encapsulated, and the encap-

sulation allows to build microphones with different

directional characteristics, which allow, for instance, to

capture the voice coming from one direction and reject

(to a certain extent) the noises or voices coming from

other directions. Apart from the directionality, micro-

phones also have different frequency responses, sensitiv-

ities, and dynamic ranges. All these characteristics can

dramatically influence the performance of a speech bio-

metric system, and should therefore be taken into ac-

count in the design of such systems.

Microphones are the most commonly used speech

input devices, and for that reason they deserve most of

the space of this article. However, this article will be

incomplete without mentioning that microphones, at

least traditional microphones, are not the only speech

input device that can be used in speech biometrics.

For instance, microphones may be arranged to form

▶microphone arrays. There also exist special micro-

phones called ▶ contact microphones that transduce

vibrations in solid bodies into electrical signals. Finally,

there is also the possibility of combining the acoustic

evidence and the visual evidence of speech by record-

ing the audio and also the movement of the lips in

what is commonly referred to as audio-visual speech

processing. Definitional entries are devoted at the end

of this article to these special speech input devices.

The first step in any voice biometric (or automatic

speaker recognition) system is to capture the voice of

the speaker, and speech input devices are used for this

purpose.

Introduction

The human hearing sense is extremely robust against

noise and small distortions in the speech and humans

are very good at recognizing people based on their

voices, even under strong distortion and noise. Most

speech input devices are designed with the goal of

capturing speech or music, translating it into electrical

signals, transmitting or storing it and, finally, reprodu-

cing that speech or music (by means of the opposite

transducer, a loudspeaker). The important point here is

that microphones are designed to be used in a chain, at

which end is, most times, the human ear. Having such

a robust receptor at the end of the chain makes it

unnecessary to be very careful in the design or selection

of a speech input device.

During the last years, however, there has been a

fundamental change in speech communication since

the receiver in the speech communication chain is

not always a human listener any more. Nowadays

machines are used for transcribing speech signals (in

automatic speech recognition) and also, and most im-

portantly in this context, for recognizing the speaker

given a segment of speech (in voice biometrics or auto-

matic speaker recognition). This fundamental change

has brought an uncomfortable reality for all speech

researchers: machines are still far less robust than

humans at processing speech.

Of course, the goal of speech researchers is making

machines not even as robust as humans but even more.

Currently, voice biometric systems achieve very good

results in relatively controlled conditions, such as in

telephone conversations with similar durations. This

has been the basic setup for the yearly competitive

Speaker Recognition Evaluations (SRE) organized

by the National Institute of Standards and Technology

(NIST) [2] for the last years. These evaluations

show that currently, technology is capable of achieving

very competitive results in these conditions and is

becoming more and more robust against variabilities.

1378V Voice Device

However, the problem of variability due to the speech

input device is far from being solved. Actually, this is a

very hot research and technological topic. The proof of

it is that next NISTevaluations in voice biometrics will

probably be centered on cross-channel conditions in

which training and testing data come from different

channels (including different microphones, micro-

phone locations (close-talk and far-field) and record-

ing devices. However, achieving robustness against

such variations is a long-term research goal that most

probably will not be fulfilled in the next few years.

In the meantime, it should be stressed that technol-

ogy is already usable in practical situations, but it should

also be highlighted that current technology may not be

as robust as desirable. In these circumstances it is essen-

tial to take extra care of the design or the selection of the

speech input device. In some cases, of course, the speech

input device is out of control, such as in telephone

applications. But there are other cases where it is neces-

sary to design the speech input device and, in this cases,

it is essential to make the right choice because there are

multiple choices of speech input devices with very dif-

ferent features, and an appropriate selection of the

speech input device could be the key to success or failure

in a voice biometrics application. This section tries to

provide an introduction to the world of speech input

devices or microphones.

Microphones

Definition

Amicrophone is a transducer that converts sounds (air

pressure variations) into variations of an electrical

magnitude, typically voltage.

History

The early history of the microphone is tied to the

development of the telephone [3]. In fact, the micro-

phone was the last element required for a telephonic

conversation to be developed. One of the earliest

versions of microphones was developed by German

researcher Philipp Reis in 1861. These microphones

were just a platinum piece associated with a membrane

that opened and closed an electric circuit as the sound

made the membrane vibrate. This allowed Reis to build

primitive prototypes that allowed to transmit voice

and music along several hundred meters. It was several

years later, in 1874, when Alexander Graham Bell pat-

ented the telephone and transmitted what is consid-

ered the first telephone conversation ‘‘Mr. Watson,

come here, I want you.’’ Bell improved the microphones

to make them better and better suited for commercial

applications. Among the earlier microphones devel-

oped by Bell there are liquid microphones in which a

diaphragm moved a metallic needle inside a metal

recipient filled with a solution of water and sulfuric

acid, so that the resistance between the needle and the

recipient varied with the movement of the diaphragm.

The latter microphones developed by Bell were based

on the variations of inductance in a moving coil at-

tached to a diaphragm. However, it was not until 1878

that the word microphone was used for the first time,

and it was associated with what it is know today as the

carbon microphone. The carbon microphone was

invented by Edison and Hughes, and constituted a

real breakthrough for telephone systems, since they

were more efficient and robust than the earlier devices.

Currently it has mostly been substituted by more mod-

ern microphones that will be described in the following

sections.

Types

All microphones are based on the transduction of air

pressure variations into an electromagnetic magnitude.

However, there are many ways to achieve this, and

therefore there are many types of microphones with

different characteristics and applications. In this article

some of the most important types will be summarized.

� Condenser or capacitance microphones. These

microphones are based on the following physical

principle (Fig. 1): the capacitance of a condenser

with two metallic plates depends on the distance

between the two plates. If one metallic plate of a

capacitor is substituted by a metallic membrane

that vibrates with sound, the capacitance of the

condenser varies with sound, and this variation

can be translated into the variation of an electrical

magnitude. There are two ways of doing this trans-

formation. The most common one is trying to set a

constant charge in the two plates and measuring

the variations of the voltage between the two plates.

Voice Device. Figure 1 Principle of functioning of a condenser microphone.

Voice Device V 1379

V

The other one (slightly more complex) is using the

variations in the capacitance to modulate the fre-

quency of an oscillator. This generates a frequency

modulated signal that needs to be demodulated,

but the demodulated signal has usually less noise

and can more effectively reproduce low frequency

signals than the one obtained with the constant

charge method. A special type of condenser micro-

phone is the electret microphone. This microphone

is a capacitor microphone in which the charge in

the plates is maintained not by applying an external

constant voltage to the capacitor, but by using a

ferroelectric material that keeps a constant charge,

in a similar way as a magnet generates a constant

magnetic field. Condenser microphones are the

most frequently used microphones nowadays, and

it is possible to find them from low-quality cheap

versions to high-quality expensive microphones.

� Dynamic or induction microphones. These micro-

phones are based on a different physical principle:

when an induction moves inside a magnetic field, it

generates a voltage by electromagnetic induction. If a

small coil is attached to a diaphragm thatmoves with

sounds and if this coil is placed into a magnetic field

(generated by a permanent magnet), the movement

of the coil will produce a voltage in its extremes that

is related to the sound. A special type of induction

microphone is ribbon microphones in which the coil

is substituted by a metallic ribbon that vibrates

with sound as is suspended in a constant magnetic

field, thus generating a current related to the

sound. These microphones are more sensitive

than coil microphones, but also are more fragile.

� Carbon microphones. This microphone is essentially

a recipient filled with carbon powder and closed by a

metallic membrane on one side and a metallic plate

on the other. As the membrane vibrates with the

sound the powder supports more or less pressure

and its electrical resistance is higher or lower (with

more pressure carbon particles increase their surface

in contact with other particles and this makes elec-

trical resistance decrease). Carbon microphones

were widely used in telephones. Currently they have

been substituted by capacitor microphones.

� Piezo-electric microphones. These microphones are

based on yet another physical effect: some solid

materials, called piezo-electric materials, have the

property of producing a voltage when a pressure

is applied to them. Using this property and a piezo-

electric material a microphone can by built by just

placing two electrical contacts on the piezo-electric

material. Piezo-electric microphones are mainly

used in musical instruments (such as electric gui-

tars) to collect and amplify the sound.

� Silicon microphones. Silicon (or chip) microphones

are not based on a new physical effect. Rather, they are

just capacitor microphones built on a silicon chip in

which the membrane is directly attached to the chip.

These microphones can be very small and are usually

associated with electronic circuitry such as a pream-

plifier and a analog-to-digital converter (ADC), so

that a single chip can produce digital audio.

Directional Characteristics

Microphones have different characteristics depending

on the direction of arrival of the sound with respect to

the microphone. A microphone’s directionality pattern

measures its sensitivity to a particular direction.

1380V Voice Device

Microphones may be classified by their directional

properties as omnidirectional (or non-directional) and

directional [4]. The latter can also be subdivided into

bidirectional and unidirectional, based on their direc-

tionality patterns. Directionality patterns are usually

specified in terms of the polar pattern of the micro-

phone (Fig. 2).

� Omnidirectional microphones. An omnidirectional

(or nondirectional) microphone is a microphone

whose response is independent of the direction of

arrival of the sound wave. Sounds coming from

different directions are picked equally. If a micro-

phone is built only to respond to the pressure, then

the resultant microphone is an omnidirectional

microphone. These types of microphones are the

most simple and inexpensive and have as advantage

having a very flat frequency response. However, the

property of capturing sounds coming from every

direction with the same sensitivity is very often

undesirable, since it is usually interesting capturing

the sounds coming from the front of the micro-

phone but not from behind or the laterals.

� Bidirectional microphones. If a microphone is built

to respond to the gradient of the pressure in a

particular direction, rather than to the pressure

itself, a bidirectional microphone is obtained.

This is achieved by letting the sound wave reach

the diaphragm not only from the front of the

microphone but also from the rear, so that if a

wave comes from a perpendicular direction the

effects on the front and the rear are canceled. This

type of microphones reach the maximum

Voice Device. Figure 2 Typical polar patterns for omnidirect

microphones.

sensitivity at the front and the rear, and reach

their minimum sensitivity at the perpendicular

directions. This directionality pattern is particular-

ly interesting to reduce noises from the sides of the

microphone. For this reason sometimes it is said

that these microphones are noise-canceling micro-

phones. Among the disadvantages of this kind of

microphones, it must be mentioned that their fre-

quency response is not nearly as flat as the one of

an omnidirectional microphone, and it also varies

with the direction of arrival. The frequency re-

sponse is also different with the distance from the

sound source to the microphone. Particularly, for

sounds generated close to the microphone (near

field) the response for low frequencies is higher

than for sounds generated far from the microphone

(far field). This is known as the proximity effect. For

that reason frequency responses are given usually

for far-field and near-field conditions, particularly

for close-talking microphones. This type of micro-

phones are more sensitive to the noises produced

by the wind and the wind induced by the pronun-

ciation of plosive sounds (such as /p/) in close-

talking microphones.

� Unidirectional Microphones. These microphones

have maximum response to sounds coming from

the front of the microphone, have nearly zero re-

sponse to sounds coming from the rear of the

microphone and small response to sounds coming

from the sides of the microphone. Unidirectional-

ity is achieved by building a microphone that

responds to the gradient of the sounds, similar to

ional, bidirectional and unidirectional (or cardioid)

Voice Device V 1381

a bidirectional microphone. The null response

from the rear is attained by introducing a material

to slow down the acoustic waves coming from the

rear so that when the wave comes from the rear it

takes equal time to reach the rear part and the front

part of the diaphragm, and therefore both cancel

out. The polar pattern of these microphones has

usually the shape of a heart, and for that reason are

sometimes called cardiod microphones. These

microphones have good noise-cancelation proper-

ties, and for these reasons, are very well suited for

capturing clean audio input.

V

Microphone Location

Some microphones have different frequency response

when the sound source is close to the microphone

(near field, or close-talking) and when the sound

source is far from the microphone (far field). In fact,

not only the frequency response, but also the problems

to the voice biometric application and the selection of

the microphone could be different. For this reason a

few concepts about microphone location will be

reviewed.

� Close-talking or near-field microphones. These

microphones are located close to the mouth of

the speaker, usually pointing at the mouth of the

speaker. This kind of microphones can benefit from

the directionality pattern to capture mainly the

sounds produced by the speaker, but could also be

very sensitive to the winds produced by the speaker,

if placed just in front of the mouth. The character-

istics of the sound captured may be very different

if the microphone is placed at different relative

positions from the mouth, which is sometimes a

problem for voice biometrics applications.

� Far-field microphones. These microphones are loca-

ted at some distance from the speaker. They have

the disadvantage that they tend to capture more

noise than close-talking microphones because

sometimes cannot take advantage of directionality

patterns. This is particularly true if the speaker can

move around as she speaks. In general, far-field

microphone speech is considered to be far more

difficult to process than close-talking speech. In

some circumstances it is possible to take advantage

of microphone arrays to locate the speaker spatially

and to focus the array to listen specially to them.

Specifications

There is an international standard for microphone

specifications [5], but few manufacturers follow it ex-

actly. Among the most common specifications of a

microphone the following must be mentioned.

� Sensitivity. The sensitivity measures the efficiency

in the transduction (i.e. how much voltage it gen-

erates for an input acoustic pressure). It is

measured in millivolts per Pascal at 1 kHz.

� Frequency Response. The frequency response is a

measure of the variation of the sensitivity of a

microphone as a function of the frequency of the

signal. It is usually represented in decibels (dB)

over a range of frequency typically between 0 and

20 kHz. The frequency response is dependent on

the direction of arrival of the sound and the dis-

tance from the sound source. The frequency res-

ponse is typically measured for sound sources very

far from the microphone and with the sound

reaching the microphone from its front direction.

For close talking microphones it is also typical to

represent the frequency response for sources close

to the microphone to take into account the pro-

ximity effect.

� Directional Characteristics. The directionality of a

microphone is the variation of its sensitivity as a

function of the sound arrival direction, and is usu-

ally specified in the form of a directionality pattern,

as explained earlier.

� Non-Linearity. Ideally, a microphone should be a

linear transducer, and therefore a pure audio tone

should produce a single pure voltage sinusoid at

the same frequency. As microphones are not exactly

linear, a pure acoustic tone produces a voltage

sinusoid at the same frequency but also some har-

monics. The most extended nonlinearity measure

is the total harmonic distortion, THD, which is the

ratio between the power of the harmonics produ-

ced and the power of the voltage sinusoid produced

at the input frequency.

� Limiting Characteristics. These characteristics indi-

cate the maximum sound pressure level (SPL) that

can be transduced with limited distortion by the

microphone. There are two different measures,

the maximum peak SPL for a maximum THD,

and the overload, clipping or saturation level. This

last one indicates the SPL that produces the

1382V Voice Evidence

maximum displacement of the diaphragm of the

microphone.

� Inherent Noise. A microphone, in the absence of

sound, produces a voltage level due to the inherent

noise produced by itself. This noise is measured as

the input SPL that would produce the same output

voltage, which is termed the equivalent SPL due to

inherent noise. This parameter determines the min-

imum SPL that can be effectively transduced by the

microphone.

� Dynamic Range. The former parameters define the

dynamic range of the microphone, (i.e. the mini-

mum and maximum SPL that can be effectively

transduced).

Summary

Speech input devices are the first element in a voice

biometric system and are sometimes not given the

attention they deserve in the design of voice biometric

applications. This section has presented some of the

variables to take into account in the selection or design

of a microphone for a voice biometric application. The

right selection, design, and even placement of a micro-

phone could be crucial for the success of a voice bio-

metric system.

Related Entries

▶Biometric Sample Acquisition

▶ Sample Acquisition (System Design)

▶ Sensors

References

1. Eargle, J.: The Microphone Book, 2nd edn. Focal, Elsevier, Bur-

lington, MA (2005)

2. National Institute of Standards and Technology (NIST): NIST

Speaker Recognition Evaluation. http://www.nist.gov/speech/

tests/spk/

3. Flichy, P.: Une Histoire de la Communication Moderne. La

Decouverte (1997)

4. Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing.

Prentice-Hall PTR, New Jersey (2001)

5. International Electrotechnical Comission: International Stan-

dard IEC 60268-4: Sound systems equipment, Part 4: Micro-

phones. Geneva, Switzerland (2004)

Voice Evidence

The forensic evidence of voice consists of the quanti-

fied degree of similarity between the speaker depen-

dent features extracted from the questioned recording

(trace) and the same extracted from recorded speech of

a suspect, represented by his or her model.

▶Voice, Forensic Evidence of

Voice Recognition

▶ Speaker Recognition, Overview

▶ Speaker Recognition Standardization

Voice Sample Synthesis

JUERGEN SCHROETER

AT&T Labs – Research, Florham Park, NJ, US

Synonyms

Speech synthesis; Synthetic voice creation; Text-to-

speech (TTS)

Definition

Over the last decade, speech synthesis, the technology

that enables machines to talk to humans, has become

so natural-sounding that a naı̈ve listener might assume

that he/she is listening to a recording of a live human

speaker. Speech synthesis is not new; indeed, it took

several decades to arrive where it is today. Originally

starting from the idea of using physics-based models of

the vocal-tract, it took many years of research to per-

fect the encapsulation of the acoustic properties of the

vocal-tract as a ‘‘black box’’, using so-called formant

synthesizers. Then, with the help of ever more

Voice Sample Synthesis V 1383

powerful computing technology, it became viable to

use snippets of recorded speech directly and glue them

together to create new sentences in the form of con-

catenative synthesizers. Combining this idea with now

available methods for fast search, potentially millions

of choices are evaluated to find the optimal sequence of

speech snippets to render a given new sentence. It is the

latter technology that is now prevalent in the highest

quality speech synthesis systems. This essay gives a

brief overview of the technology behind this progress

and then focuses on the processes used in creating

voice inventories for it, starting with recordings of a

carefully-selected donor voice. The fear of abusing the

technology is addressed by disclosing all important

steps towards creating a high-quality synthetic voice.

It is also made clear that even the best synthetic voices

today still trip up often enough so as not to fool the

critical listener.

V

Introduction

Speech synthesis is the technology that gives compu-

ters the ability to communicate to the users by voice.

When driven by text input, speech synthesis is part of

the more elaborate ▶ text-to-speech (TTS) synthesis,

which also includes text processing (expanding abbre-

viations, for example), letter-to-sound transformation

(rules, pronunciation dictionaries, etc.), and stress and

pitch assignment [1]. Speech synthesis is often viewed

as encompassing the signal-processing ‘‘backend’’ of

text-to-speech synthesis viewed as encompassing the

signal-processing ‘‘backend’’ of text-to-speech synthe-

sis (with text and linguistic processing being carried

out in the ‘‘front-end’’). As such, speech synthesis takes

phoneme-based information in context and transforms

it into audible speech. Context information is very

important because, in naturally-produced speech, no

single speech sound stands by itself but is always highly

influenced by what sounds came before, and what

sounds will follow immediately after. It is precisely

this context information that is key to achieving

high-quality speech output.

A high-quality TTS system can be used for many

applications, from telecommunications to personal

use. In the telecom area, TTS is the only practical way

to provide highly flexible speech output to the caller

of an automated speech-enabled service. Examples of

such services include reading back name and address

information, and providing news or email reading. In

the personal use area, the author has witnessed the

ingenious ‘‘high jacking’’ of AT&T’s web-based TTS

demonstration by a young student to fake his mother’s

voice in a telephone call to his school: ‘‘Timmy will be

out sick today. He cannot make it to school.’’ It seems

obvious that natural-sounding, high quality speech syn-

thesis is vital for both kinds of applications. In the

telecom area, the provider of an automated voice service

might lose customers if the synthetic voice is unintelli-

gible or sounds unnatural. If the young student wants to

get an excused day off, creating a believable ‘‘real-sound-

ing’’ voice seems essential. It is mostly concerns about

the latter kind of potential abuse that motivates this

author to write this essay. In the event that the even

stricter requirement is added of making the synthetic

voice indistinguishable from the voice of a specific per-

son, there is clearly a significantly more difficult chal-

lenge. Shortly after AT&T’s Natural Voices1 TTS

system became commercially available in August 2001,

an article in the New York Times’ Circuits section [2]

asked precisely whether people will be safe from seri-

ous criminal abuse of this technology. Therefore, the

purpose of this essay is to demystify the process of

creating such a voice, disclose what processes are

involved, and show current limitations of the technol-

ogy that make it somewhat unlikely that speech syn-

thesis could be criminally abused anytime soon.

This essay is organized as follows. The next section

briefly summarizes different speech synthesis methods,

followed by a somewhat deeper overview of the so-

called Unit Selection synthesis method that currently

delivers the highest quality speech output. The largest

section of this essay deals with creating voice databases

for unit selection synthesis. The essay concludes with

an outlook.

Overview of Voice Synthesis Methods

The voice (speech) synthesis method with the most

vision and potential, but also with somewhat unful-

filled promises, is articulatory synthesis. This method

employs mathematical models of the speech produc-

tion process in the human vocal tract, for example,

models of the mechanical vibrations of the vocal

chords (glottis) that interact with the fluid dynamics

of the laminar and turbulent airflow from the lungs to

the lips, plus linear or even nonlinear acoustical

1384V Voice Sample Synthesis

models of sound generation and propagation along the

vocal tract. A somewhat comprehensive review of this

method is given in [3]. Due to high computational

requirements and the need for highly accurate model-

ing, articulatory synthesis is mostly useful for research

in speech production. It usually delivers unacceptably

low-quality synthetic speech.

One level higher in abstraction, and much more

practical in its use, is formant synthesis. This method

captures the characteristics of the resonances of

the human vocal tract in terms of simple filters. The

single-peaked frequency characteristic of such a filter

element is called formant. Its frequency, bandwidth

(narrow to broad), and amplitude fully specify each

formant. For adult vocal tracts, four to five formants

are enough to determine their acoustic filter character-

istics. Phonetically most relevant are the lowest three

formants that span the vowel and sonorant space of a

speaker and a language. Together with a suitable wave-

form generator that approximates the glottal pulse,

formant synthesis systems, due to their highly versatile

control parameter sets, are very useful for speech per-

ception research. More on formant synthesis can be

found in [4]. For use as a speech synthesizer, the

computational requirements are relatively low, making

this method the preferred option for embedded appli-

cations, such as reading back names (e.g., ‘‘calling

Mom’’) in a dial-by-voice cellular phone handset. Its

storage requirements are miniscule (as little as 1 MB).

Formant synthesis delivers intelligible speech when

special care is given to consonants.

In the 1970s, a new method started to compete

with the, by then, well-established formant synthesis

method. Due to its main feature of stitching together

recorded snippets of natural speech, it was called con-

catenative synthesis. Many different options exist for

selecting the specific kind of elementary speech units

to concatenate. Using words as such units, although

intuitive, is not a good choice given that there are many

tens of thousands of them in a language and that each

recorded word would have to fit into several different

contexts with its neighbors, creating the need to record

several versions of each word. Therefore, word-based

concatenation usually sounds very choppy and artifi-

cial. However, subword units, such as diphones or

demisyllables turned out to be much more useful be-

cause of favorable statistics. For English, there is a

minimum of about 1500 ▶ diphones that would need

to be in the inventory of a diphone-based

concatenative synthesizer. The number is only slightly

higher for concatenating ▶ demisyllables. For both

kinds of units, however, elaborate methods are needed

to identify the best single (or few) instances of units to

store in the voice inventory, based on statistical mea-

sures of acoustic typicality and ease of concatenation,

with a minimum of audible glitches. In addition, at

synthesis time, elaborate speech signal processing is

needed to assure smooth transitions, deliver the de-

sired prosody, etc. For more details on this method, see

[5]. Concatenative synthesis, like formant synthesis,

delivers highly intelligible speech and usually has no

problem with transients like stop consonants, but usu-

ally lacks naturalness and thus cannot match the qual-

ity of direct human voice recordings. Its storage

requirements are moderate by today’s standards

(�10–100 MB).

Unit Selection Synthesis

The effort and care given to creating the voice inventory

determines to a large extent the quality of any concatena-

tive synthesizer. For best results, most concatenative syn-

thesis researchers well up into the 1990s employed a

largely manual off-line process of trial and error that

relied on dedicated experts. A selected unit needed to fit

all possible contexts (or made to fit by signal processing

such as, stretching or shrinking durations, pitch scaling,

etc.). However, morphing any given unit by signal proces-

sing in the synthesizer at synthesis time degrades voice

quality. So, the ideawas born tominimize the use of signal

processing by taking advantage of the ever increasing

power of computers to handle ever increasing data sets.

Instead of outright morphing a unit to make it fit, the

synthesizer may try to pick a suitable unit from a large

number of available candidates, optionally followed by

much more moderate signal processing. The objective

is to find automatically the optimal sequence of unit

instances at synthesis time, given a large inventory of

unit candidates and the available sentence to be synthe-

sized. This new objective turned the speech synthesis

problem into a rapid search problem [6].

The process of selecting the right units in the in-

ventory that instantiate a given input text, appropri-

ately called unit selection, is outlined in Fig. 1. Here,

the word ‘‘two’’ (or ‘‘to’’) is synthesized from using

diphone candidates for silence into ‘‘t’’ (/#-t/), ‘‘t’’

into ‘‘uw’’ (/t-uw/), and ‘‘uw’’ into silence (/uw-#/).

Voice Sample Synthesis. Figure 1 Viterbi search to retrieve optimal diphone units for the word ‘‘two’’ or ‘‘to’’.

Voice Sample Synthesis V 1385

V

Each time slot (column in Fig. 1) has several candi-

dates to choose from. Two different objective distance

measures are employed. First, transitions from one

unit to the next (depicted by arrows in the figure) are

evaluated by comparing the speech spectra at the end

of the left-side unit candidates to the speech spectra at

the beginning of the right-side unit candidates. These

are n*m comparisons, where n is the number of unit

candidates for the left column of candidates, and m is

the number of unit candidates in the right-side column

of candidates. Second, each node (circle) in the net-

work of choices depicted in Fig. 1 has an intrinsic

‘‘goodness of fit’’ measured by a so-called target cost.

The ideal target cost of a candidate unit measures the

acoustic distance of the unit against a hypothetical unit

cut from a perfect recording of the sentence to be

synthesized. However, since it is unlikely that the

exact sentence would be in the inventory, an algorithm

has to estimate the target cost using symbolic and

nonacoustic cost components such as the difference

between desired and given pitch, amplitude, and con-

text (i.e., left and right phone sequences).

The objective of selecting the optimal unit sequence

for a given sentence is to minimize the total cost that is

accumulated by summing transitional and target costs

for a given path through the network from its left-side

beginning to its right-side end. The optimal path is

the one with the minimum total cost. This path

can be identified efficiently using the Viterbi search

algorithm [7].

More detailed information about unit selection syn-

thesis can be found in [1, 8]. The latter book chapter also

summarizes the latest use of automatic speech recogni-

tion (ASR) technology in unit selection synthesis.

Voice Creation

Creating a simple-minded unit selection synthesizer

would involve just two steps: First, record exactly the

sentences that a user wants the machine to speak; and

second, identify at ‘‘synthesis’’ time the input sentence

to be spoken, and then play it back. In practice units

are used that are much shorter than sentences to be

able to create previously unseen input sentences, so

this simple-minded paradigm would not work. How-

ever, when employing a TTS front-end that converts

any input text into a sequence of unit specifications,

intuition may ask for actually playing back any inven-

tory sentence in its entirety in the odd chance that the

corresponding text has been entered. Since the transla-

tion of text into unit-based tags and back into speech is

not perfect, the objective is unlikely to ever be fully

met. In practice, however, the following, somewhat

weaker objective holds: as long as the text to be synthe-

sized is similar enough to that of a corresponding

recording that actually exists in the inventory, a high

output voice quality can be expected. It is for this

reason that unit-selection synthesis is particularly well

suited for so-called limited domain synthesis, such as

Voice Sample Synthesis. Figure 2 Steps in unit selection

voice inventory creation.

1386V Voice Sample Synthesis

weather reports, stock reports, or any automated tele-

com dialogue application (banking, medical, etc.)

where the application designer can afford the luxury

of recording a special inventory, using a carefully se-

lected voice talent. High quality synthesis for general

news or email reading is usually much more difficult to

achieve because of coverage issues [9].

Because unit selection synthesis, to achieve its best

quality results, mimics a simple tape recorder playback,

it is obvious that its output voice quality largely

depends on what material is in its voice inventory.

Without major modifications/morphing at synthesis

time, the synthesizer output is confined to the quality,

speaking style, and emotional state of the voice that

was recorded from the voice talent/donor speaker. For

this reason, careful planning of the voice inventory is

required. For example, if the inventory contains only

speech recorded from a news anchor, the synthesizer

will always sound like a news anchor.

Several issues need to be addressed in planning a

voice inventory for a unit selection synthesizer. The

steps involved are outlined in Fig. 2, starting with text

preparation to cover the material selected. Since voice

recordings cannot be done faster than real time, they

are always a major effort in time and expense. To get

optimal results, a very strict quality assurance process

for the recordings is paramount. Furthermore, the

content of the material to be recorded needs to be

addressed. Limited domain synthesis covers typical

text for the given application domain, including greet-

ings, apologies, core transactions, and good-byes. For

more general use such as email and news reading,

potentially hundreds of hours of speech need to be

recorded. However, the base corpus for both kinds of

applications needs to maximize linguistic coverage

within a small size. Including a core corpus that was

optimized for traditional diphone synthesis might

satisfy this need. In addition, news material, sentences

that use the most common names in different prosodic

contexts, addresses, and greetings are useful. For limited

domain applications, domain-specific scripts need to

be created. Most of them require customer input

such as getting access to text for existing voice

prompts, call flows, etc. There is a significant danger

in underestimating this step in the planning phase.

Finally, note that a smart and frugal effort in designing

the proper text corpus to record helps to reduce the

amount of data to be recorded. This, in turn, will speed

up the rest of the voice building process.

Quality assurance starts with selecting the best pro-

fessional voice talent. Besides the obvious criteria of

voice preference, accent, pleasantness, and suitability

for the task (a British butler voice might not be appro-

priate for reading instant messages from a banking

application), the voice talents needs to be very consis-

tent in how she/he pronounces the same word over time

and in different contexts. Speech production issues

might come into play, such as breath noise, frequent

lip smacks, disfluencies, and other speech defects. A

clearly articulated and pleasant sounding voice and a

natural prosodic quality are important. The same is true

for consistency in speaking rate, level, and style. Surpris-

ingly, good sight reading skills are not very common

among potential voice talents. Speakers with heavy

vocal fry (glottal vibration irregularities) or strong

nasality should be avoided. Overall, a low ratio of

usable recordings to total recordings done in a test

run is a good criterion for rejecting a voice talent.

Voice Sample Synthesis V 1387

V

Pronunciations of rare words, such as foreign names,

need to be agreed upon beforehand and their realiza-

tions monitored carefully. Therefore, phonetic supervi-

sion has to be part of all recording sessions.

Next, the recording studio used for the recording

sessions should have almost ‘‘anechoic’’ acoustic char-

acteristics and a very low background noise in order to

avoid coloring or tainting the speech spectrum in any

way. Since early acoustic reflections off a nearby wall or

table are highly dependent on the time-varying geom-

etry relative to the speaker’s mouth and to the micro-

phone, the recording engineer needs to make sure that

the speaker does not move at all (unrealistic) or mini-

mize these reflections. The recording engineer also

needs to make sure that sound levels, and trivial things

like the file format of the recordings are consistent and

on target. Finally, any recorded voice data needs to be

validated and inconsistencies between desired text and

actually spoken text reconciled (e.g., the speaker reads

‘‘vegetarian’’ where ‘‘veterinarian’’ was requested).

Automatic labeling of large speech corpora is a

crucial step because manual labeling by linguists is

slow (up to 500 times real time) and potentially incon-

sistent (different human labelers disagree). Therefore,

an automatic speech recognizer (ASR) is used in

so-called forced alignment mode for phonetic labeling.

Given the text of a sentence, the ASR identifies the

identities and the beginnings and ends of all ▶ pho-

nemes. ASR might employ several passes, starting from

speaker-independent models, and adapting these mod-

els to the given single speaker, and his/her speaking

style. Adapting the pronunciation dictionary to the

specific speaker’s individual pronunciations is vital to

get the correct phoneme sequence for each recorded

word. Pronunciation dictionaries used for phonetic

labeling should also be used in the synthesizer. In

addition, an automated prosodic labeler is useful for

identifying typical stress and pitch patterns, prominent

words, and phrase boundaries. Both kinds of automat-

ic labeling need to use paradigms and conventions

(such as phoneme sets and symbolic ▶ prosody tags)

that match those used in the TTS front-end at synthe-

sis time. A good set of automatic labeling and other

tools allowed the author’s group of researchers to

speed up their voice building process by more than

100 times over 6 years.

Once the recordings are done, the first step in the

voice building process is to build an index of which

sound (phoneme) is where, normalize the amplitudes,

and extract acoustic and segmental features, and

then build distance tables used to trade off (weigh)

different cost components in unit selection in the last

section. One important part of the runtime synthesiz-

er, the so-called Unit Preselection (a step used to nar-

row down the potentially very large number of

candidates) can be sped up by looking at statistics

of triples of phonemes (i.e., so-called triphones) and

caching the results. Then, running a large independent

training text corpus through the synthesizer and

gathering statistics of unit use can be used to build a

so-called join cache that eliminates recomputing join

costs at runtime for a significant speedup. The final

assembly of the voice database may include reordering

of units for access efficiency plus packaging the voice

data and indices.

Voice database validation consists of comprehen-

sive, iterative testing with the goal of identifying bad

units, either by automatic identification tools or by

many hours of careful listening and ‘‘detective’’ work

(where did this bad sound come from?), plus repair.

Allocating sufficient testing time before compute-

intensive parts of the voice building process (e.g.,

cache building) is a good idea. Also, setting realistic

expectations with the customer (buyer of the voice

database) is vital. For example, the author found that

the ‘‘damage’’ that the TTS-voice creation and synthe-

sis process introduces relative to a direct recording

seems to be somewhat independent of the voice talent.

Therefore, starting out with a ‘‘bad’’ voice talent will

only lead to a poorer sounding synthetic voice. Reduc-

ing the TTS damage over time is the subject of ongoing

research in synthesis-related algorithms employed in

voice synthesis.

The final step in unit selection voice creation is for-

mal customer acceptance and, potentially, ongoing

maintenance. Formal customer acceptance is needed

to avoid disagreements over expected and delivered qual-

ity, coverage, etc. Ongoing maintenance assures high

quality for slightly different applications or application

domains, including, for example, additional recordings.

Conclusion

This essay highlighted the steps involved in creating a

high-quality sample-based speech synthesizer. Special

focus was given to the process of voice inventory

creation.

1388V Voice Verification

From the details in this essay, it should be clear

that voice inventory creation is not trivial. It involves

many weeks of expert work and, most importantly, full

collaboration with the chosen voice talent. The idea of

(secretly) recording any person and creating a synthet-

ic voice that sounds just like her or him is simply

impossible, given the present state of the art. Collecting

several hundreds of hours of recordings necessary to

having a good chance at success of creating such a voice

inventory is only practical when high-quality archived

recordings are already available that were recorded

under very consistent acoustic conditions. A possible

workable example would be an archive containing a

year or more of evening news read by a well-known

news anchor. Even then, however, one would need to

be concerned about voice consistency, since even slight

cold infections, as well as more gradual natural changes

over time (i.e., caused by aging of the speaker) can

make such recordings unusable.

An interesting extension to the sample synthesis of

(talking) faces was made in [10]. The resulting head-

and-shoulder videos of synthetic personal agents are

largely indistinguishable from video recordings of the

face talent. Again, similar potential abuse issues are a

concern.

One specific concern is that unit-selection voice

synthesis may ‘‘fool’’ automatic speaker verification

systems. Unlike a human listener’s ear that is able to

pick up the subtle flaws and repetitiveness of a

machine’s renderings of a human voice, today’s speaker

verification systems are not (yet) designed to pay at-

tention to small blurbs and glitches that are a clear

giveaway of a unit selection synthesizer’s output, but

this could change if it became a significant problem. If

this happens, perceptually undetectable watermarking

is an option to identify a voice (or talking face) sample

as ‘‘synthetic’’. Other procedural options include ask-

ing for a second rendition of the passphrase and

comparing the two versions. If they are too similar

(or even identical), reject the speaker identity claim

as bogus.

Related Entries

▶Hidden Markov Model (HMM)

▶ Speaker Databases and Evaluation

▶ Speaker Matching

▶ Speaker Recognition, Overview

▶ Speech Production

References

1. Schroeter, J.: Basic principles of speech synthesis, In: Benesty, J.

(ed.) Springer Handbook of Speech Processing and Communi-

cation, Chap. 19 (2008)

2. Bader, J.L.: Presidents as pitchmen, and posthumous play-by-

play, commentary in the New York Times, August 9 (2001)

3. van Santen, J., Sproat, R., Olive, J., Hirschberg, J., (eds.): Prog-

ress in speech synthesis, section III. Springer, NY (1997)

4. Holmes, J.N.: Research report formant synthesizers: cascade or

parallel? Speech Commun. 2(4), 251–273 (1983)

5. Sproat, R. (ed.): Multilingual text-to-speech synthesis. The bell

labs approach. Kluwer Academic Publishers, Dordrecht MA

(1998)

6. Hunt, A., Black, A.W.: Unit selection in a concatenative speech

synthesis system using a large speech database. In: Proceedings

of the ICASSP-96, pp. 373–376, GA, USA (1996)

7. Forney, G.D.: The viterbi algorithm. Proc. IEEE 61(3), 268–278

(1973)

8. Dutoit, T.: Corpus-based speech synthesis, In: Benesty, J. (ed.)

Springer Handbook of Speech Processing and Communication,

Chap. 21 (2008)

9. van Santen, J.: Prosodic processing. In: Benesty, J. (ed.) Springer

Handbook of Speech Processing and Communication, Chap. 23

(2008)

10. Cosatto, E., Graf, H.P., Ostermann, J., Schroeter, J.: From

audio-only to audio and video text-to-speech. Acta Acustica

90, 1084–1095 (2004)

Voice Verification

▶ Liveness Assurance in Voice Authentication

Voice, Forensic Evidence of

ANDRZEJ DRYGAJLO

Swiss Federal Institute of Technology Lausanne

(EPFL), Lausanne, Switzerland

Synonym

Forensic speaker recognition

Definition

Forensic speaker recognition is the process of determin-

ing if a specific individual (suspected speaker) is the

Voice, Forensic Evidence of V 1389

source of a questioned voice recording (trace). The

forensic application of speaker recognition technology

is one of the most controversial issues within the wide

community of researchers, experts, and police workers.

This is mainly due to the fact that very different methods

are applied in this area by phoneticians, engineers, law-

yers, psychologists, and investigators. The approaches

commonly used for speaker recognition by forensic

experts include the aural-perceptual, the auditory-

instrumental, and the automatic methods. The forensic

expert’s role is to testify to the worth of the evidence by

using, if possible a quantitative measure of this worth.

It is up to other people (the judge and/or the jury) to

use this information as an aid to their deliberations

and decision.

This essay aims at presenting forensic automatic

speaker recognition (FASR) methods that provide a

coherent way of quantifying and presenting recorded

voice as scientific evidence. In such methods, the evi-

dence consists of the quantified degree of similarity

between speaker-dependent features extracted from

the trace and speaker-dependent features extracted

from recorded speech of a suspect. The interpretation

of a recorded voice as evidence in the forensic context

presents particular challenges, including within-speaker

(within-source) variability, between-speakers (between-

sources) variability, and differences in recording sessions

conditions. Consequently, FASR methods must provide

a probabilistic evaluation which gives the court an indi-

cation of the strength of the evidence given the estimated

within-source, between-sources, and between-session

variabilities.

V

Introduction

Speaker recognition is the general term used to include

all of the many different tasks of discriminating people

based on the sound of their voices. Forensic speaker

recognition involves the comparison of recordings of

an unknown voice (questioned recording) with one

or more recordings of a known voice (voice of the

suspected speaker) [1, 2].

There are several types of forensic speaker recog-

nition [3, 4]. When the recognition employs any

trained skill or any technologically-supported proce-

dure, the term technical forensic speaker recognition

is often used. In contrast to this, so-called naı̈ve for-

ensic speaker recognition refers to the application of

un-reflected everyday abilities of people to recognize

familiar voices.

The approaches commonly used for technical foren-

sic speaker recognition include the aural-perceptual,

auditory-instrumental, and automatic methods [2].

Aural-perceptual methods, based on human auditory

perception, rely on the careful listening of recordings

by trained phoneticians, where the perceived differ-

ences in the speech samples are used to estimate the

extent of similarity between voices [3]. The use of

aural-spectrographic speaker recognition can be con-

sidered as another method in this approach. The

exclusively visual comparison of spectrograms in what

has been called the ‘‘▶ voiceprint ’’ approach has come

under considerable criticism in the recent years [5]. The

auditory-instrumental methods involve the acoustic

measurements of various parameters, such as the aver-

age fundamental frequency, articulation rate, formant

centre-frequencies, etc. [4]. The means and variances

of these parameters are compared. FASR is an estab-

lished term used when automatic speaker recognition

methods are adapted to forensic applications. In auto-

matic speaker recognition, the statistical or determin-

istic models of acoustic features of the speaker’s voice

and the acoustic features of questioned recordings are

compared [6].

FASR offers data-driven methodology for quanti-

tative interpretation of recorded speech as evidence.

It is a relatively recent application of digital speech

signal processing and pattern recognition for judicial

purposes and particularly law enforcement. Results

of FASR based investigations may be of pivotal im-

portance at any stage of the course of justice, be it the

very first police investigation or a court trial. FASR

has been gaining more and more importance ever

since the telephone has become an almost ideal

tool for the commission of certain criminal offences,

especially drug dealing, extortion, sexual harassment,

and hoax calling. To a certain degree, this is undoubt-

edly a consequence of the highly-developed and fully

automated telephone networks, which may safeguard

a perpetrator’s anonymity. Nowadays, speech com-

munications technology is accessible anywhere, any-

time and at a low price. It helps to connect people,

but unfortunately also makes criminal activities

easier. Therefore, the identity of a speaker and the

interpretation of recorded speech as evidence in

the forensic context are quite often at issue in court

cases [1, 7].

1390V Voice, Forensic Evidence of

Although several speaker recognition systems for

commercial applications (mostly speaker verification)

have been developed over the past 30 years, until

recently the development of a reliable technique for

FASR has been unsuccessful because methodological

aspects concerning automatic recognition of speakers

in criminalistics and the role of the forensic expert have

not been investigated sufficiently [8]. The role of a

forensic expert is to testify in court using, if possible,

quantitative measures that estimate the value and

strength of the evidence. The judge and/or the jury

use the testimony as an aid to the deliberations and

decisions [9].

A forensic expert testifying in court is not an advo-

cate, but a witness who presents factual information

and offers a professional opinion based upon that

factual information. In order for it to be effective, it

must be carefully documented, and expressed with

precision in neutral and objective way with the adver-

sary system in mind. Technical concepts based on

digital signal processing and pattern recognition must

be articulated in layman terms such that the judge and

the attorneys may understand them. They should also

be developed according to specific recommendations

that take into account also the forensic, legal, judicial,

and criminal policy perspectives. Therefore, forensic

speaker recognition methods should be developed

based on current state-of-the-art interpretation of

forensic evidence, the concept of identity used in crim-

inalistics, a clear understanding of the inferential pro-

cess of identity, and the respective duties of the actors

involved in the judicial process, jurists, and forensic

experts.

Voice as Evidence

When using FASR, the goal is to identify whether an

unknown voice of a questioned recording (trace)

belongs to a suspected speaker (source). The ▶ voice

evidence consists of the quantified degree of similarity

between speaker dependent features extracted from the

trace, and speaker dependent features extracted from

recorded speech of a suspect, represented by his or her

model [1], so the evidence does not consist of

the speech itself. To compute the evidence, the proces-

sing chain illustrated in Fig. 1 may be employed [10].

As a result, the suspect’s voice can be recognized as the

recorded voice of the trace, to the extent that the

evidence supports the hypothesis that the questioned

and the suspect’s recorded voices were generated by

the same person (source) rather than the hypothesis

that they were not. However, the calculated value of

evidence does not allow the forensic expert alone to

make an inference on the identity of the speaker.

As no ultimate set of speaker specific features is

present or detected in speech, the recognition process

remains in essence a statistical-probabilistic process

based on models of speakers and collected data,

which depend on a large number of design decisions.

Information available from the auditory features and

their evidentiary value depend on the speech organs

and language used [3]. The various speech organs have

to be flexible to carry out their primary functions

such as eating and breathing as well as their secondary

function of speech, and the number and flexibility of

the speech organs results in a high number of ‘‘degrees

of freedom’’ when producing speech. These ‘‘degrees of

freedom’’ may be manipulated at will or may be subject

to variation due to external factors such as stress,

fatigue, health, and so on. The result of this plasticity

of the vocal organs is that no two utterances from the

same individual are ever identical in a physical sense.

In addition to this, the linguistic mechanism (lan-

guage) driving the vocal mechanism is itself far from

invariant. We are all aware of changing the way we

speak, including the loudness, pitch, emphasis, and

rate of our utterances; aware, probably, too, that style,

pronunciation, and to some extent dialect, vary as we

speak in different circumstances. Speaker recognition

thus involves a situation where neither the physical

basis of a person’s speech (the vocal organs) nor the

language driving it, are constant.

The speech signal can be represented by a sequence

of short-term feature vectors. This is known as feature

extraction (Fig. 1). It is typical to use features based on

the various speech production and perception models.

Although there are no exclusive features conveying

speaker identity in the speech signal, from the source-

filter theory of speech production it is known that the

speech spectrum envelope encodes information about

the speaker’s vocal tract shape [11]. Thus some form

of spectral envelope based features is used in most

speaker recognition systems even if they are dependent

on external recording conditions. Recently, the major-

ity of speaker recognition systems have converged to

the use of cepstral features derived from the envelope

spectra models [1].

Voice, Forensic Evidence of. Figure 1 Block diagram of the evidence processing and interpretation system. � IEEE.

Voice, Forensic Evidence of V 1391

V

Thus, the most persistent real-world challenge in

this field is the variability of speech. There is within-

speaker (within-source) variability as well as between-

speakers (between-sources) variability. Consequently,

forensic speaker recognition methods should provide

a statistical-probabilistic evaluation, which attempts

to give the court an indication of the strength of the

evidence, given the estimated within-source variability

and the between-sources variability [4, 10].

Bayesian Interpretation of Evidence

To address these variabilities, a probabilistic model [9],

Bayesian inference [8] and data-driven approaches [6]

appear to be adequate: in FASR statistical techniques

the distribution of various features extracted from a

suspect’s speech is compared with the distribution of

the same features in a reference population with re-

spect to the questioned recording. The goal is to infer

the identity of a source [9], since it cannot be known

with certainty.

The inference of identity can be seen as a reduction

process, from an initial population to a restricted class,

or, ultimately, to unity [8]. Recently, an investigation

concerning the inference of identity in forensic speaker

recognition has shown the inadequacy of the speaker

verification and speaker identification (in closed set

and in open set) techniques [8]. Speaker verification

and identification are the two main automatic techni-

ques of speech recognition used in commercial appli-

cations. When they are used for forensic speaker

recognition they imply a final discrimination decision

based on a threshold. Speaker verification is the task of

1392V Voice, Forensic Evidence of

deciding, given a sample of speech, whether a specified

speaker is the source of it. Speaker identification is the

task of deciding, given a sample of speech, which

among many speakers is the source of it. Therefore,

these techniques are clearly inadequate for forensic

purposes, because they force the forensic expert to

make decisions which are devolved upon the court.

Consequently, the state-of-the-art speaker recognition

algorithms using dynamic time warping (DTW) and

hidden Markov models (HMMs) for text-dependent

tasks, and vector quantization (VQ), Gaussian mixture

models (GMMs), ergodic HMMs and others for text-

independent tasks have to be adapted to the Bayesian

interpretation framework which represents an ade-

quate solution for the interpretation of the evidence

in the judicial process [9].

The court is faced with decision-making under un-

certainty. In a case involving FASR it wants to know

how likely it is that the speech samples of questioned

recording have come from the suspected speaker.

The answer to this question can be given using

the Bayes’ theorem and a data-driven approach to

interpret the evidence [1, 7, 10].

The odds form of Bayes’ theorem shows how new

data (questioned recording) can be combined with

prior background knowledge (prior odds (province

of the court)) to give posterior odds (province of the

court ) for judici al outcomes or issues (Eq. 1). It allow s

for revision based on new information of a measure of

uncertainty (likelihood ratio of the evidence (province

of the forensic expert)) which is applied to the pair

of competing hypotheses: H0 – the suspected speaker

is the source of the questioned recording, H1 – the

speaker at the origin of the questioned recording is

not the suspected speaker.

posterior

knowledge

pðH0jEÞpðH1jEÞposterior

odds

ðprovince ofthe courtÞ

¼new data

pðEjH0ÞpðEjH1Þlikelihood

ratio

ðprovince ofthe expertÞ

prior

knowledge

pðH0ÞpðH1Þ

prior odds

ðprovince ofthe courtÞ

ð1Þ

This hypothetical-deductive reasoning method, based

on the odds form of the Bayes’ theorem, allows evalu-

ating the likelihood ratio of the evidence that leads

to the statement of the degree of support for one

hypothesis against the other. The ultimate question

relies on the evaluation of the probative strength of

this evidence provided by an automatic speaker recog-

nition method [12]. Recently, it was demonstrated that

outcome of the aural (subjective) and instrumental

(objective) approaches can also be expressed as a

Bayesian likelihood ratio [4, 13].

Strength of Evidence

The ▶ strength of voice evidence is the result of the

interpretation of the evidence, expressed in terms of

the likelihood ratio of two alternative hypotheses. The

principal structure for the calculation and the inter-

pretation of the evidence is presented in Fig. 1. It

includes the collection (or selection) of the databases,

the automatic speaker recognition and the Bayesian

interpretation [10].

The methodological approach based on a Bayesian

interpretation (BI) framework is independent of the

automatic speaker recognition method chosen, but the

practical solution presented in this essay as an example

uses text-independent speaker recognition system based

on Gaussian mixture model (GMM) [14].

The Bayesian interpretation (BI) methodology

needs a two-stage statistical approach [10]. The first

stage consists in modeling multivariate feature data

using GMMs. The second stage transforms the data

to a univariate projection based on modeling the simi-

larity scores. The exclusively multivariate approach is

also possible but it is more difficult to articulate

in layman terms [15]. The GMM method is not only

used to calculate the evidence by comparing the

questioned recording (trace) to the GMM of the sus-

pected speaker (source), but it is also used to produce

data necessary to model the within-source variability

of the suspected speaker and the between-sources

variability of the potential population of relevant

speakers, given the questioned recording. The interpre-

tation of the evidence consists of calculating the likeli-

hood ratio using the probability density functions

(pdfs) of the variabilities and the numerical value of

evidence.

The information provided by the analysis of the

questioned recording (trace) leads to specify the initial

reference population of relevant speakers (potential pop-

ulation) having voices similar to the trace, and,

Voice, Forensic Evidence of V 1393

combined with the police investigation, to focus on and

select a suspected speaker. The methodology presented

needs three databases for the calculation and the inter-

pretation of the evidence: the potential population data-

base (P), the suspected speaker reference database (R),

and the suspected speaker control database (C) [14].

The potential population database (P) is a database

for modeling the variability of the speech of all the

potential relevant sources, using the automatic speaker

recognition method. It allows evaluating the between-

sources variability given the questioned recording,

which means the distribution of the similarity scores

that can be obtained, when the questioned recording is

compared to the speaker models (GMMs) of the po-

tential population database. The calculated between-

sources variability pdf is then used to estimate the

denominator of the likelihood ratio p(E|H1). Ideally,

the technical characteristics of the recordings (e.g.,

signal acquisition and transmission) should be chosen

according to the characteristics analyzed in the trace.

The suspected speaker reference database (R) is

recorded with the suspected speaker to model his/her

speechwith the automatic speaker recognitionmethod.

In this case, speech utterances should be produced in

the same way as those of the P database. The sus-

pected speaker model obtained is used to calculate the

Voice, Forensic Evidence of. Figure 2 The LR estimation giv

value of the evidence, by comparing the questioned

recording to the model.

The suspected speaker control database (C) is

recorded with the suspected speaker to evaluate her/his

within-source variability, when the utterances of this

database are compared to the suspected speaker model

(GMM). This calculated within-source variability pdf

is then used to estimate the numerator of the likeli-

hood ratio p(E|H0). The recording of the C database

should be constituted of utterances as far as possible

equivalent to the trace, according to the technical

characteristics, as well as to the quantity and style of

speech.

The basic method proposed has been exhaustively

tested in mock forensic cases corresponding to real

caseworks [11, 14]. In an example presented in Fig. 2,

the strength of evidence, expressed in terms of likeli-

hood ratio gives LR = 9.165 for the evidence value

E = 9.94, in this case. This means that it is 9.165

times more likely to observe the score E given the

hypothesis H0 than H1. The important point to be

made here is that the estimate of the LR is only as

good as the modeling techniques and databases used

to derive it. In the example, the GMM technique was

used to estimate pdfs from the data representing simi-

larity scores [11].

en the value of the evidence E. � IEEE.

V

1394V Voice, Forensic Evidence of

Evaluation of the Strength ofEvidence

The likelihood ratio (LR) summarizes the statement of

the forensic expert in the casework. However, the great-

est interest to the jurists is the extent to which the LRs

correctly discriminate ‘‘the same speaker and different-

speaker’’ pairs under operating conditions similar to

those of the case in hand. As was made clear in the US

Supreme Court decision in Daubert case (Daubert v.

Merrell Dow Pharmaceuticals, 1993) it should be cri-

terial for the admissibility of scientific evidence to know

to what extent the method can be, and has been, tested.

The principle for evaluation of the strength of

evidence consists in the estimation and the comparison

of the likelihood ratios that can be obtained from the

evidence E, on one hand when the hypothesis H0 is

true (the suspected speaker truly is the source of the

questioned recording) and, on the other hand, when

the hypothesisH1 is true (the suspected speaker is truly

not the source of the questioned recording) [14]. The

performance of an automatic speaker recognition

method is evaluated by repeating the experiment de-

scribed in the previous sections, with several speakers

being at the origin of the questioned recording, and by

representing the results using experimental (histogram

based) probability distribution plots such as probabili-

ty density functions and cumulative distribution func-

tions in the form of Tippett plots (Fig. 3a) [10, 14].

The way of representation of the results in the form

of Tippett plots is the one proposed by Evett and

Voice, Forensic Evidence of. Figure 3 (a) Estimated probab

corresponding to (a). � IEEE.

Buckleton in the field of interpretation of the forensic

DNA analysis [6]. The authors have named this repre-

sentation ‘‘Tippett plot,’’ referring to the concepts of

‘‘within-source comparison’’ and ‘‘between-sources

comparison’’ defined by Tippett et al.

Forensic Speaker Recognition inMismatched Conditions

Nowadays, state-of-the-art automatic speaker recogni-

tion systems show very good performance in discrimi-

nating between voices of speakers under controlled

recording conditions. However, the conditions in

which recordings are made in investigative activities

(e.g., anonymous calls and wire-tapping) cannot be

controlled and pose a challenge to automatic speaker

recognition. Differences in the background noise, in

the phone handset, in the transmission channel, and in

the recording devices can introduce variability over

and above that of the voices in the recordings. The

main unresolved problem in FASR today is that of

handling mismatch in recording conditions, also in-

cluding mismatch in languages, linguistic content, and

non-contemporary speech samples. Mismatch in re-

cording conditions has to be considered in the estima-

tion of the likelihood ratio [11–13]. Next step can

be combination of the strength of evidence using

aural-perceptive and acoustic-phonetic approaches

(aural-instrumental) of trained phoneticians with

that of the likelihood ratio returned by the automatic

ility density functions of likelihood ratios; (b) Tippett plots

Voice, Forensic Evidence of V 1395

system [4]. In order for FASR to be acceptable for

presentation in the courts, the methods and techniques

have to be researched, tested and evaluated for error, as

well as be generally accepted in the scientific commu-

nity. The methods proposed should be analyzed in the

light of the admissibility of scientific evidence (e.g.,

Daubert ruling, USA, 1993) [11].

V

Summary

The essay discussed some important aspects of fore-

nsic speaker recognition, focusing on the necessary sta-

tistical-probabilistic framework for both quantifying

and interpreting recorded voice as scientific evidence.

Methodological guidelines for the calculation of the

evidence, its strength and the evaluation of this strength

under operating conditions of the casework were pre-

sented. As an example, an automatic method using the

Gaussian mixture models (GMMs) and the Bayesian

interpretation (BI) framework were implemented for

the forensic speaker recognition task. The BI method

represents neither speaker verification nor speaker iden-

tification. These two recognition techniques cannot be

used for the task, since categorical, absolute and deter-

ministic conclusions about the identity of source of

evidential traces are logically untenable because of the

inductive nature of the process of the inference of iden-

tity. This method, using a likelihood ratio to indicate the

strength of the evidence of the questioned recording,

measures how this recording of voice scores for the

suspected speaker model, compared to relevant non-

suspect speaker models. It became obvious that partic-

ular effort is needed in the trans-disciplinary domain of

adaptation of the state-of-the-art speech recognition

techniques to real-world environmental conditions for

forensic speaker recognition. The future methods to be

developed should combine the advantages of automatic

signal processing and pattern recognition objectivity

with the methodological transparency solicited in

forensic investigations.

Related Entries

▶ Forensic Biometrics

▶ Forensic Evidence

▶ Speaker Recognition, An Overview

References

1. Rose, P.: Forensic Speaker Identification. Taylor & Francis,

London (2002)

2. Dessimoz, D., Champod, C.: Linkages between biometrics

and forensic science. In: Jain, A., Flynn, P., Ross, A. (eds.)

Handbook of Biometrics, pp. 425–459. Springer, New York

(2008)

3. Nolan, F.: Speaker identification evidence: its forms, limitations,

and roles. In: Proceedings of the Conference ‘‘Law and Language:

Prospect and Retrospect’’, Levi (Finnish Lapland), pp. 1–19

(2001)

4. Rose, P.: Technical forensic speaker recognition: Evaluation,

types and testing of evidence. Comput. Speech Lang. 20(2–3),

159–191 (2006)

5. Meuwly, D.: Voice analysis. In: Siegel, J., Knupfer, G., Saukko,

P. (eds.) Encyclopedia of Forensic Sciences, pp. 1413–1421.

Academic Press, London (2000)

6. Drygajlo, A.: Forensic automatic speaker recognition. IEEE

Signal Process. Mag. 24(2), 132–135 (2007)

7. Robertson, B., Vignaux, G.: Interpreting Evidence. Evaluating

Forensic Science in the Courtroom. John Wiley & Sons, Chiche-

ster (1995)

8. Champod, C., Meuwly, D.: The inference of identity in forensic

speaker identification.’’ Speech Commun. 31(2–3), 193–203

(2000)

9. Aitken, C., Taroni, F.: Statistics and the Evaluation of

Evidence for Forensic Scientists. John Wiley & Sons, Chichester

(2004)

10. Drygajlo, A., Meuwly, D., Alexander, A.: Statistical

methods and Bayesian interpretation of evidence in forensic

automatic speaker recognition. In: Proceedings of Eighth

European Conference on Speech Communication and

Technology (Eurospeech’03), pp. 689–692 Geneva, Switzerland,

(2003)

11. Alexander, A.: Forensic automatic speaker recognition using

Bayesian interpretation and statistical compensation for mis-

matched conditions. Ph.D. thesis, EPFL (2005)

12. Gonzalez-Rodriguez, J., Drygajlo, A., Ramos-Castro, D., Garcia-

Gomar, M., Ortega-Garcia, J.: Robust estimation, interpretation

and assessment of likelihood ratios in forensic speaker recogni-

tion. Comput. Speech Lang. 20(2–3), 331–355 (2006)

13. Alexander, A., Dessimoz, D., Botti, F., Drygajlo, A.: Aural and

automatic forensic speaker recognition in mismatched condi-

tions. Int. J. Speech Lang. Law, 12(2), 214–234 (2005)

14. Meuwly, D., Drygajlo, A.: Forensic speaker recognition based

on a Bayesian framework and Gaussian mixture model-

ling (GMM). In: Proceedings 2001: A Speaker Odyssey, The

Speaker Recognition Workshop, pp. 145–150 Crete, Greece,

(2001)

15. Alexander, A., Drygajlo, A.: Scoring and direct methods for the

interpretation of evidence in forensic speaker recognition.

In: Proceedings of Eighth International Conference on Spoken

Language Processing (ICSLP’04), pp. 2397–2400 Jeju, Korea,

(2004)

1396V Voiced Sounds

Voiced Sounds

The voiced speech is generated by the modulation of

the airstream of the lungs by periodic opening and

closing of the vocal folds in the glottis or larynx. This

is used, e.g., for vowels and nasal consonants.

▶ Speech Production

Voiceprint

Voiceprint is another name for spectrogram. This

name is usually avoided because of its association

with voiceprint recognition, which is a highly contro-

versial method of forensic speaker recognition, which

exclusively uses visual examination of spectrograms.

▶Voice, Forensic Evidence of

Volunteer Crew

The volunteer crew for a biometric test is the indivi-

duals that participate in the evaluation of the biometric

and from whom biometric samples are taken.

▶Test Sample and Size