introduction vascular biometrics vascular image data format, standardization · pdf...
TRANSCRIPT
V
Vascular Biometrics
▶Vascular Image Data Format, Standardization
Vascular Image Data Format,Standardization
ALEX H. CHOI1, JONATHAN R. AGRE
2
1Department of Information Engineering Myongji
University, Seoul, South Korea2Fujitsu Laboratories of America College Park,
MD, USA
Synonyms
Vascular biometrics; Vein biometrics
Definition
AVascular▶Biometrics, Overview Image Format Stan-
dard is useful for the exchange of vascular biometric
image information across different systems developed
by multiple organizations. As one part of this standar-
dization effort, the International Standard Organization
(ISO) has published a standard for a vascular biometric
image interchange format, which is the ISO/IEC 19794-
9 (Biometric Data Interchange Format – Part 9 Vascular
Image Format). The standard includes general require-
ments for image capture devices, environmental condi-
tions, specific definitions of image attributes, and the
data record format for storing and transmitting vascu-
lar biometric images. The vascular biometric image
format standard was developed in response to the
need for system interoperability which allows different
vascular biometric systems to be easily integrated with
other biometric modalities in a large-scale system.
# 2009 Springer Science+Business Media, LLC
Introduction
Vascular biometric technologies have existed for many
years. Moreover, new technologies employing vascular
images obtained from various parts of the human body
are emerging or under continuous improvement as a
result of new, state-of-the-art imaging devices. Some of
these technologies are being widely adopted as reliable
biometric modalities [1].
Vascular biometrics offer several intrinsic advan-
tages in comparison with the other popular biometric
modalities. First, the vascular imaging devices use
near-infrared or infrared images to capture the vein
pattern underneath the skin. This provides a high
degree of privacy that is not available with fingerprints,
which can be unintentionally left on objects, or by
facial images for face recognition schemes, which are
easily captured without ones knowledge. A similar
possibility exists for iris images captured without con-
sent for use in iris recognition schemes. Second, the
vascular imaging devices can be constructed to operate
in a non-contact fashion so that, it is not necessary for
the individual to touch the sensor in order to provide
the biometric data. This is advantageous in applica-
tions that require a high degree of hygiene such as
medical operating room access or where persons are
sensitive about touching a biometric sensing device.
Third, a high percentage of the population is able to
provide viable vascular images for use in biometric
identification, increasing ▶ usability by providing an
additional way to identify persons not able to provide
fingerprints or other biometric modal data. Fourth,
depending on the particular wavelength of (near-)
infrared light that is used, the image can capture only
the vein patterns containing oxygen depleted blood.
This can be a good indication that the biometric image
is from a live person. Fifth, the complexity of the
vascular image can be controlled so that the underlying
amount of information contained in the image can be
quite high when compared to a fingerprint, allowing
one to reduce the false accept or false reject rates to low
1370V Vascular Image Data Format, Standardization
levels. At the same time, the image information can be
compressed or it can be processed into a template to
reduce storage requirements.
Vascular biometric technologies are being used or
proposed for many applications. Some of these include
access control to secure areas, employee time-clock
tracking, Automatic Teller Machines (ATMs), secure
computer login, person identification, and as one of
several biometrics in multi-biometric systems. The
technology is not appropriate for certain other appli-
cations such as criminal forensics or surveillance.
Currently, however, little vascular biometric image
information is being exchanged between the equip-
ment and devices from different vendors. This is due
in part to the lack of standards relating to interopera-
bility of vascular biometric technology. In the general
area of biometrics interoperability, the International
Standard Organization (ISO) and the regional organi-
zations, such the INCITS M1 group in the US, define a
collection of standards relating to the various biomet-
ric modalities that include data interchange formats,
conformance testing of image and template inter-
change formats, performance testing and application
profiles. The most critical are the formats for infor-
mation exchange that would ensure interoperability
among the various vendors. Definition and standardi-
zation of the data structures for the interoperable use
of biometric data among organizations is addressed in
the ISO/IEC 19794 series [2], which is the multipart
biometric data interchange format standard, which
describes standards for capturing, exchanging, and
transferring different biometric data from personal
characteristics such as voice, or properties of parts of
the body like face, iris, fingerprint, hand geometry, or
vascular patterns.
To address this short-coming in the vascular do-
main, the ISO has published a standard for a vascular
biometric image interface format, entitled the ISO/IEC
19794-9 (Biometric data interchange format – part 9
Vascular image format) [3].
The main purpose of this standard is to define a
data record format for storing and transmitting vascu-
lar biometric images and certain of their attributes
for applications requiring the exchange of raw or
processed vascular biometric images. It is intended
for applications not severely limited by the amount of
storage required and is a compromise or a trade-off
between the resources required for data storage or
transmission and the potential for improved data
quality/accuracy. Basically, it enables various prepro-
cessing or matching algorithms to identify and verify
the type of vascular biometric image data transferred
from other image sources and to allow operations on
the data. The currently available vascular biometric
technologies that are commercialized and that may
utilize this standard for image exchange are technolo-
gies that use the back of the hand, the palm, and the
finger [4–6]. There is the ability to extend the standard
to accommodate other portions of the body if the
appropriate technology is brought forward.
The use of standardized source images can provide
interoperability among and between vendors relying
on various different recognition or verification algo-
rithms. Moreover, the format standard will offer the
developer more freedom in choosing or combining
matching algorithm technology. This also helps appli-
cation developers focus on their application domain
without concern about variations in how the vascular
biometric data was processed in the vascular biometric
modalities.
Introduction to ISO/IEC 19794-9 VascularImage Data Format Standard
ISO published the ISO/IEC 19794-9 Vascular Image
Data Format Standard in 2007, as a part of the ISO/
IEC 19794 series. The ISO/IEC 19794-9 vascular image
data format standard specifies an image interchange
format for biometric person identification or verifica-
tion technologies that utilize human vascular biometric
images and may be used for the exchange and compari-
son of vascular image data [7]. It specifies a data record
format for storing, recording, and transmitting vascu-
lar biometric information from one or more areas
of the human body. It defines the contents, format,
and units of measurement for the image exchange. The
format consists of mandatory and optional items, in-
cluding scanning parameters, compressed or uncom-
pressed image specifications, and vendor-specific
information.
The ISO/IEC 19794-9 vascular image data format
standard describes the data interchange format for
three different vascular biometric technologies utiliz-
ing different parts of the hand including back-of-hand,
finger, and palm. The standard also supports room for
extension to other vascular biometrics on other parts
of the human body, if needed. Figure 1 shows an
Vascular Image Data Format, Standardization. Figure 1 Examples of vascular biometric areas on different parts of the
hand [3].
Vascular Image Data Format, Standardization V 1371
V
example of vascular biometric areas on different parts
of the hand that are specified in ISO/IEC 19794-9.
The interchange format follows the standard data
conventions of the 19794 series of standards such as
requiring all multi-byte data to be in big-endian for-
mat, transmitting the most significant byte first and
the least significant byte last, and within a byte, the
order of transmission shall be the most significant bit
first and the least significant bit last. All numeric values
are treated as unsigned integers of fixed-length.
The vascular pattern biometric technologies cur-
rently available employ images from the finger, back of
the hand, and palm side of the hand. The location used
for imaging is to be specified in the format. To further
specify the locations, the object (target body) coordi-
nate system for each vascular technology is defined.
Standard poses and object coordinate systems are also
defined. All the coordinate systems are right-handed
Euclidian coordinate systems. It is then possible to
optionally specify a rotation of the object from the
standard pose. In order to map the object coordinate
system to the image coordinate system without further
translation, an x- and y-axis origin for scanning can be
specified in the data.
The image is acquired by scanning a rectangular
region of interest of a human body from the upper left
corner to the lower right in raster scan order, that is,
along the x-axis from top to bottom in the y direction.
The vascular image data can be stored either in a raw or
compressed format. In a raw format, the image is
represented by a rectangular array of ▶ pixels with
specified numbers of columns and rows. Images can
also be stored using one of the specified lossless or lossy
compression methods, resulting in compressed image
data. The allowable compression methods include the
JPEG [8], JPEG2000 [9], and JPEG LS [10]. It is
recommended that the compression ratio be less than
a factor of 4:1 in order to maintain a quality level
necessary for further processing.
Image capture requirements are dependent on var-
ious factors such as the type of application, the avail-
able amount of raw pixel information to be retained or
exchanged, and the targeted performance. Another
factor to consider as a requirement for vascular bio-
metric imaging is that the physical size of the target
body area where an application captures an image for
the extraction of vascular pattern data may vary sub-
stantially (unlike other biometric modalities).
The image capture requirements also define a set of
additional attributes for the capture devices such as
▶ gray scale depth, ▶ illumination source, horizontal
and vertical resolution (in pixels per cm), and the
aspect ratio. For most of the available vascular biomet-
ric technologies, the gray scale depth of the image
ranges up to 128 gray scale levels, but may, if required,
utilize two or more bytes per gray scale value instead of
one. The illumination sources used in a typical vascu-
lar biometric system are near-infrared wavelengths in
the range of approximately 700–1200 nm infrared light
sources. However, near-infrared, mid-infrared, and
Vascular Image Data Format, Standardization. Table 1 Vascular image biometric data block
Bytes Type Content Description
1–26 Data block header Header used by all vascular biometric image providers.Information on format version, capture device ID, number ofvascular images contained in the data block, etc.
27–58 Vascular image header Image header for the first image. Contains all individual imagespecific information
Unsigned char Image data
� �� �Vascular image header Image header for the last image
Unsigned char Image data
1372V Vascular Image Data Format, Standardization
visible light sources can be defined and more than one
source may be employed.
Table 1 shows the basic structure of the vascular
image biometric data block. A single data block starts
with a vascular image record header, which contains
general information about the data block such as the
identification of the image capture device and the
format version. One or more vascular image blocks
follow the record header. Each image block consists
of an image header and raw or compressed image data.
The image header contains all the image specific infor-
mation such as the body location, rotation angle, and
imaging conditions. All images in a data block must
come from the same capture device. If multiple devices
are used, then multiple blocks must be used.
The vascular image record header consist of general
information on the vascular images contained in the
data block, such as the format version number, total
length of the record block, capture device identifica-
tion, and the number of images contained in the data
block. More specific information includes format iden-
tifier, version number, record length, capture device
ID, and number of images.
For each image in the data block, the vascular
image header describes individual image-specific in-
formation including image type, vascular image record
length, image width and height, gray scale depth,
image position, property bit field, and rotation angle.
Other information in the vascular image header may
include Image format, illumination type, image back-
ground, horizontal scan resolution, vertical scan reso-
lution, pixel aspect ratio, and vascular image header
constants. The image data follows and is used to store
the biometric image information in the specific format
defined in the vascular image record header.
Future Activities
There are considerable ongoing standardization activ-
ities relating to vascular biometrics, building upon the
biometric data interchange format for vascular images
standard. A companion document that specifies the
conformance testing for the data interchange format
is currently under development. The conformance
standard specifies how to check whether the data pro-
duced by a vascular imaging device, does indeed agree
with the interchange format, as well as which items are
mandatory or optional. There are also ongoing efforts,
both internationally and in the U.S., to include the
vascular image formats into the various application
profiles (such as the INCITS M1 Profile for Point-
of-Sale Biometric Identification/Verification), which
define how to use vascular biometrics in the specific
context of an application. There are also efforts at
including vascular methods in multi-biometric fusion
schemes or as a biometric component of a smart-card
based solution. Eventually, it is expected that vascular
methods will become one of the important biometric
modalities, offering benefits not provided by the other
techniques in certain applications.
Summary
Vascular biometric technologies including vascular
images from the back-of-hand, finger, and palm are
being used as a security integrated solution in many
applications. The need for ease of exchanging and
transferring vascular biometric data from biometric
recognition devices and applications or between differ-
ent biometric modalities requires the definition of a
Vector Quantization V 1373
vascular biometrics data format standard. The devel-
opment of the vascular biometric data interchange
format standard also helps to ensure interoperability
among the various vendors. This paves the pathway so
that vascular biometric technologies can be adopted
as a standard security technology which is easily in-
tegrated in various ranges of applications.
Related Entries
▶Back-of-hand Vein
▶ Finger Data Interchange Format Standardization
▶ Finger Vein
▶Palm Vein
▶Vein and Vascular Recognition
V
References
1. Choi, A.H., Tran, C.N.: Handbook of Biometrics: Hand Vascular
Pattern Recognition Technology. Springer, New York (2007)
2. ISO/IEC 19794-1 Information Technology: Biometric Data
Interchange Format – Part 1: Framework/reference model
3. ISO/IEC 19794-9 Information Technology: Biometric Data
Interchange Format – Part 9: Vascular image data
4. Im, S.K., Park, H.M., Kim, Y.W., Han, S.C., Kim, S.W., Kang,
C.H.: Biometric identification system by extracting hand vein
patterns. J. Korean Phy. Soc. 38(3), 268–272 (2001)
5. Miura, N., Nagasaka, A., Miyatake, T.: Feature Extraction of
Finger-Vein Patterns Based on Repeated Line Tracking and Its
Application to Personal Identification. Mach. Vis. Appl. 15,
194–203 (2004)
6. Watanabe, M., Endoh, T., Shiohara, M., Sasaki, S.: Palm vein
authentication technology and its applications. In: Proceedings of
Biometric Consortium Conference, VA, USA, September 2005
7. Volner, R., Bores, P.: Multi-Biometric techniques, standards
activities and experimenting. In: Baltic Electronics Conference,
pp. 1–4. Tallinn, Estonia (2006)
8. ISO/IEC 10918 (all parts) Information Technology: Digital
Compression and Coding of Continuous Tone Still Images
9. ISO/IEC 15444 (all parts) Information Technology: JPEG 2000
Image Coding System
10. ISO/IEC 14495 (all parts) Information Technology: Lossless and
Near-Lossless Compression of Continuous Tone Still Images
Vascular Network Pattern
The network pattern composed of blood vessels.
Human blood vessels develop network structures in
each level of artery, arteriole, capillary, venule, and
vein. The network of major blood vessels can be seen
in funduscopy and in visual observation of body sur-
face. The vascular networks in fundus image are those
of retinal arteries and retinal veins. The blood vessels
observed on body surface are the cutaneous veins.
Both network patterns can be used in biometric
authentication. There are no apparent evidence on
the uniqueness and the permanence of the vascular
network pattern. However, in practice, the vascular
pattern has been used for biometric authentication
without a serious problem. Since the retinal pattern is
kept inside an eye, it is stable and seldom affected by
the change of outer environment. It is not easily ob-
servable by others and robust against the theft and the
forgery. The retinal pattern is complex, and high iden-
tification accuracy can be expected. The authentication
using this retinal pattern has been used in the institu-
tions that require high level of security.
The vascular network pattern in a hand and in a
finger can be visualized by transillumination imaging
or reflection-type imaging using near-infrared light.
The authentication with vascular pattern of a hand
and a finger is safer and more convenient than that
with retinal pattern. It has been used in common
security applications such as the authentication in
ATM and in access management.
▶Performance Evaluation, Overview
Vascular Recognition
▶Retina Recognition
Vector Quantization
The vector quantization (VQ) is a process of mapping
vectors from a large vector space to a finite number of
regions in that space (Linde, Y., Buzo, A., Gray, R.: An
algorithm for vector quantizer design. IEEE Trans.
1374V Vein
Comm. 28, 84–9517 (1980)). Each region is called
a cluster and can be represented by its center called a
codeword. The collection of all codewords is called
a codebook. During the training phase, a speaker-
specific VQ codebook is generated for each known
speaker by clustering the corresponding training acous-
tic vectors. The distance from a vector to the closest
codeword of a codebook is called a VQ-distortion.
During the recognition phase, an input utterance of
an unknown voice is vector-quantized using each
trained codebook, outputting a VQ distortion for
each codebook, each client speaker. The speaker
corresponding to the VQ codebook with the smallest
distortion is identified. Both for the training and test-
ing phases, the VQ process works independently on
each input frame and produces an averaged result (a
codebook or VQ distortion). Thus, there is no need to
perform a time alignment. The lack of time warping
greatly simplifies the system; however, it neglects
speaker-dependent temporal information that may be
present in prompted phrases.
▶ Speaker Matching
Vein
Veins are the blood vessels that carry blood to the
heart. In the cardiovascular system, blood vessels con-
sist of arteries, capillaries, and veins. Veins collect
blood from capillaries and carry it toward the heart.
In most of the veins, the blood is deoxygenated. The
pulmonary vein is one of the exceptions that carry
oxygenated blood. The walls of veins are relatively
thinner and less elastic than those of arteries. Some
veins have one-way flaps called venous valves that
prevent blood from flowing back. The valves are
found in the veins that carry blood against the force
of gravity, especially in the veins of the lower
extremities.
The vein in the subcutaneous tissue is called a
cutaneous vein. Some of the cutaneous veins can be
observed on the body surface with the naked eye. With
the light of high transmission through body tissue such
as near-infrared light, we can obtain a clear image of
the cutaneous vein. Since the pattern of venous
network is largely different between individuals, the
images can be used for authentication. The biometric
authentication using the venous network patterns in a
palm and a finger is common.
▶Palm Vein Image Sensor
▶Performance Evaluation, Overview
Vein Biometrics
▶Vascular Image Data Format, Standardization
Vein Recognition
▶Retina Recognition
Velocity (Speed)
Velocity of pen movement during the signing process.
Velocity features seem to be one of the most useful
features of on-line signatures. Generally, velocity is com-
puted from the first-order derivative of the pen position
signal with respect to time. The easiest way to compute
the velocity is to calculate the distance between two
consecutive pen-tip positions if the data is acquired at
equidistant sample points. Velocity features are repre-
sented in two ways: velocities along the x-axis and y-axis
or velocity along the pen movement direction (tangen-
tial direction). In the latter case, the direction of pen
movement is also considered as a separate feature.
▶ Signature Recognition
Verification
Biometric verification is a process that shows true or
false a claim about the similarity of biometric reference(s)
Visual-dynamic Speaker Recognition V 1375
and recognition biometric sample(s) by making a bio-
metric comparison(s).
▶Verification/Identification/Authentication/Recogni-
tion: The Terminology
Vetting
▶Background Checks
Video Camera
▶ Face Device
Video Surveillance
▶Human Detection and Tracking
Video-based Face Recognition
▶ Face Recognition, Video-based
VVideo-based Motion Capture
▶Markerless 3D HumanMotion Capture from Images
Visible Spectrum
Synonyms
Optical spectrum; Visible light
Definition
The portion of the electromagnetic spectrum that is
visible (detected) by the human eye. The wavelengths
for this spectrum is 380 to 750 nm, which are the
wavelengths seen (detected) by the human eye in air.
▶ Iris Databases
Visual Memory
Visual memory is the perceptual ability that allows
visual images to remain in memory after they are no
longer visible. It supports the matching process be-
tween two fingerprints when eye movements are
required.
▶ Latent Fingerprint Experts
Visual Sensor
▶ Face Device
Visual-dynamic Speaker Recognition
▶ Lip Movement Recognition
1376V Vitality
Vitality
▶ Liveness Detection: Fingerprint
▶ Liveness Detection: Iris
Viterbi Algorithm
The Viterbi algorithm is the conventional, recursive,
efficient way to decode a Hidden Markov Model that is
to find the optimal state sequence, given the observa-
tion sequence and the model. It provides information
about the hidden process and is a good an efficient
approximation of the evaluation problem.
▶Hidden Markov Models
VOCs (Volatile Organic Compounds)
Organic chemicals that have a high vapor pressure
resulting in a relatively high abundance in the head-
space of samples.
▶Odor Biometrics
Voice Authentication
Voice authentication is also known as speaker au-
thentication, speaker verification, and one-to-one
speaker recognition. For example, for a client – a
bank customer – to be authenticated, the client must
first go through an enrollment procedure, also known
as training. During enrollment, the client provides a
number of voice samples to the system, which in turn
are used to build a voice model for the client. When
requesting a voice authentication, a client must first
announce his or her identity. This may be done verbal-
ly by saying name, user id, account number or the
like, or it may be done by presenting an identifying
token such as a staff card or bank card. Then the
authentication takes place when the person speaks a
set phrase or a requested phrase or simply engages in
a dialogue with the authentication system. If the
voice sample matches the stored model or template
of the claimed identity, the client is authenticated.
If an impostor tries to be authenticated as a particular
client, the impostor’s voice will not match the client
model and the impostor will be rejected. The authen-
tication paradigm only compares a speech sample
with a single client model, namely the model of
the claimed identity. Hence, it is sometimes known as
one-to-one speaker recognition. In contrast speaker
identification compares a speech sample with every
possible client model, to find the closest match.
Hence this paradigm is also known as one-to-many
speaker recognition.
▶ Liveness Assurance in Voice Authentication
▶ Speaker Recognition Standardization
Voice Biometric
▶ Speaker Recognition, Overview
Voice Biometric Engine
▶ Speaker Matching
Voice Device V 1377
Voice Device
DOROTEO T. TOLEDANO,
JOAQUIN GONZALEZ-RODRIGUEZ,
JAVIER ORTEGA-GARCIA
ATVS – Biometric Recognition Group, Escuela
Politecnica Superior, Universidad Autonoma de
Madrid, Spain
V
Synonyms
Microphone; Speech input device
Definition
Voice device in the context of biometrics is frequently
used as a synonym for a simpler word: microphone.
A microphone [1] is a transducer that converts sound
(or equivalently, air pressure variations) into electrical
signals. There are many different types of microphones
that use different methods to achieve this transduction,
most of which will be revised in this article. Besides
the method employed to do the transduction, micro-
phones aremost frequently encapsulated, and the encap-
sulation allows to build microphones with different
directional characteristics, which allow, for instance, to
capture the voice coming from one direction and reject
(to a certain extent) the noises or voices coming from
other directions. Apart from the directionality, micro-
phones also have different frequency responses, sensitiv-
ities, and dynamic ranges. All these characteristics can
dramatically influence the performance of a speech bio-
metric system, and should therefore be taken into ac-
count in the design of such systems.
Microphones are the most commonly used speech
input devices, and for that reason they deserve most of
the space of this article. However, this article will be
incomplete without mentioning that microphones, at
least traditional microphones, are not the only speech
input device that can be used in speech biometrics.
For instance, microphones may be arranged to form
▶microphone arrays. There also exist special micro-
phones called ▶ contact microphones that transduce
vibrations in solid bodies into electrical signals. Finally,
there is also the possibility of combining the acoustic
evidence and the visual evidence of speech by record-
ing the audio and also the movement of the lips in
what is commonly referred to as audio-visual speech
processing. Definitional entries are devoted at the end
of this article to these special speech input devices.
The first step in any voice biometric (or automatic
speaker recognition) system is to capture the voice of
the speaker, and speech input devices are used for this
purpose.
Introduction
The human hearing sense is extremely robust against
noise and small distortions in the speech and humans
are very good at recognizing people based on their
voices, even under strong distortion and noise. Most
speech input devices are designed with the goal of
capturing speech or music, translating it into electrical
signals, transmitting or storing it and, finally, reprodu-
cing that speech or music (by means of the opposite
transducer, a loudspeaker). The important point here is
that microphones are designed to be used in a chain, at
which end is, most times, the human ear. Having such
a robust receptor at the end of the chain makes it
unnecessary to be very careful in the design or selection
of a speech input device.
During the last years, however, there has been a
fundamental change in speech communication since
the receiver in the speech communication chain is
not always a human listener any more. Nowadays
machines are used for transcribing speech signals (in
automatic speech recognition) and also, and most im-
portantly in this context, for recognizing the speaker
given a segment of speech (in voice biometrics or auto-
matic speaker recognition). This fundamental change
has brought an uncomfortable reality for all speech
researchers: machines are still far less robust than
humans at processing speech.
Of course, the goal of speech researchers is making
machines not even as robust as humans but even more.
Currently, voice biometric systems achieve very good
results in relatively controlled conditions, such as in
telephone conversations with similar durations. This
has been the basic setup for the yearly competitive
Speaker Recognition Evaluations (SRE) organized
by the National Institute of Standards and Technology
(NIST) [2] for the last years. These evaluations
show that currently, technology is capable of achieving
very competitive results in these conditions and is
becoming more and more robust against variabilities.
1378V Voice Device
However, the problem of variability due to the speech
input device is far from being solved. Actually, this is a
very hot research and technological topic. The proof of
it is that next NISTevaluations in voice biometrics will
probably be centered on cross-channel conditions in
which training and testing data come from different
channels (including different microphones, micro-
phone locations (close-talk and far-field) and record-
ing devices. However, achieving robustness against
such variations is a long-term research goal that most
probably will not be fulfilled in the next few years.
In the meantime, it should be stressed that technol-
ogy is already usable in practical situations, but it should
also be highlighted that current technology may not be
as robust as desirable. In these circumstances it is essen-
tial to take extra care of the design or the selection of the
speech input device. In some cases, of course, the speech
input device is out of control, such as in telephone
applications. But there are other cases where it is neces-
sary to design the speech input device and, in this cases,
it is essential to make the right choice because there are
multiple choices of speech input devices with very dif-
ferent features, and an appropriate selection of the
speech input device could be the key to success or failure
in a voice biometrics application. This section tries to
provide an introduction to the world of speech input
devices or microphones.
Microphones
Definition
Amicrophone is a transducer that converts sounds (air
pressure variations) into variations of an electrical
magnitude, typically voltage.
History
The early history of the microphone is tied to the
development of the telephone [3]. In fact, the micro-
phone was the last element required for a telephonic
conversation to be developed. One of the earliest
versions of microphones was developed by German
researcher Philipp Reis in 1861. These microphones
were just a platinum piece associated with a membrane
that opened and closed an electric circuit as the sound
made the membrane vibrate. This allowed Reis to build
primitive prototypes that allowed to transmit voice
and music along several hundred meters. It was several
years later, in 1874, when Alexander Graham Bell pat-
ented the telephone and transmitted what is consid-
ered the first telephone conversation ‘‘Mr. Watson,
come here, I want you.’’ Bell improved the microphones
to make them better and better suited for commercial
applications. Among the earlier microphones devel-
oped by Bell there are liquid microphones in which a
diaphragm moved a metallic needle inside a metal
recipient filled with a solution of water and sulfuric
acid, so that the resistance between the needle and the
recipient varied with the movement of the diaphragm.
The latter microphones developed by Bell were based
on the variations of inductance in a moving coil at-
tached to a diaphragm. However, it was not until 1878
that the word microphone was used for the first time,
and it was associated with what it is know today as the
carbon microphone. The carbon microphone was
invented by Edison and Hughes, and constituted a
real breakthrough for telephone systems, since they
were more efficient and robust than the earlier devices.
Currently it has mostly been substituted by more mod-
ern microphones that will be described in the following
sections.
Types
All microphones are based on the transduction of air
pressure variations into an electromagnetic magnitude.
However, there are many ways to achieve this, and
therefore there are many types of microphones with
different characteristics and applications. In this article
some of the most important types will be summarized.
� Condenser or capacitance microphones. These
microphones are based on the following physical
principle (Fig. 1): the capacitance of a condenser
with two metallic plates depends on the distance
between the two plates. If one metallic plate of a
capacitor is substituted by a metallic membrane
that vibrates with sound, the capacitance of the
condenser varies with sound, and this variation
can be translated into the variation of an electrical
magnitude. There are two ways of doing this trans-
formation. The most common one is trying to set a
constant charge in the two plates and measuring
the variations of the voltage between the two plates.
Voice Device. Figure 1 Principle of functioning of a condenser microphone.
Voice Device V 1379
V
The other one (slightly more complex) is using the
variations in the capacitance to modulate the fre-
quency of an oscillator. This generates a frequency
modulated signal that needs to be demodulated,
but the demodulated signal has usually less noise
and can more effectively reproduce low frequency
signals than the one obtained with the constant
charge method. A special type of condenser micro-
phone is the electret microphone. This microphone
is a capacitor microphone in which the charge in
the plates is maintained not by applying an external
constant voltage to the capacitor, but by using a
ferroelectric material that keeps a constant charge,
in a similar way as a magnet generates a constant
magnetic field. Condenser microphones are the
most frequently used microphones nowadays, and
it is possible to find them from low-quality cheap
versions to high-quality expensive microphones.
� Dynamic or induction microphones. These micro-
phones are based on a different physical principle:
when an induction moves inside a magnetic field, it
generates a voltage by electromagnetic induction. If a
small coil is attached to a diaphragm thatmoves with
sounds and if this coil is placed into a magnetic field
(generated by a permanent magnet), the movement
of the coil will produce a voltage in its extremes that
is related to the sound. A special type of induction
microphone is ribbon microphones in which the coil
is substituted by a metallic ribbon that vibrates
with sound as is suspended in a constant magnetic
field, thus generating a current related to the
sound. These microphones are more sensitive
than coil microphones, but also are more fragile.
� Carbon microphones. This microphone is essentially
a recipient filled with carbon powder and closed by a
metallic membrane on one side and a metallic plate
on the other. As the membrane vibrates with the
sound the powder supports more or less pressure
and its electrical resistance is higher or lower (with
more pressure carbon particles increase their surface
in contact with other particles and this makes elec-
trical resistance decrease). Carbon microphones
were widely used in telephones. Currently they have
been substituted by capacitor microphones.
� Piezo-electric microphones. These microphones are
based on yet another physical effect: some solid
materials, called piezo-electric materials, have the
property of producing a voltage when a pressure
is applied to them. Using this property and a piezo-
electric material a microphone can by built by just
placing two electrical contacts on the piezo-electric
material. Piezo-electric microphones are mainly
used in musical instruments (such as electric gui-
tars) to collect and amplify the sound.
� Silicon microphones. Silicon (or chip) microphones
are not based on a new physical effect. Rather, they are
just capacitor microphones built on a silicon chip in
which the membrane is directly attached to the chip.
These microphones can be very small and are usually
associated with electronic circuitry such as a pream-
plifier and a analog-to-digital converter (ADC), so
that a single chip can produce digital audio.
Directional Characteristics
Microphones have different characteristics depending
on the direction of arrival of the sound with respect to
the microphone. A microphone’s directionality pattern
measures its sensitivity to a particular direction.
1380V Voice Device
Microphones may be classified by their directional
properties as omnidirectional (or non-directional) and
directional [4]. The latter can also be subdivided into
bidirectional and unidirectional, based on their direc-
tionality patterns. Directionality patterns are usually
specified in terms of the polar pattern of the micro-
phone (Fig. 2).
� Omnidirectional microphones. An omnidirectional
(or nondirectional) microphone is a microphone
whose response is independent of the direction of
arrival of the sound wave. Sounds coming from
different directions are picked equally. If a micro-
phone is built only to respond to the pressure, then
the resultant microphone is an omnidirectional
microphone. These types of microphones are the
most simple and inexpensive and have as advantage
having a very flat frequency response. However, the
property of capturing sounds coming from every
direction with the same sensitivity is very often
undesirable, since it is usually interesting capturing
the sounds coming from the front of the micro-
phone but not from behind or the laterals.
� Bidirectional microphones. If a microphone is built
to respond to the gradient of the pressure in a
particular direction, rather than to the pressure
itself, a bidirectional microphone is obtained.
This is achieved by letting the sound wave reach
the diaphragm not only from the front of the
microphone but also from the rear, so that if a
wave comes from a perpendicular direction the
effects on the front and the rear are canceled. This
type of microphones reach the maximum
Voice Device. Figure 2 Typical polar patterns for omnidirect
microphones.
sensitivity at the front and the rear, and reach
their minimum sensitivity at the perpendicular
directions. This directionality pattern is particular-
ly interesting to reduce noises from the sides of the
microphone. For this reason sometimes it is said
that these microphones are noise-canceling micro-
phones. Among the disadvantages of this kind of
microphones, it must be mentioned that their fre-
quency response is not nearly as flat as the one of
an omnidirectional microphone, and it also varies
with the direction of arrival. The frequency re-
sponse is also different with the distance from the
sound source to the microphone. Particularly, for
sounds generated close to the microphone (near
field) the response for low frequencies is higher
than for sounds generated far from the microphone
(far field). This is known as the proximity effect. For
that reason frequency responses are given usually
for far-field and near-field conditions, particularly
for close-talking microphones. This type of micro-
phones are more sensitive to the noises produced
by the wind and the wind induced by the pronun-
ciation of plosive sounds (such as /p/) in close-
talking microphones.
� Unidirectional Microphones. These microphones
have maximum response to sounds coming from
the front of the microphone, have nearly zero re-
sponse to sounds coming from the rear of the
microphone and small response to sounds coming
from the sides of the microphone. Unidirectional-
ity is achieved by building a microphone that
responds to the gradient of the sounds, similar to
ional, bidirectional and unidirectional (or cardioid)
Voice Device V 1381
a bidirectional microphone. The null response
from the rear is attained by introducing a material
to slow down the acoustic waves coming from the
rear so that when the wave comes from the rear it
takes equal time to reach the rear part and the front
part of the diaphragm, and therefore both cancel
out. The polar pattern of these microphones has
usually the shape of a heart, and for that reason are
sometimes called cardiod microphones. These
microphones have good noise-cancelation proper-
ties, and for these reasons, are very well suited for
capturing clean audio input.
V
Microphone Location
Some microphones have different frequency response
when the sound source is close to the microphone
(near field, or close-talking) and when the sound
source is far from the microphone (far field). In fact,
not only the frequency response, but also the problems
to the voice biometric application and the selection of
the microphone could be different. For this reason a
few concepts about microphone location will be
reviewed.
� Close-talking or near-field microphones. These
microphones are located close to the mouth of
the speaker, usually pointing at the mouth of the
speaker. This kind of microphones can benefit from
the directionality pattern to capture mainly the
sounds produced by the speaker, but could also be
very sensitive to the winds produced by the speaker,
if placed just in front of the mouth. The character-
istics of the sound captured may be very different
if the microphone is placed at different relative
positions from the mouth, which is sometimes a
problem for voice biometrics applications.
� Far-field microphones. These microphones are loca-
ted at some distance from the speaker. They have
the disadvantage that they tend to capture more
noise than close-talking microphones because
sometimes cannot take advantage of directionality
patterns. This is particularly true if the speaker can
move around as she speaks. In general, far-field
microphone speech is considered to be far more
difficult to process than close-talking speech. In
some circumstances it is possible to take advantage
of microphone arrays to locate the speaker spatially
and to focus the array to listen specially to them.
Specifications
There is an international standard for microphone
specifications [5], but few manufacturers follow it ex-
actly. Among the most common specifications of a
microphone the following must be mentioned.
� Sensitivity. The sensitivity measures the efficiency
in the transduction (i.e. how much voltage it gen-
erates for an input acoustic pressure). It is
measured in millivolts per Pascal at 1 kHz.
� Frequency Response. The frequency response is a
measure of the variation of the sensitivity of a
microphone as a function of the frequency of the
signal. It is usually represented in decibels (dB)
over a range of frequency typically between 0 and
20 kHz. The frequency response is dependent on
the direction of arrival of the sound and the dis-
tance from the sound source. The frequency res-
ponse is typically measured for sound sources very
far from the microphone and with the sound
reaching the microphone from its front direction.
For close talking microphones it is also typical to
represent the frequency response for sources close
to the microphone to take into account the pro-
ximity effect.
� Directional Characteristics. The directionality of a
microphone is the variation of its sensitivity as a
function of the sound arrival direction, and is usu-
ally specified in the form of a directionality pattern,
as explained earlier.
� Non-Linearity. Ideally, a microphone should be a
linear transducer, and therefore a pure audio tone
should produce a single pure voltage sinusoid at
the same frequency. As microphones are not exactly
linear, a pure acoustic tone produces a voltage
sinusoid at the same frequency but also some har-
monics. The most extended nonlinearity measure
is the total harmonic distortion, THD, which is the
ratio between the power of the harmonics produ-
ced and the power of the voltage sinusoid produced
at the input frequency.
� Limiting Characteristics. These characteristics indi-
cate the maximum sound pressure level (SPL) that
can be transduced with limited distortion by the
microphone. There are two different measures,
the maximum peak SPL for a maximum THD,
and the overload, clipping or saturation level. This
last one indicates the SPL that produces the
1382V Voice Evidence
maximum displacement of the diaphragm of the
microphone.
� Inherent Noise. A microphone, in the absence of
sound, produces a voltage level due to the inherent
noise produced by itself. This noise is measured as
the input SPL that would produce the same output
voltage, which is termed the equivalent SPL due to
inherent noise. This parameter determines the min-
imum SPL that can be effectively transduced by the
microphone.
� Dynamic Range. The former parameters define the
dynamic range of the microphone, (i.e. the mini-
mum and maximum SPL that can be effectively
transduced).
Summary
Speech input devices are the first element in a voice
biometric system and are sometimes not given the
attention they deserve in the design of voice biometric
applications. This section has presented some of the
variables to take into account in the selection or design
of a microphone for a voice biometric application. The
right selection, design, and even placement of a micro-
phone could be crucial for the success of a voice bio-
metric system.
Related Entries
▶Biometric Sample Acquisition
▶ Sample Acquisition (System Design)
▶ Sensors
References
1. Eargle, J.: The Microphone Book, 2nd edn. Focal, Elsevier, Bur-
lington, MA (2005)
2. National Institute of Standards and Technology (NIST): NIST
Speaker Recognition Evaluation. http://www.nist.gov/speech/
tests/spk/
3. Flichy, P.: Une Histoire de la Communication Moderne. La
Decouverte (1997)
4. Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing.
Prentice-Hall PTR, New Jersey (2001)
5. International Electrotechnical Comission: International Stan-
dard IEC 60268-4: Sound systems equipment, Part 4: Micro-
phones. Geneva, Switzerland (2004)
Voice Evidence
The forensic evidence of voice consists of the quanti-
fied degree of similarity between the speaker depen-
dent features extracted from the questioned recording
(trace) and the same extracted from recorded speech of
a suspect, represented by his or her model.
▶Voice, Forensic Evidence of
Voice Recognition
▶ Speaker Recognition, Overview
▶ Speaker Recognition Standardization
Voice Sample Synthesis
JUERGEN SCHROETER
AT&T Labs – Research, Florham Park, NJ, US
Synonyms
Speech synthesis; Synthetic voice creation; Text-to-
speech (TTS)
Definition
Over the last decade, speech synthesis, the technology
that enables machines to talk to humans, has become
so natural-sounding that a naı̈ve listener might assume
that he/she is listening to a recording of a live human
speaker. Speech synthesis is not new; indeed, it took
several decades to arrive where it is today. Originally
starting from the idea of using physics-based models of
the vocal-tract, it took many years of research to per-
fect the encapsulation of the acoustic properties of the
vocal-tract as a ‘‘black box’’, using so-called formant
synthesizers. Then, with the help of ever more
Voice Sample Synthesis V 1383
powerful computing technology, it became viable to
use snippets of recorded speech directly and glue them
together to create new sentences in the form of con-
catenative synthesizers. Combining this idea with now
available methods for fast search, potentially millions
of choices are evaluated to find the optimal sequence of
speech snippets to render a given new sentence. It is the
latter technology that is now prevalent in the highest
quality speech synthesis systems. This essay gives a
brief overview of the technology behind this progress
and then focuses on the processes used in creating
voice inventories for it, starting with recordings of a
carefully-selected donor voice. The fear of abusing the
technology is addressed by disclosing all important
steps towards creating a high-quality synthetic voice.
It is also made clear that even the best synthetic voices
today still trip up often enough so as not to fool the
critical listener.
V
Introduction
Speech synthesis is the technology that gives compu-
ters the ability to communicate to the users by voice.
When driven by text input, speech synthesis is part of
the more elaborate ▶ text-to-speech (TTS) synthesis,
which also includes text processing (expanding abbre-
viations, for example), letter-to-sound transformation
(rules, pronunciation dictionaries, etc.), and stress and
pitch assignment [1]. Speech synthesis is often viewed
as encompassing the signal-processing ‘‘backend’’ of
text-to-speech synthesis viewed as encompassing the
signal-processing ‘‘backend’’ of text-to-speech synthe-
sis (with text and linguistic processing being carried
out in the ‘‘front-end’’). As such, speech synthesis takes
phoneme-based information in context and transforms
it into audible speech. Context information is very
important because, in naturally-produced speech, no
single speech sound stands by itself but is always highly
influenced by what sounds came before, and what
sounds will follow immediately after. It is precisely
this context information that is key to achieving
high-quality speech output.
A high-quality TTS system can be used for many
applications, from telecommunications to personal
use. In the telecom area, TTS is the only practical way
to provide highly flexible speech output to the caller
of an automated speech-enabled service. Examples of
such services include reading back name and address
information, and providing news or email reading. In
the personal use area, the author has witnessed the
ingenious ‘‘high jacking’’ of AT&T’s web-based TTS
demonstration by a young student to fake his mother’s
voice in a telephone call to his school: ‘‘Timmy will be
out sick today. He cannot make it to school.’’ It seems
obvious that natural-sounding, high quality speech syn-
thesis is vital for both kinds of applications. In the
telecom area, the provider of an automated voice service
might lose customers if the synthetic voice is unintelli-
gible or sounds unnatural. If the young student wants to
get an excused day off, creating a believable ‘‘real-sound-
ing’’ voice seems essential. It is mostly concerns about
the latter kind of potential abuse that motivates this
author to write this essay. In the event that the even
stricter requirement is added of making the synthetic
voice indistinguishable from the voice of a specific per-
son, there is clearly a significantly more difficult chal-
lenge. Shortly after AT&T’s Natural Voices1 TTS
system became commercially available in August 2001,
an article in the New York Times’ Circuits section [2]
asked precisely whether people will be safe from seri-
ous criminal abuse of this technology. Therefore, the
purpose of this essay is to demystify the process of
creating such a voice, disclose what processes are
involved, and show current limitations of the technol-
ogy that make it somewhat unlikely that speech syn-
thesis could be criminally abused anytime soon.
This essay is organized as follows. The next section
briefly summarizes different speech synthesis methods,
followed by a somewhat deeper overview of the so-
called Unit Selection synthesis method that currently
delivers the highest quality speech output. The largest
section of this essay deals with creating voice databases
for unit selection synthesis. The essay concludes with
an outlook.
Overview of Voice Synthesis Methods
The voice (speech) synthesis method with the most
vision and potential, but also with somewhat unful-
filled promises, is articulatory synthesis. This method
employs mathematical models of the speech produc-
tion process in the human vocal tract, for example,
models of the mechanical vibrations of the vocal
chords (glottis) that interact with the fluid dynamics
of the laminar and turbulent airflow from the lungs to
the lips, plus linear or even nonlinear acoustical
1384V Voice Sample Synthesis
models of sound generation and propagation along the
vocal tract. A somewhat comprehensive review of this
method is given in [3]. Due to high computational
requirements and the need for highly accurate model-
ing, articulatory synthesis is mostly useful for research
in speech production. It usually delivers unacceptably
low-quality synthetic speech.
One level higher in abstraction, and much more
practical in its use, is formant synthesis. This method
captures the characteristics of the resonances of
the human vocal tract in terms of simple filters. The
single-peaked frequency characteristic of such a filter
element is called formant. Its frequency, bandwidth
(narrow to broad), and amplitude fully specify each
formant. For adult vocal tracts, four to five formants
are enough to determine their acoustic filter character-
istics. Phonetically most relevant are the lowest three
formants that span the vowel and sonorant space of a
speaker and a language. Together with a suitable wave-
form generator that approximates the glottal pulse,
formant synthesis systems, due to their highly versatile
control parameter sets, are very useful for speech per-
ception research. More on formant synthesis can be
found in [4]. For use as a speech synthesizer, the
computational requirements are relatively low, making
this method the preferred option for embedded appli-
cations, such as reading back names (e.g., ‘‘calling
Mom’’) in a dial-by-voice cellular phone handset. Its
storage requirements are miniscule (as little as 1 MB).
Formant synthesis delivers intelligible speech when
special care is given to consonants.
In the 1970s, a new method started to compete
with the, by then, well-established formant synthesis
method. Due to its main feature of stitching together
recorded snippets of natural speech, it was called con-
catenative synthesis. Many different options exist for
selecting the specific kind of elementary speech units
to concatenate. Using words as such units, although
intuitive, is not a good choice given that there are many
tens of thousands of them in a language and that each
recorded word would have to fit into several different
contexts with its neighbors, creating the need to record
several versions of each word. Therefore, word-based
concatenation usually sounds very choppy and artifi-
cial. However, subword units, such as diphones or
demisyllables turned out to be much more useful be-
cause of favorable statistics. For English, there is a
minimum of about 1500 ▶ diphones that would need
to be in the inventory of a diphone-based
concatenative synthesizer. The number is only slightly
higher for concatenating ▶ demisyllables. For both
kinds of units, however, elaborate methods are needed
to identify the best single (or few) instances of units to
store in the voice inventory, based on statistical mea-
sures of acoustic typicality and ease of concatenation,
with a minimum of audible glitches. In addition, at
synthesis time, elaborate speech signal processing is
needed to assure smooth transitions, deliver the de-
sired prosody, etc. For more details on this method, see
[5]. Concatenative synthesis, like formant synthesis,
delivers highly intelligible speech and usually has no
problem with transients like stop consonants, but usu-
ally lacks naturalness and thus cannot match the qual-
ity of direct human voice recordings. Its storage
requirements are moderate by today’s standards
(�10–100 MB).
Unit Selection Synthesis
The effort and care given to creating the voice inventory
determines to a large extent the quality of any concatena-
tive synthesizer. For best results, most concatenative syn-
thesis researchers well up into the 1990s employed a
largely manual off-line process of trial and error that
relied on dedicated experts. A selected unit needed to fit
all possible contexts (or made to fit by signal processing
such as, stretching or shrinking durations, pitch scaling,
etc.). However, morphing any given unit by signal proces-
sing in the synthesizer at synthesis time degrades voice
quality. So, the ideawas born tominimize the use of signal
processing by taking advantage of the ever increasing
power of computers to handle ever increasing data sets.
Instead of outright morphing a unit to make it fit, the
synthesizer may try to pick a suitable unit from a large
number of available candidates, optionally followed by
much more moderate signal processing. The objective
is to find automatically the optimal sequence of unit
instances at synthesis time, given a large inventory of
unit candidates and the available sentence to be synthe-
sized. This new objective turned the speech synthesis
problem into a rapid search problem [6].
The process of selecting the right units in the in-
ventory that instantiate a given input text, appropri-
ately called unit selection, is outlined in Fig. 1. Here,
the word ‘‘two’’ (or ‘‘to’’) is synthesized from using
diphone candidates for silence into ‘‘t’’ (/#-t/), ‘‘t’’
into ‘‘uw’’ (/t-uw/), and ‘‘uw’’ into silence (/uw-#/).
Voice Sample Synthesis. Figure 1 Viterbi search to retrieve optimal diphone units for the word ‘‘two’’ or ‘‘to’’.
Voice Sample Synthesis V 1385
V
Each time slot (column in Fig. 1) has several candi-
dates to choose from. Two different objective distance
measures are employed. First, transitions from one
unit to the next (depicted by arrows in the figure) are
evaluated by comparing the speech spectra at the end
of the left-side unit candidates to the speech spectra at
the beginning of the right-side unit candidates. These
are n*m comparisons, where n is the number of unit
candidates for the left column of candidates, and m is
the number of unit candidates in the right-side column
of candidates. Second, each node (circle) in the net-
work of choices depicted in Fig. 1 has an intrinsic
‘‘goodness of fit’’ measured by a so-called target cost.
The ideal target cost of a candidate unit measures the
acoustic distance of the unit against a hypothetical unit
cut from a perfect recording of the sentence to be
synthesized. However, since it is unlikely that the
exact sentence would be in the inventory, an algorithm
has to estimate the target cost using symbolic and
nonacoustic cost components such as the difference
between desired and given pitch, amplitude, and con-
text (i.e., left and right phone sequences).
The objective of selecting the optimal unit sequence
for a given sentence is to minimize the total cost that is
accumulated by summing transitional and target costs
for a given path through the network from its left-side
beginning to its right-side end. The optimal path is
the one with the minimum total cost. This path
can be identified efficiently using the Viterbi search
algorithm [7].
More detailed information about unit selection syn-
thesis can be found in [1, 8]. The latter book chapter also
summarizes the latest use of automatic speech recogni-
tion (ASR) technology in unit selection synthesis.
Voice Creation
Creating a simple-minded unit selection synthesizer
would involve just two steps: First, record exactly the
sentences that a user wants the machine to speak; and
second, identify at ‘‘synthesis’’ time the input sentence
to be spoken, and then play it back. In practice units
are used that are much shorter than sentences to be
able to create previously unseen input sentences, so
this simple-minded paradigm would not work. How-
ever, when employing a TTS front-end that converts
any input text into a sequence of unit specifications,
intuition may ask for actually playing back any inven-
tory sentence in its entirety in the odd chance that the
corresponding text has been entered. Since the transla-
tion of text into unit-based tags and back into speech is
not perfect, the objective is unlikely to ever be fully
met. In practice, however, the following, somewhat
weaker objective holds: as long as the text to be synthe-
sized is similar enough to that of a corresponding
recording that actually exists in the inventory, a high
output voice quality can be expected. It is for this
reason that unit-selection synthesis is particularly well
suited for so-called limited domain synthesis, such as
Voice Sample Synthesis. Figure 2 Steps in unit selection
voice inventory creation.
1386V Voice Sample Synthesis
weather reports, stock reports, or any automated tele-
com dialogue application (banking, medical, etc.)
where the application designer can afford the luxury
of recording a special inventory, using a carefully se-
lected voice talent. High quality synthesis for general
news or email reading is usually much more difficult to
achieve because of coverage issues [9].
Because unit selection synthesis, to achieve its best
quality results, mimics a simple tape recorder playback,
it is obvious that its output voice quality largely
depends on what material is in its voice inventory.
Without major modifications/morphing at synthesis
time, the synthesizer output is confined to the quality,
speaking style, and emotional state of the voice that
was recorded from the voice talent/donor speaker. For
this reason, careful planning of the voice inventory is
required. For example, if the inventory contains only
speech recorded from a news anchor, the synthesizer
will always sound like a news anchor.
Several issues need to be addressed in planning a
voice inventory for a unit selection synthesizer. The
steps involved are outlined in Fig. 2, starting with text
preparation to cover the material selected. Since voice
recordings cannot be done faster than real time, they
are always a major effort in time and expense. To get
optimal results, a very strict quality assurance process
for the recordings is paramount. Furthermore, the
content of the material to be recorded needs to be
addressed. Limited domain synthesis covers typical
text for the given application domain, including greet-
ings, apologies, core transactions, and good-byes. For
more general use such as email and news reading,
potentially hundreds of hours of speech need to be
recorded. However, the base corpus for both kinds of
applications needs to maximize linguistic coverage
within a small size. Including a core corpus that was
optimized for traditional diphone synthesis might
satisfy this need. In addition, news material, sentences
that use the most common names in different prosodic
contexts, addresses, and greetings are useful. For limited
domain applications, domain-specific scripts need to
be created. Most of them require customer input
such as getting access to text for existing voice
prompts, call flows, etc. There is a significant danger
in underestimating this step in the planning phase.
Finally, note that a smart and frugal effort in designing
the proper text corpus to record helps to reduce the
amount of data to be recorded. This, in turn, will speed
up the rest of the voice building process.
Quality assurance starts with selecting the best pro-
fessional voice talent. Besides the obvious criteria of
voice preference, accent, pleasantness, and suitability
for the task (a British butler voice might not be appro-
priate for reading instant messages from a banking
application), the voice talents needs to be very consis-
tent in how she/he pronounces the same word over time
and in different contexts. Speech production issues
might come into play, such as breath noise, frequent
lip smacks, disfluencies, and other speech defects. A
clearly articulated and pleasant sounding voice and a
natural prosodic quality are important. The same is true
for consistency in speaking rate, level, and style. Surpris-
ingly, good sight reading skills are not very common
among potential voice talents. Speakers with heavy
vocal fry (glottal vibration irregularities) or strong
nasality should be avoided. Overall, a low ratio of
usable recordings to total recordings done in a test
run is a good criterion for rejecting a voice talent.
Voice Sample Synthesis V 1387
V
Pronunciations of rare words, such as foreign names,
need to be agreed upon beforehand and their realiza-
tions monitored carefully. Therefore, phonetic supervi-
sion has to be part of all recording sessions.
Next, the recording studio used for the recording
sessions should have almost ‘‘anechoic’’ acoustic char-
acteristics and a very low background noise in order to
avoid coloring or tainting the speech spectrum in any
way. Since early acoustic reflections off a nearby wall or
table are highly dependent on the time-varying geom-
etry relative to the speaker’s mouth and to the micro-
phone, the recording engineer needs to make sure that
the speaker does not move at all (unrealistic) or mini-
mize these reflections. The recording engineer also
needs to make sure that sound levels, and trivial things
like the file format of the recordings are consistent and
on target. Finally, any recorded voice data needs to be
validated and inconsistencies between desired text and
actually spoken text reconciled (e.g., the speaker reads
‘‘vegetarian’’ where ‘‘veterinarian’’ was requested).
Automatic labeling of large speech corpora is a
crucial step because manual labeling by linguists is
slow (up to 500 times real time) and potentially incon-
sistent (different human labelers disagree). Therefore,
an automatic speech recognizer (ASR) is used in
so-called forced alignment mode for phonetic labeling.
Given the text of a sentence, the ASR identifies the
identities and the beginnings and ends of all ▶ pho-
nemes. ASR might employ several passes, starting from
speaker-independent models, and adapting these mod-
els to the given single speaker, and his/her speaking
style. Adapting the pronunciation dictionary to the
specific speaker’s individual pronunciations is vital to
get the correct phoneme sequence for each recorded
word. Pronunciation dictionaries used for phonetic
labeling should also be used in the synthesizer. In
addition, an automated prosodic labeler is useful for
identifying typical stress and pitch patterns, prominent
words, and phrase boundaries. Both kinds of automat-
ic labeling need to use paradigms and conventions
(such as phoneme sets and symbolic ▶ prosody tags)
that match those used in the TTS front-end at synthe-
sis time. A good set of automatic labeling and other
tools allowed the author’s group of researchers to
speed up their voice building process by more than
100 times over 6 years.
Once the recordings are done, the first step in the
voice building process is to build an index of which
sound (phoneme) is where, normalize the amplitudes,
and extract acoustic and segmental features, and
then build distance tables used to trade off (weigh)
different cost components in unit selection in the last
section. One important part of the runtime synthesiz-
er, the so-called Unit Preselection (a step used to nar-
row down the potentially very large number of
candidates) can be sped up by looking at statistics
of triples of phonemes (i.e., so-called triphones) and
caching the results. Then, running a large independent
training text corpus through the synthesizer and
gathering statistics of unit use can be used to build a
so-called join cache that eliminates recomputing join
costs at runtime for a significant speedup. The final
assembly of the voice database may include reordering
of units for access efficiency plus packaging the voice
data and indices.
Voice database validation consists of comprehen-
sive, iterative testing with the goal of identifying bad
units, either by automatic identification tools or by
many hours of careful listening and ‘‘detective’’ work
(where did this bad sound come from?), plus repair.
Allocating sufficient testing time before compute-
intensive parts of the voice building process (e.g.,
cache building) is a good idea. Also, setting realistic
expectations with the customer (buyer of the voice
database) is vital. For example, the author found that
the ‘‘damage’’ that the TTS-voice creation and synthe-
sis process introduces relative to a direct recording
seems to be somewhat independent of the voice talent.
Therefore, starting out with a ‘‘bad’’ voice talent will
only lead to a poorer sounding synthetic voice. Reduc-
ing the TTS damage over time is the subject of ongoing
research in synthesis-related algorithms employed in
voice synthesis.
The final step in unit selection voice creation is for-
mal customer acceptance and, potentially, ongoing
maintenance. Formal customer acceptance is needed
to avoid disagreements over expected and delivered qual-
ity, coverage, etc. Ongoing maintenance assures high
quality for slightly different applications or application
domains, including, for example, additional recordings.
Conclusion
This essay highlighted the steps involved in creating a
high-quality sample-based speech synthesizer. Special
focus was given to the process of voice inventory
creation.
1388V Voice Verification
From the details in this essay, it should be clear
that voice inventory creation is not trivial. It involves
many weeks of expert work and, most importantly, full
collaboration with the chosen voice talent. The idea of
(secretly) recording any person and creating a synthet-
ic voice that sounds just like her or him is simply
impossible, given the present state of the art. Collecting
several hundreds of hours of recordings necessary to
having a good chance at success of creating such a voice
inventory is only practical when high-quality archived
recordings are already available that were recorded
under very consistent acoustic conditions. A possible
workable example would be an archive containing a
year or more of evening news read by a well-known
news anchor. Even then, however, one would need to
be concerned about voice consistency, since even slight
cold infections, as well as more gradual natural changes
over time (i.e., caused by aging of the speaker) can
make such recordings unusable.
An interesting extension to the sample synthesis of
(talking) faces was made in [10]. The resulting head-
and-shoulder videos of synthetic personal agents are
largely indistinguishable from video recordings of the
face talent. Again, similar potential abuse issues are a
concern.
One specific concern is that unit-selection voice
synthesis may ‘‘fool’’ automatic speaker verification
systems. Unlike a human listener’s ear that is able to
pick up the subtle flaws and repetitiveness of a
machine’s renderings of a human voice, today’s speaker
verification systems are not (yet) designed to pay at-
tention to small blurbs and glitches that are a clear
giveaway of a unit selection synthesizer’s output, but
this could change if it became a significant problem. If
this happens, perceptually undetectable watermarking
is an option to identify a voice (or talking face) sample
as ‘‘synthetic’’. Other procedural options include ask-
ing for a second rendition of the passphrase and
comparing the two versions. If they are too similar
(or even identical), reject the speaker identity claim
as bogus.
Related Entries
▶Hidden Markov Model (HMM)
▶ Speaker Databases and Evaluation
▶ Speaker Matching
▶ Speaker Recognition, Overview
▶ Speech Production
References
1. Schroeter, J.: Basic principles of speech synthesis, In: Benesty, J.
(ed.) Springer Handbook of Speech Processing and Communi-
cation, Chap. 19 (2008)
2. Bader, J.L.: Presidents as pitchmen, and posthumous play-by-
play, commentary in the New York Times, August 9 (2001)
3. van Santen, J., Sproat, R., Olive, J., Hirschberg, J., (eds.): Prog-
ress in speech synthesis, section III. Springer, NY (1997)
4. Holmes, J.N.: Research report formant synthesizers: cascade or
parallel? Speech Commun. 2(4), 251–273 (1983)
5. Sproat, R. (ed.): Multilingual text-to-speech synthesis. The bell
labs approach. Kluwer Academic Publishers, Dordrecht MA
(1998)
6. Hunt, A., Black, A.W.: Unit selection in a concatenative speech
synthesis system using a large speech database. In: Proceedings
of the ICASSP-96, pp. 373–376, GA, USA (1996)
7. Forney, G.D.: The viterbi algorithm. Proc. IEEE 61(3), 268–278
(1973)
8. Dutoit, T.: Corpus-based speech synthesis, In: Benesty, J. (ed.)
Springer Handbook of Speech Processing and Communication,
Chap. 21 (2008)
9. van Santen, J.: Prosodic processing. In: Benesty, J. (ed.) Springer
Handbook of Speech Processing and Communication, Chap. 23
(2008)
10. Cosatto, E., Graf, H.P., Ostermann, J., Schroeter, J.: From
audio-only to audio and video text-to-speech. Acta Acustica
90, 1084–1095 (2004)
Voice Verification
▶ Liveness Assurance in Voice Authentication
Voice, Forensic Evidence of
ANDRZEJ DRYGAJLO
Swiss Federal Institute of Technology Lausanne
(EPFL), Lausanne, Switzerland
Synonym
Forensic speaker recognition
Definition
Forensic speaker recognition is the process of determin-
ing if a specific individual (suspected speaker) is the
Voice, Forensic Evidence of V 1389
source of a questioned voice recording (trace). The
forensic application of speaker recognition technology
is one of the most controversial issues within the wide
community of researchers, experts, and police workers.
This is mainly due to the fact that very different methods
are applied in this area by phoneticians, engineers, law-
yers, psychologists, and investigators. The approaches
commonly used for speaker recognition by forensic
experts include the aural-perceptual, the auditory-
instrumental, and the automatic methods. The forensic
expert’s role is to testify to the worth of the evidence by
using, if possible a quantitative measure of this worth.
It is up to other people (the judge and/or the jury) to
use this information as an aid to their deliberations
and decision.
This essay aims at presenting forensic automatic
speaker recognition (FASR) methods that provide a
coherent way of quantifying and presenting recorded
voice as scientific evidence. In such methods, the evi-
dence consists of the quantified degree of similarity
between speaker-dependent features extracted from
the trace and speaker-dependent features extracted
from recorded speech of a suspect. The interpretation
of a recorded voice as evidence in the forensic context
presents particular challenges, including within-speaker
(within-source) variability, between-speakers (between-
sources) variability, and differences in recording sessions
conditions. Consequently, FASR methods must provide
a probabilistic evaluation which gives the court an indi-
cation of the strength of the evidence given the estimated
within-source, between-sources, and between-session
variabilities.
V
Introduction
Speaker recognition is the general term used to include
all of the many different tasks of discriminating people
based on the sound of their voices. Forensic speaker
recognition involves the comparison of recordings of
an unknown voice (questioned recording) with one
or more recordings of a known voice (voice of the
suspected speaker) [1, 2].
There are several types of forensic speaker recog-
nition [3, 4]. When the recognition employs any
trained skill or any technologically-supported proce-
dure, the term technical forensic speaker recognition
is often used. In contrast to this, so-called naı̈ve for-
ensic speaker recognition refers to the application of
un-reflected everyday abilities of people to recognize
familiar voices.
The approaches commonly used for technical foren-
sic speaker recognition include the aural-perceptual,
auditory-instrumental, and automatic methods [2].
Aural-perceptual methods, based on human auditory
perception, rely on the careful listening of recordings
by trained phoneticians, where the perceived differ-
ences in the speech samples are used to estimate the
extent of similarity between voices [3]. The use of
aural-spectrographic speaker recognition can be con-
sidered as another method in this approach. The
exclusively visual comparison of spectrograms in what
has been called the ‘‘▶ voiceprint ’’ approach has come
under considerable criticism in the recent years [5]. The
auditory-instrumental methods involve the acoustic
measurements of various parameters, such as the aver-
age fundamental frequency, articulation rate, formant
centre-frequencies, etc. [4]. The means and variances
of these parameters are compared. FASR is an estab-
lished term used when automatic speaker recognition
methods are adapted to forensic applications. In auto-
matic speaker recognition, the statistical or determin-
istic models of acoustic features of the speaker’s voice
and the acoustic features of questioned recordings are
compared [6].
FASR offers data-driven methodology for quanti-
tative interpretation of recorded speech as evidence.
It is a relatively recent application of digital speech
signal processing and pattern recognition for judicial
purposes and particularly law enforcement. Results
of FASR based investigations may be of pivotal im-
portance at any stage of the course of justice, be it the
very first police investigation or a court trial. FASR
has been gaining more and more importance ever
since the telephone has become an almost ideal
tool for the commission of certain criminal offences,
especially drug dealing, extortion, sexual harassment,
and hoax calling. To a certain degree, this is undoubt-
edly a consequence of the highly-developed and fully
automated telephone networks, which may safeguard
a perpetrator’s anonymity. Nowadays, speech com-
munications technology is accessible anywhere, any-
time and at a low price. It helps to connect people,
but unfortunately also makes criminal activities
easier. Therefore, the identity of a speaker and the
interpretation of recorded speech as evidence in
the forensic context are quite often at issue in court
cases [1, 7].
1390V Voice, Forensic Evidence of
Although several speaker recognition systems for
commercial applications (mostly speaker verification)
have been developed over the past 30 years, until
recently the development of a reliable technique for
FASR has been unsuccessful because methodological
aspects concerning automatic recognition of speakers
in criminalistics and the role of the forensic expert have
not been investigated sufficiently [8]. The role of a
forensic expert is to testify in court using, if possible,
quantitative measures that estimate the value and
strength of the evidence. The judge and/or the jury
use the testimony as an aid to the deliberations and
decisions [9].
A forensic expert testifying in court is not an advo-
cate, but a witness who presents factual information
and offers a professional opinion based upon that
factual information. In order for it to be effective, it
must be carefully documented, and expressed with
precision in neutral and objective way with the adver-
sary system in mind. Technical concepts based on
digital signal processing and pattern recognition must
be articulated in layman terms such that the judge and
the attorneys may understand them. They should also
be developed according to specific recommendations
that take into account also the forensic, legal, judicial,
and criminal policy perspectives. Therefore, forensic
speaker recognition methods should be developed
based on current state-of-the-art interpretation of
forensic evidence, the concept of identity used in crim-
inalistics, a clear understanding of the inferential pro-
cess of identity, and the respective duties of the actors
involved in the judicial process, jurists, and forensic
experts.
Voice as Evidence
When using FASR, the goal is to identify whether an
unknown voice of a questioned recording (trace)
belongs to a suspected speaker (source). The ▶ voice
evidence consists of the quantified degree of similarity
between speaker dependent features extracted from the
trace, and speaker dependent features extracted from
recorded speech of a suspect, represented by his or her
model [1], so the evidence does not consist of
the speech itself. To compute the evidence, the proces-
sing chain illustrated in Fig. 1 may be employed [10].
As a result, the suspect’s voice can be recognized as the
recorded voice of the trace, to the extent that the
evidence supports the hypothesis that the questioned
and the suspect’s recorded voices were generated by
the same person (source) rather than the hypothesis
that they were not. However, the calculated value of
evidence does not allow the forensic expert alone to
make an inference on the identity of the speaker.
As no ultimate set of speaker specific features is
present or detected in speech, the recognition process
remains in essence a statistical-probabilistic process
based on models of speakers and collected data,
which depend on a large number of design decisions.
Information available from the auditory features and
their evidentiary value depend on the speech organs
and language used [3]. The various speech organs have
to be flexible to carry out their primary functions
such as eating and breathing as well as their secondary
function of speech, and the number and flexibility of
the speech organs results in a high number of ‘‘degrees
of freedom’’ when producing speech. These ‘‘degrees of
freedom’’ may be manipulated at will or may be subject
to variation due to external factors such as stress,
fatigue, health, and so on. The result of this plasticity
of the vocal organs is that no two utterances from the
same individual are ever identical in a physical sense.
In addition to this, the linguistic mechanism (lan-
guage) driving the vocal mechanism is itself far from
invariant. We are all aware of changing the way we
speak, including the loudness, pitch, emphasis, and
rate of our utterances; aware, probably, too, that style,
pronunciation, and to some extent dialect, vary as we
speak in different circumstances. Speaker recognition
thus involves a situation where neither the physical
basis of a person’s speech (the vocal organs) nor the
language driving it, are constant.
The speech signal can be represented by a sequence
of short-term feature vectors. This is known as feature
extraction (Fig. 1). It is typical to use features based on
the various speech production and perception models.
Although there are no exclusive features conveying
speaker identity in the speech signal, from the source-
filter theory of speech production it is known that the
speech spectrum envelope encodes information about
the speaker’s vocal tract shape [11]. Thus some form
of spectral envelope based features is used in most
speaker recognition systems even if they are dependent
on external recording conditions. Recently, the major-
ity of speaker recognition systems have converged to
the use of cepstral features derived from the envelope
spectra models [1].
Voice, Forensic Evidence of. Figure 1 Block diagram of the evidence processing and interpretation system. � IEEE.
Voice, Forensic Evidence of V 1391
V
Thus, the most persistent real-world challenge in
this field is the variability of speech. There is within-
speaker (within-source) variability as well as between-
speakers (between-sources) variability. Consequently,
forensic speaker recognition methods should provide
a statistical-probabilistic evaluation, which attempts
to give the court an indication of the strength of the
evidence, given the estimated within-source variability
and the between-sources variability [4, 10].
Bayesian Interpretation of Evidence
To address these variabilities, a probabilistic model [9],
Bayesian inference [8] and data-driven approaches [6]
appear to be adequate: in FASR statistical techniques
the distribution of various features extracted from a
suspect’s speech is compared with the distribution of
the same features in a reference population with re-
spect to the questioned recording. The goal is to infer
the identity of a source [9], since it cannot be known
with certainty.
The inference of identity can be seen as a reduction
process, from an initial population to a restricted class,
or, ultimately, to unity [8]. Recently, an investigation
concerning the inference of identity in forensic speaker
recognition has shown the inadequacy of the speaker
verification and speaker identification (in closed set
and in open set) techniques [8]. Speaker verification
and identification are the two main automatic techni-
ques of speech recognition used in commercial appli-
cations. When they are used for forensic speaker
recognition they imply a final discrimination decision
based on a threshold. Speaker verification is the task of
1392V Voice, Forensic Evidence of
deciding, given a sample of speech, whether a specified
speaker is the source of it. Speaker identification is the
task of deciding, given a sample of speech, which
among many speakers is the source of it. Therefore,
these techniques are clearly inadequate for forensic
purposes, because they force the forensic expert to
make decisions which are devolved upon the court.
Consequently, the state-of-the-art speaker recognition
algorithms using dynamic time warping (DTW) and
hidden Markov models (HMMs) for text-dependent
tasks, and vector quantization (VQ), Gaussian mixture
models (GMMs), ergodic HMMs and others for text-
independent tasks have to be adapted to the Bayesian
interpretation framework which represents an ade-
quate solution for the interpretation of the evidence
in the judicial process [9].
The court is faced with decision-making under un-
certainty. In a case involving FASR it wants to know
how likely it is that the speech samples of questioned
recording have come from the suspected speaker.
The answer to this question can be given using
the Bayes’ theorem and a data-driven approach to
interpret the evidence [1, 7, 10].
The odds form of Bayes’ theorem shows how new
data (questioned recording) can be combined with
prior background knowledge (prior odds (province
of the court)) to give posterior odds (province of the
court ) for judici al outcomes or issues (Eq. 1). It allow s
for revision based on new information of a measure of
uncertainty (likelihood ratio of the evidence (province
of the forensic expert)) which is applied to the pair
of competing hypotheses: H0 – the suspected speaker
is the source of the questioned recording, H1 – the
speaker at the origin of the questioned recording is
not the suspected speaker.
posterior
knowledge
pðH0jEÞpðH1jEÞposterior
odds
ðprovince ofthe courtÞ
¼new data
pðEjH0ÞpðEjH1Þlikelihood
ratio
ðprovince ofthe expertÞ
�
prior
knowledge
pðH0ÞpðH1Þ
prior odds
ðprovince ofthe courtÞ
ð1Þ
This hypothetical-deductive reasoning method, based
on the odds form of the Bayes’ theorem, allows evalu-
ating the likelihood ratio of the evidence that leads
to the statement of the degree of support for one
hypothesis against the other. The ultimate question
relies on the evaluation of the probative strength of
this evidence provided by an automatic speaker recog-
nition method [12]. Recently, it was demonstrated that
outcome of the aural (subjective) and instrumental
(objective) approaches can also be expressed as a
Bayesian likelihood ratio [4, 13].
Strength of Evidence
The ▶ strength of voice evidence is the result of the
interpretation of the evidence, expressed in terms of
the likelihood ratio of two alternative hypotheses. The
principal structure for the calculation and the inter-
pretation of the evidence is presented in Fig. 1. It
includes the collection (or selection) of the databases,
the automatic speaker recognition and the Bayesian
interpretation [10].
The methodological approach based on a Bayesian
interpretation (BI) framework is independent of the
automatic speaker recognition method chosen, but the
practical solution presented in this essay as an example
uses text-independent speaker recognition system based
on Gaussian mixture model (GMM) [14].
The Bayesian interpretation (BI) methodology
needs a two-stage statistical approach [10]. The first
stage consists in modeling multivariate feature data
using GMMs. The second stage transforms the data
to a univariate projection based on modeling the simi-
larity scores. The exclusively multivariate approach is
also possible but it is more difficult to articulate
in layman terms [15]. The GMM method is not only
used to calculate the evidence by comparing the
questioned recording (trace) to the GMM of the sus-
pected speaker (source), but it is also used to produce
data necessary to model the within-source variability
of the suspected speaker and the between-sources
variability of the potential population of relevant
speakers, given the questioned recording. The interpre-
tation of the evidence consists of calculating the likeli-
hood ratio using the probability density functions
(pdfs) of the variabilities and the numerical value of
evidence.
The information provided by the analysis of the
questioned recording (trace) leads to specify the initial
reference population of relevant speakers (potential pop-
ulation) having voices similar to the trace, and,
Voice, Forensic Evidence of V 1393
combined with the police investigation, to focus on and
select a suspected speaker. The methodology presented
needs three databases for the calculation and the inter-
pretation of the evidence: the potential population data-
base (P), the suspected speaker reference database (R),
and the suspected speaker control database (C) [14].
The potential population database (P) is a database
for modeling the variability of the speech of all the
potential relevant sources, using the automatic speaker
recognition method. It allows evaluating the between-
sources variability given the questioned recording,
which means the distribution of the similarity scores
that can be obtained, when the questioned recording is
compared to the speaker models (GMMs) of the po-
tential population database. The calculated between-
sources variability pdf is then used to estimate the
denominator of the likelihood ratio p(E|H1). Ideally,
the technical characteristics of the recordings (e.g.,
signal acquisition and transmission) should be chosen
according to the characteristics analyzed in the trace.
The suspected speaker reference database (R) is
recorded with the suspected speaker to model his/her
speechwith the automatic speaker recognitionmethod.
In this case, speech utterances should be produced in
the same way as those of the P database. The sus-
pected speaker model obtained is used to calculate the
Voice, Forensic Evidence of. Figure 2 The LR estimation giv
value of the evidence, by comparing the questioned
recording to the model.
The suspected speaker control database (C) is
recorded with the suspected speaker to evaluate her/his
within-source variability, when the utterances of this
database are compared to the suspected speaker model
(GMM). This calculated within-source variability pdf
is then used to estimate the numerator of the likeli-
hood ratio p(E|H0). The recording of the C database
should be constituted of utterances as far as possible
equivalent to the trace, according to the technical
characteristics, as well as to the quantity and style of
speech.
The basic method proposed has been exhaustively
tested in mock forensic cases corresponding to real
caseworks [11, 14]. In an example presented in Fig. 2,
the strength of evidence, expressed in terms of likeli-
hood ratio gives LR = 9.165 for the evidence value
E = 9.94, in this case. This means that it is 9.165
times more likely to observe the score E given the
hypothesis H0 than H1. The important point to be
made here is that the estimate of the LR is only as
good as the modeling techniques and databases used
to derive it. In the example, the GMM technique was
used to estimate pdfs from the data representing simi-
larity scores [11].
en the value of the evidence E. � IEEE.
V
1394V Voice, Forensic Evidence of
Evaluation of the Strength ofEvidence
The likelihood ratio (LR) summarizes the statement of
the forensic expert in the casework. However, the great-
est interest to the jurists is the extent to which the LRs
correctly discriminate ‘‘the same speaker and different-
speaker’’ pairs under operating conditions similar to
those of the case in hand. As was made clear in the US
Supreme Court decision in Daubert case (Daubert v.
Merrell Dow Pharmaceuticals, 1993) it should be cri-
terial for the admissibility of scientific evidence to know
to what extent the method can be, and has been, tested.
The principle for evaluation of the strength of
evidence consists in the estimation and the comparison
of the likelihood ratios that can be obtained from the
evidence E, on one hand when the hypothesis H0 is
true (the suspected speaker truly is the source of the
questioned recording) and, on the other hand, when
the hypothesisH1 is true (the suspected speaker is truly
not the source of the questioned recording) [14]. The
performance of an automatic speaker recognition
method is evaluated by repeating the experiment de-
scribed in the previous sections, with several speakers
being at the origin of the questioned recording, and by
representing the results using experimental (histogram
based) probability distribution plots such as probabili-
ty density functions and cumulative distribution func-
tions in the form of Tippett plots (Fig. 3a) [10, 14].
The way of representation of the results in the form
of Tippett plots is the one proposed by Evett and
Voice, Forensic Evidence of. Figure 3 (a) Estimated probab
corresponding to (a). � IEEE.
Buckleton in the field of interpretation of the forensic
DNA analysis [6]. The authors have named this repre-
sentation ‘‘Tippett plot,’’ referring to the concepts of
‘‘within-source comparison’’ and ‘‘between-sources
comparison’’ defined by Tippett et al.
Forensic Speaker Recognition inMismatched Conditions
Nowadays, state-of-the-art automatic speaker recogni-
tion systems show very good performance in discrimi-
nating between voices of speakers under controlled
recording conditions. However, the conditions in
which recordings are made in investigative activities
(e.g., anonymous calls and wire-tapping) cannot be
controlled and pose a challenge to automatic speaker
recognition. Differences in the background noise, in
the phone handset, in the transmission channel, and in
the recording devices can introduce variability over
and above that of the voices in the recordings. The
main unresolved problem in FASR today is that of
handling mismatch in recording conditions, also in-
cluding mismatch in languages, linguistic content, and
non-contemporary speech samples. Mismatch in re-
cording conditions has to be considered in the estima-
tion of the likelihood ratio [11–13]. Next step can
be combination of the strength of evidence using
aural-perceptive and acoustic-phonetic approaches
(aural-instrumental) of trained phoneticians with
that of the likelihood ratio returned by the automatic
ility density functions of likelihood ratios; (b) Tippett plots
Voice, Forensic Evidence of V 1395
system [4]. In order for FASR to be acceptable for
presentation in the courts, the methods and techniques
have to be researched, tested and evaluated for error, as
well as be generally accepted in the scientific commu-
nity. The methods proposed should be analyzed in the
light of the admissibility of scientific evidence (e.g.,
Daubert ruling, USA, 1993) [11].
V
Summary
The essay discussed some important aspects of fore-
nsic speaker recognition, focusing on the necessary sta-
tistical-probabilistic framework for both quantifying
and interpreting recorded voice as scientific evidence.
Methodological guidelines for the calculation of the
evidence, its strength and the evaluation of this strength
under operating conditions of the casework were pre-
sented. As an example, an automatic method using the
Gaussian mixture models (GMMs) and the Bayesian
interpretation (BI) framework were implemented for
the forensic speaker recognition task. The BI method
represents neither speaker verification nor speaker iden-
tification. These two recognition techniques cannot be
used for the task, since categorical, absolute and deter-
ministic conclusions about the identity of source of
evidential traces are logically untenable because of the
inductive nature of the process of the inference of iden-
tity. This method, using a likelihood ratio to indicate the
strength of the evidence of the questioned recording,
measures how this recording of voice scores for the
suspected speaker model, compared to relevant non-
suspect speaker models. It became obvious that partic-
ular effort is needed in the trans-disciplinary domain of
adaptation of the state-of-the-art speech recognition
techniques to real-world environmental conditions for
forensic speaker recognition. The future methods to be
developed should combine the advantages of automatic
signal processing and pattern recognition objectivity
with the methodological transparency solicited in
forensic investigations.
Related Entries
▶ Forensic Biometrics
▶ Forensic Evidence
▶ Speaker Recognition, An Overview
References
1. Rose, P.: Forensic Speaker Identification. Taylor & Francis,
London (2002)
2. Dessimoz, D., Champod, C.: Linkages between biometrics
and forensic science. In: Jain, A., Flynn, P., Ross, A. (eds.)
Handbook of Biometrics, pp. 425–459. Springer, New York
(2008)
3. Nolan, F.: Speaker identification evidence: its forms, limitations,
and roles. In: Proceedings of the Conference ‘‘Law and Language:
Prospect and Retrospect’’, Levi (Finnish Lapland), pp. 1–19
(2001)
4. Rose, P.: Technical forensic speaker recognition: Evaluation,
types and testing of evidence. Comput. Speech Lang. 20(2–3),
159–191 (2006)
5. Meuwly, D.: Voice analysis. In: Siegel, J., Knupfer, G., Saukko,
P. (eds.) Encyclopedia of Forensic Sciences, pp. 1413–1421.
Academic Press, London (2000)
6. Drygajlo, A.: Forensic automatic speaker recognition. IEEE
Signal Process. Mag. 24(2), 132–135 (2007)
7. Robertson, B., Vignaux, G.: Interpreting Evidence. Evaluating
Forensic Science in the Courtroom. John Wiley & Sons, Chiche-
ster (1995)
8. Champod, C., Meuwly, D.: The inference of identity in forensic
speaker identification.’’ Speech Commun. 31(2–3), 193–203
(2000)
9. Aitken, C., Taroni, F.: Statistics and the Evaluation of
Evidence for Forensic Scientists. John Wiley & Sons, Chichester
(2004)
10. Drygajlo, A., Meuwly, D., Alexander, A.: Statistical
methods and Bayesian interpretation of evidence in forensic
automatic speaker recognition. In: Proceedings of Eighth
European Conference on Speech Communication and
Technology (Eurospeech’03), pp. 689–692 Geneva, Switzerland,
(2003)
11. Alexander, A.: Forensic automatic speaker recognition using
Bayesian interpretation and statistical compensation for mis-
matched conditions. Ph.D. thesis, EPFL (2005)
12. Gonzalez-Rodriguez, J., Drygajlo, A., Ramos-Castro, D., Garcia-
Gomar, M., Ortega-Garcia, J.: Robust estimation, interpretation
and assessment of likelihood ratios in forensic speaker recogni-
tion. Comput. Speech Lang. 20(2–3), 331–355 (2006)
13. Alexander, A., Dessimoz, D., Botti, F., Drygajlo, A.: Aural and
automatic forensic speaker recognition in mismatched condi-
tions. Int. J. Speech Lang. Law, 12(2), 214–234 (2005)
14. Meuwly, D., Drygajlo, A.: Forensic speaker recognition based
on a Bayesian framework and Gaussian mixture model-
ling (GMM). In: Proceedings 2001: A Speaker Odyssey, The
Speaker Recognition Workshop, pp. 145–150 Crete, Greece,
(2001)
15. Alexander, A., Drygajlo, A.: Scoring and direct methods for the
interpretation of evidence in forensic speaker recognition.
In: Proceedings of Eighth International Conference on Spoken
Language Processing (ICSLP’04), pp. 2397–2400 Jeju, Korea,
(2004)
1396V Voiced Sounds
Voiced Sounds
The voiced speech is generated by the modulation of
the airstream of the lungs by periodic opening and
closing of the vocal folds in the glottis or larynx. This
is used, e.g., for vowels and nasal consonants.
▶ Speech Production
Voiceprint
Voiceprint is another name for spectrogram. This
name is usually avoided because of its association
with voiceprint recognition, which is a highly contro-
versial method of forensic speaker recognition, which
exclusively uses visual examination of spectrograms.
▶Voice, Forensic Evidence of
Volunteer Crew
The volunteer crew for a biometric test is the indivi-
duals that participate in the evaluation of the biometric
and from whom biometric samples are taken.
▶Test Sample and Size