asia-pacific signal and information processing association apsipa distinguished lecture series ...

APSIPAAsia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series www.apsipa.org


Making speech tangible for better understanding of

human speech communication

Hideki Kawahara APSIPA Distinguished Lecturer, 2015-2016

Emeritus Professor Wakayama University, Japan Visiting Research Scientist, Google UK

29 January, 2016 Edinburgh, UK




Introduction to APSIPA and APSIPA DL

2

APSIPA Mission: To promote broad spectrum of research and education activities in signal and information processing in Asia PacificAPSIPA Conferences: ASPIPA Annual Summit and Conference APSIPA Publications: Transactions on Signal and Information Processing in partnership with Cambridge Journals since 2012; APSIPA NewslettersAPSIPA Social Network: To link members together and to disseminate valuable information more effectively APSIPA Distinguished Lectures: An APSIPA educational initiative to reach out to the community

Prosody, Information, and Modeling— with Emphasis on Tonal Features of Speech —

Hiroya Fujisaki

Professor Emeritus, University of Tokyoe-mail: [email protected]

Abstract

Starting from the author’s view on theprocess of information manifestation inthe tonal features of speech, this paperemphasizes the importance of objectiveand quantitative modeling in the study ofthese features. It then describes a modelfor the process of fundamental frequencycontrol of speech that has been originallyproposed and established for Japanese,and explains the physiological and physi-cal evidences on which the model is based.Application of the model for generationof F0 contours of languages other thanJapanese is then described, indicatinghow the original model can be modifiedand extended to cover those features thatare not found in Japanese. The under-lying mechanisms responsible for produc-tion of these features are also discussed.

1 Introduction [1]

In the first place, we shall look into the process bywhich certain kinds of information intended by aspeaker are manifested in the contour of the voicefundamental frequency (henceforth to be denotedby F0 contour). Although the manifestations aredifferently named as tone, pitch accent, or into-nation, depending mainly on the size of linguisticunits associated with them, the word ‘tonal fea-tures’ will be used as a generic term to include allof them.

Input Information

LinguisticLexicalSyntacticSemanticPragmatic

Para- linguistic

Non- linguistic

IntentionalAttitudinalStylistic

PhysicalEmotional

Rules of Grammar

Rules of Prosody

Physiological Constraints

Physical Constraints

MessagePlanning

UtterancePlanning

MotorCommandGeneration

SpeechSoundProduction

Segmental andSuprasegmentalFeatures ofSpeech

:

Figure 1: Processes by which various types of information are manifested in the segmental and supraseg-mental features of speech.

The information expressed by speech can be re-garded to fall into three categories: linguistic, para-linguistic, and non-linguistic, though their bound-aries may not always be clear. Here I define linguis-tic information as the symbolic information that isrepresented by a set of discrete symbols and rulesfor their combination. It can be represented eitherexplicitly by the written language, or can be eas-ily and uniquely inferred from context. Linguisticinformation thus defined is discrete and categori-cal. For example, the information concerning theaccent type of a Japanese word is discrete in thesense that it specifies one out of a finite number ofpossible accent types.

On the other hand, paralinguistic information isdefined as the information that is not inferable fromthe written counterpart but is deliberately addedby the speaker to modify or supplement the linguis-tic information. A written sentence can be utteredin various ways to express different intentions, at-titudes, and speaking styles which are under theconscious control of the speaker. Paralinguisticinformation can be both discrete and continuous.For example, the information regarding whether aspeaker’s intention is an assertion or a question isdiscrete, but it can also be continuous in the sensethat a speaker can express the degree within eachcategory.

Nonlinguistic information concerns such factorsas the age, gender, idiosyncracy, physical and emo-tional states of the speaker, etc. These factors arenot directly related to the linguistic and paralin-guistic contents of the utterances and cannot gen-

Fujisaki, H.: Prosody, Models, and Spontaneous Speech. In Computing Prosody (Sagisaka, Y., Campbell, N., and Higuchi, N., eds.), SpringerVerlag (1996) 27–42.

書不盡言言不盡意

Text is not enough to represent speech. Speech is not enough to represent heart. (from a Chinese sutra of divination lore compiled in B.C. 800)




4

» Important to make future better for humans by providing rich and deep communications mediated by ubiquitous intelligent/empathic machines » Interactive tools for speech science

education » Research tools for investigating non- and

para-linguistic aspects of speech

Making speech tangible

Auditory media

Background

Hokkaido University

Graduate school

NTT basic research

Yokosuka R&D

NTT basic research

ATR

WakayamaUniversity

CREST

CREST

e-Society

Electrical Eng.

Speech sig. proc.

Broad-band net.

Auditory inf. processing

Neural net

SF

computer

music singing

1968~

1972~

1977~

1982~

1984~

1992~

1997~

~2006

2015~

STRAIGHT 1997-

morphing 2003-

TANDEM-STRAIGHT 2007

Spark 1986-

Temporally variable multi-aspect morphing

2009

Temporally variable multi-aspect N-way morphing

2013

I am a tool builder hoping to make useful tools to promote understanding of human speech communication and to encourage collaborations between researchers and developers. I would appreciate your suggestions for me to produce further interesting tools.

F0 extractors

YIN

XSX

1997- 1999- 2002- 2005- 2007- 2008- 2012- 2013-

NDF

Overview

manipulate/explore STRAIGHT visualize/interact




8

» Acquaintance of mathematical as well as practical background of signal processing helps understanding speech phenomena

» Computers are powerful enough to provide realtime and interactive visualization/sonification of sounds

» An open source set of tools (SparkNG) is available

Interactive tools are essential

Fourier transform and sounds

enables to represent ~~~ waveform using sum of complex exponential functions

F ω( ) = 12π

f t( )−∞

∞

∫ e− jω t dt

j = −1( )e jω t = cos ωt( )+jsin ωt( )

Joseph Fourier (1768-1830)

http://www.wakayama-u.ac.jp/~kawahara/MatlabRealtimeSpeechTools/

complex exponential and sinusoid

Fourier transform as Ptolemy

Component, pitch and timbre

Fourier transform as Ptolemy

Matlab

log-aera

transfer function

display

control

English translation of following slides

LPC coeff.

LPC polynom.

pole freq. bw.

PARCOR coeff.

autocorrelation

LSP freq.

interp. corr.

Jacobi coeff.

poly. coeff. p.

CSM freq. int.

CSM poly. root.

log-aera

transfer function

display

control

log-aera

transfer function

display

control

log-areamanipulation

log-aera

transfer function

display

control

pole locationmanipulation

16-19 December, 2015

Overview





30

» Decomposes speech signal into parameters » Source information

» Fundamental frequency: F0 » Aperiodicity: mixed mode excitation

» Filter information » Enables precise control of parameters » Resynthesize from modified parameters

STRAIGHT is a VOCODER


input signal-1 F0 analysis

spectral envelope analysis

F0

analysis

non-periodicity

spectral envelope

non-periodicity analysis

non-periodic component generator

shaper and mixer

filter output signal

synthesis

periodic pulse

generatormodification

data

process

signal

parameter

physical attributes

TANDEM-STRAIGHT: periodic pulse

power spectrum Movie

TANDEM-STRAIGHT: synthetic vowel /a/

log-power spectrum

Movie

TANDEM-STRAIGHT: natural speech

Movie


input signal-1 F0 analysis


F0

analysis

non-periodicity

spectral envelope



shaper and mixer


synthesis

periodic pulse

generatormodification

data

process

signal

parameter

physical attributes

F0 extraction is crucial•Phase spectrogram and instantaneous frequency

DVD

APSIPA ASC 2014

0.1 0.15 0.2 0.25236

237

238

239

240

241

242

243

244

time (s)

frequ

ency

(Hz)

proposedYINSWIPEmodulation

Sinusoidal modulation

38

APSIPA ASC 2014

100 101 102−120

−100

−80

−60

−40

−20

0

modulation frequency (Hz)

mod

ulat

ion

dept

h (re

lativ

e dB

)

modulationYINSWIPEproposed

Sinusoidal modulation

39

APSIPA ASC 2014

F0 modulation in Noh voice

401.5 1.52 1.54 1.56 1.58 1.6 1.62 1.64 1.66 1.68 1.7

x 104

0

200

400

600

800

1000

1200

time (ms)

frequ

ency

(Hz)

6*F0

APSIPA ASC 201441

1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64x 104

0

200

400

600

800

1000

1200

time (ms)

frequ

ency

(Hz)

F0 modulation in Noh voice6*F01:21:2 1:31:3

chaos?coupling

APSIPA ASC 2014

STRAIGHT GUIs

42

Matlab

APSIPA ASC 2014

Snapshot: F0 extraction

43

customizable F0 extractor

Matlab

APSIPA ASC 2014

Snapshot: modification

44

duration size F0 amplitude

Matlab

Overview





46

» Provides an objective reference scale for measuring subjective effects

» Provides a means to quantify perceptual effects of speech parameter modifications

» Provides new test stimuli for speech communication

» Provides new special effects for post processing of existing speech materials

Morphing is a powerful tool

Prosody, Information, and Modeling— with Emphasis on Tonal Features of Speech —

Hiroya Fujisaki

Professor Emeritus, University of Tokyoe-mail: [email protected]

Abstract

Starting from the author’s view on theprocess of information manifestation inthe tonal features of speech, this paperemphasizes the importance of objectiveand quantitative modeling in the study ofthese features. It then describes a modelfor the process of fundamental frequencycontrol of speech that has been originallyproposed and established for Japanese,and explains the physiological and physi-cal evidences on which the model is based.Application of the model for generationof F0 contours of languages other thanJapanese is then described, indicatinghow the original model can be modifiedand extended to cover those features thatare not found in Japanese. The under-lying mechanisms responsible for produc-tion of these features are also discussed.

1 Introduction [1]

In the first place, we shall look into the process bywhich certain kinds of information intended by aspeaker are manifested in the contour of the voicefundamental frequency (henceforth to be denotedby F0 contour). Although the manifestations aredifferently named as tone, pitch accent, or into-nation, depending mainly on the size of linguisticunits associated with them, the word ‘tonal fea-tures’ will be used as a generic term to include allof them.

Input Information

LinguisticLexicalSyntacticSemanticPragmatic

Para- linguistic

Non- linguistic

IntentionalAttitudinalStylistic

PhysicalEmotional

Rules of Grammar

Rules of Prosody

Physiological Constraints

Physical Constraints

MessagePlanning

UtterancePlanning

MotorCommandGeneration

SpeechSoundProduction

Segmental andSuprasegmentalFeatures ofSpeech

:

Figure 1: Processes by which various types of information are manifested in the segmental and supraseg-mental features of speech.

The information expressed by speech can be re-garded to fall into three categories: linguistic, para-linguistic, and non-linguistic, though their bound-aries may not always be clear. Here I define linguis-tic information as the symbolic information that isrepresented by a set of discrete symbols and rulesfor their combination. It can be represented eitherexplicitly by the written language, or can be eas-ily and uniquely inferred from context. Linguisticinformation thus defined is discrete and categori-cal. For example, the information concerning theaccent type of a Japanese word is discrete in thesense that it specifies one out of a finite number ofpossible accent types.

On the other hand, paralinguistic information isdefined as the information that is not inferable fromthe written counterpart but is deliberately addedby the speaker to modify or supplement the linguis-tic information. A written sentence can be utteredin various ways to express different intentions, at-titudes, and speaking styles which are under theconscious control of the speaker. Paralinguisticinformation can be both discrete and continuous.For example, the information regarding whether aspeaker’s intention is an assertion or a question isdiscrete, but it can also be continuous in the sensethat a speaker can express the degree within eachcategory.

Nonlinguistic information concerns such factorsas the age, gender, idiosyncracy, physical and emo-tional states of the speaker, etc. These factors arenot directly related to the linguistic and paralin-guistic contents of the utterances and cannot gen-

Fujisaki, H.: Prosody, Models, and Spontaneous Speech. In Computing Prosody (Sagisaka, Y., Campbell, N., and Higuchi, N., eds.), SpringerVerlag (1996) 27–42.

書不盡言言不盡意

Text is not enough to represent speech. Speech is not enough to represent heart. (from a Chinese sutra of divination lore compiled in B.C. 800)

November, 2013, APSIPA, Taiwan

Matlab

ISCSLP2008, Kunming China, 16-19 December 2008

What does morphing do?

example:A

example:B

high-dimensional parameter space

trajectory design

Matlab

APSIPA ASC 2014


51

voic

es

attribute

APSIPA ASC 2014

input signal-1

F0 analysis


F0

analysis

non-periodicity

spectral envelope



shaper and mixer


synthesis

periodic pulse

generator

time axis mapping

frequency axis

mapping

time axis alignment

frequency axis

alignment

morphing

data

process

signal

parameter

time axis alignment

frequency axis alignment

physical attributes

physical attributes

a set of indexed weights of physical attributes

input signal-k

input signal-N

analysis

analysis


52

APSIPA ASC 2014

input signal-1

F0 analysis


F0

analysis

non-periodicity

spectral envelope



shaper and mixer


synthesis

periodic pulse

generator

time axis mapping

frequency axis

mapping

time axis alignment

frequency axis

alignment

morphing

data

process

signal

parameter

time axis alignment


physical attributes

physical attributes


input signal-k

input signal-N

analysis

analysis

STRAIGHT

53

APSIPA ASC 2014

input signal-1

F0 analysis


F0

analysis

non-periodicity

spectral envelope



shaper and mixer


synthesis

periodic pulse

generator

time axis mapping

frequency axis

mapping

time axis alignment

frequency axis

alignment

morphing

data

process

signal

parameter

time axis alignment


physical attributes

physical attributes


input signal-k

input signal-N

analysis

analysis


54

APSIPA ASC 2014

Generalized morphing

�w.sum(function) �exponent(w.sum(log(function))) �integration(exponent(w.sum(log(function’)))

)

55

no constraintpositivity

monotonicity

derivative of function

location, speed ...F0, power ...

time axis, frequency axis ...

enabling extrapolation

APSIPA ASC 2014


56

情報処理学会研究報告IPSJ SIG Technical Report

Θ(k)(ν, τ)の要素のそれぞれが、上で説明したいずれの種類に属するかを整理しておく。3.4.1 関数の値域に制限が無い場合非周期性指標からなるベクトル a(k)の要素は、実数であ

り値域の制限は無い。パラメタのモーフィングには、式 (2)を用いる。3.4.2 関数の値域が正の場合基本周波数 f (k)

0 と STRAIGHTスペクトル P (k) は、正の値であることが必要である。パラメタのモーフィングには、式 (4)を用いる。3.4.3 関数が単調増加の場合時間軸 t(k)(τ)と周波数軸 f (k)(ν)は、単調増加関数であ

ることが必要である。パラメタのモーフィングには、式 (5)を用いる。

3.5 汎関数としてのモーフィング汎関数としてのモーフィング T を次式により定義する。

モーフィングにより得られるパラメタの組（実際には関数の組）を Θm(ν, τ)とする。

Θm(ν, τ)=T!Θ(1)(ν, τ), Θ(2)(ν, τ), . . . ,Θ(K)(ν, τ); W

", (6)

ここで、W は次式で示す事例のそれぞれの属性の貢献度を表すベクトル関数の集合である。

W ={wF0(τ), wA(τ),wP (τ), wFx(τ), wTx(τ)}, (7)

具体的には、wX(τ) = [w(1)X (τ), w(2)

X (τ), . . . , w(K)X (τ)]T は、

属性 X ∈ {F0, A, P, Fx, Tx} の貢献度を表す関数 w(k)X (τ)

を要素とするベクトルである。属性の添字 F0, A, P, Fx, Tx

は、それぞれ、基本周波数、非周期性指標、STRAIGHTスペクトル、周波数軸、時間軸を表す。なお、実際にモーフィングされた音声を合成するには、モーフィングされた時間軸 tm3(τ) と周波数軸 fm3(ν) の逆関数を用いて、τ = t−1

m3(tm3) と ν = f−1m3 (fm3) により

抽象的な時間と周波数を求め、値を読み出し、TANDEM-STRAIGHTの合成系に供給する必要がある。貢献度の操作結果を逐次的にモーフィング音声として出力する『実時間』処理も、この逆関数を用いることで実現することができる。

4. 注釈としての対応付けとモーフィング以上の定式化により、ある事例（例えば k 番目の事例）の時間軸 t(k) と周波数軸 f (k) が、それぞれ抽象的な時間 τ

と周波数 ν に対応付けられていれば、事例の総数 N が 2以上の場合でも一段階の処理でモーフィングを行うことができる。これは、大きな利点である。具体的に説明する。現在、時変多属性モーフィングを容易に利用してもらえるように提供しているGUIでは、モーフィングのために２つの事例間の時間-周波数表現の間の対

応を対となる参照点を用いて設定し、事例とそれらの対となる参照点をまとめたものを morphing substrateとして記録するようにしている [11], [12]。この従来の方法では、M 個の事例がある場合、対となる参照点の設定という非常にコストの高い手続をM !回行うことが必要となる。新しい定式化を用いる場合には、抽象的な時間 τ と周波数 ν

への対応付けを、事例の個数であるM 回だけ行えば良い。別の言い方をするなら、TANDEM-STRAIGHTの分析結果の注釈として抽象的な時間 τ と周波数 ν への対応付け情報を付与しておけば、注釈付きの分析結果を組み合わせるだけで、自由に任意の個数の事例間でのモーフィングができることになる。

4.1 区分的一次関数を用いた具体例対応付けに必要な条件は、微分可能な単調増加関数であることだけである。以下では、この条件を満たす関数の中から、実装が容易な区分的一次関数をを選び、対応付けの方法とそれを用いたモーフィングを具体的に説明する*4 。時間軸を例として説明する。時間方向に K 個の基準点があるとする。k 番目の事例の n番目の基準点を p(k)(τn)と表すことにする。以下の議論を簡単にするために τn は、整数値をとるものとする。さらに、τ1 = 0、τn+1 = τn + 1とし、p(k)(τ1) = 0 としておく。このような準備の下で、抽象的な時間 τ から事例 k の時間軸上の値を求める関数t(k)(τ)を、次の区分的一次関数により定義する。なお、以下では (τn+1 > τ ≥ τn)とする。

t(k)(τ) = (p(k)(τn+1)−p(k)(τn))(τ−τn) + p(k)(τn). (8)

この関数の導関数を以下のように求め、

dt(k)(τ)dτ

= (p(k)(τn+1) − p(k)(τn)), (9)

対数を求めることにより、

log#

dt(k)(τ)dτ

$= log

%p(k)(τn+1) − p(k)(τn)

&, (10)

モーフィングされた時間軸 tm3(τ)を、以下のように具体的に書き下すことができる。

tm3(τ) = (pm(τn+1)−pm(τn))(τ−τn)+pm(τn), (11)

ここで、pm(τn)は、次式で定義される。

pm(τn) =K'

k=1

%p(k)(τn) − p(k)(τn−1)

&w̄(k)Tx

(τn)

+ pm(τn−1) , (12)

*4 区分的一次関数を用いる方法は、この新しい枠組みが持っている柔軟性を失わせる。また、区分的一次関数を定義するために便宜的に用いているに過ぎない基準点の位置づけを、利用者に誤解させてしまう原因でもある。これらの問題を回避できる代替案の検討を進めている。

c⃝ 2013 Information Processing Society of Japan 3









", (6)




X (τ), . . . , w(K)X (τ)]T は、










4.1 区分的一次関数を用いた具体例対応付けに必要な条件は、微分可能な単調増加関数であ

ることだけである。以下では、この条件を満たす関数の中から、実装が容易な区分的一次関数をを選び、対応付けの方法とそれを用いたモーフィングを具体的に説明する*4 。時間軸を例として説明する。時間方向に K 個の基準点があるとする。k 番目の事例の n番目の基準点を p(k)(τn)と表すことにする。以下の議論を簡単にするために τn は、整数値をとるものとする。さらに、τ1 = 0、τn+1 = τn + 1とし、p(k)(τ1) = 0 としておく。このような準備の下で、抽象的な時間 τ から事例 k の時間軸上の値を求める関数t(k)(τ)を、次の区分的一次関数により定義する。なお、以下では (τn+1 > τ ≥ τn)とする。



dt(k)(τ)dτ

= (p(k)(τn+1) − p(k)(τn)), (9)


log#

dt(k)(τ)dτ

$= log

%p(k)(τn+1) − p(k)(τn)

&, (10)




pm(τn) =K'

k=1

%p(k)(τn) − p(k)(τn−1)

&w̄(k)Tx

(τn)

+ pm(τn−1) , (12)








3.5 汎関数としてのモーフィング汎関数としてのモーフィング T を次式により定義する。モーフィングにより得られるパラメタの組（実際には関数の組）を Θm(ν, τ)とする。


", (6)




X (τ), . . . , w(K)X (τ)]T は、













dt(k)(τ)dτ

= (p(k)(τn+1) − p(k)(τn)), (9)


log#

dt(k)(τ)dτ

$= log

%p(k)(τn+1) − p(k)(τn)

&, (10)




pm(τn) =K'

k=1

%p(k)(τn) − p(k)(τn−1)

&w̄(k)Tx

(τn)

+ pm(τn−1) , (12)










", (6)




X (τ), . . . , w(K)X (τ)]T は、














dt(k)(τ)dτ

= (p(k)(τn+1) − p(k)(τn)), (9)


log#

dt(k)(τ)dτ

$= log

%p(k)(τn+1) − p(k)(τn)

&, (10)




pm(τn) =K'

k=1

%p(k)(τn) − p(k)(τn−1)

&w̄(k)Tx

(τn)

+ pm(τn−1) , (12)



morphing entity examplar

F0 aperiodicity time-frequency rep. time c.frequency c.

APSIPA ASC 2014

Speech parameter constraints

57


にある共通の抽象的な時間軸 τ と、周波数軸 ν を用意する。ある具体的な事例（事例を表す添字を k とする）をTANDEM-STRAIGHT により分析したパラメタの組を、以下のように定義し、記号 Θ(k)(ν, τ)を用いて表す。

Θ(k)(ν, τ) =!

f (k)0

"t(k)(τ)

#, a(k)

"t(k)(τ)

#,

P (k)"f (k)(ν), t(k)(τ)

#, f (k)(ν), t(k)(τ)

$, (1)

ここで、 f (k)0 は基本周波数（F0）、a(k) は非周期性指標

からなるベクトル*1、P (k)は、滑らかな時間-周波数表現である STRAIGHTスペクトルを表す。また、これらに加えて、抽象的な時刻 τ から事例の時間軸への写像 t(k)(τ)と、抽象的な周波数 ν から事例の周波数軸への写像 f (k)(ν)が、モーフィングの対象となるパラメタの組 Θ(k) の要素となる。TANDEM-STRAIGHTの分析結果である f (k)

0 と a(k)

は、t(k)(τ)の関数であり、P (k) は、t(k)(τ)と f (k)(ν)の関数である。

N 個の事例間のモーフィングは、関数の性質に応じた定式化が必要となる。ここでは、以下の三種類の関数を対象とする。( 1 ) 関数の値域に制限が無い場合。( 2 ) 関数の値域が正の場合。( 3 ) 関数が単調増加の場合。それぞれについて、以下では節を分けて説明する。

3.1 関数の値域に制限が無い場合（添字：m1）事例 kについて求められた、モーフィングの対象となる

複素数値をとる関数を g(k)(t(k)(τ))とする。N 個の事例間のモーフィングの結果として得られる関数 gm1(tm3(τ))は、次式で定義される*2。

gm1(tm3(τ)) =N%

k=1

w(k)(t(k)(τ))g(k)(t(k)(τ)), (2)

ここで、w(k)(t(k)(τ))は k番目の事例の貢献度である。補間としてモーフィングを定義する場合には、以下の拘束条件を加える。

N%

k=1

w(k)(t(k)(τ)) = 1. (3)

通常のモーフィングでは、この拘束条件を用いることが便利である。しかし、この拘束条件はモーフィングに不可欠ではない。コンテンツ制作等への応用では、この拘束条件を外すことで、より自由な表現を試みることができる。なお、モーフィングされた時間軸 tm3(τ)は、後述の『関数が単調増加の場合』で定義される。*1 ここでは sigmoid の変曲点の周波数と、変曲点での傾斜を表す係数を要素とする。なお、実際には変曲点の周波数は対数周波数で表されている。

*2 関数の添字 “m1” と “m3” は、モーフィングされる関数の性質を表している。

3.2 関数の値域が正の場合（添字：m2）事例 k について求められた、モーフィングの対象とな

る実数値関数を g(k)(t(k)(τ)) > 0とする。N 個の事例間のモーフィングの結果として得られる関数 gm2(tm3(τ))は、次式で定義される。

gm2(tm3(τ)) = exp

&N%

k=1

w(k)(t(k)(τ)) log'g(k)(t(k)(τ))

()

=N*

k=1

'g(k)(t(k)(τ))

(w(k)(t(k)(τ)), (4)

こうして求められた関数は、構成要素であるそれぞれの事例と同様に、gm2(tm3(τ)) > 0という条件を満たす。

3.3 関数が単調増加の場合（添字：m3）事例 kについて求められた、モーフィングの対象となる

単調増加関数を g(k)(τ)とする。単調増加関数は、任意の τ

について、条件 dg(k)(τ)dτ > 0を満たす関数である。N 個の

事例間のモーフィングの結果として得られる関数 gm3(τ)は、次式で定義される。

gm3(τ) =+ τ

0

exp

&N%

k=1

w(k)(ξ) log,

dg(k)(ξ)dξ

-)dξ

=+ τ

0

N*

k=1

,dg(k)(ξ)

dξ

-w(k)(ξ)

dξ, (5)

こうして求められた関数は、構成要素であるそれぞれの事例と同様に、dgm3(τ)

dτ> 0という条件を満たす。また、

N = 2の場合には、以前に報告 [8]した『時変多属性モーフィング』の定式化に一致する。

TANDEM-STRAIGHTは、離散 Fourier変換に基づいているため、モーフィングの結果として得られる周波数軸fm3(ν)には、fm3(0) = 0と fm3(1) = fs

2（fs は、標本化周波数）という境界条件が与えられている。なお、ここでは関数の定義域を [0, 1]とした。この場合には、式 (5)により求められる境界での値を用いて、境界条件を満たすように値域を変換する*3 。事例の貢献度が時間の関数ではなく、事例ごとに与えら

れた定数である場合には、モーフィング後の事例の時間長を、それぞれの事例の時間長の加重平均とした方が、おそらく利用者にとって分り易い。この場合も、式 (5)により求められる境界での値を用いて、境界条件を満たすように値域を変換する。

3.4 TANDEM-STRAIGHTのパラメタの分類TANDEM-STRAIGHTにより求められたパラメタの組

*3 式 (5)は、事例の貢献度を抽象的な周波数 ν の関数とすることを含む。すなわち、この定式化は、正確には『周波数変時変多属性多人数モーフィング』を表していることになる。しかし、今回の実装では、事例の貢献度を抽象的な時間 τ だけの関数とした。


abstract time

abstract frequencymorphing entity

( 1 ) time increases monotonically ( 2 ) frequency increases monotonically ( 3 ) time-frequency spectral representation is positive ( 4 ) fundamental frequency is positive

APSIPA ASC 2014



Θ(k)(ν, τ) =!

f (k)0

"t(k)(τ)

#, a(k)

"t(k)(τ)

#,

P (k)"f (k)(ν), t(k)(τ)

#, f (k)(ν), t(k)(τ)

$, (1)



0 と a(k)





gm1(tm3(τ)) =N%

k=1

w(k)(t(k)(τ))g(k)(t(k)(τ)), (2)


N%

k=1

w(k)(t(k)(τ)) = 1. (3)





gm2(tm3(τ)) = exp

&N%

k=1


()

=N*

k=1

'g(k)(t(k)(τ))

(w(k)(t(k)(τ)), (4)






gm3(τ) =+ τ

0

exp

&N%

k=1

w(k)(ξ) log,

dg(k)(ξ)dξ

-)dξ

=+ τ

0

N*

k=1

,dg(k)(ξ)

dξ

-w(k)(ξ)

dξ, (5)










abstract time

abstract frequency58



morphing entity

APSIPA ASC 2014 59



Θ(k)(ν, τ) =!

f (k)0

"t(k)(τ)

#, a(k)

"t(k)(τ)

#,

P (k)"f (k)(ν), t(k)(τ)

#, f (k)(ν), t(k)(τ)

$, (1)



0 と a(k)





gm1(tm3(τ)) =N%

k=1

w(k)(t(k)(τ))g(k)(t(k)(τ)), (2)


N%

k=1

w(k)(t(k)(τ)) = 1. (3)





gm2(tm3(τ)) = exp

&N%

k=1


()

=N*

k=1

'g(k)(t(k)(τ))

(w(k)(t(k)(τ)), (4)






gm3(τ) =+ τ

0

exp

&N%

k=1

w(k)(ξ) log,

dg(k)(ξ)dξ

-)dξ

=+ τ

0

N*

k=1

,dg(k)(ξ)

dξ

-w(k)(ξ)

dξ, (5)










abstract time

abstract frequency



morphing entity

APSIPA ASC 2014 60



Θ(k)(ν, τ) =!

f (k)0

"t(k)(τ)

#, a(k)

"t(k)(τ)

#,

P (k)"f (k)(ν), t(k)(τ)

#, f (k)(ν), t(k)(τ)

$, (1)



0 と a(k)





gm1(tm3(τ)) =N%

k=1

w(k)(t(k)(τ))g(k)(t(k)(τ)), (2)


N%

k=1

w(k)(t(k)(τ)) = 1. (3)





gm2(tm3(τ)) = exp

&N%

k=1


()

=N*

k=1

'g(k)(t(k)(τ))

(w(k)(t(k)(τ)), (4)






gm3(τ) =+ τ

0

exp

&N%

k=1

w(k)(ξ) log,

dg(k)(ξ)dξ

-)dξ

=+ τ

0

N*

k=1

,dg(k)(ξ)

dξ

-w(k)(ξ)

dξ, (5)










abstract time

abstract frequency



morphing entity

APSIPA ASC 2014 61



Θ(k)(ν, τ) =!

f (k)0

"t(k)(τ)

#, a(k)

"t(k)(τ)

#,

P (k)"f (k)(ν), t(k)(τ)

#, f (k)(ν), t(k)(τ)

$, (1)



0 と a(k)





gm1(tm3(τ)) =N%

k=1

w(k)(t(k)(τ))g(k)(t(k)(τ)), (2)


N%

k=1

w(k)(t(k)(τ)) = 1. (3)





gm2(tm3(τ)) = exp

&N%

k=1


()

=N*

k=1

'g(k)(t(k)(τ))

(w(k)(t(k)(τ)), (4)






gm3(τ) =+ τ

0

exp

&N%

k=1

w(k)(ξ) log,

dg(k)(ξ)dξ

-)dξ

=+ τ

0

N*

k=1

,dg(k)(ξ)

dξ

-w(k)(ξ)

dξ, (5)










abstract time

abstract frequency



morphing entity

APSIPA ASC 2014



)

62


monotonicity





APSIPA ASC 2014

No constraint case

63



Θ(k)(ν, τ) =!

f (k)0

"t(k)(τ)

#, a(k)

"t(k)(τ)

#,

P (k)"f (k)(ν), t(k)(τ)

#, f (k)(ν), t(k)(τ)

$, (1)



0 と a(k)





gm1(tm3(τ)) =N%

k=1

w(k)(t(k)(τ))g(k)(t(k)(τ)), (2)


N%

k=1

w(k)(t(k)(τ)) = 1. (3)





gm2(tm3(τ)) = exp

&N%

k=1


()

=N*

k=1

'g(k)(t(k)(τ))

(w(k)(t(k)(τ)), (4)






gm3(τ) =+ τ

0

exp

&N%

k=1

w(k)(ξ) log,

dg(k)(ξ)dξ

-)dξ

=+ τ

0

N*

k=1

,dg(k)(ξ)

dξ

-w(k)(ξ)

dξ, (5)












Θ(k)(ν, τ) =!

f (k)0

"t(k)(τ)

#, a(k)

"t(k)(τ)

#,

P (k)"f (k)(ν), t(k)(τ)

#, f (k)(ν), t(k)(τ)

$, (1)



0 と a(k)





gm1(tm3(τ)) =N%

k=1

w(k)(t(k)(τ))g(k)(t(k)(τ)), (2)


N%

k=1

w(k)(t(k)(τ)) = 1. (3)





gm2(tm3(τ)) = exp

&N%

k=1


()

=N*

k=1

'g(k)(t(k)(τ))

(w(k)(t(k)(τ)), (4)






gm3(τ) =+ τ

0

exp

&N%

k=1

w(k)(ξ) log,

dg(k)(ξ)dξ

-)dξ

=+ τ

0

N*

k=1

,dg(k)(ξ)

dξ

-w(k)(ξ)

dξ, (5)





2（fs は、標本化周波数）という境界条件が与えられている。なお、ここでは関数の定義域を [0, 1]とした。この場合には、式 (5)により求められる境界での値を用いて、境界条件を満たすように値域を変換する*3 。事例の貢献度が時間の関数ではなく、事例ごとに与えられた定数である場合には、モーフィング後の事例の時間長を、それぞれの事例の時間長の加重平均とした方が、おそらく利用者にとって分り易い。この場合も、式 (5)により求められる境界での値を用いて、境界条件を満たすように値域を変換する。




index of case

number of casesmorphed parameter: function

weight

speech parameter

not always necessary

APSIPA ASC 2014



)

64


monotonicity





APSIPA ASC 2014

positivity constraint

65



Θ(k)(ν, τ) =!

f (k)0

"t(k)(τ)

#, a(k)

"t(k)(τ)

#,

P (k)"f (k)(ν), t(k)(τ)

#, f (k)(ν), t(k)(τ)

$, (1)



0 と a(k)





gm1(tm3(τ)) =N%

k=1

w(k)(t(k)(τ))g(k)(t(k)(τ)), (2)


N%

k=1

w(k)(t(k)(τ)) = 1. (3)





gm2(tm3(τ)) = exp

&N%

k=1


()

=N*

k=1

'g(k)(t(k)(τ))

(w(k)(t(k)(τ)), (4)






gm3(τ) =+ τ

0

exp

&N%

k=1

w(k)(ξ) log,

dg(k)(ξ)dξ

-)dξ

=+ τ

0

N*

k=1

,dg(k)(ξ)

dξ

-w(k)(ξ)

dξ, (5)











Θ(k)(ν, τ) =!

f (k)0

"t(k)(τ)

#, a(k)

"t(k)(τ)

#,

P (k)"f (k)(ν), t(k)(τ)

#, f (k)(ν), t(k)(τ)

$, (1)



0 と a(k)





gm1(tm3(τ)) =N%

k=1

w(k)(t(k)(τ))g(k)(t(k)(τ)), (2)


N%

k=1

w(k)(t(k)(τ)) = 1. (3)





gm2(tm3(τ)) = exp

&N%

k=1


()

=N*

k=1

'g(k)(t(k)(τ))

(w(k)(t(k)(τ)), (4)






gm3(τ) =+ τ

0

exp

&N%

k=1

w(k)(ξ) log,

dg(k)(ξ)dξ

-)dξ

=+ τ

0

N*

k=1

,dg(k)(ξ)

dξ

-w(k)(ξ)

dξ, (5)










APSIPA ASC 2014



)

66


monotonicity





APSIPA ASC 2014

monotonicity constraint

67



Θ(k)(ν, τ) =!

f (k)0

"t(k)(τ)

#, a(k)

"t(k)(τ)

#,

P (k)"f (k)(ν), t(k)(τ)

#, f (k)(ν), t(k)(τ)

$, (1)



0 と a(k)





gm1(tm3(τ)) =N%

k=1

w(k)(t(k)(τ))g(k)(t(k)(τ)), (2)


N%

k=1

w(k)(t(k)(τ)) = 1. (3)





gm2(tm3(τ)) = exp

&N%

k=1


()

=N*

k=1

'g(k)(t(k)(τ))

(w(k)(t(k)(τ)), (4)






gm3(τ) =+ τ

0

exp

&N%

k=1

w(k)(ξ) log,

dg(k)(ξ)dξ

-)dξ

=+ τ

0

N*

k=1

,dg(k)(ξ)

dξ

-w(k)(ξ)

dξ, (5)











Θ(k)(ν, τ) =!

f (k)0

"t(k)(τ)

#, a(k)

"t(k)(τ)

#,

P (k)"f (k)(ν), t(k)(τ)

#, f (k)(ν), t(k)(τ)

$, (1)



0 と a(k)





gm1(tm3(τ)) =N%

k=1

w(k)(t(k)(τ))g(k)(t(k)(τ)), (2)


N%

k=1

w(k)(t(k)(τ)) = 1. (3)





gm2(tm3(τ)) = exp

&N%

k=1


()

=N*

k=1

'g(k)(t(k)(τ))

(w(k)(t(k)(τ)), (4)






gm3(τ) =+ τ

0

exp

&N%

k=1

w(k)(ξ) log,

dg(k)(ξ)dξ

-)dξ

=+ τ

0

N*

k=1

,dg(k)(ξ)

dξ

-w(k)(ξ)

dξ, (5)









index of case

number of casesmorphed attribute: function

weightspeech attribute

abstract parameter

APSIPA ASC 2014

Implementation: piece-wise linear function

68








", (6)




X (τ), . . . , w(K)X (τ)]T は、













dt(k)(τ)dτ

= (p(k)(τn+1) − p(k)(τn)), (9)


log#

dt(k)(τ)dτ

$= log

%p(k)(τn+1) − p(k)(τn)

&, (10)




pm(τn) =K'

k=1

%p(k)(τn) − p(k)(τn−1)

&w̄(k)Tx

(τn)

+ pm(τn−1) , (12)










", (6)




X (τ), . . . , w(K)X (τ)]T は、



は、それぞれ、基本周波数、非周期性指標、STRAIGHTスペクトル、周波数軸、時間軸を表す。なお、実際にモーフィングされた音声を合成するには、

モーフィングされた時間軸 tm3(τ) と周波数軸 fm3(ν) の逆関数を用いて、τ = t−1




と周波数 ν に対応付けられていれば、事例の総数 N が 2以上の場合でも一段階の処理でモーフィングを行うことができる。これは、大きな利点である。具体的に説明する。現在、時変多属性モーフィングを容

易に利用してもらえるように提供しているGUIでは、モーフィングのために２つの事例間の時間-周波数表現の間の対






dt(k)(τ)dτ

= (p(k)(τn+1) − p(k)(τn)), (9)


log#

dt(k)(τ)dτ

$= log

%p(k)(τn+1) − p(k)(τn)

&, (10)




pm(τn) =K'

k=1

%p(k)(τn) − p(k)(τn−1)

&w̄(k)Tx

(τn)

+ pm(τn−1) , (12)











", (6)




X (τ), . . . , w(K)X (τ)]T は、



は、それぞれ、基本周波数、非周期性指標、STRAIGHTスペクトル、周波数軸、時間軸を表す。なお、実際にモーフィングされた音声を合成するには、

モーフィングされた時間軸 tm3(τ) と周波数軸 fm3(ν) の逆関数を用いて、τ = t−1




と周波数 ν に対応付けられていれば、事例の総数 N が 2以上の場合でも一段階の処理でモーフィングを行うことができる。これは、大きな利点である。具体的に説明する。現在、時変多属性モーフィングを容

易に利用してもらえるように提供しているGUIでは、モーフィングのために２つの事例間の時間-周波数表現の間の対






dt(k)(τ)dτ

= (p(k)(τn+1) − p(k)(τn)), (9)


log#

dt(k)(τ)dτ

$= log

%p(k)(τn+1) − p(k)(τn)

&, (10)




pm(τn) =K'

k=1

%p(k)(τn) − p(k)(τn−1)

&w̄(k)Tx

(τn)

+ pm(τn−1) , (12)



value at an anchoranchor location

ID of the anchor

time axis of an exampleID of the example

morphed time axis

value at morphed location

APSIPA ASC 2014


69

voic

es

attribute

APSIPA ASC 2014


71









", (6)




X (τ), . . . , w(K)X (τ)]T は、













dt(k)(τ)dτ

= (p(k)(τn+1) − p(k)(τn)), (9)


log#

dt(k)(τ)dτ

$= log

%p(k)(τn+1) − p(k)(τn)

&, (10)




pm(τn) =K'

k=1

%p(k)(τn) − p(k)(τn−1)

&w̄(k)Tx

(τn)

+ pm(τn−1) , (12)











", (6)




X (τ), . . . , w(K)X (τ)]T は、














dt(k)(τ)dτ

= (p(k)(τn+1) − p(k)(τn)), (9)


log#

dt(k)(τ)dτ

$= log

%p(k)(τn+1) − p(k)(τn)

&, (10)




pm(τn) =K'

k=1

%p(k)(τn) − p(k)(τn−1)

&w̄(k)Tx

(τn)

+ pm(τn−1) , (12)










", (6)




X (τ), . . . , w(K)X (τ)]T は、













dt(k)(τ)dτ

= (p(k)(τn+1) − p(k)(τn)), (9)


log#

dt(k)(τ)dτ

$= log

%p(k)(τn+1) − p(k)(τn)

&, (10)




pm(τn) =K'

k=1

%p(k)(τn) − p(k)(τn−1)

&w̄(k)Tx

(τn)

+ pm(τn−1) , (12)










", (6)




X (τ), . . . , w(K)X (τ)]T は、














dt(k)(τ)dτ

= (p(k)(τn+1) − p(k)(τn)), (9)


log#

dt(k)(τ)dτ

$= log

%p(k)(τn+1) − p(k)(τn)

&, (10)




pm(τn) =K'

k=1

%p(k)(τn) − p(k)(τn−1)

&w̄(k)Tx

(τn)

+ pm(τn−1) , (12)



morphing entity examplar

F0 aperiodicity time-frequency rep. time c.frequency c.

GUI for generalized morphing preparation

Matlab

November, 2013, APSIPA, Taiwan

Matlab

STRAIGHT 1997-

morphing 2003-

TANDEM-STRAIGHT 2007

Spark 1986-

Temporally variable multi-aspect morphing

2009


2013

I am a tool builder hoping to make useful tools to promote understanding of human speech communication and to encourage collaborations between researchers and developers. I would appreciate your suggestions for me to produce further interesting tools.

F0 extractors

YIN

XSX

1997- 1999- 2002- 2005- 2007- 2008- 2012- 2013-

NDF




75




Making speech tangible




76




Making speech tangibleThank you!

Questions/comments, please!




77

This page is intentionally left blank.

Life

Vertebrates

Intelligence

Language Univ. Turing machine

Speech recognition

context: M. L.

Everyone probably has the layer which Srinivasa Aiyangar Ramanujan had access

mathematics

Ramanujan BMI?

http://commons.wikimedia.org/wiki/File:Srinivasa_Ramanujan_-_OPC_-_1.jpg

Life

Vertebrates

Intelligence


Speech recognition

context: M. L.

mathematics

Terminator

Life

Vertebrates

Intelligence


Speech recognition

context: M. L.

Everyone probably has the layer which Wolfgang Amadeus Mozart had access

mathematics

Mozart BMI?

https://commons.wikimedia.org/wiki/File:Wolfgang-amadeus-mozart_1.jpg

位相と音色 50Hz

位相と音色 400Hz

asia-pacific signal and information processing association apsipa distinguished lecture series ...

Documents