time-frequency analysis of the shepard tone - …math.bard.edu/student/pdfs/shun-yang-lee.pdf ·...

Time-Frequency Analysis of the ShepardTone

A Senior Project submitted toThe Division of Science, Mathematics, and Computing

ofBard College

byShun-Yang Lee

Annandale-on-Hudson, New YorkApril, 2010

Abstract

This project focuses on the time-frequency analysis of the Shepard Tone. The main tech-niques used in this project are Fourier Transform and other related mathematical tools.From the analysis we are able to state the general properties of the Shepard Tone, whichenable us to reconstruct our own Shepard Tones. We conclude that the choice of envelopefunctions, the spacing between sound threads, and the sound threads being parallel or notdecide the degree of illusiveness of a Shepard Tone.

Contents

Abstract 1

Dedication 5

Acknowledgments 6

1 Introduction 7

2 Background Research 92.1 Orthogonal Systems of Functions . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Fourier Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Dirichlet’s Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 Linearity Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.2 Time-differentiation Property . . . . . . . . . . . . . . . . . . . . . . 162.3.3 Time-shift Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.4 Frequency-shift Property . . . . . . . . . . . . . . . . . . . . . . . . 172.3.5 Fourier Transform Pair . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.6 Symmetry Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Dirichlet’s Conditions for the Fourier Integral . . . . . . . . . . . . . . . . . 182.5 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.1 Convolution in time . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5.2 Convolution in frequency . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 Gibbs Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.7 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.8 Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.9 Windowed Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Contents 3

2.10 Physics of Sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Preliminary Investigation of the Shepard Tone 243.1 Using SoundRuler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Using Tone Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 About Tone Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4 Some Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Reconstructing the Shepard Tone 334.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Choice of Envelope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Setup 1: Parallel Threads Equally Spaced in Time . . . . . . . . . . . . . . 34

4.3.1 Octaves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3.2 Major Sevenths (M7) . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3.3 Minor Sevenths (m7) . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3.4 Major Sixths (M6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3.5 Minor Sixths (m6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3.6 Perfect Fifths (P5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3.7 Augmented Fourths (A4) . . . . . . . . . . . . . . . . . . . . . . . . 414.3.8 Perfect Fourths (P4) . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.9 Major Thirds (M3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.10 Minor Thirds(m3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.11 Major Seconds (M2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3.12 Minor Seconds (m2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3.13 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Setup 2: Parallel VS Non-Parallel Threads . . . . . . . . . . . . . . . . . . . 444.4.1 Case 1: Parallel Sound Threads . . . . . . . . . . . . . . . . . . . . . 464.4.2 Case 2: Non-Parallel Sound Threads A . . . . . . . . . . . . . . . . . 474.4.3 Case 3: Non-Parallel Sound Threads B . . . . . . . . . . . . . . . . . 484.4.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Conclusion 56

6 Future Work 58

A Tone Analyzer 59

B Shepard Tone Generator 60

Bibliography 62

List of Figures

2.6.1 Gibbs phenomenon (from http://cnx.org/content/m28717/latest/hv12.jpg) 21

3.1.1 The 48.8’th Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.1.2 The 49.2’th Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.1.3 The 49.6’th Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1.4 The 50.0’th Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1.5 The 50.4’th Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.1.6 The 50.8’th Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2.1 Frequency vs Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.2 Log(Frequency) vs Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.3 At the 41.1 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1 Envelope=t10(t− 50)6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3.1 Harmonics (from http://www.lamadeguido.com/Image4.gif) . . . . . . . . . 364.3.2 Helix Representation of Pitches

(from http://www1.appstate.edu/˜kms/classes/psy3203/MusicIllusions/helix.gif) 374.4.1 Case 1: Frequency vs Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4.2 Case 1: 14.7 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4.3 Case 1: 16.8 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4.4 Case 2: Frequency vs Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4.5 Case 2: 14.7 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4.6 Case 2: 16.8 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4.7 Case 3: Frequency vs Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.8 Case 3: 14.7 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.4.9 Case 3: 16.8 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Dedication

To my family, for encouraging me to explore the world in my own way.

Acknowledgments

I would like to thank my advisor, Cliona Golden, for her guidance on this project andcountless advice she has given me that has been invaluable both inside and outside theclassroom. I would like to thank Sam Hsiao and Melvin Chen for being on my board andgiving me helpful suggestions. I would also like to express my gratitude to John Halle,without whom I would have never known that the Shepard Tone actually exists. Lastly, Iwant to thank all my friends at Bard who make my college experience unforgettable.

1Introduction

An audio illusion is one that confuses the listener’s auditory mechanisms which leads to

illusionary perceptions. Some audio illusions manipulate the harmonics of certain funda-

mental frequencies; some illusions make use of two or more independent audio channels

to create illusionary effects. Examples of audio illusions include the Glissando Illusion,

the Scale Illusion, the Octave Illusion, the Tritone Paradox, and the Shepard Tone. This

project focuses on the Shepard Tone and explores its properties.

The term Shepard Tone is often referred to the audio illusion where different sound

threads are superposed an octave, or octaves, apart. In this project we consider the gen-

eralized Shepard Tone where sound threads which are any intervals apart are taken into

consideration.

Chapter 2 discusses the Fourier Transform and other related mathematical tools that

are used in this project to perform a time-frequency analysis of the Shepard Tone.

Chapter 3 describes our investigations of the Shepard Tone. We first utilize a commercial

software SoundRuler to perform an initial analysis of James Tenney’s Shepard Tone piece

1. INTRODUCTION 8

‘For Ann’, and then use Tone Analyzer, written in MATLAB for the purpose of this

project, to perform our time-frequency analysis of this Shepard Tone piece.

Chapter 4 discusses the reconstruction of the Shepard Tone. Setup 1 focuses on Shepard

Tones with parallel threads that are equally spaced in time. A distance measure is used

to describe the degree of illusiveness in each of the settings in setup 1. Setup 2 focuses on

Shepard Tones with non-parallel threads. A variance-related measurement is introduced

in order to describe the degree of illusiveness in each of the settings in setup 2.

Chapter 5 lists the properties of the Shepard Tone that we discovered from the investi-

gations in the previous chapters.

Finally, chapter 6 describes some topics for future research.

2Background Research

In the field of audio signal processing, computers are often utilized to analyze audio sig-

nals. We are able to analyze any audio signal by converting the input time-domain function

(i.e. the input audio signal) into a frequency-domain function. Fourier Transform is the

standard tool for describing the frequency content of an audio signal. However, as we

will see later, it is eventually inadequate for the purposes of this project and it will be

replaced by the Short-time Fourier Transform (also called the Windowed Fourier Trans-

form). Nonetheless, the Fourier Transform forms the basis of our toolbox, so we will start

by introducing this transform.

2.1 Orthogonal Systems of Functions

In this section we are going to describe the notion of orthogonality. To do this, we first

provide some preliminary definitions on which it relies.

Definition 2.1.1. Let H be a subspace of a vector space V . An indexed set of vectors

B = {b1,b2, . . . ,bp} in V is a basis for H if

• B is a linearly independent set, and

2. BACKGROUND RESEARCH 10

• the subspace spanned by B coincides with H; that is,

H = Span{b1,b2, . . . ,bp}.

4

Definition 2.1.2. An inner product of two vectors a = [a1, a2, . . . , an] and b =

[b1, b2, . . . , bn] in Rn is defined as

〈a,b〉 = a1b1 + a2b2 + · · ·+ anbn.

4

Definition 2.1.3. A set of vectors {u1,u2, . . . ,up} in Rn, with p ≤ n, is said to be an

orthogonal set if each pair of distinct vectors from the set is orthogonal, that is, if

〈ui,uj〉 = 0 whenever i 6= j.

4

Here we cite a theorem from [3, P.384]

Theorem 2.1.4. If S = {u1,u2, . . . ,up} is an orthogonal set of nonzero vectors in Rn,

then S is linearly independent and hence is a basis for the subspace spanned by S.

Proof. If 0 = c1u1 + · · ·+ cpup for some scalars c1, c2, · · · , cp, then

0 = 0 · u1 = (c1u1 + c2u2 + · · ·+ cpup) · u1 (2.1.1)

= (c1u1) · u1 + (c2u2) · u1 + · · ·+ (cpup) · u1 (2.1.2)

= c1(u1 · u1) + c2(u2 · u1) + · · ·+ cp(up · u1) (2.1.3)

= c1(u1 · u1), (2.1.4)

since u1 is orthogonal to u2,u3, · · · ,up. Since u1 is nonzero, u1 · u1 is not zero and so c1

must be zero. Similarly, c2, c3, · · · , cp must be zero. Therefore S is linearly independent

and is a basis for the subspace spanned by S.


The set of Riemann-integrable real-valued functions defines a vector space. We can

define an inner product on this space as follows.

Definition 2.1.5. If f and g are Riemann-integrable, real-valued functions that are de-

fined on [a, b] ⊂ R, then the integral ∫ b

af(x)g(x)dx

defines the inner product of f and g, denoted by 〈f, g〉.

The non-negative number 〈f, f〉12 , denoted by ||f ||, is the norm of f . 4

From this, we get the definition of orthogonality which can now be stated for a countably

infinite set of functions.

Definition 2.1.6. Two functions f and g are said to be orthogonal if 〈f, g〉 = 0. Sim-

ilarly, let S={f1, f2, f3, ...} be a collection of Riemann-integrable functions on [a, b], then

S is said to be an orthogonal system on [a, b] if

〈fm, fn〉 = 0 whenever m 6= n.

Furthermore, if fn has norm 1 for each n, then the system is an orthonormal system

on [a, b].

4

2.2 Fourier Series

The simplest form of Fourier Transform is Fourier Series, in which we consider only time-

domain functions that are periodic on [0, 1] as inputs. We will proceed with a special

trigonometric system S = {φ0, φ1, · · · } for which

φ0(x) = 1, φ2n−1(x) =√

2 cos 2πnx, φ2n(x) =√

2 sin 2πnx, n = 1, 2, . . .

We will prove that S is an orthonormal system on the interval [0, 1].


Proof. • Firstly we will show that 〈φ2m, φ2n〉 =∫ 10

√2 sin (2πmx)

√2 sin (2πnx) = 0

whenever m 6= n. ∫ 1

0

√2 sin (2πmx)

√2 sin (2πnx)dx

= 2∫ 1

0

−12

[cos (2π(m+ n)x)− cos (2π(m− n)x)] dx (2.2.1)

= −[∫ 1

0cos (2π(m+ n)x)dx−

∫ 1

0cos (2π(m− n)x)dx

](2.2.2)

= −[

sin (2π(m+ n)x)2π(m+ n)

− sin (2π(m− n)x)2π(m− n)

]1

0

(2.2.3)

= 0, (2.2.4)

since m+ n and m− n are integers.

• Similarly, it can be shown that

〈φ2m−1, φ2n−1〉 =∫ 1

0

√2 cos (2πmx)

√2 cos (2πnx) (2.2.5)

= 0 whenever m 6= n, (2.2.6)

〈φ2m, φ2n−1〉 =∫ 1

0

√2 sin (2πmx)

√2 cos (2πnx)dx (2.2.7)

= 0 for all m and n. (2.2.8)

For n = 1, 2, . . . , we have the following.

‖φ2n‖2 = 〈φ2n, φ2n〉 =∫ 1

0

√2 sin (2πnx)

√2 sin (2πnx)dx (2.2.9)

= 1, (2.2.10)

‖φ2n−1‖2 = 〈φ2n−1, φ2n−1〉 =∫ 1

0

√2 cos (2πnx)

√2 cos (2πnx)dx (2.2.11)

= 1. (2.2.12)

We also know that

〈φ0, φ0〉 =∫ 1

012dx = 1


Therefore

〈φ2m, φ2n〉 = 〈φ2m−1, φ2n−1〉 = δm,n

for n,m = 1, 2, . . . , and

‖φ2m‖ = ‖φ2n−1‖ = 1

for m = 0, 1, 2, . . . , n = 1, 2, . . . .

• The equations above constitute the orthogonality relations for the φn’s, and show

that the set of functions

φ0(x) = 1, φ2n−1(x) =√


2 sin 2πnx, n = 1, 2, . . .

is an orthonormal system.

Given f in C [0, 1], where C [0, 1] denotes all the continuous functions on the interval

[0, 1], we can find the n-th order Fourier approximation to f on [0, 1] by calculating

the orthogonal projection of f onto the orthonormal basis {φk}, where

φ0(x) = 1, φ2n−1(x) =√


2 sin 2πnx, n = 1, 2, . . . .

The Fourier coefficients of f are given by

a0 = 〈f, φ0〉 =∫ 1

0f(t)dt, (2.2.13)

ak = 〈f, φ2k−1〉 =∫ 1

0f(t)√

2 cos (2πkt)dt, k ≥ 1, (2.2.14)

bk = 〈f, φ2k〉 =∫ 1

0f(t)√

2 sin (2πkt)dt, k ≥ 1. (2.2.15)

The Fourier Series of f on C [0, 1] is then defined by:

f(t) = a0 +∞∑m=1

{am√

2 cos 2πmt+ bm√

2 sin 2πmt}, (2.2.16)

where a0, am and bm, m = 1, 2, . . . , are the Fourier Coefficients.


Fourier Series can also be written in exponential form, using Euler’s formula.

eiθ = cos(θ) + i sin(θ),

where i is√−1.

Therefore, the exponential form of the Fourier Series expansion of f is

f(t) =∞∑

n=−∞Fne

i2πnt, (2.2.17)

where

Fn = F (n) =∫ 1

0f(t)e−i2πntdt.

We can extend this definition to a generalized version where the period of the functions

is T as follows.

f(t) =∞∑

n=−∞Fne

i 2πTnt, (2.2.18)

where

Fn = F (n) =1T

∫ 1

0f(t)e−i

2πTntdt.

2.2.1 Dirichlet’s Conditions

It turns out that functions which satisfy certain conditions will have a convergent Fourier

Series expansion. Citing from [1, P.286], the conditions known as Dirichlet’s conditions

are stated in the following theorem.

Theorem 2.2.1. Dirichlet’s Conditions

If f(t) is a bounded, periodic function that in any period has

• a finite number of isolated maxima and minima, and

• a finite number of points of finite discontinuity

then the Fourier Series expansion of f(t) converges to f(t) at all points where f(t) is

continuous and to the average of the right- and left-hand limits of f(t) at points where

f(t) is discontinuous.


2.3 Fourier Transform

For input audio signals that are not periodic, the more general Fourier Transform is often

used. The Fourier Transform of a time-domain function f(t) is defined as

(Ff)(ξ) = F (ξ) =∫ ∞−∞

f(t)e−i2πξtdt. (2.3.1)

We can represent a time-domain input function in terms of frequency-domain functions

by the following equation:

f(t) =∫ ∞−∞

F (ξ)ei2πξtdξ. (2.3.2)

The Fourier Transform has the following properties:

• Linearity Property

• Time-differentiation Property

• Time-shift Property

• Frequency-shift Property

• Symmetry Property

which are explained below.

2.3.1 Linearity Property

If f(t) and g(t) are functions having Fourier Transforms F (ξ) and G(ξ), respectively, and

if α and β are constants, then

(F(αf + βg))(ξ) = α(Ff)(ξ) + β(Fg)(ξ) = αF (ξ) + βG(ξ).


Proof.

(F(αf + βg))(ξ) =∫ ∞−∞

[αf(t) + βg(t)] e−i2πξtdt (2.3.3)

= α

∫ ∞−∞

f(t)e−i2πξtdt+ β

∫ ∞−∞

g(t)e−i2πξt (2.3.4)

= αF (ξ) + βG(ξ). (2.3.5)

2.3.2 Time-differentiation Property

If the function f(t) has a Fourier Transform F (ξ) then

f(t) =∫ ∞−∞

F (ξ)ei2πξtdξ.

If we differentiate f with respect to t then we will get

df

dt=∫ ∞−∞

∂

∂t

[F (ξ)ei2πξt

]dξ =

∫ ∞−∞

(i2πξ)F (ξ)ei2πξtdξ,

which implies that dfdt is the inverse Fourier Transform of (i2πξ)F (ξ). This means

F{df

dt

}= (i2πξ)F (ξ).

By repeatedly taking the derivatives we get

F{dnf

dtn

}= (i2πξ)nF (ξ).

2.3.3 Time-shift Property

If a function f(t) has Fourier Transform F (ξ) and we set g(t) = f(t − τ) to be a shifted

version of f(t), then

(Fg)(ξ) = e−i2πξτF (ξ).

Proof.

(Fg)(ξ) =∫ ∞−∞

g(t)e−i2πξtdt =∫ ∞−∞

f(t− τ)e−i2πξtdt


Let x = t− τ , then we get

(Fg)(ξ) =∫ ∞−∞

f(x)e−i2πξ(x+τ)dx = e−i2πξτ∫ ∞−∞

f(x)e−i2πξxdx = e−i2πξτF (ξ).

2.3.4 Frequency-shift Property

If the function f(t) has Fourier Transform F (ξ), then the Fourier Transform of g(t) =

ei2πξ0tf(t) is

(F(ei2πξ0tf))(ξ) = F (ξ − ξ0).

Proof.

(Fg)(ξ) =∫ ∞−∞

ei2πξ0tf(t)e−i2πξtdt =∫ ∞−∞

f(t)e−i2π(ξ−ξ0)tdt = F (ξ − ξ0).

2.3.5 Fourier Transform Pair

Recall that the Fourier Transform of a function f(t) is defined by

(Ff)(ξ) = F (ξ) =∫ ∞−∞

f(t)e−i2πξtdt,

whenever the integral exists. Now we define the inverse Fourier Transform of G(ξ) as

(F−1G)(t) = g(t) =∫ ∞−∞

G(ξ)ei2πξtdξ.

{f,Ff} is called a Fourier Transform pair.

2.3.6 Symmetry Property

It is a claimed result that the inverse Fourier Transform is indeed the inverse of the Fourier

Transform.

(F−1F )(t) = f(t) =∫ ∞−∞

F (ξ)ei2πξtdξ.


Replacing the dummy variable ξ with y we get

f(t) =∫ ∞−∞

F (y)ei2πytdy.

Therefore,

f(−t) =∫ ∞−∞

F (y)e−i2πytdy.

Now replacing t by ξ we get

f(−ξ) =∫ ∞−∞

F (y)e−i2πξydy.

Notice that the right hand side of the above equation is the Fourier Transform of F (y).

This means, given that

(Ff)(ξ) = F (ξ),

we can conclude

F2f(ξ) = f(−ξ).

This can also be written as (F−1f)(ξ) = (Ff)(−ξ), and is called the symmetry property.

2.4 Dirichlet’s Conditions for the Fourier Integral

Citing from [1, P.347], a set of conditions for the validity of Fourier integral representations

are stated in the following theorem.

Theorem 2.4.1. Dirichlet’s Conditions for the Fourier integral

If the function f(t) is such that

• it is absolutely integrable, so that

∫ ∞−∞|f(t)|dt <∞,

and,


• it has at most a finite number of maxima and minima and a finite number of dis-

continuities in any finite interval,

then the Fourier integral representation of f(t) converges to f(t) at all points where f(t)

is continuous and to the average of the right- and left-hand limits of f(t) where f(t) is

discontinuous.

2.5 Convolution

The concept of convolution is defined as follows.

Definition 2.5.1. The convolution of two functions f(t) and g(t), denoted by f ∗ g, is

defined by

(f ∗ g)(t) =∫ ∞−∞

f(τ)g(t− τ)dτ (2.5.1)

=∫ ∞−∞

f(t− τ)g(τ)dτ. (2.5.2)

4

2.5.1 Convolution in time

If functions u(t) and v(t) have Fourier Transforms U(ξ) and V (ξ), respectively, then the

Fourier Transform of the convolution y(t) = (u ∗ v)(t) is

(Fy)(ξ) = (F(u ∗ v))(ξ) = (F(v ∗ u))(ξ) = (UV )(ξ).

Proof.

(Fy)(ξ) = (F(u ∗ v))(ξ) =∫ ∞−∞

[∫ ∞−∞

u(τ)v(t− τ)dτ]e−i2πξtdt (2.5.3)

=∫ ∞−∞

u(τ)[∫ ∞−∞

e−i2πξtv(t− τ)dt]dτ. (2.5.4)


Now replacing t− τ with z we get

Y (ξ) =∫ ∞−∞

u(τ)[∫ ∞−∞

e−i2πξ(z+τ)v(z)dt]dτ (2.5.5)

=∫ ∞−∞

u(τ)e−i2πξτdτ∫ ∞−∞

e−i2πξzv(z)dz (2.5.6)

= (UV )(ξ). (2.5.7)

That is, a convolution in the time domain is transformed into a product in the frequency

domain.

2.5.2 Convolution in frequency

If (Fu)(ξ) = U(ξ) and (Fv)(ξ) = V (ξ), then

(F(uv))(ξ) = (U ∗ V )(ξ)

Proof. Let u(t) =∫∞−∞ U(ξ)ei2πξtdξ and v(t) =

∫∞−∞ V (ξ)eiπξtdξ, then

(F−1(U ∗ V ))(t) =∫ ∞−∞

ei2πξt[∫ ∞−∞

U(y)V (ξ − y)dy]dξ (2.5.8)

=∫ ∞−∞

U(y)[∫ ∞−∞

V (ξ − y)ei2πξtdξ]dy. (2.5.9)

Now replacing ξ − y by z and we get

(F−1(U ∗ V ))(t) =∫ ∞−∞

U(y)[∫ ∞−∞

V (z)ei2π(z+y)tdz

]dy (2.5.10)

=∫ ∞−∞

U(y)ei2πytdy∫ ∞−∞

V (z)ei2πztdz (2.5.11)

= (uv)(t). (2.5.12)

Therefore,

(F(uv))(ξ) = (U ∗ V )(ξ).

This means that a multiplication in the time domain is transformed into a convolution

in the frequency domain.


2.6 Gibbs Phenomenon

Figure 2.6.1 (from http://cnx.org/content/m28717/latest/hv12.jpg) shows one deficiency

of Fourier Transform, the Gibbs Phenomenon, which describes the overshoot and un-

dershoot that occur when trying to transform a time-domain function with discontinuities,

jumps and sudden changes. In order to capture the sudden change, many high frequency

oscillations will occur, which creates undesirable artifacts.

Figure 2.6.1. Gibbs phenomenon (from http://cnx.org/content/m28717/latest/hv12.jpg)

2.7 Discrete Fourier Transform

We know that computers deal with discrete data and that, when a recording is made, it is

sampled discretely in time (for CD format the sampling rate is 44100 samples per second).

We have to use the discrete form of the Fourier Transform to meet our requirements. Citing

from [1, P.389], suppose we have a sequence {gk} of N samples drawn from a continuous-

time signal g(t), at equal intervals T , the discrete Fourier Transform (DFT) pair can

be written as

Gk =N−1∑n=0

gne− 2πi

Nkn,


gn =1N

N−1∑k=0

Gke2πiNkn.

where {Gk} is roughly sampling the continuous Fourier Transform at equal intervals 1NT .

We see a high degree of similarity by comparing the discrete Fourier Transform with the

continuous Fourier Transform (equation 2.3.1). The main difference is that the discrete

Fourier Transform modifies the integration, which is in the continuous form, to a finite

sum, which enables the computer to process sampled audio signals.

2.8 Aliasing

Sometimes different signals become indistinguishable after being sampled due to the choice

of sampling frequencies. This means that, when trying to recover the original signals from

the sampled versions, the resulting continuous signals are different from the original signals.

For example, if we undersample an audio signal, then the recovered signal may include

undesired lower frequency oscillations. This is because, by undersampling, we have failed

to capture all higher frequency information in the signal. It turns out that as long as the

sampling frequency is no less than a quantity called the Nyquist frequency then the

aliasing effect can be avoided. The Nyquist frequency is defined as 2 ∗ fmax, i.e.,

fNyq = 2 ∗ fmax,

where fmax is the highest frequency for which the Fourier Transform is nonzero.

2.9 Windowed Fourier Transform

Sometimes we are interested in the behavior of functions when localized in time, i.e.,

we would like to know the behavior of functions over some very short period of time.

For this purpose we have the Windowed Fourier Transform, or Short-time Fourier

Transform. The Windowed Fourier Transform is defined as


X(τ, ξ) =∫ ∞−∞

x(t)W (t− τ)e−i2πξtdt,

where W (t) is the window function, which could be, for example, a Rectangle, Hamming,

Gaussian, Hanning, or Blackman window, etc.; x(t) is the input time-domain function that

is to be transformed. We can interpret X(τ, ξ) as the Fourier Transform of the windowed

function W (t− τ)x(t).

We can measure the energy of the function in a certain time-frequency neighborhood.

One way to measure the energy density in some time-frequency neighborhood is the spec-

trogram, which is denoted PS and is defined as

(PSx)(τ, ξ) = |X(τ, ξ)|2 =∣∣∣∣∫ ∞−∞

x(t)W (t− τ)e−i2πξtdt∣∣∣∣2 .

In practice, for functions that have been sampled, we apply the discrete Fourier Trans-

form, rather than the Fourier Transform, along with a discrete window.

2.10 Physics of Sound

The ‘pitch’ of a soundwave is determined by its frequency. A soundwave with a higher

frequency will be heard ‘higher’ in pitch than one with a lower frequency. A normal

human being can perceive soundwaves with frequencies ranging from approximately 20

Hz to 15000 Hz.

Definition 2.10.1. A sound thread is a continuous sound with its frequency and am-

plitude varying continuously over time. 4

Definition 2.10.2. A pitch class is the set of all pitches that are some integer multiple

of an octave apart. For example, the pitch class of D is the set of all D’s in different

octaves. 4

3Preliminary Investigation of the Shepard Tone

3.1 Using SoundRuler

Using the commercial software SoundRuler, we perform an initial analysis of James Ten-

ney’s famous Shepard Tone piece: ‘For Ann’. Firstly we load the input .wav file into

SoundRuler and partition the piece into 120 0.2-second-subsections. Secondly we generate

figures that represent the frequency-amplitude relation at the 48.8, 49.2, 49.6, 50.0, 50.4,

and 50.8 second. We display the figures below.

Figure 3.1.1 shows the frequency-amplitude relation of James Tenney’s ‘For Ann’ at the

48.8 second. We can see that there are eight clear peaks appearing, with the leftmost peak

having the largest amplitude. Generally speaking, at this stage each peak has a larger

amplitude compared to peaks with higher frequencies. In Figure 3.1.2, we see that this

relationship among peaks does not hold so clearly at the 49.2 second. Nonetheless, we

do notice that the rightmost peak shifted out of the observation window during 48.8-49.2

seconds.

Comparing Figure 3.1.2 and Figure 3.1.3 we notice that the three rightmost peaks

noticeably shifted to their right, which means the frequencies increased and listeners would

3. PRELIMINARY INVESTIGATION OF THE SHEPARD TONE 25

hear the sound ascend. The same phenomenon can be observed by comparing Figure 3.1.3

with Figure 3.1.4.

Another thing worth noticing is the number of peaks in each figure. Table 3.1.1 sum-

marizes this information. Notice that the number of peaks in each figure remains fairly

constant. This shows that, whenever the rightmost peaks shift out of the frequency range

which we focus on, some peaks with lower frequencies will appear to replace them, so the

total number of peaks does not decrease.

3.2 Using Tone Analyzer

3.2.1 About Tone Analyzer

We use code we wrote in MATLAB, which we call Tone Analyzer, to analyze sound files.

The code of Tone Analyzer can be found in Appendix A. It firstly loads a .wav file into

MATLAB, and then calculates relevant parameters. More specifically, the parameters it

calculates are

• Length of the sound file, L.

• Number of samples in a window, M.

• Number of points to sample the Fourier Transform at, N.

• Sampling rate, fs.

Figure Number of Peaks1.3.1 81.3.2 71.3.3 61.3.4 81.3.5 71.3.6 7

Table 3.1.1. Number of Peaks in Figure 1.3.1-1.3.6


• Number of frames that need to be formed in order to chop the sound file and apply

window to, nframe.

We load James Tenney’s ‘For Ann’ into Tone Analyzer. The values of the parameters

found by Tone Analyzer are listed in Table 3.2.1.

Using the parameters calculated, Tone Analyzer will construct frames, i.e., segments of

the original audio signal that have equal length, that the designed window will be applied

to. Here we use W = 0.5(1− cos(2πx)) to be our window.

Each frame captures 0.1 seconds of audio signal, and a Windowed Fourier Transform

is performed. The moduli of the output signal are calculated and we plot three figures to

show the results.

Figure 3.2.1 reveals the time-frequency relation of the input audio signal. The horizontal

axis represents the time, the vertical axis represents the frequency, and the color represents

the amplitude, with blue being lower amplitude and red being higher amplitude. One

can notice that there are about seventeen overlapping sound threads appearing during

the whole piece, and they are equally spaced in time. The color change of the threads

corresponds to the change in amplitude. We also notice that, from Figure 3.2.1, these

threads’ frequencies seem to be increasing exponentially. We next plot another figure,

Figure 3.2.2, this time with the vertical axis representing Log-frequency. It is obvious

that these ‘sound threads’ become straight lines after taking log, confirming that they are

indeed exponential functions.

Parameter Value DescriptionL 1248815 Length of the sound fileM 2206 Number of samples in a windowN 2206 Number of points to sample the Fourier Transform atfs 22050 Sampling Rate

nframe 5640 Number of frames needed to chop the sound file

Table 3.2.1. Parameters of ‘For Ann’


The last figure, Figure 3.2.3, shows the frequency-amplitude relation at the 41.1 second.

3.3 Observations

From Figure 3.2.1 and 3.2.2 we observe that there are multiple equally-spaced time threads

that make up the Shepard Tone. We can see that the frequencies of the threads exhibit

exponential growth. The amplitude of each thread varies over time: it starts with lower

amplitude and grows as time passes until it reaches the highest amplitude, and decays

afterward.

3.4 Some Explanations

The manipulation of frequencies and amplitudes of the sound threads seems to be the

source of the illusion. In fact, Shimizu’s research [4] shows that, when listening to a Shepard

Tone, the frequency closest to the previous frequency that was being heard is perceived,

as long as both of the frequencies lie within the listener’s sensitive range of perception,

which is typically between 500 Hz and 5000 Hz. Once the frequency on which the listener

concentrates passes the sensitive range the listener’s attention will automatically shift back

to a different frequency which lies within the sensitive range (Shimizu et al., 2007). Also,

the research [4] claimes that the attention shift will most likely coincide with a multiple of

an octave, due to the spatial closeness of the intervals (See Figure 4.3.2). (A sound that is

perceived an octave lower than another sound has half the frequency of the other sound.)

Therefore, when the frequencies grow in time the listener follows the rise of the sound

until it surpasses the sensitive frequency. Once the sensitive range of perception is passed

the listener will automatically shift his/her attention to sounds with lower frequencies.

Since the amplitude also changes over time, it becomes difficult for the listener to notice

the shifts he/she makes while listening to the Shepard Tone. This is precisely the source

of the illusion.


Figure 3.1.1. The 48.8’th Second



Figure 3.2.1. Frequency vs Time

Figure 3.2.2. Log(Frequency) vs Time


Figure 3.2.3. At the 41.1 second

4Reconstructing the Shepard Tone

4.1 Method

From the end of the previous chapter, we know that a Shepard Tone is made by superposing

sound threads with different but related frequencies together, each of which, at a fixed

point in time, has a different amplitude. In fact, the term Shepard Tone is sometimes

referred to sound threads that are superposed in such a way that every two neighboring

sound threads are an octave apart. In this project we consider the generalized Shepard

Tone whose spacing between sound threads does not have to be octaves apart, and the

sound threads need not be parallel to each other.

Here we will use Shepard Tone Generator, which we wrote in MATLAB, to recon-

struct the Shepard Tone. The Shepard Tone Generator utilizes the idea of superposing

sound threads together in order to create an audio signal that is illusive. The frequen-

cies of threads will be controlled by an exponential-growth function, and the amplitudes

of threads will be controlled by an envelope function. Schematically, our Shepard Tone

4. RECONSTRUCTING THE SHEPARD TONE 34

Generator produces an audio signal of the form

∑i

(envelopefunction)(t) sin(fi(t)),

where i indexes the sound threads, and we will describe our choice of envelope function as

well as our choices of exponential frequency functions fi(t) below. Notice that a normal

human being can perceive soundwaves with frequencies ranging from 20 Hz to 15000 Hz.

The corresponding Nyquist frequency is 15000 ∗ 2 = 30000. Since 44100 > 30000, we can

thus choose the sampling frequency to be FS = 44100 to avoid aliasing effects.

4.2 Choice of Envelope

The manipulation of the amplitude of each sound thread is crucial to the overall success

of a Shepard Tone. In order to have the listener unaware of their attention shift from

one sound thread to another, we have to increase and then decrease the volume of each

sound thread gradually. We can control the amplitude of each sound thread by applying

an envelope function to it. In the following setups, we use the polynomial

t10(t− 50)6

as our envelope function. (See Fig 4.2.1)

We chose this particular polynomial because it has the following properties: It is zero

at the beginning (t = 0 seconds) and at the end (t = 50 seconds) of our audio file; it has

a faster descent than ascent due to the choice of powers of 6 and 10; and of course it is

smooth.

4.3 Setup 1: Parallel Threads Equally Spaced in Time

Using the harmonics graph (from http://www.lamadeguido.com/Image4.gif) shown below

(Fig 4.3.1) and some simple calculations, which will be demonstrated in the following


Figure 4.2.1. Envelope=t10(t− 50)6

sections, we can construct the desired musical intervals by controlling the time interval

between sound threads. In each of the following setups we use f(t) = AeBt, where A =

2, B = 14 , to control the frequencies of the sound threads, and we use curve = sin(f(t)) to

construct sound threads with increasing frequencies. The choices of A and B here are from

observations of Figure 3.2.1, which shows the time-frequency relation of James Tenney’s

‘For Ann’. We estimate the distances between every two neighboring sound threads and

the rate of change in frequency for each thread.

One way of defining intervals is by counting the number of half steps that lie between

the given two notes. For example, two notes that form an octave have twelve half steps

that lie in between.

From the study [4], we know that the Shepard Tone’s illusiveness comes from the lis-

tener’s attention shift. Furthermore, Shimizu’s research claims that shifts of attention

mostly coincide with a multiple of octaves because of the spatial closeness of the octave

tones. Figure 4.3.2 shows the spatial closeness of musical pitches in a helix represen-


Figure 4.3.1. Harmonics (from http://www.lamadeguido.com/Image4.gif)

tation (from http://www1.appstate.edu/˜kms/classes/psy3203/MusicIllusions/helix.gif).

We can therefore derive a distance measure which measures the distance of attention shift

that one makes when the pitch on which one concentrates passes the sensitive frequency

range and, as a consequence, the process of searching for notes from the same pitch class

begins.

Suppose a Shepard Tone is composed of sound threads that are octaves apart, i.e., at any

given point in time the frequencies of these threads represent notes from the same pitch

class. Then once a sound thread on which he/she focuses passes the sensitive fequency

range, he/she only needs to shift his/her attention one thread below the original one,

since the new thread will be in the same pitch class as the original note, only an octave

below.


Figure 4.3.2. Helix Representation of Pitches(from http://www1.appstate.edu/˜kms/classes/psy3203/MusicIllusions/helix.gif)

We propose a distance measurement that can be applied to all intervals except for

major seconds (M2) and minor seconds (m2). The cases of major and minor seconds will

be discussed separately after the other intervals.

Definition 4.3.1 (Distance Measure). Suppose a particular Shepard Tone is formed by

superposing sound threads in such a way that any two neighboring threads are n half steps

apart, where n ≤ 12. Then the distance of the listener’s attention shift when searching for

the closest note from the same pitch class is

σ(n) =lcm(n, 12)

12

(Note that σ(n) will take values between 1 and 11.) In other words, the new thread

from the same pitch class as the original thread is σ(n) octaves away from the original

one. Therefore, for an ascending Shepard Tone the listener’s attention has to shift σ(n)

octaves downward in order to find a note from the same pitch class as before. 4


4.3.1 Octaves

First of all we have to figure out the required time interval between different sound threads

in order to make any two neighboring sound threads an octave apart from each other. From

Figure 4.3.1 we can see that two notes that are an octave apart will have frequency f for

the low note and 2f for the high note. Now, suppose at time t a sound thread has frequency

f , and at a later point of time, t1, the frequency of the same sound thread changes to 2f .

To figure out the delay in time that will make the sound threads octaves apart from each

other, we need to figure out the time difference t1 − t. Therefore, we have to solve the

following system of equations.

f = f(t) = AeBt,

2f = f(t1) = AeBt1 .

Substituting AeBt for f and plugging that into the second equation we get

2AeBt = AeBt1 .

Since A 6= 0, we get

2eBt = eBt1 .

Now applying logarithms to both sides of the equation we get

log 2eBt = log eBt1

Since

log 2eBt = log 2 + log eBt = log 2 +Bt

and

log eBt1 = Bt1.

we then have that

log 2 +Bt = Bt1.


Therefore

t1 =log 2B

+ t,

and

t1 − t =log 2B

.

Therefore, the time interval required for the sound threads to be octaves apart is log 2B

seconds, or an integer multiple of log 2B seconds. Plugging in B = 1

4 , we know that the

threads have to be 2.7726 seconds, or an integer multiple of 2.7726 seconds, apart.

Since an octave is formed by two notes that are twelve half steps apart, we can calculate

the listener’s attention shift by plugging in n = 12 to our distance formula.

σ(12) =lcm(12, 12)

12=

1212

= 1.

This calculation shows that the listener only has to shift to the thread one octave below

in order to locate a note from the same pitch class.

4.3.2 Major Sevenths (M7)

We can calculate the time interval between threads in order to make a major seventh

interval. By Figure 4.3.1, we know that to make a major seventh interval we need the two

notes to have frequencies f and 116 f . Note that f and 11

6 f are not the only pair that forms

a major seventh. f and 158 f would work as well. The choice of f and 11

6 f is related to some

musical tuning consideration which we will not go into in detail. In the following discussion

the reader should keep in mind that the given ratios to form the desired intervals might

not be the only choices.

Similar to the octave case, we get

t1 − t =log(11

6 )B

,

and plugging in B = 14 ,

t1 − t ≈ 2.4245.


Therefore, the threads have to be 2.4245 seconds, or an integer multiple of 2.4245 seconds,

apart in order to make a major seventh system.

Since a major seventh is formed by two notes that are eleven half steps apart, we can

calculate the listener’s attention shift by plugging in n = 11 to our distance formula

σ(11) =lcm(12, 11)

12=

13212

= 11.

This calculation shows that the listener has to shift to the thread eleven octaves below

in order to locate a note from the same pitch class.

4.3.3 Minor Sevenths (m7)

Similarly, the threads have to be 2.2385 seconds, or an integer multiple of 2.2385 seconds,

apart in order to make a minor seventh system.

Since a minor seventh is formed by two notes that are ten half steps apart, we can

calculate the listener’s attention shift by plugging in n = 10 to our distance formula.

σ(10) =lcm(12, 10)

12=

6012

= 5.

Therefore the listener has to shift to the thread five octaves below in order to locate a

note from the same pitch class.

4.3.4 Major Sixths (M6)

To make a major sixth system the threads have to be 2.0433 seconds, or an integer multiple

of 2.0433 seconds, apart. Since a major sixth is formed by two notes that are nine half

steps apart, we can calculate the listener’s attention shift by plugging in n = 9 to our

distance formula.

σ(9) =lcm(12, 9)

12=

3612

= 3.

Therefore the listener has to shift to the thread four octaves below in order to locate a



4.3.5 Minor Sixths (m6)

To make a minor sixth system the threads have to be 1.8800 seconds, or an integer multiple

of 1.8800 seconds, apart. Since a minor sixth is formed by two notes that are eight half


distance formula.

σ(8) =lcm(12, 8)

12=

2412

= 2.

Therefore the listener has to shift to the thread two octaves below in order to locate a


4.3.6 Perfect Fifths (P5)

To make a perfect fifth system the threads have to be 1.6219 seconds, or an integer multiple

of 1.6219 seconds, apart. Since a perfect fifth is formed by two notes that are seven half


distance formula.

σ(7) =lcm(12, 7)

12=

8412

= 7.

Therefore the listener has to shift to the thread seven octaves below in order to locate

a note from the same pitch class.

4.3.7 Augmented Fourths (A4)

To make a perfect fourth system the threads have to be 1.4267 seconds, or an integer

multiple of 1.4267 seconds, apart. Since an augmented fourth is formed by two notes that

are sixth half steps apart, we can calculate the listener’s attention shift by plugging in

n = 6 to our distance formula.

σ(6) =lcm(12, 6)

12=

1212

= 1.

Therefore the listener has to shift to the thread one octave below in order to locate a



4.3.8 Perfect Fourths (P4)

To make a perfect fourth system the threads have to be 1.1507 seconds, or an integer

multiple of 1.1507 seconds, apart. Since a perfect fourth is formed by two notes that are

five half steps apart, we can calculate the listener’s attention shift by plugging in n = 5

to our distance formula.

σ(5) =lcm(12, 5)

12=

6012

= 5.

Therefore the listener has to shift to the thread five octaves below in order to locate a


4.3.9 Major Thirds (M3)

To make a major third system the threads have to be 0.8926 seconds, or an integer multiple

of 0.8926 seconds, apart. Since a major third is formed by two notes that are four half


distance formula.

σ(4) =lcm(12, 4)

12=

1212

= 1.



4.3.10 Minor Thirds(m3)

To make a minor third system the threads have to be 0.7293 seconds, or an integer multiple

of 0.7293 seconds, apart. Since a minor third is formed by two notes that are three half


distance formula.

σ(3) =lcm(12, 3)

12=

1212

= 1.




4.3.11 Major Seconds (M2)

To make a major second system the threads have to be 0.5344 seconds, or an integer

multiple of 0.5344 seconds, apart. Suppose we were to use the distance formula to calculate

the listener’s attention shift. Then, since a major second is formed by two notes that are

two half steps apart, we can plug in n = 2 to our distance formula.

σ(2) =lcm(12, 2)

12=

1212

= 1.



However, when sound threads are just two half steps, i.e., one whole step, apart, the

distance between sound threads is too small for our ears to tell clearly whether the Shepard

Tone is ascending or not. Therefore, even if we get σ(2) = 1 for Shepard Tones containing

sound threads that are major seconds apart, we cannot conclude that it has the same

degree of illusiveness as other intervals with σ value of 1.

4.3.12 Minor Seconds (m2)

To make a minor second system the threads have to be 0.3480 seconds, or an integer mul-

tiple of 0.3480 seconds, apart. Similar to the major second case discussed above, suppose

we were to use the distance formula to calculate the listener’s attention shift. Then, since

a major second is formed by two notes that are one half step apart, we can plug in n = 1

to our distance formula.

σ(1) =lcm(12, 1)

12=

1212

= 1.



However, when sound threads are just one half step apart, the distance between sound

threads is too small for our ears to tell clearly whether the Shepard Tone is ascending or


not. Therefore, even if we get σ(1) = 1 for Shepard Tones containing sound threads that

are minor seconds apart, we cannot conclude that it has the same degree of illusiveness as

other intervals with σ value of 1.

The relationship between intervals and the number of half steps is summarized in Ta-

ble 4.3.1.

4.3.13 Observations

As Table 4.3.2 shows, by comparing the σ(n)’s of different intervals we can form a hierarchy

in terms of attention-shifting distance. We say that intervals that require shorter travel

distance tend to be more illusive because the listener is less likely to be aware of the

attention shift.

4.4 Setup 2: Parallel VS Non-Parallel Threads

From study [4], we know that the illusiveness of a Shepard Tone comes from the inad-

vertent attention shift from one sound thread to another that the listener makes when

concentrating on the Shepard Tone. A Shepard Tone is illusive precisely when the listener

is indeed unaware of this attention shift. In the following discussion, we will use the term

‘frequency distribution’ of the sound threads to refer to the pattern, or more specifically

the relative spacings, of frequency peaks that are present locally in time. These frequency

peaks correspond to the frequencies of the sound threads. (See, for example, Figure 4.4.6.)

We claim that if the sound threads’ frequency distribution hardly varies as time passes,

the listener would be unlikely to notice the attention shift. Alternatively, if there is a large

change in the frequency distribution over time, which the listener would perceive as a

change in the rate of ascent at the attention shift, then the listener would be alerted to

this shift of attention, and consequently would not find such a Shepard Tone illusive. In

order to measure this frequency distribution variation over time, we use some statistical


tools to create a measurement of variation. We will then use this measurement to describe

the degrees of illusiveness of different Shepard Tones. In this section we will compare the

degree of illusiveness between Shepard Tones with parallel sound threads and those with

non-parallel sound threads.

For each of the following 21-second-long setups, we randomly pick two points in time

(here we pick the 14.7th and the 16.8th seconds), identify the frequencies of the peaks,

and calculate the ratios of the frequencies of every two neighboring peaks. Then we will

calculate the variance of the logarithm of the ratios. Finally, we will compare the percent-

age change in variance between these two points in time. Note that the smaller the value

of percentage change in variance is, the more constant the frequency spacing between

peaks is. We claim that one setup is less illusive than another if the percentage change

in variance between the two arbitrarily chosen points in time is larger than that for the

other setup. This is because if a setup has a larger percentage change of variance, then

the listener is more likely to notice the change in frequency distribution, which is central

to the illusiveness of the Shepard Tone, and therefore this setup would be less illusive. We

outline the way we calculate the relative change in variance in the following definition.

Definition 4.4.1 (Calculating the Relative Change in Variance). Suppose a particular

Shepard Tone is formed by superposing n sound threads, and, at a particular point of

time t0, the frequencies of these sound threads reach f1, f2, f3, · · · , fn, where f1 ≤ f2 ≤

f3 ≤ · · · ≤ fn. Then we define σ2(t0) in the following way.

σ2(t0) = var(

logf2

f1, log

f3

f2, · · · , log

fnfn−1

)

We can therefore calculate the relative change in variance between two different

points in time t0, t1, with t0 ≤ t1, by


relative change in variance =σ2(t1)− σ2(t0)

σ2(t0).

4

Here we will consider three cases. Case 1 consists of parallel sound threads, while Cases

2 and 3 consist of non-parallel sound threads with different parameters.

4.4.1 Case 1: Parallel Sound Threads

In the first case we use Shepard Tone Generator to generate a Shepard Tone of six parallel

sound threads, each of which has parameters A = 2, B = 0.25 and delay = 2.77. We use

the MATLAB command wavwrite to make this Shepard Tone into a .wav file. Then we use

Tone Analyzer to generate Figure 4.4.1, Figure 4.4.2 and Figure 4.4.3. Figure 4.4.1 shows

the time-frequency distribution of this case. Also, by reading off the data from Figure 4.4.2

and 4.4.3 we can calculate the σ2’s of the 14.7th second and the 16.8th second.

σ2(14.7) = var(

log14070

, log270140

, · · · , log43902190

)(4.4.1)

= var(0.6931, 0.6568, 0.7115, 0.6931, 0.6886, 0.6954) (4.4.2)

= 0.3233, (4.4.3)

σ2(16.8) = var(

log12060

, log230120

, · · · , log37001850

)(4.4.4)

= var(0.6931, 0.6506, 0.6931, 0.6931, 0.6986, 0.6931) (4.4.5)

= 0.3222, (4.4.6)

σ2(16.8)− σ2(14.7)σ2(14.7)

=0.3222− 0.3233

0.3233= −0.0034.


Figure 4.4.1. Case 1: Frequency vs Time

4.4.2 Case 2: Non-Parallel Sound Threads A

The second case consists of six non-parallel sound threads that, while all having parameters

A = 2 and delay = 2.77, have different B values that are 0.25, 0.15, 0.21, 0.19, 0.17, 0.23,

respectively. Notice the variance of the B values is 0.0014. Similar to Case 1, we use Shepard

Tone Generator and Tone Analyzer to generate relevant graphs. Figure 4.4.4 shows the

time-frequency distribution of this case. By reading off the data from Figure 4.4.5 and

4.4.6, we can calculate the σ2’s of the 14.7th second and the 16.8th second.


σ2(14.7) = var(

log9030, log

61090

, · · · , log41501600

)(4.4.7)

= var(1.0986, 1.9136, 0.9643, 0.9531) (4.4.8)

= 0.2106, (4.4.9)

σ2(16.8) = var(

log13040

, log960130

, · · · , log71202630

)(4.4.10)

= var(1.1787, 1.9994, 1.0078, 0.9959) (4.4.11)

= 0.2272, (4.4.12)

σ2(16.8)− σ2(14.7)σ2(14.7)

=0.2272− 0.2106

0.2106= 0.0787.

4.4.3 Case 3: Non-Parallel Sound Threads B

The third case consists of six non-parallel sound threads that, while all having parameters

A = 2 and delay = 2.77, have different B values that are 0.25, 0.08, 0.19, 0.10, 0.24, 0.16,

respectively. Notice the variance of the B values is 0.0050, which is larger than that of Case

2. This means the B values in Case 3 vary more than those in Case 2. Figure 4.4.7 shows

the time-frequency distribution of this case. By reading off the data from Figure 4.4.8 and

4.4.9 we can calculate the σ2’s of the 14.7th second and the 16.8th second.

σ2(14.7) = var(

log25060

, log2810250

, log45302810

)(4.4.13)

= var(1.4217, 2.4195, 0.4775) (4.4.14)

= 0.9429, (4.4.15)


σ2(16.8) = var(

log37080

, log4590370

, log75604590

)(4.4.16)

= var(1.5315, 2.5181, 0.4990) (4.4.17)

= 1.0194, (4.4.18)

σ2(16.8)− σ2(14.7)σ2(14.7)

=1.0194− 0.9429

0.9429= 0.0811.

4.4.4 Observations

By comparing the relative change in variance in the above three cases, we realize that in

Case 1, where all sound threads are parallel, the relative change in variance is −0.0034,

Case 2: 0.0787, and Case 3: 0.0811. The small value of relative change in variance in Case 1

indicates that the spacing between sound threads are fairly constant. This means that since

the frequency spacing between sound threads hardly change, it is difficult for the listener

to notice any attention shift when listening to the sound file. On the contrary, since Case

2 and Case 3 have a comparatively large relative change of variance, the listener is more

likely to notice the attention shift when focusing on those two Shepard Tones, and thus

find both Case 2 and 3 less illusive. Furthermore, since the relative change in variance

in Case 3 is larger than that in Case 2, Case 3 is even less illusive than Case 2. This

conclusion can be confirmed by checking the variances of both cases’ B Values. (The B

values in Case 3 have a larger variance than those in Case 2.)

Therefore we conclude that Shepard Tones with parallel sound threads are more illusive

than those with non-parallel sound threads, as would be expected. We can further conclude

that Shepard Tones whose B values have larger variances are less illusive than those whose

B values have smaller variance.


Interval Number of Half StepsOctave 12

Major Seventh 11Minor Seventh 10Major Sixth 9Minor Sixth 8Perfect Fifth 7

Augmented Fourth 6Perfect Fourth 5Major Third 4Minor Third 3

Major Second 2Minor Second 1

Table 4.3.1. Intervals and their corresponding number of half steps

σ 1 2 3 5 7 11Octave m6 M6 m7 P5 M7

A4 P4M3m3

(M2)(m2)

Table 4.3.2. Intervals and corresponding σ values

Interval Seconds ApartOctave 2.7726

Major Seventh 2.4245Minor Seventh 2.2385Major Sixth 2.0433Minor Sixth 1.8800Perfect Fifth 1.6219

Augmented Fourth 1.4267Perfect Fourth 1.1507Major Third 0.8926Minor Third 0.7293

Major Second 0.5344Minor Second 0.3480

Table 4.3.3. Intervals and their corresponding space in time when B = 14


Figure 4.4.2. Case 1: 14.7 second


5Conclusion

In this project we investigated some properties of the Shepard Tone via a time-frequency

analysis on the Shepard Tone.

We learned that a Shepard Tone is illusive because it makes the listener unaware of

the attention shift that is made while listening to it. Whenever the frequency of the

sound thread on which one concentrates passes the sensitive range, typically between 500

Hz and 5000 Hz, the listener’s attention will automatically shift back to another sound

thread that lies within the sensitive range. A well designed Shepard Tone takes advantage

of these attention shifts made by the listener in such a way that the listener is unaware of

these shifts. This is why the Shepard Tone is mistakenly heard as infinitely ascending.

We wrote Tone Analyzer in MATLAB to perform a time-frequency analysis on the

Shepard Tone. Next, using information from this analysis, we wrote another piece of

MATLAB code, which we called Shepard Tone Generator, to construct our own Shepard

Tones. In this way, we were able to experiment with different settings, and found through

this that the degree of illusiveness of a Shepard Tone is decided by the following factors.

5. CONCLUSION 57

• Choice of envelope function: The envelope function sets the amplitude of each sound

thread over time. A good envelope function should be smooth and have a fast ascent

followed by a fast descent. In our investigation we chose the polynomial function

t10(t− 50)6 to be our envelope function.

• Spacing between sound threads: The spacing between sound threads partially de-

termines the degree of illusiveness of a Shepard Tone. We designed a distance mea-

surement σ which, along with our own perception, allowed us to conclude that the

spacings between sound threads that resulted in the most illusive Shepard Tones are

intervals with σ value of 1, which are octaves, augmented fourths, major thirds and

minor thirds. Intervals with larger σ values lead to less illusive Shepard Tones.

• Parallel or non-parallel sound threads: Based on our claim that the larger the vari-

ation of frequency distribution of a Shepard Tone is, the less illusive this Shepard

Tone would be, we defined a measurement of variation which calculates the relative

change in variance of a Shepard Tone, and concluded from this, and our own auditory

perception, that parallel sound threads create a higher degree of illusiveness than

non-parallel sound threads. We further concluded that non-parallel sound threads

with a larger relative change in variance appear to be less illusive than those with a

smaller relative change in variance.

6Future Work

Since the envelope function of the Shepard Tone partially determines its degree of illusive-

ness, it is worth investigating the ideal type of envelope function. In this project we focused

on polynomial functions since they are smooth and can be easily chosen to have a fast

ascent followed by a fast descent. It would be worthwhile to further explore if variations

on our choice of polynomial function could improve the degree of illusiveness, or perhaps

to search for other kinds of functions that could serve as better envelope functions than

polynomial functions.

It would also be worthwhile to investigate whether the choice of the sound threads’

parameters, A and B, could potentially affect the degree of illusiveness of a Shepard Tone.

As mentioned in the introduction, there are a lot of other audio illusions that are being

investigated, mainly by scholars from the field of psychology. It would be interesting to

extend our time-frequency analysis of the Shepard Tone to those illusions and explore

properties they possess.

Appendix ATone Analyzer

L=length(x);M=0.1*fs; % raw window length = the number of samples in 0.1 sec

%(captures frequencies down to 10 Hz)M=2*round(M/2); % good to have even number of samples in windowN=M; % how many points to sample the Fourier transform at

%(also good to have this even)H=round(M/10); % H=hopsize is such that 10 consecutive windows overlapnframes=floor((L-M)/H); % the number of frames that fit

T=(0.5*M+H*[0:1:nframes-1])/fs; % timegrid formed by midpoint of each frameF=fs*[0:N/2-1]/N; % frequency grid for calculated Fourier transform samples

W=0.5*(1-cos(2*pi*[0:M-1]/M))’; % raised cosine window

output=zeros(N,nframes);

xoff=0;for m=1:nframes

xt=x(xoff+1:xoff+M); % extract data from x with% length M, which equals window length.

xtw=W.*xt; % apply window function W to xt.output(:,m)=fft(xtw,N); % perform fft at N pointsxoff=xoff+H;

end

y=abs(output);

y=y(1:N/2,:); % extract only positive frequencies

Appendix BShepard Tone Generator

FS=44100; % sampling frequencysec=50; % length of signal in seconds.t=[0:1/FS:sec]; % number of entries = FS*sec+1(this is time 0)numOfEntries = FS*sec+1;

A=2;B=1/4;

curve=zeros(1,numOfEntries); %allocate the curve vectorout=zeros(1,numOfEntries);

% if B=1/4 then% octave=log(2)/B = 2.7726% M7=log(11/6)/B = 2.4245% m7=log(7/4)/B = 2.2385% M6=log(5/3)/B = 2.0433% m6=log(8/5)/B = 1.8800% p5=log(3/2)/B = 1.6219% A4=log(10/7)/B =1.4267% p4=log(4/3)/B = 1.1507% M3=log(5/4)/B = 0.8926% m3=log(6/5)/B = 0.7293% M2=log(8/7)/B = 0.5344% m2=log(12/11)/B = 0.3480delay=2.4245;numOfThreads=15;envelope=power(t,10)*power((t-sec),6);

for k=0:numOfThreads-1curve=A*exp(B*(t-k*delay));

APPENDIX B. SHEPARD TONE GENERATOR 61

envelope=[zeros(1,floor(delay*FS)),envelope(1:numOfEntries-floor(delay*FS))];out=out+envelope.*sin(curve);

end

%play soundhalfTime=floor(numOfEntries/2);excerpt=out(:,(1)*halfTime:numOfEntries-1);soundsc(excerpt,FS); % defalut rate is 8192. Here makes it FS instead.

Bibliography

[1] Glyn James, Advanced Modern Engineering Mathematics, Prentice Hall, Pearson Ed-ucation, 2004.

[2] Stephane Mallat, A Wavelet Tour of Signal Processing, Academic Press, New York,1999.

[3] David C. Lay, Linear Algebra and Its Applications, Addison Wesley, Pearson Educa-tion, 2006.

[4] Yu Shimizu, Neuronal response to Shepard’s tones. An auditory fMRI study using mul-tifractal analysis, Brain Research 1186 (2007), 113–123.

time-frequency analysis of the shepard tone - …math.bard.edu/student/pdfs/shun-yang-lee.pdf ·...

Documents