time-frequency analysis for music signal analysis

8/13/2019 Time-Frequency Analysis for Music Signal Analysis

1/32

1

Time-Frequency Analysis for Music Signal Analysis

[Tutorial]

r99943 4 AbstractI. IntroductionII. Time-frequency analysis and classic Fourier transformIII. Basic Concepts about Music

i. Musical Pitchii. Harmony

iii. Tempo, Beat and RhythmIV. Time-Frequency Analysis and Musical Signal

i. Short Time Fourier Transform and Gabor Transformii. Wigner Distribution Function

V. Time-Frequency Representationi. Log-Frequency Spectrogram

ii. Time-Chroma RepresentationVI. Other Applications on Musical Signal

i. Onset Detection and Novelty Curveii. Periodicity Analysis and Tempo Estimation

iii. Harmonic Pitch Class Profilesiv. Modified HHT for Detecting Fundamental Frequency

VII. ConclusionVIII. Reference


2/32

2

AbstractTime-frequency analysis is an efficient tool for analyzing signals. It is

extended from the classic Fourier approach. In this tutorial, it will introduce several

kinds of time-frequency analysis and work them on musical signals.There are many time-frequency methods such as Short-time Fourier

transform (STFT), Gabor transform (GT), Wigner Distribution function (WDF). They

are employed in analyzing music played on a piano, a flute, or a guitar. Musical

sound is more complicated than sound produced by human. It has wider band of

frequency, different methods of producing sounds. The important of all is music

signals are typical examples for time-varying signals. Therefore, the classic Fourier

transform is not sufficient to analyze it. We can use time-frequency analysis to see

the variation of frequency corresponding to time.

I. IntroductionIn this tutorial, at first, II will introduce why we can use time-frequency

analysis for music signals and what is difference between classic Fourier transform

and time-frequency analysis. In the section III, I will also introduce some basic

music theory. The several kinds of time-frequency analysis will be introduced and

be implemented in the section IV.

In section V, it will show two kinds of time-frequency representation for

musical signal. Next in the section VI, some other advanced analysis for musical

signals will be mentioned. For example, Chroma (HPCP) is an advanced application

of time-frequency analysis. The frequency is mapped into 12 pitch classes. We can

know the change of pitch class corresponding to time. Finally, the conclusion is in

section VII, and the reference is in section VIII.


3/32

3

II. Time-frequency analysis and classic Fourier transform

In the past, we can get a continue signal s(t)s spectrum by classic Fourier

transform. This is computed by

In the spectrum, we can see the magnitude in different frequency. It helps a

lot in research. For example, a sinusoid signal which frequency is 440Hz. We can do

Fourier transform on Figure. 1(a). And can get the result in Figure. 1(b) . There is a

peak in frequency 440Hz.

Figure. 1 Fourier transform of a sinusoid signal. (a) The sinusoid signal with

440Hz.(b) The Fourier spectrum of (a).

Similarly, the Fourier transform of sinusoid components will have several

peaks in right frequency. However, this representation cans not give any

information about the localization of the sinusoids in time. We dont know when

the sinusoid appears in the signal. Therefore, time-frequency analysis can solve this

problem. Lets take atypical example in the class.

Example:

f(t): x(t) = cos(t) when t < 10,

x(t) = cos(3t) when 10 t < 20 x(t) = cos(2t) when t 20


4/32

4

Figure. 2 (a) the signal and the classic Fourier transform of the signal. (b) the

time-frequency analysis of signal.

In the figure.2 , we can find the most important difference is the time-

frequency analysis can have the time information of the signal. By seeing the

figure.2(b), we know the cos(t) appears from 0~10, cos(3t) appears from

10~20, cos(2t) appears from 20~30. It is the reason why we need to use time-

frequency analysis. Except for fast convolution, all the properties of classic Fourier

transform can be replaced by time-frequency analysis.

III. Basic Concepts about MusicThis section will introduce some knowledge about music. Music is sound

that has some stable frequencies in a time period. Music can be produced by


5/32


6/32

6

Fig. 3.Middle C (262 Hz) played on a piano and a violin. The top pane shows the

waveform, with the spectrogram below. Zoomed-in regions shown above the

waveform reveal the 3.8-ms fundamental period of both notes.

Fig.3. shows the waveforms and spectrograms of middle C (with

fundamental frequency 262 Hz) played on a piano and a violin. Zoomed-in views

above the waveforms show the relatively stationary waveform with a 3.8-ms

period in both cases. The spectrograms (calculated with a 46-ms window) show the

harmonic series at integer multiples of the fundamental. Obvious differences

between piano and violin sound include the decaying energy within the piano note,

and the slight frequency modulation (vibrato) on the violin.

Although different cultures have developed different musical conventions, a

common feature is the musical scale, a set of discrete pitches that repeats every

octave, from which melodies are constructed. For example, contemporary western

music is based on the equal tempered scale, which allows the octave to be

divided into twelve equal steps on a logarithmic axis while still (almost) preserving

intervals corresponding to the most pleasant note combinations. The equal division

makes each frequency 21/12~=1.06x larger than its predecessor, and this interval is


7/32

7

known as a semitone. There are twelve semitones in an octave. Its shown as

Figure.4 . For example, if frequency of A in this octave is 440Hz, the one octave

higher of A is 880Hz.

Figure.4The twelve pitch classes of an octave.

The coincidence is that it is even possible to divide the octave uniformly into

such a small number of steps, and still have these steps give close, if not exact,

matches to the simple integer ratios that result in consonant harmonies, eg.

2(1/12)^7~=1.498~=3/2.

The western major scale spans the octave using seven of the twelve steps-

the "white notes" on a piano, denoted by C, D, E, F, G, A, B.The spacing between

successive notes is two semitones, except for E/F and B/C which are only one

semitone apart. The black notes in between are named in reference to the note

immediately below (e.g.,C#), or above (Db) , depending on musicological


8/32

8

conventions. The octave degree denoted by these symbols is sometimes known as

the pitchs chroma, and a particular pitch can be specified by the concatenation of

a chroma and an octave number (where each numbered octave spans C to B). The

lowest note on a piano is A0 (27.5 Hz), the highest note is C8 (4186 Hz), and middle

C (262 Hz) is C4.

Fig. 5.Middle C, followed by the E and G above, then all three notes togethera C

Major triadplayed on a piano. Top pane shows the spectrogram; bottom pane

shows the chroma representation.

B. Harmony

While sequences of pitches create melodies and the only part reproducible

by a monophonic instrument such as the voiceanother essential aspect of much

music is harmony, the simultaneous presentation of notes at different pitches.

Different combinations of notes result in different chords, which remain

recognizable regardless of the instrument used to play them. Consonant harmonies

tend to involve pitches with simple frequency ratios, indicating many shared

harmonics. Fig. 5 shows middle C (262 Hz), E (330 Hz), and G (392 Hz) played on a

piano; these three notes together form a C Major triad, a common harmonic unit in

western music. The ubiquity of simultaneous pitches, with coincident or near-

coincident harmonics, is a major challenge in the automatic analysis of music audio.


9/32

9

C. Tempo, Beat and Rhythm

The musical aspects of tempo, beat, and rhythm play a fundamental role for

the understanding of music. The beat is the steady pulse that drives music forward

and provides the temporal framework of a piece of music. Intuitively, the beat can

be described as a sequence of perceived pulses that are regularly spaced in time

and correspond to the pulse a human taps along when listening to the music

The term tempo then refers to the rate of the pulse. Musical pulses typically

go along with note onsets or percussive events. Locating such events within a given

signal constitutes a fundamental task, which is often referred to as onset detection.

And this part will be introduced more comprehensively in section VI

IV. Time-Frequency Analysis and Musical SignalFigure.6 shows some kinds of time-frequency analysis. This section will

introduce three time-frequency methods and the implementation results of

musical signals.

Fig. 6.Time-frequency analysis methods


10/32

10

A.Short Time Fourier Transform and Gabor TransformShort-time Fourier Transform is a basic type of time-frequency analysis. If

there is a continue signal x(t), we can compute the short-time Fourier transform

by

Where w(t) is a mask function. When the w(t) is a rectangular function, the

transform is called Rec-STFT. When the w(t) is a Gaussian function, the transform

is called Gabor transform.

However, the musical signal is not a continue signal. It is sampled in a

sampling frequency. Therefore, we cant use the form to compute the Rec-short-

time Fourier transform. So we change the original form to

where t=nt, f=mf, =pt, B=Qt,

There are some constraints because the discrete form of the short-time

Fourier transform. The first, t*f = 1/N, where N is a integer. The second,

N>=2Q+1. The third, t


11/32

11

Take a drum for example. Figure.7 is the wave of a drum. The length of

signals is 0.05 seconds. And the sampling frequency is 44100Hz. It was

implemented by Matlab. The width of window is 0.005 and 0.002. The frequency

band is 0~5000Hz. We can see the result in Figure. 8.

Figure.8 (a) Rec-STFT of a drum, Width of window B=0.005 (b) Rec-STFT of a drum,

B=0.002. The vertical axis is frequency (Hz) and the horizontal axis is time (s)

As you can see, the fundamental frequency of the drum is about

2000Hz.There is an overtone in 4000Hz. You also can find that when B=0.005, the

white line is clearer. However, when B=0.002, the white line is rough and the

resolution is not good. Therefore, the width of window is also an important factor.

There is another example on piano in Fih.9.

Figure.9 The analysis of piano


12/32

12

Figure.9 is the wave of piano and the spectrum of piano notes. We can find

the fundamental frequency about 440Hz, and there are several harmonic

overtones in higher frequency.

The Figure.10 is the spectrogramof piano notes. The spectrogram is square

of STFT. The analysis meaning of spectrogram is same with the STFT. Spectrogram

is computed by

Figure.10 The spectrogram of piano notes.

Figure.11 (a) the wave of piano (b) the STFT of piano wave

The window function can have different type to have different Short-time

Fourier transform. Except for rectangular function and Gaussian function, there are


13/32

13

also triangle function, Hanning function, Hamming function and others you can

imagine. Comparing to other functions, the Gaussian function has better effect on

resolution because of Gaussian function is an eigen-function of Fourier transform.

So it can have better resolution in time domain and frequency domain.

B. Wigner Distribution Function

Wigner distribution function is also a useful tool for analysis signals. It is

computed by

Where x(t) is the signal, * is conjugate of the signal.

The advantage of Wigner distribution function (WDF) is high clarity.

However, it also has high calculation and cross-term problem. Fig.12. shows the

comparison between Gabor transform and WDF.

Figure.12Comparing WDF to Gabor transform


14/32

14

V. Time-Frequency RepresentationAlthough the spectrogram is profoundly useful, it still has one drawback. It

displays frequencies on a uniform scale. However, musical scales are based on a

logarithmic scale for frequencies. Therefore, we should describe below how such a

logarithmic scale is related to human hearing and it leads to a new type of time-

frequency analysis. Here we will introduce two types of representation.

A. Log-Frequency Spectrogram

As mentioned above, our perception of music defines a logarithmic

frequency scale, with each doubling in frequency (an octave) corresponding to an

equal musical interval. This motivates the use of timefrequency representations

with a similar logarithmic frequency axis, which in fact correspond more closely to

representation in the ear. (Because the bandwidth of each bin varies in proportion

to its center frequency, these representations are also known as constant-Q

transforms, since each filters effective center frequency-to-bandwidth ratioits

Qis the same.) The constant-Q transform is also a type of time-frequency analysis.

It develops from Short-time Fourier transform. It can transform a data series to the

frequency domain. It is computed by

Where N(k) = Q(fs/fk), W(k,n)=-(1-)cos(2n/N(k)), fs is sampling rate,

Q=fk/fk, fkis center frequency, is a number between zero to one.

With, for instance, 12 frequency bins per octave, the result is a

representation with one bin per semitone of the equal-tempered scale.

A simple way to achieve this is as a mapping applied to an STFT

representation. Each bin in the log-frequency spectrogram is formed as a linear


15/32

15

weighting of corresponding frequency bins from the original spectrogram. For a

log-frequency axis with KLbins, this calculation can be expressed in matrix notation

asY=MX, where Y is the log-frequency spectrogram with KLrows and T columns, X

is the original STFT magnitude array |X(t,k)| (with t indexing columns and k

indexing rows). Mis a weighting matrix consisting of KLrows, each of K+1 columns,

that give the weight of STFT bin X(.,k) contributing to log-frequency bin Y(.,l) . For

instance, using a Gaussian window

Where B defines the bandwidth of the filter-bank as the frequency

difference (in octaves) at which the bin has fallen to exp(-1/2) of its peak gain. fmin

is the frequency of the lowest bin (l=0) and N0is the number of bins per octave in

the log-frequency axis. The calculation is illustrated in Fig. 13, where the top-left

image is the matrix M, the top right is the conventional spectrogram X, and the

bottom right shows the resulting log-frequency spectrogram Y.

Figure.13. Calculation of a log-frequency spectrogram as a column-wise linear

mapping of bins from a conventional (linear-frequency) spectrogram.


16/32

16

Drawback of Log-Frequency SpectrogramAlthough conceptually simple, such a mapping often gives unsatisfactory

results: in the figure, the logarithmic frequency axis uses one bin per semitone,

starting at fmin= 110 Hz (A2). At this point, the log-frequency bins have centers only

6.5 Hz apart; to have these centered on distinct STFT bins would require a window

of 153 ms, or almost 7000 points at Hz. Using a 64-ms window, as in the figure,

causes blurring of the low-frequency bins.

The long time window required to achieve semitone resolution at low

frequencies has serious implications for the temporal resolution of any analysis.

Since human perception of rhythm can often discriminate changes of 10ms or less,

an analysis window of 100ms or more can lose important temporal structure. One

popular alternative to a single STFT analysis is to construct a bank of individual

band-pass filters, for instance one per semitone, each tuned the appropriate

bandwidth and with minimal temporal support. Although this loses the famed

computational efficiency of the fast Fourier transform, some of this may be

regained by processing the highest octave with an STFT-based method, down-

sampling by a factor of 2, then repeating for as many octaves as are desired.

However, this results in different sampling rates for each octave of the analysis,

raising further computational issues.

B. Time-Chroma Representation

Some applications are primarily concerned with the chroma of the notes

present, but less with the octave. Foremost among these is chord transcription

the annotation of the current chord as it changes through a song. Chords are a joint

property of all the notes sounding at or near a particular point in time, for instance

the C Major chord of Fig. 5, which is the unambiguous label of the three notes C, E,


17/32

17

and G. Chords are generally defined by three or four notes, but the precise octave

in which those notes occur is of secondary importance. Thus, for chord recognition,

a representation that describes the chroma present but folds the octaves

together seems ideal.

A typical chroma representation consists of a 12-bin vector for each time

step, one for each chroma class from C to B. Given a log-frequency spectrogram

representation with semitone resolution from the preceding section, one way to

create chroma vectors is simply to add together all the bins corresponding to each

distinct chroma. More involved approaches may include efforts to include energy

only from strong sinusoidal components in the audio, and exclude non-tonal

energy such as percussion and other noise.

Figure.14 Three representations of a chromatic scale comprising every note on

the piano from lowest to highest. Top pane: conventional spectrogram (93-ms

window). Middle pane: log-frequency spectrogram (186-ms window). Bottom

pane: chroma-gram (based on 186-ms window).

Fig. 14 shows a chromatic scale, consisting of all 88 piano keys played one a

second in an ascending sequence. The top pane shows the conventional, linear-


18/32

18

frequency spectrogram, and the middle pane shows a log-frequency spectrogram

calculated as in Fig. 13. Notice how the constant ratio between the fundamental

frequencies of successive notes appears as an exponential growth on a linear axis,

but becomes a straight line on a logarithmic axis. The bottom pane shows a 12-bin

chroma representation (a chroma-gram) of the same data.

Drawback of Time-Chroma SpectrogramEven though there is only one note sounding at each time, notice that very

few notes result in a chroma vector with energy in only a single bin. This is because

although the fundamental may be mapped neatly into the appropriate chroma bin,

as will the harmonics at 2f0, 4f0, 8f0,, etc. (all related to the fundamental by

octaves), the other harmonics will map onto other chroma bins. The harmonic at

3f0, for instance, corresponds to an octave plus 7 semi-tones 2(12+7)/12

~=3, thus for

the C4 sounding at 40 s, we see the second most intense chroma bin after C is the

G seven steps higher. Other harmonics fall in other bins, giving the more complex

pattern. Many musical notes have the highest energy in the fundamental harmonic,

and even with a weak fundamental, the root chroma is the bin into which the

greatest number of low-order harmonics fall, but for a note with energy across a

large number of harmonicssuch as the lowest notes in the figurethe chroma

vector can become quite cluttered.

One might think that attempting to attenuate higher harmonics would give

better chroma representations by reducing these alias terms. In fact, many

applications are improved by whitening the spectrumi.e., boosting weaker bands

to make the energy approximately constant across the spectrum. This helps

remove differences arising from the different spectral balance of different musical

instruments, and hence better represents the tonal.

Chroma representations may use more than 12 bins per octave to reflect


19/32

19

finer pitch variations, but still retain the property of combining energy from

frequencies separated by an octave. To obtain robustness against global mis-

tunings, practical chroma analyses need to employ some kind of adaptive tuning,

for instance by building a histogram of the differences between the frequencies of

all strong harmonics and the nearest quantized semitone frequency, then shifting

the semitone grid to match the peak of this histogram. It is, however, useful to limit

the range of frequencies over which chroma is calculated. Human pitch perception

is most strongly influenced by harmonics that occur in a dominance region

between about 400 and 2000 Hz. Thus, after whitening, the harmonics can be

shaped by a smooth, tapered frequency window to favor this range.

VI. Other Applications on Musical Signals

A.Onset Detection and Novelty CurveThe objective of onset detection is to determine the physical starting times

of notes or other musical events as they occur in a music recording. The general

idea is to capture sudden changes in the music signal, which are typically caused by

the onset of novel events. As a result, one obtains a so-called novelty curve, the

peaks of which indicate onset candidates. For example, playing a note on a

percussiveinstrument typically results in a sudden increase of the signals energy,

see Fig. 15(a). Having such a pronounced at-tack phase, note onset candidates may

be determined by locating time positions, where the signals amplitude envelope

starts to increase. Much more challenging, however, is the detection of onsets in

the case of non-percussivemusic, where one often has to deal with soft onsets or

blurred note transitions. This is often the case for vocal music or classical music

dominated by string instruments.


20/32

20

Figure.15Waveform representation of the beginning of Another one bites the dust

by Queen (a) Note onsets. (b) Beat positions.

Furthermore, in complex polyphonic mixtures, simultaneously occurring

events may result in masking effects, which makes it hard to detect individual

onsets. As a consequence, more refined methods have to be used for computing

the novelty curves, e.g., by analyzing the signals spectral content, pitch, harmony

or phase. To handle the variety of different signal types, a combination of novelty

curves particularly designed for certain classes of instruments can improve the

detection accuracy.

To illustrate some of these ideas, we now describe a typical spectral-based

approach for computing novelty curves. Given a music recording, a short-time

Fourier transform is used to obtain a spectrogram X = (X(t,k))t,k with k[0:K] and

t[0:T-1]. Note that the Fourier coefficients of X are linearly spaced on the

frequency axis. Using suitable binning strategies, various approaches switch over to

a logarithmically spaced frequency axis. Keeping the linear frequency axis puts

greater emphasis on the high-frequency regions of the signal, thus accentuating

the aforementioned noise bursts visible as high-frequency content. One simple, yet


21/32

21

important step, which is often applied in the processing of music signals, is referred

to as logarithmic compression. In our context, this step consists in applying a

logarithm to the magnitude spectrogram |X| of the signal yielding Y= log(1+C|X|)

for a suitable constant C>1. Such a compression step not only accounts for the

logarithmic sensation of human sound intensity, but also balances out the dynamic

range of the signal. In particular, by increasing C, low-intensity values in the high-

frequency spectrum become more prominent. This effect is clearly visible in Fig. 16.

Figure.16(a) Score representation (b) Magnitude spectrogram (c) Compressed

spectrogram using C = 1000 (d) Novelty curve derived from (b) (e) Novelty curve

derived from (c) .

To obtain a novelty curve, one basically computes the discrete derivative of

the compressed spectrum Y. More precisely, one sums up only positive intensity


22/32

22

changes to emphasize onsets while discarding offsets to obtain the novelty

function

Fig. 16(e) shows a typical novelty curve for our Shostakovich example. As

mentioned above, one often process the spectrum in a band-wise fashion obtaining

a novelty curve for each band separately. These novelty curves are then weighted

and summed up to yield a final novelty function.

The peaks of the novelty curve typically indicate the positions of note

onsets. Therefore, to explicitly determine the positions of note onsets, one

employs peak picking strategies based on fixed or adaptive threshold. In the case of

noisy novelty curves with many spurious peaks, however, this is a fragile and error-

prone step. Here, the selection of the relevant peaks that correspond to true note

onsets becomes a difficult or even infeasible problem. For example, in the

Shostakovich Waltz, the first beats (downbeats) of the 3/4 meter are played softly

by non-percussive instruments leading to relatively weak and blurred onsets,

whereas the second and third beats are played staccato supported by percussive

instruments. As a result, the peaks of the novelty curve corresponding to

downbeats are hardly visible or even missing, whereas peaks corresponding to the

percussive beats are much more pronounced, see Fig. 16(e).

B. Periodicity Analysis and Tempo Estimation

Generally speaking, one can do this analysis between three different

methods. The autocorrelation method allows for detecting periodic self-similarities

by comparing a novelty curve with time-shifted (localized) copies. Another widely

used method is based on a bank of comb filter resonators, where a novelty curve is

compared with templates that consists of equally spaced spikes covering a range of


23/32

23

periods and phases. Third, the short-time Fourier transform can be used to derive a

timefrequency representation of the novelty curve. Here, the novelty curve is

compared with templates consisting of sinusoidal kernels each representing a

specific frequency. Each of the methods reveals periodicity properties of the

underlying novelty curve from which one can estimate the tempo or beat structure.

Figure.17 Excerpt of Shostakovich

s Waltz No.2 (a) Fourier tempo-gram (b).

Autocorrelation tempo-gram

For example, suppose that a music signal has a dominant tempo of =220

BPM (beats per minute) around position t, then the tempo-gram corresponding

value T(t,) is in Fig. 17. In practice, one often has to deal with tempo ambiguities,

where a tempo is confused with integer multiples 2, 3(referred to as

harmonics of ) and integer fractions /2, /3,(referred to as sub-harmonics of

). To avoid such ambiguities, a mid-level tempo representation referred to as

cyclic tempo-grams can be constructed. Tempi differing by a power of two are

identified.

A tempo-gram can be obtained by analyzing a novelty curve with respect

to local periodic patterns using a short-time Fourier transform. To this end, one


24/32

24

fixes a window function W of finite length centered at t=0. Then, for a frequency

parameter w, the complex Fourier coefficient F(t,w) is defined by

Note that the frequency parameter w (measured in Hertz) corresponds to

the tempo parameter =60w (measured in BPM). Therefore, one obtains a

discrete Fourier tempo-gram by

As an example, Fig.17(a) shows the tempo-gram of our Shostakovich

example from Fig.16. Note that TF reveals a slightly increasing tempo over time

starting with roughly =225 BPM. Also, TF reveals the second tempo harmonics

starting with =450 BPM. Actually, since the novelty curve locally behaves like a

track of positive clicks, it is not hard to see that Fourier analysis responds to

harmonics but tends to suppress sub-harmonics.

Next we will introduce autocorrelation-based methods. To obtain a discrete

autocorrelation tempo-gram, one again fixes a window function fixes a window

function W of finite length centered at t=0. The local autocorrelation is then

computed by comparing the windowed novelty curve with time shifted copies of

itself. Here, we use the unbiased local autocorrelation.

Now, to convert the lag parameter into a tempo parameter, one needs to

know the sampling rate. Supposing that each time parameter t corresponds to r

seconds, then the lag l corresponds to the tempo = 60/(rl) BPM. From this, one

obtains the autocorrelation tempo-gram TAby

Finally, using standard resampling and interpolation techniques applied to


25/32

25

the tempo domain, one can derive an autocorrelation tempo-gram TA that is

defined on the same tempo set as the Fourier tempo-gram TF. The tempo-gram T

A

for our Shostakovich example is shown in Fig. 17(b). It clearly indicates the sub-

harmonics. Actually, the parameter =75is the third sub-harmonics of =225 and

corresponds to the tempo on the measure level.

Figure.18 Excerpt of the Mazurka Op.30 No.2 (a) Score (b) Fourier rempogram with

reference tempo (c) Beat position

Assuming a more or less steady tempo, most tempo estimation approaches

determine only one global tempo value for the entire recording. For example, such

a value may be obtained by averaging the tempo values obtained from a frame-

wise periodicity analysis. Dealing with music with significant tempo changes, the

task of local tempo estimation becomes a much more difficult problem. See Fig. 18

for a complex example. Having computed a tempo-gram, the frame-wise maximum

yields a good indicator of the locally dominating tempohowever, one often has to

struggle with confusions of tempo harmonics and sub-harmonics. Here, it can be

improved by a combined usage of Fourier and autocorrelation tempo-grams.


26/32

26

C. Harmonic Pitch Class Profiles (HPCP)

If we want to detect pitch correctly, we have to extract a nice feature for

seeing the pitch clearly. The tool is Harmonic Pitch Class Profiles. The HPCP is an

enhanced pitch distribution feature. It is also called Chroma. We can do some

process on musical signals to get the HPCP feature and then we use the feature to

measure similarity. We will take our focus on how to get the HPCP feature because

of the process also has relationship with time-frequency analysis.

Figure.19 General HPCP feature extraction block diagram. Music signals are

converted to a sequence of HPCP vectors that evolves with time

After musical signals are input, first we will do spectral analysis to know the

frequency components. So we can use constant-Q transformto convert the signal


27/32

27

into a spectrogram. After constant-Q transform, it also has a frequency filtering, so

only a frequency band between 100 and 5000 Hz is used. The peak detection is

used, so only the local maxima of spectrum are considered.

In the reference frequency computation procedure, we will estimate the

deviation with respect to 440Hz.

In the frequency to pitch class mapping, its a procedure for determining the

pitch class value from frequency values. We introduce a weighting scheme using a

cosine function and consider the presence of harmonic frequency, taking account a

total of 8 harmonics for each frequency. To map the value on a one-third of a

semitone, so the size of the pitch class distribution vectors is equal to 36.

Finally, in the post-processing, we just need to normalize the feature frame

by frame dividing through the maximum value in order to eliminate dependency on

global loudness. And then we can get a result like Figure. 20.

Figure.20Example of a HPCP sequence.

After we get the HPCP feature, we can know the pitch in a time section. It

has been used to compute similarity between two songs in many papers. In the

Figure.21, its a system of measuring similarity between two songs. At first, we

need to use time-frequency analysis to extract the HPCP feature. And then set two


28/32

28

songs HPCP to a global HPCP, so there is a standard of comparing. Next, use the

two features to construct a binary similarity matrix. We will use Smith-Waterman

algorithm to construct a local alignment matrix H in the Dynamic Programming

Local Alignment. Finally, after doing some post- processing, we can compute the

distance between two songs. We can use a threshold of distance to choose which

songs we want.

Figure.21The example of music similarity measure.

D. Modified HHT for Detecting Fundamental Frequency

The traditional HHT is computed by

However, there are some drawbacks in traditional HHT. The first, when a

signal has multiple primary frequency components, the same frequency

component may not reside in the same IMF. The second, some perturbation in the

neighborhood may change its position. Therefore, it will also change the upper and

lower envelope curve and the mean function. Then more iteration is needed to sift

out the IMF with a suitable scale. It complicates the stopping criterion and has

more computational complexity. The third, HHT is very sensitive to non-stationary

components. The existence of these components complicates the task of

fundamental frequency estimation.


29/32

29

Therefore, there is a Modified HHT for fundamental frequency estimation.

The block diagram is shown as Figure.22.

Figure.22The modified HHT block diagram of fundamental frequency estimation.

After the signal is segmented with window size, we use filter bank to

decompose the signal into several narrowband music signals. Then we discard

weak bands using an energy threshold. EMD is used to get each individual bands

IMF. The next step, we discard IMFs which are outside the pass-band. Then we

select the IMF containing the fundamental frequency, which has maximum

correlation with the original signal. Finally, the traditional Hilbert transform is

applied to the IMF which is selected finally. The median of the instantaneous

frequency inside the effective window is set to the fundamental frequency.

Figure.23 IMFs of C4 (261Hz) by sifting (a) w/o and (b) w/ the filter-bank pre-

processing.


30/32

30

The modified HHT algorithm has three unique features. First, it uses a

mirror approach to estimate the outside extrema in the EMD process. The second,

it uses Rillings stopping criteria in the EMD process to handle the mode mixing

problem. The third, to solve the problem of sub-harmonics and partials, it discards

the weakness bands in the original signal. Therefore, it has better performance of

estimating the fundamental frequency.

There are experimental results about comparing modified HHT to other

methods. It is shown as Figure.24. In this section we know the Hilbert Huang

transform can also be used to estimate fundamental frequency. So it has its effects

on musical signals.

Figure.24Performance comparison of hit rates of the YIN method, the HHT method

and the modified HHT.


31/32

31

VI. Conclusion

In this tutorial, we know time-frequency analysis is more powerful than

classic Fourier transform in analyzing music signal. There are many types of time-

frequency analysis such as Short-time Fourier transform, Wigner distribution

function. However, not all time-frequency methods are appropriate to process

music signals. We need to make a choice in different situations.

Musical scales are based on a logarithmic scale for frequencies, thats the

reason why we introduce log-frequency spectrogram and time-chroma

representation. There are lots of applications we can use time-frequency analysis

to process musical signal. For instance, beat detection, tempo estimation and

similarity measurement. Moreover, there are some drawbacks on Hilbert Huang

transform. And the modified Hilbert Huang transform has some post-processing

before doing HHT in order to adopt on musical signals.

Maybe there are still some applications of time-frequency analysis not

discussed. I will study hard to improve my knowledge but I still hope the tutorial

will have some supports and offer basic related knowledge to reader.

VII. Reference

[1] Joan Serra, Emilia Gomez, Perfecto Herrera, and Xavier Serra Chroma Binary

Similarity and Local Alignment Applied to Cover Song Identification August, 2008

[2] William J. Pielemeier, Gregory H. Wakefield, and Mary H. Simoni Time-

frequency Analysis of Musical Signals September,1996

[3] Jeremy F. Alm and James S. Walker Time-Frequency Analysis of Musical

Instruments 2002


32/32

[4] Monika Dorfler What Time-Frequency Analysis Can Do To Music Signals

April,2004

[5] EnShuo Tsau, Namgook Cho and C.-C. Jay Kuo Fundamental Frequency

Estimation For Music Signals with Modified Hilbert-Huang transform

[6] Meinard Muller, Daniel P.W.Ellis, Anssi Klapuri and Gael Richard, Signal

Processing for Music Analysis, IEEE Journel of Selected Topics In Signal Processing,

Vol. 5, NO. 6, October 2011

[7] Masataka Goto, An Audio-based Real-time Beat Tracking System for Music

With or Without Drum-sounds, Journel of New Music Research, 2001, Vol. 30, No.

2 ,pp 159~171.

[8] Kuo-Cyuan Kuo , Fractional Fourier Transform and Time-Frequency Analysis

and Apply to Acoustic Signals, Master Thesis, June, 2008.

[9] Chung-Han Huang, tutorial of Time-Frequency Analysis for Music SignalAnalysis

[10] J.J Ding, Slides of time-frequency analysis and wavelet transform

time-frequency analysis for music signal analysis

Documents