25 singer similarity a brief literature review catherine lai mumt-611 mir march 24, 2005 1

/25

Singer Similarity

A Brief Literature Review

Singer Similarity

A Brief Literature Review

Catherine Lai

MUMT-611 MIR

March 24, 2005

1

/25

Outline of PresentationOutline of Presentation

Introduction– Motivation– Related research

Recent publications– Kim & Whitman, 2002– Liu & Huang, 2002– Tsai, Wang, Rodgers, Cheng & Yu, 2003– Bartsch & Wakefield, 2004

Discussion Conclusion

2

/25

IntroductionIntroduction

Motivation

– Multitude of audio files circulation on the Internet

– Replace human documentation efforts and organize collection of music recordings automatically

– Singer identification relatively easy for human but not machines

Related Research

– Speaker identification

– Musical instrument identification

3

/25

Kim & Whitman, 2002. Kim & Whitman, 2002.

“Singer Identification in Popular Music Recordings Using Voice Coding Features” (MIT Media Lab)

Automatically establish the I.D. of singer using acoustic features extracted from songs in a DB of pop music– Perform segmentation of vocal region prior to singer

I.d.– Classifier uses features drawn from voice coding

based on Linear Predictive Coding (LPC)Good at highlight formant locationsRegions of resonance significant perceptually

4

/25

Kim & Whitman, 2002.

Detection of Vocal Region


Detection of Vocal Region

Detect region of singing detect energy within frequencies bounded by the range of vocal energy– Filter audio signal with band-pass filter– Used Chebychev IIR digital filter of order 12

Attenuate other instruments fall outside of the vocal range regions e.g. bass and cymbals– Voice not only remaining instrument in the region

Discriminate the other sounds e.g. drums use a measure of harmonicity– Vocal segment is > 90% voiced is highly harmonic– Measure harmonicity of filtered signal within analysis

frame and thresholding the harmonicity against a fixed value

5

/25


Feature Extraction


Feature Extraction

12-pole LP analysis based on the general principle behind LPC for speech used for feature extraction

LP analysis performed on linear and warped scales Linear scale treats all frequencies equally on linear scale

– Human ears not equally sensitive to all frequencies linearly– Warping function adjusts closely to the

Bark scale approx. frequency sensitivity of human hearing – Warp function better at capture formant location at lower

frequencies

6

/25


Experiments


Experiments

Data sets include 17 different singer > 200 songs 2 classifier Gaussian Mixture Model (GMM) and

SVM used on 3 different feature sets– Linear scaled, warped scaled, both linear and

warped data Run on entire song data and on segments

classified as vocal only

7

/25


Results


Results

Linear frequency feature tend to outperform the warped frequency feature when each used alone; combined best

Song and frame accuracy increases when using only vocal segments in GMM Song and frame accuracy decreases when using only vocal segments in SVM

8

Kim & Whitman, 2002

/25


Discussion and Future Work


Discussion and Future Work

Better performance of linear frequency scale features vs. warped frequency scale features indicate– Machine find increased accuracy of the linear scale at

higher frequencies useful– Contrary to human auditory system

The performance of the SVM decreased is puzzling– Finding aspects of the features not specifically related to

voice Add high-level musical knowledge to the system

– Attempt to I.D. song structure such as locate verses or choruses

– Higher probability of vocals in these sections

9

/25

Liu & Huang, 2002.Liu & Huang, 2002.

“A Singer Identification Technique for Content-Based Classification of MP3 Music Objects”

Automatically classify MP3 music objects according to singers Major steps:

– Coefficients extracted from compressed raw data used to compute the MP3 features for segmentation

– Use these features to segment MP3 objects into a sequence of notes or phonemes

Waveform of 2 phonemes

– Each MP3 phoneme in the training set, its MP3 features extracted

and stored with its associated singer in phoneme DB– Phoneme in the MP3 DB used as discriminators in an MP3

classifier to I.D. the singers of unknown MP3 objects

10

Liu & Huang, 2002

/25

Liu & Huang, 2002.

Classification

Liu & Huang, 2002.

Classification

Number of different phonemes a singer can sing is limited and singer with different timbre possess unique phoneme set

Phonemes of an unknown MP3 song can be associated with the similar phoneme of the same singer in the phoneme DB

kNN classifier used for classification– Each unknown MP3 song first segmented into phonemes– First N phonemes used and compared with every

discriminators in the phoneme DB– K closest neighbors found

For each of the k closest neighbor, – If its distance within a threshold, a weighted vote given– K*N weighted votes accumulated according to singer– Unknown MP3 song is assigned to the singer with largest

score

11

/25

Liu & Huang, 2002.

Experiments

Liu & Huang, 2002.

Experiments

Data set consists of 10 male and 10 female Chinese singers each with 30 songs

3 factors dominate the results of the MP3 music classification method– Setting of k in the kNN classifier (best k = 80 result 90%

precision rate)– Threshold for vote decision used by the discriminator

(best threshold = 0.2)– Number of singer allowed in a music class (larger no.

higher precision) Allow > 1 singer in a musical class Grouping several singers with similar voice provide

ability to find songs with singers of similar voices

12

/25

Liu & Huang, 2002.

Results and Future Work

Liu & Huang, 2002.

Results and Future Work

Results within expectation– Songs sung by a singer with very unique style resulted in the

highest precision rate (> 90%)– Songs sung by a singer with a common voice resulted in only

50% of the precision rate

Future work to use more music features– Pitch, melody, rhythm, and harmonicity for music classification– Try to represent MP3 features according to syntax and semantics

of the MPEG7 standards

13

Liu & Huang, 2002

/25

Tsai et al., 2003.Tsai et al., 2003.

“Blind Clustering of Popular Music Recordings Based on Singer Voice Characteristics” (ISMIR)

Technique for automatically clustering undocumented music recording based on associated singers given no singer information or population of singers

Clustering method based on the singer’s voice rather than background music, genre, or others

3-stage process proposed:– Segmentation of each recording into vocal/non-vocal

segments– Suppressing the characteristics of background from

vocal segment– Clustering the recording based on singer characteristic

similarity

14

/25

Tsai et al., 2003.

Classification

Tsai et al., 2003.

Classification

Classifier for vocal/non-vocal segmentation – Front-end signal processor to convert digital

waveform into spectrum-based feature vectors– Back-end statistical processor to perform

modeling, matching, and decision making

15

/25

Tsai et al., 2003.

Classification

Tsai et al., 2003.

Classification

Classifier operates in 2 phases: training and testing– During training phase, a music DB with manual

vocal/non-vocal transcriptions used to form two separate GMMS: a vocal GMM and non-vocal GMM

– In testing phase, recognizer takes as input feature vectors extracted from an unknown recording, produces as output the frame log-likelihoods for the vocal GMM and the non-vocal GMM

16

/25

Tsai et al., 2003.

Classification

Tsai et al., 2003.

Classification

Block diagram

17

Tsai, 2003

/25

Tsai et al., 2003.

Decision Rules

Tsai et al., 2003.

Decision Rules

Decision for each frame made according to one of three decision rules: 1. frame-based, 2. fixed-length-segment-based, and 3. homogeneous-segment based decision rules.

18

Tsai, 2003

Assign a single classification per segment

/25

Tsai et al., 2003.

Singer Characteristic Modeling

Tsai et al., 2003.

Singer Characteristic Modeling

Characteristics of voice be modeled to cluster recordings– V = {v1, v2, v3, …} be features vectors from a vocal

region, is a mixture of solo feature vectors S = {s1, s2, s3, …} background accompaniment feature vectors B = {b1,

b2, b3, …} S and B unobservable

– B can be approximated from the non-vocal segments– S is subsequent estimated given V and B

A solo and a background music model is generate for each recording to be clustered

19

/25

Tsai et al., 2003.

Clustering

Tsai et al., 2003.

Clustering

Each recording evaluated against each singer’s solo model– Log-likelihood of the vocal portion of one recording tested

against one solo model computed (for all solo models) K-mean algorithm used for clustering

– Starts with a single cluster and recursively split clusters– Bayesian Information Criterion employed to decide the best

value of k

20

/25

Tsai et al., 2003.

Experiments

Tsai et al., 2003.

Experiments

Data set consists of 416 tracks from Mandarin pop music CD Experiments run to validate the vocal/non-vocal

segmentation method– Best accuracy achieved was 78% using the homogeneous

segment-based method

21

/25

Tsai et al., 2003.

Results

Tsai et al., 2003.

Results

System evaluation on the basis of average cluster purity When k = singer population, the highest purity = 0.77

22

Tsai, 2003

/25

Tsai et al., 2003.

Future Work

Tsai et al., 2003.

Future Work

Test method on a wider variety of data– Larger singer population– Richer songs with different genre

23

/25

Discussion and ConclusionDiscussion and Conclusion

Singer similarity technique can be used to – Automatically organize a collection of music recordings

based on lead singer– Labeling of guest performers information usually omitted

in music in music database– Replace human documentation efforts

Extend to handle duets, chorus, background vocals, other musical data with multiple simultaneous or non-simultaneous singers– Rock band songs with parts sung by the guitarist,

drummer band members can be identified

24

/25

Bibliography Bibliography

Bartsch, M. and G. Wakefield (2004). Singing voice identification using spectral envelope estimation. IEEE Transactions on Speech and Audio Processing, vol. 12, no. 2,100-9.

Kim, Y. and B. Whitman (2002). Singer identification in popular music recordings using voice coding features. In Proceedings of the 2002 International Symposium on Music Information Retrieval.

Liu, C. and C. Huang (2002). A singer identification technique for content-based classification of mp3 music objects. In Proceedings of the 2002 Conference on Information and Knowledge Management (CIKM), 438-445.

Tsai, W., H. Wang, D. Rodgers, S. Cheng, and H. Yu (2003). Blind clustering of popular music recording based on singer voice characteristics. In Proceedings of the 2003 International Symposium on Music Information Retrieval.

25

25 singer similarity a brief literature review catherine lai mumt-611 mir march 24, 2005 1

Documents