25 singer similarity a brief literature review catherine lai mumt-611 mir march 24, 2005 1
TRANSCRIPT
/25
Singer Similarity
A Brief Literature Review
Singer Similarity
A Brief Literature Review
Catherine Lai
MUMT-611 MIR
March 24, 2005
1
/25
Outline of PresentationOutline of Presentation
Introduction– Motivation– Related research
Recent publications– Kim & Whitman, 2002– Liu & Huang, 2002– Tsai, Wang, Rodgers, Cheng & Yu, 2003– Bartsch & Wakefield, 2004
Discussion Conclusion
2
/25
IntroductionIntroduction
Motivation
– Multitude of audio files circulation on the Internet
– Replace human documentation efforts and organize collection of music recordings automatically
– Singer identification relatively easy for human but not machines
Related Research
– Speaker identification
– Musical instrument identification
3
/25
Kim & Whitman, 2002. Kim & Whitman, 2002.
“Singer Identification in Popular Music Recordings Using Voice Coding Features” (MIT Media Lab)
Automatically establish the I.D. of singer using acoustic features extracted from songs in a DB of pop music– Perform segmentation of vocal region prior to singer
I.d.– Classifier uses features drawn from voice coding
based on Linear Predictive Coding (LPC)Good at highlight formant locationsRegions of resonance significant perceptually
4
/25
Kim & Whitman, 2002.
Detection of Vocal Region
Kim & Whitman, 2002.
Detection of Vocal Region
Detect region of singing detect energy within frequencies bounded by the range of vocal energy– Filter audio signal with band-pass filter– Used Chebychev IIR digital filter of order 12
Attenuate other instruments fall outside of the vocal range regions e.g. bass and cymbals– Voice not only remaining instrument in the region
Discriminate the other sounds e.g. drums use a measure of harmonicity– Vocal segment is > 90% voiced is highly harmonic– Measure harmonicity of filtered signal within analysis
frame and thresholding the harmonicity against a fixed value
5
/25
Kim & Whitman, 2002.
Feature Extraction
Kim & Whitman, 2002.
Feature Extraction
12-pole LP analysis based on the general principle behind LPC for speech used for feature extraction
LP analysis performed on linear and warped scales Linear scale treats all frequencies equally on linear scale
– Human ears not equally sensitive to all frequencies linearly– Warping function adjusts closely to the
Bark scale approx. frequency sensitivity of human hearing – Warp function better at capture formant location at lower
frequencies
6
/25
Kim & Whitman, 2002.
Experiments
Kim & Whitman, 2002.
Experiments
Data sets include 17 different singer > 200 songs 2 classifier Gaussian Mixture Model (GMM) and
SVM used on 3 different feature sets– Linear scaled, warped scaled, both linear and
warped data Run on entire song data and on segments
classified as vocal only
7
/25
Kim & Whitman, 2002.
Results
Kim & Whitman, 2002.
Results
Linear frequency feature tend to outperform the warped frequency feature when each used alone; combined best
Song and frame accuracy increases when using only vocal segments in GMM Song and frame accuracy decreases when using only vocal segments in SVM
8
Kim & Whitman, 2002
/25
Kim & Whitman, 2002.
Discussion and Future Work
Kim & Whitman, 2002.
Discussion and Future Work
Better performance of linear frequency scale features vs. warped frequency scale features indicate– Machine find increased accuracy of the linear scale at
higher frequencies useful– Contrary to human auditory system
The performance of the SVM decreased is puzzling– Finding aspects of the features not specifically related to
voice Add high-level musical knowledge to the system
– Attempt to I.D. song structure such as locate verses or choruses
– Higher probability of vocals in these sections
9
/25
Liu & Huang, 2002.Liu & Huang, 2002.
“A Singer Identification Technique for Content-Based Classification of MP3 Music Objects”
Automatically classify MP3 music objects according to singers Major steps:
– Coefficients extracted from compressed raw data used to compute the MP3 features for segmentation
– Use these features to segment MP3 objects into a sequence of notes or phonemes
Waveform of 2 phonemes
– Each MP3 phoneme in the training set, its MP3 features extracted
and stored with its associated singer in phoneme DB– Phoneme in the MP3 DB used as discriminators in an MP3
classifier to I.D. the singers of unknown MP3 objects
10
Liu & Huang, 2002
/25
Liu & Huang, 2002.
Classification
Liu & Huang, 2002.
Classification
Number of different phonemes a singer can sing is limited and singer with different timbre possess unique phoneme set
Phonemes of an unknown MP3 song can be associated with the similar phoneme of the same singer in the phoneme DB
kNN classifier used for classification– Each unknown MP3 song first segmented into phonemes– First N phonemes used and compared with every
discriminators in the phoneme DB– K closest neighbors found
For each of the k closest neighbor, – If its distance within a threshold, a weighted vote given– K*N weighted votes accumulated according to singer– Unknown MP3 song is assigned to the singer with largest
score
11
/25
Liu & Huang, 2002.
Experiments
Liu & Huang, 2002.
Experiments
Data set consists of 10 male and 10 female Chinese singers each with 30 songs
3 factors dominate the results of the MP3 music classification method– Setting of k in the kNN classifier (best k = 80 result 90%
precision rate)– Threshold for vote decision used by the discriminator
(best threshold = 0.2)– Number of singer allowed in a music class (larger no.
higher precision) Allow > 1 singer in a musical class Grouping several singers with similar voice provide
ability to find songs with singers of similar voices
12
/25
Liu & Huang, 2002.
Results and Future Work
Liu & Huang, 2002.
Results and Future Work
Results within expectation– Songs sung by a singer with very unique style resulted in the
highest precision rate (> 90%)– Songs sung by a singer with a common voice resulted in only
50% of the precision rate
Future work to use more music features– Pitch, melody, rhythm, and harmonicity for music classification– Try to represent MP3 features according to syntax and semantics
of the MPEG7 standards
13
Liu & Huang, 2002
/25
Tsai et al., 2003.Tsai et al., 2003.
“Blind Clustering of Popular Music Recordings Based on Singer Voice Characteristics” (ISMIR)
Technique for automatically clustering undocumented music recording based on associated singers given no singer information or population of singers
Clustering method based on the singer’s voice rather than background music, genre, or others
3-stage process proposed:– Segmentation of each recording into vocal/non-vocal
segments– Suppressing the characteristics of background from
vocal segment– Clustering the recording based on singer characteristic
similarity
14
/25
Tsai et al., 2003.
Classification
Tsai et al., 2003.
Classification
Classifier for vocal/non-vocal segmentation – Front-end signal processor to convert digital
waveform into spectrum-based feature vectors– Back-end statistical processor to perform
modeling, matching, and decision making
15
/25
Tsai et al., 2003.
Classification
Tsai et al., 2003.
Classification
Classifier operates in 2 phases: training and testing– During training phase, a music DB with manual
vocal/non-vocal transcriptions used to form two separate GMMS: a vocal GMM and non-vocal GMM
– In testing phase, recognizer takes as input feature vectors extracted from an unknown recording, produces as output the frame log-likelihoods for the vocal GMM and the non-vocal GMM
16
/25
Tsai et al., 2003.
Classification
Tsai et al., 2003.
Classification
Block diagram
17
Tsai, 2003
/25
Tsai et al., 2003.
Decision Rules
Tsai et al., 2003.
Decision Rules
Decision for each frame made according to one of three decision rules: 1. frame-based, 2. fixed-length-segment-based, and 3. homogeneous-segment based decision rules.
18
Tsai, 2003
Assign a single classification per segment
/25
Tsai et al., 2003.
Singer Characteristic Modeling
Tsai et al., 2003.
Singer Characteristic Modeling
Characteristics of voice be modeled to cluster recordings– V = {v1, v2, v3, …} be features vectors from a vocal
region, is a mixture of solo feature vectors S = {s1, s2, s3, …} background accompaniment feature vectors B = {b1,
b2, b3, …} S and B unobservable
– B can be approximated from the non-vocal segments– S is subsequent estimated given V and B
A solo and a background music model is generate for each recording to be clustered
19
/25
Tsai et al., 2003.
Clustering
Tsai et al., 2003.
Clustering
Each recording evaluated against each singer’s solo model– Log-likelihood of the vocal portion of one recording tested
against one solo model computed (for all solo models) K-mean algorithm used for clustering
– Starts with a single cluster and recursively split clusters– Bayesian Information Criterion employed to decide the best
value of k
20
/25
Tsai et al., 2003.
Experiments
Tsai et al., 2003.
Experiments
Data set consists of 416 tracks from Mandarin pop music CD Experiments run to validate the vocal/non-vocal
segmentation method– Best accuracy achieved was 78% using the homogeneous
segment-based method
21
/25
Tsai et al., 2003.
Results
Tsai et al., 2003.
Results
System evaluation on the basis of average cluster purity When k = singer population, the highest purity = 0.77
22
Tsai, 2003
/25
Tsai et al., 2003.
Future Work
Tsai et al., 2003.
Future Work
Test method on a wider variety of data– Larger singer population– Richer songs with different genre
23
/25
Discussion and ConclusionDiscussion and Conclusion
Singer similarity technique can be used to – Automatically organize a collection of music recordings
based on lead singer– Labeling of guest performers information usually omitted
in music in music database– Replace human documentation efforts
Extend to handle duets, chorus, background vocals, other musical data with multiple simultaneous or non-simultaneous singers– Rock band songs with parts sung by the guitarist,
drummer band members can be identified
24
/25
Bibliography Bibliography
Bartsch, M. and G. Wakefield (2004). Singing voice identification using spectral envelope estimation. IEEE Transactions on Speech and Audio Processing, vol. 12, no. 2,100-9.
Kim, Y. and B. Whitman (2002). Singer identification in popular music recordings using voice coding features. In Proceedings of the 2002 International Symposium on Music Information Retrieval.
Liu, C. and C. Huang (2002). A singer identification technique for content-based classification of mp3 music objects. In Proceedings of the 2002 Conference on Information and Knowledge Management (CIKM), 438-445.
Tsai, W., H. Wang, D. Rodgers, S. Cheng, and H. Yu (2003). Blind clustering of popular music recording based on singer voice characteristics. In Proceedings of the 2003 International Symposium on Music Information Retrieval.
25