audio thumbnailing of popular music using chroma-based representations matt williamson chris scharf...
TRANSCRIPT
![Page 1: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/1.jpg)
Audio Thumbnailing of Popular Music Using Chroma-Based
Representations
Matt Williamson
Chris Scharf
Implementation based on:IEEE Transactions on Multimedia, Vol. 7, No. 1, February 2005Mark A. Bartsch, Member, IEEE, and Gregory H. Wakefield, Member, IEEE
![Page 2: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/2.jpg)
Introduction
• Multimedia content is growing rapidly
• Efficient method of browsing is necessary
• Indexing and retrieval methods are media-dependent
![Page 3: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/3.jpg)
Primary goal
• Minimize audition time for a given type of media
![Page 4: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/4.jpg)
Current methods
• Images– Downsampling
• Produces a smaller version of image (thumbnail)• Reduces cost of delivery and display
![Page 5: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/5.jpg)
Current methods
• Audio: speech– Symbolic representation
• Produces a transcript of the audio
![Page 6: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/6.jpg)
What about music?
• Adapt an existing method:– Downsampling (time compression)
• Results in highly distorted, unintelligible audio
![Page 7: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/7.jpg)
What about music?
• Adapt an existing method (cont’d):– Symbolic representation (score transcription)
• Extremely difficult• Results in essentially meaningless information• Does not convey other important elements:
– Vocal style– Instruments used– Processing effects used
![Page 8: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/8.jpg)
Essential problem:
Adapting existing methods cannot reduce the audition time for music and effectively
convey the “gist” of the song
![Page 9: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/9.jpg)
Possible Solution:
Audio thumbnailing via chroma-based analysis
![Page 10: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/10.jpg)
Audio thumbnailing
• Produces a short clip of the selection to represent the “gist” of the song
![Page 11: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/11.jpg)
Chroma-based analysis
• Based on the extraction of chroma features from the audio
• Chroma Feature Extraction Algorithm:– Frame Segmentation– Feature Calculation– Correlation Calculation– Correlation Filtering– Thumbnail Selection
![Page 12: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/12.jpg)
Chroma Feature Extraction
• Extract frequencies from audio file• Calculate chroma values from frequencies:
• Categorize chroma values into pitch classes– 12 pitch classes: A, A#/Bb, C, C#/Db, …, G#/Ab
ffc 22 loglog
![Page 13: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/13.jpg)
Frame Segmentation
• Author’s Implementation:– Determined via beat tracking algorithm– Range: 0.25s to 0.56s
• Our Implementation:– Average of range: 0.41s
![Page 14: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/14.jpg)
Feature Calculation
• Calculate 12-element chroma feature vector, vt for each frame:– Apply FFT to each frequency:
– Constraints:• Minimum frequency: 20 Hz
– Lower limit of human hearing
• Maximum frequency: 2000 Hz– Higher frequencies effect the perception of chroma
}11...0{,)(
,
kN
nFv
kSn k
tkt
![Page 15: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/15.jpg)
Correlation Calculation
• Calculate similarity matrix C– Each element is equal to the correlation between two
feature vectors:
– High correlation along diagonals in the matrix indicate repetitions within the song
jTiji vvC ,
![Page 16: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/16.jpg)
Correlation Filtering
• Calculate the filtered time-lag matrix T:– Exposes similarity between extended segments that
are separated by constant lag– Filtering is performed along the diagonals of C
• Uses a symmetric rectangular windowing function (a uniform moving average filter)
– T is then “rotated” so that the diagonals are oriented vertically
k
kjikiji kwCT )(,,
![Page 17: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/17.jpg)
Thumbnail Selection
• Select maximum value in T– The location of this value indicates:
• Occurrence of the segment (the y-coordinate)• Lag time (the x-coordinate)
– Constraints:• Minimum lag time = 1/10 of song length• Maximum start time = 3/4 of song length
– To reduce susceptibility to “fading repeat”
![Page 18: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/18.jpg)
Results
• Jimmy Buffet – “Math Sucks”– System: [64, 89]
• Lifehouse – “You and Me”– System: [38, 63]
• Gavin DeGraw – “I Don’t Want To Be”– System: [95, 120]
• Super Mario Brothers Theme– System: [18, 43]
![Page 19: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/19.jpg)
Conclusion
• Successfully extracted time segments which closely match the chorus of the song
• Feature Calculation issue:– Author’s implementation unclear
![Page 20: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/20.jpg)
Possible Uses
• Audio domain:– Improved search capability
• Searching for similar songs
– Audio fingerprinting
• Other domains:– Detection of irregular heartbeats
![Page 21: Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,](https://reader035.vdocument.in/reader035/viewer/2022062422/56649ee55503460f94bf488c/html5/thumbnails/21.jpg)
Suggested Improvements and Alternatives
• Image-based analysis on the waveform
• Tested alternatives– MSE on signal frequencies
• Chroma-based analysis proved more correct