bochen li advisor: zhiyao duan - university of …zduan/teaching/ece477/lectures...more music video...
TRANSCRIPT
Audio-visual Analysis of Music Performance
Bochen LiAdvisor: Zhiyao Duan
Audio Information Research LabUniversity of Rochester
Background
Music is a multi-modal art form
Background
• Visual component is not a marginal phenomenon in music perception, but an important factor in the communication of meanings [Platz’12]
• Visually perceived elements of performances (gesture, motion, facial expressions of the performer), affect the evaluations of judges [Tsay’13]
Music is a multi-modal art form
Background
More music video streaming service nowadays
Background
Multi-modal Music Information Retrieval (MIR)
• Instrument Recognition
• Playing Activity Detection
• Polyphonic Music Analysis
• Fingering Estimation
• Conductor Following
• Cross-modal Music Generation/Retrieval
Background
State of the art
Tasks Percussion Piano Guitar Strings Wind Singing
Fingering N/A ✔ ✔ ✔ N/A
Association ✔
Play/Non-play ✔ ✔ ✔
Onset ✔ ✔
Vibrato N/A N/A ✔
Transcription ✔ ✔ ✔ ✔
Separation ✔
Overview
Research Topics
Case study 1: Multi-modal Source Association
- Body motion [Li’17a], finger motion, vibrato motion [Li’17b]
Case study 2: Performance Expressiveness Analysis
- Vibrato analysis [Li’17c]- Visual performance generation [Li’18]
Case study 3: Visually Informed Music Transcription
- Guitar transcription [Paleari’08]- Piano transcription [Akbari’15]- Violin transcription [Zhang’07]- Multi-pitch analysis for strings [Dinesh’17]
Case study 4: Visually Informed Audio Source Separation
- Motion-driven [Parekh’17]- Cross-modal deep representations [Zhao’18]
Case Study 1:Multi-modal Source Association
• Concept
• System 1: Body Motion Analysis [Li’17a]
• System 2: Finger Motion Analysis
• System 3: Vibrato Motion Analysis [Li’17b]
• Integrated System
Case Study 1: Multi-modal Source Association
Concept
Input Output
Video
Audio
Score
Case Study 1: Multi-modal Source Association
Concept
Track associated
Temporally not aligned
Player/track not associated
Temporally aligned
Player/track not associated
Temporally not aligned
Video Performance
Score Tracks
Audio Tracks
• Alignment: mapping of event sequence in temporal domain
• Association: bijection between tracks/players in ensemble
Audio tracks are separated using score-informed techniques
Case Study 1: Multi-modal Source Association
Concept
Traditional MIR task
• Alignment: mapping of event sequence in temporal domain
• Association: bijection between tracks/players in ensemble
Track associated
Temporally not aligned
Player/track not associated
Temporally aligned
Player/track not associated
Temporally not aligned
Video Performance
Score Tracks
Audio Tracks
Audio tracks are separated using score-informed techniques
Case Study 1: Multi-modal Source Association
Concept
• Alignment: mapping of event sequence in temporal domain
• Association: bijection between tracks/players in ensemble
Traditional MIR task
Track associated
Temporally aligned
Player/track not associated
Temporally aligned
Player/track not associated
Temporally aligned
Video Performance
Score Tracks
Audio Tracks
Audio tracks are separated using score-informed techniques
Proposed Work
Case Study 1: Multi-modal Source Association
System 1: Body Motion Analysis
Association Optimization
Source 2
Source 3Audio-scoreAlignment
Source Separation
Motion AnalysisVideo
Audio
MIDI
Player A
Player B
Player C
Overview
• Designed for string instrumentalists
• Bow stroke tone onset
Source 1
Case Study 1: Multi-modal Source Association
Visual feature extraction
• Method 1: Optical flow estimation
- Pixel-level motion velocity- Calculated between adjacent frames
System 1: Body Motion Analysis
Case Study 1: Multi-modal Source Association
Visual feature extraction
• Method 2: Human Pose Estimation
- Semantic representation- Frame-wise estimation without tracking- Less computation cost
System 1: Body Motion Analysis
Case Study 1: Multi-modal Source Association
Onset Likelihood Estimation
System 1: Body Motion Analysis
Case Study 1: Multi-modal Source Association
Association Optimization
M1,1 M2,1 M3,1 M4,1
M1,2 M2,2 M3,2 M4,2
M1,3 M2,3 M3,3 M4,3
M1,4 M2,4 M3,4 M4,4
M: pair-wise correspondence scoreσ: permutation function
System 1: Body Motion Analysis
Case Study 1: Multi-modal Source Association
System 2: Finger Motion Analysis
Visual feature extraction
• Method: Optical flow estimation
Case Study 1: Multi-modal Source Association
Onset Likelihood Estimation
System 2: Finger Motion Analysis
Case Study 1: Multi-modal Source Association
System 3: Vibrato Motion Analysis
Visual feature extraction
• Method: Optical flow estimation
Case Study 1: Multi-modal Source Association
Vibrato correspondence
VideoMotion Velocity
AudioPitch Trajectory
System 3: Vibrato Motion Analysis
Case Study 1: Multi-modal Source Association
Integrated System
Overview
• Works for all common instruments in Western chamber music
• Universal framework
Case Study 1: Multi-modal Source Association
Evaluations
• Longer duration higher chance to retrieve correct association
• Association accuracy: string > wind/brass
Integrated System
Case Study 2:Performance Expressiveness Analysis
• Vibrato Detection and Analysis [Li’17c]
• Visual Performance Generation [Li’18]
Case Study 2: Performance Expressiveness Analysis
Vibrato Detection and Analysis
Concept
• Important artistic effect• Pitch modulation of a note in a periodic fashion• Characterized by Rate & Extent
Spectrogram
Audio
Vibrato
Non-vibrato
Case Study 2: Performance Expressiveness Analysis
Vibrato Detection and Analysis
Vibrato Detection
• Note-level vibrato/non-vibrato classification
Vibrato Analysis
• Vibrato rate: speed of pitch variation (1/T Hz)
• Vibrato extent: amount of pitch variation (A cents)
T
A
Pitch
Time
Time
Pitch
Case Study 2: Performance Expressiveness Analysis
Vibrato Detection and Analysis
Ground-truth
Audio-based, polyphonic
Video-based
Pitch
PitchSpec
Hand Hand Displacement
0.2 0.4 0.6 0.8 1.0 1.2 sec
0 0.2 0.4 0.6 0.8 1.0 1.2 sec
0 0.2 0.4 0.6 0.8 1.0 1.2 sec
Case Study 2: Performance Expressiveness Analysis
Vibrato Detection and Analysis
Motion Feature Extraction
• Hand tracking• Optical flow estimation
Case Study 2: Performance Expressiveness Analysis
Vibrato Detection and Analysis
Vibrato Detection
• Each note segment as a sample• Support Vector Machine (SVM)
Classifier
Vibrato / Non-vibrato
8-D
t
Note segment
Case Study 2: Performance Expressiveness Analysis
Vibrato Detection and Analysis
Vibrato Analysis
• Principal Component Analysis• Vibrato rate: motion rate
Case Study 2: Performance Expressiveness Analysis
Vibrato Detection and Analysis
Vibrato Analysis
• Principal Component Analysis• Vibrato rate: motion rate• Vibrato extent: motion extent rescaled by pitch contour
Estimated vib extent
Pitch contour
Motion extent
Estimated pitch contour Motion displacement Curve X(t)
Ground-truth pitch contour
Case Study 2: Performance Expressiveness Analysis
Vibrato Detection and Analysis
EvaluationsProposed
• Outperforms audio-based baseline systems
Case Study 2: Performance Expressiveness Analysis
Visual Performance Generation
InputMIDI score
OutputSkeleton movement as pianist
downbeat other
pick-up
Pitch
Beat
Time
Concept
Case Study 2: Performance Expressiveness Analysis
Visual Performance Generation
LSTM
CNN CNN
50d 10d
MIDI Note Stream Metric Structure
Output
Input
timetime
pitch beat
Body SkeletonMethod
• Convolutional Neural Network (CNN)
- Automatic feature extraction from MIDI
• Recurrent Neural Network (RNN)
- Model the temporal coherence
Case Study 2: Performance Expressiveness Analysis
Visual Performance Generation
Evaluations
YouTube Playlist: https://www.youtube.com/playlist?list=PLSf9SKAnNHL1je3Cfsx9xho07sWcvJhGV
Case Study 3:Visually Informed Music Transcription
• Music Transcription for Guitar [Paleari’08]
• Music Transcription for Piano [Akbari’15]
• Music Transcription for Violin [Zhang’07]
• Multi-pitch Analysis for String Ensemble [Dinesh’17]
Case Study 3: Music Transcription
Guitar Music Transcription [Paleari’08]
Method
1. Guitar body localization2. Fretboard tracking3. Hand detection4. Audio-visual information fusion
Evaluations
• 89% correctly detected notes
Case Study 3: Music Transcription
Piano Music Transcription [Akbari’15]
Method
1. Keyboard registration2. Illumination normalization3. Pressed key detection4. Music transcription
Evaluations
• 95% F-measure
Case Study 3: Music Transcription
Violin Music Transcription [Zhang’07]
Method
1. Motion compensation2. Multiple finger tracking3. String detection4. Fingering event detection5. Note inference
Case Study 3: Music Transcription
Multi-pitch Analysis for String Ensembles [Dinesh’17]
Concept
• Multi-pitch Estimation (MPE)
- Estimate instantaneous pitches and polyphony
• Multi-pitch Streaming (MPS)
- Organize pitches into streams corresponding to individual sources
Original Spectrogram MPE MPS
Case Study 3: Music Transcription
Multi-pitch Analysis for String Ensembles
System overview
• Video play/non-play (P/NP) activity detection
• P/NP provide instantaneous polyphony number
• P/NP only assign detected pitches to active sources
Case Study 3: Music Transcription
Multi-pitch Analysis for String Ensembles
Evaluations
Multi-pitch Estimation
Case Study 3: Music Transcription
Multi-pitch Analysis for String Ensembles
Evaluations
Multi-pitch Streaming
Case Study 4:Visually Informed Source Separation
• Motion-driven [Parekh’17]
• Cross-modal Deep Representation [Zhao’18]
Case Study 4: Source Separation
Motion-driven [Parekh’17]]
Overview
• Speed of sound-producing motion Characteristics of sound event• Extends the Non-negative Matrix Factorization (NMF)
Case Study 4: Source Separation
Motion-driven [Parekh’17]]
Overview
• V: audio mixture’s spectrogram • W: audio dictionary• M: clustered average motion speeds• H: activity matrix (shared by video and audio)• A: regression coefficients for each motion cluster
Case Study 4: Source Separation
Cross-modal Deep Representation [Zhao’18]
Overview
• Train a two-stream network using large amount of audio-video data• Learn cross-modal representation (embedding in aligned space)• Localize audio source in the video frame• Separate the sound for each pixel in the video
Case Study 4: Source Separation
Cross-modal Deep Representation [Zhao’18]
Results
• Clustering of sound in space
Case Study 4: Source Separation
Cross-modal Deep Representation [Zhao’18]
Results
• Visualization of corresponding channel activations