multimodal pattern matching algorithms and applications
DESCRIPTION
In this presentation I focus on 3 projects I have been working in the last year. The first one is a novel pattern matching algorithm, based on the well known Dynamic Time Warping. The presented algorithm can be used to find real-valued subsequences within a longer sequence, without prior knowledge of their start-end points. I have applied the algorithm for the task of acoustic matching, for which I will show some preliminary results. Then I will continue to explain a second DTW-based algorithm, this one being able do an online of two musical pieces. One of the music pieces can be input life or be retrieved from an audio file, while the second one is extracted from an online music video. The online alignment allows for the music video to be played in total synchrony with the corresponding ambient/recorded audio. Finally, I will talk about video copy detection, which is the task of finding video duplicate segments within a big database. I will explain our multimodal approach, based on audio-visual change-based features.TRANSCRIPT
Mul$modal pa+ern matching algorithms and applica$ons
Xavier Anguera Telefonica Research
Outline
• Introduc$on • Par$al sequence matching
– U-‐DTW algorithm
• Music/video online synchroniza$on – MuViSync prototype
• Video Copy detec$on
Par$al Sequence Matching Using an Unbounded Dynamic Time Warping
Algorithm
Xavier Anguera, Robert Macrare and Nuria Oliver
Telefonica Research, Barcelona, Spain
Proposed challenge • Given one or several audio signals we want to find and align recurring acous$c pa+erns.
Proposed challenge • We could use the ASR/phone$c output and search for symbol
repe$$ons PROS: – It is easy to apply, the ASR takes care of any $me warping CONS: – ASR is language dependent and requires training – We introduce addi$onal sources of error (acous$c condi$ons, OOV’s) – It can be very slow and not embeddable
• Automa$c mo$f discovery directly in the speech signal – Train free, language independent and resilient to some noises
ASR/Phone$za$on
symbols alignment
Symbolic representa$on
acous$c alignment
• Alignment loca$ons • Scores
Areas of applica$on
• Improve ASR by disambigua$on over several repe$$ons (Park and Glass, 2005)
• Pa+ern-‐based speech recogni$on – flat modelling (Zweig and Nguyen, 2010)
• Acous$c summariza$on (Muscariello, 2009)
• Musical structure analysis (Müller, 2007)
• Server-‐less mobile voice search (Anguera, 2010)
Automa$c mo$f discovery • Goal is to avoid going to text and therefore be more robust to errors
• Good deal of applicable work on this area: – Biomedicine in matching DNA sequences (conver$ng the speech signals into symbol strings)
– Directly from real-‐valued mul$dimensional samples using DTW-‐like algorithms • Müller’07, Muscariello’09, Park’05, Zweig’10 • Most need to compute all the cost matrix a priori
Dynamic Time Warping -‐ DTW • DTW algorithm allows the computa$on of the op$mal alignment between two $me series Xu, Yv ε ΦD
Image by Daniel Lemire
€
XU = (u1,...,um,...,uM )
€
XV = (v1,....,vn,..,vN )
Dynamic Time Warping (II) • The op$mal alignment can be found in O(MN) complexity using dynamic programming.
• We need to define a cost func$on between any two elements in the series and build a distance matrix:
€
d :ΦD × ΦD →ℜ≥ 0
Image by Tsanko Dyustabanov
€
d(i, j) = um − vn
Where usually:
€
c(i(k), j(k))
€
F = c(1),...,c(K)Warping func$on: where
Euclidean distance
Warping constraints For speech signals some constraints are usually applied to the warping func$on F: – Monotonicity:
– Con$nuity (i.e. local constraints):
€
i(k −1) ≤ i(k)
€
j(k −1) ≤ j(k)
€
i(k) − i(k −1) ≤1
€
j(k) − j(k −1) ≤1
Sakoe,H. and Chiba,S. (1978) Dynamic programming algorithm op0miza0on for spoken word recogni0on, IEEE Trans. on Acoust., Speech, and Signal Process, ASSP-‐26, 43-‐49.
(m, n)
(m-‐1, n-‐1)
(m-‐1, n)
€
D(m,n) =minD(m −1,n)D(m,n −1)D(m −1,n −1)
⎧
⎨ ⎪
⎩ ⎪
+ d(um,vn )
Warping constraints (II) – Boundary condi$on:
i.e. DTW needs prior knowledge of the start-‐end alignment points.
– Global constraints €
i(1) =1
€
j(1) =1
€
i(K) = M
€
j(K) = N
Image from Keogh and Ratanamahatana
DTW Dynamic Programming
DTW Dynamic Programming
DTW Dynamic Programming
DTW Dynamic Programming
DTW main problem • The boundary condi$on constraints $me-‐series to be aligned from start to end – We need a modifica$on to DTW to allow common pa+ern discovery in reference and query signals regardless of the sequence’s other content
Alterna$ve proposals
• Meinard Müller’s Path extrac$on for music – Needs to pre-‐compute the complete cost matrix.
• Alex Park’s Segmental DTW – Needs to pre-‐compute the complete cost matrix, very computa$onally expensive ajerwards.
• Armando Muscarielo’s word discovery algorithm – Searches for pa+erns locally, does not check all possible star$ng points.
[1] M. Müller, “Informa$on Retrieval for Music and Mo$on”,Springer, New York, USA, 2007. [2] A. Park et al., “Towards unsupervised pa+ern discovery in speech,” in In Proc. ASRU’05, Puerto Rico, 2005. [3] A. Muscariello et al., “Audio keyword extrac$on by unsupervised word discovery,” in Proc. INTER-‐ SPEECH’09, 2009.
Unbounded-‐DTW Algorithm
• U-‐DTW is a modifica$on to DTW that is fast and accurate in finding recurring pa+erns
• We call it unbounded because: – The start-‐end posi$ons of both segments are not constrained
– Mul$ple matching segments can be found with a single pass of the algorithm
– Minimizes the computa$onal cost of comparing two mul$dimensional $me series
U-‐DTW Cost func$on and matching length
• Given two sequences to be matched U=(u1, u2, …, uM) and V=(v1, v2, …, vN)
we use the inner product similarity
Values range [-‐1,1], the higher the closer • We look for matching sequences with a minimum length Lmin (set at 400ms in our experiments) €
s(m,n) = cosθ =um ,vnum vn
U-‐DTW global/local constraints
• no global constraints are applied in order to allow for matching of any segment among both sequences
• Local constraints are set to allow warping up to 2X
(m, n)
(m-‐1, n-‐2)
(m-‐1, n-‐1)
(m-‐2, n-‐1)
€
D(m,n) =maxD(m − 2,n)D(m,n − 2)D(m − 2,n − 2)
⎧
⎨ ⎪
⎩ ⎪
+ s(um,vn )
U-‐DTW computa$onal savings
• Computa$onal savings are achieved thanks to: 1. We sample the distance/similarity matrix at
certain possible matching start points (sesng Synchroniza$on points)
2. Dynamic programming is done forward, prunning out low similarity paths
Synchroniza$on points • Only certain (m,n) posi$ons are analyzed in the matrix for possible matching segments – Selected not to loose any matching segment – Op$mize the computa$onal cost
• Two methods are followed: horizontal and ver$cal bands:
τh
τd
λ
(m,n)
λ
λ
π/4 2τh
(m,n)
U
U
V V
U-‐DTW Dynamic Programming
Forward dynamic programming • For each posi$on (m,n): 3 possible forward paths are considered
• The forward path is extended forward IIF: – Its normalized global similarity is above a pruning threshold
– is greater than any previous path in that loca$on
(m, n)
(m+1, n+2)
(m+1, n+1)
(m+2, n+1)
€
S(m',n') =D(m,n) + s(m',n')
M(m,n) +1≥Thrprun
€
S(m',n')
U-‐DTW Dynamic Programming
U-‐DTW Dynamic Programming
Backward path algorithm
• When a possible matching segment is found in the forward path, the same is done backwards star$ng from the origina$ng SP posi$on.
The same procedure is followed as in the forward path
(m, n)
(m-‐1, n-‐2)
(m-‐1, n-‐1)
(m-‐2, n-‐1)
U-‐DTW Dynamic Programming
U-‐DTW Dynamic Programming
Computa$onal savings example Ba
rcelon
a
Barcelona
Experimental setup • We asked 23 people to record 47 words from 6 categories, 5 itera$ons each:
• Simple energy-‐based trimming eliminates non-‐speech regions
• We simulate acous$c context by a+aching different start-‐end audio sequences to Xu,v.
Nature
Ci$es
People
Events
Family
Monuments
€
XU ,V [n,i],i =1...5, j =1...47
Experimental setup (II)
• Signals are parameterized with 10MFCC every 10ms
• Each word Xu is compared to all words Xv from the same speaker (234 comparisons) and the closest one is retrieved
We get a hit m=n, a miss otherwise • Tests were performed on an Ubuntu Linux PC @2.4GHz. €
argminm, j D(XU [n,i],XV [m, j]) | (n,i) ≠ (m, j)
Comparing systems
• Standard DTW – Compare the sequences without any added acous$c context (i.e. prior knowledge of start-‐end points)
• Segmental DTW (Park and Glass, 2005) – Minimum segment length of 500ms – Band size of 70ms, 50% overlap
– Used 2 distances: Euclidean and 1-‐inner product
Performance evalua$on Used metrics:
– Accuracy: percentage of words correctly matched (Xu y Xv are different itera$ons of the same word).
– Average processing $me per sequence pair (Xu-‐Xv) (excluding parameteriza$on)
– Average ra$o of frame-‐pair distances within each sequence-‐pair cost matrix.
€
Acc =correct matches∑all matches
⋅ 100
€
Time =time(D(XU [n,i],∑ XV [m, j]))
#matches⋅ 100
€
Ratio =computed(d(XU [n,i],XV [m, j]))∑
MN⋅ 100
Results
Algorithm Accuracy Avg. ;me ra;o
Segmental DTW w/ Eucl. 80.61% 82.7ms 1
Segmental DTW w/ inner prod. 74.62% 86.7ms 1
U-‐DTW horiz. bands 89.53% 10.6ms 0.51
U-‐DTW diag. bands 89.34% 9.0ms 0.42
Standard DTW 95.42% 0.6ms 1
Effect of the Cutout Threshold
Conclusions and future work
• We propose a novel algorithm called U-‐DTW for unconstrained pa+ern discovery in speech
• We show it is faster and more accurate than exis$ng alterna$ves
• We are star$ng to test the algorithm for unrestricted audio summariza$on
MuViSync AudioVisual Music Synchroniza$on
Xavier Anguera, Robert Macrae and Nuria Oliver
…on the go, …
…at home, …
People enjoy listening to their favorite music everywhere…
…or in a party with friends
Users increasingly have a personal mp3 music collec$on…
…but it usually contains ‘only’ music.
What if you could watch the video clip of any of our songs while listening to it?
…but the audio quality is much worse that in your mp3…
You could go to sites like YouTube…
What if you could listen to our high quality mp3 music while watching the video clips?
MuViSync: Music and Video Synchroniza$on system
Personal Music
Video clip
streaming
local
MuViSync
MuViSync synchronizes audio and video from two different
sources and plays them together in-‐sync
Applica$on scenarios
• Watch on TV your favorite music – Personal music synchroniza$on with video clips either local or streamed
• Watch on your iPhone your music – Personal music synchroniza$on by streaming the video into the iPhone
• Iden0fy and watch any music – Combined with songID technology, either at home or on the go.
MuViSync applica$on • We have developed a prototype applica0on for Windows/mac, and soon for Iphone.
Alignment algorithm requirements
• Perform an alignment between the mp3 music and the Video’s audio track
• Ini$ally only par$al knowledge is available from both sources (life recording or buffering)
• Alignment has to be done online and in real-‐$me
• Emphasis is needed on the user sa$sfac$on when playing the video.
Applica$on testbed • We use 320 music videos (Youtube) + their corresponding mp3 files
• A supervised ground-‐truth alignment was performed using offline DTW and checking for consistency
• Audio is processed every 100ms (200ms window) and chroma features are extracted
MuViSync online alignment algorithm
1. Ini$al path discovery – Both signals (audio and video) are buffered, features
are extracted and an ini$al alignment is found
2. Real-‐$me online alignment – An incremental alignment is computed
3. Alignment post-‐processing to ensure a smooth playback of the aligned video.
Audio + feats extrac$on
Feats extrac$on
Ini$al path discovery
Real-‐$me alignment
1)
2)
ta tv
alignment
Ini$al path discovery (online mp3 playback + video buffering)
Audio available from the video
Audio from the mp3 file
Video buffering end
Sync request
Ini$al path discovery • A segment of the audio and the buffered video are checked for alignment using forward-‐DTW
• The global similarity D(m,n) at each loca$on (m,n) is normalized by the length of the op$mum path to that loca$on
• At each step, all paths with D’(m,n) < Dave(*,n) are pruned.
• The ini0al alignment is selected when only one path survives or the sync 0me is reached.
Ini$al path discovery
Audio available from the video
Aud
io being played from
mp3
Audio $me alignment buffer (about 1s)
Ini$al path discovery
Audio available from the video
Aud
io being played from
mp3
Ini$al path discovery
Audio available from the video
Aud
io being played from
mp3
Ini$al path discovery
Audio available from the video
Aud
io being played from
mp3
Real-‐$me online alignment • Star$ng from the ini$al alignment we itera$vely compute: 1. Locally op$mum forward path for L steps: p1…pL
using a) local constraints (no dynamic programming)
2. Backward (standard) DTW from pL to p1 using b) local constraints
3. Add the ini$al p/2 steps to the final path, and start 1) from pL/2 un$l the playback ends
Real-‐$me online alignment
Audio available from the video
Aud
io being played from
mp3
Real-‐$me online alignment
Audio available from the video
Aud
io being played from
mp3
1)Forward locally best path with L=8
p1
pL
Real-‐$me online alignment
Audio available from the video
Aud
io being played from
mp3
2)stardard DTW
p1
pL
Real-‐$me online alignment
Audio available from the video
Aud
io being played from
mp3
3)Move forward the new star$ng point
p1
Alignment postprocessing • Alignment es$mates every 100ms are not enough to drive 25/30 fps video
• An interpola$on of the points + averaging over 5 seconds gives the projec$on es$mate for current playback
Experiments • We use 320 videos+mp3, aligned using offline DTW and manually checked for consistency.
• Accuracy is computed as the % of songs with average error < some ms.
Average accuracy @100ms for different video buffer lengths
Experiments
Video Duplicate Detec$on Xavier Anguera and Pere Obrador
Let’s say you’re looking for the Bush a+ack video…
…and you get 11,100 results.
…ajer 40 minutes...
watching many of the videos returned you no$ce that many are similar, i.e. near duplicates
27% in average in Youtube [Wu et al., 2007] 12% in average in Youtube [Anguera et al, 2009]
Near duplicate (NDVC) defini$on • Iden$cal or approximately iden$cal videos, that differ in some feature: – file formats, encoding parameters – photometric varia$ons (color, ligh$ng changes) – overlays (cap$on, logo, audio commentary)
– edi$ng opera$ons (frames add/remove) – seman$c similarity
NDVC are videos that are “essen(ally the same”
Near duplicates(NDVC) vs. Video copies
• These two concepts are not totally well discriminated in the bibliography.
• Video copy: exact video segment, with some transforma$ons on it
• Near duplicate: similar videos on the same topic (different view points, seman$cally similar videos, …)
In our research we approach the video copy detec;on
Examples of video copies
Use Scenarios: Copyright law enforcement
Detec$on of copyright infringing videos in online video sharing sites
In a recent study we found that in average 12% of search results in YouTube are copies of the same video
Currently police forces usually have to manually scroll through ALL materials in pederasty cases searching for evidence.
Discover illegal content hidden within other videos
Use Scenarios: Video forensics for illegal ac$vi$es
Database management/op$miza$on and helping in searches over historic contents
Video excerpts used several $mes
Use Scenarios: Database management
Adver$sement detec$on/iden$fica$on
Programming analysis
Use Scenarios: adver$sement detec$on and management
Use Scenarios: Informa$on overload reduc$on
Improved (more diverse) video search results by clustering all video duplicates.
George Bush
Before clustering
Ajer clustering
Steps in Video Duplicate detec$on
1. Indexing of the reference videos A. Obtain features represen$ng the video B. Store these features in a scalable manner
2. Search of queries within the reference set
Feature extrac$on References indexing
Ref videos
Query video Feature extrac$on
Search for duplicates
Features Database
ONLINE
OFFLINE
Ways to approach near-‐duplicate video detec$on
• Local features – Extracted from selected frames in the videos
– Focus on local characteris$cs within those frames
• Global features – Extracted from selected frames or from all the video
– Focus on overall characteris$cs
Local features
• Comes from the previous knowledge on image copy detec$on/near duplicates detec$on
• Steps: – Keyframes are first extracted from the videos at regular intervals or by detec$ng shots
– Local features are obtained for these keyframes: • SIFT • SURF • HARRIS • …
Global Features
• Features are extracted either from the whole video or from keyframes by looking at the overall image (not at par$cular points).
In our work we extract them from the whole video
Mul$modal video copy detec$on
• Most works use only video/images informa$on – They prefer local features for their robustness
• We introduce audio informa$on by combining global features from both the audio and video tracks
• We are also experimen$ng on fusing local features with global features (work in progress)
Mul$modal global features
• We use features based on the changes in the data-‐> more robust to transforma$ons
• Video: – Hue + satura$on interframe change – Lightest and darkest centroid interframe distance
• Audio: – Bayesian informa$on criterion (BIC) between adjacent segments
– Cross-‐BIC between adjacent segments – Kullback-‐Leibler divergence (KL2) between adjacent segments
Hue+Satura$on interframe change
1. Transform the colorspace from RGB to HSV (Hue+Satura$on+Value)
Hue+Satura$on interframe change
2. Compute for each 2 consecu$ve frames their HS histogram and compute their intersec$on as:
Lightest and darkest centroid interframe distance
1. Find the lightest and darkest regions in each frame and obtain its centroid
Lightest and darkest centroid interframe distance
We compute the euclidean distance between each two adjacent frames, obtaining two global feature streams
Acous$c features
• Compute some acous$c distance between adjacent acous$c segments
Segment A Segment B
GMM A GMM B GMM A+B
Acous$c features (II)
• Likelihood-‐based metrics: – Bayesian Informa$on Criterion
– Cross-‐BIC
• Model distance metrics: – Kullback-‐Leibler divergence (KL2)
Acous$c features (III)
• For example: the Bayesian Informa$on Criterion (BIC) output:
Search for full copies • For each video-‐query pair we compute the correla$on of each feature pair
• We then find the posi$ons with high similarity (peaks).
Reference
Possible copy
XFFT
FFT
IFFT Find peaks
Mul$modal fusion • When mul$ple modali$es are available, fusion is performed on the correla$ons
Output score
• The resul$ng score is computed by weighted sum of the different modali$es’ normalized dot product at the found peak
• Automa$c weights are obtained via
Finding subsegments of the query • The previously described algorithm considers the whole query matches a por$on of the reference videos
• To avoid such restric$on a modifica$on to the algorithm first splits the query into overlaping 20s segments
• By accumula$ng the resul$ng peaks for each segment we can obtain the main delay and its segment
Algorithm performance evalua$on
• To test the algorithm we used the MUSCLE-‐VCD database: – Over 100 hours of reference videos from the SoundVision group (Nederlands)
– 2 test sets • ST1: 15 query videos where the whole query is considered
• ST2: 3 videos with 21 segments appearing in the reference database
h+p://www-‐roc.inria.fr/imedia/civr-‐bench/benchMuscle.html
MUSCLE-‐VCD transforma$on examples
Evalua$on metrics
• We use the same metrics as in the MUSCLE-‐VCD benchmark tests
Evalua$on metrics (II)
• We also use the more standard Precision and recall metrics
Evalua$on results
Evalua$on results histogram for ST1
Youtube reranking applica$on • We downloaded all videos searching for the top 20 most viewed and 20 most visited videos
Youtube reranking applica$on • We applied mul$modal copy detec$on and grouped all near duplicates
Youtube Reranking test
• Results show how some videos have mul$ple clear copies that can boost their ranking once clustered
Thanks for your aHen;on
xanguera@$d.es www.xavieranguera.com
Linkedin: h+p://es.linkedin.com/in/xanguera Twi+er: h+p://twi+er.com/xanguera
Website: h+p://www.xavieranguera.com/