polyphonic queries a review of recent research by cory mckay
TRANSCRIPT
Melodic Fragments as Queries
Melodic fragments are a useful way of searching electronic music databases
Convenient for humans to remember musical phrases
Easy to enter musical fragments Monophonic: query by humming Polyphonic: notation software
Search Requirements
Want searches to return all occurrences of a given set of notes in a database
May also want records that contain similar sets of notes
Should be able to deal with incomplete or partially erroneous queries
Ideally, want to search music stored in Symbolic formats (MIDI, scores, sketches) Raw audio data
Monophonic Databases
Has been some success with special case of monophonic databases Andreas Kornstadt’s Themefinder Roger J. MacNab’s Meledex
Searching polyphonic databases is much more difficult
Polyphonic Databases: Problems
Notes may begin simultaneously. Makes it impossible to outline an unambiguous sequence of events
Multiple voices, with varying roles and relevance to particular queries
Hard to deal with both symbolic and raw audio representations: Monophonic: can just transcribe audio Polyphonic: no good transcription system
Current Polyphonic Systems
Wide-spread systems rely on meta-data This no good for searches of melodic
fragments Currently no widely accepted content-
based system A number of papers on topic have recently
been published
Wiggins et al. (2002)
Designed a new algorithm called SIA(M)ESE For making transposition-invariant queries Matches a query even if there are events in a
score being searched that separate musical events in the query
Assumes that both queries and database files are in symbolic form and are accurate
This limits generality of algorithm No implementation of the algorithm given
Doraisamy & Rüger (2001)
Polyphonic queries Partial queries permitted Symbolic queries Symbolic records Pitch and rhythm features
Doraisamy & Rüger (2001)
N-grams are produced by converting notes into interval-based representation and grouping intervals into subdivisions of length n using a gliding window
Leads to transposition-invariant data N-grams work well with monophonic
queries
Doraisamy & Rüger (2001)
To deal with potential for simultaneous note onsets in polyphonic music, constructs exhaustive “melodic strings”: Divide piece into overlapping windows of n
adjacent onset times Find all possible combinations of onsets
within each window
Doraisamy & Rüger (2001)
Incorporates rhythmic information as well as intervals into each n-gram window
Done by calculating ratio of time differences between adjacent note onsets: Ratioi = (Onseti+2 – Onseti+1) / (Onseti+1 – Onseti)
Avoids need to quantize events based on a predetermined beat duration
By using onsets only, avoids needing to determine the duration of notes, which can be difficult to detect in raw audio recordings
Doraisamy & Rüger (2001)
N-grams converted into text-based representations to allow use of text-based search engines
Interval and rhythmic ratio histograms constructed in order to search for patterns in each piece
Doraisamy & Rüger (2001)
Tested using a database of 3096 MIDI representations of classical music
Studied effects of varying window sizes, bin ranges and query lengths
95% success rate with window sizes of 4 onset times, variable bin ranges and queries involving 50 notes
74% with query lengths of ten notes 65% with queries containing errors
Doraisamy & Rüger (2001)
Evaluation: Performs well under ideal conditions Databases or queries containing raw audio not
considered Only transposition invariant searches possible Undesirably long query lengths necessary Only classical music records tested Successful in showing the potential utility of n-grams Most recent work uses a variant of this n-gram
approach
Doraisamy & Rüger (2002)
Monophonic queries Partial queries permitted Symbolic queries Symbolic records Pitch and rhythm features
Doraisamy & Rüger (2002)
Focused on monophonic queries because of query by humming
N-grams particularly appropriate for error-prone queries resulting from query by humming
One or two mistakes only lead to a few incorrect n-grams among a larger number of correct ones
Doraisamy & Rüger (2002)
Used same overall design as previous system
Includes more sophisticated error models to test effects of query inaccuracies
Database expanded to include popular music as well as classical music
80% of the relevant compositions were returned in first 15 hits, on average
Doraisamy & Rüger (2002)
Evaluation N-grams are an effective and error-tolerant
tool for searching polyphonic music with monophonic queries
Improvements still need to be made Queries still symbolic, so true applicability of
system to query by humming not tested
Pickens et al. (2002)
Polyphonic queries Full-length queries only Audio queries Symbolic records Harmonic features
Pickens et al. (2002)
Relies on transcription algorithms to transform audio queries into symbolic form
Relies on error tolerance of searches to compensate for transcription errors
Two types of polyphonic transcription systems used, including blackboard
Pickens et al. (2002)
Transcribed queries and database records analyzed and stored using a harmonic modelling module
Characterizes pieces by mapping chords to a probability distribution Breaks into sequences of independent note sets Applies smoothing procedure to sets Markov models created from smoothed sets
Pickens et al. (2002)
Performing searches: Scoring function used to compare query
models with each model stored in database Dissimilarity scores produced Allows search hits to be ranked
Pickens et al. (2002)
Tested with database containing 3150 classical piano pieces
Queries consisted of full-length audio recordings On average, searches assigned a rank of
between 2 and 6 to the correct database record Moderately successful at matching variations of
a piece: on average, 3 of top 5 hits relevant
Pickens et al. (2002)
Evaluation Can use polyphonic audio queries Limited by effectiveness of its transcription
systems and error tolerance of the query system
System only tested on piano music Full-length queries limit usefulness Only one feature used
Song, Bae & Yoon (2002)
Monophonic queries Partial queries permitted Audio queries Audio records Melodic features
Song, Bae & Yoon (2002)
Intended to be used with query by humming
Avoids disadvantages of automated transcription by mapping audio data directly to “mid-level” melody-based feature set description
Contrasts with approach of first transcribing audio data and then extracting features from this high-level symbolic representation
Song, Bae & Yoon (2002)
“Mid-level” representation produced by processing audio frames using a five-step process
Song, Bae & Yoon (2002)
Instead of making definite decision on which notes were present, as a blackboard system would have done, vectors of all possible notes were kept for each audio segment
A DP-matching method was used during searches so that potentially error-prone patterns of different lengths could be compared
Song, Bae & Yoon (2002)
Tested by attempting to match 176 hummed samples to 92 short extracts (15-20 seconds long) of popular Korean and Western popular songs
Resulted in exact matches roughly 43% of the time and a match in the top ten from 69% to 76% of the time (depending on window size)
Search time varied from 3 to 14 seconds
Song, Bae & Yoon (2002)
Evaluation Performance relatively poor Only monophonic queries possible Only tested using a database of short
recordings This was only system using audio data for
both queries and database records
Conclusions
No truly viable systems produced yet Promising approaches have been proposed Recurring problems:
None of systems can deal with both symbolic and audio data
None have been tested with both polyphonic and monophonic queries
Tend to require long queries to achieve good results Searches do not allow much flexibility (e.g. must be
transposition invariant)
Conclusions
Difficult to compare systems because they each use different performance evaluation metrics
Limited data sets used during testing