lyrics segmentation via bimodal text–audio representation

26
HAL Id: hal-03295581 https://hal.archives-ouvertes.fr/hal-03295581 Submitted on 14 Oct 2021 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Lyrics segmentation via bimodal text–audio representation Michael Fell, Yaroslav Nechaev, Gabriel Meseguer-Brocal, Elena Cabrio, Fabien Gandon, Geoffroy Peeters To cite this version: Michael Fell, Yaroslav Nechaev, Gabriel Meseguer-Brocal, Elena Cabrio, Fabien Gandon, et al.. Lyrics segmentation via bimodal text–audio representation. Natural Language Engineering, Cambridge Uni- versity Press (CUP), 2021, 28 (3), pp.1-20. 10.1017/S1351324921000024. hal-03295581

Upload: others

Post on 15-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lyrics segmentation via bimodal text–audio representation

HAL Id: hal-03295581https://hal.archives-ouvertes.fr/hal-03295581

Submitted on 14 Oct 2021

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Lyrics segmentation via bimodal text–audiorepresentation

Michael Fell, Yaroslav Nechaev, Gabriel Meseguer-Brocal, Elena Cabrio,Fabien Gandon, Geoffroy Peeters

To cite this version:Michael Fell, Yaroslav Nechaev, Gabriel Meseguer-Brocal, Elena Cabrio, Fabien Gandon, et al.. Lyricssegmentation via bimodal text–audio representation. Natural Language Engineering, Cambridge Uni-versity Press (CUP), 2021, 28 (3), pp.1-20. �10.1017/S1351324921000024�. �hal-03295581�

Page 2: Lyrics segmentation via bimodal text–audio representation

1

Lyrics Segmentation via Bimodal Text-audioRepresentation†‡

M i c h a e l F e l l 1E, Y a r o s l a v N e c h a e v 2, G a b r i e l M e s e g u e r -B r o c a l 3,E l e n a C a b r i o 1, F a b i e n G a n d o n 1 and G e o f f r o y P e e t e r s 4

1Universite Cote d’Azur, CNRS, Inria, I3S, France2Amazon, Cambridge, MA, USA

3Ircam Lab, CNRS, Sorbonne Universite, France4LTCI, Telecom Paris, Institut Polytechnique de Paris, France

ECorresponding author. Email: [email protected]

( Received 14 October 2021 )

Abstract

Song lyrics contain repeated patterns that have been proven to facilitate automated lyricssegmentation, with the final goal of detecting the building blocks (e.g. chorus, verse) of asong text. Our contribution in this article is two-fold. First, we introduce a convolutionalneural network-based model that learns to segment the lyrics based on their repetitive textstructure. We experiment with novel features to reveal different kinds of repetitions in thelyrics, for instance based on phonetical and syntactical properties. Second, using a novelcorpus where the song text is synchronized to the audio of the song, we show that the textand audio modalities capture complementary structure of the lyrics and that combiningboth is beneficial for lyrics segmentation performance. For the purely text-based lyricssegmentation on a dataset of 103k lyrics, we achieve an f-score of 67.4%, improving onthe state of the art (59.2% f-score). On the synchronized text-audio dataset of 4.8k songs,we show that the additional audio features improve segmentation performance to 75.3%f-score, significantly outperforming the purely text-based approaches.

1 Introduction

Understanding the structure of song lyrics (e.g. intro, verse, chorus) is an important

task for music content analysis (Cheng et al., 2009; Watanabe et al., 2016) since it

allows to split a song into semantically meaningful segments enabling a description

of each section rather than a global description of the whole song. The importance of

this task arises also in Music Information Retrieval, where music structure detection

is a research area aiming at automatically estimating the temporal structure of a

† This article is an extended version of (Fell et al., 2018).‡ This work is partly funded by the French Research NationalAgency (ANR) under the

WASABI project (contract ANR-16-CE23-0017-01).

AUTHORS' VERSION

Page 3: Lyrics segmentation via bimodal text–audio representation

2 M Fell et al.

music track by analyzing the characteristics of its audio signal over time. Given

that lyrics contain rich information about the semantic structure of a song, relying

on textual features could help in overcoming the existing difficulties associated with

large acoustic variation in music. However, so far only a few works have addressed

the task lyrics-wise (Fell et al., 2018; Mahedero et al., 2005; Watanabe et al., 2016;

Barate et al., 2013). Carrying out structure detection by means of an automated

system is therefore a challenging but useful task, that would allow to enrich song

lyrics with improved structural clues that can be used for instance by search engines

handling real-word large song collections. A step forward, a complete music search

engine should support search criteria exploiting both the audio and the textual

dimensions of a song.

Structure detection consists of two steps: a text segmentation stage that divides

lyrics into segments, and a semantic labelling stage that labels each segment with a

structure type (e.g. intro, verse, chorus). Given the variability in the set of structure

types provided in the literature according to different genres (Tagg, 1982; Brackett,

1995), rare attempts have been made to achieve the second step, i.e. semantic

labelling. While addressing the first step is the core contribution of this paper, we

leave the task of semantic labelling for future work.

In (Fell et al., 2018) we proposed a first neural approach for lyrics segmentation

that was relying on purely textual features. However, with this approach we fail to

capture the structure of the song in case there is no clear structure in the lyrics -

when sentences are never repeated or in the opposite case when they are always

repeated. In such cases however, the structure may arise from the acoustic/audio

content of the song, often from the melody representation. This paper aims at

extending the approach proposed in (Fell et al., 2018) by complementing the textual

analysis with acoustic aspects. We perform lyrics segmentation on a synchronized

text-audio representation of a song to benefit from both textual and audio features.

In this direction, this work focuses on the following research question: given the

text and audio of a song, can we learn to detect the lines delimiting segments in

the song text? This question is broken down into two sub questions: 1) given solely

the song text, can we learn to detect the lines delimiting segments in the song? and

2) do audio features - in addition to the text - boost the model performance on the

lyrics segmentation task?

To address these questions, this article contains the following contributions. Con-

tributions 1a) and 1b) have been previously published in (Fell et al., 2018), while

2) is a novel contribution.

1a) We introduce a convolutional neural network-based model that i) efficiently

exploits the Self-Similarity Matrix representations (SSM) used in the state-

of-the-art (Watanabe et al., 2016), and ii) can utilize traditional features

alongside the SSMs (see Section 2 until 2.2).

1b) We experiment with novel features that aim at revealing different properties

of a song text, such as its phonetics and syntax. We evaluate this unimodal

(purely text-based) approach on two standard datasets of English lyrics, the

Page 4: Lyrics segmentation via bimodal text–audio representation

3

Music Lyrics Database and the WASABI corpus (see Section 3.1). We show

that our proposed method can effectively detect the boundaries of music seg-

ments outperforming the state of the art, and is portable across collections of

song lyrics of heterogeneous musical genre (see Sections 3.2-3.4).

2) We experiment with a bimodal lyrics representation (see Section 2.3) that

incorporates audio features into our model. For this, we use a novel bimodal

corpus (DALI, see Section 4.1) in which each song text is time-aligned to

its associated audio. Our bimodal lyrics segmentation performs significantly

better than the unimodal approach. We investigate which text and audio

features are the most relevant to detect lyrics segments and show that the

text and audio modalities complement each other. We perform an ablation

test to find out to what extent our method relies on the alignment quality of

the lyrics-audio segment representations (see Sections 4.2-4.4).

To better understand the rational underlying the proposed approach, consider the

segmentation of the Pop song depicted in Figure 1. The left side shows the lyrics and

its segmentation into its structural parts: the horizontal green lines indicate the seg-

ment borders between the different lyrics segments. We can summarize the segmen-

tation as follows: Verse1-Verse2-Bridge1-Chorus1-Verse3-Bridge2-Chorus2-Chorus3-

Chorus4-Outro. The middle of Figure 1 shows the repetitive structure of the lyrics.

The exact nature of this structure representation is introduced later and is not

needed to understand this introductory example. The crucial point is that the seg-

ment borders in the song text (green lines) coincide with highlighted rectangles

in the chorus (the Ci) of the lyrics structure (middle). We find that in the verses

(the Vi) and bridges (the Bi) highlighted rectangles are only found in the melody

structure1 (right). The reason is that these verses have different lyrics, but share

the same melody (analogous for the bridges). While the repetitive structure of the

lyrics is an effective representation for lyrics segmentation, we believe that an en-

riched segment representation that also takes into account the audio of a song can

improve segmentation models. While previous approaches relied on purely textual

features for lyrics segmentation, showing the discussed limitations, we propose to

perform lyrics segmentation on a synchronized text-audio representation of a song

to benefit from both textual and audio features.

Earlier in this section, we presented our research questions and motivation, along

with a motivational example. In the remainder of the paper, in Section 2 we de-

fine the task of classifying lines as segment borders, the classification methods we

selected for the task, and the bimodal text-audio representation. In Section 3 we

describe our lyrics segmentation experiments using text lines as input. Then, Sec-

tion 4 describes our lyrics segmentation experiments using multimodal (unimodal

or bimodal) lyrics lines, containing text or audio information or both as input.

We follow up with a shared error analysis for both experiments in Section 5. In

Section 6 we position our work in the current state of the art, and in Section 7

1 Technically, what we show is a part of the audio structure, based on chroma features. Wedescribe these features and their connection to the melody in more detail in Section 2.3.For the purpose of a simpler presentation, we often call this the melody structure.

Page 5: Lyrics segmentation via bimodal text–audio representation

4 M Fell et al.

we conclude with future research directions to provide more metadata to music

information retrieval systems.

2 Modelling Segments in Song Lyrics

Detecting the structure of a song text is a non-trivial task that requires diverse

knowledge and consists of two steps: text segmentation followed by segment la-

belling. In this work we focus on the task of segmenting the lyrics. This first step is

fundamental to segment labelling when segment borders are not known. Even when

segment borders are “indicated” by line breaks in lyrics available online, those line

breaks have usually been annotated by users and neither are they necessarily iden-

tical to those intended by the songwriter, nor do users in general agree on where

to put them. Thus, a method to automatically segment unsegmented song texts is

needed to automate that first step. Many heuristics can be imagined to find the

segment borders. In our example, separating the lyrics into segments of a constant

length of four lines (Figure 1) gives the correct segmentation. However, in another

example, the segments can be of different length. This is to say that enumerating

heuristic rules is an open-ended task.

We follow (Watanabe et al., 2016) by casting the lyrics segmentation task as

binary classification. Let L = {a1, a2, ..., an} be the lyrics of a song composed of n

text lines and seg : L −→ B be a function that returns for each line ai ∈ L if it is

the end of a segment. Here, B = {0, 1} is the Boolean domain. The task is to learn

a classifier that approximates seg. At the learning stage, the ground truth segment

borders are observed as double line breaks in the lyrics. At the testing stage, we

hide the segment borders and the classifier has to predict them.

As lyrics are texts that accompany music, their text lines do not exist in isolation.

Instead, each text line is naturally associated to a segment of audio. We define a

bimodal lyrics line ai = (li, si) as a pair containing both the i-th text line li, and

its associated audio segment si. In the case we only use the text lines, we model

this as unimodal lyrics lines, i.e. ai = (li).2

In order to infer the lyrics structure, we rely on our Convolutional Neural

Network-based model that we introduced in (Fell et al., 2018). Our model archi-

tecture is detailed in Section 2.2. It detects segment boundaries by leveraging the

repeated patterns in a song text that are conveyed by the Self-Similarity Matrices.

2 This definition can be straightforwardly extended to more modalities, ai then becomesa tuple containing time-synchronized information.

Page 6: Lyrics segmentation via bimodal text–audio representation

5

Fig

.1:

Lyri

cs(l

eft)

ofa

Pop

son

g,th

ere

pet

itiv

est

ruct

ure

ofth

ely

rics

(mid

dle

),an

dth

ere

pet

itiv

est

ruct

ure

ofth

eso

ng

mel

od

y

(rig

ht)

.L

yri

csse

gmen

tb

ord

ers

(gre

enli

nes

)co

inci

de

wit

hh

ighli

ghte

dre

ctan

gles

inly

rics

stru

ctu

rean

dm

elody

stru

ctu

re.

(“D

on‘t

Bre

akM

yH

eart

”by

Den

Har

row

)

Page 7: Lyrics segmentation via bimodal text–audio representation

6 M Fell et al.

2.1 Self-Similarity Matrices

We produce Self-Similarity Matrices (SSMs) based on bimodal lyrics lines ai =

(li, si) in order to capture repeated patterns in the text line li as well as its as-

sociated audio segment si. SSMs have been previously used in the literature to

estimate the structure of music (Foote, 2000; Cohen-Hadria & Peeters, 2017) and

lyrics (Watanabe et al., 2016; Fell et al., 2018). Given a song consisting of lyrics

lines {a1, a2, ..., an}, a Self-Similarity Matrix SSMM ∈ Rn×n is constructed, where

each element is set by computing a similarity measure between the two correspond-

ing elements (SSMM)ij = simM(xi, xj). We choose xi, xj to be elements from the

same modality, i.e. they are either both text lines (li) or both audio segments (si)

associated to text lines. simM is a similarity measures that compares two elements

of the same modality to each other. In the unimodal case, we compute SSMs from

only one modality: either text lines li or audio segments si.

As a result, SSMs constructed from a text-based similarity highlight distinct

patterns of the text, revealing the underlying structure. Analogously, SSMs con-

structed from an audio-based similarity highlight distinct patterns of the audio. In

our motivational example, the textual SSM encodes how similar the text lines are

on a character level (see Figure 1, middle) while the audio SSM encodes how similar

the associated melodies are to each other (see Figure 1, right). In our experiments,

we work with text-based similarities (see Section 3.2) as well as audio-based simi-

larities (see Section 4.2). While in our motivational example we manually overlay

the different SSMs, to find that some structural elements are only unveiled by the

melody - and not by the text - in our neural architecture, we overlay different SSMs

by stacking them into a single time-aligned tensor with c channels, as described in

the following.

There are two common patterns that were investigated in the literature: diagonals

and rectangles. Diagonals parallel to the main diagonal indicate sequences that

repeat and are typically found in a chorus. Rectangles, on the other hand, indicate

sequences in which all the lines are highly similar to one another. Both of these

patterns were found to be indicators of segment borders.

2.2 Convolutional Neural Network-based Model

Lyrics segments manifest themselves in the form of distinct patterns in the SSM.

In order to detect these patterns efficiently, we introduce the Convolutional Neural

Network (CNN) architecture which is illustrated in Figure 2. The model predicts for

each lyrics line if it is segment ending. For each of the n lines of a song text the model

receives patches (see Figure 2, step A) extracted from SSMs ∈ Rn×n and centered

around the line: inputi = {P 1i , P

2i , ..., P

ci } ∈ R2w×n×c, where c is the number of

SSMs or number of channels and w is the window size. To ensure the model captures

the segment-indicating patterns regardless of their location and relative size, the

input patches go through two convolutional layers (see Figure 2, step B) (Goodfellow

et al., 2016), using filter sizes of (w+1)×(w+1) and 1×w, respectively. By applying

max pooling after both convolutions each feature is downsampled to a scalar. After

Page 8: Lyrics segmentation via bimodal text–audio representation

7

Fig. 2: Convolutional Neural Network-based model inferring lyrics segmentation.

the convolutions, the resulting feature vector is concatenated with the line-based

features (see Figure 2, step C) and goes through a series of densely connected layers.

Finally, the softmax is applied to produce probabilities for each class (border/not

border) (see Figure 2, step D). The model is trained with supervision using binary

cross-entropy loss between predicted and ground truth segment border labels (see

Figure 2, step E). Note that while the patch extraction is a local process, the SSM

representation captures global relationships, namely the similarity of a line to all

other lines in the lyrics.

2.3 Bimodal Lyrics Lines

To perform lyrics segmentation on a bimodal text-audio representation of a song to

benefit from both textual and audio features, we use a corpus where the annotated

lyrics ground truth (segment borders) is synchronized with the audio. This bimodal

dataset is described in Section 4.1. We focus solely on the audio extracts that have

singing voice, as only they are associated to the lyrics. For that let ti be the time

interval of the (singing event of) text line li in our synchronized text-audio corpus.

Then, a bimodal lyrics line ai = (li, si) consists of both a text line li (the text

line during ti) and its associated audio segment si (the audio segment during ti).

As a result, we have the same number of text lines and audio segments. While

the text lines li can be used directly to produce SSMs, the complexity of the raw

audio signal prevents it from being used as direct input of our system. Instead,

it is common to extract features from the audio that highlight some aspects of

the signal that are correlated with the different musical dimensions. Therefore, we

describe each audio segment si as set of different time vectors. Each frame of a

vector contains information of a precise and small time interval. The size of each

audio frame depends on the configuration of each audio feature. Specifically, we use

a sample rate of 22kHz to extract from each time frame two sets of features using

librosa.feature (McFee et al., 2015). We call an audio segment si featurized by a

feature f if f is applied to all frames of si. For our bimodal segment representation

we featurize each si with one of the following features:

• Mel-frequency cepstral coefficients (mfcc ∈ R14): these coefficients

Page 9: Lyrics segmentation via bimodal text–audio representation

8 M Fell et al.

(Davis & Mermelstein, 1980) emphasize parts of the signal that are related

with our understanding of the musical timbre. The mfcc describe the over-

all shape of a spectral envelope of a signal as a set of features. We extract

15 coefficients and discard the first component as it only conveys a constant

offset.

• Chroma feature (chr ∈ R12): this feature (Fujishima, 1999) describes the

harmonic information of each frame by computing the “presence” of the twelve

different notes. We compute a 12-element feature vector where each feature

corresponds with each pitch class in western music where one octave is divided

into 12 equal-tempered pitches.

3 Experiments with Unimodal Text-based Representations

This section describes our lyrics segmentation experiments using text lines as input.

First, we describe the datasets used (Section 3.1). We then define the different

similarity measures used to construct the self-similarity matrices (see Section 3.2).

In Section 3.3 we describe the models and configurations that we have investigated.

Finally, we present and discuss the obtained results (Section 3.4).

3.1 Datasets: MLDB and WASABI

Song texts are available widely across the Web in the form of user-generated content.

Unfortunately for research purposes, there is no comprehensive publicly available

online resource that would allow a more standardized evaluation of research results.

This is mostly attributable to copyright limitations and has been criticized before in

(Mayer & Rauber, 2011). Research therefore is usually undertaken on corpora that

were created using standard web-crawling techniques by the respective researchers.

Due to the user-generated nature of song texts on the Web, such crawled data

is potentially noisy and heterogeneous, e.g. the way in which line repetitions are

annotated can range from verbatim duplication to something like Chorus (4x) to

indicate repeating the chorus four times.

In the following we describe the lyrics corpora we used in our experiments. First,

MLDB and WASABI are purely textual corpora. Complementarily, DALI is a cor-

pus that contains bimodal lyrics representations in which text and audio are syn-

chronized.

The Music Lyrics Database (MLDB) V.1.2.73 is a proprietary lyrics corpus of

popular songs of diverse genres. We use this corpus in the same configuration as

used before by the state of the art in order to facilitate a comparison with their

work. Consequently, we only consider English song texts that have five or more

segments and we use the same training, development and test indices, which is a

60%-20%-20% split. In total we have 103k song texts with at least 5 segments. 92%

of the remaining song texts count between 6 and 12 segments.

3 http://www.odditysoftware.com/page-datasales1.htm

Page 10: Lyrics segmentation via bimodal text–audio representation

9

The WASABI corpus4 (Meseguer-Brocal et al., 2017), is a larger corpus of song

texts, consisting of 744k English song texts with at least 5 segments, and for each

song it provides the following information: its lyrics5, the synchronized lyrics when

available6, DBpedia abstracts and categories the song belongs to, genre, label,

writer, release date, awards, producers, artist and/or band members, the stereo

audio track from Deezer, when available, the unmixed audio tracks of the song, its

ISRC, bpm, and duration.

3.2 Similarity Measures

In the following, we define the text-based similarities used to compute the SSMs.

Given the text lines of the lyrics, we compute three line-based text similarity mea-

sures, based on either their characters, their phonetics or their syntax.

• String similarity (simstr): a normalized Levenshtein string edit similarity

between the characters of two lines of text (Levenshtein, 1966). This has been

widely used - e.g. (Watanabe et al., 2016; Fell et al., 2018).

• Phonetic similarity (simphon): a simplified phonetic representation of the

lines computed using the “Double Metaphone Search Algorithm” (Philips,

2000). When applied to “i love you very much” and “i’l off you vary match”

it returns the same result: “ALFFRMX”. This algorithm was developed to

capture the similarity of similar sounding words even with possibly very dis-

similar orthography. We translate the text lines into this “phonetic language”

and then compute simstr between them.

• Lexico-syntactical similarity (simlsyn): this measure was initially pro-

posed in (Fell, 2014) to capture both the lexical similarity between text lines

as well as the syntactical similarity. Consider the two text lines “Look into

my eyes” and “I look into your eyes”: there is a similarity on the lexical level,

as similar words are used. Also, the lines are similar on the syntactical level,

as they share a similar word order. We estimate the lexical similarity simlex

between two lines as their relative word bigram overlap, and in analogy, we

estimate their syntactical similarity simsyn via relative POS tag bigram over-

lap. Finally, we define the lexico-syntactical similarity simlsyn as a weighted

sum of simlex and simsyn. The details of the computation are described in the

Appendix - Section B.

3.3 Models and Configurations

We represent song texts via text lines and experiment on the MLDB and WASABI

datasets. We compare to the state of the art (Watanabe et al., 2016) and successfully

reproduce their best features to validate their approach. Two groups of features are

4 https://wasabi.i3s.unice.fr/5 Extracted from http://lyrics.wikia.com/6 From http://usdb.animux.de

Page 11: Lyrics segmentation via bimodal text–audio representation

10 M Fell et al.

used in the replication: repeated pattern features (RPF) extracted from SSMs and

n-grams extracted from text lines. The RPF basically act as hand-crafted image

filters that aim to detect the edges and the insides of diagonals and rectangles in

the SSM.

Then, our own models are neural networks as described in Section 2.2, that use

as features SSMs and two line-based features: the line length and n-grams. For

the line length, we extracted the character count from each line, a simple proxy of

the orthographic shape of the song text. Intuitively, segments that belong together

tend to have similar shapes. Similarly to (Watanabe et al., 2016)’s term features we

extracted those n-grams from each line that are most indicative for segment borders:

using the tf-idf weighting scheme, we extracted n-grams that are typically found left

or right from the segment border, varied n-gram lengths and also included indicative

part-of-speech tag n-grams. This resulted in 240 term features in total. The most

indicative words at the start of a segment were: {ok, lately, okay, yo, excuse, dear,

well, hey}. As segment-initial phrases we found: {Been a long, I’ve been, There’s

a, Won’t you, Na na na, Hey, hey}. Typical words ending a segment were: {..., ..,

!, ., yeah, ohh, woah. c’mon, wonderland}. And as segment-final phrases we found

as most indicative: {yeah!, come on!, love you., !!!, to you., with you., check it out,

at all., let’s go, ...}In this experiment we consider only SSMs made from text-based similarities;

we note this in the model name as CNNtext. We further name a CNN model by

the set of SSMs that it uses as features. For example, the model CNNtext{str}uses as only feature the SSM made from string similarity simstr, while the model

CNNtext{str, phon, lsyn} uses three SSMs in parallel (as different input channels),

one from each similarity.

For convolutional layers we empirically set wsize = 2 and the amount of features

extracted after each convolution to 128. Dense layers have 512 hidden units. We

have also tuned the learning rate (negative degrees of 10), the dropout probability

with increments of 0.1. The batch size was selected from the beginning to be 256 to

better saturate our GPU. The CNN models were implemented using Tensorflow.

For comparison, we implement two baselines. The random baseline guesses for

each line independently if it is a segment border (with a probability of 50%) or not.

The line length baseline uses as only feature the line length in characters and is

trained using a logistic regression classifier.

In order to favor comparative analysis, the first experiments are run against the

MLDB data set (see Section 3.1) used by the state-of-the-art method (Watanabe

et al., 2016). To test the system portability to bigger and more heterogeneous

data sources, we further experimented our method on the WASABI corpus (see

Section 3.1). In order to test the influence of genre on classification performance,

we aligned MLDB to WASABI as the latter provides genre information. Song texts

that had the exact same title and artist names (ignoring case) in both data sets were

aligned. This rather strict filter resulted in an amount of 58567 (57%) song texts

with genre information in MLDB. Table 2 shows the distribution of the genres in

MLDB song texts. We then tested our method on each genre separately, to test our

hypothesis that classification is harder for some genres in which almost no repeated

Page 12: Lyrics segmentation via bimodal text–audio representation

11

patterns can be detected (as Rap songs). To the best of our knowledge, previous

work did not report on genre-specific results.

In this work we did not normalize the lyrics in order to rigorously compare our

results to (Watanabe et al., 2016). We estimate the proportion of lyrics containing

words that indicate the text structure (Chorus, Intro, Refrain, ...), to be marginal

(0.1-0.5%) in the MLDB corpus. When applying our methods for lyrics segmenta-

tion to lyrics found online, an appropriate normalization method should be applied

as a pre-processing step. For details on such a normalization procedure we refer the

reader to (Fell, 2014), Section 2.1.

Evaluation metrics are Precision (P), Recall (R), and f-score (F1). Significance is

tested with a permutation test (Ojala & Garriga, 2010), and the p-value is reported.

3.4 Results and Discussion

Table 1 shows the results of our experiments with text lines on the MLDB dataset.

We start by measuring the performance of our replication of (Watanabe et al.,

2016)’s approach. This reimplementation exhibits 56.3% F1, similar to the results

reported in the original paper (57.7%). The divergence could be attributed to a

different choice of hyperparameters and feature extraction code. Much weaker base-

lines were explored as well. The random baseline resulted in 18.6% F1, while the

usage of simple line-based features, such as the line length (character count), im-

proves this to 25.4%.

The best CNN-based model, CNNtext{str, phon, lsyn}+n-grams, outperforms all

our baselines reaching 67.4% F1, 8.2pp better than the results reported in (Watan-

abe et al., 2016). We perform a permutation test (Ojala & Garriga, 2010) of this

model against all other models. In every case, the performance difference is statis-

tically significant (p < .05).

Subsequent feature analysis revealed that the model CNNtext{str} is by far the

most effective. The CNNtext{lsyn} model exhibits much lower performance, despite

using a much more complex feature. We believe the lexico-syntactical similarity is

much noisier as it relies on n-grams and PoS tags, and thus propagates error from

the tokenizers and PoS taggers. The CNNtext{phon} exhibits a small but measur-

able performance decrease from CNNtext{str}, possibly due to phonetic features

capturing similar regularities, while also depending on the quality of preprocess-

ing tools and the rule-based phonetic algorithm being relevant for our song-based

dataset. The CNNtext{str, phon, lsyn} model that combines the different textual

SSMs yields a performance comparable to CNNtext{str}.In addition, we test the performance of several line-based features on our dataset.

Most notably, the n-grams feature provides a significant performance improvement

producing the best model. Note that adding the line length feature to any CNNtext

model does not increase performance.

To show the portability of our method to bigger and more heterogeneous datasets,

we ran the CNN model on the WASABI dataset (as described in Section 3.1), ob-

taining results that are very close to the ones obtained for the MLDB dataset: pre-

Page 13: Lyrics segmentation via bimodal text–audio representation

12 M Fell et al.

Model Features P R F1

Random baseline n/a 18.6 18.6 18.6

Line length baseline text line length 16.7 52.8 25.4

Handcrafted filters

RPF (our replication) 48.2 67.8 56.3

RPF (Watanabe et al., 2016) 56.1 59.4 57.7

RPF + n-grams 57.4 61.2 59.2

CNNtext

{str} 70.4 63.0 66.5

{phon} 75.9 55.6 64.2

{lsyn} 74.8 50.0 59.9

{str, phon, lsyn} 74.1 60.5 66.6

{str, phon, lsyn} + n-grams 72.1 63.3 67.4

Table 1: Results with text lines on MLDB dataset in terms of Precision (P), Recall

(R) and F1 in %.

cision: 67.4% for precision, 67.3% recall, and 67.4% f-score using the CNNtext{str}model.

Results differ significantly based on genre. We split the MLDB dataset with

genre annotations into training and test, trained on all genres, and tested on each

genre separately. In Table 2 we report the performances of the CNNtext{str} on

lyrics of different genres. Songs belonging to genres such as Country, Rock or

Pop, contain recurrent structures with repeating patterns, which are more easily

detectable by the CNNtext algorithm. Therefore, they show significantly better

performance. On the other hand, the performance on genres such as Hip Hop or

Rap, is much worse.

4 Experiments with Multimodal Text-audio Representations

This section describes our lyrics segmentation experiments using multimodal (uni-

modal or bimodal) lyrics lines, containing text or audio information or both as

input. We follow the same structure as in our experiments with text lines, describ-

ing the dataset used (see Section 4.1), defining the similarity measures used to

construct the self-similarity matrices (see Section 4.2), describing the models and

configurations that we have investigated (see Section 4.3), and finally, presenting

and discussing the obtained results (see Section 4.4).

Page 14: Lyrics segmentation via bimodal text–audio representation

13

Genre Lyrics[#] P R F1

Rock 6011 73.8 57.7 64.8

Hip Hop 5493 71.7 43.6 54.2

Pop 4764 73.1 61.5 66.6

RnB 4565 71.8 60.3 65.6

Alternative Rock 4325 76.8 60.9 67.9

Country 3780 74.5 66.4 70.2

Hard Rock 2286 76.2 61.4 67.7

Pop Rock 2150 73.3 59.6 65.8

Indie Rock 1568 80.6 55.5 65.6

Heavy Metal 1364 79.1 52.1 63.0

Southern Hip Hop 940 73.6 34.8 47.0

Punk Rock 939 80.7 63.2 70.9

Alternative Metal 872 77.3 61.3 68.5

Pop Punk 739 77.3 68.7 72.7

Soul 603 70.9 57.0 63.0

Gangsta Rap 435 73.6 35.2 47.7

Table 2: Results with text lines. CNNtext{str} model performances across musical

genres in the MLDB dataset in terms of Precision (P), Recall (R) and F1 in %.

Underlined are the performances on genres with less repetitive text. Genres with

highly repetitive structure are in bold.

4.1 Dataset: DALI

The DALI corpus7 (Meseguer-Brocal et al., 2018) contains synchronized lyrics-

audio representations on different levels of granularity: syllables, words, lines and

segments. Depending on the song, the alignment quality between text segments and

audio segments is higher or lower. In the Appendix (see Section A), we explain how

we estimate this segment alignment quality Qual.

Then, in order to test the impact of Qual on the performance of our lyrics seg-

mentation algorithm, we partition the DALI corpus into parts with different Qual.

Initially, DALI consists of 5358 lyrics that are synchronized to their audio track.

Like in previous publications (Watanabe et al., 2016; Fell et al., 2018), we ensure

that all song texts contain at least 5 segments. This constraint reduces the number

of tracks used by us to 4784. We partition the 4784 tracks based on their Qual into

high (Q+), med (Q0), and low (Q−) alignment quality datasets. Table 3 gives an

overview over the resulting dataset partitions. The Q+ dataset consists of 50842

lines and 7985 segment borders and has the following language distribution: 72%

English, 11% German, 4% French, 3% Spanish, 3% Dutch, 7% other languages.

7 https://github.com/gabolsgabs/DALI

Page 15: Lyrics segmentation via bimodal text–audio representation

14 M Fell et al.

Corpus name Alignment quality Song count

Q+ high (90-100%) 1048

Q0 med (52-90%) 1868

Q− low (0-52%) 1868

full dataset - 4784

Table 3: The DALI dataset partitioned by alignment quality

4.2 Similarity Measures

In this experiment, we add to the common choice of text-based similarity measures

also audio-based similarities - the crucial ingredient that makes our approach

multimodal. In the following, we define the text-based and audio-based similarities

that we use to compute the SSMs.

Text similarity: For our model, we produce SSMs based on the String similarity

measure as introduced in Section 3.2. The measure is applied on the textual

component li of the multimodal lines ai.

Audio similarities: We have previously defined the process of extracting audio

features, as well as the concrete audio features (see Section 2.3). When extracting

features from audio segments of different lengths, we obtain feature vectors of dif-

ferent lengths. There are several alternatives to measure the similarity between two

audio sequences (e.g. mfcc sequences) of possibly different lengths, among which Dy-

namic Time Warping Td is the most popular one in the Music Information Retrieval

community. Given bimodal lyrics lines au, av , we compare two audio segments suand sv that are featurized by a particular audio feature (mfcc or chroma) using Td:

Td(i, j) = d(su(i), sv(j)) + min

Td(i− 1, j),

Td(i− 1, j − 1),

Td(i, j − 1)

Td must be parametrized by an inner distance d to measure the distance between

the frame i of su and the frame j of sv. Depending on the particular audio feature

su and sv are featurized with, we employ a different inner distance as defined below.

Let m be the length of the vector su and n be the length of sv. Then, we compute

the minimal distance between the two audio sequences as Td(m,n) and normalize

this by the length r of the shortest alignment path between su and sv to obtain

values in [0,1] that are comparable to each other. We finally apply λx.(1 − x) to

turn the distance Td into a similarity measure Sd:

Sd(su, sv) = 1− Td(m,n) · r−1

Given bimodal lyrics lines ai, we now define similarity measures between audio

Page 16: Lyrics segmentation via bimodal text–audio representation

15

segments si that are featurized by a particular audio feature presented previously

(mfcc, chr) based on our similarity measure Sd:

• MFCC similarity (simmfcc): Sd between two audio segments featurized

by the mfcc feature. As inner distance we use the cosine distance: d(x, y) =

x · y · (‖x‖ · ‖y‖)−1

• Chroma similarity (simchr): Sd between two audio segments featurized by

the chroma feature. As inner distance we use the cosine distance.

4.3 Models and Configurations

Here, the song texts are represented via bimodal lyrics lines, incorporating both text

and audio information, and experimentation is performed on the DALI corpus. In

order to test our hypotheses which text and audio features are most relevant to

detect segment boundaries, and whether the text and audio modalities complement

each other, we compare different types of models: baselines, text-based models,

audio-based models, and finally bimodal models that use both text and audio fea-

tures. We provide the following baselines: the random baseline guesses for each line

independently if it is a segment border (with a probability of 50%) or not. The

line length baselines use as feature only the line length in characters (text-based

model) or milliseconds (audio-based model) or both, respectively. These baselines

are trained using a logistic regression classifier.

Finally, the last baseline models the segmentation task as sequence tagging by

tagging each text line as segment-ending or not ending. This model uses an RNN

and the lyrics line is here modelled as the average word vector of all words in the

line. The RNN uses GRU cells with 50 hidden states and 300 dimensional word

vectors (Pennington et al., 2014).

All other models are CNNs using the architecture described previously and use

as features SSMs made from different textual or audio similarities as described

in Section 4.2. The CNN-based models that use purely textual features (str) are

named CNNtext, while the CNN-based models using purely audio features (mfcc,

chr) are named CNNaudio. Lastly, the CNNmult models are multimodal in the sense

that they use combinations of textual and audio features. We name a CNN model

by its modality (text, audio, mult) as well as by the set of SSMs that it uses as

features. For example, the model CNNmult{str, mfcc} uses as textual feature the

SSM made from string similarity simstr and as audio feature the SSM made from

mfcc similarity simmfcc.

As dataset we use the Q+ part of the DALI dataset (see Section 4.1). We split the

data randomly into training and test sets using the following scheme: considering

that the DALI dataset is relatively small, we average over two different 5-fold cross-

validations. We prefer this sampling strategy for our small dataset over a more

common 10-fold cross-validation as it avoids the test set becoming too small.

Page 17: Lyrics segmentation via bimodal text–audio representation

16 M Fell et al.

4.4 Results and Discussion

The results of our experiments with multimodal lyrics lines on the DALI dataset

are depicted in Table 4. The random baseline and the different line length baselines

reach a performance of 15.5%-33.5% F1. Interestingly, the audio-based line length

(33.5% F1) is more indicative of the lyrics segmentation than the text-based line

length (25.0% F1).8 Finally, the word-based RNN sequence tagger performs better

(41.6% F1) than the simple baselines, but is vastly inferior to the CNN-based mod-

els. Given this finding, we did not try the sequence tagger with additional audio

features.

The model CNNtext{str} performs with 70.8% F1 similarly to the CNNtext{str}model from the first experiment (66.5% F1). The models use the exact same SSMstr

feature and hyperparameters, but another lyrics corpus (DALI instead of MLDB).

We believe that as DALI was assembled from karaoke singing instances, it likely

contains more repetitive song texts that are easier to segment using the employed

method. Note that the DALI dataset is too small to allow a genre-wise comparison

as we did in the previous experiment using the MLDB dataset.

The CNNaudio models perform similarly well than the CNNtext models.

CNNaudio{mfcc} reaches 65.3% F1, while CNNaudio{chr} results in 63.9% F1. The

model CNNaudio{mfcc, chr} performs with 70.4% F1 significantly (p < .001) better

than the models that use only one of the features. As the mfcc feature models tim-

bre and instrumentation, whilst the chroma feature models melody and harmony,

they provide complementary information to the CNNaudio model which increases

its performance.

Most importantly, the CNNmult models combining text- with audio-based features

constantly outperform the CNNtext and CNNaudio models. CNNmult{str, mfcc} and

CNNmult{str, chr} achieve a performance of 73.8% F1 and 74.5% F1, respectively

- this is significantly (p < .001) higher compared to the 70.8% (70.4%) F1 of the

best CNNtext (CNNaudio) model. Finally, the overall best performing model is a

combination of the best CNNtext and CNNaudio models and delivers 75.3% F1.

CNNmult{str, mfcc, chr} is the only model to significantly (p < .05) outperform all

other models in all three evaluation metrics: precision, recall, and F1. Note, that

all CNNmult models outperform all CNNtext and CNNaudio models significantly

(p < .001) in recall.

We perform an ablation test on the alignment quality. For this, we train CNN-

based models with those feature sets that performed best on the Q+ part of DALI.

For each modality (text, audio, mult), i.e. CNNtext{str}, CNNaudio{mfcc, chr}, and

CNNmult{str, mfcc, chr}, we train a model for each feature set on each partition

of DALI (Q+, Q0, Q−). We always test our models on the same alignment quality

they were trained on. The alignment quality ablation results are depicted in Ta-

ble 5. We find that independent of the modality (text, audio, mult.), all models

perform significantly (p < .001) better with higher alignment quality. The effect of

8 Note that adding line length features to any CNN-based model does not increase per-formance.

Page 18: Lyrics segmentation via bimodal text–audio representation

17

Model Features P R F1

Random baseline n/a 15.7 15.7 15.7

Line length baselines

text length 16.6 51.8 25.0

audio length 22.7 63.8 33.5

text length + audio length 22.6 63.0 33.2

RNN sequence tagging word vectors 39.9 43.6 41.6

CNNtext {str} 78.7 64.2 70.8

CNNaudio

{mfcc} 79.3 55.9 65.3

{chr} 76.8 54.7 63.9

{mfcc, chr} 79.2 63.8 70.4

CNNmult

{str, mfcc} 80.6 69.0 73.8

{str, chr} 82.5 69.0 74.5

{str, mfcc, chr} 82.7 70.3 75.3

Table 4: Results with multimodal lyrics lines on the Q+ dataset in terms of Preci-

sion (P), Recall (R) and F1 in %. Note that the CNNtext{str} model is the same

configuration as in Table 2, but trained on different dataset.

modality on segmentation performance (F1) is as follows: on all datasets we find

CNNmult{str, mfcc, chr} to significantly (p < .001) outperform both CNNtext{str}and CNNaudio{mfcc, chr}. Further, CNNtext{str} significantly (p < .001) outper-

forms CNNaudio{mfcc, chr} on the Q0 and Q− dataset, whereas this does not hold

on the Q+ dataset (p ≥ .05).

5 Error Analysis

An SSM for a Rap song is depicted in Figure 3. As texts in this genre are less

repetitive, the SSM-based features are less reliable to determine a song’s structure.

Moreover, when returning to the introductory example in Figure 1, we observe

that verses (the Vi) and bridges (the Bi) are not detectable when looking at the

text representation only (see Figure 1, middle). The reason is that these verses

have different lyrics. However, as these parts share the same melody, highlighted

rectangles are visible in the melody structure.

Indeed, we found our bimodal segmentation model to produce significantly

(p < .001) better segmentations (75.3% F1) compared to the purely text-based

(70.8% F1) and audio-based models (70.4% F1). The increase in F1 stems from

both increased precision and recall. The model increase in precision is observed as

Page 19: Lyrics segmentation via bimodal text–audio representation

18 M Fell et al.

Dataset Model Features P R F1

Q+

CNNtext {str} 78.7 64.2 70.8

CNNaudio {mfcc, chr} 79.2 63.8 70.4

CNNmult. {str, mfcc, chr} 82.7 70.3 75.3

Q0

CNNtext {str} 73.6 54.5 62.8

CNNaudio {mfcc, chr} 74.9 48.9 59.5

CNNmult. {str, mfcc, chr} 75.8 59.4 66.5

Q−CNNtext {str} 67.5 30.9 41.9

CNNaudio {mfcc, chr} 66.1 24.7 36.1

CNNmult. {str, mfcc, chr} 68.0 35.8 46.7

Table 5: Results with multimodal lyrics lines for the alignment quality ablation test

on the datasets Q+, Q0, Q− in terms of Precision (P), Recall (R) and F1 in %.

CNNmult often produces less false positive segment borders, i.e. the model deliv-

ers less noisy results. We observe an increase in recall in two ways: first, CNNmult

sometimes detects a combination of the borders detected by CNNtext and CNNaudio.

Secondly, there are cases where CNNmult detects borders that are not recalled in

either of CNNtext or CNNaudio.

Segmentation algorithms that are based on exploiting patterns in an SSM, share

a common limitation: non-repeated segments are hard to detect as they do not show

up in the SSM. Note, that such segments are still occasionally detected indirectly

when they are surrounded by repeated segments. Furthermore, a consecutively re-

peated pattern such as C2-C3-C4 in Figure 1 is not easily segmentable as it could

potentially also form one (C2C3C4) or two (C2-C3C4 or C2C3-C4) segments. An-

other problem is that of inconsistent classification inside of a song: sometimes,

patterns in the SSM that look the same to the human eye are classified differently.

Note, however that on the pixel level there is a difference, as the inference in the

used CNN is deterministic. This is a phenomenon similar to adversarial examples

in image classification (same intension, but different extension).

We now analyze the predictions of our different models for the example song given

in Figure 1. We compare the predictions of the following three different models:

the text-based model CNNtext{str} (visualized in Figure 1 as the left SSM called

“repetitive lyrics structure”), the audio-based model CNNaudio{chr} (visualized in

Figure 1 as the right SSM called “repetitive melody structure”), and the bimodal

model CNNmult{str, mfcc, chr}. Starting with the first chorus, C1, we find it to

be segmented correctly by both CNNtext{str} and CNNaudio{chr}. As previously

discussed, consecutively repeated patterns are hard to segment and our text-based

model indeed fails to correctly segment the repeated chorus (C2-C3-C4). The audio-

based model CNNaudio{chr} overcomes this limitation and segments the repeated

Page 20: Lyrics segmentation via bimodal text–audio representation

19

Fig. 3: Example SSM computed from textual similarity simstr. As common for

Rap song texts, there is no chorus (diagonal stripe parallel to main diagonal).

However, there is a highly repetitive musical state from line 18 to 21 indicated by

the corresponding rectangle in the SSM spanning from (18,18) to (21,21). (“Meet

Your Fate” by Southpark Mexican, MLDB-ID: 125521)

chorus correctly. Finally, we find that in this example both the text-based and

the audio-based models fail to segment the verses (the Vi) and bridges (the Bi)

correctly. The CNNmult{str, mfcc, chr} model manages to detect the bridges and

verses in our example.

Note that adding more features to a model does not always increase its ability

to detect segment borders. While in some examples, the CNNmult{str, mfcc, chr}model detects segment borders that were not detected in any of the models

CNNtext{str} or CNNaudio{mfcc, chr}, there are also examples where the bimodal

model does not detect a border that is detected by both the text-based and the

audio-based models.

6 Related Work

Besides the work of (Watanabe et al., 2016) that we have discussed in detail in

Section 2, only a few papers in the literature have focused on the automated detec-

tion of the structure of lyrics. (Mahedero et al., 2005) report experiments on the

use of standard NLP tools for the analysis of music lyrics. Among the tasks they

address, for structure extraction they focus on lyrics having a clearly recognizable

structure (which is not always the case) divided into segments. Such segments are

weighted following the results given by descriptors used (as full length text, relative

position of a segment in the song, segment similarity), and then tagged with a label

describing them (e.g. chorus, verses). They test the segmentation algorithm on a

Page 21: Lyrics segmentation via bimodal text–audio representation

20 M Fell et al.

small dataset of 30 lyrics, 6 for each language (English, French, German, Spanish

and Italian), which had previously been manually segmented.

More recently, (Barate et al., 2013) describe a semantics-driven approach to the

automatic segmentation of song lyrics, and mainly focus on pop/rock music. Their

goal is not to label a set of lines in a given way (e.g. verse, chorus), but rather

identifying recurrent as well as non-recurrent groups of lines. They propose a rule-

based method to estimate such structure labels of segmented lyrics, while in our

approach we apply machine learning methods to unsegmented lyrics.

(Cheng et al., 2009) propose a new method for enhancing the accuracy of audio

segmentation. They derive the semantic structure of songs by lyrics processing to

improve the structure labeling of the estimated audio segments. With the goal of

identifying repeated musical parts in music audio signals to estimate music structure

boundaries (lyrics are not considered), (Cohen-Hadria & Peeters, 2017) propose to

feed Convolutional Neural Networks with the square-sub-matrices centered on the

main diagonals of several SSMs, each one representing a different audio descriptor,

building their work on (Foote, 2000).

For a different task than ours, (Mihalcea & Strapparava, 2012) use a corpus of

100 lyrics synchronized to an audio representation with information on musical

key and note progression to detect emotion. Their classification results using both

modalities, textual and audio features, are significantly improved compared to a

single modality.

7 Conclusion

In this article, we have addressed the task of lyrics segmentation on synchronized

text-audio representations of songs. For the songs in the corpus DALI where the

lyrics are aligned to the audio, we have derived a measure of alignment quality

specific to our task of lyrics segmentation. Then, we have shown that exploiting

both textual and audio-based features lead the employed Convolutional Neural

Network-based model to significantly outperform the state-of-the-art system for

lyrics segmentation that relies on purely text-based features. Moreover, we have

shown that the advantage of a bimodal segment representation pertains even in

the case where the alignment is noisy. This indicates that a lyrics segmentation

model can be improved in most situations by enriching the segment representation

by another modality (such as audio).

As for future work, the problem of inconsistent classification inside of a song (SSM

patterns look almost identically, but classifications differ) may be tackled by cluster-

ing the SSM patterns in such a way that very similar looking SSM patterns end up

in the same cluster. This can be seen as a preprocessing denoising step of the SSMs

where details that are irrelevant to our task are deleted, without losing relevant

information. Furthermore, the problem that the bimodal model sometimes fails to

detect a segment border, even if the submodels correctly detected that border may

be tackled by implementing a late fusion approach (Snoek et al., 2005) where the

prediction of the bimodal model is conditioned on the predictions of both the text-

based and the audio-based submodels. In alternative to our CNN-based approach,

Page 22: Lyrics segmentation via bimodal text–audio representation

21

other neural architectures such as RNNs and Transformers (Vaswani et al., 2017)

can be applied to the lyrics segmentation problem. While our initial experiments

in framing the lyrics segmentation task as sequence tagging (see Section 4.3) did

not yield results competitive to our CNNs, we believe that experimentation with

more recent sentence embeddings, such as those derived from pretrained language

models (Devlin et al., 2018), can be beneficial. Finally, we would like to experiment

with further modalities, for instance with subtitled music videos where text, audio,

and video are all synchronized to each other.

A Measuring the Segment Alignment Quality in DALI

The DALI corpus (Meseguer-Brocal et al., 2018) contains synchronized lyrics-audio

representations on different levels of granularity: syllables, words, lines and seg-

ments. It was created by joining two datasets: (1) a corpus for karaoke singing

(AMX) which contains alignments between lyrics and audio on the syllable level

and (2) a subset of WASABI lyrics that belong to the same songs as the lyrics in

AMX. Note that corresponding lyrics in WASABI can differ from those in AMX

to some extent. Also, in AMX there is no annotation of segments. DALI provides

estimated segments for AMX lyrics, projected from the ground truth segments from

WASABI. For example, Figure 4 shows on the left side the lyrics lines as given in

AMX. The right side shows the lyrics lines given in WASABI as well as the ground

truth lyrics segments. The left side shows the estimated lyrics segments in AMX.

Note how the lyrics in WASABI have one segment more, as the segment W3 has no

counter part in AMX.

Based on the requirements for our task, we derive a measure to assess how well

the estimated AMX segments correspond / align to the groundtruth WASABI seg-

ments. Since we will use the WASABI segments as ground truth labels for supervised

learning, we need to make sure, the AMX lines (and hence audio information) actu-

ally belongs to the aligned segment. As only for the AMX lyrics segments we have

aligned audio features and we want to consistently use audio features in our segment

representations, we make sure that every AMX segment has a counterpart WASABI

segment (see Figure 4, A0 ∼ W0, A1 ∼ W1, A2 ∼ W2, A3 ∼ W4). On the other

hand, we allow WASABI segments to have no corresponding AMX segments (see

Figure 4, W3). We further do not impose constraints on the order of appearance of

segments in AMX segmentations vs. WASABI segmentations, to allow for possible

rearrangements in the order of corresponding segments. With these considerations,

we formulate a measure of alignment quality that is tailored to our task of bimodal

lyrics segmentation. Let A,W be segmentations, where A = A0A1...An and the

Ai are AMX segments and W = W0W1...Wm with WASABI lyrics segments Wi.

Then the alignment quality between the segmentations A,W is composed from the

similarities of the best-matching segments. Using string similarity simstr as defined

in Section 3.2, we define the alignment quality Qual as follows:

Page 23: Lyrics segmentation via bimodal text–audio representation

22 M Fell et al.

Fig. 4: Lyrics lines and estimated lyrics segments in AMX (left). Lyrics lines and

ground truth lyrics segments in WASABI (right) for the song (“Don‘t Break My

Heart” by Den Harrow)

Qual(A,W ) = Qual(A0A1...An, W0W1...Wm)

= min0≤i≤n

{ max0≤j≤m

{ simstr(Ai,Wj) } }

B Lexico-syntactical similarity

Formally, given two text lines x, y, let bigrams(x) be the set of bigrams in line x.

Following (Fell, 2014) lexical similarity simlex between lines x, y is then defined as:

simlex(x, y) =|bigrams(x) ∩ bigrams(y)|

max{|bigrams(x)|, |bigrams(y)|}

To define the syntactical similarity simsyn, we apply a POS tagger to those

word bigrams that do not overlap. Formally, the non-overlapped bigrams are

x = bigrams(x) \ (bigrams(x) ∩ bigrams(y)) and y = bigrams(y) \ (bigrams(x) ∩bigrams(y)). We then apply element-wise a function postag to the non-overlapped

Page 24: Lyrics segmentation via bimodal text–audio representation

23

bigrams in x, y to obtain POS tagged bigrams. Syntactical similarity simsyn is thus

given by:

simsyn(x, y) =

(|postag(x) ∩ postag(y)|

max{|postag(x)|, |postag(y)|}

)2

Note that the whole term is squared to heuristically account for the simple

fact that there are usually many more words than POS tags and so syntactical

similarities are inherently larger than lexical ones since the overlap is normalized

by a smaller number of overall POS tags in consideration.

We define simlsyn as a weighted sum of simlex and simsyn:

simlsyn(x, y) = αlex · simlex(x, y) + (1− αlex) · simsyn(x, y)

We heuristically set αlex = simlex. The idea for this weighting is that when x

and y have similar wordings, they likely have high similarity, so the wording should

be more important if it is more similar. On the other hand, if the wording is more

dissimilar, the structural similarity should be more important for figuring out a

lexico-syntactical similarity between two lines. Hence, simlsyn can be written as:

simlsyn(x, y) = sim2lex(x, y) + (1− simlex) · simsyn(x, y)

We close with an example to illustrate the computation of simlsyn: 9

x = “The man sleeps deeply.”

⇒ bigrams(x) = {the man, man sleep, sleep deep}y = “A man slept.”

⇒ bigrams(y) = {a man, man sleep}⇒ simlex(x, y) = 1

3

x = {the man, sleep deep}y = {a man}postag(x) = {DET NOUN, VERB ADVERB}postag(y) = {DET NOUN}⇒ simsyn(x, y) = (1

2 )2 = 14

⇒ simlsyn(x, y) =(13

)2+ (1− 1

3 ) · 14 = 19 + 1

6 ≈ 0.28

9 We assume stemming in this example.

Page 25: Lyrics segmentation via bimodal text–audio representation

24 M Fell et al.

References

Barate, A., Ludovico, L. A., & Santucci, E. 2013 (Dec). A Semantics-Driven Approach toLyrics Segmentation. Pages 73–79 of: 2013 8th International Workshop on Semanticand Social Media Adaptation and Personalization.

Brackett, D. 1995. Interpreting Popular Music. Cambridge University Press.Cheng, H. T., Yang, Y. H., Lin, Y. C., & Chen, H. H. 2009 (May). Multimodal structure

segmentation and analysis of music using audio and textual information. Pages 1677–1680 of: 2009 IEEE International Symposium on Circuits and Systems.

Cohen-Hadria, Alice, & Peeters, Geoffroy. 2017 (June). Music Structure Boundaries Esti-mation Using Multiple Self-Similarity Matrices as Input Depth of Convolutional NeuralNetworks. In: AES International Conference Semantic Audio 2017.

Davis, Steven B., & Mermelstein, Paul. 1980. Comparison of parametric representationsfor monosyllabic word recognition in continuously spoken sentences. ACOUSTICS,SPEECH AND SIGNAL PROCESSING, IEEE TRANSACTIONS ON, 357–366.

Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, & Toutanova, Kristina. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprintarXiv:1810.04805.

Fell, Michael. 2014. Lyrics classification. M.Phil. thesis, Saarland University, Germany.Fell, Michael, Nechaev, Yaroslav, Cabrio, Elena, & Gandon, Fabien. 2018. Lyrics Seg-

mentation: Textual Macrostructure Detection using Convolutions. Pages 2044–2054 of:Proceedings of the 27th International Conference on Computational Linguistics.

Foote, Jonathan. 2000. Automatic audio segmentation using a measure of audio novelty.Pages 452–455 of: Multimedia and Expo, 2000. ICME 2000. 2000 IEEE InternationalConference on, vol. 1. IEEE.

Fujishima, Takuya. 1999. Realtime Chord Recognition of Musical Sound: a System UsingCommon Lisp Music. In: ICMC. Michigan Publishing.

Goodfellow, Ian, Bengio, Yoshua, & Courville, Aaron. 2016. Deep Learning. MIT Press.http://www.deeplearningbook.org.

Levenshtein, Vladimir I. 1966. Binary codes capable of correcting deletions, insertions,and reversals. Pages 707–710 of: Soviet physics doklady, vol. 10.

Mahedero, Jose P. G., Martınez, Alvaro, Cano, Pedro, Koppenberger, Markus, & Gouyon,Fabien. 2005. Natural Language Processing of Lyrics. Pages 475–478 of: Proceedingsof the 13th Annual ACM International Conference on Multimedia. MULTIMEDIA ’05.New York, NY, USA: ACM.

Mayer, Rudolf, & Rauber, Andreas. 2011. Musical genre classification by ensembles ofaudio and lyrics features. Pages 675–680 of: Proceedings of the 12th InternationalConference on Music Information Retrieval.

McFee, Brian, Raffel, Colin, Liang, Dawen, Ellis, Daniel PW, McVicar, Matt, Battenberg,Eric, & Nieto, Oriol. 2015. librosa: Audio and music signal analysis in python. Pages18–25 of: Proceedings of the 14th python in science conference, vol. 8.

Meseguer-Brocal, Gabriel, Peeters, Geoffroy, Pellerin, Guillaume, Buffa, Michel, Cabrio,Elena, Faron Zucker, Catherine, Giboin, Alain, Mirbel, Isabelle, Hennequin, Romain,Moussallam, Manuel, Piccoli, Francesco, & Fillon, Thomas. 2017 (Aug.). WASABI: aTwo Million Song Database Project with Audio and Cultural Metadata plus WebAudioenhanced Client Applications. In: Web Audio Conference 2017 – Collaborative Audio#WAC2017. Queen Mary University of London, London, United Kingdom.

Meseguer-Brocal, Gabriel, Cohen-Hadria, Alice, & Peeters, Geoffroy. 2018. DALI: a largeDataset of synchronized Audio, Lyrics and notes, automatically created using teacher-student machine learning paradigm. In: ISMIR Paris, France.

Mihalcea, Rada, & Strapparava, Carlo. 2012. Lyrics, music, and emotions. Pages 590–599 of: Proceedings of the 2012 Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational Natural Language Learning. Association forComputational Linguistics.

Page 26: Lyrics segmentation via bimodal text–audio representation

25

Ojala, Markus, & Garriga, Gemma C. 2010. Permutation Tests for Studying ClassifierPerformance. J. Mach. Learn. Res., 11(Aug.), 1833–1863.

Pennington, Jeffrey, Socher, Richard, & Manning, Christopher. 2014. Glove: Global vec-tors for word representation. Pages 295–313 of: Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP).

Philips, Lawrence. 2000. The Double Metaphone Search Algorithm. C/C++ Users Jour-nal, 18(06), 38–43.

Snoek, Cees G. M., Worring, Marcel, & Smeulders, Arnold W. M. 2005. Early Versus LateFusion in Semantic Video Analysis. Pages 399–402 of: Proceedings of the 13th AnnualACM International Conference on Multimedia. MULTIMEDIA ’05. New York, NY,USA: ACM.

Tagg, Philip. 1982. Analysing popular music: theory, method and practice. Popular Music,2, 37–67.

Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez,Aidan N, Kaiser, L ukasz, & Polosukhin, Illia. 2017. Attention is All you Need. Pages5998–6008 of: Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vish-wanathan, S., & Garnett, R. (eds), Advances in Neural Information Processing Systems30. Curran Associates, Inc.

Watanabe, Kento, Matsubayashi, Yuichiroh, Orita, Naho, Okazaki, Naoaki, Inui, Ken-taro, Fukayama, Satoru, Nakano, Tomoyasu, Smith, Jordan, & Goto, Masataka. 2016.Modeling Discourse Segments in Lyrics Using Repeated Patterns. Pages 1959–1969of: Proceedings of COLING 2016, the 26th International Conference on ComputationalLinguistics: Technical Papers.