self organizing maps for the visual analysis of pitch contours

9
Self Organizing Maps for the Visual Analysis of Pitch Contours Dominik Sacha Yuki Asano Christian Rohrdantz Felix Hamborg Daniel Keim Bettina Braun Miriam Butt Data Analysis and Visualization Group & Department of Linguistics University of Konstanz [email protected] Abstract We present a novel interactive approach for the visual analysis of intonation con- tours. Audio data are processed algo- rithmically and presented to researchers through interactive visualizations. To this end, we automatically analyze the data using machine learning in order to find groups or patterns. These results are vi- sualized with respect to meta-data. We present a flexible, interactive system for the analysis of prosodic data. Using real- world application examples, one contain- ing preprocessed, the other raw data, we demonstrate that our system enables re- searchers to interact dynamically with the data at several levels and by means of dif- ferent types of visualizations, thus arriving at a better understanding of the data via a cycle of hypothesis generation and testing that takes full advantage of our visual pro- cessing abilities. 1 Introduction and Related Work Traditionally, linguistic research on F 0 contours has been conducted by manually annotating the data using an agreed-upon set of pitch accents and boundary tones such as the ToBI system (Beck- man et al., 2005). However, the manual catego- rization of F 0 contours is open to subjectiveness in decision making. To overcome this disadvan- tage, recent research has focused on functional data analysis of F 0 contour data (Gubian et al., 2013). The F 0 contours are smoothed and normal- ized resulting in comparable pitch vectors for dif- ferent utterances of the same structure. However, with this method, the original underlying data is abstracted away from and cannot be easily ac- cessed (or visualized) for individual analysis. One of the typical tasks in prosodic research is to determine specific F 0 contours that signal cer- tain functions. State of the art analysis is time intensive and not ideal, because statistics or pro- jections are applied to the data leading to a possi- ble loss of important aspects of original data. To overcome these problems, we offer a visual analy- tics system that allows for the use of preprocessed F 0 pitch vectors in data analysis as well as the ability to work with the original, individual data points. Moreover, the linguistic researcher is in- teractively involved in the visual analytics process by guiding the machine learning and by interact- ing with the visualization according to the visual analytics mantra “Analyze first, Show the Impor- tant, Zoom, filter and analyze further, Details on demand” (Keim et al., 2008). Our system consists of three components. The Data Input where all input files are read and con- verted into the internal data model. The second part covers Machine Learning where we make use of Self Organizing Maps (SOM) in order to find clusters of similar pitch contours. The visualiza- tion based on the SOM result is realized within our last component, the Interactive Visualization. The researcher can interpret the data directly via this visualization, but may also interact with the sys- tem in order to steer the underlying model. The overall work flow is illustrated in Figure 1. This combination of human knowledge and reasoning with automated computational processing is the key idea of visual analytics (Thomas and Cook, 2006) and supports human knowledge generation processes (Sacha et al., 2014). Our contribution builds on existing previous work on SOM based visual analysis (Vesanto, 1999; Moehrmann et al., 2011), but also on previous attempts to visually investigate data from the domain of prosodic re- search (Ward and Mccartney, 2010; Ward, 2014). Furthermore, we profit from approaches to ana- lyze speech using the SOM algorithm (Mayer et al., 2009; Silva et al., 2011; Tadeusiewicz et al., Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 181

Upload: others

Post on 16-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Self Organizing Maps for the Visual Analysis of Pitch Contours

Self Organizing Maps for the Visual Analysis of Pitch Contours

Dominik Sacha Yuki Asano Christian Rohrdantz Felix HamborgDaniel Keim Bettina Braun Miriam Butt

Data Analysis and Visualization Group & Department of LinguisticsUniversity of Konstanz

[email protected]

Abstract

We present a novel interactive approachfor the visual analysis of intonation con-tours. Audio data are processed algo-rithmically and presented to researchersthrough interactive visualizations. To thisend, we automatically analyze the datausing machine learning in order to findgroups or patterns. These results are vi-sualized with respect to meta-data. Wepresent a flexible, interactive system forthe analysis of prosodic data. Using real-world application examples, one contain-ing preprocessed, the other raw data, wedemonstrate that our system enables re-searchers to interact dynamically with thedata at several levels and by means of dif-ferent types of visualizations, thus arrivingat a better understanding of the data via acycle of hypothesis generation and testingthat takes full advantage of our visual pro-cessing abilities.

1 Introduction and Related Work

Traditionally, linguistic research on F0 contourshas been conducted by manually annotating thedata using an agreed-upon set of pitch accents andboundary tones such as the ToBI system (Beck-man et al., 2005). However, the manual catego-rization of F0 contours is open to subjectivenessin decision making. To overcome this disadvan-tage, recent research has focused on functionaldata analysis of F0 contour data (Gubian et al.,2013). The F0 contours are smoothed and normal-ized resulting in comparable pitch vectors for dif-ferent utterances of the same structure. However,with this method, the original underlying data isabstracted away from and cannot be easily ac-cessed (or visualized) for individual analysis.

One of the typical tasks in prosodic research is

to determine specific F0 contours that signal cer-tain functions. State of the art analysis is timeintensive and not ideal, because statistics or pro-jections are applied to the data leading to a possi-ble loss of important aspects of original data. Toovercome these problems, we offer a visual analy-tics system that allows for the use of preprocessedF0 pitch vectors in data analysis as well as theability to work with the original, individual datapoints. Moreover, the linguistic researcher is in-teractively involved in the visual analytics processby guiding the machine learning and by interact-ing with the visualization according to the visualanalytics mantra “Analyze first, Show the Impor-tant, Zoom, filter and analyze further, Details ondemand” (Keim et al., 2008).

Our system consists of three components. TheData Input where all input files are read and con-verted into the internal data model. The secondpart covers Machine Learning where we make useof Self Organizing Maps (SOM) in order to findclusters of similar pitch contours. The visualiza-tion based on the SOM result is realized within ourlast component, the Interactive Visualization. Theresearcher can interpret the data directly via thisvisualization, but may also interact with the sys-tem in order to steer the underlying model. Theoverall work flow is illustrated in Figure 1. Thiscombination of human knowledge and reasoningwith automated computational processing is thekey idea of visual analytics (Thomas and Cook,2006) and supports human knowledge generationprocesses (Sacha et al., 2014). Our contributionbuilds on existing previous work on SOM basedvisual analysis (Vesanto, 1999; Moehrmann et al.,2011), but also on previous attempts to visuallyinvestigate data from the domain of prosodic re-search (Ward and Mccartney, 2010; Ward, 2014).Furthermore, we profit from approaches to ana-lyze speech using the SOM algorithm (Mayer etal., 2009; Silva et al., 2011; Tadeusiewicz et al.,

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 181

Page 2: Self Organizing Maps for the Visual Analysis of Pitch Contours

Figure 1: Work flow in four steps. A-Data Input, B-Configuration, C-Training, D-Visualization.

1999), but open up a new domain within this fieldas we allow for a visualization of pitch contoursdirectly on a SOM-grid. We furthermore do notjust produce one SOM, but also compute and visu-ally present several dependent/derivative SOMs.

2 System

The system pipeline consists of three main com-ponents: 1) Data-Input; 2) Machine-Learning; 3)Interactive Visualizations.

2.1 Data Input

Our system is able to process and visualize anykind of data that satisfies the following restric-tions. The data set needs to consist of a list ofdata items, where each item contains a set of key-value pairs, also called data attributes. The valueof a data attribute must be a primitive, i.e., either anumber, text string, or an array consisting of prim-itives. Except for primitive-arrays we do not allownested data, thus we flatten the input data if nec-essary. Overall, data items should be comparableand contain attributes with equal keys (and differ-ent values).

The system also expects comparable featurevectors to which a distance measure can be ap-plied. Furthermore, additional (meta) data can bepart of the input. In the use cases presented here,each F0 data is connected with speaker informa-tion such as the native language of the speaker, thelevel of second language (L2) proficiency and thecontext the data was produced in.

Vector Preprocessing After having loaded inthe data, our system allows for the inspection ofdata prior to the actual analysis. Figure 1-A showsthe inspection view that is typically used in thework flow at first. As part of the configurationwork flow, the user selects an attribute as the In-put Vector (Figure 1-B). This forms the basis of

the machine learning component.Before entering the machine learning of train-

ing phase, our system performs a validation of theInput Vector and allows for its adjustment if nec-essary. Whereas normalized and smoothed data,i.e., data items with vectors of equal length, can beprocessed directly, our system also offers the func-tionality to perform basic preprocessing of raw In-put Vectors. If it is found that not all vectors haveequal length, we offer several preprocessing tech-niques from which one can be chosen: Besidessimple approaches of adding mean-values (mean-padding) or 0s (zero-padding), we also offer an ap-proach that makes use of linear interpolation (pair-wise). If time and landmark-information is avail-able, it is also possible to divide the vectors intoparts and adjust each of the parts separately. Asa result, all the parts have equal length and aretherefore better suited for comparison. The InputVectors values can be normalized using Semitone-Normalization. The mean value can also be sub-tracted from each contour, in order to minimizegender effects.

In sum, we offer a very flexible preprocessingfunctionality for the Input Vectors. The availabletechniques can be combined flexibly and dynam-ically according to what is most suitable for theanalysis task at hand. However, there are stillmethods that could be added. For example, onecould additionally enhance the vector processingby a stronger leveraging of the time informationin order to prepare the data for duration focusedanalysis tasks.

2.2 Machine Learning

We make use of Machine Learning (ML) for thedetection of groups/clusters that are present in thedata based on the Input Vectors. Additionally, thesystem detects correlations to the meta data. Inour use cases this included information about the

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 182

Page 3: Self Organizing Maps for the Visual Analysis of Pitch Contours

Figure 2: SOM-Training illustrated by 4 steps. For each cluster the prototype and the distances be-tween adjacent cells are visualized by black lines in between. In step 4 the training has finished and thededicated F0 contours are also drawn in each cell.

native language of the speakers and the level oftheir language proficiency.

In principle, any distance function, projection orclustering method could be applied in our extensi-ble framework. The central problem that needs tobe resolved is that the high dimensional data fromthe Input Vectors needs to be reduced to a two-dimensional visualization that can be rendered ona computer screen or a piece of paper. We exper-imented with several different methods and foundthat SOMs, also known as Kohonen Maps (Koho-nen, 2001), match the demands of this task best.SOMs are a well established ML technique thatcan be used for clustering or as a classifier basedon feature vectors. SOMs are very suitable forour purpose for several reasons. First we can useSOMs as an unsupervised ML-technique to find afixed number of clusters subsequent to a trainingphase. SOMs also provide a topology where sim-ilar clusters are adjacent. Finally, the algorithmadapts to the given input data depending on theamounts of desired clusters and data.

Furthermore, in our system, the clustering anddimensionality reduction are integrated in onestep. This stands in contrast to other clustering anddimensionality reduction techniques like Multi Di-mensional Scaling (MDS), Principal ComponentAnalysis (PCA) or Non-negative Matrix Factor-ization (NMF). A disadvantage found with theseother methods is that they tend to lead to clutterin the two-dimensionsal space (when there is highdegree of overlap in the data). It is also unclearwhen to perform the clustering: in the high di-mensional space before projection or in the two-dimensional space afterwards.

Our system proceeds as follows. First, theSOM-grid is initialized with random cluster cen-troids, which are feature vector prototypes foreach cluster. Afterwards each feature vector is

used to train the SOM in a random order. Foreach vector the SOM algorithm determines thebest matching unit (BMU) and adjusts the BMUand adjacent clusters prototypes based on the in-put vector. This process is repeated n-times untilthe SOM is in a stable state (Figure 2, steps A-C).After the training phase the resulting grid can beused for clustering. Each vector is assigned to thecluster with the least distance to the cluster proto-type (BMU). In Step D of Figure 2 each cell rep-resents a cluster containing the cluster prototype(black vector) and the cluster members (coloredvectors).

Note that we did not rely on existing softwarelibraries like the SOM-toolbox, but instead imple-mented the algorithm from scratch. The reason forthis is that we aim at being able to visualize andsteer the algorithm at every step (see Section 2.4).

2.3 VisualizationsWe build on Schreck et al.’s work on SOM-basedvisual analysis (Schreck, 2010). Within the ba-sic SOM-grid, we provide several different waysof visualizing the information of interest to the re-searcher. As shown in Figure 3-A, we provide anoverview visualization which shows the SOM-grid(Figure 3-A) filled by the clustered pitch contours.The individual cells also show the cluster centroidand the vectors (contours) that belong to that cellin relation to the centroid (Figure 3-F). We also vi-sualize the training history of a cluster in the back-ground in each cell (Figure 3-A) in order to keeptrack of the training phase.

Beyond the clustered contours, we furthermoreprovide possible visualizations (these can be se-lected or not), which add in simple highlighters orbar charts to the SOM result (Figure 3-C). We alsoexperimented with heatmaps,1 which turned out

1In our approach a color overlay for the SOM grid

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 183

Page 4: Self Organizing Maps for the Visual Analysis of Pitch Contours

Figure 3: Different approaches to visualize SOM-results according to available meta data. (A) Gridvisualization, (B) word cloud, (C) bar charts, (D) mixed color cells, (E) ranked group clusters, (F) onesingle cell that visualizes contained vectors and the cluster prototype, (G) separated heatmaps for allvalues of a categorical attribute.

to be good for visualizing the distribution of dataattributes among the SOM-grid (3-G). The color-intensity of a node depends on the number of dataitems it contains; the more data items, the strongerthe intensity.

We offer several normalization options. Oneapproach takes the global maximum (of allgroups/grids), whereas the other one takes a lo-cal maximum for each single group/grid. Differentkinds of normalizations can also be chosen in or-der to handle outliers or small variabilities in thedata. Depending on the underlying data, an ade-quate normalization technique is needed to obtainvisible patterns in the data.

A drawback of the heatmaps is that it is noteasy to detect if cells are homogeneous or hetero-geneous. That means that it is hard to determinewhether a cell contains only vectors of a specificgroup (i.e., in our use cases just native Japanese ornative Germans) or if it is a mixed cell. For thatreason we also offer another visualization. Foreach cell we derive the color depending on thenumber of group members. Therefore we assigna color (e.g. red vs. blue) to each group and mixthem accordingly. As a result homogeneous (redvs. blue) and heterogeneous (purple) clusters areeasy to detect (see Figure 3-D, where GL standsfor “German learner” and “JN” for Japanese na-tive). Finally, we also offer word cloud visualiza-tions for each cell (Figure 3-B). These allow theuser an overview of the values contained in a cell ifthe selected attribute has many categories/values.

Each of these visualizations offers different per-

spectives on the data and the user is able to interactdynamically with each of the different visualiza-tion possibilities.

2.4 Interaction

The system offers various possibilities for interac-tion: 1) Configuration/Encoding Interactions; 2)SOM Interactions; 3) Selection Interactions.

Configuration/Encoding Interactions: Thealgorithm and the visualization techniques offermany possibilities for individualized configura-tion, e.g., the grid dimensions of the SOM or thenormalization techniques that are applied by thevisualization techniques. Furthermore the cell lay-out can be toggled interactively from the SOM-grid to a grouped alignment. An advantage of thegrouped alignment is that the typical feature clus-ters for each group can be determined by their po-sition. In combination with our coloring approach,the analysts are thus able to locate the top groupclusters and detect if they are homogeneous or het-erogeneous (Figure 3-E). Users may also defineand change visual mappings like the colors thatare assigned to the attribute values.

SOM Interactions: We incorporate the ideathat the analyst should be able to steer the train-ing phase of the algorithm as well (Schreck et al.,2009). The analyst is able to enter into an iterativeprocess that refines the analysis in each step. Ineach step the SOM result can be manipulated andserves as an input for the next iteration. For one, itis possible to delete cells directly on the grid. An-other interactive possibility is to move cells to a

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 184

Page 5: Self Organizing Maps for the Visual Analysis of Pitch Contours

desired position and to “pin” them to this position.That means that for the next SOM training this cellis fixed. We make use of these interactions to steerthe SOM-algorithm to deliver visually similar out-puts. For example, if we fix a cell near the up-per right corner, in the next round of training thiscell and the cells similar to it will be in the samecorner (e.g., in Figure 4-E the two gray cells arefixed). Finally, it is possible to break off the cur-rent training process and to restart or to investi-gate the current state in more detail if the analystalready perceives a pattern or discovers problems.

Selection Interactions: These interactions helpto filter and select the data during the analysis pro-cess. The data that are contained in the currentSOM visualization serve as input for the next it-eration of the analysis work flow. Besides remov-ing data elements directly on the SOM grid, datacan be selected to be removed directly in the at-tribute table (Figure 4-D). This feature allows theanalyst to drill down into selected data subspaces.Details on Demand operations also enable the userto inspect subsets of clusters. Furthermore, singlecells can be selected and investigated in a separatelinked detail view.

By enabling these interactions we present theanalyst with the flexible possibilities for an itera-tive analysis process. The system first provides anoverview of the data, the analyst is able to interactwith the data in iterations of hypothesis formationand testing. The hypothesis testing can be donewith respect to the entire data set, or with respectto a selected subset. In order to keep track of thevarious visualizations and interactions conductedby the analyst, we offer a visualization history thatdisplays the developed SOM grids next to one an-other (e.g., Figure 5). Clicking on one of thesegrids will automatically bring the selected SOMto the front of the screen.

3 Use Cases

We demonstrate the added value that our approachbrings to prosodic research with respect to two lin-guistic experiments that were originally conductedindependently of this work. We take a “paired ana-lytics” approach for an evaluation of the potentialof our system (Arias-Hernandez et al., 2011). Inthis approach, an expert for visual analytics col-laborates with a domain expert. The domain ex-pert places their focus on tasks, hypotheses andideas while an analysis expert operates the system.

Figure 4: Interaction techniques that enable foran iterative data exploration. Configuration Inter-actions can be used to define parameters like thegrid dimensions or visual mappings (e.g., select-ing the attribute colors). SOM Interactions includethe direct manipulation of the SOM-visualization(move, delete, or pin cells, begin or stop SOMtraining). Selection Interactions enable the analystto dismiss data in each step in order to drill downinto interesting data subspaces.

We are well aware that the standards for eval-uation in natural language processing are quanti-tative in nature. There is an inherent conflict be-tween quantitative evaluation and the rationale forusing a visual analytics system in the first place.Visual analytics has the overall aim of allowingan interactive, exploratory access to an underlyingcomplex data set. It is very difficult to quantifydata exploration and cycles of hypothesis testingin the absence of a bench mark or gold standard.This is a known problem within visual analytics(Keim et al., 2010; Sacha et al., 2014), but onewhich cannot be addressed within the scope of thispaper. The two use cases presented here shouldbe seen as an initial test as to the added value ofour system. An application to other scenarios andother use cases is planned as future work.

The use cases discussed below consist of exper-iments that were concerned with whether linguis-tic structures of a native language (henceforth L1)influence second language (henceforth L2) learn-ing. The experiments involved Japanese nativespeakers vs. German learners of Japanese. The lat-ter group had varying degrees of L2 competence.The data set consists of F0 contours and meta dataabout the speakers.

3.1 Experiment 1

The first experiment investigated how nativespeakers of an intonation language (German) pro-duce attitudinal differences in an L2 that has lexi-cally specified pitch movement (Japanese).

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 185

Page 6: Self Organizing Maps for the Visual Analysis of Pitch Contours

Methods15 Japanese native speakers and 15 German na-tive speakers, who were proficient in the respectivelanguages participated in the experiment. Theyproduced the German word Entschuldigung andthe Japanese word sumimasen, which both mean‘excuse me’. The Japanese word contains a lexi-cally specified pitch fall associated with the penul-timate mora in the word, /se/. Materials were pre-sented with descriptions of short scenes. The taskwas to produce the target word three times in or-der to attract an imaginary waiter’s attention in acrowded and noisy bar.

Our hypotheses were that Japanese nativespeakers would not change the F0 contours acrossthe three attempts, because the Japanese fallingpitch accent is lexically fixed. German learnerswould change them, because German F0 can bechanged in order to convey attitude or emotion.

Segmental boundary annotation was carried outon the recorded raw data using Praat (Boersma andWeenink, 2011) as the first step. In Experiment1, segmental boundaries were put between theJapanese smallest segmental unit, morae, whichresulted in —su—mi—ma—se—n— (the straightlines signal the segmental boundaries). Then, F0contours were computed from the annotated datausing the F0 tracking algorithm in the Praat toolkitwith the default range of 70-350 Hz for males and100-500 Hz for females. Following the proce-dures of Functional Data Analysis (Ramsay andSilverman, 2009), we first smoothed the sampledF0 contours into a continuous curve represented bya mathematical function of time f (t) adopting B-splines (de Boor, 2001). Values of F0 were ex-pressed in semitones (=st) and the mean value wassubtracted from each value, in order to minimizegender effects. After smoothing the curves weautomatically carried out landmark registration inorder to align corresponding segmental boundariesin time (Gubian et al., 2013; Ramsay et al., 2009).After these steps, the smoothed F0 data all had thesame duration.

AnalysisThe analysis process of analyzing Experiment1 is shown in Figure 5. The first SOM of-fers an overview for the whole dataset. Theword cloud visualization additionally shows theutterances that occur in the cells (sumimasen,Entschuldigung). In a next step the data set wasfiltered to show only the data for sumimasen (Fig-

Figure 5: Experiment 1 work flow history: Anoverview is shown first. In the following steps datais filtered and the analysis refines stepwise intoan interesting subspace. First, only the utterancessumimasen ’excuse me’ are selected (A). Theseare then further subdivided according to speakergroup (B/C): Japanese Native (JN) vs. German(DE).

Figure 6: Experiment 1: Heatmaps for the rep-etition attribute for each speaker group. Germanlearner contours clearly include more variationscompared to native speaker contours.

ure 5-A) and a second SOM with only this datawas trained. In the 2nd SOM in Figure 5 thecells are coloured according to the number ofspeaker groups in each cell. Our analyst wasable to discover different pitch contours per group(blue-German cells on the left-hand side and red-Japanese cells on the right-hand side).

In order to get more details we decided to trainan additional SOM for each speaker group. Wesimply added the relevant filters and began a newSOM training for each group (Figure 5-B/C). As aresult the two visualizations now clearly show thatthe F0 produced by the groups look different. Forfurther analysis, we also opened a heatmap visual-ization for another attribute for each group basedon the SOM-grids B and C. In Figure 6 the repeti-tions (1st, 2nd, or 3rd) are shown for each group.One can clearly discover that the Japanese nativespeakers’ (top) F0 contours rarely vary in compar-ison with the German speakers (bottom).

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 186

Page 7: Self Organizing Maps for the Visual Analysis of Pitch Contours

3.2 Experiment 2In Experiment 1 we were able to determine thatGerman learners did not produce typical JapaneseF0 contours, namely flat F0 followed by a drasticpitch fall, just on the basis of unannotated F0 data.The second experiment examined whether Ger-man learners can produce this typical Japanese F0phonetic form in an imitation experiment. The ex-periment was originally conducted independentlyof Experiment 1.

Methods24 Japanese native speakers and 48 German learn-ers were asked to imitate Japanese disyllabic non-words consisting of three-morae (/CV:CV/) with along-vowel. All stimuli were recorded either witha flat pitch (high-high, HH) or with a falling pitch(high-low, HL) that occurs after the long-vowel.F0 contours produced by Japanese native speak-ers are expected to imitate the stimuli correctly byrealizing the typical phonetic form of a Japanesepitch accent, namely a drastic pitch fall precededby a flat F0. In contrast, as per the results of Ex-periment 1, German learners are expected not toproduce this phonetic form.

In analogy to Experiment 1, segmental annota-tion was carried out. Segmental boundaries wereput between consonants and vowels, which re-sulted in —c—v—(c)c—(v)v—. Then, F0 contourswere computed as in Experiment 1. The data con-tained the raw Hertz values of F0 and additional in-formation included data about segments, speakerinformation, time and landmark information forthe produced pitch contour. In total 2393 datarecords were put into the SOM system.

AnalysisThe analysis workflow for Experiment 2 is shownin Figure 7. The first SOM offers an overview forthe whole dataset. This overview clearly showstwo clusters for flat and falling F0 contours (“HH”-blue and “HL”-red). On the lower most right cor-ner, there is a red cell in the blue cluster. This typeof pattern could be indicative of an error or noisein the data set.

Note that the SOM system did not know whichexperimental conditions the data contained. With-out any information about the experimental vari-ables, SOM detected differences across condi-tions. Furthermore, no other current analysis tech-niques enable an overview of F0 data in this man-ner. Since we were interested in the phonetic re-

alization of Japanese pitch accent, we further ana-lyzed only the data of the falling F0 condition.

As a consequence, a second SOM containingonly the “HL” contours was trained (Figure 7-A).The next step was to remove the noise from thedata (Figure 7-2nd SOM). In the second SOMwe discovered one cell that contains non fallingF0 contours (lower left corner). We deleted thiscell and fixed/pinned the other corner cells in or-der to steer the SOM algorithm to produce a sim-ilar SOM in the next training phase (Figure 7-B). In the next SOM the cells are colored accord-ing to the number of speaker groups in each cell(blue-German, red-Japanese). The three cells inthe lower left corner were the most frequent F0contours produced only by German learners ofJapanese. To analyze this further, we also changedthe grid based layout to the ranked group layout toshow the three most frequent F0 contours in eachlanguage group (Figure 7-C). As a result, the lastSOM visualization now clearly shows that the F0produced by the groups look different: Japanesenative speakers produced typical Japanese F0 con-tours consisting of a flat F0 before a drastic F0 fall(Gussenhoven, 2004). The third cells from abovein both of the language groups show the same F0forms, suggesting that some German native speak-ers produced F0 contours that were very similar tothose of Japanese native speakers. Note however,that the most frequent contours produced by Ger-man learners clearly differed from the Japanesecontours. Finally, one of the most important con-tributions of the SOM system was that it deliveredus the findings without the necessity of having firstmanually annotated a large amount of data, savingpersonell costs.

4 Conclusion

We provide an interactive system for the analy-sis of prosodic feature vectors. To complementother state of the art techniques we make use ofmachine learning in combination with interactivevisualizations. We implemented an iterative pro-cess using chains of SOM-trainings for a step-by-step refinement of the analysis. We show withreal experiment data that the system supports lin-guistic research. Importantly, the analysis allowsfor a clustering of F0 contours that works with-out time-intensive and possibly subjective man-ual intonational analysis. The clustered contourscan be subjected to intensive phonological analy-

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 187

Page 8: Self Organizing Maps for the Visual Analysis of Pitch Contours

Figure 7: Experiment 2 work flow history: An overview is shown first. Then only “HL” F0 contours(red colored cells) are selected (A). The researcher interacted directly with the second SOM: The noisecluster was removed (bottom left corner) and the other corner cells were fixed in order to steer the SOMalgorithm to produce a visually similar SOM for the next training (B). The resulting SOM reveals blueclusters on the left hand side. Changing the layout to the top clusters per group (C) allows for a bettercomparison.

sis and furthermore allow the potential detectionof more fine-grained phonetic differences acrossconditions. The analyses hence provide an impor-tant first step that the linguist can then focus on forsubsequent analysis. For example, it is very easyto filter the data (e.g., examine only a subset of thedata) or to adjust the grid size. More importantly,the approach is advantageous for an analysis of L2data, since the learners’ language has a dynamiccharacter (Selinker, 1972) and it is difficult to de-termine intonational categories beforehand. OurSOM approach is generalizable to all kinds of datafor which feature vectors can be derived, includingother linguistic features as intensity, amplitude orduration.

We learned that the visualization of F0 contoursprovides the most intuitive access for an under-standing of the underlying data. One reason is thatthe F0 contour can be visually inspected and di-rectly related to meta data (e.g., through colors).Even without time-intensive manual annotation ofF0 contours, we could clearly see the differencesbetween L1 and L2 performance despite the dif-ferent characteristics of the two experimental datasets. We visualizaed and animated the SOM train-ing phase and presented this to the researcher aswell. This may seem unnecessary, but experiencehas shown that it helps users that are not experi-enced with ML to better understand the processes.

We applied our technique to two differentdatasets. A comparison of the achieved resultsshows that our approach works very well “out ofthe box” with preprocessed data and also with lesseffort on the preprocessing. To overcome the prob-

lem of handling less preprocessed data we addedsimple methods that turned out to be sufficient inorder to reveal new insights. The system helped usto handle unexpected outliers or noise in the data.All the F0 contours that do not match the majorclusters of the SOM-algorithm are assigned to afew single cells. The data in these cells could eas-ily be removed.

We plan to make the system available for otherresearchers in the future and are considering sev-eral expansions as well. For one, other machinelearning and visualization techniques could beadded for additional or further tasks. We alsocould try to support the user more in detectinginteresting subspaces in the data. It is possible,for instance to visualize an overview of attribute-heatmaps that enables the human to detect patternsin each iteration.

In sum, this paper has presented an innovativeand promising new approach for the automaticanalysis of prosodic data. Key components arethat prosodic data is translated into vectors that canbe processed and analyzed further by SOM tech-niques and presented to the user as an interactivevisual analytic system.

Acknowledgments

We gratefully acknowledge the funding for thiswork, which comes from the Research InitiativeLingVisAnn and the research project “Visual An-alytics of Text Data in Business Applications” ofthe University of Konstanz.

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 188

Page 9: Self Organizing Maps for the Visual Analysis of Pitch Contours

ReferencesRichard Arias-Hernandez, Linda T Kaastra, Tera Marie

Green, and Brian Fisher. 2011. Pair analytics:Capturing reasoning processes in collaborative vi-sual analytics. In System Sciences (HICSS), 201144th Hawaii International Conference on, pages 1–10. IEEE.

Mary E. Beckman, Julia Hirschberg, and StefanieShattuck-Hufnagel. 2005. The original ToBI sys-tem and the evolution of the ToBI framework. InS.-A. Jun, editor, Prosodic Typology – The Phonol-ogy of Intonation and Phrasing. Oxford UniversityPress.

Paul Boersma and David Weenink. 2011. Praat: Doingphonetics by computer [computer program] version5.2.20.

Carl de Boor. 2001. A Practical Guide to Splines.Springer, New York.

Michele Gubian, Yuki Asano, Salomi Asaridou, andFrancesco Cangemi. 2013. Rapid and smoothpitch contour manipulation. In Proceedings of the14th Annual Conference of the International SpeechCommunication Association, Lyon, France, pages31–35.

Carlos Gussenhoven. 2004. The Phonology ofTone and Intonation. Research surveys in lin-guistics. Cambridge University Press, Cambridge.2003065202 Carlos Gussenhoven. ill. ; 24 cm. In-cludes bibliographical references (p. 321-344) andindex.

Daniel Keim, Gennady Andrienko, Jean-DanielFekete, Carsten Gorg, Jorn Kohlhammer, and GuyMelancon. 2008. Visual analytics: Definition, pro-cess, and challenges. Springer.

Daniel A. Keim, Jorn Kohlhammer, Geoffrey P. Ellis,and Florian Mansmann. 2010. Mastering the Infor-mation Age - Solving Problems with Visual Analyt-ics. Eurographics Association.

Teuvo Kohonen. 2001. Self-organizing Maps, vol-ume 30. Springer.

Rudolf Mayer, Jakob Frank, and Andreas Rauber.2009. Analytic comparison of audio feature setsusing self-organising maps. In Proceedings ofthe Workshop on Exploring Musical InformationSpaces, in Conjunction with ECDL, pages 62–67.

Julia Moehrmann, Andre Burkovski, Evgeny Bara-novskiy, Geoffrey-Alexeij Heinze, Andrej Rapoport,and Gunther Heidemann. 2011. A discussionon visual interactive data exploration using self-organizing maps. In Advances in Self-OrganizingMaps - 8th International Workshop, WSOM 2011,Espoo, Finland, June 13-15, 2011. Proceedings,pages 178–187.

James Ramsay and Bernard. W. Silverman. 2009.Functional Data Analysis. Springer.

James Ramsay, Giles Hookers, and Spencer Graves.2009. Functional Data Analysis with R and MAT-LAB. Springer.

Dominik Sacha, Andreas Stoffel, Florian Stoffel,Bum Chul Kwon, Geoffrey P. Ellis, and Daniel A.Keim. 2014. Knowledge generation model for vi-sual analytics. IEEE Transactions on Visualizationand Computer Graphics, 20(12):1604–1613.

Tobias Schreck, Jurgen Bernard, Tatiana Tekuov, andJorn Kohlhammer. 2009. Visual cluster analysis oftrajectory data with interactive Kohonen maps. Pal-grave Macmillan Information Visualization, 8:14–29.

Tobias Schreck, 2010. Visual-Interactive Analysis WithSelf- Organizing Maps — Advances and ResearchChallenges, pages 83–96. Intech.

Larry Selinker. 1972. Interlanguage. IRAL-International Review of Applied Linguistics in Lan-guage Teaching, 10(1-4):209–232.

Ana Cristina C Silva, Ana Cristina P Macedo, andGuilherme A Barreto. 2011. A SOM-based analysisof early prosodic acquisition of English by Brazilianlearners: preliminary results. In Advances in Self-Organizing Maps, pages 267–276. Springer.

Ryszard Tadeusiewicz, Wieslaw Wszolek, Antoni Iz-worski, and Tadeusz Wszolek. 1999. The meth-ods of pathological speech visualization [using Ko-honen neural networks]. In Engineering in Medicineand Biology, 1999. 21st Annual Conference and the1999 Annual Fall Meetring of the Biomedical En-gineering Society, BMES/EMBS Conference, 1999.Proceedings of the First Joint, volume 2, pages 980vol.2–, Oct.

James J. Thomas and Kristin A. Cook. 2006. A visualanalytics agenda. IEEE Computer Graphics and Ap-plications, 26(1):10–13.

Juha Vesanto. 1999. SOM-based data visualizationmethods. Intelligent Data Analysis, 3(2):111–126.

Nigel G Ward and Joshua L Mccartney. 2010. Visual-ization to support the discovery of prosodic contoursrelated to turn-taking.

Nigel G Ward. 2014. Automatic discovery of simply-composable prosodic elements. In Speech Prosody,volume 2014, pages 915–919.

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 189