probabilistic typology: deep generative models of …jason/papers/cotterell+eisner.acl17.pdf ·...

Probabilistic Typology: Deep Generative Models of Vowel Inventories

Ryan Cotterell and Jason EisnerDepartment of Computer Science

Johns Hopkins University{ryan.cotterell,eisner}@jhu.edu

Abstract

Linguistic typology studies the range ofstructures present in human language. Themain goal of the field is to discover whichsets of possible phenomena are universal,and which are merely frequent. For ex-ample, all languages have vowels, whilemost—but not all—languages have an [u]sound. In this paper we present the firstprobabilistic treatment of a basic questionin phonological typology: What makes anatural vowel inventory? We introduce a se-ries of deep stochastic point processes, andcontrast them with previous computational,simulation-based approaches. We providea comprehensive suite of experiments onover 200 distinct languages.

1 Introduction

Human languages exhibit a wide range of phenom-ena, within some limits. However, some structuresseem to occur or co-occur more frequently than oth-ers. Linguistic typology attempts to describe therange of natural variation and seeks to organize andquantify linguistic universals, such as patterns ofco-occurrence. Perhaps one of the simplest typolog-ical questions comes from phonology: which vow-els tend to occur and co-occur within the phonemeinventories of different languages? Drawing in-spiration from the linguistic literature, we proposemodels of the probability distribution from whichthe attested vowel inventories have been drawn.

It is a typological universal that every languagecontains both vowels and consonants (Velupillai,2012). But which vowels a language containsis guided by softer constraints, in that certainconfigurations are more widely attested than oth-ers. For instance, in a typical phoneme inven-tory, there tend to be far fewer vowels than con-sonants. Likewise, all languages contrast vowelsbased on height, although which contrast is madeis language-dependent (Ladefoged and Maddieson,1996). Moreover, while over 600 unique vowel

Figure 1: The transformed vowel space that is constructedwithin one of our deep generative models (see §7.1). A deepnetwork nonlinearly maps the blue grid (“formant space”) tothe red grid (“metric space”), with individual vowels mappedfrom blue to red position as shown. Vowel pairs such as [@]–[O] that are brought close together are anti-correlated in thepoint process. Other pairs such as [y]–[1] are driven apart.For purposes of the visualization, we have transformed thered coordinate system to place red vowels near their bluepositions—while preserving distances up to a constant factor(a “Procrustes transformation”).

phonemes have been attested cross-linguistically(Moran et al., 2014), certain regions of acousticspace are used much more often than others, e.g.,the regions conventionally transcribed as [a], [i],and [u]. Human language also seems to prefer in-ventories where phonologically distinct vowels arespread out in acoustic space (“dispersion”) so thatthey can be easily distinguished by a listener. Wedepict the acoustic space for English in Figure 2.

In this work, we regard the proper goal of lin-guistic typology as the construction of a universalprior distribution from which linguistic systems aredrawn. For vowel system typology, we proposethree formal probability models based on stochas-tic point processes. We estimate the parametersof the model on one set of languages and evaluateperformance on a held-out set. We explore threequestions: (i) How well do the properties of ourproposed probability models line up experimen-tally with linguistic theory? (ii) How well can ourmodels predict held-out vowel systems? (iii) Doour models benefit from a “deep” transformationfrom formant space to metric space?

iːɪ

e

æ

ə

ʌɑːɒ

ɔː

ʊuː

Figure 2: The standard vowel table in IPA for the RP accentof English. The x-axis indicates the front-back spectrum andthe y-axis indicates the high-low distinction.

2 Vowel Inventories and their Typology

Vowel inventories are a simple entry point into thestudy of linguistic typology. Every spoken lan-guage chooses a discrete set of vowels, and thenumber of vowel phonemes ranges from 3 to 46,with a mean of 8.7 (Gordon, 2016). Nevertheless,the empirical distribution over vowel inventories isremarkably peaked. The majority of languageshave 5–7 vowels, and there are only a handfulof distinct 4-vowel systems attested despite manypossibilities. Reigning linguistic theory (Becker-Kristal, 2010) has proposed that vowel inventoriesare shaped by the principles discussed below.

2.1 Acoustic Phonetics

One way to describe the sound of a vowel isthrough its acoustic energy at different frequencies.A spectrogram (Figure 3) is a visualization of theenergy at various frequencies over time. Considerthe “peak” frequencies F0 < F1 < F2 < . . . thathave a greater energy than their neighboring fre-quencies. F0 is called the fundamental frequencyor pitch. The other qualities of the vowel are largelydetermined by F1, F2, . . ., which are known as for-mants (Ladefoged and Johnson, 2014). In manylanguages, the first two formants F1 and F2 containenough information to identify a vowel: Figure 3shows how these differ across three English vowels.We consider each vowel listed in the InternationalPhonetic Alphabet (IPA) to be cross-linguisticallycharacterized by some (F1, F2) pair.

2.2 Dispersion

The dispersion criterion (Liljencrants and Lind-blom, 1972; Lindblom, 1986) states that thephonemes of a language must be “spread out” sothat they are easily discriminated by a listener. A

0 Hz

1000 Hz

2000 Hz

3000 Hz

4000 Hz

5000 Hz/i/ /u/ /ɑ/

Figure 3: Example spectrogram of the three English vowels:[i], [u] and [A]. The x-axis is time and y-axis is frequency.The first two formants F1 and F2 are marked in with coloredarrows for each vowel. We used the Praat toolkit to generatethe spectrogram and find the formants (Boersma et al., 2002).

language seeks phonemes that are sufficiently “dis-tant” from one another to avoid confusion. Dis-tances between phonemes are defined in some la-tent “metric space.” We use this term rather than“perceptual space” because the confusability of twovowels may reflect not just their perceptual similar-ity, but also their common distortions by imprecisearticulation or background noise.1

2.3 FocalizationThe dispersion criterion alone does not seem tocapture the whole story. Certain vowels are simplymore popular cross-linguistically. A commonly ac-cepted explanation is the quantal theory of speech(Stevens, 1972, 1989). The quantal theory statesthat certain sounds are easier to articulate and toperceive than others. These vowels may be charac-terized as those where F1 and F2 have frequen-cies that are close to one another. On the pro-duction side, these vowels are easier to pronouncesince they allow for greater articulatory impreci-sion. On the perception side, they are more salientsince the two spectral peaks aggregate and act asone, larger peak to a certain degree. In general,languages will prefer these vowels.

2.4 Dispersion-Focalization TheoryThe dispersion-focalization theory (DFT) combinesboth of the above notions. A good vowel systemnow consists of vowels that contrast with eachother and are individually desirable (Schwartz et al.,1997). This paper provides the first probabilis-tic treatment of DFT, and new evaluation metricsfor future probabilistic and non-probabilistic treat-ments of vowel inventory typology.

1We assume in this paper that the metric space isuniversal—although it would not be unreasonable to supposethat each language’s vowel system has adapted to avoid confu-sion in the specific communicative environment of its speakers.

3 Point Process Models

Given a base set V , a point process is a distributionover its subsets.2 In this paper, we take V to bethe set of all IPA symbols corresponding to vow-els. Thus a draw from a point process is a vowelinventory V ⊆ V , and the point process itself isa distribution over such inventories. We will con-sider three basic point process models for vowelsystems: the Bernoulli Point Process, the MarkovPoint Process and the Determinantal Point Process.In this section, we review the relevant theory ofpoint processes, highlighting aspects related to §2.

3.1 Bernoulli Point Processes

Taking V = {v1, . . . , vN}, a Bernoulli point pro-cess (BPP) makes an independent decision aboutwhether to include each vowel in the subset. Theprobability of a vowel system V ⊆ V is thus

p(V ) ∝∏vi∈V

φ(vi), (1)

where φ is a unary potential function, i.e., φ(vi) ≥0. Qualitatively, this means that φ(vi) should belarge if the ith vowel is good in the sense of §2.3.Marginal inference in a BPP is computationallytrivial. The probability that the inventory V con-tains vi is φ(vi)/(1 + φ(vi)), independent of theother vowels in V . Since a BPP predicts eachvowel independently, it only models focalization.Thus, the model provides an appropriate baselinethat will let us measure the importance of the dis-persion principle—how far can we get with justfocalization? A BPP may still tend to generatewell-dispersed sets if it defines φ to be large onlyon certain vowels in V and these are well-dispersed(e.g., [i], [u], [a]). More precisely, it can define φso that φ(vi)φ(vj) is small whenever vi, vj are sim-ilar.3 But it cannot actively encourage dispersion:

2A point process is a specific kind of stochastic process,which is the technical term for a distribution over functions.Under this view, drawing some subset of V from the pointprocess is regarded as drawing some indicator function on V .

3We point out that such a scheme would break down if weextended our work to cover fine-grained phonetic modeling ofthe vowel inventory. In that setting, we ask not just whether theinventory includes /i/ but exactly which pronunciation of /i/ itcontains. In the limit, φ becomes a function over a continuousvowel space V = R2, turning the BPP into an inhomogeneousspatial Poisson process. A continuous φ function implies thatthe model places similar probability on similar vowels. Thenif most vowel inventories contain some version of /i/, thenmany of them will contain several closely related variants of/i/ (independently chosen). By contrast, the other methods inthis paper do extend nicely to fine-grained phonetic modeling.

including vi does not lower the probability of alsoincluding vj .

3.2 Markov Point ProcessesA Markov Point Process (MPP) (Van Lieshout,2000)—also known as a Boltzmann machine (Ack-ley et al., 1985; Hinton and Sejnowski, 1986)—generalizes the BPP by adding pairwise interac-tions between vowels. The probability of a vowelsystem V ⊆ V is now

p(V ) ∝∏vi∈V

φ(vi)∏

vi,vj∈Vψ(vi, vj), (2)

where each φ(vi) ≥ 0 is, again, a unary potentialthat scores the quality of the ith vowel, and eachψ(vi, vj) ≥ 0 is a binary potential that scores thecombination of the ith and jth vowels. Roughlyspeaking, the potential ψ(vi, vj) should be large ifthe ith and jth vowel often co-occur. Recall thatunder the principle of dispersion, the vowels thatoften co-occur are easily distinguishable. Thus,confusable vowel pairs should tend to have poten-tial ψ(vi, vj) < 1.

Unlike the BPP, the MPP can capture both fo-calization and dispersion. In this work, we willconsider a fully connected MPP, i.e., there is a po-tential function for each pair of vowels in V . MPPsclosely resemble Ising models (Ising, 1925), butwith the difference that Ising models are typicallylattice-structured, rather than fully connected.

Inference in MPPs. Inference in fully connectedMPPs, just as in general Markov Random Fields(MRFs), is intractable (Cooper, 1990) and we mustrely on approximation. In this work, we estimateany needed properties of the MPP distribution by(approximately) drawing vowel inventories fromit via Gibbs sampling (Geman and Geman, 1984;Robert and Casella, 2005). Gibbs sampling simu-lates a discrete-time Markov chain whose station-ary distribution is the desired MPP distribution.At each time step, for some random vi ∈ V , itstochastically decides whether to replace the cur-rent inventory V with V , where V is a copy of Vwith vi added (if vi /∈ V ) or removed (if vi ∈ V ).The probability of replacement is p(V )

p(V )+p(V ).

3.3 Determinantal Point ProcessesA determinantal point process (DPP) (Macchi,1975) provides an elegant alternative to an MPP,and one that is directly suited to modeling both fo-calization and dispersion. Inference requires only

a few matrix computations and runs tractably inO(|V|3) time, even though the model may encodea rich set of multi-way interactions. We focus onthe L-ensemble parameterization of the DPP, dueto Borodin and Rains (2005).4 This type of DPPdefines the probability of an inventory V ⊆ V as

p(V ) ∝ detLV , (3)

where L ∈ RN×N (for N = |V|) is a symmetricpositive semidefinite matrix, and LV refers to thesubmatrix of L with only those rows and columnscorresponding to those elements in the subset V .

Although MAP inference remains NP-hard inDPPs (just as in MPPs), marginal inference be-comes tractable. We may compute the normalizingconstant in closed form as follows:∑

V ∈2V

detLV = det (L+ I) . (4)

How does a DPP ensure focalization and disper-sion? L is positive semidefinite iff it can be writtenas E>E for some matrix E ∈ RN×N . It is possi-ble to express p(V ) in terms of the column vectorsof E, which we call e1, . . . , eN :

• For inventories of size 2, p({vi, vj}) ∝(φ(vi)φ(vj) sin θ)2, where φ(vi), φ(vj) repre-sent the quality of vowels vi, vj (as in the BPP)while sin θ ∈ [0, 1] represents their dissimi-larity. More precisely, φ(vi), φ(vj) are thelengths of vectors ei, ej while θ is the anglebetween them. Thus, we should choose thecolumns of E so that focal vowels get longvectors and similar vowels get vectors of simi-lar direction.• Generalizing beyond inventories of size 2,p(V ) is proportional to the square of the vol-ume of the parallelepiped whose sides aregiven by {ei : vi ∈ V }. This volume canbe regarded as

∏vi∈V φ(vi) times a term that

ranges from 1 for an orthogonal set of vowelsto 0 for a linearly dependent set of vowels.• The events vi ∈ V and vj ∈ V are anti-

correlated (when not independent). That is,while both vowels may individually have highprobabilities (focalization), having either onein the inventory lowers the probability of theother (dispersion).

4Most DPPs are L-ensembles (Kulesza and Taskar, 2012).

4 Dataset

At this point it is helpful to introduce the empiricaldataset we will model. For each of 223 languages,5

Becker-Kristal (2010) provides the vowel inventoryas a set of IPA symbols, listing the first 5 formantsfor each vowel (or fewer when not available inthe original source). Some corpus statistics areshown in Figs. 4 and 5.6 For the present paper,we take V to be the set of all 53 IPA symbols thatappear in the corpus. We treat these IPA labels asmeaningful, in that we consider two vowels in dif-ferent languages to be the same vowel in V if (forexample) they are both annotated as [O]. We char-acterize that vowel by its average formant vectoracross all languages in the corpus that contain thevowel: e.g., (F1, F2, . . .) = (500, 700, . . .) for [O].In future work, we plan to relax this idealization(see footnote 3), allowing us to investigate naturalquestions such as whether [u] is pronounced higher(smaller F1) in languages that also contain [o] (toachieve better dispersion).

5 Model Parameterization

The BPP, MPP, and DPP models (§3) require us tospecify parameters for each vowel in V . In §5.1, wewill accomplish this by deriving the parameters foreach vowel vi from a possibly high-dimensionalembedding of that vowel, e(vi) ∈ Rr.

In §5.2, e(vi) ∈ Rr will in turn be definedas some learned function of f(vi) ∈ Rk, wheref : V 7→ Rk is the function that maps a vowelto a k-vector of its measurable acoustic properties.This approach allows us to determine reasonableparameters even for rare vowels, based on theirmeasurable properties. It will even enable us in

5Becker-Kristal lists some languages multiple times withdifferent measurements. When a language had multiple list-ings, we selected one randomly for our experiments.

6Caveat: The corpus is a curation of information fromvarious phonetics papers into a common electronic format.No standard procedure was followed across all languages: itwas up to individual phoneticists to determine the size of eachvowel inventory, the choice of IPA symbols to describe it,and the procedure for measuring the formants. Moreover, itis an idealization to provide a single vector of formants foreach vowel type in the language. In real speech, different to-kens of the same vowel are pronounced differently, because ofcoarticulation with the vowel context, allophony, interspeakervariation, and stochastic intraspeaker variation. Even within atoken, the formants change during the duration of the vowel.Thus, one might do better to represent a vowel’s pronuncia-tion not by a formant vector, but by a conditional probabilitydistribution over its formant trajectories given its context, orby a parameter vector that characterizes such a conditionaldistribution. This setting would require richer data than wepresent here.

future to generalize to vowels that were unseen inthe training set, letting us scale to very large orinfinite V (footnote 3).

5.1 Deep Point ProcessesWe consider deep versions of all three processes.

Deep Bernoulli Point Process. We define

φ(vi) = ||e(vi)|| ≥ 0 (5)

Deep Markov Point Process. The MPP em-ploys the same unary potential as the BPP, as wellas the binary potential

ψ(vi, vj) = exp− 1

T · ||e(vi)−e(vj)||2< 1 (6)

where the learned temperature T > 0 controls therelative strength of the unary and binary potentials.

This formula is inspired by Coulomb’s lawfor describing the repulsion of static electricallycharged particles. Just as the repulsive force be-tween two particles approaches∞ as they approacheach other, the probability of finding two vowelsin the same inventory approaches exp−∞ = 0as they approach each other. The formula is alsoreminiscent of Shepard (1987)’s “universal law ofgeneralization,” which says here that the proba-bility of responding to vi as if it were vj shouldfall off exponentially with their distance in some“psychological space” (here, embedding space).

Deep Determinantal Point Process. For theDPP, we simply define the vector ei to be e(vi),and proceed as before.

Summary. In the deep BPP, the probability ofa set of vowels is proportional to the product ofthe lengths of their embedding vectors. The deepMPP modifies this by multiplying in pairwise re-pulsion terms in (0, 1) that increase as the vectors’endpoints move apart in Euclidean space (or asT → ∞). The deep DPP instead modifies it bymultiplying in a single setwise repulsion term in(0, 1) that increases as the embedding vectors be-come more mutually orthogonal. In the limit, then,the MPP and DPP both approach the BPP.

5.2 EmbeddingsThroughout this work, we simply have f extractthe first k = 2 formants, since our dataset does notprovide higher formants for all languages.7 For

7In lieu of higher formants, we could have extended thevector f(vi) to encode the binary distinctive features of theIPA vowel vi: round, tense, long, nasal, creaky, etc.

example, we have f([O]) = (500, 700). We nowdescribe three possible methods for mapping f(vi)to an embedding e(vi). Each of these maps haslearnable parameters.

Neural Embedding. We first consider directlyembedding each vowel vi into a vector space Rr.We achieve this through a feed-forward neural net

e(vi) = W1 tanh (W0f(vi) + b0) + b1, (7)

Equation (7) gives an architecture with 1 layer ofnonlinearity; in general we consider stacking d ≥ 0layers. Here W0 ∈ Rr×k,W1 ∈ Rr×r, . . .Wd ∈Rr×r are weight matrices, b0, . . .bd ∈ Rr are biasvectors, and tanh could be replaced by any point-wise nonlinearity. We treat both the depth d and theembedding size r as hyperparameters, and selectthe optimal values on a development set.

Interpretable Neural Embedding. We are in-terested in the special case of neural embeddingswhen r = k since then (for any d) the mappingf(vi) 7→ e(vi) is a diffeomorphism:8 a smoothinvertible function of Rk. An example of such adiffeomorphism is shown in Figure 1.

There is a long history in cognitive psychologyof mapping stimuli into some psychological space.The distances in this psychological space may bepredictive of generalization (Shepard, 1987) or ofperception. Due to the anatomy of the ear, the map-ping of vowels from acoustic space to perceptualspace is often presumed to be nonlinear (Rosnerand Pickering, 1994; Nearey and Kiefte, 2003),and there are many perceptually-oriented phoneticscales, e.g., Bark and Mel, that carry out such non-linear transformations while preserving the dimen-sionality k, as we do here. As discussed in §2.2,vowel system typology is similarly believed to beinfluenced by distances between the vowels in alatent metric space. We are interested in whethera constrained k-dimensional model of these dis-tances can do well in our experiments.

Prototype-Based Embedding. Unfortunately,our interpretable neural embedding is unfortunatelyincompatible with the DPP. The DPP assignsprobability 0 to any vowel inventory V whose evectors are linearly dependent. If the vectors arein Rk, then this means that p(V ) = 0 whenever|V | > k. In our setting, this would limit vowelinventories to size 2.

8Provided that our nonlinearity in (7) is a differentiableinvertible function like tanh rather than relu.

Our solution to this problem is to still constructour interpretable metric space Rk, but then map thatnonlinearly to Rr for some large r. This latter mapis constrained. Specifically, we choose “prototype”points µ1, . . . ,µr ∈ Rk. These prototype pointsare parameters of the model: their coordinates arelearned and do not necessarily correspond to anyactual vowel. We then construct e(vi) ∈ Rr as a“response vector” of similarities of our vowel vi tothese prototypes. Crucially, the responses dependon distances measured in the interpretable metricspace Rk. We use a Gaussian-density responsefunction, where x(vi) denotes the representationof our vowel vi in the interpretable space:

e(vi)` = w` p(x(vi);µ`, σ2I) (8)

= w` (2πσ2)−( k2 ) exp

(−||x− µ`||2

2σ2

).

for ` = 1, 2, . . . , r. We additionally impose theconstraints that each w` ≥ 0 and

∑r`=1w` = 1.

Notice that the sum∑r

`=1 e(vi) may be viewedas the density at x(vi) under a Gaussian mixturemodel. We use this fact to construct a prototype-based MPP as well: we redefine φ(vi) to equal thispositive density, while still defining ψ via equa-tion (6). The idea is that dispersion is measuredin the interpretable space Rk, and focalization isdefined by certain “good” regions in that space thatare centered at the r prototypes.

6 Evaluation Metrics

Fundamentally, we are interested in whether ourmodel has abstracted the core principles of whatmakes a good vowel system. Our choice of a proba-bilistic model provides a natural test: how surprisedis our model by held-out languages? In other words,how likely does our model think unobserved, butattested vowel systems are? While this is a naturalevaluation paradigm in NLP, it has not—to the bestof our knowledge—been applied to a quantitativeinvestigation of linguistic typology.

As a second evaluation, we introduce a vowelsystem cloze task that could also be used to evalu-ate non-probabilistic models. This task is definedby analogy to the traditional semantic cloze task(Taylor, 1953), where the reader is asked to fillin a missing word in the sentence from the con-text. In our vowel system cloze task, we presenta learner with a subset of the vowels in a held-outvowel system and ask them to predict the remain-ing vowels. Consider, as a concrete example, the

general American English vowel system (exclud-ing long vowels) {[i], [I], [u], [U], [E], [æ], [O], [A],[@]}. One potential cloze task would be to predict{[i], [u]} given {[I], [U], [E], [æ], [O], [A], [@]} andthe fact that two vowels are missing from the in-ventory. Within the cloze task, we report accuracy,i.e., did we guess the missing vowel right? Weconsider three versions of the cloze tasks. First,we predict one missing vowel in a setting whereexactly one vowel was deleted. Second, we predictup to one missing vowel where a vowel may havebeen deleted. Third, we predict up to two missingvowels, where one or two vowels may be deleted.

7 Experiments

We evaluate our models using 10-fold cross-validation over the 223 languages. We report themean performance over the 10 folds. The per-formance on each fold (“test”) was obtained bytraining many models on 8 of the other 9 folds(“train”), selecting the model that obtained the besttask-specific performance on the remaining fold(“development”), and assessing it on the test fold.Minimization of the parameters is performed withthe L-BFGS algorithm (Liu and Nocedal, 1989).As a preprocessing step, the first two formants val-ues F1 and F2 are centered around zero and scaleddown by a factor of 1000 since the formant valuesthemselves may be quite large.

Specifically, we use the development fold toselect among the following combinations of hy-perparameters. For neural embeddings, we triedr ∈ {2, 10, 50, 100, 150, 200}. For prototype em-beddings, we took the number of componentsr ∈ {20, 30, 40, 50}. We tried network depthsd ∈ {0, 1, 2, 3}. We sweep the coefficient for anL2 regularizer on the neural network parameters.

7.1 Results and DiscussionFigure 1 visualizes the diffeomorphism from for-mant space to metric space for one of our DPPmodels (depth d = 3 with r = 20 prototypes).Similar figures can be generated for all of the inter-pretable models.

We report results for cross-entropy and the clozeevaluation in Table 1.9 Under both metrics, wesee that the DPP is slightly better than the MPP;both are better than the BPP. This ranking holds for

9Computing cross-entropy exactly is intractable with theMPP, so we resort to an unbiased importance sampling schemewhere we draw samples from the BPP and reweight accordingto the MPP (Liu et al., 2015).

BPP uBPP uMPP uDPP iBPP iMPP iDPP pBPP pMPP pDPP

x-ent 8.24 8.28 8.08 8.00 13.01 11.50 7 12.83 10.95 10.29cloze-1 69.55% 69.55% 72.05% 73.18% 64.13% 67.02% 7 65.13% 68.18% 68.18%cloze-01 60.00% 60.00% 61.01% 62.27% 61.78% 61.04% 7 61.02% 63.04% 63.63%cloze-012 53.18% 53.18% 57.92% 58.18% 39.04% 43.02% 7 40.56% 45.01% 45.46%

Table 1: Cross-entropy in nats (lower is better) and cloze prediction accuracy (higher is better). “BPP” is a simple BPP with oneparameter for each of the 53 vowels in V . This model does artificially well by modeling an “accidental” feature of our data: it isable to learn not only which vowels are popular among languages, but also which IPA symbols are popular or conventional amongthe descriptive phoneticists who created our dataset (see footnote 6), something that would become irrelevant if we upgraded ourtask to predict actual formant vectors rather than IPA symbols (see footnote 3). Our point processes, by contrast, are appropriatelyallowed to consider a vowel only through its formant vector. The “u-” versions of the models use the uninterpretable neuralembedding of the formant vector into Rr: by taking r to be large, they are still able to learn special treatment for each vowel inV (which is why uBPP performs identically to BPP, before being beaten by uMPP and uDPP). The “i-” versions limit themselvesto an interpretable neural embedding into Rk, giving a more realistic description that does not perform as well. The “p-”versionslift that Rk embedding into Rr by measuring similarities to r prototypes; they thereby improve on the corresponding i- versions.For each result shown, the depth d of our neural network was tuned on a development set (typically d = 2). r was also tunedwhen applicable (typically r > 100 dimensions for the u- models and r ≈ 30 prototypes for the p- models).

each of the 3 embedding schemes. The embeddingschemes themselves are compared in the caption.

Within each embedding scheme, the BPP per-forms several points worse on the cloze tasks, con-firming that dispersion is needed to model vowelinventories well. Still, the BPP’s respectable per-formance shows that much of the structure can becapture by focalization. As §3 noted, the BPP maygenerate well-dispersed sets, as the common vow-els tend to be dispersed already (see Figure 4). Inthis capacity, however, the BPP is not explanatoryas it cannot actually tell us why these vowels shouldbe frequent.

We mention that depth in the neural network ishelpful, with deeper embedding networks perform-ing slightly better than depth d = 0.

Finally, we identified each model’s favorite com-plete vowel system of size n (Table 2). For theBPP, this is simply the n most probable vowels.Decoding the DPP and MPP is NP-hard, but wefound the best system by brute force (for small n).The dispersion in these models predicts differentsystems than the BPP.

8 Discussion: Probabilistic Typology

Typology as Density Estimation? Our goal isto define a universal distribution over all possiblevowel inventories. Is this appropriate? We regardthis as a natural approach to typology, because itdirectly describes which kinds of linguistic systemsare more or less common. Traditional implicationaluniversals (“all languages with vi have vj”) are soft-ened, in our approach, into conditional probabilitiessuch as “p(vj ∈ V | vi ∈ V ) ≈ 0.9.” Here the 0.9is not merely an empirical ratio, but a smoothed

probability derived from the complete estimateddistribution. It is meant to make predictions aboutunseen languages.

Whether human language learners exploit anyproperties of this distribution10 is a separate ques-tion that goes beyond typology. Jakobson (1941)did find that children acquired phoneme invento-ries in an order that reflected principles similar todispersion (“maximum contrast”) and focalization.

At any rate, we estimate the distribution givensome set of attested systems that are assumed tohave been drawn IID from it. One might objectthat this IID assumption ignores evolutionary re-lationships among the attested systems, causingour estimated distribution to favor systems that arecoincidentally frequent among current human lan-guages, rather than being natural in some timelesssense. We reply that our approach is then appro-priate when the goal of typology is to estimate thedistribution of actual human languages—a distri-bution that can be utilized in principle (and alsoin practice, as we show) to predict properties ofactual languages from outside the training set.

A different possible goal of typology is a the-ory of natural human languages. This goal wouldrequire a more complex approach. One shouldnot imagine that natural languages are drawn in avacuum from some single, stationary distribution.Rather, each language is drawn conditionally onits parent language. Thus, one should estimate astochastic model of the evolution of linguistic sys-tems through time, and identify “naturalness” with

10This could happen because learners have evolved to ex-pect the languages (the Baldwin effect), or because the lan-guages have evolved to be easily learned (universal grammar).

BPP MPP DPPchanges from n− 1 changes from n− 1 changes from n− 1

n MAP inventory additions deletions MAP inventory additions deletions MAP inventory additions deletions

1 i i @ @ @ @2 i, u u i, u i, u @ i, u i, u @3 i, u, a a i, u, a a i, u, a a4 i, u, a, o o i, u, a, e e i, u, a, o o5 i, u, a, o, e e i, u, a, e, @ @ i, u, a, o, @ o

Table 2: Highest-probability inventory of each size according to our three models (prototype-based embeddings and d = 3). TheMAP configuration is computed by brute-force enumeration for small n.

the directions in which this system tends to evolve.

Energy Minimization Approaches. The tradi-tional energy-based approach (Liljencrants andLindblom, 1972) to vowel simulation minimizesthe following objective (written in our notation):

E(m) =∑

1≤i<j≤m

1

||e(vi)− e(vj)||2, (9)

where the vectors e(vi) ∈ Rr are not spit out of adeep network, as in our case, but rather directly op-timized. Liljencrants and Lindblom (1972) proposea coordinate descent algorithm to optimize E(m).While this is not in itself a probabilistic model,they generate diverse vowel systems through ran-dom restarts that find different local optima (a kindof deterministic evolutionary mechanism). We notethat equation (9) assumes that the number of vowelsm is given, and only encodes a notion of dispersion.Roark (2001) subsequently extended equation (9)to include the notion of focalization.

Vowel Inventory Size. A fatal flaw of the tradi-tional energy minimization paradigm is that it hasno clear way to compare vowel inventories of dif-ferent sizes. The problem is quite crippling since,in general, inventories with fewer vowels will havelower energy. This does not match reality—theempirical distribution over inventory sizes (shownin Figure 5) shows that the mode is actually 5 andsmall inventories are uncommon: no 1-vowel in-ventory is attested and only one 2-vowel inventoryis known. A probabilistic model over all vowelsystems must implicitly model the size of the sys-tem. Indeed, our models pit all potential inventoriesagainst each other, bestowing the extra burden tomatch the empirical distribution over size.

Frequency of Inventories. Another problem isthe inability to model frequency. While for inven-tories of a modest size (3-5 vowels) there are veryfew unique attested systems, there is a plethora of

attested larger vowel systems. The energy min-imization paradigm has no principled manner totell the scientist how likely a novel system may be.Appealing again to the empirical distribution overattested vowel systems, we consider the relativediversity of systems of each size. We graph thisin Figure 5. Consider all vowel systems of size 7.There are

(|V|7

)potential inventories, yet the empir-

ical distribution is remarkably peaked. Our proba-bilistic models have the advantage in this contextas well, as they naturally quantify the likelihood ofan individual inventory.

Typology is a Small-Data Problem. In contrastto many common problems in applied NLP, e.g.,part-of-speech tagging, parsing and machine trans-lation, the modeling of linguistic typology is fun-damentally a “small-data” problem. Out of the7105 languages on earth, we only have linguisticannotation for 2600 of them (Comrie et al., 2013).Moreover, we only have phonetic and phonologicalannotation for a much smaller set of languages—between 300-500 (Maddieson, 2013). Given thepaucity of data, overfitting on only those attestedlanguages is a dangerous possibility—just becausea certain inventory has never been attested, it isprobably wrong to conclude that it is impossible—or even improbable—on that basis alone. By anal-ogy to language modeling, almost all sentencesobserved in practice are novel with respect to thetraining data, but we still must employ a princi-pled manner to discriminate high-probability sen-tences (which are syntactically and semanticallycoherent) from low-probability ones. Probabilisticmodeling provides a natural paradigm for this sortof investigation—machine learning has developedwell-understood smoothing techniques, e.g., reg-ularization with tuning on a held-out dev set, toavoid overfitting in a small-data scenario.

Related Work in NLP. Various point processeshave been previously applied to potpourri of tasks

i u a o e ɔ ɛ ɪ y ʊ ɑ ø æ ə ɨ œ ʏɯʌ ɤ ɒ ɵ ʉ ɜ ɐ e ö0

10

20

30

40

50

60

70

80

90

Figure 4: Percentage of the vowel inventories (y-axis) in theBecker-Kristal corpus (Becker-Kristal, 2010) that have a givenvowel (shown in IPA along the x-axis).

in NLP. Determinantal point processes have found ahome in the literature in tasks that require diversity.E.g., DPPs have achieved state-of-the-art results onmulti-document document summarization (Kuleszaand Taskar, 2011), news article selection (Affandiet al., 2012) recommender systems (Gartrell et al.,2017), joint clustering of verbal lexical semanticproperties (Reichart and Korhonen, 2013), interalia. Poisson point processes have also been ap-plied to NLP problems: Yee et al. (2015) modelthe emerging topic on social media using a homo-geneous point process and Lukasik et al. (2015)apply a log-Gaussian point process, a variant ofthe Poisson point process, to rumor detection inTwitter. We are unaware of previous attempts toprobabilistically model vowel inventory typology.

Future Work. This work lends itself to sev-eral technical extensions. One could expand thefunction f to more completely characterize eachvowel’s acoustic properties, perceptual properties,or distinctive features (footnote 7). One could gen-eralize our point process models to sample finitesubsets from the continuous space of vowels (foot-note 3). One could consider augmenting the MPPwith a new factor that explicitly controls the sizeof the vowel inventory. Richer families of pointprocesses might also be worth exploring. For ex-ample, perhaps the vowel inventory is generated bysome temporal mechanism with latent intermediatesteps, such as sequential selection of the vowels orevolutionary drift of the inventory. Another possi-bility is that vowel systems tend to reuse distinctivefeatures or even follow factorial designs, so that aninventory with creaky front vowels also tends tohave creaky back vowels.

3 4 5 6 7 8 9 10 11 12 13 140

20

40

60

80

100

120

140

Figure 5: Histogram of the sizes of different vowel inventoriesin the corpus. The x-axis is the size of the vowel inventoryand the y-axis is the number of inventories with that size.

9 Conclusions

We have presented a series of point process modelsfor the modeling of vowel system inventory typol-ogy with the goal of a mathematical groundingfor research in phonological typology. All mod-els were additionally given a deep parameteriza-tion to learn representations similar to perceptualspace in cognitive science. Also, we motivated ourpreference for probabilistic modeling in linguistictypology over previously proposed computationalapproaches and argued it is a more natural researchparadigm. Additionally, we have introduced sev-eral novel evaluation metrics for research in vowel-system typology, which we hope will spark furtherinterest in the area. Their performance was empiri-cally validated on the Becker-Kristal corpus, whichincludes data from over 200 languages.

Acknowledgments

The first author was funded by an NDSEG graduatefellowship, and the second author by NSF grant IIS-1423276. We would like to thank Tim Vieira andHuda Khayrallah for helpful initial feedback.

ReferencesDavid H. Ackley, Geoffrey E. Hinton, and Terrence J.

Sejnowski. 1985. A learning algorithm for Boltz-mann machines. Cognitive Science 9(1):147–169.

Raja Hafiz Affandi, Alex Kulesza, and Emily B. Fox.2012. Markov determinantal point processes. InProceedings of the Twenty-Eighth Conference onUncertainty in Artificial Intelligence. pages 26–35.

Roy Becker-Kristal. 2010. Acoustic Typology of VowelInventories and Dispersion Theory: Insights froma Large Cross-Linguistic Corpus. Ph.D. thesis,UCLA.

Paulus Petrus Gerardus Boersma et al. 2002. Praat, asystem for doing phonetics by computer. Glot Inter-national 5.

Alexei Borodin and Eric M. Rains. 2005. Eynard-Mehta theorem, Schur process, and their Pfaffiananalogs. Journal of Statistical Physics 121(3-4):291–317.

Bernard Comrie, Matthew S. Dryer, David Gil, andMartin Haspelmath. 2013. Introduction. InMatthew S. Dryer and Martin Haspelmath, editors,The World Atlas of Language Structures Online,Max Planck Institute for Evolutionary Anthropol-ogy, Leipzig. http://wals.info/chapter/s1.

Gregory F. Cooper. 1990. The computational complex-ity of probabilistic inference using Bayesian beliefnetworks. Artificial Intelligence 42(2-3):393–405.

Mike Gartrell, Ulrich Paquet, and Noam Koenigstein.2017. Low-rank factorization of determinantal pointprocesses pages 1912–1918.

Stuart Geman and Donald Geman. 1984. Stochas-tic relaxation, Gibbs distributions, and the Bayesianrestoration of images. IEEE Transactions on PatternAnalysis and Machine Intelligence (6):721–741.

Matthew K. Gordon. 2016. Phonological Typology.Oxford.

Geoffrey E. Hinton and Terry J. Sejnowski. 1986.Learning and relearning in Boltzmann machines. InDavid E. Rumelhart and James L. McClelland, ed-itors, Parallel Distributed Processing, MIT Press,volume 2, chapter 7, pages 282–317.

Ernst Ising. 1925. Beitrag zur theorie des ferromag-netismus. Zeitschrift fur Physik A Hadrons and Nu-clei 31(1):253–258.

Roman Jakobson. 1941. Kindersprache, Aphasie undallgemeine Lautgesetze. Suhrkamp Frankfurt aM.

Alex Kulesza and Ben Taskar. 2011. Learning deter-minantal point processes. In Proceedings of theTwenty-Seventh Conference on Uncertainty in Arti-ficial Intelligence. pages 419–427.

Alex Kulesza and Ben Taskar. 2012. Determinantalpoint processes for machine learning. Foundationsand Trends R© in Machine Learning 5(2–3):123–286.

Peter Ladefoged and Keith Johnson. 2014. A Course inPhonetics. Centage.

Peter Ladefoged and Ian Maddieson. 1996. The Soundsof the World’s Languages. Oxford.

Johan Liljencrants and Bjorn Lindblom. 1972. Numer-ical simulation of vowel quality systems: The roleof perceptual contrast. Language pages 839–862.

Bjorn Lindblom. 1986. Phonetic universals in vowelsystems. Experimental Phonology pages 13–44.

Dong C. Liu and Jorge Nocedal. 1989. On the limitedmemory BFGS method for large scale optimization.Mathematical Programming 45(1-3):503–528.

Qiang Liu, Jian Peng, Alexander T. Ihler, and JohnW. Fisher III. 2015. Estimating the partition func-tion by discriminance sampling. In Proceedings ofthe Thirty-First Conference on Uncertainty in Artifi-cial Intelligence. pages 514–522.

Michal Lukasik, Trevor Cohn, and Kalina Bontcheva.2015. Point process modelling of rumour dynamicsin social media. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 2: ShortPapers). Association for Computational Linguistics,Beijing, China, pages 518–523.

Odile Macchi. 1975. The coincidence approach tostochastic point processes. Advances in AppliedProbability pages 83–122.

Ian Maddieson. 2013. Vowel quality inventories. InMatthew S. Dryer and Martin Haspelmath, editors,The World Atlas of Language Structures Online,Max Planck Institute for Evolutionary Anthropol-ogy, Leipzig. http://wals.info/chapter/2.

Steven Moran, Daniel McCloy, and Richard Wright.2014. PHOIBLE online. Leipzig: Max Planck In-stitute for Evolutionary Anthropology .

Terrance M. Nearey and Michael Kiefte. 2003. Com-parison of several proposed perceptual representa-tions of vowel spectra. Proceedings of the XVth In-ternational Congress of Phonetic Sciences 1:1005–1008.

Roi Reichart and Anna Korhonen. 2013. Improved lex-ical acquisition through DPP-based verb clustering.In Proceedings of the 51st Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers). Association for Computational Lin-guistics, Sofia, Bulgaria, pages 862–872.

Brian Roark. 2001. Explaining vowel inventory ten-dencies via simulation: Finding a role for quantallocations and formant normalization. In North EastLinguistic Society. volume 31, pages 419–434.

Christian P. Robert and George Casella. 2005. MonteCarlo Statistical Methods. Springer-Verlag NewYork, Inc., Secaucus, NJ, USA.

Burton S. Rosner and John B. Pickering. 1994. VowelPerception and Production. Oxford UniversityPress.

Jean-Luc Schwartz, Louis-Jean Boe, Nathalie Vallee,and Christian Abry. 1997. The dispersion-focalization theory of vowel systems. Journal ofPhonetics 25(3):255–286.

Roger N. Shepard. 1987. Toward a universal law ofgeneralization for psychological science. Science237(4820):1317–1323.

http://wals.info/chapter/s1

http://wals.info/chapter/s1

http://wals.info/chapter/2

http://wals.info/chapter/2

Kenneth N. Stevens. 1972. The quantal nature ofspeech: Evidence from articulatory-acoustic data.In E. E. David and P. B. Denes, editors, Human Com-munication: A Unified View, McGraw-Hill, pages51–56.

Kenneth N Stevens. 1989. On the quantal nature ofspeech. Journal of Phonetics 17:3–45.

Wilson L. Taylor. 1953. Cloze procedure: a new toolfor measuring readability. Journalism and MassCommunication Quarterly 30(4):415.

M. N. M. Van Lieshout. 2000. Markov Point Pro-cesses and Their Applications. Imperial CollegePress, London.

Viveka Velupillai. 2012. An Introduction to LinguisticTypology. John Benjamins Publishing Company.

Connie Yee, Nathan Keane, and Liang Zhou. 2015.Modeling and characterizing social media topics us-ing the gamma distribution. In EVENTS. pages 117–122.

probabilistic typology: deep generative models of …jason/papers/cotterell+eisner.acl17.pdf ·...

Documents