data management and linguistic analysis: mds applied to roda

47
Data Management and Linguistic Analysis: MDS applied to RODA Sheila M. Embleton, Dorin Uritescu & Eric S. Wheeler York University, Toronto, Canada

Upload: mai

Post on 14-Jan-2016

21 views

Category:

Documents


1 download

DESCRIPTION

Data Management and Linguistic Analysis: MDS applied to RODA. Sheila M. Embleton, Dorin Uritescu & Eric S. Wheeler York University, Toronto, Canada. Order of Presentation. Context Romanian and RODA RODA as Linguistic Technology Examples Latin Word-final /u/ - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Management and Linguistic Analysis:  MDS applied to RODA

Data Management and Linguistic Analysis:

MDS applied to RODA

Sheila M. Embleton, Dorin Uritescu & Eric S. Wheeler

York University, Toronto, Canada

Page 2: Data Management and Linguistic Analysis:  MDS applied to RODA

Order of Presentation Context

Romanian and RODA RODA as Linguistic Technology Examples

• Latin Word-final /u/• Non-palatalized dentals before front vowels

MDS MDS as an analytic tool MDS and Romanian Dialects

Page 3: Data Management and Linguistic Analysis:  MDS applied to RODA

Context

Page 4: Data Management and Linguistic Analysis:  MDS applied to RODA

Romania

Source: http://en.wikipedia.org/wiki/Romanian_language#Geographic_distribution

Page 5: Data Management and Linguistic Analysis:  MDS applied to RODA

Romanian

22+ million speakers critical exemplar of eastern

Romance language family

Page 6: Data Management and Linguistic Analysis:  MDS applied to RODA

Noul Atlas lingvistic român. Crişana Crişana region in

north-west Romania

Hard copy atlas by Stan and Uritescu (1996, 2003)

Digitize to make it more accessible

Page 7: Data Management and Linguistic Analysis:  MDS applied to RODA

Objective Use Information Technology to

permit a broad range of scholars to access the data, select the data appropriately, and present the data clearly;

and so gain greater understanding of its significance.

Page 8: Data Management and Linguistic Analysis:  MDS applied to RODA

State of the Project (Nov 2007)

Have entered all 407 maps from Vol. I and II Twice proof-read Consulted source slips, when needed

Have developed search and mapping tools to access the digital data

Initial version now posted at:http://vpacademic.yorku.ca/romanian

Page 9: Data Management and Linguistic Analysis:  MDS applied to RODA

RODA as linguistic technology

Page 10: Data Management and Linguistic Analysis:  MDS applied to RODA
Page 11: Data Management and Linguistic Analysis:  MDS applied to RODA

The technology allows one to:

View the data Search for data and count it Interpret the data or the counts Analyze the data (e.g. MDS) See the results as maps

Save the maps as .jpg pictures Save the results for later use

Hear samples of the data

Page 12: Data Management and Linguistic Analysis:  MDS applied to RODA

RODA: function Custom-defined maps

• You select the data• You see the result as a map

Programmable access to the whole set of digitized data• You ask about data spread over many maps• You can customize what you search for

(not just the editor’s choice)

Page 13: Data Management and Linguistic Analysis:  MDS applied to RODA

RODA: search of data Context of search becomes important

• Word-final vs non-final vs either• Plain character vs accented character• Character vs (superposed) alternate

Choice of fields to search• E.g. With nouns: sg. vs pl. entries• Variations heard by field workers• Flags to mark special situations (e.g.

hesitation)

Page 14: Data Management and Linguistic Analysis:  MDS applied to RODA

Examples from RODA

Page 15: Data Management and Linguistic Analysis:  MDS applied to RODA

Crişana, Romania

Page 16: Data Management and Linguistic Analysis:  MDS applied to RODA

Crişana, Romania

(from RODA)

Page 17: Data Management and Linguistic Analysis:  MDS applied to RODA

Seeing Words Change

Word-final /u/in Latin and non-Latin words

Page 18: Data Management and Linguistic Analysis:  MDS applied to RODA

Word-final /u/ from Latin

Latin Romanian(standard and most

dialects)

Dialectal Variation

canto ‘I sing’ cânt cântu(vowel present)

cântu

(non-syllabic)

oculum ‘eye’ ochi ochiu ochiu

Page 19: Data Management and Linguistic Analysis:  MDS applied to RODA

Is word-final /u/ random? Look for a geographic pattern over

all potential occurrences The maps for single examples such

as /ochi/ and others, are in the hard-copy dialect Atlas,

But total data for all examples is spread widely over many maps.

Page 20: Data Management and Linguistic Analysis:  MDS applied to RODA

Word-final /u/

Data from:•407 maps•Field 1

Size of cross shows the number of occurrences

Horizontal= syllabic

Vertical = non-syllabic

Page 21: Data Management and Linguistic Analysis:  MDS applied to RODA

Syllabic and non-syllabic /u/

Data from:•Selected maps•Field 1•Word-final or non-word-final

Size of cross shows the number of occurrences

Horizontal= syllabic

Vertical = non-syllabic

Page 22: Data Management and Linguistic Analysis:  MDS applied to RODA

Word-final,syllabic /u/

Data from:•407 maps•Field 1•word-final only•(horizontal = vertical)

Locations 137, 141, 146 show most examples

Page 23: Data Management and Linguistic Analysis:  MDS applied to RODA

Word-final,syllabic /u/

Can review the data

Page 24: Data Management and Linguistic Analysis:  MDS applied to RODA

Word-final,syllabic /u/

Data from:•selected maps•Field 1•word-final only•removed non-vocalic /u/ , def. art., some clusters +/u/.•(horizontal = vertical)

Locations 137, 141, 146 show most examples

Page 25: Data Management and Linguistic Analysis:  MDS applied to RODA

/u/ Pattern There is a pattern:

Word final /u/ is retained in central, and north-eastern areas

It is syllabic mostly in parts of the central area

The locations with most frequent syllabic final /u/ do not form a continuous area

Page 26: Data Management and Linguistic Analysis:  MDS applied to RODA

Dialect sub-regions Some locations have a given

feature; others do not. On the basis of such (sometimes

limited) examples, linguists posit the existence of dialect sub-regions.

MDS analysis of “all” data raises questions about the nature of these sub-regions.

Page 27: Data Management and Linguistic Analysis:  MDS applied to RODA

Non-palatalized dentals before front vowels

Page 28: Data Management and Linguistic Analysis:  MDS applied to RODA

Non-palatalized dentals before front vowels

Crişana: dentals before front vowels are palatalized.

Are they restructured as palatals? If the process is no longer productive,

there may be non-palatalized dentals before front vowels.

If so, where, in what forms and what is the frequency?

Page 29: Data Management and Linguistic Analysis:  MDS applied to RODA

Non-palatalized dentals before front vowels

•Examples everywhere.

•(As is well-known, dentals are not palatalized in Oaş, except for 220.)

•Map shows where and how many examples.

Page 30: Data Management and Linguistic Analysis:  MDS applied to RODA

Non-palatalized dentals before front vowels

There are examples everywhere (not only in Oaş)

Here we establish a result with the location and frequency of examples.

Can view the examples that support the conclusion.

Page 31: Data Management and Linguistic Analysis:  MDS applied to RODA

MDS

Page 32: Data Management and Linguistic Analysis:  MDS applied to RODA

MDS as Analytic tool In addition to select, search, count

and map functions, RODA can have special-purpose analytic tools.

A built-in MDS tool allows us to create MDS maps based on any selected set of data.

Other analytic techniques could also be implemented.

Page 33: Data Management and Linguistic Analysis:  MDS applied to RODA

MDS Process-1

Multidimensional scaling (MDS) uses the “linguistic distance” between n+1 locations to place them in an n-dimensional space exactly...

Page 34: Data Management and Linguistic Analysis:  MDS applied to RODA

MDS Process-2

MDS projects an n-space onto a 2-space (a map) so that the distances among the points are preserved as best as possible.

Page 35: Data Management and Linguistic Analysis:  MDS applied to RODA

Projection to 2-space

Page 36: Data Management and Linguistic Analysis:  MDS applied to RODA

MDS Process -3 The linguistic map may or may not

correspond to geography It does give a high-level picture of

the total linguistic relationship: All the data used to get the distances is now displayed as a single picture.

Page 37: Data Management and Linguistic Analysis:  MDS applied to RODA

Distance measures Based on linguistic forms being

“same” or “not same” Does not account for forms that are

nearly the same:• “cat” ~ “caţ” ~ “feline”

Missing forms are “not same” Summed over many comparisons

Page 38: Data Management and Linguistic Analysis:  MDS applied to RODA

MDS and dialects Embleton and

Wheeler have used an MDS process on English dialects Finnish dialects

Dialect roughly correlates with geography

Page 39: Data Management and Linguistic Analysis:  MDS applied to RODA

Romanian Dialect groupings Begin with a hypothesis about

dialect groupings in Crişana. Analyzed all data in 403 maps, using

the MDS method. Identity is exact match; any difference

is a difference of 1. Distance is sum of differences.

We see the groupings on a map.

Page 40: Data Management and Linguistic Analysis:  MDS applied to RODA

MDS mapAll groups

South-east and South-west are distinct.

The rest are less so. Suggests

the dialect unity of the region

--> refine groupings

Page 41: Data Management and Linguistic Analysis:  MDS applied to RODA

MDS mapRefined groupings

Still, considerable overlap or closeness

More groups that could be identified, e.g.:

Several divisions in West

Two areas in Oaş

Oaş is close to southern areas

Still, its distinctness is clear (cf. also Uritescu 1984a).

Page 42: Data Management and Linguistic Analysis:  MDS applied to RODA

MDS mapRefined groupings

Page 43: Data Management and Linguistic Analysis:  MDS applied to RODA

Crişana dialect regionsWhen a lot of data is considered: There is much overlap of regions A few regions are distinct.It is possible that areas share features in a

complex way, based on distance, physical geography and other factors.

There is more apparent unity than traditional analyses (based on a few features) would provide.

Page 44: Data Management and Linguistic Analysis:  MDS applied to RODA

Further investigation

We want to look at: Differences in vocabulary (rare vs

common terms) Phonetics vs morphology vs syntax Other definitions of distance

Page 45: Data Management and Linguistic Analysis:  MDS applied to RODA

RODA and MDS RODA provides the large amount of

data. MDS makes the large amount of

data readily understandable as a single picture.

Implementing MDS in RODA means that researchers can easily try the approach.

Page 46: Data Management and Linguistic Analysis:  MDS applied to RODA

Summary RODA provides:

Accessible data Flexible searching and custom presentation Repeatable processing

MDS makes the data easy to visualize Result: new linguistic insights based on

the greater understanding of the data

Page 47: Data Management and Linguistic Analysis:  MDS applied to RODA

Contacts Sheila [email protected] Dorin [email protected] Eric [email protected]

Site: vpacademic.yorku.ca/romanian/