pericles - a field approach to semantic drift

14
GRANT AGREEMENT: 601138 | SCHEME FP7 ICT 2011.4.3 Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics [Digital Preservation] Borås, May 19, 2015

Upload: periclesfp7

Post on 08-Aug-2015

61 views

Category:

Data & Analytics


0 download

TRANSCRIPT

GRANT AGREEMENT: 601138 | SCHEME FP7 ICT 2011.4.3 Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics [Digital Preservation]

Borås, May 19, 2015

Society, technology, language etc. change: ◦ Evolution, dynamics – requires representation

Semantics: ◦ The study of word vs. sentence meaning

◦ In natural vs artificial languages, e.g.

LRM = abstract language,

domain-specific LRM = abstract sublanguage;

Representation for computation of past trends vs. future projections

Distributional semantics (Harris 1970)

Semantic/lexical fields (Trier 1934)

Reconstruct groups of words with related meaning in space, i.e. construct field

Means of representation: ◦ Vector space vs. vector field ◦ At stake: compute smooth

change with meaningful groups of related words in a region/field, i.e. create artificial lexical field

Lexical field in graph form

From language to its representation

Vector space

A vector field

An evolving vector field

Dependency drift as landscape

Matrix

21578 docs, 12000 terms (filtered here)

Economy news over one year (8 periods)

ESOM: source by Ultsch et al 2005 ◦ Weight vectors (z axis) indicate a

potential, mapped here to a grid of nodes with best matching units (BMUs) standing for terms

◦ Approximately 5 nodes per index term to interpolate a potential field to study lexical gaps in distributional patterns (60.000 nodes)

◦ Index terms converted to location + direction vectors

SOMOCLU: fastest open-source SOM algorithm available Visualization: ESOM Tools by Databionics (third-party software)

WP4 confidential

A cropped section of the U-matrix with best matching units and labels, showing a tight cluster

Large gap with BMUs pulled apart indicating tensions in the field

The terms in this group, including ones that are not plotted in the figure, are: bongard, consign, ita, louisvill, occupi, reafffirm (with this misspelling), stabil, stabilis, strength, temporao, tight. Some are clearly related, for others, we find justification in the corpus.

Large gap examples: Apart from energet and garrison, these words are frequent, with over twenty appearances each, but were separated from other regions by emerging content in the fault lines.

The evolution of the LRM and its domain ontologies…

… influence the evolution of feature and DO categories displayed as a field…

Amazon scalability test with Somoclu (2015): 12.8 million book reviews represented as a lexical field ◦ Evaluation ongoing