timo honkela: semantic and pragmatics representations of large text corpora

33
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016 Timo Honkela FIN-CLARIN Jubilee Seminar and Nordic CLARIN Network Seminar University of Helsinki, 9 Jun 2016 Semantic and pragmatic representations of large text corpora [email protected]

Upload: timo-honkela

Post on 27-Jan-2017

85 views

Category:

Education


6 download

TRANSCRIPT

Page 1: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Timo Honkela

FIN-CLARIN Jubilee Seminar andNordic CLARIN Network SeminarUniversity of Helsinki, 9 Jun 2016

Semantic and pragmatic representations

of large text corpora

[email protected]

Page 2: Timo Honkela: Semantic and pragmatics representations of large text corpora

2

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Agenda

● Digital humanities in Finland● Strategic role of humanities and

social sciences● Research using text corpora

Page 3: Timo Honkela: Semantic and pragmatics representations of large text corpora

3

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Digital humanities in Finland

● Research in humanities and social sciences is increasingly using digitally stored resources and computational analysis tools

Page 4: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Krister Lindénet al.

Page 5: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Varieng - Research Unit for the Study of Variation, Contacts and Change in English

Big Data, Rich Data, Uncharted Data19–22 October 2015Helsinki, Finland

Terttu Nevalainen

Irma TaavitsainenTanja Säilyhttp://www.helsinki.fi/varieng/

http://www.helsinki.fi/varieng/people/varieng_saily.html

et al.

Page 6: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Multilinguallanguage technology

Jörg Tiedemann

Mathias Creutz et al.

Page 7: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Text mining historical newspapers

Mikko Tolonen

Kimmo Kettunen

Page 8: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Citizen MindscapesAnalysis of large social media corporain order to increase understanding of

social and societal phenomena

Page 9: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Educational efforts:e.g. Digital Humanities Hackathon

Page 10: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

In many such research efforts andeducational activities, FIN-CLARINserves as an essential resourceand infrastructure.

Page 11: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

In many such research efforts andeducational activities, FIN-CLARINserves as an essential resourceand infrastructure.

Let's celebrate andhave a moment

of applause

http://375humanistia.helsinki.fi/en/humanists/kimmo-koskenniemi

Page 12: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Complexity associated withdifferent areas of science

Biological phenomena

Physical phenomena

Cultural phenomena

Page 13: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Importance ofhumanities and social sciences

● As surprising it may at first sound, one can claim that humanities and social sciences are the most important ones

● These disciplines deal with topics like language and communication, social condition, historical developments, economy, etc.

● Due to the complexity, research in these areas is challenging; generalizations commonplacein physics are rarely possible

Page 14: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Understandingthe phenomena

Theory andknowledgeformation

Qualitative Quantitative

Open data:corpora

Openmethods

Computationalresources

Page 15: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Lars Borin

Linguistics hasbeen the first

e-science

Page 16: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Challenges:

“Language is BIG”

“Human INTERPRETATION isinherently involved”

Importance of language:

”Language is involved in mostrelevant human activities”

Page 17: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Example:

Complexity ofFinnish at thelevel of wordforms

Kimmo Koskenniemi (2013):Johdatus kieliteknologiaan,sen merkitykseen ja sovelluksiin(Introduction to language technology, its significance andapplications)

https://helda.helsinki.fi/bitstream/handle/10138/38503/kt-johd.pdf?sequence=1

Page 18: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

> 6000 languages,many more dialects Billions of people

blogs.state.gov

en.wikipedia.org

A large number ofdifferent cultures

en.wikipedia.org A vast number of ways to relatelanguage, concepts andthe world to each other

Page 19: Timo Honkela: Semantic and pragmatics representations of large text corpora

Simulating processes of language emergence and communication 19

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Language as a system

● Considering natural language as a signal and dynamic system at cognitive and social levels (also in its written form) rather than a symbolic and logical system

● Importance of embodiment (cf. e.g. Harnad) and embeddedness (cf. e.g. Edelman)

● Learning and pattern recognition processes are essential (as opposed to the theories presented e.g. by Chomsky, Fodor, Pinker); much of the learning is bound to be unsupervised

Page 20: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Complexity of languageregarding different areas and levels

Structure:morphology and syntax

Meaning: semantics and pragmatics

Page 21: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Complexity of languageregarding different areas and levels

Structure:morphology and syntax

Meaning: semantics and pragmatics

What are the nature,granularity, type,

metadata involved, etc.for different researchpurposes in different

areas of linguistics andother areas of humanities

and social sciences?

Page 22: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Need toharmonize,build sharedterminologies,theories,frameworks, etc.

Need to modelcontextuality,

ambiguity, vagueness,history-dependence,

change, ambiguity,etc.

Page 23: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Need toharmonize,build sharedterminologies,theories,frameworks, etc.

Need to modelcontextuality,

ambiguity, vagueness,history-dependence,

change, ambiguity,etc.

The same medium, language, isthe object of study as well as the

basis for theory formation,representing the ideas and resources, etc.

Page 24: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Philosophy of scienceis essential to

understand whatis going on...

Data-driveninductive mode

Hypothesisdriven,

deductive mode

Page 25: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

An old research example:

Data-driven emergenceof implicit word

categories that match withhuman syntactic

and semantic intuitions

Page 26: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Classical example: Learning meaning from context:

Maps of words in Grimm fairy tales

Honkela, Pulkki & Kohonen 1995

Page 27: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Research example:

Multimodallygroundedmodels

of meaning

Page 28: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Labeling movements: Associatinghigh-dim. kinesthetic time series

with linguistic labels

Förger & Honkela 2014

Page 29: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

RUNNING

WALKING

LIMPING

JOGGING

Förger & Honkela 2014

Page 30: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Research example:

Tensor-based analysis ofsubjective aspect

of interpreting linguisticexpressions

Page 31: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

GICA: Grounded IntersubjectiveConcept Analysis

Honkela et al. 2012

Page 32: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Analysis of the word 'health'

Honkela et al. 2012

Page 33: Timo Honkela: Semantic and pragmatics representations of large text corpora

Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016

Ideas for building corpora

● Espansion of the contextual framework● Enriching metadata● Increasing multimodal data sources

that associate linguistic data with othermodalities

● Involving large number of peoplein labeling data to model variation

● Collecting data in real world contexts