timo honkela: semantic and pragmatics representations of large text corpora
TRANSCRIPT
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Timo Honkela
FIN-CLARIN Jubilee Seminar andNordic CLARIN Network SeminarUniversity of Helsinki, 9 Jun 2016
Semantic and pragmatic representations
of large text corpora
2
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Agenda
● Digital humanities in Finland● Strategic role of humanities and
social sciences● Research using text corpora
3
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Digital humanities in Finland
● Research in humanities and social sciences is increasingly using digitally stored resources and computational analysis tools
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Krister Lindénet al.
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Varieng - Research Unit for the Study of Variation, Contacts and Change in English
Big Data, Rich Data, Uncharted Data19–22 October 2015Helsinki, Finland
Terttu Nevalainen
Irma TaavitsainenTanja Säilyhttp://www.helsinki.fi/varieng/
http://www.helsinki.fi/varieng/people/varieng_saily.html
et al.
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Multilinguallanguage technology
Jörg Tiedemann
Mathias Creutz et al.
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Text mining historical newspapers
Mikko Tolonen
Kimmo Kettunen
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Citizen MindscapesAnalysis of large social media corporain order to increase understanding of
social and societal phenomena
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Educational efforts:e.g. Digital Humanities Hackathon
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
In many such research efforts andeducational activities, FIN-CLARINserves as an essential resourceand infrastructure.
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
In many such research efforts andeducational activities, FIN-CLARINserves as an essential resourceand infrastructure.
Let's celebrate andhave a moment
of applause
http://375humanistia.helsinki.fi/en/humanists/kimmo-koskenniemi
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Complexity associated withdifferent areas of science
Biological phenomena
Physical phenomena
Cultural phenomena
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Importance ofhumanities and social sciences
● As surprising it may at first sound, one can claim that humanities and social sciences are the most important ones
● These disciplines deal with topics like language and communication, social condition, historical developments, economy, etc.
● Due to the complexity, research in these areas is challenging; generalizations commonplacein physics are rarely possible
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Understandingthe phenomena
Theory andknowledgeformation
Qualitative Quantitative
Open data:corpora
Openmethods
Computationalresources
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Lars Borin
Linguistics hasbeen the first
e-science
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Challenges:
“Language is BIG”
“Human INTERPRETATION isinherently involved”
Importance of language:
”Language is involved in mostrelevant human activities”
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Example:
Complexity ofFinnish at thelevel of wordforms
Kimmo Koskenniemi (2013):Johdatus kieliteknologiaan,sen merkitykseen ja sovelluksiin(Introduction to language technology, its significance andapplications)
https://helda.helsinki.fi/bitstream/handle/10138/38503/kt-johd.pdf?sequence=1
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
> 6000 languages,many more dialects Billions of people
blogs.state.gov
en.wikipedia.org
A large number ofdifferent cultures
en.wikipedia.org A vast number of ways to relatelanguage, concepts andthe world to each other
Simulating processes of language emergence and communication 19
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Language as a system
● Considering natural language as a signal and dynamic system at cognitive and social levels (also in its written form) rather than a symbolic and logical system
● Importance of embodiment (cf. e.g. Harnad) and embeddedness (cf. e.g. Edelman)
● Learning and pattern recognition processes are essential (as opposed to the theories presented e.g. by Chomsky, Fodor, Pinker); much of the learning is bound to be unsupervised
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Complexity of languageregarding different areas and levels
Structure:morphology and syntax
Meaning: semantics and pragmatics
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Complexity of languageregarding different areas and levels
Structure:morphology and syntax
Meaning: semantics and pragmatics
What are the nature,granularity, type,
metadata involved, etc.for different researchpurposes in different
areas of linguistics andother areas of humanities
and social sciences?
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Need toharmonize,build sharedterminologies,theories,frameworks, etc.
Need to modelcontextuality,
ambiguity, vagueness,history-dependence,
change, ambiguity,etc.
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Need toharmonize,build sharedterminologies,theories,frameworks, etc.
Need to modelcontextuality,
ambiguity, vagueness,history-dependence,
change, ambiguity,etc.
The same medium, language, isthe object of study as well as the
basis for theory formation,representing the ideas and resources, etc.
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Philosophy of scienceis essential to
understand whatis going on...
Data-driveninductive mode
Hypothesisdriven,
deductive mode
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
An old research example:
Data-driven emergenceof implicit word
categories that match withhuman syntactic
and semantic intuitions
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Classical example: Learning meaning from context:
Maps of words in Grimm fairy tales
Honkela, Pulkki & Kohonen 1995
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Research example:
Multimodallygroundedmodels
of meaning
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Labeling movements: Associatinghigh-dim. kinesthetic time series
with linguistic labels
Förger & Honkela 2014
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
RUNNING
WALKING
LIMPING
JOGGING
Förger & Honkela 2014
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Research example:
Tensor-based analysis ofsubjective aspect
of interpreting linguisticexpressions
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
GICA: Grounded IntersubjectiveConcept Analysis
Honkela et al. 2012
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Analysis of the word 'health'
Honkela et al. 2012
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Ideas for building corpora
● Espansion of the contextual framework● Enriching metadata● Increasing multimodal data sources
that associate linguistic data with othermodalities
● Involving large number of peoplein labeling data to model variation
● Collecting data in real world contexts