nederlab laboratory for research on the patterns of change in the dutch language and culture...

18
Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeti May 16 th , 2013 Meertens Institute, Amsterdam

Upload: david-hensley

Post on 27-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

Nederlab

Laboratory for research on the patterns of change in the Dutch

language and culture

E-Humanities Group Research Meeting, May 16th , 2013Meertens Institute, Amsterdam

Page 2: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

A bit of history

• The CLARIN EU project (2008-2011) intended to provide an answer to the digital challenge set out by the EU:– How to bring together large amounts of data from

all over Europe along with the necessary tools to process them?

This was followed by a number of national CLARIN projects (CLARIN-NL, D-SPIN…) tackling these challenges at a national level

Page 3: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

A bit of history (cont)

• The CatchPlus project valorizes scientific research results to usable tools and services for the entire Dutch heritage sector.– This software leads to better disclosure and larger

accessibility of collections from heritage institutions.

Page 4: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

A bit of history (cont)

• It brought us:– PID services– ‘concept’ registries– Flexible metadata formats (CMDI)– Standard publication protocols (OAI-PMH)– Web authentication methods (SAML 2)

– And a lot of tools and data sets at the national levels• (Anyone remember the CLARIN-NL call 1-4 projects?)

Page 5: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

Scenario characterized mainly by accidental and temporary interactionsScenario where dedicated services centres of new type interact in a stable way and give persistent and easy-to-use services to the community. Researchers must be able to rely on the services offered

CLARIN center network

Page 6: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

Source: Riding the WaveHow Europe can gain from the rising tide of scientific dataReport of the High Level Expert Group on Scientific Data

Page 7: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

Arguments for Nederlab

• Bridge the gap between community support services and user community/data providers

• 7 points towards digitization criticismNRC handelsblad (science section)September 10 and 11, 2011

Digitisation of older texts is going wrongA lot of money is wasted

Page 8: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

7 points1. All the money for digitisation has to come from a single fund; the funding body is to impose requirements to the quality

2. Funds are only provided if both the digitisation and the metadata meet the (international) standards. This is the only way that sub collections can eventually be combined.

3. Linking money and quality. Text quality varies greatly, from corrected OCR to messy, uncorrected OCR.

4. Scientists, researchers and other users have to be more closely involved with the development of large websites. Better cooperation with users.

5. Central register which shows what has already been digitised, as much work is unnecessarily repeated. Money is only offered to those institutions who first investigate what has already been done.

6. Central register has to be accessible to the public. This way, people can donate books which they would otherwise throw away, and which now can be cut up. This saves a lot of time when digitising.

7. A national plan should be drawn up to professionally digitise the most important sources within 10 years, at the lowest possible cost.

Page 9: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

Hypothesis

• The hypothesis is that changes in language and culture – both of which express human cognition – are related to each other and that they are based on identical or comparable regularities. By means of Nederlab we want to uncover these regularities. Research into those regularities will show which parts of the Dutch language and culture are subject to change, and which remain constant.

Page 10: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

Hypothesis

• Nature versus nurture debate

Page 11: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

Some research questions

• Detecting new concepts, words and combinations of words.

• Concept history: What is meant by ‘burgerschap’ (‘citizenship’)?

• Systematically mapping linguistic changes; for example deflexion.

• Determining patterns and motives; How are the nobility, the clergy, etc., described, and with which motives are these ‘groups’ associated?

Page 12: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

Some research questions

• Detecting similarities in texts: Who is citing who?• When were terse phrases, idioms and expressions

coined and how were they taken over by authors and by different text genres?

• What was the first text genre in which a certain metaphor was used for the first time?

• Author recognition. Who was the author of a certain text?

Page 13: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

(Some) Challenges for Nederlab1. Usability2. Handling large amounts of data from various sources and varying quality

3. Handle editorial process4. Dealing with diachronic (processing) issues5. Integrating technologies from different technology providers6. Integrating technologies that contribute towards answering research

questions7. Identify gaps.

Page 14: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

De Gids

DBNL has mass digitized all volumes of ‘de Gids.’ Not only have their contents are accessible now, but also the contributions by individual authors. – How did the number of contributions by female authors progress over the years

?– How did the average age vary over the years ?– Where do the authors come from?– The percentage of poetry/prose over the years ?– What are the ‘new’ words occurring over the years ?– Which frequently used terms are used over the years ?

• How do these change

– Which words are used in one period, but not in another ?

Page 15: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

Dutch language innovations

• The second research pilot concerns the hypothesis that in the 19th century innovations in the Dutch language started in Dutch overseas: in Indonesia, Surinam, and the Dutch Antilles. This hypothesis is supported by the fact that in this periode for the first time relatively large contingents of bilingual speakers were living in the Dutch colonies, which is an important condition for language innovation. The hypothesis will be tested (by Sjef Barbiers and Nicoline van der Sijs) by comparing texts printed in the Netherlands and overseas.

Page 16: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

Extract articlesKBdidl

KBAlto

Nederlab metadata index(SOLR)

Convert to Folia Folia XML

Folia XML

Postagging (Frog)+ cleanup (TICCL)

N-gram generatie(http://software.ticc.uvt.nl/tel-0.1.tar.gz)

N-grams

n-gram indices(SOLR)

Blacklab POS indices(Lucene)

Index metadata

Index n-grams

Index POS tags

DBNLMetadata

Extract articles

Covnert to FoliaDBNLdata

Page 17: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens
Page 18: Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16 th, 2013 Meertens

Thank you

[email protected]