Page 1: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

•The extraction tool for English:

•Natural Language & Semantic Processing

•Experiment 1: triple extraction from the CIDOC-CRM documentation

•Experiment 2 :triple extraction from free text

•Extension to other languages (French and German)

•Use of the tool in EPOCH

•(Cultural Heritage corpus creation)


Motivation for the extraction tool

• Extract and facilitate the mapping of (unstructured) textual information to CIDOC-CRM

• Assist more generic mapping tools (AMA)• Develop a first prototype able to extract

triples from English texts • Evaluate the tool, using data from the

CIDOC-CRM documentation

On the propositional nature of CIDOC-CRM

“The domain class is analogous to the grammatical subject of the phrase for

which the property is analogous to the verb. Property names in the CRM

are designed to be semantically meaningful and grammatically correct when read

from domain to range. In addition, the inverse property name, normally

given in parentheses, is also designed to be semantically meaningful and

grammatically and correct when read from range to domain.”

Definition of the CIDOC Conceptual Reference Model, June 2005, page vi

De facto standard modelCIDOC model



Mapping Tools

Assisting generic tools:EPOCH AMA

The idea is to represent in the simplest way the CIDOC model as well as the original model and to let the domain expert insert his/her knowledge in the mapping.

Text mining for CH




The creation of a CIDOC-CRM compatible database for CH is partially automated by extracting triples after analyzing free texts in natural language.

Text analysis

Triple extraction

Database creation

CRM ~ semantic representation

We have a painting of John Constable entitled “A country

lane” and identified by “203-1888”. It is a watercolour which

is 21.3 cm high and 18.3 cm wide. It was painted in


The problem

• The capability of the CRM to describe so many formats

with so few properties is due to the fact that most actions

and events are not encoded as properties, but as paths with

the event as node in the middle. So "Van Gogh painted

this .." translates to two triples with a Production in the

centre. I.e. hundreds of action verbs have to be recognized

and mapped. If this is not understood no useful matching

can be done (anonymous reviewer)

Some terms• semantic association: measure of co-occurrence

of terms• lemma: base form of a word• POS: part-of-speeches• clause: a coherent whole of POSs• hypernym: a word more generic than a given

word• chunking: breaking sentences into clauses• WordNet: structured lexical database

POS tagging and Phrase chunking

POS tagging and Phrase chunking

Page 10: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Getting Semantics through Syntax

“ CIDOC-CRM is great ”










be(CIDOC-CRM,great)Sentence NP VPVP V AdjP

Page 11: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Getting Semantics directly through a Semantic Grammar

“ CIDOC-CRM is great ”










be(CIDOC-CRM,great)Assertion Brand Action_InfoAction_Info Action Info

Page 12: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

Getting Semantics from keywords and pattern-matching

<assertion> <CIDOC-CRM> “is” <info>

<opinion> “I think” <CIDOC-CRM> “is” <info>

Other noisy input is simply skipped.

What about words not in WordNet (proper nouns)? Semantic Orientation

Page 13: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

SO-A in practice

We compute Association using PMI:

which is positive when words co-occur and negative otherwise.

We compute PMI in IR using hit counts over a large corpus:

N is the total number of documents in the corpus, smoothing makes PMI-

IR 0 for words not in the corpus and NEAR means a distance of at most

20 words (Turney and Littman, 2003).

Page 14: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK

1.Text cleaning.

2.Tokenization and POS tagging

3.Clause chunking and pruning

4.NC regrouping

5.Intermediate triples (IT) creation

6.Referent resolution

7.Final triple (FT) creation.


Experiments (1)

Experiments (1)

•144 sentences

•184 final triples

•at least a final triple for 46 sentences

Experiments (2)

Lange Herzogstrasse is Wolfenbüttel

main shopping area. The street's

particular charm lies in its broad-face

half-timbered buildings, historic

merchant's houses; their central gables

still retain the distinctive hatches

through which goods could be hoisted

up to the attics for storage.

Experiments (2)

•A text of 3922 words and 173 sentences.

•197 intermediate triples and 79 final triples extracted.

A shallower approach for English

We have a painting of John Constable entitled “A country

lane” and identified by “203-1888”. It is a watercolour which

is 21.3 cm high and 18.3 cm wide. It was painted in


A shallower approach for English

Shallow extraction (English)

We PP we

have VHP have

a DT a

painting NN painting

of IN of

John NP John

Constable NP Constable

entitled VVD entitle

" `` "

A DT a

country NN country

lane NN lane

" '' "

and CC and

identified VVD identify

by IN by

" `` "

203-1888 JJ @card@

" '' "

. SENT .

It PP it

is VBZ be

a DT a

watercolour NN

which WDT which

is VBZ be

21.3 CD @card@

centimetres NNS centimetre

high JJ high

and CC and

18.3 CD @card@

centimetres NNS centimetre

wide JJ wide

. SENT .

It PP it

was VBD be

painted VVN paint

in IN in

England NP England

. SENT .

Shallow extraction (English)

Nous avons une toile de John Constable intitulée «Une route de

campagne» et identifiée par «203-1888». C'est une aquarelle

qui a une largeur de 21.3 centimètres et une hauteur de 18.3

centimètres. Elle fût peinte en Angleterre.

Shallow extraction (French)

Nous PRO:PER nous

avons VER:pres avoir

une DET:ART un

toile NOM toile

de PRP de

John NAM John

Constable NOM constable

intitulée ADJ <unknown>

" PUN:cit "

Une DET:ART un

route NOM route

de PRP de

campagne NOM campagne

" PUN:cit "

et KON et

identifiée VER:subp <unknown>

par PRP par

" PUN:cit "

203-1888 ABR @card@

" PUN:cit "

est VER:pres être

une DET:ART un

aquarelle NOM aquarelle

qui PRO:REL qui

a VER:pres avoir

une DET:ART un

largeur NOM largeur

de PRP de

21.3 NUM @card@

centimètres NOM

et KON et

une DET:ART un

hauteur NOM hauteur

de PRP de

18.3 NUM @card@

centimètres NOM <unknown>

. SENT .

Elle PRO:PER elle

fût VER:futu <unknown>

peinte VER:pper peindre

en PRP en

Angleterre NAM Angleterre

. SENT .

Shallow extraction (French)

Wir haben einen Anstrich des John Constable erlauben

„Einen Landweg” und durch „203-1888” gekennzeichnet

werden. Es ist ein Aquarell, das 21.3 Zentimeter hoch und

18.3 Zentimeter breit ist. Es würde in England gemält.

Shallow extraction (German)

Wir PPER wir

haben VAFIN haben

einen ART ein

Anstrich NN Anstrich

des ART d

John NE John

Constable NE <unknown>

erlauben VVFIN erlauben

" $( "

Einen NN Eine

Landweg NN Landweg

" $( "

und KON und

durch APPR durch

" $( "

203-1888 CARD @card@

" $( "

gekennzeichnet VVPP kennzeichnen

werden VAINF werden

. $. .

Es PPER es

ist VAFIN sein

ein ART ein

Aquarell NN Aquarell

, $, ,

das PRELS d

21.3 CARD @card@

Zentimeter NN Zentimeter

hoch ADJD hoch

und KON und

18.3 CARD @card@

Zentimeter NN Zentimeter

breit ADJD breit

ist VAFIN sein

. $. .

Es PPER es

würde VVFIN <unknown>

in APPR in

England NE England

gemält VVFIN <unknown>

. $. .

Shallow extraction (German)

Question Answering for CH


Natural LanguageInteraction


Through Natural Language Processing (NLP) and Generation (NLG), users can interact and query databases in CIDOC-CRM. Resources are available for a wide range of languages (multilingualism). By combining the mining and interactive tools, language technology automates the structuring and querying of heterogeneous and semi-structured information within the framework of CIDOC-CRM.



Resources (1)

Resources (1)

•Towards a Semantic Web for Heritage Resources


•Art & Architecture Thesaurus Online


•National Monuments Record Thesauri


•Networked Knowledge Organization Systems/Services


•AGROVOC  Multilingual Thesaurus


•Controlled vocabulary for the applied life sciences


•National Library of Medicine:

•MDA Archaeological Objects Thesaurus:


Resources (2)

Resources (2)

•HEREIN:•MUSEUM: •IKEM:•Cultural Heritage:•Michael:


•AMA tools:



Page 26: Preserving meanings in (multilingual) text mining for Cultural Heritage Michel Généreux David B. Arnold University of Brighton, UK























































First Seeds

Second seeds

Resources (3): Corpus and Multiwords terms extraction

Resources (3): Corpus and Multiwords terms extraction

Archaeological Sites

Image for Archeology

Mailing lists and discussion

North Carolina

Back Issues

archaeological research

Financial Corp

Summer Institute

NCPTT Partners

Ministry of Culture

via e-mail

University Art

Hire Date

Preschooler Programs

Next page

de los

plot keywords

natural resources

Quick Links

Sculpture Conservation Studio

Museum Shop

Dot Net Nuke

Cast overview

megalithic tombs

Archaeological Institute of America

landscape painting

Sound Mix

Mc Culley

Department of Animal

Historical Center

historic sites

Find surnames

Previous page

Golf Course

Genealogy CDs

Children's and Juvenile Books

list archives

Historic Materials

Nine Inch Nails

favorite player

Dennis Montagna

Disaster Recovery by the Book

Preservation Officers

historic properties

U Minh

non-profit organization

Web Notify

Internet Service

Southwest Parks

Arts Council

archeology nationwide

download protocol

Science Center

web site

site provides

African Studies

Mini Grants

Travel Guides

programming language

Landscape Restoration

click the Order button

Ancient Near

Grant Programs

Personal Profile

Development Center

Break w

Remembering the Way

Kuala Lumpur

Guns N Roses

National Monuments

materials conservation

Hidden Treasures

para o

Command History

Architectural History

writing skills

x cm

Preservation Pioneer

Powered by LISTSERV

Remembering Nelson Hall

Please send

Financial Group

Local History

Art Center

Grant Recipients

Site Map

Grand Staircase-Escalante National

Blue Book

Can Tho

Resource guides

Park Net Accessibility FOIA Privacy

Fine art

Contact Lorine

South Carolina

Up the Heat on Research

Road Safety

web pages


Total Transfers

Common Ground

100 random Multiwords terms from 1567

Discussion

Discussion•Benchmark for evaluation: CIDOC-CRM human-annotated texts

•Over generation: properties expressed with ‘be’ and ‘have’

•How do we combine triples to form paths ? (see *)

•Given modest results for English, how do we extend the previous method to other languages?

*The capability of the CRM to describe so many formats with so few properties is due to

the fact that most actions and events are not encoded as properties, but as paths with

the event as node in the middle. So "Van Gogh painted this .." translates to two triples

with an Production in the centre. I.e. hundreds of action verbs have to be recognized

and mapped. If this is not understood no useful matching can be done (anonymous


Thank you!
Questions ?

Thank you!

Questions ?

