trausan-matu: natural language processing and topic modelling

Post on 30-Dec-2016

219 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Natural Language Processing andTopic Modelling

Ştefan Trăușan-Matu

University Politehnica of BucharestRomanian Academy Research Institute for Artificial Intelligence

COST Action IS1310 - Reassembling the Republic of Letters23rd March St’. Anne’s College University of Oxford

Natural Language Processing (NLP)

• Input:

– Text in digital format (strings of characters)

• document

• corpus

• question

• transcription of a monologue or of a conversation

• instant messenger log

• discussion forum, social network

• corpus of interlinked documents (e.g. letters)

• dialog

2Republic of Letters, University of Oxford23 March 2015

Natural Language Processing• Output:

– Text(s) in digital format

• translation – e.g. Google translate

• document(s) summary - summarizers

• answer – question answering

• clusters of documents

– Automatically generated annotations

– List of topics in the text

– Links among topics

– Similar documents

– Links among documents – intertextuality

– Threads of discussion, COLLABORATION

– Other data

• collocations

• structures (syntactical, discourse, rhetorical, etc.)

• opinions In text

• participation & collaboration degrees in conversations

• …3

NLP approaches

• Grammar-based

• Statistical (corpus-based, machinelearning)

– unsupervized (clustering, LSA, LDA)

– Supervized

annotated corpus

learned model

automated annotation

4Republic of Letters, University of Oxford23 March 2015

Text annotation• Space

• Time

• Named Entities

• Links

• Syntactic

• Semantic

• Pragmatic

• Discourse

• Rhetoric

• …5Republic of Letters, University of Oxford23 March 2015

Text Annotators

6Republic of Letters, University of Oxford23 March 2015

Topic Modeling No generally accepted definition for a “topic” in NLPDocument clustersAbstractions based on document clustersLabels;Centroids, etc

(Word, Probability) pairs

Bayesian statistical modelsTopics – distributions over wordsDocuments – distributions over topicsGenerative modelTopic IntertwiningConceptually similar to the ideas of Mikhail BakhtinTopics and voices

7Republic of Letters, University of Oxford23 March 2015

Topic Modeling (2)• LSA/pLSA/hLDA/CTM

– Each newer version corrects some flaws of theearlier ones

• LDA

– Readily available

• Mallet

• Easily reproducible experiments

8Republic of Letters, University of Oxford23 March 2015

The LSA idea

• Reducing the dimensionality of the vector space,similarly to the least squares method

• The effect is the creation of semantic spacescontaining semantically related words

• Bag-of-words approach

• http://lsa.colorado.edu

9Republic of Letters, University of Oxford23 March 2015

LSA - Vector space model

Singular value decomposition (SVD)

n=min(t,d)Tdxnnxntxntxd DSTA

10Republic of Letters, University of Oxford23 March 2015

09.016.061.073.025.05dim

58.058.000.000.058.04dim

41.015.037.059.057.03dim

65.035.051.033.030.02dim

26.070.048.013.044.01dim

cos truckcarmoonastronautmonaut

T T

39.000.000.000.000.0

00.000.100.000.000.0

00.000.028.100.000.0

00.000.000.059.100.0

00.000.000.000.016.2

S

22.041.019.063.029.053.05dim

58.058.000.058.000.000.04dim

33.012.020.045.075.028.03dim

41.022.063.019.053.029.02dim

12.033.045.020.028.075.01dim

654321 dddddd

DT

101000

011001

000011

000010

000101cos

654321

truck

car

moon

astronaut

monaut

dddddd

ATerms-documents array

(ex. from Manning and Schutze, 1999)

Reduced A

• By SVD on maps the n-dimension space ona k-dimension one, with n >>k

• Common values for k are 100 and 150.

2|||| AA

Tdxx DSB 222

65.035.000.130.084.046.02dim

26.071.097.004.060.062.11dim

654321 dddddd

B

LSA based text processing

The most significant 20 wordsfrom Plato

[Plato|TheApology,Justin|TheSecondApology-(0.6475);Plato|TheRepublic.7,Irenaeus|AgainstHeresies.6-(0.6095)]

The similarity of Plato’s workswith the works of other writers

11Republic of Letters, University of Oxford23 March 2015

Latent Dirichlet Allocation

12

http://www.columbia.edu/~ih2240/dataviz/G4063-week5/images/text/LDA.png

Republic of Letters, University of Oxford23 March 2015

Bakhtin’s Polyphonic Intertextuality

Voice I

Voice IIVoice III

Voice IVoice IIVoice III

In dialog

Text 1 Text 2 Text 3

Text 4

Text 1Text 2Text 3

In dialog in text 4

13Republic of Letters, University of Oxford23 March 2015

Polyphony Appears in music (e.g. J.S.Bach) and in novels (Bakhtin)

The Polyphonic Model (Trausan-Matu, 2005, 2010)

Analysis method (Trausan-Matu, Dascalu and Rebedea, 2010)

Computer support tools for the polyphonic analysis ofconversations and networks of documents The “Polyphony” system (Trausan-Matu and all, 2007)

ASAP (Dascalu, Chioasca and Trausan-Matu, 2008)

PolyCAFe (Trausan-Matu, Rebedea and Dascalu, 2011; Rebedea, Dascalu,Trausan-Matu and all, 2010)

Collaboration regions detection (Banica, Trausan-Matu and Rebedea,2011)

Detection of the Important moments (Chiru and Trausan-Matu, 2012)

Intertextuality detection (Ghiban and Trausan-Matu, 2012)

ReaderBench (Dascălu, Trăușan-Matu and Dessus, 2013)14Republic of Letters, University of Oxford23 March 2015

Intertextuality analysis

• Mikhail Bakhtin’s dialogistical and polyphonicmodel Intertextuality (Kristeva)

• Analyze how concepts are echoed from onetext to another (intertextual networks)

• To indicate membership to a philosophicaltrend or influences among authors

15Republic of Letters, University of Oxford23 March 2015

Bakhtin’s Polyphonic Intertextuality

Theme 2 and Theme 3 mayhave the same words butonly different concepts

Section 1 and 6 are dialogical or polyphonical.They may present a higher force ofexpresivity.

16Republic of Letters, University of Oxford23 March 2015

PolyCAFe(Trăușan-Matu, Dascălu and Rebedea)

• Polyphony-based Collaboration Analysis andFeedback generation

• Developed in the “Language Technologies forLifelong Learning” EU FP7 project(http://www.ltfll-project.org/)

• Analyses chat (instant messenger) logs withmore than two participants using thepolyphonic model (Trăușan-Matu)

17Republic of Letters, University of Oxford23 March 2015

From: Trăuşan-Matu , A Polyphonic Model for Interethnic Discourse, 2013

18Republic of Letters, University of Oxford23 March 2015

PolyCAFe

From: Trăuşan-Matu , A Polyphonic Model for Interethnic Discourse, 2013

19Republic of Letters, University of Oxford23 March 2015

ReaderBench(Dascalu, Trăușan-Matu and Dessus)

• Based on

– LSA, LDA

– Polyphonic model

– WordNet

– Social Network Analysis

23 March 2015 Republic of Letters, University of Oxford 20

(http://wordnet.princeton.edu)

NLP Text pre-processing in PolyCAFe and ReaderBench

23 March 2015 Republic of Letters, University of Oxford

ReaderBench Document view

23 March 2015 Republic of Letters, University of Oxford 22

ReaderBench Concept View

23 March 2015 Republic of Letters, University of Oxford 23

Concept View

24

ReaderBench Corpus Similarity

23 March 2015 Republic of Letters, University of Oxford 25

ReaderBench Document Centrality

26

Thank you!

Questions?

stefan.trausan@cs.pub.ro

trausan@gmail.com

http://www.racai.ro/trausan

27Republic of Letters, University of Oxford23 March 2015

top related