pig for natural language processing

47
Pig for Natural Language Processing Max Jakob

Upload: others

Post on 12-Sep-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pig for Natural Language Processing

Pig forNatural Language ProcessingMax Jakob

Page 2: Pig for Natural Language Processing

Agenda 1

2

3

Introduction (speaker, affiliation, project)

Named Entities

pignlproc

Page 3: Pig for Natural Language Processing

• MSc in Computational Linguistics

• Software Developer at Neofonie GmbH

• Dicode project

• Text mining

Speaker: Max Jakob

• Text mining

• Scalability

• Past year

• DBpedia extraction framework

• DBpedia Spotlight

3 © Neofonie GmbH

Page 4: Pig for Natural Language Processing

• Spin-off out of TU Berlin (1998)

• 160 employees in Berlin (head quarters) and Hamburg

• State-of-the-art Search, Online Portals, Mobile Apps

Neofonie

• R&D: 14 successfully finished research projects

• Current focus:

• Semantics

• Question Answering

• Recommendations

• Cloud Computing

4 © Neofonie GmbH

Page 5: Pig for Natural Language Processing

Dicode

• EU-funded project (FP7)

• Goal: ``Augment collaboration and decision making in data-intensive

and cognitively-complex settings.´´

• Build scalable services for data mining and collaboration

• Example Use Case: SocialSocialSocialSocial Media Media Media Media MonitoringMonitoringMonitoringMonitoring• Example Use Case: SocialSocialSocialSocial Media Media Media Media MonitoringMonitoringMonitoringMonitoring

• Sources: blogs, news, Twitter, etc.

• What are the key trends?

• How is my brand perceived on the web?

• Problem: ambiguity of names and concepts

• Need Named Entity Disambiguation

5 © Neofonie GmbH

Page 6: Pig for Natural Language Processing

Named Entity Recognition/Disambiguation

• By example:

• Input: plain text

• Output: text withWikipedia links

6 © Neofonie GmbH

Page 7: Pig for Natural Language Processing

Named Entity Recognition/Disambiguation

• By example:

• Input: plain text

• Output: text withWikipedia links

• 1. Find „interesting“ strings (recognition)

• SSSSurfaceurfaceurfaceurface formsformsformsforms

7 © Neofonie GmbH

Page 8: Pig for Natural Language Processing

Named Entity Recognition/Disambiguation

• By example:

• Input: plain text

• Output: text withWikipedia links

• 1. Find „interesting“ strings (recognition)

• SurfaceSurfaceSurfaceSurface formsformsformsforms

• 2. Choose appropriate Wikipedia page (disambiguation)

• EachWikipedia page represents an entityentityentityentity

• Every surface form can have multiple candidate entities for linking

8 © Neofonie GmbH

Page 9: Pig for Natural Language Processing

„Michael Jackson died in 2007.“

9 © Neofonie GmbH

Page 10: Pig for Natural Language Processing

„Michael Jackson died in 2007.“

• Named Entity Recognition

• Find surfacesurface formsforms

• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.

10 © Neofonie GmbH

Page 11: Pig for Natural Language Processing

„Michael Jackson died in 2007.“

• Named Entity Recognition

• Find surfacesurface formsforms

• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.

• Named Entity Disambiguation• Named Entity Disambiguation

• Choose entityentity from candidates

• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson]

11 © Neofonie GmbH

Page 12: Pig for Natural Language Processing

„Michael Jackson died in 2007.“

• Named Entity Recognition

• Find surfacesurface formsforms

• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.

• Named Entity Disambiguation• Named Entity Disambiguation

• Choose entityentity from candidates

• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson]

• Michael Jackson (singer)

12 © Neofonie GmbH

Page 13: Pig for Natural Language Processing

„Michael Jackson died in 2007.“

• Named Entity Recognition

• Find surfacesurface formsforms

• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.

• Named Entity Disambiguation• Named Entity Disambiguation

• Choose entityentity from candidates

• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson]

• Michael Jackson (singer)

• Michael Jackson (writer)

13 © Neofonie GmbH

Page 14: Pig for Natural Language Processing

„Michael Jackson died in 2007.“

• Named Entity Recognition

• Find surfacesurface formsforms

• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.

• Named Entity Disambiguation• Named Entity Disambiguation

• Choose entityentity from candidates

• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson]

• Michael Jackson (singer)

• Michael Jackson (writer)

• Context: died in 2007

14 © Neofonie GmbH

Page 15: Pig for Natural Language Processing

„Michael Jackson died in 2007.“

• Named Entity Recognition

• Find surfacesurface formsforms

• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.

• Named Entity Disambiguation• Named Entity Disambiguation

• Choose entityentity from candidates

• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson]

• Michael Jackson (singer)

• Michael Jackson (Michael Jackson (Michael Jackson (Michael Jackson (writerwriterwriterwriter))))

• Context: died in 2007

15 © Neofonie GmbH

Page 16: Pig for Natural Language Processing

„Michael Jackson died in 2007.“

• Named Entity Recognition

• Find surfacesurface formsforms

• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.

• Named Entity Disambiguation• Named Entity Disambiguation

• Choose entityentity from candidates

• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson]

• Michael Jackson (singer)

• Michael Jackson (Michael Jackson (Michael Jackson (Michael Jackson (writerwriterwriterwriter))))

• Context: died in 2007

• Context may not be distinctive

16 © Neofonie GmbH

Page 17: Pig for Natural Language Processing

Probabilities

• P( entity | surface form )

• Which entity is typically meant by a name?

• For example, given [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] (and ignoring the context), what are theprobabilities of the candidates?

• Michael Jackson (singer) 0.75Michael Jackson (singer) 0.75

• Michael Jackson (writer) 0.25

17 © Neofonie GmbH

Page 18: Pig for Natural Language Processing

Probabilities

• P( entity | surface form )

• Which entity is typically meant by a name?

• For example, given [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] (and ignoring the context), what are theprobabilities of the candidates?

• Michael Jackson (singer) 0.75Michael Jackson (singer) 0.75

• Michael Jackson (writer) 0.25

• Other useful probabilities:

• P( surface form | entity ), P( entity ), P( surface form )

18 © Neofonie GmbH

Page 19: Pig for Natural Language Processing

Probabilities

• P( entity | surface form )

• Which entity is typically meant by a name?

• For example, given [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] (and ignoring the context), what are theprobabilities of the candidates?

• Michael Jackson (singer) 0.75Michael Jackson (singer) 0.75

• Michael Jackson (writer) 0.25

• Other useful probabilities:

• P( surface form | entity ), P( entity ), P( surface form )

• Estimate usingWikipedia page links

• In 1994 the beer journalist [[Michael Jackson (writer)|Michael Jackson]] described Webster's beers as "light" and "faintly oily".

19 © Neofonie GmbH

Page 20: Pig for Natural Language Processing

Previous method

• Sequential processing on one machine

• Process includes

• Parsing theWikipedia articles

• Resolving redirects

• Tokenizing and counting 3-grams*• Tokenizing and counting 3-grams*

• Aggregating counts

20 © Neofonie GmbH

* Word sequences with length <= 3

Page 21: Pig for Natural Language Processing

Previous method

• Sequential processing on one machine

• Process includes

• Parsing theWikipedia articles

• Resolving redirects

• Tokenizing and counting 3-grams*• Tokenizing and counting 3-grams*

• Aggregating counts

• Extremely long runtime: more than 1 week�

• Tedious update process

• Hard to improve pipeline

• New concepts do not have probabilities

• Going beyond 3-grams seems impossible

21 © Neofonie GmbH

*Word sequences with length <= 3

Page 22: Pig for Natural Language Processing

Apache Pig

• Framework for analyzing large datasets on top of Apache Hadoop

• High-level scripting language PigLatinPigLatinPigLatinPigLatin

• Data-flow language

• Think in tuples, bags and maps

• load, filter, join, group by, store, …• load, filter, join, group by, store, …

• Pig analyzes the PigLatin script and derives a MapReduce plan

• No need to dive deep into MapReduce

• High development productivity

• Automatic optimizations

• Simple interface for user defined functions (UDF)

22 © Neofonie GmbH

Page 23: Pig for Natural Language Processing

pignlproc

• Open source project started by Olivier Grisel

• Pig-Loader forWikipedia articles

• Several UDFs, e.g. to extract links

• Example scripts, e.g. to build a training corpus for Named Entity Recognition in OpenNLP formatOpenNLP format

• Our Extensions:

• Loader: Parse Wikipedia page ID

• UDF: Resolve redirects

• UDF: N-gram generator

• PigLatin script for probability estimation

23 © Neofonie GmbH

Page 24: Pig for Natural Language Processing

Probability estimation

• P( entity | surface form ) =count( surface form, entity )

count( surface form )

24 © Neofonie GmbH

Page 25: Pig for Natural Language Processing

Probability estimation

• P( entity | surface form ) =

• count( Michael Jackson ) = 4

count( surface form, entity )

count( surface form )

25 © Neofonie GmbH

Page 26: Pig for Natural Language Processing

Probability estimation

• P( entity | surface form ) =

• count( Michael Jackson ) = 4

• count( Michael Jackson, Michael Jackson (singer) ) = 3

count( surface form, entity )

count( surface form )

• count( Michael Jackson, Michael Jackson (writer) ) = 1

26 © Neofonie GmbH

Page 27: Pig for Natural Language Processing

Probability estimation

• P( entity | surface form ) =

• count( Michael Jackson ) = 4

• count( Michael Jackson, Michael Jackson (singer) ) = 3

count( surface form, entity )

count( surface form )

• count( Michael Jackson, Michael Jackson (writer) ) = 1

• P( Michael Jackson (singer) | Michael Jackson) = ¾ = 0.75

• P( Michael Jackson (writer) | Michael Jackson) = ¼ = 0.25

27 © Neofonie GmbH

Page 28: Pig for Natural Language Processing

Probability estimation

• P( entity | surface form ) =

• count( Michael Jackson ) = 4

• count( Michael Jackson, Michael Jackson (singer) ) = 3

count( surface form, entity )

count( surface form )

• count( Michael Jackson, Michael Jackson (writer) ) = 1

• P( Michael Jackson (singer) | Michael Jackson) = ¾ = 0.75

• P( Michael Jackson (writer) | Michael Jackson) = ¼ = 0.25

• Check the project web for estimation of other probabilities

28 © Neofonie GmbH

Page 29: Pig for Natural Language Processing

PigLatin code example (1)

parsed = LOADLOADLOADLOAD 'enwiki-20111207-pages-articles.xml'‚

USINGUSINGUSINGUSING pignlproc.storage.ParsingWikipediaLoader('en')

ASASASAS (title, id, pageUrl, text, redirect, links, headers, paragraphs);

… more … more … more … more pignlprocpignlprocpignlprocpignlproc magic …magic …magic …magic …… more … more … more … more pignlprocpignlprocpignlprocpignlproc magic …magic …magic …magic …

29 © Neofonie GmbH

Page 30: Pig for Natural Language Processing

PigLatin code example (1)

parsed = LOADLOADLOADLOAD 'enwiki-20111207-pages-articles.xml'‚

USINGUSINGUSINGUSING pignlproc.storage.ParsingWikipediaLoader('en')

ASASASAS (title, id, pageUrl, text, redirect, links, headers, paragraphs);

… more … more … more … more pignlprocpignlprocpignlprocpignlproc magic …magic …magic …magic …… more … more … more … more pignlprocpignlprocpignlprocpignlproc magic …magic …magic …magic …

DESCRIBEDESCRIBEDESCRIBEDESCRIBE pageLinks;

pageLinks: {

surfaceForm: chararray,

entity: chararray

}

30 © Neofonie GmbH

Page 31: Pig for Natural Language Processing

PigLatin code example (1)

parsed = LOADLOADLOADLOAD 'enwiki-20111207-pages-articles.xml'‚

USINGUSINGUSINGUSING pignlproc.storage.ParsingWikipediaLoader('en')

ASASASAS (title, id, pageUrl, text, redirect, links, headers, paragraphs);

… more … more … more … more pignlprocpignlprocpignlprocpignlproc magic …magic …magic …magic …… more … more … more … more pignlprocpignlprocpignlprocpignlproc magic …magic …magic …magic …

DESCRIBEDESCRIBEDESCRIBEDESCRIBE pageLinks;

pageLinks: {

surfaceForm: chararray,

entity: chararray

}

31 © Neofonie GmbH

• Bag of tuples, e.g. {

(Micheal Jackson, Michael Jackson (singer)),

(Micheal Jackson, Michael Jackson (singer)),

(Micheal Jackson, Michael Jackson (writer)),

(King of Pop, Michael Jackson (singer)), … }

Page 32: Pig for Natural Language Processing

PigLatin code example (2)

groupedBySurfaceForms = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY surfaceForm;

32 © Neofonie GmbH

Page 33: Pig for Natural Language Processing

PigLatin code example (2)

groupedBySurfaceForms = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY surfaceForm;

surfaceFormCounts = FOREACHFOREACHFOREACHFOREACH groupedBySurfaceForms GENERATEGENERATEGENERATEGENERATE

group ASASASAS surfaceForm,

COUNT(pageLinks) ASASASAS surfaceFormCount;

33 © Neofonie GmbH

Page 34: Pig for Natural Language Processing

PigLatin code example (2)

groupedBySurfaceForms = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY surfaceForm;

surfaceFormCounts = FOREACHFOREACHFOREACHFOREACH groupedBySurfaceForms GENERATEGENERATEGENERATEGENERATE

group ASASASAS surfaceForm,

COUNT(pageLinks) ASASASAS surfaceFormCount;

DESCRIBEDESCRIBEDESCRIBEDESCRIBE surfaceFormCounts ;

surfaceFormCounts : {

surfaceForm : chararray,

surfaceFormCount: long

}

34 © Neofonie GmbH

Page 35: Pig for Natural Language Processing

PigLatin code example (2)

groupedBySurfaceForms = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY surfaceForm;

surfaceFormCounts = FOREACHFOREACHFOREACHFOREACH groupedBySurfaceForms GENERATEGENERATEGENERATEGENERATE

group ASASASAS surfaceForm,

COUNT(pageLinks) ASASASAS surfaceFormCount;

DESCRIBEDESCRIBEDESCRIBEDESCRIBE surfaceFormCounts ;

surfaceFormCounts : {

surfaceForm : chararray,

surfaceFormCount: long

}

35 © Neofonie GmbH

• Bag of tuples, e.g. {

(Micheal Jackson, 4),

(King of Pop, 1),

… }

Page 36: Pig for Natural Language Processing

PigLatin code example (3)

groupedByPairs = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY (surfaceForm, entity);

36 © Neofonie GmbH

Page 37: Pig for Natural Language Processing

PigLatin code example (3)

groupedByPairs = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY (surfaceForm, entity);

pairCounts = FOREACHFOREACHFOREACHFOREACH groupedByPairs GENERATEGENERATEGENERATEGENERATE

group ASASASAS pair,

COUNT(pageLinks) ASASASAS pairCount;

37 © Neofonie GmbH

Page 38: Pig for Natural Language Processing

PigLatin code example (3)

groupedByPairs = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY (surfaceForm, entity);

pairCounts = FOREACHFOREACHFOREACHFOREACH groupedByPairs GENERATEGENERATEGENERATEGENERATE

group ASASASAS pair,

COUNT(pageLinks) ASASASAS pairCount;

DESCRIBEDESCRIBEDESCRIBEDESCRIBE pairCounts ;

pairCounts : {

pair: (chararray, chararray),

pairCount: long

}

38 © Neofonie GmbH

Page 39: Pig for Natural Language Processing

PigLatin code example (3)

groupedByPairs = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY (surfaceForm, entity);

pairCounts = FOREACHFOREACHFOREACHFOREACH groupedByPairs GENERATEGENERATEGENERATEGENERATE

group ASASASAS pair,

COUNT(pageLinks) ASASASAS pairCount;

DESCRIBEDESCRIBEDESCRIBEDESCRIBE pairCounts ;

pairCounts : {

pair: (chararray, chararray),

pairCount: long

}

39 © Neofonie GmbH

• Bag of tuples, e.g. {

((Micheal Jackson, Michael Jackson (singer)), 3),

((Micheal Jackson, Michael Jackson (writer)), 1),

((King of Pop, Michael Jackson (singer)), 1), … }

Page 40: Pig for Natural Language Processing

PigLatin code example (4)

joined = JOINJOINJOINJOIN

surfaceFormCounts BYBYBYBY surfaceForm,

pairCounts BYBYBYBY pageLinks::surfaceForm;

40 © Neofonie GmbH

Page 41: Pig for Natural Language Processing

PigLatin code example (4)

joined = JOINJOINJOINJOIN

surfaceFormCounts BYBYBYBY surfaceForm,

pairCounts BYBYBYBY pageLinks::surfaceForm;

probEntityGivenSf = FOREACHFOREACHFOREACHFOREACH joined GENERATEGENERATEGENERATEGENERATE

surfaceForm,

pairCount/surfaceFormCount,

pairUri;

41 © Neofonie GmbH

Page 42: Pig for Natural Language Processing

PigLatin code example (4)

joined = JOINJOINJOINJOIN

surfaceFormCounts BYBYBYBY surfaceForm,

pairCounts BYBYBYBY pageLinks::surfaceForm;

probEntityGivenSf = FOREACHFOREACHFOREACHFOREACH joined GENERATEGENERATEGENERATEGENERATE

surfaceForm,

pairCount/surfaceFormCount,

pairUri;

42 © Neofonie GmbH

• Bag of tuples, e.g. {

(Micheal Jackson, ¾, Michael Jackson (singer)),

(Micheal Jackson, ¼, Michael Jackson (writer)),

(Jacko, 1, Michael Jackson (singer)), … }

Page 43: Pig for Natural Language Processing

Runtimes for probability estimation

• Runtime single-threaded

• 3-grams: > 1 1 1 1 weekweekweekweek

43 © Neofonie GmbH

Page 44: Pig for Natural Language Processing

Runtimes for probability estimation

• Runtime single-threaded

• 3-grams: > 1 1 1 1 weekweekweekweek

• Hadoop Cluster

• 3 nodes, 2 hexacores each, hyper-threaded

44 © Neofonie GmbH

Page 45: Pig for Natural Language Processing

Runtimes for probability estimation

• Runtime single-threaded

• 3-grams: > 1 1 1 1 weekweekweekweek

• Hadoop Cluster

• 3 nodes, 2 hexacores each, hyper-threaded

• Runtime pignlproc

• 3-grams: 1.5 1.5 1.5 1.5 hourshourshourshours

45 © Neofonie GmbH

Page 46: Pig for Natural Language Processing

Runtimes for probability estimation

• Runtime single-threaded

• 3-grams: > 1 1 1 1 weekweekweekweek

• Hadoop Cluster

• 3 nodes, 2 hexacores each, hyper-threaded

• Runtime pignlproc

• 3-grams: 1.5 1.5 1.5 1.5 hourshourshourshours

• 5-grams: 2.5 hours

• „Karl Theodor zu Guttenberg“, „Ursula von der Leyen“

46 © Neofonie GmbH

Page 47: Pig for Natural Language Processing

DicodeDicodeDicodeDicode

PigPigPigPig

pignlprocpignlprocpignlprocpignlproc

DBpediaDBpediaDBpediaDBpedia SpotlightSpotlightSpotlightSpotlight

Jobs Jobs Jobs Jobs atatatat NeofonieNeofonieNeofonieNeofonie

http://dicode-project.eu

http://pig.apache.org

https://github.com/dicode-project/pignlproc

http://spotlight.dbpedia.org

http://www.neofonie.de/karriere/jobs

Neofonie GmbHRobert-Koch-Platz 410115 BerlinGermany

T +49.30 24627 -F +49.30 24627 120

www.neofonie.de

Max JakobMax JakobMax JakobMax JakobSoftware Developer

290

[email protected]