Download - Pig for Natural Language Processing
![Page 1: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/1.jpg)
Pig forNatural Language ProcessingMax Jakob
![Page 2: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/2.jpg)
Agenda 1
2
3
Introduction (speaker, affiliation, project)
Named Entities
pignlproc
![Page 3: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/3.jpg)
• MSc in Computational Linguistics
• Software Developer at Neofonie GmbH
• Dicode project
• Text mining
Speaker: Max Jakob
• Text mining
• Scalability
• Past year
• DBpedia extraction framework
• DBpedia Spotlight
3 © Neofonie GmbH
![Page 4: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/4.jpg)
• Spin-off out of TU Berlin (1998)
• 160 employees in Berlin (head quarters) and Hamburg
• State-of-the-art Search, Online Portals, Mobile Apps
Neofonie
• R&D: 14 successfully finished research projects
• Current focus:
• Semantics
• Question Answering
• Recommendations
• Cloud Computing
4 © Neofonie GmbH
![Page 5: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/5.jpg)
Dicode
• EU-funded project (FP7)
• Goal: ``Augment collaboration and decision making in data-intensive
and cognitively-complex settings.´´
• Build scalable services for data mining and collaboration
• Example Use Case: SocialSocialSocialSocial Media Media Media Media MonitoringMonitoringMonitoringMonitoring• Example Use Case: SocialSocialSocialSocial Media Media Media Media MonitoringMonitoringMonitoringMonitoring
• Sources: blogs, news, Twitter, etc.
• What are the key trends?
• How is my brand perceived on the web?
• Problem: ambiguity of names and concepts
• Need Named Entity Disambiguation
5 © Neofonie GmbH
![Page 6: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/6.jpg)
Named Entity Recognition/Disambiguation
• By example:
• Input: plain text
• Output: text withWikipedia links
6 © Neofonie GmbH
![Page 7: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/7.jpg)
Named Entity Recognition/Disambiguation
• By example:
• Input: plain text
• Output: text withWikipedia links
• 1. Find „interesting“ strings (recognition)
• SSSSurfaceurfaceurfaceurface formsformsformsforms
7 © Neofonie GmbH
![Page 8: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/8.jpg)
Named Entity Recognition/Disambiguation
• By example:
• Input: plain text
• Output: text withWikipedia links
• 1. Find „interesting“ strings (recognition)
• SurfaceSurfaceSurfaceSurface formsformsformsforms
• 2. Choose appropriate Wikipedia page (disambiguation)
• EachWikipedia page represents an entityentityentityentity
• Every surface form can have multiple candidate entities for linking
8 © Neofonie GmbH
![Page 9: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/9.jpg)
„Michael Jackson died in 2007.“
9 © Neofonie GmbH
![Page 10: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/10.jpg)
„Michael Jackson died in 2007.“
• Named Entity Recognition
• Find surfacesurface formsforms
• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.
10 © Neofonie GmbH
![Page 11: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/11.jpg)
„Michael Jackson died in 2007.“
• Named Entity Recognition
• Find surfacesurface formsforms
• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.
• Named Entity Disambiguation• Named Entity Disambiguation
• Choose entityentity from candidates
• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson]
11 © Neofonie GmbH
![Page 12: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/12.jpg)
„Michael Jackson died in 2007.“
• Named Entity Recognition
• Find surfacesurface formsforms
• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.
• Named Entity Disambiguation• Named Entity Disambiguation
• Choose entityentity from candidates
• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson]
• Michael Jackson (singer)
12 © Neofonie GmbH
![Page 13: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/13.jpg)
„Michael Jackson died in 2007.“
• Named Entity Recognition
• Find surfacesurface formsforms
• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.
• Named Entity Disambiguation• Named Entity Disambiguation
• Choose entityentity from candidates
• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson]
• Michael Jackson (singer)
• Michael Jackson (writer)
13 © Neofonie GmbH
![Page 14: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/14.jpg)
„Michael Jackson died in 2007.“
• Named Entity Recognition
• Find surfacesurface formsforms
• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.
• Named Entity Disambiguation• Named Entity Disambiguation
• Choose entityentity from candidates
• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson]
• Michael Jackson (singer)
• Michael Jackson (writer)
• Context: died in 2007
14 © Neofonie GmbH
![Page 15: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/15.jpg)
„Michael Jackson died in 2007.“
• Named Entity Recognition
• Find surfacesurface formsforms
• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.
• Named Entity Disambiguation• Named Entity Disambiguation
• Choose entityentity from candidates
• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson]
• Michael Jackson (singer)
• Michael Jackson (Michael Jackson (Michael Jackson (Michael Jackson (writerwriterwriterwriter))))
• Context: died in 2007
15 © Neofonie GmbH
![Page 16: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/16.jpg)
„Michael Jackson died in 2007.“
• Named Entity Recognition
• Find surfacesurface formsforms
• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] died in 2007.
• Named Entity Disambiguation• Named Entity Disambiguation
• Choose entityentity from candidates
• [Michael JacksonMichael JacksonMichael JacksonMichael Jackson]
• Michael Jackson (singer)
• Michael Jackson (Michael Jackson (Michael Jackson (Michael Jackson (writerwriterwriterwriter))))
• Context: died in 2007
• Context may not be distinctive
16 © Neofonie GmbH
![Page 17: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/17.jpg)
Probabilities
• P( entity | surface form )
• Which entity is typically meant by a name?
• For example, given [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] (and ignoring the context), what are theprobabilities of the candidates?
• Michael Jackson (singer) 0.75Michael Jackson (singer) 0.75
• Michael Jackson (writer) 0.25
17 © Neofonie GmbH
![Page 18: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/18.jpg)
Probabilities
• P( entity | surface form )
• Which entity is typically meant by a name?
• For example, given [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] (and ignoring the context), what are theprobabilities of the candidates?
• Michael Jackson (singer) 0.75Michael Jackson (singer) 0.75
• Michael Jackson (writer) 0.25
• Other useful probabilities:
• P( surface form | entity ), P( entity ), P( surface form )
18 © Neofonie GmbH
![Page 19: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/19.jpg)
Probabilities
• P( entity | surface form )
• Which entity is typically meant by a name?
• For example, given [Michael JacksonMichael JacksonMichael JacksonMichael Jackson] (and ignoring the context), what are theprobabilities of the candidates?
• Michael Jackson (singer) 0.75Michael Jackson (singer) 0.75
• Michael Jackson (writer) 0.25
• Other useful probabilities:
• P( surface form | entity ), P( entity ), P( surface form )
• Estimate usingWikipedia page links
• In 1994 the beer journalist [[Michael Jackson (writer)|Michael Jackson]] described Webster's beers as "light" and "faintly oily".
19 © Neofonie GmbH
![Page 20: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/20.jpg)
Previous method
• Sequential processing on one machine
• Process includes
• Parsing theWikipedia articles
• Resolving redirects
• Tokenizing and counting 3-grams*• Tokenizing and counting 3-grams*
• Aggregating counts
20 © Neofonie GmbH
* Word sequences with length <= 3
![Page 21: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/21.jpg)
Previous method
• Sequential processing on one machine
• Process includes
• Parsing theWikipedia articles
• Resolving redirects
• Tokenizing and counting 3-grams*• Tokenizing and counting 3-grams*
• Aggregating counts
• Extremely long runtime: more than 1 week�
• Tedious update process
• Hard to improve pipeline
• New concepts do not have probabilities
• Going beyond 3-grams seems impossible
21 © Neofonie GmbH
*Word sequences with length <= 3
![Page 22: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/22.jpg)
Apache Pig
• Framework for analyzing large datasets on top of Apache Hadoop
• High-level scripting language PigLatinPigLatinPigLatinPigLatin
• Data-flow language
• Think in tuples, bags and maps
• load, filter, join, group by, store, …• load, filter, join, group by, store, …
• Pig analyzes the PigLatin script and derives a MapReduce plan
• No need to dive deep into MapReduce
• High development productivity
• Automatic optimizations
• Simple interface for user defined functions (UDF)
22 © Neofonie GmbH
![Page 23: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/23.jpg)
pignlproc
• Open source project started by Olivier Grisel
• Pig-Loader forWikipedia articles
• Several UDFs, e.g. to extract links
• Example scripts, e.g. to build a training corpus for Named Entity Recognition in OpenNLP formatOpenNLP format
• Our Extensions:
• Loader: Parse Wikipedia page ID
• UDF: Resolve redirects
• UDF: N-gram generator
• PigLatin script for probability estimation
23 © Neofonie GmbH
![Page 24: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/24.jpg)
Probability estimation
• P( entity | surface form ) =count( surface form, entity )
count( surface form )
24 © Neofonie GmbH
![Page 25: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/25.jpg)
Probability estimation
• P( entity | surface form ) =
• count( Michael Jackson ) = 4
count( surface form, entity )
count( surface form )
25 © Neofonie GmbH
![Page 26: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/26.jpg)
Probability estimation
• P( entity | surface form ) =
• count( Michael Jackson ) = 4
• count( Michael Jackson, Michael Jackson (singer) ) = 3
count( surface form, entity )
count( surface form )
• count( Michael Jackson, Michael Jackson (writer) ) = 1
26 © Neofonie GmbH
![Page 27: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/27.jpg)
Probability estimation
• P( entity | surface form ) =
• count( Michael Jackson ) = 4
• count( Michael Jackson, Michael Jackson (singer) ) = 3
count( surface form, entity )
count( surface form )
• count( Michael Jackson, Michael Jackson (writer) ) = 1
• P( Michael Jackson (singer) | Michael Jackson) = ¾ = 0.75
• P( Michael Jackson (writer) | Michael Jackson) = ¼ = 0.25
27 © Neofonie GmbH
![Page 28: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/28.jpg)
Probability estimation
• P( entity | surface form ) =
• count( Michael Jackson ) = 4
• count( Michael Jackson, Michael Jackson (singer) ) = 3
count( surface form, entity )
count( surface form )
• count( Michael Jackson, Michael Jackson (writer) ) = 1
• P( Michael Jackson (singer) | Michael Jackson) = ¾ = 0.75
• P( Michael Jackson (writer) | Michael Jackson) = ¼ = 0.25
• Check the project web for estimation of other probabilities
28 © Neofonie GmbH
![Page 29: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/29.jpg)
PigLatin code example (1)
parsed = LOADLOADLOADLOAD 'enwiki-20111207-pages-articles.xml'‚
USINGUSINGUSINGUSING pignlproc.storage.ParsingWikipediaLoader('en')
ASASASAS (title, id, pageUrl, text, redirect, links, headers, paragraphs);
… more … more … more … more pignlprocpignlprocpignlprocpignlproc magic …magic …magic …magic …… more … more … more … more pignlprocpignlprocpignlprocpignlproc magic …magic …magic …magic …
29 © Neofonie GmbH
![Page 30: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/30.jpg)
PigLatin code example (1)
parsed = LOADLOADLOADLOAD 'enwiki-20111207-pages-articles.xml'‚
USINGUSINGUSINGUSING pignlproc.storage.ParsingWikipediaLoader('en')
ASASASAS (title, id, pageUrl, text, redirect, links, headers, paragraphs);
… more … more … more … more pignlprocpignlprocpignlprocpignlproc magic …magic …magic …magic …… more … more … more … more pignlprocpignlprocpignlprocpignlproc magic …magic …magic …magic …
DESCRIBEDESCRIBEDESCRIBEDESCRIBE pageLinks;
pageLinks: {
surfaceForm: chararray,
entity: chararray
}
30 © Neofonie GmbH
![Page 31: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/31.jpg)
PigLatin code example (1)
parsed = LOADLOADLOADLOAD 'enwiki-20111207-pages-articles.xml'‚
USINGUSINGUSINGUSING pignlproc.storage.ParsingWikipediaLoader('en')
ASASASAS (title, id, pageUrl, text, redirect, links, headers, paragraphs);
… more … more … more … more pignlprocpignlprocpignlprocpignlproc magic …magic …magic …magic …… more … more … more … more pignlprocpignlprocpignlprocpignlproc magic …magic …magic …magic …
DESCRIBEDESCRIBEDESCRIBEDESCRIBE pageLinks;
pageLinks: {
surfaceForm: chararray,
entity: chararray
}
31 © Neofonie GmbH
• Bag of tuples, e.g. {
(Micheal Jackson, Michael Jackson (singer)),
(Micheal Jackson, Michael Jackson (singer)),
(Micheal Jackson, Michael Jackson (writer)),
(King of Pop, Michael Jackson (singer)), … }
![Page 32: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/32.jpg)
PigLatin code example (2)
groupedBySurfaceForms = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY surfaceForm;
32 © Neofonie GmbH
![Page 33: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/33.jpg)
PigLatin code example (2)
groupedBySurfaceForms = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY surfaceForm;
surfaceFormCounts = FOREACHFOREACHFOREACHFOREACH groupedBySurfaceForms GENERATEGENERATEGENERATEGENERATE
group ASASASAS surfaceForm,
COUNT(pageLinks) ASASASAS surfaceFormCount;
33 © Neofonie GmbH
![Page 34: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/34.jpg)
PigLatin code example (2)
groupedBySurfaceForms = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY surfaceForm;
surfaceFormCounts = FOREACHFOREACHFOREACHFOREACH groupedBySurfaceForms GENERATEGENERATEGENERATEGENERATE
group ASASASAS surfaceForm,
COUNT(pageLinks) ASASASAS surfaceFormCount;
DESCRIBEDESCRIBEDESCRIBEDESCRIBE surfaceFormCounts ;
surfaceFormCounts : {
surfaceForm : chararray,
surfaceFormCount: long
}
34 © Neofonie GmbH
![Page 35: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/35.jpg)
PigLatin code example (2)
groupedBySurfaceForms = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY surfaceForm;
surfaceFormCounts = FOREACHFOREACHFOREACHFOREACH groupedBySurfaceForms GENERATEGENERATEGENERATEGENERATE
group ASASASAS surfaceForm,
COUNT(pageLinks) ASASASAS surfaceFormCount;
DESCRIBEDESCRIBEDESCRIBEDESCRIBE surfaceFormCounts ;
surfaceFormCounts : {
surfaceForm : chararray,
surfaceFormCount: long
}
35 © Neofonie GmbH
• Bag of tuples, e.g. {
(Micheal Jackson, 4),
(King of Pop, 1),
… }
![Page 36: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/36.jpg)
PigLatin code example (3)
groupedByPairs = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY (surfaceForm, entity);
36 © Neofonie GmbH
![Page 37: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/37.jpg)
PigLatin code example (3)
groupedByPairs = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY (surfaceForm, entity);
pairCounts = FOREACHFOREACHFOREACHFOREACH groupedByPairs GENERATEGENERATEGENERATEGENERATE
group ASASASAS pair,
COUNT(pageLinks) ASASASAS pairCount;
37 © Neofonie GmbH
![Page 38: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/38.jpg)
PigLatin code example (3)
groupedByPairs = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY (surfaceForm, entity);
pairCounts = FOREACHFOREACHFOREACHFOREACH groupedByPairs GENERATEGENERATEGENERATEGENERATE
group ASASASAS pair,
COUNT(pageLinks) ASASASAS pairCount;
DESCRIBEDESCRIBEDESCRIBEDESCRIBE pairCounts ;
pairCounts : {
pair: (chararray, chararray),
pairCount: long
}
38 © Neofonie GmbH
![Page 39: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/39.jpg)
PigLatin code example (3)
groupedByPairs = GROUP GROUP GROUP GROUP pageLinks BYBYBYBY (surfaceForm, entity);
pairCounts = FOREACHFOREACHFOREACHFOREACH groupedByPairs GENERATEGENERATEGENERATEGENERATE
group ASASASAS pair,
COUNT(pageLinks) ASASASAS pairCount;
DESCRIBEDESCRIBEDESCRIBEDESCRIBE pairCounts ;
pairCounts : {
pair: (chararray, chararray),
pairCount: long
}
39 © Neofonie GmbH
• Bag of tuples, e.g. {
((Micheal Jackson, Michael Jackson (singer)), 3),
((Micheal Jackson, Michael Jackson (writer)), 1),
((King of Pop, Michael Jackson (singer)), 1), … }
![Page 40: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/40.jpg)
PigLatin code example (4)
joined = JOINJOINJOINJOIN
surfaceFormCounts BYBYBYBY surfaceForm,
pairCounts BYBYBYBY pageLinks::surfaceForm;
40 © Neofonie GmbH
![Page 41: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/41.jpg)
PigLatin code example (4)
joined = JOINJOINJOINJOIN
surfaceFormCounts BYBYBYBY surfaceForm,
pairCounts BYBYBYBY pageLinks::surfaceForm;
probEntityGivenSf = FOREACHFOREACHFOREACHFOREACH joined GENERATEGENERATEGENERATEGENERATE
surfaceForm,
pairCount/surfaceFormCount,
pairUri;
41 © Neofonie GmbH
![Page 42: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/42.jpg)
PigLatin code example (4)
joined = JOINJOINJOINJOIN
surfaceFormCounts BYBYBYBY surfaceForm,
pairCounts BYBYBYBY pageLinks::surfaceForm;
probEntityGivenSf = FOREACHFOREACHFOREACHFOREACH joined GENERATEGENERATEGENERATEGENERATE
surfaceForm,
pairCount/surfaceFormCount,
pairUri;
42 © Neofonie GmbH
• Bag of tuples, e.g. {
(Micheal Jackson, ¾, Michael Jackson (singer)),
(Micheal Jackson, ¼, Michael Jackson (writer)),
(Jacko, 1, Michael Jackson (singer)), … }
![Page 43: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/43.jpg)
Runtimes for probability estimation
• Runtime single-threaded
• 3-grams: > 1 1 1 1 weekweekweekweek
43 © Neofonie GmbH
![Page 44: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/44.jpg)
Runtimes for probability estimation
• Runtime single-threaded
• 3-grams: > 1 1 1 1 weekweekweekweek
• Hadoop Cluster
• 3 nodes, 2 hexacores each, hyper-threaded
44 © Neofonie GmbH
![Page 45: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/45.jpg)
Runtimes for probability estimation
• Runtime single-threaded
• 3-grams: > 1 1 1 1 weekweekweekweek
• Hadoop Cluster
• 3 nodes, 2 hexacores each, hyper-threaded
• Runtime pignlproc
• 3-grams: 1.5 1.5 1.5 1.5 hourshourshourshours
45 © Neofonie GmbH
![Page 46: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/46.jpg)
Runtimes for probability estimation
• Runtime single-threaded
• 3-grams: > 1 1 1 1 weekweekweekweek
• Hadoop Cluster
• 3 nodes, 2 hexacores each, hyper-threaded
• Runtime pignlproc
• 3-grams: 1.5 1.5 1.5 1.5 hourshourshourshours
• 5-grams: 2.5 hours
• „Karl Theodor zu Guttenberg“, „Ursula von der Leyen“
46 © Neofonie GmbH
![Page 47: Pig for Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022071603/613d7988736caf36b75dc33d/html5/thumbnails/47.jpg)
DicodeDicodeDicodeDicode
PigPigPigPig
pignlprocpignlprocpignlprocpignlproc
DBpediaDBpediaDBpediaDBpedia SpotlightSpotlightSpotlightSpotlight
Jobs Jobs Jobs Jobs atatatat NeofonieNeofonieNeofonieNeofonie
http://dicode-project.eu
http://pig.apache.org
https://github.com/dicode-project/pignlproc
http://spotlight.dbpedia.org
http://www.neofonie.de/karriere/jobs
Neofonie GmbHRobert-Koch-Platz 410115 BerlinGermany
T +49.30 24627 -F +49.30 24627 120
www.neofonie.de
Max JakobMax JakobMax JakobMax JakobSoftware Developer
290