Download - Large Scale Processing of Unstructured Text
![Page 1: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/1.jpg)
Large Scale Processing of Text
Suneel MarthiDataWorks Summit 2017,San Jose, California
@suneelmarthi
![Page 2: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/2.jpg)
$WhoAmI
● Principal Software Engineer in the Office of Technology, Red Hat
● Member of Apache Software Foundation
● Committer and PMC member on Apache Mahout, Apache OpenNLP, Apache Streams
![Page 3: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/3.jpg)
What is a Natural Language?
![Page 4: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/4.jpg)
What is a Natural Language?
Is any language that has evolved naturally in humans through use and repetition without conscious planning or
premeditation(From Wikipedia)
![Page 5: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/5.jpg)
What is NOT a Natural Language?
![Page 6: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/6.jpg)
Characteristics of Natural Language
Unstructured
Ambiguous
Complex
Hidden semantic
Ironic
Informal
Unpredictable
Rich
Most updated
Noise
Hard to search
![Page 7: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/7.jpg)
and it holds most of human knowledge
![Page 8: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/8.jpg)
and it holds most of human knowledge
![Page 9: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/9.jpg)
and but it holds most of human knowledge
![Page 10: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/10.jpg)
As information overload grows ever worse, computers may
become our only hope for handling a growing deluge of
documents.
MIT Press - May 12, 2017
![Page 11: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/11.jpg)
What is Natural Language Processing?
NLP is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions
between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully
process large natural language corpora.(From Wikipedia)
![Page 12: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/12.jpg)
???
![Page 13: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/13.jpg)
![Page 14: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/14.jpg)
How?
![Page 15: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/15.jpg)
By solving small problems each timeA pipeline where an ambiguity type is solved, incrementally.
Sentence DetectorMr. Robert talk is today at room num. 7. Let's go? | | | | ❌
| | ✅
TokenizerMr. Robert talk is today at room num. 7. Let's go? || | | | | | | || || | ||| | | ❌
| | | | | | | | || | | | | | ✅
![Page 16: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/16.jpg)
By solving small problems each timeEach step of a pipeline solves one ambiguity problem.
Name Finder<Person>Washington</Person> was the first president of the USA.<Place>Washington</Place> is a state in the Pacific Northwest region
of the USA.
POS TaggerLaura Keene brushed by him with the glass of water .
| | | | | | | | | | |
NNP NNP VBD IN PRP IN DT NN IN NN .
![Page 17: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/17.jpg)
By solving small problems each timeA pipeline can be long and resolve many ambiguities
LemmatizerHe is better than many others
| | | | | |
He be good than many other
![Page 18: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/18.jpg)
Apache OpenNLP
![Page 19: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/19.jpg)
Apache OpenNLPMature project (> 10 years)
Actively developed
Machine learning
Java
Easy to train
Highly customizable
Fast
Language Detector (soon)
Sentence detector
Tokenizer
Part of Speech Tagger
Lemmatizer
Chunker
Parser
....
![Page 20: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/20.jpg)
Training Models for EnglishCorpus - OntoNotes (https://catalog.ldc.upenn.edu/ldc2013t19)
bin/opennlp TokenNameFinderTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-ontonotes.bin
bin/opennlp POSTaggerTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-maxent.bin
![Page 21: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/21.jpg)
Training Models for PortugueseCorpus - Amazonia (http://www.linguateca.pt/floresta/corpus.html)
bin/opennlp TokenizerTrainer.ad -lang por -data amazonia.ad -model por-tokenizer.bin -detokenizer lang/pt/tokenizer/pt-detokenizer.xml -encoding ISO-8859-1
bin/opennlp POSTaggerTrainer.ad -lang por -data amazonia.ad -model por-pos.bin -encoding ISO-8859-1 -includeFeatures false
bin/opennlp ChunkerTrainerME.ad -lang por -data amazonia.ad -model por-chunk.bin -encoding ISO-8859-1
bin/opennlp TokenNameFinderTrainer.ad -lang por -data amazonia.ad -model por-ner.bin -encoding ISO-8859-1
![Page 22: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/22.jpg)
Name Finder API - Detect NamesNameFinderME nameFinder = new NameFinderME(new TokenNameFinderModel( OpenNLPMain.class.getResource("/opennlp-models/por-ner.bin”)));
for (String document[][] : documents) {
for (String[] sentence : document) { Span nameSpans[] = nameFinder.find(sentence); // do something with the names }
nameFinder.clearAdaptiveData()}
![Page 23: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/23.jpg)
Name Finder API - Train a modelObjectStream<String> lineStream =
new PlainTextByLineStream(new FileInputStream("en-ner-person.train"), StandardCharsets.UTF8);
TokenNameFinderModel model;try (ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream)) { model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(), TokenNameFinderFactory nameFinderFactory);}
model.serialize(modelFile);
![Page 24: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/24.jpg)
Name Finder API - Evaluate a modelTokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new NameFinderME(model));
evaluator.evaluate(sampleStream);
FMeasure result = evaluator.getFMeasure();
System.out.println(result.toString());
![Page 25: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/25.jpg)
Name Finder API - Cross Evaluate a modelFileInputStream sampleDataIn = new FileInputStream("en-ner-person.train");ObjectStream<NameSample> sampleStream = new PlainTextByLineStream(sampleDataIn.getChannel(), StandardCharsets.UTF_8);
TokenNameFinderCrossValidator evaluator = new TokenNameFinderCrossValidator("en", 100, 5);
evaluator.evaluate(sampleStream, 10);
FMeasure result = evaluator.getFMeasure();System.out.println(result.toString());
![Page 26: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/26.jpg)
Language Detector
Sentence Detector Tokenizer POS
Tagger
Lemmatizer
Name Finder
Chunker
Language 1
Language 2
Language N
Index...
![Page 27: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/27.jpg)
Apache Flink
![Page 28: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/28.jpg)
Apache FlinkMature project - 320+ contributors, > 11K commits
Very Active project on Github
Java/Scala
Streaming first
Fault-Tolerant
Scalable - to 1000s of nodes and more
High Throughput, Low Latency
![Page 29: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/29.jpg)
Apache Flink - Pos Tagger and NERfinal StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> portugeseText = env.readTextFile(OpenNLPMain.class.getResource( "/input/por_newscrawl.txt").getFile());
DataStream<String> engText = env.readTextFile( OpenNLPMain.class.getResource("/input/eng_news.txt").getFile());
DataStream<String> mergedStream = inputStream.union(portugeseText);
SplitStream<Tuple2<String, String>> langStream = mergedStream.split(new LanguageSelector());
![Page 30: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/30.jpg)
Apache Flink - Pos Tagger and NERDataStream<Tuple2<String, String>> porNewsArticles = langStream.select("por"); DataStream<Tuple2<String, String[]>> porNewsTokenized = porNewsArticles.map(new PorTokenizerMapFunction());
DataStream<POSSample> porNewsPOS = porNewsTokenized.map(new PorPOSTaggerMapFunction());
DataStream<NameSample> porNewsNamedEntities = porNewsTokenized.map(new PorNameFinderMapFunction());
![Page 31: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/31.jpg)
Apache Flink - Pos Tagger and NER private static class LanguageSelector implements OutputSelector<Tuple2<String, String>> { public Iterable<String> select(Tuple2<String, String> s) { List<String> list = new ArrayList<>(); list.add(languageDetectorME.predictLanguage(s.f1).getLang()); return list; } }
private static class PorTokenizerMapFunction implements MapFunction<Tuple2<String, String>, Tuple2<String, String[]>> { public Tuple2<String, String[]> map(Tuple2<String, String> s) { return new Tuple2<>(s.f0, porTokenizer.tokenize(s.f0)); } }
![Page 32: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/32.jpg)
Apache Flink - Pos Tagger and NER private static class PorPOSTaggerMapFunction implements MapFunction<Tuple2<String, String[]>, POSSample> { public POSSample map(Tuple2<String, String[]> s) { String[] tags = porPosTagger.tag(s.f1); return new POSSample(s.f0, s.f1, tags); } }
private static class PorNameFinderMapFunction implements MapFunction<Tuple2<String, String[]>, NameSample> { public NameSample map(Tuple2<String, String[]> s) { Span[] names = engNameFinder.find(s.f1); return new NameSample(s.f0, s.f1, names, null, true); } }
![Page 33: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/33.jpg)
What’s Coming ??
![Page 34: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/34.jpg)
What’s Coming ??● DL4J: Mature Project: 114 contributors, ~8k commits● Modular: Tensor library, reinforcement learning, ETL,..● Focused on integrating with JVM ecosystem while
supporting state of the art like gpus on large clusters● Implements most neural nets you’d need for language● Named Entity Recognition using DL4J with LSTMs● Language Detection using DL4J with LSTMs● Possible: Translation using Bidirectional LSTMs with embeddings● Computation graph architecture for more advanced use cases
![Page 35: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/35.jpg)
CreditsJoern Kottmann — PMC Chair, Apache OpenNLP
Tommaso Teofili --- PMC - Apache Lucene, Apache OpenNLP
William Colen --- Head of Technology, Stilingue - Inteligência Artificial, Sao Paulo, Brazil PMC - Apache OpenNLP
Till Rohrmann --- Engineering Lead, Data Artisans, Berlin, GermanyCommitter and PMC, Apache Flink
Fabian Hueske --- Data Artisans, Committer and PMC on Apache Flink
![Page 36: Large Scale Processing of Unstructured Text](https://reader031.vdocument.in/reader031/viewer/2022030318/5a67b7a17f8b9a360c8b6f11/html5/thumbnails/36.jpg)
Questions ???