large scale processing of unstructured text

Large Scale Processing of Text

Suneel MarthiDataWorks Summit 2017,San Jose, California

@suneelmarthi

$WhoAmI

● Principal Software Engineer in the Office of Technology, Red Hat

● Member of Apache Software Foundation

● Committer and PMC member on Apache Mahout, Apache OpenNLP, Apache Streams

What is a Natural Language?

Is any language that has evolved naturally in humans through use and repetition without conscious planning or

premeditation(From Wikipedia)

What is NOT a Natural Language?

Characteristics of Natural Language

Unstructured

Ambiguous

Complex

Hidden semantic

Ironic

Informal

Unpredictable

Most updated

Hard to search

and it holds most of human knowledge

and but it holds most of human knowledge

As information overload grows ever worse, computers may

become our only hope for handling a growing deluge of

documents.

MIT Press - May 12, 2017

What is Natural Language Processing?

NLP is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions

between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully

process large natural language corpora.(From Wikipedia)

By solving small problems each timeA pipeline where an ambiguity type is solved, incrementally.

Sentence DetectorMr. Robert talk is today at room num. 7. Let's go? | | | | ❌

| | ✅

TokenizerMr. Robert talk is today at room num. 7. Let's go? || | | | | | | || || | ||| | | ❌

| | | | | | | | || | | | | | ✅

By solving small problems each timeEach step of a pipeline solves one ambiguity problem.

Name Finder<Person>Washington</Person> was the first president of the USA.<Place>Washington</Place> is a state in the Pacific Northwest region

of the USA.

POS TaggerLaura Keene brushed by him with the glass of water .

| | | | | | | | | | |

NNP NNP VBD IN PRP IN DT NN IN NN .

By solving small problems each timeA pipeline can be long and resolve many ambiguities

LemmatizerHe is better than many others

| | | | | |

He be good than many other

Apache OpenNLP

Apache OpenNLPMature project (> 10 years)

Actively developed

Machine learning

Easy to train

Highly customizable

Language Detector (soon)

Sentence detector

Tokenizer

Part of Speech Tagger

Lemmatizer

Chunker

Parser

Training Models for EnglishCorpus - OntoNotes (https://catalog.ldc.upenn.edu/ldc2013t19)

bin/opennlp TokenNameFinderTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-ontonotes.bin

bin/opennlp POSTaggerTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-maxent.bin

Training Models for PortugueseCorpus - Amazonia (http://www.linguateca.pt/floresta/corpus.html)

bin/opennlp TokenizerTrainer.ad -lang por -data amazonia.ad -model por-tokenizer.bin -detokenizer lang/pt/tokenizer/pt-detokenizer.xml -encoding ISO-8859-1

bin/opennlp POSTaggerTrainer.ad -lang por -data amazonia.ad -model por-pos.bin -encoding ISO-8859-1 -includeFeatures false

bin/opennlp ChunkerTrainerME.ad -lang por -data amazonia.ad -model por-chunk.bin -encoding ISO-8859-1

bin/opennlp TokenNameFinderTrainer.ad -lang por -data amazonia.ad -model por-ner.bin -encoding ISO-8859-1

Name Finder API - Detect NamesNameFinderME nameFinder = new NameFinderME(new TokenNameFinderModel( OpenNLPMain.class.getResource("/opennlp-models/por-ner.bin”)));

for (String document[][] : documents) {

for (String[] sentence : document) { Span nameSpans[] = nameFinder.find(sentence); // do something with the names }

nameFinder.clearAdaptiveData()}

Name Finder API - Train a modelObjectStream<String> lineStream =

new PlainTextByLineStream(new FileInputStream("en-ner-person.train"), StandardCharsets.UTF8);

TokenNameFinderModel model;try (ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream)) { model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(), TokenNameFinderFactory nameFinderFactory);}

model.serialize(modelFile);

Name Finder API - Evaluate a modelTokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new NameFinderME(model));

evaluator.evaluate(sampleStream);

FMeasure result = evaluator.getFMeasure();

System.out.println(result.toString());

Name Finder API - Cross Evaluate a modelFileInputStream sampleDataIn = new FileInputStream("en-ner-person.train");ObjectStream<NameSample> sampleStream = new PlainTextByLineStream(sampleDataIn.getChannel(), StandardCharsets.UTF_8);

TokenNameFinderCrossValidator evaluator = new TokenNameFinderCrossValidator("en", 100, 5);

evaluator.evaluate(sampleStream, 10);

FMeasure result = evaluator.getFMeasure();System.out.println(result.toString());

Language Detector

Sentence Detector Tokenizer POS

Tagger

Lemmatizer

Name Finder

Chunker

Language 1

Language 2

Language N

Index...

Apache Flink

Apache FlinkMature project - 320+ contributors, > 11K commits

Very Active project on Github

Java/Scala

Streaming first

Fault-Tolerant

Scalable - to 1000s of nodes and more

High Throughput, Low Latency

Apache Flink - Pos Tagger and NERfinal StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<String> portugeseText = env.readTextFile(OpenNLPMain.class.getResource( "/input/por_newscrawl.txt").getFile());

DataStream<String> engText = env.readTextFile( OpenNLPMain.class.getResource("/input/eng_news.txt").getFile());

DataStream<String> mergedStream = inputStream.union(portugeseText);

SplitStream<Tuple2<String, String>> langStream = mergedStream.split(new LanguageSelector());

Apache Flink - Pos Tagger and NERDataStream<Tuple2<String, String>> porNewsArticles = langStream.select("por"); DataStream<Tuple2<String, String[]>> porNewsTokenized = porNewsArticles.map(new PorTokenizerMapFunction());

DataStream<POSSample> porNewsPOS = porNewsTokenized.map(new PorPOSTaggerMapFunction());

DataStream<NameSample> porNewsNamedEntities = porNewsTokenized.map(new PorNameFinderMapFunction());

Apache Flink - Pos Tagger and NER private static class LanguageSelector implements OutputSelector<Tuple2<String, String>> { public Iterable<String> select(Tuple2<String, String> s) { List<String> list = new ArrayList<>(); list.add(languageDetectorME.predictLanguage(s.f1).getLang()); return list; } }

private static class PorTokenizerMapFunction implements MapFunction<Tuple2<String, String>, Tuple2<String, String[]>> { public Tuple2<String, String[]> map(Tuple2<String, String> s) { return new Tuple2<>(s.f0, porTokenizer.tokenize(s.f0)); } }

Apache Flink - Pos Tagger and NER private static class PorPOSTaggerMapFunction implements MapFunction<Tuple2<String, String[]>, POSSample> { public POSSample map(Tuple2<String, String[]> s) { String[] tags = porPosTagger.tag(s.f1); return new POSSample(s.f0, s.f1, tags); } }

private static class PorNameFinderMapFunction implements MapFunction<Tuple2<String, String[]>, NameSample> { public NameSample map(Tuple2<String, String[]> s) { Span[] names = engNameFinder.find(s.f1); return new NameSample(s.f0, s.f1, names, null, true); } }

What’s Coming ??

What’s Coming ??● DL4J: Mature Project: 114 contributors, ~8k commits● Modular: Tensor library, reinforcement learning, ETL,..● Focused on integrating with JVM ecosystem while

supporting state of the art like gpus on large clusters● Implements most neural nets you’d need for language● Named Entity Recognition using DL4J with LSTMs● Language Detection using DL4J with LSTMs● Possible: Translation using Bidirectional LSTMs with embeddings● Computation graph architecture for more advanced use cases

CreditsJoern Kottmann — PMC Chair, Apache OpenNLP

Tommaso Teofili --- PMC - Apache Lucene, Apache OpenNLP

William Colen --- Head of Technology, Stilingue - Inteligência Artificial, Sao Paulo, Brazil PMC - Apache OpenNLP

Till Rohrmann --- Engineering Lead, Data Artisans, Berlin, GermanyCommitter and PMC, Apache Flink

Fabian Hueske --- Data Artisans, Committer and PMC on Apache Flink

Questions ???

large scale processing of unstructured text

Technology

a basin- to channel-scale unstructured grid hurricane...

de-identification of unstructured (narrative text ... ·...

discovering topics from unstructured text

biomedical text mining: automatic processing of unstructured...

fast support for unstructured data ...text data mining...

text mining techniques for analyzing unstructured

information extraction from unstructured web text

extraction of information from unstructured...

ontology-driven automatic entity disambiguation in...

extracting relations from unstructured text - ryan mcdonald

capturing the value of unstructured data: introduction to...

a basin- to channel-scale unstructured grid hurricane

information extraction: distilling structured data from...

integration and representation of unstructured text in...

ontology-driven automatic entity disambiguation in...

extraction of causal-association networks from unstructured...

extreme scale unstructured adaptive cfd: from multiphase

analysis of unstructured data: applications of text...

exceptional data insight for petabyte- scale unstructured

integrating unstructured text into the structured...