an introduction to nlp4l (scala by the bay / big data scala 2015)

An Introduction to NLP4L

Natural Language Processing tool for Apache Lucene

Koji Sekiguchi @kojisays Founder & CEO, RONDHUIT

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

What’s NLP4L?

What’s NLP4L?• GOAL

• Improve Lucene users’ search experience

• FEATURES

• Use of Lucene index as a Corpus Database

• Lucene API Front-end written in Scala

• NLP4L provides

• Preprocessors for existing ML tools

• Provision of ML algorithms and Applications (e.g. Transliteration)

• FEATURES

• NLP4L provides

• FEATURES

• NLP4L provides

What’s Lucene?

alice 1an 1, 2, 3

apple 1, 3ate 1is 3

likes 2mike 2

orange 2red 3

Alice ate an apple.

Mike likes an orange.

An apple is red.

1: 2: 3:

indexing

“apple”

searching

(inverted) index

Lucene is a high-performance, full-featured text search engine library written entirely in Java.

• Future Plans

Evaluation Measures

target

Evaluation Measures

targetresult

Evaluation Measures

targetresult

tpfp fn

Evaluation Measures

targetresult

positive

Evaluation Measuresnegative

result

Evaluation Measures

true positive

true negative

Evaluation Measures

targetresult

false positive

false negative

Evaluation Measures

targetresult

tpfp fn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

Recall ,Precision

tpfp fn

targetresult

tpfp fn

Recall ,Precision

Solutionn-gram, synonym dictionary, etc.

facet (filter query) Ranking Tuning

recall precision

recall , precision

facet (filter query) Ranking Tuning

recall precision

recall , precision

e.g. Transliteration

facet (filter query)

recall precision

recall , precision

Ranking Tuning

e.g. Transliteration

facet (filter query)

e.g. Named Entity Extraction

recall precision

recall , precision

Ranking Tuning

gradual precision improvement

q=watch

targetresult

filter by “Gender=Men’s”

targetresult

filter by “Gender=Men’s”

filter by “Price=100-150”

Structured Documents

ID product price gender

1 CURREN New Men’s Date Stainless Steel Military Sport Quartz Wrist Watch 8.92 Men’s

2 Suiksilver The Gamer Watch 87.99 Men’s

Unstructured Documents

ID article

1David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.

2He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.

Make them Structured

ID article person org loc

1David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.

David Cameron EU Bruss

2He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.

EUUK Britain

NEE[1] extracts interesting words.

[1] Named Entity Extraction

Manual Tagging using brat

• Future Plans

A small Corpus

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

writer.close

index_simple.scala

text data

writer.close

index_simple.scala

Lucene index directory

writer.close

index_simple.scala

schema definition

writer.close

index_simple.scala

create Lucene document

writer.close

index_simple.scala

open a writer

writer.close

index_simple.scala

write documents

writer.close

index_simple.scala

close writer

writer.close

index_simple.scala

As for code snippets used in my talk, please look at: https://github.com/NLP4L/meetups/tree/master/20150818

Getting word countsalice 1

an 1, 2, 3

apple 1, 3

likes 2

mike 2

orange 2

an 1, 2, 3

apple 1, 3

likes 2

mike 2

orange 2

val reader = RawReader(index)

reader.sumTotalTermFreq("text") // -> 12

reader.field("text").get.terms.size // -> 9

reader.totalTermFreq("text", "an") // -> 3

reader.close

getting_word_counts.scala

an 1, 2, 3

apple 1, 3

likes 2

mike 2

orange 2

reader.close

an 1, 2, 3

apple 1, 3

likes 2

mike 2

orange 2

reader.close

Getting top termsalice 1

an 1, 2, 3

apple 1, 3

likes 2

mike 2

orange 2

reader.topTermsByDocFreq("text") reader.topTermsByTotalTermFreq("text") // -> // (term, docFreq, totalTermFreq) // (an,3,3) // (apple,2,2) // (likes,1,1) // (is,1,1) // (orange,1,1) // (mike,1,1) // (ate,1,1) // (red,1,1) // (alice,1,1)

reader.close

What’s ShingleFilter?• ShingleFilter = Word n-gram TokenFilter

WhitespaceTokenizer

ShingleFilter (N=2)

“Lucene is a popular software”

Lucene/is/a/popular/software

Lucene is/is a/a popular/popular software

Language Model• LM represents the fluency of language

• N-gram model is the LM which is most widely used

• Calculation example for 2-gram

val index = "/tmp/index-lm"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

// create a language model index val writer = IWriter(index, schema())

def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) }

CORPUS.foreach(addDocument(_))

writer.close()

language_model.scala

val index = "/tmp/index-lm"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

// create a language model index val writer = IWriter(index, schema())

def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) }

CORPUS.foreach(addDocument(_))

writer.close()

schema definition

// P(apple|an) = C(an apple) / C(an) val count_an_apple = reader.totalTermFreq("word2g", "an apple") val count_an = reader.totalTermFreq("word", "an") val prob_apple_an = count_an_apple.toFloat / count_an.toFloat

// P(orange|an) = C(an orange) / C(an) val count_an_orange = reader.totalTermFreq("word2g", "an orange") val prob_orange_an = count_an_orange.toFloat / count_an.toFloat

reader.close

Alice/NNP ate/VB an/AT apple/NNP ./. Mike/NNP likes/VB an/AT orange/NNP ./. An/AT apple/NNP is/VB red/JJ ./.

NNP Proper noun, singularVB VerbAT ArticleJJ Adjective. period

Part-of-Speech Tagging

Our Corpus for training

Hidden Markov Model

Series of Words

Hidden Markov Model

Series of Part-of-Speech

Hidden Markov Model

HMM state diagramNNP 0.667

VB 0.0

JJ 0.0

AT 0.333

0.4 0.6

0.6670.333

alice 0.2 apple 0.4 mike 0.2 orange 0.2

ate 0.333 is 0.333 likes 0.333

an 1.0

red 1.0

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

indexer.close()

hmm.scala

text data (they are tagged!)

indexer.close()

hmm.scala

write-open Lucene index

indexer.close()

hmm.scala

tagged texts are indexed here

indexer.close()

hmm.scala

make an HmmModel from Lucene index

indexer.close()

hmm.scala

get HmmTagger from HmmModel

indexer.close()

hmm.scala

use HmmTagger to annotate unknown sentence

indexer.close()

hmm.scala

NLP4L has hmm_postagger.scala in examples directory. It uses brown corpus for HMM training.

• Future Plans

TransliterationTransliteration is a process of transcribing letters or words from one alphabet to another one to facilitate comprehension and pronunciation for non-native speakers.

computer コンピューター

server サーバー

internet インターネット

mouse マウス

information インフォメーション

examples of transliteration from English to Japanese

It helps improve recallyou search English “mouse”

It helps improve recall

but you got “マウス” (=mouse) highlighted in Japanese

Training data in NLP4Lアaカcaデdeミーmy アaクcセceンnトt アaクcセceスss アaクcシciデdeンnトt アaクcロroバッbaトt アaクcショtioンn アaダdaプpターter アaフfリriカca エaアirバbuスs アaラlaスsカka アaルlコーcohoルl アaレlleルrギーgy

train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt

academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー

Demo: Transliteration

Input Prediction Right Answer

アルゴリズム algorism algorithm

プログラム program (OK)

ケミカル chaemmical chemical

ダイニング dining (OK)

コミッター committer (OK)

エントリー entree entry

nlp4l> :load examples/trans_katakana_alpha.scala

Gathering loan words

① crawl

gathering Katakana-Alphabet

string pairs

アルゴリズム, algorithm

Transliteration

“アルゴリズム”

“algorism”

calculate edit distance

synonyms.txt

store pair of strings if edit distance is small enough

④⑤

Gathering loan words

① crawl

gathering Katakana-Alphabet

string pairs

アルゴリズム, algorithm

Transliteration

“アルゴリズム”

“algorism”

calculate edit distance

synonyms.txt

store pair of strings if edit distance is small enough

④⑤

Got 1,800+ records of synonym knowledge from jawiki

• Future Plans

NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

Mahout Spark

Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log

Dictionaries

・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment

maintenance

Model files Tagged Corpus

Document Vectors

・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection

・Learning to Rank ・Personalized Search

Keyword Attachment• “Keyword attachment” is a general format that enables the

following functions.

• Learning to Rank

• Personalized Search

• Named Entity Extraction

• Document Classification

Lucene doc

Lucene doc keyword

↑ Increase boost

Learning to Rank• Program learns, from access log and other sources, that the score of document d for a query q should be larger than the normal score(q,d) Lucene doc d q, q, …

https://en.wikipedia.org/wiki/Learning_to_rank

Personalized Search• Program learns, from access log and other sources, that

the score of document d for a query q by user u should be larger than the normal score(q,d)

• Since you cannot specify score(q,d,u) as Lucene restricts doing so, you have to specify score(qu,d).

• Limit the data to high-order queries or divide fields depending on a user as the number of q-u combinations can be enormous.

Lucene doc d1 q1u1, q2u2

Lucene doc d2 q2u1, q1u2

Join and Code with Us!

Contact us at

koji at apache dot org

for the details.

Demo or Q & A

Thank you!

an introduction to nlp4l (scala by the bay / big data scala 2015)

Technology

scalding - big data programming with scala

scala and spark: coevolving ecosystems for big data

java/scala lab: Борис Трофимов -...

scala bay meetup - the state of scala code style and quality

streaming big data with spark, kafka, cassandra, akka &...

scala days 2018 you are a -...

java/scala lab: Роман Никитченко - big data -...

iot on dcos - scala by the bay 2015

scala in practice · scala & java for all you know, it's...

scala 1996 en - jean leon · vinya la scala cabernet...

programming systems for big data · scala •spark is...

demystifying big data with scala and akka

big data pipeline with scala by rohit rai, tuplejump -...

scala and big data in icm. scoobie, scalding, spark,...

scala - the language for big data

big data analysis with scala and spark heather...

big data processing with spark and scala

big data science in scala

functional database strategies at scala bay

big data scala by the bay: interactive spark in your browser