an introduction to nlp4l (scala by the bay / big data scala 2015)

Post on 22-Jan-2018

1.502 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

An Introduction to NLP4L

Natural Language Processing tool for Apache Lucene

Koji Sekiguchi @kojisays Founder & CEO, RONDHUIT

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

2

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

3

What’s NLP4L?

4

What’s NLP4L?• GOAL

• Improve Lucene users’ search experience

• FEATURES

• Use of Lucene index as a Corpus Database

• Lucene API Front-end written in Scala

• NLP4L provides

• Preprocessors for existing ML tools

• Provision of ML algorithms and Applications (e.g. Transliteration)

5

What’s NLP4L?• GOAL

• Improve Lucene users’ search experience

• FEATURES

• Use of Lucene index as a Corpus Database

• Lucene API Front-end written in Scala

• NLP4L provides

• Preprocessors for existing ML tools

• Provision of ML algorithms and Applications (e.g. Transliteration)

6

What’s NLP4L?• GOAL

• Improve Lucene users’ search experience

• FEATURES

• Use of Lucene index as a Corpus Database

• Lucene API Front-end written in Scala

• NLP4L provides

• Preprocessors for existing ML tools

• Provision of ML algorithms and Applications (e.g. Transliteration)

7

What’s Lucene?

alice 1an 1, 2, 3

apple 1, 3ate 1is 3

likes 2mike 2

orange 2red 3

Alice ate an apple.

Mike likes an orange.

An apple is red.

1: 2: 3:

indexing

“apple”

searching

(inverted) index

Lucene is a high-performance, full-featured text search engine library written entirely in Java.

8

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

9

Evaluation Measures

10

Evaluation Measures

target

11

Evaluation Measures

targetresult

12

Evaluation Measures

targetresult

tpfp fn

tn

13

Evaluation Measures

targetresult

positive

14

Evaluation Measuresnegative

15

result

Evaluation Measures

16

true positive

true negative

Evaluation Measures

targetresult

17

false positive

false negative

Evaluation Measures

targetresult

tpfp fn

tn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

18

Recall ,Precision

tpfp fn

tn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

19

targetresult

tpfp fn

tn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

Recall ,Precision

20

Solutionn-gram, synonym dictionary, etc.

facet (filter query) Ranking Tuning

recall precision

recall , precision

21

Solutionn-gram, synonym dictionary, etc.

facet (filter query) Ranking Tuning

recall precision

recall , precision

22

Solutionn-gram, synonym dictionary, etc.

e.g. Transliteration

facet (filter query)

recall precision

recall , precision

Ranking Tuning

23

Solutionn-gram, synonym dictionary, etc.

e.g. Transliteration

facet (filter query)

e.g. Named Entity Extraction

recall precision

recall , precision

Ranking Tuning

24

gradual precision improvement

q=watch

25

targetresult

filter by “Gender=Men’s”

26

targetresult

gradual precision improvement

27

targetresult

filter by “Gender=Men’s”

filter by “Price=100-150”

gradual precision improvement

Structured Documents

ID product price gender

1 CURREN New Men’s Date Stainless Steel Military Sport Quartz Wrist Watch 8.92 Men’s

2 Suiksilver The Gamer Watch 87.99 Men’s

28

Unstructured Documents

ID article

1David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.

2He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.

29

Make them Structured

ID article person org loc

1David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.

David Cameron EU Bruss

els

2He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.

EUUK Britain

NEE[1] extracts interesting words.

[1] Named Entity Extraction

30

Manual Tagging using brat

31

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

32

A small Corpus

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

33

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

34

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

35

text data

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

36

Lucene index directory

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

37

schema definition

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

38

create Lucene document

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

39

open a writer

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

40

write documents

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

41

close writer

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

42

As for code snippets used in my talk, please look at: https://github.com/NLP4L/meetups/tree/master/20150818

Getting word countsalice 1

an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3

43

Getting word countsalice 1

an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3

val reader = RawReader(index)

reader.sumTotalTermFreq("text") // -> 12

reader.field("text").get.terms.size // -> 9

reader.totalTermFreq("text", "an") // -> 3

reader.close

getting_word_counts.scala

44

Getting word countsalice 1

an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3

val reader = RawReader(index)

reader.sumTotalTermFreq("text") // -> 12

reader.field("text").get.terms.size // -> 9

reader.totalTermFreq("text", "an") // -> 3

reader.close

getting_word_counts.scala

45

Getting word countsalice 1

an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3

val reader = RawReader(index)

reader.sumTotalTermFreq("text") // -> 12

reader.field("text").get.terms.size // -> 9

reader.totalTermFreq("text", "an") // -> 3

reader.close

getting_word_counts.scala

46

Getting top termsalice 1

an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3

val reader = RawReader(index)

reader.topTermsByDocFreq("text") reader.topTermsByTotalTermFreq("text") // -> // (term, docFreq, totalTermFreq) // (an,3,3) // (apple,2,2) // (likes,1,1) // (is,1,1) // (orange,1,1) // (mike,1,1) // (ate,1,1) // (red,1,1) // (alice,1,1)

reader.close

getting_word_counts.scala

47

What’s ShingleFilter?• ShingleFilter = Word n-gram TokenFilter

WhitespaceTokenizer

ShingleFilter (N=2)

“Lucene is a popular software”

Lucene/is/a/popular/software

Lucene is/is a/a popular/popular software

48

Language Model• LM represents the fluency of language

• N-gram model is the LM which is most widely used

• Calculation example for 2-gram

49

val index = "/tmp/index-lm"

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

// create a language model index val writer = IWriter(index, schema())

def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) }

CORPUS.foreach(addDocument(_))

writer.close()

language_model.scala

1/2

50

val index = "/tmp/index-lm"

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

// create a language model index val writer = IWriter(index, schema())

def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) }

CORPUS.foreach(addDocument(_))

writer.close()

language_model.scala

1/2

51

schema definition

val reader = RawReader(index)

// P(apple|an) = C(an apple) / C(an) val count_an_apple = reader.totalTermFreq("word2g", "an apple") val count_an = reader.totalTermFreq("word", "an") val prob_apple_an = count_an_apple.toFloat / count_an.toFloat

// P(orange|an) = C(an orange) / C(an) val count_an_orange = reader.totalTermFreq("word2g", "an orange") val prob_orange_an = count_an_orange.toFloat / count_an.toFloat

reader.close

language_model.scala

2/2

52

Alice/NNP ate/VB an/AT apple/NNP ./. Mike/NNP likes/VB an/AT orange/NNP ./. An/AT apple/NNP is/VB red/JJ ./.

NNP Proper noun, singularVB VerbAT ArticleJJ Adjective. period

Part-of-Speech Tagging

53

Our Corpus for training

Hidden Markov Model

54

Hidden Markov Model

55

Series of Words

Hidden Markov Model

56

Series of Part-of-Speech

Hidden Markov Model

57

Hidden Markov Model

58

HMM state diagramNNP 0.667

VB 0.0

. 0.0

JJ 0.0

AT 0.333

1.0

1.0

0.4 0.6

0.6670.333

59

alice 0.2 apple 0.4 mike 0.2 orange 0.2

ate 0.333 is 0.333 likes 0.333

an 1.0

red 1.0

. 1.0

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

60

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

61

text data (they are tagged!)

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

62

write-open Lucene index

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

63

tagged texts are indexed here

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

64

make an HmmModel from Lucene index

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

65

get HmmTagger from HmmModel

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

66

use HmmTagger to annotate unknown sentence

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

NLP4L has hmm_postagger.scala in examples directory. It uses brown corpus for HMM training.

67

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

68

TransliterationTransliteration is a process of transcribing letters or words from one alphabet to another one to facilitate comprehension and pronunciation for non-native speakers.

computer コンピューター

server サーバー

internet インターネット

mouse マウス

information インフォメーション

examples of transliteration from English to Japanese

69

It helps improve recallyou search English “mouse”

70

It helps improve recall

but you got “マウス” (=mouse) highlighted in Japanese

71

Training data in NLP4Lアaカcaデdeミーmy アaクcセceンnトt アaクcセceスss アaクcシciデdeンnトt アaクcロroバッbaトt アaクcショtioンn アaダdaプpターter アaフfリriカca エaアirバbuスs アaラlaスsカka アaルlコーcohoルl アaレlleルrギーgy

train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt

72

academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー

Demo: Transliteration

Input Prediction Right Answer

アルゴリズム algorism algorithm

プログラム program (OK)

ケミカル chaemmical chemical

ダイニング dining (OK)

コミッター committer (OK)

エントリー entree entry

nlp4l> :load examples/trans_katakana_alpha.scala

73

Gathering loan words

① crawl

gathering Katakana-Alphabet

string pairs

アルゴリズム, algorithm

Transliteration

“アルゴリズム”

“algorism”

calculate edit distance

synonyms.txt

74

store pair of strings if edit distance is small enough

④⑤

Gathering loan words

① crawl

gathering Katakana-Alphabet

string pairs

アルゴリズム, algorithm

Transliteration

“アルゴリズム”

“algorism”

calculate edit distance

synonyms.txt

75

store pair of strings if edit distance is small enough

④⑤

Got 1,800+ records of synonym knowledge from jawiki

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

76

NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

77

NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

78

NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

79

NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

80

Solr

ES

Mahout Spark

Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log

Dictionaries

・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment

maintenance

Model files Tagged Corpus

Document Vectors

・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection

・Learning to Rank ・Personalized Search

81

Keyword Attachment• “Keyword attachment” is a general format that enables the

following functions.

• Learning to Rank

• Personalized Search

• Named Entity Extraction

• Document Classification

Lucene doc

Lucene doc keyword

↑ Increase boost

82

Learning to Rank• Program learns, from access log and other sources, that the score of document d for a query q should be larger than the normal score(q,d) Lucene doc d q, q, …

https://en.wikipedia.org/wiki/Learning_to_rank

83

Personalized Search• Program learns, from access log and other sources, that

the score of document d for a query q by user u should be larger than the normal score(q,d)

• Since you cannot specify score(q,d,u) as Lucene restricts doing so, you have to specify score(qu,d).

• Limit the data to high-order queries or divide fields depending on a user as the number of q-u combinations can be enormous.

Lucene doc d1 q1u1, q2u2

Lucene doc d2 q2u1, q1u2

84

Join and Code with Us!

Contact us at

koji at apache dot org

for the details.

85

Demo or Q & A

Thank you!

86

top related