an attempt to obtain a similar japanese historical material using the variable order n-gram taizo...

39
An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN PNC2012

Upload: phillip-manning

Post on 31-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram

Taizo YAMADA

National Institutes for the Humanities, JAPAN

PNC2012

Page 2: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Contribution

• We prototyped a search system which has– a method of collecting similar materials using Vector Space

Model (VSM) for Japanese historical materials, and– a method of displaying similar materials into timeline.

• In order to use VSM, we introduced a method of a word segmentation using nonparametric Bayesian model.– The word segmentation technique is unsupervised learning.

• No dictionary.• trying to find hidden structure in unlabeled text.

PNC2012 2

Page 3: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Outline

• Background, purpose• Methodology

– Show search results sorted by similarity using VSM

– Word segmentation using nonparametric Bayesian model

• Conclusion, future work

PNC2012 3

Page 4: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Background

• Recently, in Japanese historical material, the amount of encoded texts has been grown up.– Not only catalogue encoding.

• Japanese historical full-text databases in Japan – SHIPS

• Historiographical Institute, The University of Tokyo

– AzumaKagami Database• National Institute of Japanese Literature

– Kaneaki-kyo-ki Database• National Museum of Japanese History

– …PNC2012 4

Page 5: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Example : SHIPS @ Historiographical Institute

PNC2012 5

27DBsTarget: premodern Japanese historical materialsFrom: Nara period to: Edo period

URL: http://wwwap.hi.u-tokyo.ac.jp/ships/shipscontroller

Page 6: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Example : SHIPS @ Historiographical Institute

PNC2012 6

Full-text Databases

The Dai Nihon Shiryo Unified Database

The Full-text Database of the Old Japanese DiariesThe Komonjo full-text databaseThe Nara Period Komonjo full-text databaseThe Heian Ibun full-text databaseThe Kamakura Ibun full-text database

Page 7: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

The Komonjo full-text database(1)

PNC2012 7

keywords

Page 8: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

The Komonjo full-text database(2)

PNC2012 8

Page 9: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

The Komonjo full-text database(2)

PNC2012 9

The number of results, The search condition

DateBibliographic information,Link to imageText (substring where is hit)

Page 10: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

The Komonjo full-text database(2)

PNC2012 10

The system can sort according to date or bibliographic information.

Date

Sorted by date

Page 11: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

The Komonjo full-text database(2)

PNC2012 11

The system can sort according to date or bibliographic information.

Date

Sorted by date

But…sorting according to similarity with the query is not supported

Page 12: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

The Komonjo full-text database(2)

PNC2012 12

The system can sort according to date or bibliographic information.

Date

Sorted by date

But…sorting according to similarity with the query is not supported

Furthermore…Collecting similar texts is not supported.The system has no visual analysis method.(e.g. displaying using timeline.)

Page 13: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Purpose

• In order to efficient analysis from a huge amount of texts, method of obtaining and visualizing similar texts is important.– Which material is similar to a query?– Which material is similar to a material selected by a user?

• Our approach is as follows:– Show search results sorted by similarity with a query.– Show similar materials according to the selected search

result.– Displaying the similar materials into the timeline.

PNC2012 13

Page 14: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Sorting according to similarity (1)

• Let me search with following conditions:– Query: “ 足利直義” (ASHIKAGA, Tadayoshi)– Target:

• Full-text databases in SHIPS.• Temporal: “ 南北朝時代” Nanboku-chō period, JAPAN

(spanning from 1333 to 1392)• 7007 materials (catalogues)• Text: 4067 kinds of characters, 1,204,594 total characters

PNC2012 14

Page 15: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Sorting according to similarity (2)

• The search results are sorted according to the similarity to the query.

• The similarity (score) is calculated using Vector Space Model (VSM).

PNC2012 15

Page 16: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Sorting according to similarity (2)

• The search results are sorted according to the similarity to the query.

• The similarity (score) is calculated using Vector Space Model (VSM).

PNC2012

Let me select No.2

16

Page 17: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Show similar materials (1)

• To select a result, presenting the detail of the result (title, text, temporal information, …), and the materials which are similar to the result.

PNC2012 17

Page 18: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Show similar materials (1)

• To select a result, presenting the detail of the result (title, text, temporal information, …), and the materials which are similar to the result.

PNC2012 18

Moreover, select No.2

Page 19: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Show similar materials (2)

• It is displayed in the same way.

PNC2012 19

Page 20: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Displaying into the timeline (1)

• It is displayed in the same way.

PNC2012 20

click

Page 21: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Displaying into the timeline (2)

PNC2012 21

Page 22: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Displaying into the timeline (2)

PNC2012 22

This timeline feature was developed using Simile Timeline (version 2.3.0 (with Ajax lib pre 2.3.0))

Page 23: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Displaying into the timeline (2)

PNC2012 23

Selected material

Page 24: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Displaying into the timeline (2)

PNC2012 24

Similar materials

Page 25: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Displaying into the timeline (2)

PNC2012 25

Similar materials

A color of an icon shows the similarity of the material.• : this material (font color: green)• : top 1- 5• : top 6 - 9• : top 11-

Page 26: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Displaying into the timeline (2)

PNC2012 26

The materials have temporal information which is normalized. According to the temporal information, the materials are displayed into the timeline.

Page 27: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

What should be solved?

• In order to sort according to the similarity, method of similarity calculation is needed.

• Then, we use the Vector Space Model(VSM).– In VSM, a material is represented as a vector.

– An element in a vector (the weight of the term in the material) is calculated using tf.idf measure.

– In order to calculate a similarity between a query and a text, the vector of the query is created.

PNC2012 27

Page 28: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Similarity Calculation using VSM

• The similarity between x and y is calculated by the following formula.

– : the frequency of term i in document(material) x

– : the frequency of document(material) x

– : the number of documents (materials)

• Currently, the numbers of terms in the system is 119,282.

PNC2012

Similarity

28

Page 29: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Similarity Calculation using VSM

• The similarity between x and y is calculated by the following formula.

– : the frequency of term i in document(material) x

– : the frequency of document(material) x

– : the number of documents (materials)

• Currently, the numbers of terms in the system is 119,282.

PNC2012 29

By the way…The tf.idf measure needs terms from text.

Page 30: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Similarity Calculation using VSM

• The similarity between x and y is calculated by the following formula.

– : the frequency of term i in document(material) x

– : the frequency of document(material) x

– : the number of documents (materials)

• Currently, the numbers of terms in the system is 119,282.

PNC2012 30

By the way…The tf.idf measure needs terms from text.

Is it possible to extract terms from Japanese historical text?

Page 31: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Word segmentation (1)

• An example of text of Japanese historical material :

• The term extraction is very hard…– There are no morphological analysis for a text of Japanese historical material.– Moreover, we don’t have any dictionaries (for person, location,…) to cover

the Nanboku-chō period.

• Here, we use word segmentation results of text as terms.– In Japanese historical text, there are

• many nouns, • some auxiliary verbs, some verbs and few others.

PNC2012

足利直義御教書案(切紙) 島津家文書 _1

御教書案師直師泰誅伐事早馳参御方可致軍忠之状如件観応元年十一月三日御判島津左京進入道殿

31

Page 32: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Word segmentation (3)

• We introduce word segmentation technique using NPYLM (Nested Pitman-Yor Language Model) [1].– NPYLM can generate probabilistic distributions by nonparametric Bayesian model, and is a word n-gram

model without any dictionaries. (unsupervised learning technique)• Word is represented by Character variable order n-gram in the model.

• The world of nonparametric: There are “many parameters” not “no parameters” .

• An example: the result of the word segmentation using NPYLM is follows:

PNC2012

御教書 | 案 | 師直師 | 泰誅伐事 | 早馳 | 参御方 | 可致 | 軍忠之 | 状如件 | 観応元年十一月三 | 日 | 御判 | 島津 | 左京進 | 入道殿

御教書案師直師泰誅伐事早馳参御方可致軍忠之状如件観応元年十一月三日御判島津左京進入道殿text:

result:

[1] Daichi Mochihashi, Takeshi Yamada, Naonori Ueda.: "Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling". ACL-IJCNLP 2009, pp.100-108, 2009.

32

It is not so bad!?

御教書 | 案 | 師直 | 師泰 | 誅伐事 | 早馳参 | 御方 | 可致 | 軍忠之状 | 如件 | 観応 | 元年 | 十一月 | 三日 | 御判 | 島津 | 左京進 | 入道 | 殿correct?:

There are a number of correct segmentations

Page 33: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

The technique of word segmentation(1)

• Forward filtering-Backward sampling– This is often used in Bayesian learning of HMM (Hidden Markov Model)– 2 phases : Forward filtering, Backward sampling

• Forward sampling phase: computing forward variables ()– The variable is the probability of string s(=c1, … ,ct) with final k

characters being a word.

PNC2012 33

御教書案 師直師泰 誅伐事 早馳参…1 t-k-j+1 t-k t-k+1 t t+1

s:

k御教書案師 直師泰

御教書案師直 師泰

j

𝛼 [𝑡 ] [𝑘]

𝛼 [𝑡−𝑘 ] [ 𝑗 ]Here, : probability of bigramHow is the probability computed ? using NPYLM

Aggregating marginal probabilities!

Page 34: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

The technique of word segmentation(2)

• Backward sampling phase: Determination the value of k in – In proportion of the forward variable , value of k is drawn.

– Thus, • Start status :

– N: the length of string s (thus, s = c1, …, cN)– $: a sentence boundary. $ is at the end of the string.

• Terminate condition : t = 0• : string length

– If k is defined, word is defined too.

PNC2012 34

御教書案 師直師泰 誅伐事 早馳参…1 t-k+1 t t+1…

s:

御教書案師 直師泰

御教書案師直 師泰

k

𝑤𝑖

𝑐𝑡−𝑘+1𝑡

Here, : probability of bigramHow is the probability calculated ? using NPYLM

𝑃 (𝑤𝑖∨𝑐𝑡−𝑘+1𝑡 )

Page 35: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

The technique of word segmentation(3)

• NPYLM is an extension of HPYLM [1](Hierarchal Pitman-Yor Language Model).– [1]Y.W.Teh.: A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes, ACL2006.

– In HPYLM, context h, word w, the probability that w follows h is p(w|h), the probability is computed by

– the suffix of h consisting of all but the first word– – – : parameter of HPYLM … many parameters, So “nonparametric”

• Discounting frequency of n-degree and interpolated by frequency of (n-1)-degree

• NPYLM has Character n-gram model and Word n-gram model.– Character n-gram is embedded in the base measure of Word n-gram.

• Thus, 0-gram in Word n-gram is Character n-gram.

– HPYLM is represented by Character n-gram and Word n-gram.

• Detail written in Daichi Mochihashi, Takeshi Yamada, Naonori Ueda.: "Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling". ACL-IJCNLP 2009, pp.100-108, 2009.

PNC2012 35

Page 36: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

The technique of word segmentation(4)

• Finally, segmented string can be obtained.

• Note: The procedure is accomplished by MCMC(Markov chain Monte Carlo) method. – Using all training data, various parameters should be sampled. We use the Gibbs sampling.

• Parameters are fixed except sampling target parameter. • And the parameter is sampled with training data. • Next sampling target is decided. • The above-mentioned is applied until all the parameter is sampled.

– The procedure should be repeated many times. • Until “you are satisfied“ (e.g., the change of parameters is quite small),

or parameters are converged.

PNC2012 36

御教書 | 案 | 師直師 | 泰誅伐事 | 早馳 | 参御方 | 可致 | 軍忠之 | 状如件 | 観応元年十一月三 | 日 | 御判 | 島津 | 左京進 | 入道殿

御教書案師直師泰誅伐事早馳参御方可致軍忠之状如件観応元年十一月三日御判島津左京進入道殿text:

Segment:

Page 37: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Examples: Word segmentation

PNC2012 37

譲與 | 所領 | 事右 | 美作国 | 塩湯郷壱分 | 地頭 | 職并 | 公文 | 職内 | 半分者 | 兄義季 | 観応二 | 年二月 | 廿一日 | 戴御判 | 今半分事 | 康季同日 | 戴御 | 判者也 | 彼 | 御判 | 同 | 御施行 | 遵行 | 軍忠之 | 御判 | 等相副之 | 甥為 | 帯刀 | 左衛門 | 尉季 | 治於 | 猶子 | 譲與 | 者也 | 聊一 | 族親 | 類中 | 不可有 | 違乱 | 妨儀也 | 仍為後日 | 譲状 | 如件 | 永和元 | 年八月 | 十日 | 藤原 | 康季

譲與所領事右美作国塩湯郷壱分地頭職并公文職内半分者兄義季観応二年二月廿一日戴御判今半分事康季同日戴御判者也彼御判同御施行遵行軍忠之御判等相副之甥為帯刀左衛門尉季治於猶子譲與者也聊一族親類中不可有違乱妨儀也仍為後日譲状如件永和元年八月十日藤原康季

text:

Segment:

13750880550 赤堀文書 大日本史料 6_46

吉河 | 讃岐 | 入道 | 玄龍申 | 安芸国 | 山県郡内 | 志路原庄事 | 依 | 軍忠 | 預置 | 之候 | 御下文 | 事申 | 御沙汰 | 候者 | 可然候 | 恐惶 | 謹言 | 明徳元 | 年八月 | 十七日 | 左京 | 権大夫 | 義弘 | 在判 | 進上御 | 奉行所

吉河讃岐入道玄龍申安芸国山県郡内志路原庄事依軍忠預置之候御下文事申御沙汰候者可然候恐惶謹言明徳元年八月十七日左京権大夫義弘在判進上御奉行所text:

Segment:

13900080170  大内義弘挙状案 吉川家文書 _2

Segmentation may be fine more if some dictionaries can be introduced. We should challenge in semi-supervised learning!?

Page 38: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Conclusion, Future Work

• In order to support to search and show from a huge amount of the texts of the Japanese historical materials, we introduce the search system using VSM and Timeline.

• In order to VSM, we introduce word segmentation using NPYLM.

• All the methods which we introduced is the stage of the “sketch”.– The search system is prototype.– We introduce and prototype, but do not evaluate.

• The evaluation is very hard!?

– We'll consider to introducing a semi-supervised learning• For more fine word segmentation…

– And investigating method of text analysis for Japanese historical material.• For example:

– topic extraction, – With temporal data: topic detection / topic tracking,…

• For discovering similar materials “semantically”…

PNC2012 38

Page 39: An Attempt to Obtain A Similar Japanese Historical Material Using The Variable Order N-gram Taizo YAMADA National Institutes for the Humanities, JAPAN

Thank you for listening to my presentation.

– E-mail: [email protected]

PNC2012 39