graduate school of information science, nagoya university, japan

24
Bootstrapping-based Extraction of Dictionary Terms from Unsegmented Legal Text JURISIN 2008 Second International Workshop on Juris-informatics Graduate School of Information Science, Nagoya University, Japan Masato HAGIWARA, Yasuhiro OGAWA, Katsuhiko TOYAMA

Upload: kamin

Post on 31-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

JURISIN 2008 Second International Workshop on Juris-informatics. Bootstrapping-based Extraction of Dictionary Terms from Unsegmented Legal Text. Graduate School of Information Science, Nagoya University, Japan Masato HAGIWARA, Yasuhiro OGAWA, Katsuhiko TOYAMA. Background. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Graduate School of Information Science, Nagoya University, Japan

Bootstrapping-based Extraction of Dictionary Terms from Unsegmented Legal Text

JURISIN 2008Second International Workshop

on Juris-informatics

Graduate School of Information Science,Nagoya University, Japan

Masato HAGIWARA, Yasuhiro OGAWA,   Katsuhiko TOYAMA

Page 2: Graduate School of Information Science, Nagoya University, Japan

Background

• Growing demand for translation of Japanese statutes– Social and economic globalization– Promotion of international investment toward Japan– Technical assistance to developing and/or

former socialist countries

• Japanese government effort– “Study Council for Promoting Translation of Japanese

Laws and Regulations into Foreign Language”

Page 3: Graduate School of Information Science, Nagoya University, Japan

Bilingual Dictionary

• Standard Japanese-English bilingual dictionary of legal terms (SBD)– Recommended to translators and lawyers– More than 250 major statutes to be translated,

120 already released based on SBD

• High compiling/maintenance cost• Should be technically supported

Page 4: Graduate School of Information Science, Nagoya University, Japan

Dictionary Compilation Support

• Natural language processing technique– Automatic extraction of bilingual lexicons by

word alignment technique [Toyama et al. 2006]

– Japanese entries must be fixed before application– Appropriate terms are still selected by hand

Supported by automatic dictionary term selectionfrom unsegmented legal text

Supported by automatic dictionary term selectionfrom unsegmented legal text

Page 5: Graduate School of Information Science, Nagoya University, Japan

Defined Terms

• What kind of terms should be selected?

この法律において「商品取引所」とは、会員商品取引所及び株式会社商品取引所をいう。(The term “Commodity Exchange” as used in this Act shall mean a Member Commodity Exchange and an Incorporated Commodity Exchange.)

(Act No. 239, 1950)

この法律において、次の各号に掲げる用語の意義は、当該各号に定めるところによる。 一 著作物 思想又は感情を創作的に表現したものであつて、文芸、学術、美術又は音楽の範囲に属するものをいう。 二 著作者 著作物を創作する者をいう。(In this Act, the meanings of the terms listed in the following items shall be as prescribed respectively in those items:

(i) “work” means a production in which thoughts or sentiments are expressed in a creative way and which falls within the literary, scientific, artistic or musical domain;

(ii) “author” means a person who creates the work;)

(Act No. 48, 1970)

Definition sentences

Page 6: Graduate School of Information Science, Nagoya University, Japan

Pattern-based Term Extraction

Commodity Exchange

… in accordance with the standards and methods specified by a Commodity Exchange …

… a market that a Commodity Exchange has opened for each single kind of …

… a Member, etc. of a Commodity Exchange or a Member, etc. of a facility equivalent …

… a Member, etc. of a facility equivalent to a Commodity Exchange in a foreign state …

“Important terms appear in similar contexts”

Page 7: Graduate School of Information Science, Nagoya University, Japan

Pattern-based Term Extraction

Commodity Exchange

… in accordance with the standards and methods specified by a Commodity Exchange …

… a market that a Commodity Exchange has opened for each single kind of …

… a Member, etc. of a Commodity Exchange or a Member, etc. of a facility equivalent …

… a Member, etc. of a facility equivalent to a Commodity Exchange in a foreign state …

specified by #

# has opened

Member, etc. of a #

equivalent to a #

Patterns

… one-third or more has been specified by articles of incorporation, at least such …

… the locations where Old Markets have been opened and Listed Commodities …

… person is a member of a commodity futures association (hereinafter referred to…

… in a foreign state equivalent to a Commodity Market; hereinafter the same shall apply …

“Important terms appear in similar contexts”

Page 8: Graduate School of Information Science, Nagoya University, Japan

Pattern-based Term Extraction

Commodity Exchange

… in accordance with the standards and methods specified by a Commodity Exchange …

… a market that a Commodity Exchange has opened for each single kind of …

… a Member, etc. of a Commodity Exchange or a Member, etc. of a facility equivalent …

… a Member, etc. of a facility equivalent to a Commodity Exchange in a foreign state …

specified by #

# has opened

Member, etc. of a #

equivalent to a #

… one-third or more has been specified by articles of incorporation, at least such …

… the locations where Old Markets have been opened and Listed Commodities …

… person is a member of a commodity futures association (hereinafter referred to…

… in a foreign state equivalent to a Commodity Market; hereinafter the same shall apply …

PatternsArticles of incorporation

Old Markets

commodity futures association

Commodity Market

Instances

“Important terms appear in similar contexts”

Page 9: Graduate School of Information Science, Nagoya University, Japan

Bootstrapping-based Methods

• Espresso [Pantel and Pennacchiotti 2006]

– Extraction of lexical relations (binary)– English news articles (segmented)

• Tchai [Komachi and Suzuki 2008]

– Extraction of semantic categories (unary)– Japanese query logs (unsegmented but short)

• Long, unsegmented Japanese legal text → Conventional analyzers/parsers are not applicable

Page 10: Graduate School of Information Science, Nagoya University, Japan

Objectives

• A new algorithm Monaka is proposed– Based on Tchai algorithm– Character n-gram based instance/pattern induction– Constraint to ensure proper segmentation

• Evaluation to confirm its effectiveness fordictionary term extraction

Page 11: Graduate School of Information Science, Nagoya University, Japan

Espresso Algorithm [Pantel and Pennacchiotti 2006]

Seed

instances

Seed

instances

BootstrappingPattern

Ranking

Pattern

Ranking

CorpusCorpus

Pattern

Induction

Pattern

InductionInstance

Induction

Instance

Induction

)(max

),(

||

1)( ir

pipmi

Ipr

Ii pmi

Extracted

instances

Extracted

instancesInstance

Ranking

Instance

Ranking

)(max

),(

||

1)( pr

pipmi

Pir

Pp pmi

wheat :: crop

George Wendt :: star

nitrogen :: element

diborane :: substance

x is a y

y such as x

x and other y

Patterns

Picasso :: artist

tax :: charge

protein :: biopolymer

HCl :: string acit

Instances

Page 12: Graduate School of Information Science, Nagoya University, Japan

Tchai Algorithm [Komachi and Suzuki 2008]

• Applied Espresso to semantic category extractionfrom Japanese web query logs

• Some improvements over Espresso– Query-based pattern induction

seed: JAL query: JAL_flight pattern: #_flight– Local PMI Max

– Ambiguous instance/pattern filtering• Ambiguous instance: 1.5x patterns of prev. instances

• Ambiguous pattern: 2.0x instances of prev. patterns

• Improves the precision of the extracted instances

)()(max

),(

||

1)( ir

p

pipmi

Ipr

Ii pmi

Page 13: Graduate School of Information Science, Nagoya University, Japan

Monaka Algorithm – Pattern Induction

• Character n-gram based induction

Espresso → Segmented English text

Tchai → Short Japanese queries

この法律において「商品取引所」とは、会員商品取引所及び株式会社商品取引所をいう。(The term “Commodity Exchange” as used in this Act shall mean a Member Commodity Exchange and an Incorporated Commodity Exchange.)

(Act No. 239, 1950)

商品取引所(Commodity Exchange)

Instanceて「 #

いて「 #

おいて「 #

# 」と# 」とは# 」とは、 …

Patterns

62 n

Page 14: Graduate School of Information Science, Nagoya University, Japan

Monaka Algorithm – Instance Induction

この法律において「商品取引所」とは、会員商品取引所及び株式会社商品取引所をいう。(The term “Commodity Exchange” as used in this Act shall mean a Member Commodity Exchange and an Incorporated Commodity Exchange.)

(Act No. 239, 1950)

律において「 #

(# as used in this Act)

Pattern 商品商品取商品取引商品取引所商品取引所」…

Instances

102 nIncorrectly segmented instances are extracted as wellIncorrectly segmented instances are extracted as well

Page 15: Graduate School of Information Science, Nagoya University, Japan

Bidirectional Adjacency Constraint (BAC)

• Constraint to ensure proper segmentation

…   この法律において「商品取引所」とは、会員商品取引所 …

Instance i

: Instance reliability)(ir

Page 16: Graduate School of Information Science, Nagoya University, Japan

Bidirectional Adjacency Constraint (BAC)

• Constraint to ensure proper segmentation

…   この法律において「商品取引所」とは、会員商品取引所 …

: Preceding instance reliability)(irp : Succeeding instance reliability)(irs

Instance i

m ms

mp iririr ))()((

2

1)(

Combine as the generalized average

… 律において「商品取引所」とは、会員 …

… 律において「商品取引所」とは、会員 …

)(irp )(irs )(ir

high

low

high high

high low

Page 17: Graduate School of Information Science, Nagoya University, Japan

Monaka Algorithm – Ambiguous Patterns and Instances

• Character n-gram based pattern/instance induction– Negative effect of generic instance/pattern is more serious

e.g. “ て「 #”, # 」と– The number of extracted instances is unpredictable

• Ambiguous pattern filtering– Ambiguity = # of co-occurring instance types

– Discard 10 most ambiguous patterns after each induction

• Ambiguous instance filtering– Ambiguity = # of statutes in which the pattern appears (DF)

– Discard ones which appear in more than 70% of the statutes

Page 18: Graduate School of Information Science, Nagoya University, Japan

Experimental Settings

• Corpus– 228 Japanese acts included in the translation project

– Article, paragraph, and item numbers → head markers

• Seed instances– Randomly chosen 100 defined terms out of

1,225 defined terms extracted by regular expression

• Bootstrapping– # of patterns: initially 100, incremented by 10

– # of instances: start with 100 seeds, 100 new instancescumulatively learned in each iteration

– A total of 10 iterations

Page 19: Graduate School of Information Science, Nagoya University, Japan

Evaluation

1. Defined term reproducibility test– How well the rest of the defined terms are reproduced,

without depending on the definition sentences

– Gold standard: 1,225 defined terms

– Closed test

2. SBD coverage test– How many of the SBD entries are covered

– Gold standard: all the 3,510 SBD entries appearedat least once in the corpus

– Open test

Page 20: Graduate School of Information Science, Nagoya University, Japan

Results – Defined Term Reproducibility

Extracted a quarter of the defined termswith the precision of 29.2%

Extracted a quarter of the defined termswith the precision of 29.2%

Page 21: Graduate School of Information Science, Nagoya University, Japan

Results – SBD Coverage

5% or more improvement

→ Supports the effectiveness of the constraint

5% or more improvement

→ Supports the effectiveness of the constraint

Page 22: Graduate School of Information Science, Nagoya University, Japan

Results – Extracted Instances

Monaka-BAC DT SBD Monaka+BAC DT SBD

銀行等(bank etc.) 銀行等

(bank etc.)

証券会 (securities company)

特定目的信託(special purpose trust)

設立事務所(established place of business)

登記(registration)

石油(oil)

紛争(dispute)

同項第2号 (item (ii) of the same paragraph)

地域(area)

再生債務者の(rehabilitation debtor)

破産手続(bankruptcy proceedings)

処分に(deposition)

都市(city)

販売(sale) 外国

(foreign state)

再生計画(rehabilitation plan) 道路

(road)

廃棄(disposal) 港務局

(port bureau)

Page 23: Graduate School of Information Science, Nagoya University, Japan

Result – Extracted Patterns

• Mostly substrings of other patterns• Most of the patterns are quite generic

– A single pattern may induce too many incorrect instances

Monaka+BAC

# に係る(concerning #)

# 又は(# or)

# 及び# and

# 」という。(referred to as ‘#)

当該 #(the #, that #)

に規定する #(# provided for in)

(...) において、 #(in ..., #)

Reliability measures are effective to rank patterns/instances

BAC is essential for extraction from unsegmented text

Page 24: Graduate School of Information Science, Nagoya University, Japan

Conclusion

• Monaka algorithm was proposed– Bootstrapping-based lexical knowledge acquisition

– Simple character n-gram based instance/pattern induction

– Constraint (BAC) to ensure proper segmentation

– Ambiguous pattern/instance filtering

• Evaluation results– Improved precision/recall in both defined term reproducibility

and SBD coverage

– BAC helped to extract many correctly segmented instances

• Future work: Application of Monaka to other domains– Highly “fixed” format of Japanese statutes– Investigation on the effect of “topic drift”

[Komachi et al. 2008] showed bootstrapping tend to converge to generic instances