bootstrapping-based extraction of dictionary terms from unsegmented legal text jurisin 2008 second...
TRANSCRIPT
Bootstrapping-based Extraction of Dictionary Terms from Unsegmented Legal Text
JURISIN 2008Second International Workshop
on Juris-informatics
Graduate School of Information Science,Nagoya University, Japan
Masato HAGIWARA, Yasuhiro OGAWA, Katsuhiko TOYAMA
Background
• Growing demand for translation of Japanese statutes– Social and economic globalization– Promotion of international investment toward Japan– Technical assistance to developing and/or
former socialist countries
• Japanese government effort– “Study Council for Promoting Translation of Japanese
Laws and Regulations into Foreign Language”
Bilingual Dictionary
• Standard Japanese-English bilingual dictionary of legal terms (SBD)– Recommended to translators and lawyers– More than 250 major statutes to be translated,
120 already released based on SBD
• High compiling/maintenance cost• Should be technically supported
Dictionary Compilation Support
• Natural language processing technique– Automatic extraction of bilingual lexicons by
word alignment technique [Toyama et al. 2006]
– Japanese entries must be fixed before application– Appropriate terms are still selected by hand
Supported by automatic dictionary term selectionfrom unsegmented legal text
Supported by automatic dictionary term selectionfrom unsegmented legal text
Defined Terms
• What kind of terms should be selected?
この法律において「商品取引所」とは、会員商品取引所及び株式会社商品取引所をいう。(The term “Commodity Exchange” as used in this Act shall mean a Member Commodity Exchange and an Incorporated Commodity Exchange.)
(Act No. 239, 1950)
この法律において、次の各号に掲げる用語の意義は、当該各号に定めるところによる。 一 著作物 思想又は感情を創作的に表現したものであつて、文芸、学術、美術又は音楽の範囲に属するものをいう。 二 著作者 著作物を創作する者をいう。(In this Act, the meanings of the terms listed in the following items shall be as prescribed respectively in those items:
(i) “work” means a production in which thoughts or sentiments are expressed in a creative way and which falls within the literary, scientific, artistic or musical domain;
(ii) “author” means a person who creates the work;)
(Act No. 48, 1970)
Definition sentences
Pattern-based Term Extraction
Commodity Exchange
… in accordance with the standards and methods specified by a Commodity Exchange …
… a market that a Commodity Exchange has opened for each single kind of …
… a Member, etc. of a Commodity Exchange or a Member, etc. of a facility equivalent …
… a Member, etc. of a facility equivalent to a Commodity Exchange in a foreign state …
“Important terms appear in similar contexts”
Pattern-based Term Extraction
Commodity Exchange
… in accordance with the standards and methods specified by a Commodity Exchange …
… a market that a Commodity Exchange has opened for each single kind of …
… a Member, etc. of a Commodity Exchange or a Member, etc. of a facility equivalent …
… a Member, etc. of a facility equivalent to a Commodity Exchange in a foreign state …
specified by #
# has opened
Member, etc. of a #
equivalent to a #
Patterns
… one-third or more has been specified by articles of incorporation, at least such …
… the locations where Old Markets have been opened and Listed Commodities …
… person is a member of a commodity futures association (hereinafter referred to…
… in a foreign state equivalent to a Commodity Market; hereinafter the same shall apply …
“Important terms appear in similar contexts”
Pattern-based Term Extraction
Commodity Exchange
… in accordance with the standards and methods specified by a Commodity Exchange …
… a market that a Commodity Exchange has opened for each single kind of …
… a Member, etc. of a Commodity Exchange or a Member, etc. of a facility equivalent …
… a Member, etc. of a facility equivalent to a Commodity Exchange in a foreign state …
specified by #
# has opened
Member, etc. of a #
equivalent to a #
… one-third or more has been specified by articles of incorporation, at least such …
… the locations where Old Markets have been opened and Listed Commodities …
… person is a member of a commodity futures association (hereinafter referred to…
… in a foreign state equivalent to a Commodity Market; hereinafter the same shall apply …
PatternsArticles of incorporation
Old Markets
commodity futures association
Commodity Market
Instances
“Important terms appear in similar contexts”
Bootstrapping-based Methods
• Espresso [Pantel and Pennacchiotti 2006]
– Extraction of lexical relations (binary)– English news articles (segmented)
• Tchai [Komachi and Suzuki 2008]
– Extraction of semantic categories (unary)– Japanese query logs (unsegmented but short)
• Long, unsegmented Japanese legal text → Conventional analyzers/parsers are not applicable
Objectives
• A new algorithm Monaka is proposed– Based on Tchai algorithm– Character n-gram based instance/pattern induction– Constraint to ensure proper segmentation
• Evaluation to confirm its effectiveness fordictionary term extraction
Espresso Algorithm [Pantel and Pennacchiotti 2006]
Seed
instances
Seed
instances
BootstrappingPattern
Ranking
Pattern
Ranking
CorpusCorpus
Pattern
Induction
Pattern
InductionInstance
Induction
Instance
Induction
)(max
),(
||
1)( ir
pipmi
Ipr
Ii pmi
Extracted
instances
Extracted
instancesInstance
Ranking
Instance
Ranking
)(max
),(
||
1)( pr
pipmi
Pir
Pp pmi
wheat :: crop
George Wendt :: star
nitrogen :: element
diborane :: substance
x is a y
y such as x
x and other y
Patterns
Picasso :: artist
tax :: charge
protein :: biopolymer
HCl :: string acit
Instances
Tchai Algorithm [Komachi and Suzuki 2008]
• Applied Espresso to semantic category extractionfrom Japanese web query logs
• Some improvements over Espresso– Query-based pattern induction
seed: JAL query: JAL_flight pattern: #_flight– Local PMI Max
– Ambiguous instance/pattern filtering• Ambiguous instance: 1.5x patterns of prev. instances
• Ambiguous pattern: 2.0x instances of prev. patterns
• Improves the precision of the extracted instances
)()(max
),(
||
1)( ir
p
pipmi
Ipr
Ii pmi
Monaka Algorithm – Pattern Induction
• Character n-gram based induction
Espresso → Segmented English text
Tchai → Short Japanese queries
この法律において「商品取引所」とは、会員商品取引所及び株式会社商品取引所をいう。(The term “Commodity Exchange” as used in this Act shall mean a Member Commodity Exchange and an Incorporated Commodity Exchange.)
(Act No. 239, 1950)
商品取引所(Commodity Exchange)
Instanceて「 #
いて「 #
おいて「 #
…
# 」と# 」とは# 」とは、 …
Patterns
62 n
Monaka Algorithm – Instance Induction
この法律において「商品取引所」とは、会員商品取引所及び株式会社商品取引所をいう。(The term “Commodity Exchange” as used in this Act shall mean a Member Commodity Exchange and an Incorporated Commodity Exchange.)
(Act No. 239, 1950)
律において「 #
(# as used in this Act)
Pattern 商品商品取商品取引商品取引所商品取引所」…
Instances
102 nIncorrectly segmented instances are extracted as wellIncorrectly segmented instances are extracted as well
Bidirectional Adjacency Constraint (BAC)
• Constraint to ensure proper segmentation
… この法律において「商品取引所」とは、会員商品取引所 …
Instance i
: Instance reliability)(ir
Bidirectional Adjacency Constraint (BAC)
• Constraint to ensure proper segmentation
… この法律において「商品取引所」とは、会員商品取引所 …
: Preceding instance reliability)(irp : Succeeding instance reliability)(irs
Instance i
m ms
mp iririr ))()((
2
1)(
Combine as the generalized average
… 律において「商品取引所」とは、会員 …
… 律において「商品取引所」とは、会員 …
)(irp )(irs )(ir
high
low
high high
high low
Monaka Algorithm – Ambiguous Patterns and Instances
• Character n-gram based pattern/instance induction– Negative effect of generic instance/pattern is more serious
e.g. “ て「 #”, # 」と– The number of extracted instances is unpredictable
• Ambiguous pattern filtering– Ambiguity = # of co-occurring instance types
– Discard 10 most ambiguous patterns after each induction
• Ambiguous instance filtering– Ambiguity = # of statutes in which the pattern appears (DF)
– Discard ones which appear in more than 70% of the statutes
Experimental Settings
• Corpus– 228 Japanese acts included in the translation project
– Article, paragraph, and item numbers → head markers
• Seed instances– Randomly chosen 100 defined terms out of
1,225 defined terms extracted by regular expression
• Bootstrapping– # of patterns: initially 100, incremented by 10
– # of instances: start with 100 seeds, 100 new instancescumulatively learned in each iteration
– A total of 10 iterations
Evaluation
1. Defined term reproducibility test– How well the rest of the defined terms are reproduced,
without depending on the definition sentences
– Gold standard: 1,225 defined terms
– Closed test
2. SBD coverage test– How many of the SBD entries are covered
– Gold standard: all the 3,510 SBD entries appearedat least once in the corpus
– Open test
Results – Defined Term Reproducibility
Extracted a quarter of the defined termswith the precision of 29.2%
Extracted a quarter of the defined termswith the precision of 29.2%
Results – SBD Coverage
5% or more improvement
→ Supports the effectiveness of the constraint
5% or more improvement
→ Supports the effectiveness of the constraint
Results – Extracted Instances
Monaka-BAC DT SBD Monaka+BAC DT SBD
銀行等(bank etc.) 銀行等
(bank etc.)
証券会 (securities company)
特定目的信託(special purpose trust)
設立事務所(established place of business)
登記(registration)
石油(oil)
紛争(dispute)
同項第2号 (item (ii) of the same paragraph)
地域(area)
再生債務者の(rehabilitation debtor)
破産手続(bankruptcy proceedings)
処分に(deposition)
都市(city)
販売(sale) 外国
(foreign state)
再生計画(rehabilitation plan) 道路
(road)
廃棄(disposal) 港務局
(port bureau)
Result – Extracted Patterns
• Mostly substrings of other patterns• Most of the patterns are quite generic
– A single pattern may induce too many incorrect instances
Monaka+BAC
# に係る(concerning #)
# 又は(# or)
# 及び# and
# 」という。(referred to as ‘#)
当該 #(the #, that #)
に規定する #(# provided for in)
(...) において、 #(in ..., #)
Reliability measures are effective to rank patterns/instances
BAC is essential for extraction from unsegmented text
Conclusion
• Monaka algorithm was proposed– Bootstrapping-based lexical knowledge acquisition
– Simple character n-gram based instance/pattern induction
– Constraint (BAC) to ensure proper segmentation
– Ambiguous pattern/instance filtering
• Evaluation results– Improved precision/recall in both defined term reproducibility
and SBD coverage
– BAC helped to extract many correctly segmented instances
• Future work: Application of Monaka to other domains– Highly “fixed” format of Japanese statutes– Investigation on the effect of “topic drift”
[Komachi et al. 2008] showed bootstrapping tend to converge to generic instances