slides.ltdca

On Semi-Supervised LearningOf Legal Semantics

L. Thorne McCartyRutgers University

Three Papers● 1998: Structured Casenotes: How Publishers Can Add Value to

Public Domain Legal Materials on the World Wide Web.● 2007: Deep Semantic Interpretations of Legal Texts.● 2015: How to Ground a Language for Legal Discourse in a

Prototypical Perceptual Semantics.

And a Proposal:

A research strategy to produce a computational summary of a legal case, which can be scaled up to a realistic legal corpus.

The Challenge

A structured casenote is a computational summary of the procedural history of a case along with the substantive legal conclusions articulated at each stage of the process. It would play the same role in the legal information systems of the 21st century that West Headnotes and Key Numbers have played in the 20th century.

From my 1998 paper:

Why focus on procedural history?

The traditional case brief focuses on the procedural context first:

Who is suing whom, and for what? What is the plaintiff's legal theory? What facts does the plaintiff allege to support this theory? How does the defendant respond? How does the trial court dispose of the case? What is the basis of the appeal? What issues of law are presented to the appellate court? How does the appellate court resolve these issues, and with what justification?

Think about the traditional “brief” that students are taught to write in their first year of law school:

Within this procedural framework, we would represent the substantive issues at stake in the decision.

● For the computational summary, we need an expressive Knowledge Representation (KR) language.

● How can we build a database of structured casenotes at the appropriate scale?● Fully automated processing of legal texts?● Semi-automated, with a human editor in the loop?● For either approach, we need a Natural Language (NL)

technology that can handle the complexity of legal cases.● But in 1998, neither the NL nor the KR technology was

sufficiently advanced.

Two Steps Toward a Solution:ICAIL '07

Contributions:● Showed that a “state-of-the-art statistical parser ... can handle

even the complex syntactic constructions of an appellate court judge.”

● Showed that the “semantic interpretation of the full text of a judicial opinion can be computed automatically from the output of the parser.” Technical specifications:● Quasi-Logical Form (QLF).● Definite Clause Grammar (DCG).

She has also brought this ADA suit in which she claims that her former employer, Policy Management Systems Corporation, discriminated against her on account of her disability.

526 U.S. 795 (1999)

Terms:

term(lex, var, list)

...

“She has also brought this ADA suit ... “

The petitioner contends that the regulatory takings claim should not have been decided by the jury and that the Court of Appeals adopted an erroneous standard for regulatory takings liability.

526 U.S. 687 (1999)

sterm(decided,C,[_,_])...ANDsterm(adopted,J,[_,_])...

[modal(should),negative,perfect,passive]

The court ruled that sufficient evidence had been presented to the jury from which it reasonably could have decided each of these questions in Del Monte Dunes' favor.

526 U.S. 687 (1999)

Semantics of 'WDT' and 'WHNP': W^nterm(which,W,[])

Semantics of 'IN': Obj^Subj^P^pterm(in,P,[Subj,Obj])

Unify: Obj = nterm(which,W,[])

Term = pterm(in,P,[Subj,Obj])

Semantics of 'WHPP':

W^Subj^P^pterm(in,P,[Subj,nterm(which,W,[])])

Semantics of 'S': E^sterm(claims,E,[_,_])

Unify: Term = pterm(in,P,[E,nterm(which,W,[])])

Tense = [present]

Semantics of 'SBAR':

W^(E^(P^pterm(in,P,[E,nterm(which,W,[])]) &

sterm(claims,E,[_,_]))/[present])

● How accurate are these semantic interpretations?● Unfortunately, we do not have the data to answer this

question.● Consider a different strategy:

● Write hand-coded extraction patterns to map information from the QLF interpretations into the format of a structured casenote.

● Generalize these extraction patterns by the unsupervised learning of the legal semantics implicit in a large set of unannotated legal cases.

● The total system would thus be engaged in a form of semi-supervised learning of legal semantics.

Two Steps Toward a Solution:ICAIL '15

● New Article (less technical, more intuitive):“How to Ground a Language for Legal Discourse in a Prototypical Perceptual Semantics”

(An edited transcript of a presentation at the Legal Quanta Symposium at Michigan State University College of Law on October 29, 2015)

Forthcoming in 2016 Michigan State Law Review _____.

Includes links to my more technical papers.

● Prototype Coding:● The basic idea is to represent a point in an n-dimensional

space by measuring its distance from a prototype in several specified directions.

● Furthermore, assuming that our initial space is Euclidean, we want to select a prototype that lies at the origin of an embedded, low-dimensional, nonlinear subspace, which is in some sense “optimal”.

● The second point leads to a theory of● Manifold Learning● Deep Learning

● The theory has three components, drawn from: Probability, Geometry, Logic.

● The Probabilistic Model:This is a diffusion process determined by a potential function, U(x), and its gradient, ∇U(x), in an arbitrary n-dimensional Euclidean space.

The invariant probability measure for the diffusion process is proportional to , which means that ∇U(x) is proportional to the gradient of the log of the stationary probability density.

e2 U x

● The Geometric Model:This is a Riemannian manifold with a Riemannian metric, , which we interpret as a measure of dissimilarity.

Using this dissimilarity metric, we can define a radial coordinate, ρ, and the directional coordinates, θ

1, θ

2,...,θ

n – 1, in our original n-

dimensional space, and then compute an optimal nonlinear k-dimensional subspace.

The radial coordinate is defined to follow the gradient vector, ∇U(x), and the directional coordinates are defined to be orthogonal to ∇U(x).

g ij x

7X7patch

60,000 images 600,000 patches49 dimensions

12 dimensions

sample

scanencode

scan14X14patch

48 dimensions

encode

12 dimensions

encode Category: 4 12 dimensions48 dimensions

● is estimated from the data using the mean shift algorithm.

● at a prototype.● The prototypical clusters

partition the space of 600,000 patches.

∇U x

∇U x=0

35 Prototypes

Prototype 09

Prototype 27

Prototype 30

Principal Axes

ρ

Geodesic Coordinate Curves

θ

θ

● The Logical Language:The proposed logical language is a categorical logic based on the category of differential manifolds (Man), which is weaker than a logic based on the category of sets (Set) or the category of topological spaces (Top).

For an intuitive understanding of what this means, assume that we have replaced the standard semantics of classical logic, based on sets and their elements, with a semantics based on manifolds and their points. The atomic formulas can then be interpreted as prototypical clusters, and the geometric properties of these clusters can be propagated throughout the rest of the language.

The same strategy can be applied to the entirety of my Language for Legal Discourse (LLD).

LogicGeometryProbability

Constraints

Logic is constrained by the geometry.

Geometric model is constrained bythe probabilistic model.

Probability measure is constrained by the data.

Conjecture: The existence of these mutual constraints makes It possible to learn the semantics of a complex knowledge representation language.

● Why is this a “prototypical perceptual semantics”?● It is a prototypical semantics because it is based on a

representation of prototypical clusters.● It is a prototypical perceptual semantics because the primary

illustrations of the theory are drawn from the field of image processing.

● Claim: If we can build a logical language on these foundations, we will have a plausible account of how human cognition could be grounded in human perception.

Can We Learn A Grounded Semantics

Without a Perceptual Ground?● Two reasons to think this is possible:

● The theory of differential similarity is not really sensitive to the precise details of the representations used at the lower levels.

● There is increasing evidence that the semantics of lexical items can be represented, approximately, as a vector in a high-dimensional vector space, using only the information available in the texts.

● Research Strategy:● We initialize our model with a word embedding computed

from legal texts.● We learn the higher level concepts in a legal domain by

applying the theory of differential similarity.● Discussion?

slides.ltdca

Technology