2017-01-25-systemt-overview-stanford
TRANSCRIPT
![Page 1: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/1.jpg)
1
![Page 2: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/2.jpg)
2
![Page 3: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/3.jpg)
Bank6
23%
Bank5
20%
Bank4
21%
Bank3
5%
Bank2
15%
Bank1
15%
3
![Page 4: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/4.jpg)
Operations Analysis
4
![Page 5: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/5.jpg)
5
![Page 6: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/6.jpg)
6
![Page 7: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/7.jpg)
7
We are raising our tablet forecast.
S
areNP
We
S
raisingNP
forecastNP
tablet
DET
our
subj
obj
subj pred
Dependency
Tree
Oct 1 04:12:24 9.1.1.3 41865: %PLATFORM_ENV-1-DUAL_PWR: Faulty internal power supply B detected
Time Oct 1 04:12:24
Host 9.1.1.3
Process 41865
Category%PLATFORM_ENV-1-DUAL_PWR
MessageFaulty internal power supply B detected
![Page 8: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/8.jpg)
88
Singapore 2012 Annual Report(136 pages PDF)
Identify note breaking down Operating expenses line item, and extract opex components
Identify line item for Operating expenses from Income statement (financial table in pdf document)
![Page 9: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/9.jpg)
9
![Page 10: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/10.jpg)
10
Intel's 2013 capex is elevated at 23% of sales, above average of 16%
FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.
I'm still hearing from clients that Merrill's website is better.
Customer or competitor?
Good or bad?
Entity of interest
![Page 11: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/11.jpg)
11
![Page 12: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/12.jpg)
’
’ 12
![Page 13: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/13.jpg)
13
![Page 14: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/14.jpg)
14
![Page 15: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/15.jpg)
15
![Page 16: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/16.jpg)
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,
sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.
Suspendisse
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in
sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
Tokenization
(preprocessing step)
Level 1
Gazetteer[type = LastGaz] Last
Gazetteer[type = FirstGaz] First
Token[~ “[A-Z]\w+”] Caps
Rule priority used to prefer
First over Caps
• Rule priority used to prefer First over Caps.
• Lossy Sequencing: annotations dropped
because input to next stage must be a sequence
– First preferred over Last since it was declared earlier
16
![Page 17: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/17.jpg)
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,
sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.
Suspendisse
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in
sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e
sagittis Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra
lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque
id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent
Tokenization
(preprocessing step)
Level 1
Gazetteer[type = LastGaz] Last
Gazetteer[type = FirstGaz] First
Token[~ “[A-Z]\w+”] Caps
Level 2 First Last Person
First Caps Person
First Person
Rigid Rule Priority and Lossy
Sequencing in Level 1 caused
partial results
17
![Page 18: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/18.jpg)
•
•
••
•
•
18
![Page 19: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/19.jpg)
19
![Page 20: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/20.jpg)
AQL Language
Optimizer
Operator
Graph
Specify extractor semantics
declaratively (express logic of
computation, not control flow)
Choose efficient execution
plan that implements
semantics
Optimized execution plan
executed at runtime
20
![Page 21: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/21.jpg)
21
![Page 22: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/22.jpg)
Document
text: String
Person
last: Spanfirst: Span fullname: Span
22
![Page 23: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/23.jpg)
23 23
![Page 24: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/24.jpg)
Mark
Scott
Anna
…
DocumentInput Tuple
…
we will meet Mark
Scott and
…
Output Tuple 2 Span 2Document
Span 1Output Tuple 1 Document
Dictionary
24
![Page 25: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/25.jpg)
25
![Page 26: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/26.jpg)
Dictionary<First>
SmithScott
TomorrowMarkScottHowardSmith
Join<First> <Caps>
Join<First> <Last>
Mark ScottHowardSmith
Mark ScottHowardSmith
Union
Mark ScottHowardSmithMark ScottHowardSmith
ScottMark
Howard
Consolidate
Mark ScottHowardSmith
Dictionary<Last>
Regex<Caps>
……Tomorrow, we will meet Mark Scott, Howard Smith …
Explicit operator for
resolving ambiguityInput may contain overlapping annotations
(No Lossy Sequencing problem)
Output may contain overlapping annotations
(No Rigid Matching Regimes)
ScottMark
Howard
26
![Page 27: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/27.jpg)
create view FirstCaps asselect CombineSpans(F.name, C.name) as namefrom First F, Caps Cwhere FollowsTok(F.name, C.name, 0, 0);
<First> <Caps>
0 tokens
27
![Page 28: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/28.jpg)
create view Person asselect S.name as namefrom (
( select CombineSpans(F.name, C.name) as namefrom First F, Caps Cwhere FollowsTok(F.name, C.name, 0, 0))
union all( select CombineSpans(F.name, L.name) as namefrom First F, Last Lwhere FollowsTok(F.name, L.name, 0, 0))
union all( select *from First F )
) Sconsolidate on name;
<First><Caps>
<First><Last>
<First>
28
![Page 29: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/29.jpg)
create view Person asselect S.name as namefrom (
( select CombineSpans(F.name, C.name) as namefrom First F, Caps Cwhere FollowsTok(F.name, C.name, 0, 0))
union all( select CombineSpans(F.name, L.name) as namefrom First F, Last Lwhere FollowsTok(F.name, L.name, 0, 0))
union all( select *from First F )
) Sconsolidate on name;
Explicit clause for
resolving ambiguity
(No Rigid Priority
problem)
Input may contain
overlapping annotations
(No Lossy Sequencing
problem)
29
![Page 30: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/30.jpg)
30
Deep Syntactic Parsing ML Training & Scoring
Core Operators
Tokenization Parts of Speech DictionariesRegular
ExpressionsSpan
OperationsRelational Operations
Semantic Role Labels
Language to express NLP Algorithms AQL
….AggregationOperations
![Page 31: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/31.jpg)
31
package com.ibm.avatar.algebra.util.sentence;
import java.io.BufferedWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.regex.Matcher;
public class SentenceChunker
{
private Matcher sentenceEndingMatcher = null;
public static BufferedWriter sentenceBufferedWriter = null;
private HashSet<String> abbreviations = new HashSet<String> ();
public SentenceChunker ()
{
}
/** Constructor that takes in the abbreviations directly. */
public SentenceChunker (String[] abbreviations)
{
// Generate the abbreviations directly.
for (String abbr : abbreviations) {
this.abbreviations.add (abbr);
}
}
/**
* @param doc the document text to be analyzed
* @return true if the document contains at least one sentence boundary
*/
public boolean containsSentenceBoundary (String doc)
{
String origDoc = doc;
/*
* Based on getSentenceOffsetArrayList()
*/
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
String candidate = /*
* Looks at the last character of the String. If this last
* character is part of an abbreviation (as detected by
* REGEX) then the sentenceString is not a fullSentence and
* "false” is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder)
&& isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
// sentences.addElement(candidate.trim().replaceAll("\n", "
// "));
// sentenceArrayList.add(new Integer(currentOffset + boundary
// + 1));
// currentOffset += boundary + 1;
// Found a sentence boundary. If the boundary is the last
// character in the string, we don't consider it to be
// contained within the string.
int baseOffset = currentOffset + boundary + 1;
if (baseOffset < origDoc.length ()) {
// System.err.printf("Sentence ends at %d of %d\n",
// baseOffset, origDoc.length());
return true;
}
else {
return false;
}
}
// origDoc.substring(0,currentOffset));
// doc = doc.substring(boundary + 1);
doc = remainder;
}
}
while (boundary != -1);
// If we get here, didn't find any boundaries.
return false;
}
public ArrayList<Integer> getSentenceOffsetArrayList (String doc)
{
ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
sentenceArrayList.add (new Integer (0));
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {
String candidate = doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
/*
* Looks at the last character of the String. If this last character
* is part of an abbreviation (as detected by REGEX) then the
* sentenceString is not a fullSentence and "false" is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder) &&
isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + boundary + 1));
currentOffset += boundary + 1;
}
// origDoc.substring(0,currentOffset));
// doc = doc.substring(boundary + 1);
doc = remainder;
}
}
while (boundary != -1);
if (doc.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + doc.length ()));
}
sentenceArrayList.trimToSize ();
return sentenceArrayList;
}
private void setDocumentForObtainingBoundaries (String doc)
{
sentenceEndingMatcher = SentenceConstants.
sentenceEndingPattern.matcher (doc);
}
private int getNextCandidateBoundary ()
{
if (sentenceEndingMatcher.find ()) {
return sentenceEndingMatcher.start ();
}
else
return -1;
}
private boolean doesNotBeginWithPunctuation (String remainder)
{
Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);
return (!m.find ());
}
private String getLastWord (String cand)
{
Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);
if (lastWordMatcher.find ()) {
return lastWordMatcher.group ();
}
else {
return "";
}
}
/*
* Looks at the last character of the String. If this last character is
* par of an abbreviation (as detected by REGEX)
* then the sentenceString is not a fullSentence and "false" is returned
*/
private boolean isFullSentence (String cand)
{
// cand = cand.replaceAll("\n", " "); cand = " " + cand;
Matcher validSentenceBoundaryMatcher =
SentenceConstants.validSentenceBoundaryPattern.matcher (cand);
if (validSentenceBoundaryMatcher.find ()) return true;
Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);
if (abbrevMatcher.find ()) {
return false; // Means it ends with an abbreviation
}
else {
// Check if the last word of the sentenceString has an entry in the
// abbreviations dictionary (like Mr etc.)
String lastword = getLastWord (cand);
if (abbreviations.contains (lastword)) { return false; }
}
return true;
}
}
Java Implementation of Sentence Boundary Detection
create dictionary AbbrevDict from file
'abbreviation.dict’;
create view SentenceBoundary as
select R.match as boundary
from ( extract regex /(([\.\?!]+\s)|(\n\s*\n))/
on D.text as match from Document D ) R
where
Not(ContainsDict('AbbrevDict',
CombineSpans(LeftContextTok(R.match, 1),R.match)));
Equivalent AQL Implementation
31
![Page 32: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/32.jpg)
32
![Page 33: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/33.jpg)
33
![Page 34: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/34.jpg)
34
![Page 35: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/35.jpg)
35
Tokenization overhead is paid only once
First
(followed within 0 tokens)
Plan C
Plan A
Join
Caps
Restricted Span Evaluation
Plan B
FirstIdentify Caps starting
within 0 tokens
Extract text to the
right
CapsIdentify First ending
within 0 tokensExtract text to the left
![Page 36: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/36.jpg)
0
100
200
300
400
500
600
700
0 20 40 60 80 100
Average document size (KB)
Th
rou
gh
pu
t (K
B/s
ec
)
Open Source Entity Tagger
SystemT
10~50x faster
[Chiticariu et al., ACL’10] 36
![Page 37: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/37.jpg)
[Chiticariu et al., ACL’10]
Dataset Document SizeThroughput
(KB/sec)Average Memory
(MB)
Range Average ANNIE SystemT ANNIE SystemT
Web Crawl 68 B – 388 KB 8.8 KB 42.8 498.8 201.8 77.2
Medium SEC Filings
240 KB – 0.9 MB 401 KB 26.3 703.5 601.8 143.7
Large
SEC Flings1 MB – 3.4 MB 1.54 MB 21.1 954.5 2683.5 189.6
37
![Page 38: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/38.jpg)
38
![Page 39: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/39.jpg)
•
•
•
•
•
•
39
![Page 40: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/40.jpg)
PersonPhone
Person PhonePerson
Anna at James St. office (555-5555) ….
’
•
•
create view PersonPhone as
select P.name as person, N.number as phone
from Person P, Phone N
where Follows(P.name, N.number, 0, 30);
Person Phone
t1t2
t3t1 t3
t2 t3
Provenance:
Boolean
expression
40
![Page 41: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/41.jpg)
41
![Page 42: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/42.jpg)
2013 2015 2016 2017
• UC Santa Cruz
(full Graduate class)
2014
• U. Washington (Grad)
• U. Oregon (Undergrad)
• U. Aalborg, Denmark (Grad)
• UIUC (Grad)
• U. Maryland Baltimore County
(Undergrad)
• UC Irvine (Grad)
• NYU Abu-Dhabi (Undergrad)
• U. Washington (Grad)
• U. Oregon (Undergrad)
• U. Maryland Baltimore County
(Undergrad)
• …
• UC Santa Cruz, 3 lectures
in one Grad class
SystemT MOOC
42
![Page 43: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/43.jpg)
43
![Page 44: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/44.jpg)
44
![Page 45: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/45.jpg)
45
![Page 46: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/46.jpg)
create dictionary PurchaseVerbs as
('buy.01', 'purchase.01', 'acquire.01', 'get.01');
create view Relation as
select A.verb as BUY_VERB, R2.head as PURCHASE, A.polarity as WILL_BUY
from Action A, Role R
where
MatchesDict('PurchaseVerbs', A.verbClass);
and Equals(A.aid, R.aid)
and Equals(R.type, 'A1');
ACL ‘15, ‘16, EMNLP ‘16, COLING ’16a, ‘16b, ‘16c
•
•
46
![Page 47: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/47.jpg)
47
![Page 48: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/48.jpg)
48
![Page 49: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/49.jpg)
49
![Page 50: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/50.jpg)
50
![Page 51: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/51.jpg)
51
![Page 52: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/52.jpg)
52
![Page 53: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/53.jpg)
Ease of
Programming
Ease of
Sharing
53
![Page 54: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/54.jpg)
54
R1: create view Phone as
Regex(‘d{3}-\d{4}’, Document, text);
R2: create view Person as
Dictionary(‘first_names.dict’, Document, text);
Dictionary file first_names.dict:
anna, james, john, peter…
R3: create table PersonPhone(match span);
insert into PersonPhone
select Merge(F.match, P.match) as match
from Person F, Phone P
where Follows(F.match, P.match, 0, 60);
Person PhonePerson Person Phone
Anna at James St. office (555-5555), or James, her assistant - 777-7777 have the details.
••
54
![Page 55: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/55.jpg)
PersonDictionary
FirstNames.dict
Doc
PersonPhoneJoin
Follows(name,phone,0,60)
James
James555-5555
PhoneRegex
/\d{3}-\d{4}/
555-5555
PhonePerson
Anna at James St. office (555-5555), …
55
![Page 56: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/56.jpg)
56
56
HLC 2
Remove James
from output of R2’Dictionary op.
HLC 3: Remove
James555-5555
from output of R3’s
join op.
HLC 1
Remove 555-5555
from output of
R1’s Regex op.
true
Merge(F.match, P.match) as match
⋈Follows(F.match,P.match,0,60)
Dictionary‘firstName.dict’, text
Regex‘\d{3}-\d{4}’, text
R2 R1
R3
Doc
Goal: remove “James 555-5555” from output
56
![Page 57: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/57.jpg)
57
57
HLC 2
Remove James
from output of R2’Dictionary op.
HLC 3: Remove
James555-5555
from output of R3’s
join op.
HLC 1
Remove 555-5555
from output of
R1’s Regex op.
true
Merge(F.match, P.match) as match
⋈Follows(F.match,P.match,0,60)
Dictionary‘firstName.dict’, text
Regex‘\d{3}-\d{4}’, text
R2 R1
R3
Doc
Goal: remove “James 555-5555” from output
LLC 1
Remove ‘James’from FirstNames.dict
LLC 2
Add filter pred. on
street suffix in right
context of match
LLC 3
Reduce character gap between
F.match and P.match from 60 to 10
57
![Page 58: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/58.jpg)
58
⋈
Dictionary ContainsDict()
Contains IsContained Overlaps
“ PersonPhonePersonPhone ”
58
![Page 59: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/59.jpg)
59 • Input:
– Set of HLCs, provenance graph, labeled results
• Output:
– List of LLCs, ranked based on improvement in F1-measure
• Algorithm:
– For each operator Op, consider all HLCs (ti, Op)– For each HLC, enumerate all possible LLCs– For each LLC:
• Compute the set of local tuples it removes from the output of Op
• Propagate these removals up through the provenance graph to compute the effect on end-to-end result
– Rank LLCs
59
![Page 60: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/60.jpg)
– t Dictionary
Op
– k Op
– k
– O(n2) n
Op ti Op ti
+++
+
+++
+ ++
+ +
+
+ + ++
++ +
+
++++
-- - --
- - -- - --
--
- - ---
-
--
--
- -
-Tuples to remove
from output of OpOutput tuples
60
![Page 61: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/61.jpg)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Baseline I1 I2 I3 I4 I5
Enron
ACE
CoNLL
EnronPP
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Baseline I1 I2 I3 I4 I5
Enron
ACE
CoNLL
EnronPP
61
Precision improves greatly after a few iterations, while recall remains fairly stable
Precision – % correct results of total results identified
Recall – % correct results identified of total correct labels
Person extraction on formal text (CoNLL, ACE)Person and PersonPhone extraction on informal text (Enron)
61
![Page 62: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/62.jpg)
0
10
20
30
40
50
60
70
80
90
F1- measure
62
Almost all expert’s refinements are among top 12 generated refinements
Done in 2 minutes !
Expert A after 1 hour, 9 refinements
Person extraction on informal text (Enron)
62
![Page 63: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/63.jpg)
63
![Page 64: 2017-01-25-SystemT-Overview-Stanford](https://reader031.vdocument.in/reader031/viewer/2022030309/58f2ead71a28ab16068b4607/html5/thumbnails/64.jpg)
Development Environment
AQL Extractor
create view ProductMention asselect ...from ...where ...
create view IntentToBuy asselect ...from ...where ... Cost-based
optimization
. .
.
Discovery tools for AQL development
SystemT Runtime
Input
Documents
Extracted
Objects
Challenge: Building extractors for enterprise applications requires an information extraction system that is expressive,
efficient, transparent and usable. Existing solutions are either rule-based solutions based on cascading grammar with
expressivity and efficiency issues, or black-box solutions based on machine learning with lack of transparency.
Our Solution: A declarative information extraction system with cost-based optimization, high-performance runtime and
novel development tooling based on solid theoretical foundation [PODS’13, PODS’14], shipping with over 10+ IBM products.
AQL: a declarative language that can be used to build extractors
outperforming the state-of-the-arts [ACL’10]
Multilingual SRL-enabled: [ACL’15, ACL’16, EMNLP’16, COLING’16]
A suite of novel development tooling leveraging
machine learning and HCI [EMNLP’08, VLDB’10,
ACL’11, CIKM’11, ACL’12, EMNLP’12, CHI’13,
SIGMOD’13, ACL’13,VLDB’15, NAACL’15]
Cost-based optimization for
text-centric operations [ICDE’08, ICDE’11, FPL’13, FPL’14]
Highly embeddable runtime
with high-throughput and
small memory footprint. [SIGMOD Record’09, SIGMOD’09]
For details and
Online Class visit:https://ibm.biz/BdF4GQ
64