2017-01-25-systemt-overview-stanford

Bank6

23%

Bank5

20%

Bank4

21%

Bank3

5%

Bank2

15%

Bank1

15%

3

Operations Analysis

4

7

We are raising our tablet forecast.

S

areNP

We

S

raisingNP

forecastNP

tablet

DET

our

subj

obj

subj pred

Dependency

Tree

Oct 1 04:12:24 9.1.1.3 41865: %PLATFORM_ENV-1-DUAL_PWR: Faulty internal power supply B detected

Time Oct 1 04:12:24

Host 9.1.1.3

Process 41865

Category%PLATFORM_ENV-1-DUAL_PWR

MessageFaulty internal power supply B detected

88

Singapore 2012 Annual Report(136 pages PDF)

Identify note breaking down Operating expenses line item, and extract opex components

Identify line item for Operating expenses from Income statement (financial table in pdf document)

10

Intel's 2013 capex is elevated at 23% of sales, above average of 16%

FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.

I'm still hearing from clients that Merrill's website is better.

Customer or competitor?

Good or bad?

Entity of interest

’

’ 12

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,

sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc

volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.

Suspendisse

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in

sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.

Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in

Tokenization

(preprocessing step)

Level 1

Gazetteer[type = LastGaz] Last

Gazetteer[type = FirstGaz] First

Token[~ “[A-Z]\w+”] Caps

Rule priority used to prefer

First over Caps

• Rule priority used to prefer First over Caps.

• Lossy Sequencing: annotations dropped

because input to next stage must be a sequence

– First preferred over Last since it was declared earlier

16

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,

sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc

volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.

Suspendisse

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in

sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.

Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e

sagittis Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci.

Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra

lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque

id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent

Tokenization

(preprocessing step)

Level 1

Gazetteer[type = LastGaz] Last

Gazetteer[type = FirstGaz] First

Token[~ “[A-Z]\w+”] Caps

Level 2 First Last Person

First Caps Person

First Person

Rigid Rule Priority and Lossy

Sequencing in Level 1 caused

partial results

17

•

•

••

•

•

18

AQL Language

Optimizer

Operator

Graph

Specify extractor semantics

declaratively (express logic of

computation, not control flow)

Choose efficient execution

plan that implements

semantics

Optimized execution plan

executed at runtime

20

Document

text: String

Person

last: Spanfirst: Span fullname: Span

22

Mark

Scott

Anna

…

DocumentInput Tuple

…

we will meet Mark

Scott and

…

Output Tuple 2 Span 2Document

Span 1Output Tuple 1 Document

Dictionary

24

Dictionary<First>

SmithScott

TomorrowMarkScottHowardSmith

Join<First> <Caps>

Join<First> <Last>

Mark ScottHowardSmith


Union

Mark ScottHowardSmithMark ScottHowardSmith

ScottMark

Howard

Consolidate


Dictionary<Last>

Regex<Caps>

……Tomorrow, we will meet Mark Scott, Howard Smith …

Explicit operator for

resolving ambiguityInput may contain overlapping annotations

(No Lossy Sequencing problem)

Output may contain overlapping annotations

(No Rigid Matching Regimes)

ScottMark

Howard

26

create view FirstCaps asselect CombineSpans(F.name, C.name) as namefrom First F, Caps Cwhere FollowsTok(F.name, C.name, 0, 0);

<First> <Caps>

0 tokens

27

create view Person asselect S.name as namefrom (

( select CombineSpans(F.name, C.name) as namefrom First F, Caps Cwhere FollowsTok(F.name, C.name, 0, 0))

union all( select CombineSpans(F.name, L.name) as namefrom First F, Last Lwhere FollowsTok(F.name, L.name, 0, 0))

union all( select *from First F )

) Sconsolidate on name;

<First><Caps>

<First><Last>

<First>

28

create view Person asselect S.name as namefrom (

( select CombineSpans(F.name, C.name) as namefrom First F, Caps Cwhere FollowsTok(F.name, C.name, 0, 0))

union all( select CombineSpans(F.name, L.name) as namefrom First F, Last Lwhere FollowsTok(F.name, L.name, 0, 0))

union all( select *from First F )

) Sconsolidate on name;

Explicit clause for

resolving ambiguity

(No Rigid Priority

problem)

Input may contain

overlapping annotations

(No Lossy Sequencing

problem)

29

30

Deep Syntactic Parsing ML Training & Scoring

Core Operators

Tokenization Parts of Speech DictionariesRegular

ExpressionsSpan

OperationsRelational Operations

Semantic Role Labels

Language to express NLP Algorithms AQL

….AggregationOperations

31

package com.ibm.avatar.algebra.util.sentence;

import java.io.BufferedWriter;

import java.util.ArrayList;

import java.util.HashSet;

import java.util.regex.Matcher;

public class SentenceChunker

{

private Matcher sentenceEndingMatcher = null;

public static BufferedWriter sentenceBufferedWriter = null;

private HashSet<String> abbreviations = new HashSet<String> ();

public SentenceChunker ()

{

}

/** Constructor that takes in the abbreviations directly. */

public SentenceChunker (String[] abbreviations)

{

// Generate the abbreviations directly.

for (String abbr : abbreviations) {

this.abbreviations.add (abbr);

}

}

/**

* @param doc the document text to be analyzed

* @return true if the document contains at least one sentence boundary

*/

public boolean containsSentenceBoundary (String doc)

{

String origDoc = doc;

/*

* Based on getSentenceOffsetArrayList()

*/

// String origDoc = doc;

// int dotpos, quepos, exclpos, newlinepos;

int boundary;

int currentOffset = 0;

do {

/* Get the next tentative boundary for the sentenceString */

setDocumentForObtainingBoundaries (doc);

boundary = getNextCandidateBoundary ();

if (boundary != -1) {doc.substring (0, boundary + 1);

String remainder = doc.substring (boundary + 1);

String candidate = /*

* Looks at the last character of the String. If this last

* character is part of an abbreviation (as detected by

* REGEX) then the sentenceString is not a fullSentence and

* "false” is returned

*/

// while (!(isFullSentence(candidate) &&

// doesNotBeginWithCaps(remainder))) {

while (!(doesNotBeginWithPunctuation (remainder)

&& isFullSentence (candidate))) {


int nextBoundary = getNextCandidateBoundary ();

if (nextBoundary == -1) {

break;

}

boundary = nextBoundary;

candidate = doc.substring (0, boundary + 1);

remainder = doc.substring (boundary + 1);

}

if (candidate.length () > 0) {

// sentences.addElement(candidate.trim().replaceAll("\n", "

// "));

// sentenceArrayList.add(new Integer(currentOffset + boundary

// + 1));

// currentOffset += boundary + 1;

// Found a sentence boundary. If the boundary is the last

// character in the string, we don't consider it to be

// contained within the string.

int baseOffset = currentOffset + boundary + 1;

if (baseOffset < origDoc.length ()) {

// System.err.printf("Sentence ends at %d of %d\n",

// baseOffset, origDoc.length());

return true;

}

else {

return false;

}

}

// origDoc.substring(0,currentOffset));

// doc = doc.substring(boundary + 1);

doc = remainder;

}

}

while (boundary != -1);

// If we get here, didn't find any boundaries.

return false;

}

public ArrayList<Integer> getSentenceOffsetArrayList (String doc)

{

ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();

// String origDoc = doc;

// int dotpos, quepos, exclpos, newlinepos;

int boundary;

int currentOffset = 0;

sentenceArrayList.add (new Integer (0));

do {


setDocumentForObtainingBoundaries (doc);

boundary = getNextCandidateBoundary ();

if (boundary != -1) {

String candidate = doc.substring (0, boundary + 1);

String remainder = doc.substring (boundary + 1);

/*

* Looks at the last character of the String. If this last character

* is part of an abbreviation (as detected by REGEX) then the

* sentenceString is not a fullSentence and "false" is returned

*/

// while (!(isFullSentence(candidate) &&

// doesNotBeginWithCaps(remainder))) {

while (!(doesNotBeginWithPunctuation (remainder) &&

isFullSentence (candidate))) {


int nextBoundary = getNextCandidateBoundary ();

if (nextBoundary == -1) {

break;

}

boundary = nextBoundary;

candidate = doc.substring (0, boundary + 1);

remainder = doc.substring (boundary + 1);

}

if (candidate.length () > 0) {

sentenceArrayList.add (new Integer (currentOffset + boundary + 1));

currentOffset += boundary + 1;

}

// origDoc.substring(0,currentOffset));

// doc = doc.substring(boundary + 1);

doc = remainder;

}

}

while (boundary != -1);

if (doc.length () > 0) {

sentenceArrayList.add (new Integer (currentOffset + doc.length ()));

}

sentenceArrayList.trimToSize ();

return sentenceArrayList;

}

private void setDocumentForObtainingBoundaries (String doc)

{

sentenceEndingMatcher = SentenceConstants.

sentenceEndingPattern.matcher (doc);

}

private int getNextCandidateBoundary ()

{

if (sentenceEndingMatcher.find ()) {

return sentenceEndingMatcher.start ();

}

else

return -1;

}

private boolean doesNotBeginWithPunctuation (String remainder)

{

Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);

return (!m.find ());

}

private String getLastWord (String cand)

{

Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);

if (lastWordMatcher.find ()) {

return lastWordMatcher.group ();

}

else {

return "";

}

}

/*

* Looks at the last character of the String. If this last character is

* par of an abbreviation (as detected by REGEX)

* then the sentenceString is not a fullSentence and "false" is returned

*/

private boolean isFullSentence (String cand)

{

// cand = cand.replaceAll("\n", " "); cand = " " + cand;

Matcher validSentenceBoundaryMatcher =

SentenceConstants.validSentenceBoundaryPattern.matcher (cand);

if (validSentenceBoundaryMatcher.find ()) return true;

Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);

if (abbrevMatcher.find ()) {

return false; // Means it ends with an abbreviation

}

else {

// Check if the last word of the sentenceString has an entry in the

// abbreviations dictionary (like Mr etc.)

String lastword = getLastWord (cand);

if (abbreviations.contains (lastword)) { return false; }

}

return true;

}

}

Java Implementation of Sentence Boundary Detection

create dictionary AbbrevDict from file

'abbreviation.dict’;

create view SentenceBoundary as

select R.match as boundary

from ( extract regex /(([\.\?!]+\s)|(\n\s*\n))/

on D.text as match from Document D ) R

where

Not(ContainsDict('AbbrevDict',

CombineSpans(LeftContextTok(R.match, 1),R.match)));

Equivalent AQL Implementation

31

35

Tokenization overhead is paid only once

First

(followed within 0 tokens)

Plan C

Plan A

Join

Caps

Restricted Span Evaluation

Plan B

FirstIdentify Caps starting

within 0 tokens

Extract text to the

right

CapsIdentify First ending

within 0 tokensExtract text to the left

0

100

200

300

400

500

600

700

0 20 40 60 80 100

Average document size (KB)

Th

rou

gh

pu

t (K

B/s

ec

)

Open Source Entity Tagger

SystemT

10~50x faster

[Chiticariu et al., ACL’10] 36

[Chiticariu et al., ACL’10]

Dataset Document SizeThroughput

(KB/sec)Average Memory

(MB)

Range Average ANNIE SystemT ANNIE SystemT

Web Crawl 68 B – 388 KB 8.8 KB 42.8 498.8 201.8 77.2

Medium SEC Filings

240 KB – 0.9 MB 401 KB 26.3 703.5 601.8 143.7

Large

SEC Flings1 MB – 3.4 MB 1.54 MB 21.1 954.5 2683.5 189.6

37

•

•

•

•

•

•

39

PersonPhone

Person PhonePerson

Anna at James St. office (555-5555) ….

’

•

•

create view PersonPhone as

select P.name as person, N.number as phone

from Person P, Phone N

where Follows(P.name, N.number, 0, 30);

Person Phone

t1t2

t3t1 t3

t2 t3

Provenance:

Boolean

expression

40

2013 2015 2016 2017

• UC Santa Cruz

(full Graduate class)

2014

• U. Washington (Grad)

• U. Oregon (Undergrad)

• U. Aalborg, Denmark (Grad)

• UIUC (Grad)

• U. Maryland Baltimore County

(Undergrad)

• UC Irvine (Grad)

• NYU Abu-Dhabi (Undergrad)

• U. Washington (Grad)

• U. Oregon (Undergrad)

• U. Maryland Baltimore County

(Undergrad)

• …

• UC Santa Cruz, 3 lectures

in one Grad class

SystemT MOOC

42

create dictionary PurchaseVerbs as

('buy.01', 'purchase.01', 'acquire.01', 'get.01');

create view Relation as

select A.verb as BUY_VERB, R2.head as PURCHASE, A.polarity as WILL_BUY

from Action A, Role R

where

MatchesDict('PurchaseVerbs', A.verbClass);

and Equals(A.aid, R.aid)

and Equals(R.type, 'A1');

ACL ‘15, ‘16, EMNLP ‘16, COLING ’16a, ‘16b, ‘16c

•

•

46

Ease of

Programming

Ease of

Sharing

53

54

R1: create view Phone as

Regex(‘d{3}-\d{4}’, Document, text);

R2: create view Person as

Dictionary(‘first_names.dict’, Document, text);

Dictionary file first_names.dict:

anna, james, john, peter…

R3: create table PersonPhone(match span);

insert into PersonPhone

select Merge(F.match, P.match) as match

from Person F, Phone P

where Follows(F.match, P.match, 0, 60);

Person PhonePerson Person Phone

Anna at James St. office (555-5555), or James, her assistant - 777-7777 have the details.

••

54

PersonDictionary

FirstNames.dict

Doc

PersonPhoneJoin

Follows(name,phone,0,60)

James

James555-5555

PhoneRegex

/\d{3}-\d{4}/

555-5555

PhonePerson

Anna at James St. office (555-5555), …

55

56

56

HLC 2

Remove James

from output of R2’Dictionary op.

HLC 3: Remove

James555-5555

from output of R3’s

join op.

HLC 1

Remove 555-5555

from output of

R1’s Regex op.

true

Merge(F.match, P.match) as match

⋈Follows(F.match,P.match,0,60)

Dictionary‘firstName.dict’, text

Regex‘\d{3}-\d{4}’, text

R2 R1

R3

Doc

Goal: remove “James 555-5555” from output

56

57

57

HLC 2

Remove James

from output of R2’Dictionary op.

HLC 3: Remove

James555-5555

from output of R3’s

join op.

HLC 1

Remove 555-5555

from output of

R1’s Regex op.

true

Merge(F.match, P.match) as match

⋈Follows(F.match,P.match,0,60)

Dictionary‘firstName.dict’, text

Regex‘\d{3}-\d{4}’, text

R2 R1

R3

Doc

Goal: remove “James 555-5555” from output

LLC 1

Remove ‘James’from FirstNames.dict

LLC 2

Add filter pred. on

street suffix in right

context of match

LLC 3

Reduce character gap between

F.match and P.match from 60 to 10

57

58

⋈

Dictionary ContainsDict()

Contains IsContained Overlaps

“ PersonPhonePersonPhone ”

58

59 • Input:

– Set of HLCs, provenance graph, labeled results

• Output:

– List of LLCs, ranked based on improvement in F1-measure

• Algorithm:

– For each operator Op, consider all HLCs (ti, Op)– For each HLC, enumerate all possible LLCs– For each LLC:

• Compute the set of local tuples it removes from the output of Op

• Propagate these removals up through the provenance graph to compute the effect on end-to-end result

– Rank LLCs

59

– t Dictionary

Op

– k Op

– k

– O(n2) n

Op ti Op ti

+++

+

+++

+ ++

+ +

+

+ + ++

++ +

+

++++

-- - --

- - -- - --

--

- - ---

-

--

--

- -

-Tuples to remove

from output of OpOutput tuples

60

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Baseline I1 I2 I3 I4 I5

Enron

ACE

CoNLL

EnronPP

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Baseline I1 I2 I3 I4 I5

Enron

ACE

CoNLL

EnronPP

61

Precision improves greatly after a few iterations, while recall remains fairly stable

Precision – % correct results of total results identified

Recall – % correct results identified of total correct labels

Person extraction on formal text (CoNLL, ACE)Person and PersonPhone extraction on informal text (Enron)

61

0

10

20

30

40

50

60

70

80

90

F1- measure

62

Almost all expert’s refinements are among top 12 generated refinements

Done in 2 minutes !

Expert A after 1 hour, 9 refinements

Person extraction on informal text (Enron)

62

Development Environment

AQL Extractor

create view ProductMention asselect ...from ...where ...

create view IntentToBuy asselect ...from ...where ... Cost-based

optimization

. .

.

Discovery tools for AQL development

SystemT Runtime

Input

Documents

Extracted

Objects

Challenge: Building extractors for enterprise applications requires an information extraction system that is expressive,

efficient, transparent and usable. Existing solutions are either rule-based solutions based on cascading grammar with

expressivity and efficiency issues, or black-box solutions based on machine learning with lack of transparency.

Our Solution: A declarative information extraction system with cost-based optimization, high-performance runtime and

novel development tooling based on solid theoretical foundation [PODS’13, PODS’14], shipping with over 10+ IBM products.

AQL: a declarative language that can be used to build extractors

outperforming the state-of-the-arts [ACL’10]

Multilingual SRL-enabled: [ACL’15, ACL’16, EMNLP’16, COLING’16]

A suite of novel development tooling leveraging

machine learning and HCI [EMNLP’08, VLDB’10,

ACL’11, CIKM’11, ACL’12, EMNLP’12, CHI’13,

SIGMOD’13, ACL’13,VLDB’15, NAACL’15]

Cost-based optimization for

text-centric operations [ICDE’08, ICDE’11, FPL’13, FPL’14]

Highly embeddable runtime

with high-throughput and

small memory footprint. [SIGMOD Record’09, SIGMOD’09]

For details and

Online Class visit:https://ibm.biz/BdF4GQ

64

65

http://ibm.co/1Cdm1Mj

http://ibm.co/1DIouEv

https://bigdatauniversity.com/learn/text_analytics/

mailto:[email protected]