resqu: a framework for automatic evaluation of knowledge-driven automatic summarization

RESQU: A FRAMEWORK FOR AUTOMATIC EVALUATION OF

KNOWLEDGE-DRIVEN AUTOMATIC SUMMARIZATION

MASTERS THESIS DEFENSE

NISHITA JAYKUMAR

MAY 26, 2016

MASTERS COMMITTEE

AMIT P. SHETH (ADVISOR)

THOMAS C. RINDFLESCH (NIH)

DELROY CAMERON (APPLE INC.)

KRISHNAPRASAD THIRUNARAYAN

1

Main Issue: Indirect Information access

PubMed Search Service

2

3

Acetaminophen TREATS Migraine Disorders

Sumatriptan TREATS Migraine Disorder

Topiramate PREVENTS Migraine Disorders

More direct Information access

Semantic MEDLINE

3

Thesis Motivation

• Automatically evaluate summaries in Semantic MEDLINE.

• Identify features that impact summary quality.

• Improve semantic summaries it generates.

4

Outline

•Automatic summarization

- Extractive, abstractive

- Summarization in semantic MEDLINE and ResQu

•Automatic summarization evaluation

- Intrinsic, extrinsic

•Datasets

- UMLS, SemRep, MetaMap

•Approach

- Summary transformation

- Semantic similarity

•Experimental evaluation

•Conclusion

5

•What is an effective summary?

- Saliency

- Compressed format

• Approaches to Automatic Summarization

Automatic Summarization

Extractive Abstractive

6

Extractive summary

A randomized, placebo-controlled trial of

acetaminophen for treatment of migraine

headache.

Long-term evaluation of sumatriptan and

naproxen sodium for the acute treatment of

migraine in adolescents.

…………….

Mapping from disease-specific measures to

health-state utility values in individuals with

migraine.

Abstractive summary


Sumatriptan TREATS Migraine Disorders

…………….

Migraine Disorders PROCESS_OF Individuals

Semantic MEDLINE Summarization System Overview

Source

Documents

Conceptual

Representation

Conceptual

Condensate

Semantic

Predications

Semantic

Predications

Feature application:

• Relevance

• Connectivity

• Novelty

• Saliency

Interpretation Transformation

Reduction

Generalization

SemRep

Semantic

Summary

Generalization

7

Aspirin TREATS Coronary artery disease

Coronary artery disease COEXISTS_WITH Inflammation

Coronary artery disease ISA Vascular disease

tomography DIAGNOSIS Coronary artery disease

Intrinsic Evaluation:

- Compared to a human-curated gold standard.

- Using document similarity measures.

• Evaluating Summary Quality

Evaluating Summaries

Extrinsic evaluation:

- Based on a secondary task.

- Through a discrete scoring system.

8

Intrinsic Evaluation of Extractive Summariztion

• Pyramid Approach [Nenkova et al., 2004]

- Summary Content Units (SCU)

• Louis et al [2009]

• Distribution of terms

• Kullback-Liebler

• Jensen-Shannon

Nenkova, Ani, and Rebecca Passonneau. "Evaluating content selection in summarization: The pyramid method."

(2004). Louis, Annie, and Ani Nenkova. "Automatic summary evaluation without human models." Notebook

Papers and Results, Text Analysis Conference (TAC-2008), Gaithersburg, Maryland (USA). 2008.

9

• Information Misalignment

• Semantic summary – structured background knowledge.

• Gold standard – textual.

• Proposed Solution

• Summary transformation: predications to text.

• Semantic similarity computation.

Intrinsic Evaluation of Abstractive Summarization

10

Approach: ResQu

We can use the words that co-occur with the semantic predications in a summary to represent

the meaning of the semantic predications based on distributional semantics.

By generating multiple summaries with features held-out, we can effectively evaluate the impact

of each feature.

Word Co-occurrence

Leave-one-Out

11

A semantic summary can be understood and potentially improved by leveraging distributional

statistics between the structured knowledge that comprises the semantic summary and the

words with which these structured constructs co-occur, across the corpus.

Thesis Statement

123

valproic acid TREAT migraine

Sumatriptan TREATS Migraine Disorders

lamotrigine TREATS Migraine with Aura

Dihydroergotamine TREATS Migraine Disorders


Aspirin TREATS Migraine Disorders

zolmitriptan TREATS Migraine Disorders

eletriptan TREATS Migraine Disorders

Analgesics TREATS Migraine Disorders

ziconitide TREATS Migraine Disorders

Semantic Predications

…

…

Proposed Solution (ResQu)

Co-occurring arguments

Semantic summary

vector

13

…

• Similarity between SS and GS

- Cosine similarity, Euclidean distance, Jensen-Shannon divergence

• Root Mean-Squared Error

• For each summary generated with a feature held-out

Measuring Similarity

The summary that is least similar to the gold standard has the most important feature.

146

Assertional

Knowledge

Definitional

Knowledge

ComplementaryDisjoint

65 Attributes:

62 Provenance Metadata 3

Semantic Attributes

MEDLINE(1865 – 2015)

Largest Biomedical

Knowledgebase,

>25 million abstracts,

PubMed, PMC

Semantic Predications

Medical Subject Headings (MeSH)15 Unique Trees, Max Depth – 15

~27,000 Terms

SPECIALIST Lexicon

Semantic Network

Metathesaurus>300k concepts

>100 Vocabularies

9 million triples

134 Types

15 Groups

54 predicates

Unified Medical Language System (UMLS)

MeSH Indexing

d1

d2

d3

dn

Resource-Rich

Biomedical Knowledge

151

ResQu System Architecture

User Query

Processor

Document

Selector

Predication

Mapper

Concept

MapperSummarizer

(Schema

Summarizer)

Vectorizer

Predication

Extractor

(SemRep)

Graph

Generator

ResQu

Summary

Vectors

MEDLINE

15

Jericho Crawler

Gold standard

Vectors

Similarity

Computation

Module

Gold standard

creation module

User Query

• l: label of an entity (or concept) in the UMLS,

- Migraine Disorders: C0149931

• c1: Humans[MH] and c2: Clinical Trial [PTYP]

• dt: the date range of documents

• ub: is the upper bound (default = 5000)

q = (l, c1, c2, dt, ub)

178

q = (Migraine Disorders[MH] AND Humans[MH] AND Clinical Trial

[PTYP] AND 1860/01:2014/08[DCOM])

User Query Instance

189

•Query from the User Query Processor.

•Retrieves the set of MEDLINE documents.

• D = {d1; d2;. . . ; dn}

•Uses the MEDLINE Entrez Search API.

Document Selection

20

Document Selection

21

Semantic Predications Extractor

22

A randomized, placebo-controlled trial of acetaminophen for

treatment of migraine disorders

Acetaminophen Migraine disorderstreats

Automatic Summarizer

Inflammation mediated by the immune system is known to be important in carcinogenesis and, specifically, T helper 17 cells have been reported to play a role in tumor

progression by promoting neo-angiogenesis. The aim of this study was to investigate whether inflammatory cytokines and vascular endothelial growth factor (VEGF) levels

in exhaled breath condensate (EBC) and in serum were related to tumor size in patients with non-small cell lung cancer (NSCLC). Il-6, IL-17, TNF-α and VEGF levels were

measured in EBC and serum of 15 patients with stage I-IIA NSCLC and in 30 healthy controls by immunoassay. The tumor size was measured by a CT scan. The

concentrations of IL-6, IL-17 and VEGF were significantly higher in EBC of patients with lung cancer, compared with controls, while only serum IL-6 concentration was

higher in patients compared to controls. A significant correlation (r = 0.78, p = 0.001) was observed between EBC levels of IL-6 and IL-17; IL-17 was also correlated to EBC

levels of the VEGF (r = 0.83, p < 0.001) and TNF-α (r = 0.62, p = 0.014). The tumor diameter was significantly correlated with EBC concentrations of VEGF (r = 0.58, p =

0.039), IL-6 (r = 0.67, p = 0.013) and IL-17 (r = 0.66, p = 0.017). Our results show a significant relationship between inflammatory and angiogenic markers, measured in

EBC by a non-invasive method, and tumor mass. To assess whether polymorphisms of the interleukin-23 receptor (IL23R) gene are associated with bladder transitional cell

carcinoma because chronic inflammation contributes to bladder cancer and the IL23R is known to be critically involved in the carcinogenesis of various malignant tumors.

226 patients with bladder cancer and 270 age-matched controls were involved in the study. Polymerase chain reaction-restriction fragment length polymorphism was used

for genotyping. Genotype distribution and allelic frequencies between patients and controls were compared. In all three single nucleotide polymorphisms of IL23R studied,

the distribution of genotype and allele frequencies of rs10889677 differed significantly between patients and controls. The frequency of allele C of rs10889677 was

significantly increased in cases compared with controls (0.2898 vs. 0.1833, odds ratio 1.818, 95 % confidence interval 1.349-2.449). The result indicates that IL23R may

play an important role in the susceptibility of bladder cancer in Chinese population. For over a century, inactivated or attenuated bacteria have been employed in the clinic

as immunotherapies to treat cancer, starting with the Coley's vaccines in the 19th century and leading to the currently approved bacillus Calmette-Guérin vaccine for

bladder cancer. While effective, the inflammation induced by these therapies is transient and not designed to induce long-lasting tumor-specific cytolytic T lymphocyte

(CTL) responses that have proven so adept at eradicating tumors. Therefore, in order to maintain the benefits of bacteria-induced acute inflammation but gain long-lasting

anti-tumor immunity, many groups have constructed recombinant bacteria expressing tumor-associated antigens (TAAs) for the purpose of activating tumor-specific CTLs.

One bacterium has proven particularly adept at inducing powerful anti-tumor immunity, Listeria monocytogenes (Lm). Lm is a gram-positive bacterium that selectively

infects antigen-presenting cells wherein it is able to efficiently deliver tumor antigens to both the MHC Class I and II antigen presentation pathways for activation of tumor-

targeting CTL-mediated immunity. Lm is a versatile bacterial vector as evidenced by its ability to induce therapeutic immunity against a wide-array of TAAs and specifically

infect and kill tumor cells directly. It is for these reasons, among others, that Lm-based immunotherapies have delivered impressive therapeutic efficacy in preclinical

models of cancer for two decades and are now showing promise clinically. The result indicates that IL23R may play an important role in the susceptibility of bladder cancer

in Chinese population. For over a century, inactivated or attenuated bacteria have been employed in the clinic as immunotherapies to treat cancer, starting with the Coley's

vaccines in the 19th century and leading to the currently approved bacillus Calmette-Guérin vaccine for bladder cancer. While effective, the inflammation induced by these

therapies is transient and not designed to induce long-lasting tumor-specific cytolytic T lymphocyte (CTL) responses that have proven so adept at eradicating tumors.

Therefore, in order to maintain the benefits of bacteria-induced acute inflammation but gain long-lasting anti-tumor immunity, many groups have constructed recombinant

bacteria expressing tumor-associated antigens (TAAs) for the purpose of activating tumor-specific CTLs. One bacterium has proven particularly adept at inducing powerful

anti-tumor immunity, Listeria monocytogenes (Lm). Lm is a gram-positive bacterium that selectively infects antigen-presenting cells wherein it is able to efficiently deliver

tumor antigens to both the MHC Class I and II antigen presentation pathways for activation of tumor-targeting CTL-mediated immunity. Lm is a versatile bacterial vector as

evidenced by its ability to induce therapeutic immunity against a wide-array of TAAs and specifically infect and kill tumor cells directly. It is for these reasons, among others,

that Lm-based immunotherapies have delivered impressive therapeutic efficacy in preclinical models of cancer for two decades and are now showing promise clinically.

inflammation contributes to bladder cancer and the IL23R is known to be critically involved in the carcinogenesis of various malignant tumors. 226 patients with bladder

cancer and 270 age-matched controls were involved in the study. Polymerase chain reaction-restriction fragment length polymorphism was used for genotyping. Genotype

distribution and allelic frequencies between patients and controls were compared. In all three single nucleotide polymorphisms of IL23R studied, the distribution of genotype

and allele frequencies of rs10889677 differed significantly between patients and controls. The frequency of allele C of rs10889677 was significantly increased in cases

compared with controls (0.2898 vs. 0.1833, odds ratio 1.818, 95 % confidence

Ibuprofen

Topiramate

Headache

Acetaminophen

TREATS

PREVENTS

ISA

LOCATION_OF

Migraine

Disorders

Migraine

Disorders

Migraine

Disorders

Migraine

Disorders

TREATS

Migraine

Disorders

Migraine

Disorders

Vestibule

PainISA

24

Semantic Summary

25

Step 1: get all documents for each concept in semantic summary.

Step 2: create bag-of-words for each concept (term-frequency).

Step 3: then aggregate the bag-of-words for each concept in the entire

semantic summary.

Step 4: we use the idfs for each words in the corpus to create the tf-idf vector for the

given semantic summary.

Summary Transformation

𝑡𝑓𝑖𝑑𝑓 𝑡, 𝑑, 𝐷 = 𝑡𝑓 𝑡, 𝑑 ∗ log𝑁

𝑛𝑡

26

Bag-of-words Model

We used hemofiltration to treat a patient with digoxin overdose that was

complicated by refractory hyperkalemia.

bow = [(we,1), (used,1), . . ., (hyperkalemia,1)]

bow_sparse_vector =[(678,1), (2,1), . . ., (999,1)]

27

Dictionary Creation

28

Term Index Document id

ibuprofen 0 1,3,…,3000

.

.

.

migraine 5 5,6,…,475

Documents

ibuprofen is …. migraine

Ibuprofen is effective in treating Migraine

Gold Standard Dataset

29

Gold Standard Vectorization

Step 1: iterate over the each document in the gold standard.

Step 2: tokenize each sentence.

Step 3: create the bag-of-words model.

Step 4: we use the idfs for each word from the dictionary to create the tf-idf

vector for the gold standard.

Problem: data sparsity.

30

Gold Standard Vectorization Enhancement

Step 1: MetaMap the gold standard document.

Step 2: create bag-of-words for each concept (term frequency).

Step 3: then aggregate the bag-of-words for each concept bag-of-words for

summary.

Step 4: we use the idfs for each word from the dictionary to create the tf-idf

vector for the gold standard.

Solution: enhance with context clues from corpus.

31

Step 1: select 20 disease as topics for an information need.

Step 2: use each query to generate a semantic summary.

Step 3: transform each semantic summary into semantic summary vectors.

Step 4: transform each gold standard into a gold standard tf-idf vectors.

Step 5: compute the similarity between a semantic summary vector and its associated

gold standard vector under different features.

Step 6: determine the features that generate the most informative summary in each

scenario.

Evaluation: Overall Approach

32

•Cosine Similarity

• Euclidean distance

→

• Jensen-Shannon Distance

Summarization Evaluation Metrics

𝑠, 𝑇 = 𝑠 ⋅ 𝑇

𝑠 𝑇cosine

ⅇ 𝑠, 𝑇 =

𝑖=1

𝑛

𝜔𝑖 − 𝑡𝑖2

𝐽𝑆𝐷( 𝑠| 𝑇 =1

2[𝐾𝐿( 𝑠| 𝑀 + 𝐾𝐿 𝑇 𝑀 ,

K𝐿 𝑠||𝑇 = i=1

𝑛

p w𝑖log

P w𝑖

P 𝑡𝑖

where 𝑀 =1

2(𝑠 ′ + 𝑇)

33

32

Cosine Similarity

3 00 00 02 0 0 03 22

5 42 53 61 3 1 20 00

– Gold standard vector– semantic summary vector

𝑇

𝑠

𝑇 𝑠

𝑠, 𝑇 = 𝑠 ⋅ 𝑇

𝑠 𝑇cosine

w1 w2 w6 w7 w8 w9 w10 w11 w12 w|W|w3 w4 w5

W – {w1, w2, . . . , wn}

3333

Euclidean Distance

3 00 00 02 0 0 03 22

5 42 53 61 3 1 20 00

w1 w2 w6 w7 w8 w9 w10 w11 w12 w|W|w3 w4 w5

𝑇

𝑠

ⅇ 𝑠, 𝑇 =

𝑖=1

𝑛

𝜔𝑖 − 𝑡𝑖2

– Gold standard vector– semantic summary vector

𝑇 𝑠W – {w1, w2, . . . , wn}

(3 − 5)2+ (2 − 1)2+(3 − 0)2+(2 − 0)2+(0 − 3)2+(0 − 2)2+(0 − 5)2+. . . +(0 − 2)2

= 122

= 11.04

Semantic Similarity Comparison

34

Root Mean-Squared Error

35

𝐸 = (𝑒1,, 𝑒2,. . . , 𝑒20 )

𝐸𝑆 = (𝑆1′ , 𝑆2

′ , . . ., 𝑆20′ )

𝐸𝑆 = (𝑇1′, 𝑇2

′, . . ., 𝑇20′ )

cos 𝐸𝑆, 𝐸𝑇 = (𝑐𝑜𝑠1, 𝑐𝑜𝑠2, . . . , 𝑐𝑜𝑠20)

ⅇu𝑑 𝐸𝑆, 𝐸𝑇 = (ⅇu𝑑1, ⅇu𝑑2, . . . , ⅇu𝑑20)

JS 𝐸𝑆, 𝐸𝑇 = (𝑗𝑠1, 𝑗𝑠2, . . . , 𝑗𝑠20)

Root Mean-Squared Error

36

𝑆𝐼𝑀 = {𝑠𝑖𝑚1, 𝑠𝑖𝑚2, . . . , 𝑠𝑖𝑚20}

𝑅𝑀𝑆𝐸 𝑆𝐼𝑀 = 𝑖=1

𝑛 𝑠𝑖𝑚𝑖2

𝑛

Method Cosine-RMSE Euclidean-RMSE JS-RMSE

Leave-out-relevancy 0.263 0.315 0.187

Leave-out-connectivity 0.263 0.335 0.143

Leave-out-novelty 0.254 0.329 0.252

Leave-out-saliency 0.237 0.333 0.281

Evaluation

Saliency is the most important feature.

37

• We propose a method for intrinsic evaluation of abstractive summarization.

• We transform semantic summaries in an equivalent textual representation.

• We evaluate the impact of these features using numerous similarity metrics.

• We adopt a leave-one-out strategy to identify and evaluate the features that impact

automatically generated semantic summaries.

Contributions

38

Limitations and Future Work

1. Query diversity

- 20 disease treatments

2. Concept-based bag-of-words

3. Gold standard impurities

- Diluted quality based on co-occurrence

39

Use machine learning and a larger query set

Involve more domain experts and consider

other gold standard creation techniques

Use facts instead of concepts

40

THANK YOU!

Prof. Amit P. Sheth

(Advisor)

Prof. Krishnaprasad

ThirunarayanThomas C. Rindflesch Delroy Cameron

Acknowledgements

resqu: a framework for automatic evaluation of knowledge-driven automatic summarization

Software