tweeting beyond facts – the need for a linguistic perspective

Tweeting Beyond Facts ---The Need for a Linguistic Perspective

Sabine BerglerCLaC Labs

Sofia 2015

CLaC Labs Core Idea

Linguistics (like mathematics) is general consistent (across domains, corpora, and tasks) modular (= compositional)

Domain knowledge is specific only sometimes compositional reasonably well supported for some domains

(NLM suite of tools for BioNLP)

CLaC Modules and Architecture

discourse structure

embedding graph (typed)coreference semantic annotations parse tree, dependencies domain ontology lexical semantics

Archaeological Approach Theory

• shallow• slow and careful (small goals)• attention to context • analyzed extensively• iterative

Practice

• linguistically inspired• modular• vetted in shared tasks• extensive ablation studies• reuse in different pipelines for additional evaluation

Pertussis Seroprevalence in Korean Adolescents and Adults Using Anti-Pertussis Toxin Immunoglobulin G J Korean Med Sci. 2014 May;29(5)This finding indicates that natural pertussis infection is endemic in older adults and that Tdap booster vaccination rates at 11-12 yr of age may be insufficient. Reports from Israel and the Netherlands have already indicated that the highest pertussis seroprevalence was in older adults (13,18). Because protective immunity against pertussis may last for 4-12 yr after a primary DTaP vaccination series (19,20), natural pertussis infection could occur in older adults even after previous vaccinations.

Legend: report negation modal temporal ordering

Existence and FactsThe mean anti-PT IgG titer and pertussis seroprevalence were 35.53 ± 62.91 EU/mL and 41.4%, respectively.

The mean anti-PT IgG titers and seroprevalence were not significantly different between the age groups.

However, the seroprevalence in individuals 51 yr of age or older was significantly higher than in individuals younger than 51 yr (46.5% vs 39.1%, P = 0.017).

Legend: negation comparison contrast irrealis

Negation: explicit and implicit

trigger (different length lists available, domain specific possible) linguistic scope (derived from parser information)

We observed no genetic alterations in the IRF-4 promoter, which can account for the lack of IRF-4 expression.

entailment: no alterations? no alterations in the IRF-4 promoter?

Stanford Parse Tree

(S (NP (PRP We)) (VP (VBD observed) (NP (NP (DT no) (JJ genetic) (NNS alterations)) (PP (IN in) (NP (NP (DT the) (NN IRF-4) (NN promoter)) (SBAR (WHNP (WDT which)) (S (VP (MD can) (VP (VB account) (PP (IN for) (NP (NP (DT the) (NN lack)) (PP (IN of) (NP (NN IRF-4) (NN expression)))))))))))))))

Collapsed Typed Dependencies nsubj(observed-2, We-1)

root(ROOT-0, observed-2)neg(alterations-5, no-3)amod(alterations-5, genetic-4)dobj(observed-2, alterations-5)det(promoter-9, the-7)nn(promoter-9, IRF-4-8)prep_in(alterations-5, promoter-9)nsubj(account-13, promoter-9)aux(account-13, can-12)rcmod(promoter-9, account-13)det(lack-16, the-15)prep_for(account-13, lack-16)nn(expression-19, IRF-4-18)prep_of(lack-16, expression-19)

NEGATOR developed by Sabine

Rosenberg 1. trigger detection 2. linguistic scope determination3. focus of negation detection4. negation and modality interaction

Leader in two Shared Task competitions:

*Sem 2012 pilot task on negation focus (sole participant)

CLEF 2012 QA4MRE pilot task on interaction of negation and modality (Rank 1 and 2 of 6 with over 10% advance)

ModNegator for CLEF QA4MRE assembled from existing modules:

negation triggers from NEGATOR modality triggers from Kilicoglu

scope from NEGATOR (auxiliary rules added)

Rank 1 with wide margin (Conan Doyle data)narrow greedy

macroaverage .64 .62

microaverage .71 .68

accuracy .71 .67

Error CaseScope barrier relative clause:Dr Gallo had initially suggested that AIDS was caused by HTLV-I, a virus that noone disputes he discovered.

ModalTrigger: suggestedModal Scope: Dr Gallo had initially suggested that AIDS was caused by HTLV-I, a virus that no one disputes he discovered.

NegTrigger: no Negation Scope: Dr Gallo had initially suggested that AIDS was caused by HTLV-I, a virus that no one disputes he discovered.

NEGATOR: disputes : LABEL = NEGMODGold Standard: disputes : LABEL = NEG

Speculative Language (aka Hedging)

Also we could not find any RAG-like sequences in the recently sequenced sea urchin lancelet hydra.

Caspases can also be activated with the aid of Apaf-1, which in turn appears to be regulated by cytochrome c and dATP.

Phenotypic differences are suggestive of distinct functions for some of these genes in regulating dendrite arborization.

Speculative Language Detection Halil Kilicoglu

BioNLP 08, BioNLP 09, CoNLL 2011 same system adapted for subsequent tasks based on triggers and parser dependencies also incorporates negation, modality, etc

Embedding Predications Halil Kilicoglu

2012

Unified account of semantic phenomena beyond categorical assertions

core notion: semantic embedding categorization: comprehensive, domain-independent, consolidated embedding graph: compositional semantic interpretation genre-independent: news, molecular biology, shared tasks

Kilicoglu Processing Pipeline

Syntactic Dependency Graph 1

Dependency Graph 2

Typed Combined Embedding Graph

Sentiment Towards Vaccination

The incidence ☹ of pertussis ☹ decreased ☺ with the introduction of the diphtheria-tetanus-whole cell pertussis (DTwP) vaccination ☺ in children around the world (1), and a decrease ☺ in pertussis ☹ was also observed in Korea where the DTwP vaccination ☺ has been universally recommended ☺ for infants and children since 1954 (2).

However, pertussis ☹ began to rise ☹ in the 1990s in Europe and North America, especially in adolescents (1,3,4,5), and it has been also observed since the 2000s in Korea (2).

Summary: §1:☺ §2: ☹

Sentiment InferencesThe incidence ☹ of pertussis decreased with the intro-duction of the diphtheria-tetanus-whole cell pertussis (DTwP) vaccination.

Baseline: count sentiment words, use majority vote: ☹Lexical semantics + syntactic inferences:

NP: (The incidence of pertussis☹ ) ☹ → pertussis ☹Valence shifter verb: decreased(NP) = decreased( )☹ =☺Bonus inference:decrease(DTwP, pertussis )☺ → DTwP ☺☹

Sentiment Analysis for Tweets Canberk ÖzdemirSemEval 2015 Task 10B: rank 9 (of 40)

introduces new large semantic lexicon: Gezi

combines 5 sentiment lexica (aFinn smallest, Gezi largest)

uses linguistic scope for negation and modality (NEGATOR)

benefits from 5 point sentiment scale (strong pos, pos, neg, strong neg)

Tweets with Figurative Language Canberk Özdemir

SemEval 2015 Task 11: rank 1 (of 35) with wide margin

same system as for Task 10B

no special tailoring for figurative language apart from using training data for decision tree

linguistic notions at the moment equivalent to training for figurative language

Negation and Modality in ClacSentipipe negation triggers from Rosenberg modality triggers: modal auxiliaries scope from NEGATOR

He is hurt. -2Negation flips and dampens (*-.5) He is not hurt. +1

Modality dampens (*.5) He may be hurt. -1

features: negated-negative, modalized-negative, …

Sample Tweet Gold Annotations Need car financing? Toyota of Hollywood has you covered! http://t.co/rMFV0qYNOK

Kobe Bryant is better than the 40th best player. I would say about 25th

@TV_Exposed: Every episode of Friends is coming to Netflix on January 1st http://t.co/OiVJzaTOh9 damn i want netflix heere tooo

Equalizer tomorrow, Alexander and the Terrible Horrible No Good Very Bad Day & Fury Sunday. #lastfreemovieweekend

http://t.co/rMFV0qYNOK%3C/tweet

Current Work at CLaC Labs

Extend the trigger scope approach for

✓negation Modality sentiment annotation! modification (human monocytes)! emotion annotation! causal chain extraction! vaccine avoidance argument detection in blogs

Explicit Negation

Noun Phrases

Sundries

Underappreciated Items

numbers (IV, twice, 100,00, 100.00, 100,000) amounts (57%, 16Gb, 12ml, pH7, 7mph) locations person tense and aspect (this type of research has not been

done/was not done/is not done/is not being done) modifier semantics (prenominal modifiers: long-term

prospective studies, adverbials: virtually no risk)

Junk Language? there is much information in ignored language

linguistic treatments are universal, can be adapted to domain specific usage

a suite of general, language oriented modules should be considered as a form of preprocessing of the data, followed by domain specific treatments

this can significantly improve the downstream specialized processing

Conclusion linguistic principles form a solid baseline for modular,

adaptable NLP modules

trigger-linguistic scope approach to speculative language, negation, and modality proved effective

parsing feasible, even for tweets, with preprocessing

extra-propositional parts of text prove effective in task-oriented evaluation

Headnoun, Base NP, MaxNP, PP<MaxNP>

<BaseNP> a 1993 <headnoun> survey </headnoun> </BaseNP>

<PP> of pediatricians and family practitioners </PP></MaxNP>

overly simplistic heuristic: in <MaxNP> <BaseNP> the news in California <BaseNP> <MaxNP>

ellipsis, coordination, …<MaxNP> <BaseNP> the health <BaseNP>

of <MaxNP> <BaseNP> vaccinated vs unvaccinated children </BaseNP> </MaxNP> </MaxNP>

Causal Triggers

Causality Michelle Khalife

is pervasive in language conveys important information trigger lists exist for biomedical texts triggers require predicate argument structure

tweeting beyond facts – the need for a linguistic perspective

Data & Analytics