ieee socialcom 2015: intent classification of social media text

25
Intent Classification of Short-text Social Media Dec 19 2015 The 8 th IEEE SocialCom-2015 Hemant Purohit Information Sciences and Technology, George Mason U Guozhu Dong, Valerie Shalin, Krishnaprasad Thirunarayan, Amit Sheth Kno.e.sis, Wright State U

Upload: hemant-purohit

Post on 07-Apr-2017

622 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Intent Classification of Short-text Social Media

Dec 19 2015The 8th IEEE SocialCom-2015

Hemant PurohitInformation Sciences and Technology, George Mason U

Guozhu Dong, Valerie Shalin, Krishnaprasad Thirunarayan, Amit Sheth

Kno.e.sis, Wright State U

@hemant_pt IEEE SocialCom-2015

Outline

● Intention

● Social Media Short-text

● Intent Classification Problem

● Feature Representation● Bottom-Up

● Bag of Tokens model ● Top-Down

● Set of Patterns:● Declarative Knowledge & Social Behavior Knowledge ● Contrast Mining based Patterns

● Experiments & Results

● Limitations & Future Work

22

@hemant_pt IEEE SocialCom-2015

Intention

● Intent: Purpose or aim for an action

● ‘we are tempted to speak of “different senses” of a word which is clearly not equivocal, we may infer that we are pretty much in the dark about the character of the concept which it represents’ (Anscombe 1963, p. 1) [Stanford Encyclopedia of Philosophy]

● Latent in the utterance

3

@hemant_pt IEEE SocialCom-2015

Social Media Short-text & Intent

Social media text: unstructured, informal language, short

4

DOCUMENT INTENT

Text REDCROSS to 90999 to donate 10$ to help the victims of hurricane sandy

SEEKING HELP

Anyone know where the nearest #RedCross is? I wanna give blood today to help the victims of hurricane Sandy

OFFERING HELP

Would like to urge all citizens to make the proper preparations for Hurricane #Sandy - prep is key - http://t.

co/LyCSprbk has valuable info!

ADVISING

4

@hemant_pt IEEE SocialCom-2015

Short-text Document Intent

● Intent: Aim of action

DOCUMENT INTENT

Text REDCROSS to 90999 to donate 10$ to help the victims of hurricane sandy

SEEKING HELP

Anyone know where the nearest #RedCross is? I wanna give blood today to help the victims of hurricane Sandy

OFFERING HELP

Would like to urge all citizens to make the proper preparations for Hurricane #Sandy - prep is key - http://t.

co/LyCSprbk has valuable info!

ADVISING

5

How to identify relevant intent from ambiguous, unconstrained natural language text?

Relevant intent ➔ Articulation of organizational tasks (e.g., Seeking vs. Offering resources)

5

@hemant_pt IEEE SocialCom-2015

Intent Classification: Problem Formulation

● Given a set of user-generated text documents, identify existing intents

● Variety of interpretations

● Problem statement: a multi-class classification task

approximate f: S → C , where C = {C1, C2, …, CK}

is a set of predefined K intent classes, and S = {m1, m2 … mN}

is a set of N short text documents

Focus - Cooperation-assistive intent classes, C= {Seeking, Offering, None}

66

@hemant_pt IEEE SocialCom-2015

Intent Classification: Related Work

TEXT CLASSIFICATION

TYPE

FOCUS EXAMPLE

Topic predominant subject matter

sports or entertainment

Sentiment/Emotion/Opinion

focus on present state of emotional affairs

negative or positive; happy emotion

Intent Focus on action, hence, future state of affairs

offer to help after floods

e.g., I am going to watch the awesome Fast and Furious movie!! #Excited

77

@hemant_pt IEEE SocialCom-2015

Intent Classification: Related Work

DATA TYPE APPROACH FOCUS LIMITED APPLICABILITY

8

Formal text on Webpages/blogs

(Kröll and Strohmaier 2009, -15; Raslan et al. 2013, -14)

Knowledge Acquisition:

via Rules, Clustering

• Lack of large corpora with proper grammatical structure

• Poor quality text hard to parse for dependencies

Commercial Reviews, marketplace

(Hollerit et al. 2013, Chen et al. 2013, Wang et al. 2015, Wu et al.

2011, Ramanand et al. 2010, Carlos & Yalamanchi 2012)

Classification: via Rules, Lexical template based,

Pattern

• More generalized intents (e.g., ‘help’ broader than ‘sell’)

• Patterns implicit to capture than for buying/selling

Search Queries

(Broder 2002, Downey et al. 2008,, Case 2012, Wu et al. 2010, Strohmaier & Kröll 2012)

User Profiling: Query Classification

• Lack of large query logs, click graphs

• Existence of social conversation

8

@hemant_pt IEEE SocialCom-2015

Intent Classification: Challenges

● Unconstrained Natural Language in small space

● Ambiguity in interpretation

● Sparsity of low ‘signal-to-noise’: Imbalanced classes● 1% signals (Seeking/Offering) in 4.9 million tweets #Sandy

● Hard-to-predict problem ● e.g., commercial intent, F-1 score 65% on Twitter [Hollerit et al. 2013]

@Zuora wants to help @Network4Good with Hurricane Relief. Text SANDY to 80888 & donate $10 to @redcross @AmeriCares & @SalvationArmyUS #help

*Blue: offering intent, *Red: seeking intent

99

@hemant_pt IEEE SocialCom-2015

Intent Classification: Domain & Features

10

Intent

Binary

Crisis Domain: - [Varga et al. 2013] Problem & Aid (Japanese)- Purohit et al. 2013, 2014: Seeking & Offering- Features: N-grams, Rules, Noun-Verb templates, etc.

Commercial Domain:- [Hollerit et al. 2013] Buy vs. Sell intent- Features: N-grams, Part-of-Speech

Multiclass

Commercial Domain:- [Wang et al. 2015] Semi-supervised- Features: N-grams, Part-of-speech

10

@hemant_pt IEEE SocialCom-2015

TOP-DOWN

Pattern Rules:Declarative (DK) & Social Behavior (SK) Knowledge, Contrast Mining (CTK,CPK)

(patterns defined for intent association)

BOTTOM-UP

Bag of N-grams Tokens:Independent Tokens

(patterns derived from the data)

OurHybrid

Approach

Learning Improves

ExpressivityIncreases

11

@hemant_pt IEEE SocialCom-2015

Intent Classification Hybrid: Multiclass Classifier – Feature Creation1. (T) Bag of Tokens

Abstraction: due to importance in info sharing [Nagarajan et al. 2010]

- Numeric (e.g., $10) → _NUM_

- Interactions (e.g., RT & @user) → _RT_ , _MENTION_

- Links (e.g., http://bit.ly) → _URL_

N-grams: after stemming and abstraction [Hollerit et al. 2013] TOKENIZER ( mi ) → { bi-, tri-gram }

12

TOKENIZER(mi , min, max)

12

@hemant_pt IEEE SocialCom-2015

Leveraging Declarative Knowledge

● Conceptual Dependency Theory [Schank, 1972]

● Make meaning independent from the actual words in input ● e.g., Class in an Ontology abstracts similar instances

● Verb Lexicon [Hollerit et al. 2013]

● Verb reflects action● Relevant Levin’s Verb categories [Levin, 1993] , e.g., give, send, etc.

● Syntactic Pattern● Auxiliary & modals: e.g., ‘be’, ‘do’, ‘could’, etc. [Ramanand et al. 2010]

● Word order: Verb-Subject positions, etc.

1313

@hemant_pt IEEE SocialCom-2015

Leveraging Social Behavior Knowledge

● Conversation indicators often thrown away in Text Mining

14

CATEGORY Hj Hj SETH1 - Determiners (the)

H3 - Subject pronouns (she, he, we, they)

H9 - Dialogue management indicators (thanks, yes, ok, sorry, hi, hello, bye, anyway, how about, so, what do you mean, please, {could, would, should, can, will} followed by pronoun)

H11 - Hedge words (kinda, sorta)

• Feature_Hj (mi) = term-frequency ( Hj-set, mi )• Normalized • Total 14 feature categories

@hemant_pt IEEE SocialCom-2015

Intent Classification Hybrid: Multiclass Classifier - Feature Creation2. (DK) Declarative Knowledge Patterns

● Domain expert guidance

● Psycholinguistics syntactic & semantic rules● Expand by WordNet and Levin Verbs

e.g.,

3. (SK) Social Knowledge Indicators● Offline conversation indicators e.g., Hj = Dialogue Management, Hj-set = {Thanks, anyway,..}

15

Feature_Pj (mi) = 1 if Pj exists in mi , else 0

Feature_Hj (mi) = term-frequency ( Hj-set, mi )

@hemant_pt IEEE SocialCom-2015

Intent Classification Hybrid: Multiclass Classifier - Feature Creation4. (CTK) Contrast Knowledge Patterns

INPUT: corpus {mi} cleaned and abstracted, min. support, X For each class Cj

● Find contrasting pattern using sequential pattern mining

OUTPUT: contrast patterns set {P} for each class Cj

5. (CPK) Contrast Patterns: on Part-of-Speech tags of {mi}

16

e.g., unique sequential patterns:SEEKING: help .* victim .* _url_ .*OFFERING: anyon .* know .* cloth .*

@hemant_pt IEEE SocialCom-2015

Contrast Mining based Patterns

Finding CTK (CPK): Contrast Knowledge PatternsFor each class Cj

1. Tokenize the cleaned, abstracted text of {mi }

2. Mine Sequential Patterns, {P}: using SPADE Algorithm

3. Reduce to minimal sequences {P}

4. Compute growth rate & contrast strength for P with all other Ck

5. Top-K ranked {P} by contrast strength

OUTPUT: contrast patterns set {P} for each class Cj

17

gr(P,Cj,Ck) = support (P,Cj) / support (P,Ck) .. (1)

Contrast-Growth (P,Cj) = 1/(|Cj| -1) ΣCk, k=/=j gr(P,Cj,Ck)/ (1 + gr(P,Cj,Ck)) ..(2)

Sparse-Contrast-Strength(P,Cj) = support(P,Cj)*Contrast-Growth(P,Cj) .. (3)

@hemant_pt IEEE SocialCom-2015

CORPUS

Set of short text

documents,

S

FEATURES

Knowledge-driven features

XT, y

M_1

M_2

M_K

.

.

.

Subset XjT ⊂ S such that, Xj

T includes all the labeled instances of class Cj for

model M_j

Binarization Frameworks for Multiclass Classifier: 1 vs. All (OVA)

P(c2)

P(c1)

X1T, y

1

X2

T, y2

XK

T, yK P(c

K)

18(In 1 vs. 1 (OVO) framework: K*(K-1)/2 classifiers, for each Cj,Ck pair)

@hemant_pt IEEE SocialCom-2015

Intent Classification Hybrid: Multiclass Classifier - Experiments

● Datasets

● Dataset-1: Hurricane Sandy, Oct 27 – Nov 7, 2012● Dataset-2: Philippines Typhoon, Nov 7 – Nov 17, 2013

● Parameters● Base Learner M_j: Random Forest, 10 trees with 100 features● bi-, tri-gram for (T) ● K=100% & min. support 10% for CTK, 50% for CPK

19

@hemant_pt

Intent Classification: Multiclass Classifier – Results

20

Avg. F-1 Score(10-fold CV)

Frameworks:

Gain 7%, p < 0.05

Dataset-1 (Hurricane Sandy, 2012)

(Declarative)

(Social)

(Contrast)

T,DK,SK,CTK,CPK

T,CTK,CPK

@hemant_pt

Intent Classification: Multiclass Classifier - Results

21

Frameworks:

Gain 6%, p < 0.05

Dataset-2 (Philippines Typhoon, 2013)

(Declarative)

(Social)

(Contrast)

Avg. F-1 Score(10-fold CV)

T,DK,SK,CTK,CPK

T,CTK,CPK

@hemant_pt IEEE SocialCom-2015

Lessons1. Top-down & Bottom-up hybrid approach improves data

representation for learning (complementary) intent classes- Top 1% discriminative features contained 50% knowledge driven

2. Offline theoretic social conversation (SK) features (the, thanks, etc.), often removed for text mining are valuable for intent mining.

3. There is a varying effect of knowledge types (SK vs. DK vs. CTK/CPK) in different types of real world event datasets➢ Culturally-sensitive psycholinguistics knowledge in future

22

@hemant_pt IEEE SocialCom-2015

Limitations & Future Work Directions

-Non-cooperation assistive intent classes not considered

-Temporal drift of intent not considered

-Possibility for Multilabel intent classes with instances

-Mining actor-level intent beyond document level

23

@hemant_pt IEEE SocialCom-2015

Conclusion

A hybrid approach of interplaying features from

top-down representation via patterns using prior knowledge of psycholinguistics, social behavior, & contrast mining

&

bottom-up representation via bag-of-tokens model

improves Intent Classification of short-text on social media.

24

@hemant_pt IEEE SocialCom-201525

TWITTER: @hemant_ptMAIL: [email protected]

Acknowledgement: Respective image sources, and

Questions?

Grant IIS-1111182