lexical differences in autobiographical narratives from schizophrenic patients and healthy controls...

Lexical Differences in Autobiographical Narratives from SchizophrenicPatients and Healthy Controls

Kai Hong, Christian G. Kohler, Mary E. March,

Amber A. Parker, Ani Nenkova

University of Pennsylvania

Our Task

Identifying significant differences in lexical use from narratives by Patients vs Controls

Perform automatic classification

Identify a small subset of highly distinguishing features

How prediction accuracy varies with emotion type

Observations on lexical use# occurrences in narratives

Subjects Patient Controldog/dogs 28 1money 41 4sorry 0 7relationship 0 9

Self reference – “ I “Total occurring times: 1291 times vs 626 timesRatio after normalization by #words: 5.5% vs 4.3%

Dataset 201 stories from 39 subjects

- Patients: 120 stories, 23 patients- Controls: 81 stories, 16 controls

Five emotions: Anger Sad Happy Disgust Fear

Talk about past experience (moderately, mildly, extremely) in their lives

30 – 90 seconds to finish the story

Length of Stories

• No big difference when Patients vs Controls• Some difference between emotions.

Average # WordsPatients 192

Controls 181

P-value: 0.4254

Workflow

Narratives (Training)

Features

Lexical Feature Extraction

Features Basic Feature - Few, easy to compute general features Lexical Features and Repetitions - Sparse and many LIWC, Diction - Based on dictionary, More general Two-tailed T-test for significant features - 169 out of 6057 significant

Basic Features• Patients have more: sentences/document, words/document

• Control have more: letters/word, words/sentence, tokens/vocabulary

Control > SCH P-valueletters/word 0.003

words/sentence 0.001tokens/vocabulary 0.153

SCH > Control P-valuesentences/ doc 0.038

words/doc 0.460

Repetitions• Example: One day um , my um , my sister had brought my , her niece , her daughter , my sister had brought her daughter uh to watch my dog right .

• Repetition: Calculate the frequency that one word appeared repeatedly within some window size (5).

• Repetition of Words and punctuations

Repetitions: Significance?

Rep-Word0

0.02

0.04

0.06

SCHNC

Rep-Punc0

0.010.020.030.040.050.06

SCHNC

P-value < 0.001 P-value < 0.001

Significant: - Rep-word

Significant: - Rep-punctuation

Lexical Features

• Words - Frequency in narratives

• Repetition of specific words - The presence of repetition about one word (0/1)

• Example: She was , she was a huge , she was very , very wonderful.

More common in Schizophrenia

P-value Features< 1e-3 I couldn’t extremely mildly money0.001 – 0.01 extreme feeling moderately my took

way ?0.01 – 0.05 ain’t alone at aw before

behind became care chance confused

• First personal pronoun: I, my• money• Feelings • some adverbs: mildly, moderately, extremely• ?

More common in Schizophrenia

• Focus on family (grandfather, sister, son)• dog/ dogs

P-value Features0.01 – 0.05 December dog dogs forty friends

god got grandfather guess guyhand hanging hearing hundred increasedlooking loved mental met mildmoderate myself outside paper passedpiece remember sister son standstand stop story take takenthrowing trouble use wake wanna

More common in Control

P-value Features

< 1e-3 comma0.001 – 0.01 really sorry very0.01 – 0.05 able actually are basically be

being get’s in late notrelationship result she’s sleep telltheir there’s weeks

• Third person plural: their • sorry• Some adjectives and adverbs: actually, basically,

really, very

Significant Rep+Lexical Features

• Patients: more repetition of and, um, I, a, was.

Schizophrenia Status P-valueRep-and SCH < 0.001Rep-um SCH 0.008

Rep-I SCH < 0.001Rep-a SCH 0.011

Rep-was SCH 0.018

Significant Rep+Lexical Features• Patients: more repetition of and, um, I, a, was.

• Control: more repetition of comma, very.

Schizophrenia Status P-valueRep-and SCH < 0.001Rep-um SCH 0.008

Rep-I SCH < 0.001Rep-a SCH 0.011

Rep-was SCH 0.018

Control Status P-valueRep-, NC 0.001

Rep-very NC 0.007

LIWC Method: - Degree for usage of different categories of words - Dictionary based approach - 69 dictionaries

Example: Cried - sadness, negative emotion, overall effect, verb, past-tense verb

Previous Use - writing styles, physical and emotional pain (Tausczik and Pennebaker, 2010)

LIWC – Significant Features

Category #words Example Status P-valueI 12 I, me, mine SCH < 0.001personal pronoun

70 I, them, itself, you SCH 0.029

insight 195 Think, know, consider

SCH 0.026

adverb 69 Very, really, quickly

NC 0.001

exclusive words

17 But, without, exclusive

NC 0.005

Inhibition 111 Block, constrain, stop

NC 0.019

More common for Patients & Control

DictionMethod: - Also dictionary-based approach - 28 small categories, 5 master variables

Master variables (major categories) - Realism, Optimism, Certainty, Activity, Commonality.

Example: Certainty = [Tenacity + Leveling + Collectives + Insistence] - [Numerical Terms + Ambivalence + Self Reference + Variety]

Diction – Significant Features

Category Status P-valueself reference SCH < 0.001

cognitive terms SCH 0.014

past SCH 0.036insistence SCH 0.046satisfaction SCH 0.047

Diction – Significant Features

Category Status P-valueself reference SCH < 0.001word-mean-length NC < 0.001realism NC < 0.001diversity NC 0.005familiarity NC 0.019cognitive terms SCH 0.014cooperation NC 0.027past SCH 0.036insistence SCH 0.046satisfaction SCH 0.047

Workflow

Narratives(Training)

Features


Workflow


Features Selected Features



Feature Selection

Feature Selection Two-tailed T-test for real valued features

- Thresholds: 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.15

Signal to noise - Using Challenge Learning Object Package (CLOP)

Experimental Setup Leave-one-subject-out (39 times) Subject Status = Story Status

Experimental Setup

Voting: stories -> subjects

Evaluation metrics: Accuracy and F-measure - by stories - by subjects

Workflow


Features Selected Features


Narratives(Testing)

SVM-light +

Control

Patients


Feature Selection

Narratives(Training) Voting

?

Performance by T-testMuch higher than Random

P-value by Story by Subject # Features

0.05 62.7 64.1 169

Performance by T-testMore noise when relaxing threshold

P-value by Story by Subject # Features0.15 59.0 58.9 4500.1 61.7 64.1 341

0.05 62.7 64.1 169

Performance by T-test Better performance when tighten the threshold Best Performance when threshold = 0.001

P-value by Story by Subject # Features0.15 59.0 58.9 4500.1 61.7 64.1 341

0.05 62.7 64.1 1690.01 57.7 65.4 44

0.005 64.2 71.6 320.001 65.7 75.6 18

0.0005 61.7 66.7 14

Performance changing with feature size Best performance achieved when feature = 25 Signal to noise selection

Best Performance by Signal-to-noise

Achieved when #Features = 25 - Accuracy for story 64.7%, accuracy for subject: 76.9% - Patient Recall: 91.3%

Schizophrenia Control General

P(%) R(%) F(%) P(%) R(%) F(%) Accuracy Macro-F

75.0 91.3 82.4 81.8 56.3 66.7 76.9 74.6

Status Prediction by Emotion

Accuracy (%) Signal-to-noise (25) T-test (0.05) T-test (0.001)

Happy 66.7 59.0 71.8

Disgust 63.4 61.0 51.2

Anger 61.0 70.7 70.7

Fear 60.0 55.0 67.5

Sad 72.5 60.0 67.5

Story 64.7 62.9 65.7

Patient 76.9 64.1 74.4

Same training data Predict on different emotions Different approaches and settings

Number of features on different thresholds

0.1 0.05 0.010

50

100

150AngerSadHappyDisgustFear

p-value

# Features

More features -> more distinguishing

From one emotion

T-test

Emotion related analysis

Emotion Schizophrenia ControlHappy ambivalent doDisgust dogs, health communicationAnger argued praiseFear money accidentSad satisfaction working

Higher value in each emotion

Conclusion Analyze distinguishing power of different features - Basic features - Lexical features, repetitions - LIWC - Diction 25 features: top performance (65%, 77%) - p-value feature selection - signal-to-noise feature selection Different emotions have different distinguishing power - anger, sad > happy > fear, disgust

Thank you !!!

Backup Slides

Related work LMs to detect language dominance and language impairment

(Gabani et al, 2009) Speech related features for autism patients (Heeman et al, 2010) Syntax features for mild cognitive impairment (Roark et al, 2011) Syntactic complexity features for autism (Prud’hommeaux et al, 2011) Lexical features to recognize different personalities (Gill et al,

2009; Mairesse et al, 2006) Predict adherence to treatment and syndrome scale in

Schizophrenia through conversations (Howes, et, al, 2012)

Language Model

Using Unigram, Bigram, Trigram

Use Pos-Tag and Lexical

Simply using Laplace smoothing,

LMs Performance

By Story(%) Schizophrenia-F Control-F AccuracyRandom 54.4 44.6 50.02-gram 62.5 44.4 55.22-gram-pos 62.2 53.3 58.2

By Subject(%) Schizophrenia-F Control-F AccuracyRandom 54.1 45.0 50.02-gram 62.5 50.0 58.92-gram-pos 62.2 54.5 61.5

Feature Normalization

Approach 1: - Get the average from training data.

Approach 2: - Get the maximum and minimum from training data. - Projection into [0,1].

Motivating Applications

Track patient status between visits

Early automatic diagnosis and screening

Best Performance by Signal-to-noise Achieved when #Features = 25 - Accuracy for story 64.7%, accuracy for subject: 76.9% - Patient Recall: 91.3%

Schizophrenia Control General

Measurement P(%) R(%) F(%) P(%) R(%) F(%) Accuracy Macro-F

Story Majority 59.7 100 74.8 0 0 0 59.7 37.425-

Features68.7 75.0 71.7 57.1 49.4 52.9 64.7 62.3

Sub-ject

Majority 59.0 100 74.2 0 0 0 59.0 37.125-

Features75.0 91.3 82.4 81.8 56.3 66.7 76.9 74.6

(All) Average 59.7 50 54.4 40.5 50 44.6 50.0 49.5

Diction Definitions• Cognitive terms: Modes of discovery, Mental challenges, Institutional

learning practices, Intellection: intuitional, retionalistic, calculative.• Reality: [Familiarity + Spatial Awareness + Temporal Awareness +

Present Concern + Human Interest + Concreteness] -[Past Concern + Complexity]

• Diversity: Neutral: inconsistent, contrasting; Positive: exceptional, unique; Negative: Extremist

• Cooperation: work relations, interactions, associations, job-related tasks, personal involvement, etc. (sisterhood, friendship, teamwork, consolidate, relationship)

• Familiarity: consisting of a selected number of C.K. Ogden’s (1968) operation words which he calculates to be the most common words in the English language

lexical differences in autobiographical narratives from schizophrenic patients and healthy controls...

Documents

significant slide

story slide

significant features

reppunctuation slide

patients controls

basic features patients

emotion type slide

patients vs controls