seelig columbia preliminary 062615

24
Positive Psychology and Movies Avner Abrami, Malek Ben Sliman, Garud Iyengar, Luba Smolensky, Olivier Toubia Preliminary report for Seelig group, June 2015 1/24

Upload: avner-abrami

Post on 14-Apr-2017

212 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Seelig Columbia Preliminary 062615

Positive Psychology and Movies

Avner Abrami, Malek Ben Sliman, Garud Iyengar, LubaSmolensky, Olivier Toubia

Preliminary report for Seelig group, June 2015

1 / 24

Page 2: Seelig Columbia Preliminary 062615

Overall Approach

I 24 Character Traits (CT) based on Positive Psychologyliterature

I Extract CT from movie descriptions (wikipedia)I Seeded LDA text mining algorithmI Each movie is represented by a set of 24 CT weights

I Cluster movies based on their CT weights (k-means clustering)

I Simplify CT taxonomy / find key relevant groups of CT(Non-negative Matrix Factorization)

I Link between CT and movie performance (regression)I Regress widest release, box office, marketability, playability on

CT-related predictorsI Control for production budget and genre

2 / 24

Page 3: Seelig Columbia Preliminary 062615

Identify Character Traits in Movies Using Seeded LDA

I Relevant LiteraturesI Media Psychology (identification, parasocial interactions)I Positive Psychology (Character Strengths)I Natural Language Processing (Latent Dirichlet Allocation -

LDA)

I Seeded LDAI ”Topics” associated with Character TraitsI ”Seed words” mapped to Character TraitsI Word-Topic associationsI Document-Topic associations

I Each movie is represented by a vector of 24 CT weightsI Each weight can be thought of as the proportion of words in

that movie description that are attached to that CT

3 / 24

Page 4: Seelig Columbia Preliminary 062615

2,667 Seed Words Mapped to 24 Character Traits

Character Trait RepresentativeSeed Words

RepresentativeMovie

AverageWeight (×1000)

Creativity story, dance, mind Saving Mr. Banks 2.9

Curiosity discover, search, learn My Life in Ruins 3.7

Open Mindedness information, data, interest Interstellar 1.3

Love of Learning school, learns, book Bad Teacher 3.9

Wisdom age, advice, research The Counselor 1.4

Bravery fight, war, military Last Ounce of Courage 2.1

Persistence fight, insist, training The Fighter 1.5

Integrity united, true, fake The Invention of Lying 1.9

Vitality life, happy, energy The Artist 1.9

Love mother, love, wife Whip It 9.0

Kindness friend, heart, care Unconditional 1.6

Social Intelligence relationship, human, computer An Education 1.8

Citizenship group, family, together The Joneses 4.4

Fairness king, queen, jail Robin Hood 1.9

Leadership team, head, power G.I. Joe: Retaliation 4.2

Forgiveness and Mercy revenge, war, crime Kung Fu Panda 2 1.5

Humility and Modesty grand, rich, proud The Iron Lady 0.7

Prudence secret, safe, accident Tower Heist 1.8

Self Regulation drug, drunk, alcoholic The Sitter 1.5

Appreciation ofBeauty and Excellence

nature, beautiful, art The Monuments Men 1.0

Gratitude gift, honor, respect Avatar 0.8

Hope dream, pursue, hope Post Grad 1.4

Humor game, play, fun Wreck-It Ralph 1.4

Spirituality church, lord, evil Priest 2.24 / 24

Page 5: Seelig Columbia Preliminary 062615

Each movie is represented by a vector of 24 CT weights - afew examples

Character Trait How to Train Your Dragon The Fault in our Stars Captain PhillipsCreativity 0.00 17.63 2.56Curiosity 5.70 5.91 5.94Open.mindedness 0.00 0.00 1.77Love.of.learning 6.58 16.73 0.00Perspective 4.94 3.87 0.00Bravery 3.07 0.00 1.77Persistence 9.13 0.00 0.01Integrity 0.00 1.07 0.01Vitality 0.00 0.02 1.77Love 0.00 15.15 3.23Kindness 0.00 5.24 0.00Social.Intelligence 4.93 0.00 0.00Citizenship 0.00 3.12 11.52Fairness 0.00 0.00 0.00Leadership 0.00 0.01 13.19Forgiveness.and.Mercy 2.47 0.00 3.51Humility.and.Modesty 0.00 0.00 0.00Prudence 2.45 0.00 3.53Self.Regulation 0.00 1.12 0.00Appreciation.of.beauty.and.excellence 0.00 1.08 0.00Gratitude 0.00 3.17 0.00Hope 0.00 0.00 1.75Humor 0.00 3.22 0.00Spirituality 0.00 0.00 0.00

5 / 24

Page 6: Seelig Columbia Preliminary 062615

Clustering Movies Based on CT Weights Using K-MeansClustering

I Group movies together based on their CT weights

I 11 ”communities” (clusters) emerge

I Details in ”kmeans clustering.pdf”

6 / 24

Page 7: Seelig Columbia Preliminary 062615

Simplify CT Taxonomy / Find Key Relevant Groups of CT

I Simplest approach: only keep CTs with the largest varianceacross movies (i.e., that best describe how movies differ fromone another)

I Non-negative Matrix Factorization (NMF) generalizes andextends this intuitionI Identify latent factors that best explain variance between

moviesI Each factor is a composite score, i.e., weighted average of

original CTs

7 / 24

Page 8: Seelig Columbia Preliminary 062615

NMF based on CT weights

I Perform NMF on the CT weights estimated from text mining

I 4 factors emerge that are respectively dominated by Love,Citizenship, Love of Learning, and Leadership

I Details in ”NMF on CT weights.pdf”

8 / 24

Page 9: Seelig Columbia Preliminary 062615

NMF Based on Raw Occurrence Data

I NMF may also be applied on the raw occurrence data, i.e.,which seed words appear in which movie descriptions

I 3 factors emerge that respectively capture Love, Citizenship /Humanism, and Spirituality / Values

I Details in ”NMF on raw occurrence data.pdf”

9 / 24

Page 10: Seelig Columbia Preliminary 062615

Linking CT with Movie Performance using Regressions:Dependent Variables

I Widest Release

I Log(Marketability)Marketability=(Opening Box Office Adjusted forInflation)/(Number of Theater at the Opening Week)

I Log(Playability)Playability=(Domestic Box Office)/(Highest Weekly BoxOffice)

I Log (Domestic Box Office)

10 / 24

Page 11: Seelig Columbia Preliminary 062615

Independent Variables (for each movie)

I Non CT-related:

I Widest ReleaseI Production BudgetI Genre.

Removed Documentary and Foreign because too few cases. 8remaining genres → 7 dummy variables (effects coding)

I CT-related:I CT weights (24 continuous variables)I Community Membership based k-means clustering

I 11 communities → 10 dummy variables (effects coding, i.e.,average effect set to 0)

I Weights on factors identified by Non-negative MatrixFactorization

I Characteristics of the movie’s CT weight distribution (nextslide)

11 / 24

Page 12: Seelig Columbia Preliminary 062615

Characteristics of the Movie’s CT Weight Distribution

I Average weight (across CT, for each movie): overallrepresentation of CT in that movie

I Coefficient of variation (CV) of the weights (standarddeviation / mean): dispersion of CT in that movie

I Average distribution of weights (across movies): prototypicaldistribution of 24 CT weights

I (Euclidean) distance between the CT distribution for eachmovie vs. the prototypical distribution

12 / 24

Page 13: Seelig Columbia Preliminary 062615

Regression Results: 24 CT Weights

Variable log(Marketability) log(Playability) log(DBO) Widest(1000s)Creativity 10.348 1.647 2.988 -3.900Curiosity -1.511 4.560 6.864 -6.202Open mindedness 10.401 4.560 20.544* -14.764Love of learning -2.715 -1.033 -6.796* 6.979Perspective -6.780 -5.373 -12.527 -1.586Bravery -13.171 -5.213 -22.081** -21.882**Persistence -1.289 9.016* 9.110 30.315**Integrity -26.911** -9.732** -19.572** -14.415Vitality 18.020 6.587 0.817 -25.048**Love 8.961** 0.074 2.999 -13.436**Kindness -31.516** 0.894 -11.812 4.730Social Intelligence 15.065 1.233 -2.162 -16.723*Citizenship 11.201 -0.669 -0.497 -13.142**Fairness 2.152 5.484** 1.947 -10.690*Leadership -8.638 -2.951 -9.489** -0.154Forgiveness and Mercy -19.244 -12.849** -13.363 -4.749Humility and Modesty 18.557 -5.197 -11.148 -45.644**Prudence -3.439 -0.966 -13.149 -7.420Self Regulation 7.043 1.367 4.514 7.770Appreciation of beauty and excellence -4.603 14.917** 3.519 -45.488**Gratitude -11.291 -1.574 1.907 4.608Hope -19.454 -0.243 -9.848 4.281Humor -5.164 -5.987* -8.221 -13.297Spirituality -4.564 -5.326** -2.016 -8.201widest release(1000s) -0.202** -0.182** 1.004**production budget($M) 0.009** 0.002** 0.004** 0.013**Action -0.024 -0.052* -0.053 0.122Comedy 0.126 0.050* 0.200** -0.064Drama 0.249** 0.091** 0.301** -0.635**Horror 0.060 -0.188** -0.133* 0.401**Kids -0.207** 0.167** -0.167** 0.344**Sci-Fi -0.140 -0.037 -0.126* 0.030Thriller -0.011 -0.012 -0.053 -0.256**Number of observations 866 866 866 866Number of parameters 34 34 34 33R2 0.142 0.314 0.816 0.537

**: p < 0.05. *: p < 0.10.

13 / 24

Page 14: Seelig Columbia Preliminary 062615

Regression Results: Community Membership (Based onK-means Clustering)

Variable log(Marketability) log(Playability) log(DBO) Widest(1000s)Leadership community 0.096 0.057 -0.005 0.060Citizenship community 0.046 0.039 0.020 -0.033Creativity community 0.145 -0.008 0.230** -0.048Curiosity community 0.059 0.022 -0.004 0.161Love of Learning community -0.031 0.076* -0.055 0.265**Fairness community -0.285* -0.032 -0.120 -0.199Sprituality community -0.014 -0.034 0.044 -0.053Bravery community -0.113 -0.063 -0.187** -0.229**Love community 0.117 0.048* 0.113** -0.038Love-Learning community -0.075 -0.012 -0.073 0.103widest release(1000s) -0.219** -0.186** 1.013**production budget($M) 0.009** 0.002** 0.004** 0.014**Action -0.066 -0.064** -0.069 0.180**Comedy 0.140* 0.052* 0.189** -0.145**Drama 0.246** 0.078** 0.258** -0.748**Horror 0.042 -0.190** -0.093 0.551**Kids -0.193* 0.176** -0.158** 0.374**Sci-Fi -0.149 -0.025 -0.124* 0.053Thriller -0.049 -0.026 -0.056 -0.194**Number of observations 866 866 866 866Number of parameters 20 20 20 19R2 0.113 0.286 0.809 0.507

**: p < 0.05. *: p < 0.10.

14 / 24

Page 15: Seelig Columbia Preliminary 062615

Regression Results: NMF Factor Scores Based on CTWeights

Variable log(Marketability) log(Playability) log(DBO) DV = Widest(1000s)Intercept 8.864** 1.221** 18.114** 1.575**NMF Factor 1 (”Love”) 2.775** 0.206 1.133 -0.473NMF Factor 2 (”Citizenship”) 1.974 -0.063 0.042 0.943NMF Factor 3 (”Love of Learning”) 0.853 0.168 -0.466 1.948**NMF Factor 3 (”Leadership”) -0.206 -0.262 -1.065* 1.328widest release(1000s) -0.209** -0.182** 1.020**production budget($M) 0.009** 0.002** 0.004** 0.013**Action -0.024 -0.052* -0.045 0.126Comedy 0.131* 0.048* 0.182** -0.098Drama 0.237** 0.080** 0.251** -0.741**Horror 0.025 -0.201** -0.100 0.560**Kids -0.194* 0.175** -0.163** 0.368**Sci-Fi -0.116 -0.019 -0.103 0.025Thriller -0.009 -0.015 -0.023 -0.226**Number of observations 866 866 866 866Number of parameters 14 14 14 13R2 0.115 0.278 0.808 0.504

**: p < 0.05. *: p < 0.10.

15 / 24

Page 16: Seelig Columbia Preliminary 062615

Regression Results: NMF Factor Scores Based on RawOccurrence Data

Variable log(Marketability) log(Playability) log(DBO) Widest(1000s)NMF Factor 1 (”Love”) 411.911 316.338** 180.352 -314.930NMF Factor 2 (”Citizenship+Humanism”) 272.817 188.142** 112.699 -256.833NMF Factor 3 (”Spirituality+Values”) 228.811 178.638** 165.292 -223.529widest release(1000s) -0.215** -0.180** 1.013** N/Aproduction budget($M) 0.008** 0.002** 0.004** 0.013**Action -0.056 -0.046 -0.081 0.156**Comedy 0.132* 0.027 0.207** -0.125*Drama 0.220** 0.071** 0.241** -0.732**Horror 0.069 -0.181** -0.091 0.537**Kids -0.209** 0.176** -0.164** 0.382**Sci-Fi -0.151 -0.014 -0.125* 0.070Thriller -0.032 -0.013 -0.061 -0.209**Number of observations 866 866 866 866Number of parameters 13 13 13 12R2 0.107 0.287 0.805 0.499

**: p < 0.05. *: p < 0.10.

16 / 24

Page 17: Seelig Columbia Preliminary 062615

Regression Results: Characteristics of CT WeightDistribution

Some movies receive less distribution but perform better(opportunity?):I Movies with stronger average CT weightsI Movies with more dispersed CT weightsI Movies with more prototypical distribution of CT weights

Variable log(Marketability) log(Playability) log(DBO) Widest(1000s)average CT weight 227.935** 48.767 143.919** -276.268**CV of CT weights 0.275** -0.008 0.226** -0.397**distance to prototypicaldistribution of CT weights

-30.607** -6.622** -26.208** 14.881*

widest release(1000s) -0.216** -0.187** 1.005**production budget($M) 0.008** 0.002** 0.004** 0.013**Action -0.053 -0.067** -0.077 0.104Comedy 0.126* 0.048* 0.197** -0.068Drama 0.232** 0.083** 0.270** -0.638**Horror 0.041 -0.196** -0.111 0.472**Kids -0.187* 0.173** -0.158** 0.322**Sci-Fi -0.130 -0.028 -0.121* 0.004Thriller -0.022 -0.022 -0.044 -0.265**Number of observations 866 866 866 866Number of parameters 13 13 13 12

R2 0.118 0.287 0.811 0.519

**: p < 0.05. *: p < 0.10.

17 / 24

Page 18: Seelig Columbia Preliminary 062615

Conclusions Based on Current Analysis

I Character Traits appear to be a relevant way to describemovies above and beyond genresI Driven by Media Psychology and Positive Psychology

literaturesI Intuitive and easy to interpretI Linked to movie performance (stronger average character traits

with more dispersion and distribution similar to ”prototype” →higher performance)

I Identification of gaps in the market? (widest release notaligned with performance)

I Movies may be clustered based on their CT weights

I The CT taxonomy may be simplified / a smaller set ofrelevant factors may be identified

18 / 24

Page 19: Seelig Columbia Preliminary 062615

Future Steps

I Interpretation of current results

I Other types of regressions that allow looking at high-orderinteractions and non-linear effects (e.g., random forests, SVM)

I Use scripts as text input (rather than wiki description)

I Unsupervised topic modeling

19 / 24

Page 20: Seelig Columbia Preliminary 062615

Appendix

20 / 24

Page 21: Seelig Columbia Preliminary 062615

Regular LDA: Data Generating Process

I Each token (word) in each document is independentlyassigned to a topic according to a multinomial distribution(document-topic)

Prob(zdi ) ∼ Multinomial(θd) (1)

I Token is assigned to a particular word according to anothermultinomial distribution (topic-word)

Prob(wdi |z

di = k) ∼ Multinomial(φk) (2)

I {zdi }, {φk}, {θd} estimated using Gibbs sampling (MCMC)

21 / 24

Page 22: Seelig Columbia Preliminary 062615

Seeded LDA

I Constrain some topics to have zero weights on some words inthe dictionary

I lk : set of words on which topic k is allowed to have positiveweights

I We allow n topics per character strength (e.g., n versions ofLove)

I Number of topics K = 24n + 1I Topic K = baseline topic

I may have positive weights on all wordsI controls for the baseline occurrence of words.

I Dictionary=set of seed words + ”all other” wordI ”all other” word controls for number of words in document (

positive weight only on baseline topic)

22 / 24

Page 23: Seelig Columbia Preliminary 062615

Seeded LDA: EstimationI Priors:

θk ∼ Dirichlet(α11K ) (3)

φk ∼ Dirichlet(α2lk) (4)

I Posteriors:

Prob(zdi = k |wd

i , {φk}, θd) =φk(wd

i )θd(k)∑

k ′ φk ′(wdi )θd(k ′)

(5)

Prob(φk |{zdi }, {w

di }) = Dirichlet(α2l

1k +

(i ,d):zdi =k

1(wdi = 1), ... , α2l

Wk +

(i ,d):zdi =k

1(wdi = W ))

(6)

Prob(θd |{zdi }) = Dirichlet(α1 +

i

1(zdi = 1), ...α1 +

i

1(zdi = K )) (7)

I Markov Chain Monte Carlo (MCMC) with n =1 to 4

23 / 24

Page 24: Seelig Columbia Preliminary 062615

Descriptive Statistics

Unit of analysis Mean Standard deviation Min Max

Number of words (including ”all others”) Movie descriptions (N=866) 702.96 296.91 22 2621

Number of occurrences of seed words Movie descriptions (N=866) 38.20 1.97 0 122

Number of unique seed words Movie descriptions (N=866) 27.19 10.91 0 77

Number of character traits with at least one seed word occurrence Movie descriptions (N=866) 15.65 4.06 0 24

Total number of occurrences across movie descriptions Seed words (N=2667) 12.40 41.94 0 730

Proportion of movie descriptions with at least one occurrence Seed words (N=2667) 0.01 0.03 0 0.42

Total number of occurrences across movie descriptions Seed words with at least one occurrence (N=1595) 20.73 52.63 1 730

Proportion of movie descriptions with at least one occurrence Seed words with at least one occurrence (N=1595) 0.02 0.04 0.001 0.42

Average number of seed word occurrences per movie description Character traits (N=24) 2.17 1.65 0.51 7.99

Proportion of movie descriptions with at least one seed word occurrence Character traits (N=24) 0.65 0.17 0.33 0.95

24 / 24