don’t remove my stop words: identifying personality …

29
DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY TRAITS FROM QUORA ANSWERS ASHUTOSH BAHETI, 12CS10012 RAHUL GURNANI, 12CS10039 DHRUV JAIN, 12CS30043 NISHKARSH SHASTRI, 12CS10034 SABYASACHEE BARAUH, 12CS30029

Upload: others

Post on 18-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY TRAITS FROM QUORA ANSWERS

ASHUTOSH BAHETI, 12CS10012

RAHUL GURNANI, 12CS10039

DHRUV JAIN, 12CS30043

NISHKARSH SHASTRI, 12CS10034

SABYASACHEE BARAUH, 12CS30029

Page 2: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

OBJECTIVE

● Identifying Personality of Quora users with respect to the big five personality traits using linguistic features based analysis of their answer

● Openness To Experience

● Conscientiousness

● Extraversion

● Agreeableness

● Neuroticism

2

Page 3: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

RELATED WORK

Psychological meaning of words : LIWC and computerised text analysis methods - Yla R. Tausczik and James W. Pennebaker

Tausczik, Yla R., and James W. Pennebaker. "The psychological meaning of words: LIWC and computerized text analysis methods."

Mairesse, François, et al. "Using linguistic cues for the automatic recognition of personality in conversation and text."

Workshop on Computational Personality Recognition - Fabio Celli, Fabio Pianesi, David Stillwell, Michal Kosinski

3

Page 4: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Project Timeline4

Page 5: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Classifying essay data based on LIWC as feature

Identifying the linguistic features for the Big V personality traits

Extraction of textual features from the essays

Classifying based on new features and LIWC

Survey with the Quora users to get a labelled dataset

Crawling the answers of Surveyed users

Using the Quora Dump to expand LIWC

Trained the model based on labelled Quora Dataset

Calculated the accuracy of the trained model

5

Page 6: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Classification of Essay Data

Straightforward ML approachlabelled essays with binary values for each personality

sanitized the data present in the essays

Created the trie structure for LIWC prefix matching

Extracted the features based on LIWC word count for each category

Applied SVM to the data using WEKA

Accuracy of model found to be 53%

6

Page 7: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Features Identified for Extroversion

Word Variance (repetitivity)

Type/Token Ratio

Formality measure and Informality MeasureF-Measure = (noun freq + adjective freq + preposition freq + article freq -

pronoun freq - verb freq - adverb freq - interjection freq + 100)/2

I-Measure = (Wrong-typed Words freq. + Interjections freq. + Emoticon freq. ) * 100

Positivity of Text and Negativity Of Text

Rich Vocabulary, use of difficult words

Concrete and Frequent Words

Use of more social words

7

Page 8: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Features Identified for Openness

Preference for longer words

Words expressing tentativeness

Avoidance of 1st person singular pronouns

Present tense forms

The avoidance of past tense indicates

8

Page 9: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Features Identified for Conscientiousness

Avoid negations

Avoid words reflecting discrepancies (e.g., should and would)

2nd person pronouns

Filler words (in males and not in females): More useful in

speech analysis

9

Page 10: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Features Identified for Agreeableness

More positive emotions few negative emotions

Few articles

Negative and Positive emotion words

Leisure activity

10

Page 11: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Features Identified for Neuroticism

1st person singular pronouns

Noun Negative

Multiple punctuations

Fewer references to occupation

11

Page 12: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Extraction of Features

Python scripts using nltk to extract the features mentioned in previous five slides

Speech based features were not extracted

12

Page 13: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

NLP Techniques based features:Discourse Parsing

Used the discourse parsing on all the essays data.

Created RST style discourse trees.

Extracted main nucleus text from the data

Extracted the relation count from the RST trees

Normalized the relation count.

Constructed the feature vector to include the discourse

relation count

13

Page 14: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Expansion of LIWC Word Set

Seeded LDA and Word2Vec Methods

14

Page 15: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Expansion of LIWC: Seeded LDA

Seeded LDA treats each document as a mixture of topics.

It treats topics as a probability distribution of words.

We can give a prior asymetric probability to a word topic pair

to seed the topic with the given word.

We have used the gensim package and the eta parameter to

implement seeded LDA, however it did not give better

results due to overfitting.

15

Page 16: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Expansion of LIWC: Word2Vec

Applied Word2Vec modelling on Quora Dump

Found the most similar words for each word present under the

tag

Compared the similarity with 1Billion WIki Text

Added the most similar words thus found to new LIWC

dictionary

Trained the models on new LIWC dictionary

16

Page 17: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Expansion of Posemo,Negemo,Funct-words

Added More Positive Words,Negative Words[1]Added more functional words[2]

1. Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."

2. Leah Gilner and Franc Morales at [Sequence Publishing] (http://www.sequencepublishing.com) for listing English function words

17

Page 18: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

User Survey18

Page 19: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Survey Method

Used a 10 question questionnaire - BFI 10

Contacted the Quora users having more than 30 answers

50 Users filled the survey

Calculated the personality score for all the 5 personality traits

between 1-10

19

Page 20: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Extraction of Data

Written the Python script to crawl all the answers of these users

Sanitized the answersPruned all the answers with less than 200 wordsLabelled the dataset thus obtained with survey results

20

Page 21: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Results21

Page 22: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Only LIWC Features on labelled Essays 22

SMOLogisti

cAdabo

ost SVM

Random

Forest

Openness

60.5348 %

59.9271 %

59.1167 %

51.5397 %

55.1053 %

Conscientiousness

55.4295 %

55.3485 %

55.3485 %

50.8104 %

53.4441 %

Extraversion

54.5786 %

54.7812 %

54.8622 %

51.7423 %

53.201 %

Agreeableness

55.1459 %

53.7682 %

56.0778 %

53.0794 %

54.4165 %

Neuroticism

55.9968 %

56.1183 %

54.3355 %

50.0405 %

52.5932 %

Page 23: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

LIWC Features + New Extracted Features on labelled Essays23

SMOLogisti

cAdabo

ost SVM

Random

Forest

Openness

60.5348 %

60.3728 %

58.59 %

51.9854 %

57.7391 %

Conscientiousness

56.3614 %

55.0243 %

55.2269 %

51.2156 %

53.282 %

Extraversion

55.1864 %

55.5105 %

55.5105 %

51.2561 %

52.5122 %

Agreeableness

54.9028 %

53.6872 %

56.7261 %

53.0389 %

52.107 %

Neuroticism

56.9692 %

57.7391 %

54.0924 %

50.6888 %

51.7828 %

Page 24: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Expanded LIWC + New Extracted Features on labelled Essays24

SMOLogisti

cAdabo

ost SVM

Random

Forest

Openness

61.1831 %

61.7504 %

59.8865 %

53.0794 %

56.3209 %

Conscientiousness

55.5105 %

54.6191 %

53.8088 %

51.5802 %

51.6613 %

Extraversion

54.2139 %

54.3355 %

55.8752 %

52.0259 %

50.6078 %

Agreeableness

55.3485 %

54.2139 %

54.2139 %

51.6613 %

51.7423 %

Neuroticism

57.577 %

56.6856 %

54.8622 %

51.2561 %

51.9449 %

Page 25: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Expanded LIWC + New Extracted Features + Discourse Relations on labelled Essays

25

SMOLogisti

cAdabo

ost SVM

Random

Forest

Openness

61.4336 %

60.2726 %

58.9601 %

52.3473 %

57.2943 %

Conscientiousness

56.4866 %

55.679 %

53.054 %

51.5901 %

51.2367 %

Extraversion

54.1141 %

53.4578 %

55.5275 %

52.5492 %

53.054 %

Agreeableness

56.7895 %

56.9914 %

54.5684 %

53.8617 %

54.8208 %

Neuroticism

56.84 %

57.0419 %

53.8617 %

53.5083 %

53.3569 %

Page 26: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Only LIWC Features on Labelled Quora Dataset26

SMOLogisti

cAdabo

ost SVM

Random

Forest

Openness

74.8971 %

74.8971 %

70.3704 %

70.535 %

71.6049 %

Conscientiousness

68.9712 %

66.9136 %

68.9712 %

68.9712 %

69.7942 %

Extraversion

76.2963 %

76.7078 %

76.2963 %

76.2963 %

78.93 %

Agreeableness

67.8189 %

66.5844 %

63.4568 %

63.4568 %

66.1728 %

Neuroticism

72.9218 %

71.8519 %

72.9218 %

72.9218 %

71.7695 %

Page 27: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Expanded LIWC + FeaturesQuora dataset

27

SMOLogis

ticAdaboost

Adaboost (random

forest) SVM

Random

Forest

Openness

75.3909 %

73.9095 %

72.7572 %

77.284 %

71.0288 %

74.7325 %

Conscientiousness

70.1235 %

67.572 %

68.9712 %

73.6626 %

68.5597 %

71.2757 %

Extraversion

76.3786 %

77.284 %

76.2963 %

80.4115 %

77.9424 %

79.7531 %

Agreeableness

66.9959 %

67.1605 %

63.4568 %

69.3827 %

64.1975 %

66.0905 %

Neuroticism

73.0041 %

70.0412 %

72.9218 %

75.2263 %

71.9342 %

72.3457 %

Page 28: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Future Work

Expand LIWC by taking more unlabelled quora dataGathering richer labelled quora data by conducting paid

personality surveysEvaluate on more labelled quora dataLeveraging Discourse output to generate better discourse

featuresAdd more linguistic features by identifying patterns in quora

answers

28

Page 29: DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY …

Thank You29