text to data to insight data and the digital world august ... · the mystery of mr. galbraith and...

31
text to data to insight Data and the Digital World August 23, 2017 Whitt Kilburn, [email protected], Dept. of Political Science Matt Schultz, [email protected], University Libraries

Upload: truongtuyen

Post on 21-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

texttodatatoinsightDataandtheDigitalWorld

August23,2017

WhittKilburn,[email protected],Dept.ofPolitical ScienceMattSchultz,[email protected], University Libraries

Anoverview ofNatual Language Processinghttps://monkeylearn.com/blog/definitive-guide-natural-language-processing/

WincentyLutoslawski,1863-1954coinedtheterm“stylometry”

Source:https://archive.org/details/origingrowthofpl00lutoialaSource:https://commons.wikimedia.org/wiki/Main_Page

‘Ifhandwritingcanbesoexactlydeterminedastoaffordcertaintyastoitsidentity,soalsowithstyle,sincestyleismorepersonalandcharacteristicthanhandwriting’

TheOriginandGrowth1897(p.60)

Modernstylometry:relationshipbetweentextstyleandmetacharacteristics

writing style:patternsofwordusage,especiallyfunction(‘stop’)wordusage

metacharacteristics: textauthor,gender,chronology,timeperiod,identity,etc.

https://www.nytimes.com/2017/07/06/upshot/the-word-choices-that-explain-why-jane-austen-endures.html

’Function’ or‘Stop’wordsinJaneAusten, PrideandPrejudice

RomeoandJuliet:Awordcloud(ofmostlystoporfunctionwords)

Patternsofusageofthemostfrequentwords---- thefunctionwords--- revealtheauthor’s“fingerprint”

Zipf’s 1st Law:Rank-Frequencydependence

such asmodern English‘the’, ‘in’, ‘of’, ‘or’, ‘I’, ‘is’

Stylometry:usuallybeginningwithatableoffrequencies

ThemysteryofMr.GalbraithandTheCuckoo’sCalling

SentimentAnalysis:TrumpTweets.Negative tweetsarefromonephonetype,Android

Source: http://varianceexplained.org/r/trump-tweets/

Tokenization--- theprocessofbreakingupatextintounitsofanalysis,whethercharacters,words,sentences,orparagraphs….

Wheretextminingusuallybegins:

Quick Notes:

Cluster method in stylo() ishierarchical clustering.

Distancesmetrics: Burrow’s Delta(classic Delta)isManhattan distanceon standardized (zdistributed) scores.

Othermetrics:Euclidean.

Consult thestylo() manual forfurther info….And Google!

Briefoverviewofsentiment analysis: extractionofanauthor’semotionalintentfromatext

http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm

PopularEmotional Lexicon:Mohammad’sNRCDictionaryofcrowd-sourcedtermsassociated with8emotional states, fromtheoryofpsychologist RobertPlutchik:

1)anger2)fear3)sadness4)disgust5)surprise6)anticipation7)trust8)joy

Inthebagofwords approach,atext’ssentiment isscoredbythepresenceofwordsassociatedwithanemotionalstate.