craig evans cas587 – culture as data project results 4 december 2012
TRANSCRIPT
Challenge:Finding the Right Data Set
Wide variety of data types presented Global, national, local Big data, personal data
Discussed varying technologies Data mining Text mining Machine learning Visualisation
All very abstract …
Motivation:Something Personal/Relatable
Never lose sight of the data Its not about the technology
Technology is a tool, not an endpoint Choose data that we can all see
something in
So …
Goal
CAS587 is an interdisciplinary class We have different interests/focus –
do they come out through our readings analysis?
Analyse the writings of the CAS587 class, and see if there is any apparent trend in their writing.
Importance …
To the student: Who else in the class has a similar interest? Who has expresses skills that are
complementary? Who would you reach out to to build a team
later? To the instructor:
Has the right message been communicated? Have your goals in educating the class been
met? To the wider population:
This is an example of how data can get used in a way unintended. Would you write differently if you knew the text was going to be used for this purpose?
Would you choose to post anonymously instead?
Data Appropriateness
It is a “raw” data set No previous preprocessing It is not what the data was intended
for It is a little “random” in nature – not
a traditional structured dataset found in an online repository
CAS587 – The Data Set
Starts as a PDF file Converted to standard ASCII text file
Manual cleanup of data required Removal of heading/footer information
Result? 150 files 96677 words 1150150 chars
The Process
1. PDF’s submitted to CAS587 Website
2. Results exported toplain text files
3. Results imported todatabase
4. Results analyzed in custom Java application
5. Results returned to
database
6. Results returned to Excel / Visualisation Tool
Used trial version of publicly
available PDF2Text tool
mySQL
• Text parsed to individual words• Text stemmed using WordNet• tf*idf Weightings used to
generate keywords per person/article
• If time permits – run Sentiment Analysis over corpus
Excel is easy, but once data
processed, I can have some fun
with the visualisation
tf*idf … term frequency x inverse doc frequency(From Wikipedia)
… a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is used as a weighting factor in information retrieval and text mining. The tf*idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.
ExampleConsider a document containing 100 words where the word cow appears 3 times. Following the previously defined formulas, the term frequency (TF) for cow is then (3 / 100) = 0.03. Now, assume we have 10 million documents and cow appears in one thousand of these. Then, the inverse document frequency is calculated as log(10 000 000 / 1 000) = 4. The tf*idf score is the product of these quantities: 0.03 × 4 = 0.12.
The Class Week 2-6
Week 2: What is Culture as Data? filter,comparative,autism,scholarship,writer,closed,overload,library,net,
outward,air, inside,coin,ecology,region Week 3: Social Media - culture, trends, and data
activism,movie,stock,flu,market,happy,mood,tweet,Trends,predict,happier,weak, Democrats,happiness,television
Week 4: Visualization, the challenges of visualizing culture - the challenges of manipulating large amounts of data visualization,template,analyst,analytic,seer,visual,computing,dot,cloud,
distort, manipulate,viewer,map,lie,trap Week 5: Books, Music, Images, Movies
music,dementia,rating,alzheimer,movie,taste,playlist,political,novel,Books,musical, affiliation,preference,listen,writing
Week 6: Data as Culture: Curating, Scrubbing, and Sampling classification,hire,narrative,card,database,replicate,icd,scientific,declin
e,finding,poetic,viscosity,replication,solution,electronic
The Class Week 7-11
Week 7: Prediction customer,habit,pregnant,economy,economics,coupon,routine,cue,
prediction, purchasing,evaluation,trigger Week 8: Personal data online. Conversations and
Persistence. Interpretations of personal data. Spider,thesis,speaker,oatmeal,report,communicative,annual,persona,p
ublic,email, private,eat,analyzeword,mouth,wife Week 9: History of Big Data Critiques
skull,friction,reductionism,craniology,maturity,downfall,shimmering,positivism, introspectometer,domain,inaccurate,conflict,economics,igy,dominate
Week 10: Life After Privacy obfuscation,protect,privacy,car,policy,setting,private,default,public,opti
on,breach, anonymize,identifiable,regulation,photo Week 11: Art as Data; Data as Art
art,wind,transfinite,installation,artistic,cascade,choir,hint,visualization,rose,color, contents,flow,beautiful
Picking on an IndividualCraig Evans – Total Corpus
Keywords from total corpus cent,visualisation,suspect,secondary,teach,
zip,irb,material,illustrate,interestingly, openly,playlist,artwork,profile,century, experience,lose,computationally,reuse
Most negative sentiment … not,lose,suspect,base,dementia,secondary,paranoid,
bias,present,present,disturbing,insufficient,paranoia, difficult,number
Most positive sentiment … model,interesting,good,well,better,researcher,accura
te, aware,time,time,beneficial,enable,teach,illustrate,find, method,read,add,excellent,art
Picking on an IndividualCraig Evans – Week 7
Week 7: Prediction … Keywords customer,habit,pregnant,economy,economics,coupon
, routine,cue,prediction,purchasing,evaluation,trigger Keywords against rest of corpus
model,influence,buying,paper,predictive,joke, series,economist,valid,pregnant,resource,woman,link
Most negative sentiment … bias,difficult,not,base,nefarious,invalid,defunct,
savage,hard,blue,miss,number,scale,pregnant Most positive sentiment …
model,find,color,joke,read,sound,accurate, interesting,valid,valid,privacy,improve,influence, compare,reasoning,group,improvement,absolute