project2:(textanalysis(with(python(...

55
Project 2: Text Analysis with Python Header Comments Oct 22, 2015 CSCI 0931 Intro. to Comp. for the HumaniHes and Social Sciences 1

Upload: others

Post on 22-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Project  2:  Text  Analysis  with  Python    

Header  Comments  Oct  22,  2015  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   1  

Page 2: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Python  DicHonaries  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   2  

Func%on  (All  operate  on  Dic%onaries)  

Input   Output   Example  

keys() None   List  of  keys  

>>> freq2.keys() ['the', 'cat']

values() None   List  of  values  

>>> freq2.values() [3, 2]

<key> in <dict> Key   Boolean   >>> ’zebra’ in freq2 False

<key> in <dict> (same  as  above)   >>> ’cat' in freq2 True

del(<dict>[<key>]) Dict.  Entry  

None   >>> del(freq2['cat'])

•  Keys  Are  Unique!    •  Assigning/geYng  any  value  is  very  fast  

Page 3: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Python  DicHonaries  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   3  

Func%on  (All  operate  on  Dic%onaries)  

Input   Output   Example  

keys() None   List  of  keys  

>>> freq2.keys() ['the', 'cat']

values() None   List  of  values  

>>> freq2.values() [3, 2]

<key> in <dict> Key   Boolean   >>> ’zebra’ in freq2 False False

<key> in <dict> (same  as  above)   >>> 'the' in freq2 True

del(<dict>[<key>]) Dict.  Entry  

None   >>> del(freq2['cat'])

Let’s  try  it!  

Page 4: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

The  Big  Picture  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   4  

Overall  Goal  Build  a  Concordance  of  a  text  

•  Loca%ons  of  words  •  Frequency  of  words  

Today:  Get  Word  Frequencies  •  Define  the  inputs  and  the  outputs  •  Learn  a  new  data  structure  •  Write  a  funcHon  to  get  word  frequencies  •  Go  from  word  frequencies  to  a  concordance  (finally!)  

Page 5: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Python  DicHonaries  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   5  

Func%on  (All  operate  on  Dic%onaries)  

Input   Output   Example  

keys() None   List  of  keys  

>>> freq2.keys() ['the', 'cat']

values() None   List  of  values  

>>> freq2.values() [3, 2]

<key> in <dict> Key   True  or  False  

>>> ’zebra’ in freq2 False

<key> in <dict> (means  same  as  above)  

>>> ’cat' in freq2 True

del(<dict>[<key>]) Dict.  Entry  

None   >>> del(freq2[‘cat'])

Page 6: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Python  DicHonaries  

I  want  to  write  a  wordFreq  funcHon    

•  What  is  the  input  to  wordFreq?  •  What  is  the  output  of  wordFreq?  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   6  

The cat had a hat. The cat sat on the hat.

Word   Freq.  

the   3  

cat   2  

had   1  

a   1  

hat   2  

sat   1  

on   1  

Page 7: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Python  DicHonaries  

I  want  to  write  a  wordFreq  funcHon    

•  What  is  the  input  to  wordFreq?  •  What  is  the  output  of  wordFreq?  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   7  

The cat had a hat. The cat sat on the hat.

Word   Freq.  

the   3  

cat   2  

had   1  

a   1  

hat   2  

sat   1  

on   1  

Let’s  try  it!  

Page 8: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

The  Big  Picture  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   8  

Overall  Goal  Build  a  Concordance  of  a  text  

•  Loca%ons  of  words  •  Frequency  of  words  

Today:  Get  Word  Frequencies  •  Define  the  inputs  and  the  outputs  •  Learn  a  new  data  structure  •  Write  a  funcHon  to  get  word  frequencies  •  Go  from  word  frequencies  to  a  concordance  (finally!)  

Page 9: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Building  a  Concordance  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   9  

The cat had a hat. The cat sat on the hat. 0 1 2 3 4 5 6 7 8 9 10

Word   List  of  Posi%ons   Frequency  

the   [0,5,9] 3

cat   [1,6] 2

had   [2] 1

a   [3] 1

hat   [4,10] 2

sat   [7] 1

on   [8] 1

Page 10: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

The  Big  Picture  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   10  

Overall  Goal  Build  a  Concordance  of  a  text  

•  Loca%ons  of  words  •  Frequency  of  words  

Today:  Get  Word  Frequencies  •  Define  the  inputs  and  the  outputs  •  Learn  a  new  data  structure  •  Write  a  funcHon  to  get  word  frequencies  •  Go  from  word  frequencies  to  a  concordance  (finally!)  

This  will  be  part  of  your  next  HW  

Page 11: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Long  Hmeline…  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   11  

Sun   Mon   Tues   Wed   Thurs   Fri   Sat  

10/18   10/19   10/20   10/21   10/22   10/23   10/24  

Project  2:  Proposal  out  

10/25   10/26   10/27   10/28   10/29   10/30   10/31  

HW  2-­‐6  due  

HW  2-­‐7  due  

Ini%al  Proposal  due  

11/1   11/2   11/3   11/4   11/5   11/6   11/7  

HW  2-­‐8  due  

Revised  Proposal  due  

11/8   11/9   11/10   11/12   11/13   11/14   11/15  

Project  due    

Page 12: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Today’s  first  topic:  Project  2  

•  Reminders  •  Data  Sources  – Project  Gutenberg  – English  DicHonary  – Debate  Transcripts  

•  Project  2  DescripHon  •  Example  Project  2  Proposal  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   12  

Page 13: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Data  Sources  

•  Looking  at  a  few  examples  today  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   13  

Page 14: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Data  Sources  

•  Looking  at  a  few  examples  today  •  Think  about  what  hypotheses  you  could  explore  using  these  data  sources  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   14  

Page 15: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Data  Sources  

•  Looking  at  a  few  examples  today  •  Think  about  what  hypotheses  you  could  explore  using  these  data  sources  

•  What  other  sources  are  you  interested  in?  – What  are  the  important  data  you  want  to  compute  by  extracHng  pieces  of  the  text?  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   15  

Page 16: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Data  Sources  

•  Open  “Text  Data  Sources”  link  on  the  webpage  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   16  

Page 17: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Project  Gutenberg  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   17  

 hhp://www.gutenberg.org/  

Page 18: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Project  Gutenberg  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   18  

 hhp://www.gutenberg.org/  

1. Find  a  book.  Any  book.  2. How  large  is  the  Plain Text UTF-8  File?  

1. Mb  =  Megabyte  2. Kb  =  Kilobyte  

3. Find  a  book  that  is  <  1Mb.  Download  it.  

1024  Kb  =  1Mb  

Page 19: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Project  Gutenberg  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   19  

 hhp://www.gutenberg.org/  

Look  at  the  funcHon    removeLicenseFromProjectGutenberg

in  DataImport.py

Page 20: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Today’s  first  topic:  Project  2  

•  Data  Sources  – Project  Gutenberg  – English  DicHonary  – Debate  Transcripts  

•  Project  2  DescripHon  •  Example  Project  2  Proposal  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   20  

Page 21: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Webster's  Unabridged  DicHonary  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   21  

hhp://www.mso.anu.edu.au/~ralph/OPTED/  

Page 22: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Webster's  Unabridged  DicHonary  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   22  

hhp://www.mso.anu.edu.au/~ralph/OPTED/  

1. According  to  the  homepage,  what  does  each  line  contain?  

2. What  leher  is  the  smallest  file?  1. Mb  =  Megabyte  2. Kb  =  Kilobyte  

3. Click  on  it.    Right-­‐click  and  select  View Page Source...

1024  Kb  =  1Mb  

Page 23: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Webster's  Unabridged  DicHonary  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   23  

hhp://www.mso.anu.edu.au/~ralph/OPTED/  

Look  at  the  funcHon      getWebsterDictionary

in  DataImport.py

Page 24: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Today’s  first  topic:  Project  2  

•  Data  Sources  – Project  Gutenberg  – English  DicHonary  – Debate  Transcripts  

•  Project  2  DescripHon  •  Example  Project  2  Proposal  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   24  

Page 25: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

The  American  Presidency  Project  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   25  

 hhp://www.presidency.ucsb.edu/  

Click  on    Republican Candidates Debate in

Mesa, AZ

Page 26: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

The  American  Presidency  Project  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   26  

 hhp://www.presidency.ucsb.edu/  

Look  at  the  funcHon      getTranscript in  DataImport.py

Page 27: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Today’s  first  topic:  Project  2  

•  Data  Sources  – Project  Gutenberg  – English  DicHonary  – Debate  Transcripts  

•  Project  2  DescripHon  •  Example  Project  2  Proposal  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   27  

Page 28: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Project  2  Rubric  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   28  

Page 29: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Today’s  first  topic:  Project  2  

•  Data  Sources  – Project  Gutenberg  – English  DicHonary  – Debate  Transcripts  

•  Project  2  DescripHon  •  Example  Project  2  Proposal  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   29  

Page 30: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   30  

Anna  Ritz  Project  2  Proposal    Background:  Aoer  each  debate,  there’s  lots  of  talk  about  who  “won”  it,  i.e.          I  will  define  the  “winner”  as  the  person  who  received  applause  the  most  frequently  during  the  debate.    Claim:  I  claim  that  in  the  AZ  debate,  Romney  “won”  and  Santorum  “lost”  –  that  is,  Romney  received  applause  the  most  and  Santorum  received  applause  the  least.    ....    

http://www.washingtonpost.com/blogs/the-fix/post/arizona-republican-debate-winners-and-losers/2012/02/22/gIQAsKkVUR_blog.html

hhp://blogs.phoenixnewHmes.com/valleyfever/2012/02/who_won_last_nights_arizona_re.php  

Page 31: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Look  at  the  file  structure…  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   31  

Page 32: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Skeleton  Code  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   32  

Page 33: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences  

Skeleton  Code  

33  

Page 34: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   34  

Anna  Ritz  Project  2  Proposal    ...    Claim:  I  claim  that  in  the  AZ  debate,  Romney  “won”  and  Santorum  “lost”  –  that  is,  Romney  received  applause  the  most  and  Santorum  received  applause  the  least.    ....    Backup  Plan:    ???      Increasing  Degree  of  Difficulty:    ???  

hhp://blogs.phoenixnewHmes.com/valleyfever/2012/02/who_won_last_nights_arizona_re.php  

Page 35: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

What  else  can  I  do?  

•  Count  presence  of  characters  in  different  chapters  in  a  book.  – Generate  CSV,  plot  graph  on  Google  Spreadsheets  

•  Look  at  the  Sherlock  Holmes  stories  – Search  for  “elementary”  and  “Watson”  close  together  

– Get  all  variaHons  of  the  famous  quote  (that  some  people  claim  it  was  never  said  in  the  book)  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   35  

Page 36: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

What  else  can  I  do?  

•  Get  tweets  from  Western  US  and  Eastern  US  – Check  whether  “Pepsi”  shows  up  more  than  “Coke”  

– Soda  vs.  Pop  “issue”  

•  Right  now,  we  give  you  tweets  in  a  CSV  file  •  Later  in  the  course,  you’ll  get  your  own  tweets  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   36  

Page 37: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Today’s  first  topic:  Project  2  

•  Data  Sources  – Project  Gutenberg  – English  DicHonary  – Debate  Transcripts  

•  Project  2  DescripHon  •  Example  Project  2  Proposal  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   37  

Page 38: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

HW:  Building  a  Concordance  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   38  

The cat had a hat. The cat sat on the hat. 0 1 2 3 4 5 6 7 8 9 10

Word   List  of  Posi%ons   Frequency  

the   [0,5,9] 3

cat   [1,6] 2

had   [2] 1

a   [3] 1

hat   [4,10] 2

sat   [7] 1

on   [8] 1

Page 39: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

List  as  values  in  a  dicHonary  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   39  

Page 40: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Lists  as  values  of  a  dicHonary  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   40  

The cat had a hat. The cat sat on the hat.

Key   Value   >>> conc = {} >>> conc {}

Page 41: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Lists  as  values  of  a  dicHonary  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   41  

The cat had a hat. The cat sat on the hat.

Key   Value  

cat   [1,6]  >>> conc = {} >>> conc {} >>> conc['cat'] = [1,6] >>> conc {'cat':[1,6]}

Page 42: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Lists  as  values  of  a  dicHonary  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   42  

The cat had a hat. The cat sat on the hat.

Key   Value  

cat   [1,6]  

hat   [4,10]  

>>> conc = {} >>> conc {} >>> conc['cat'] = [1,6] >>> conc {'cat':[1,6]} >>> conc['hat'] = [4,10] >>> conc {’hat':[4,10], 'cat':[1,6]}

Page 43: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Lists  as  values  of  a  dicHonary  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   43  

The cat had a hat. The cat sat on the hat.

Key   Value  

cat   [1,6,400]  

hat   [4,10]  

>>> conc['cat'] = conc['cat'] + [400] {'cat':[1,6,400], ’hat':[4,10]}

Page 44: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Header  Comments  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   44  

Page 45: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Header  Comments  def addOne(t): '’’Receives a number and returns the number summed to one''‘

def addOne(t):

'’’num -> num Receives a number and returns the number summed to one'''

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   45  

Page 46: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Header  Comments  def sumThem(a, b): '’’Receives two integers and returns their sum''‘

def sumThem(a, b):

'’’int * int -> int Receives two integers and returns their sum'''

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   46  

Page 47: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Header  Comments  def buildFreqTable(text): '’’Receives a text and returns a dictionary mapping each word with its frequency''‘

def buildFreqTable(text):

'’’string -> (string,int)dict Receives a text and returns a dictionary mapping each word with its frequency''‘

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   47  

Page 48: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Header  Comments  def addPassword(dictionary,key,value): '''Adds the (key,value) pair to the dictionary and returns the new dictionary''‘

def addPassword(dictionary,key,value):

'''(string,string)dict * string * string -> (string, string)dict Adds the (key,value) pair to the dictionary and returns the new dictionary'''

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   48  

Page 49: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Header  Comments  def isElementOf(element, listOfElems): '’’Checks if element is part of the provided list''‘

def isElementOf(element, listOfElems):

'’’int * int list -> bool Checks if element is part of the provided list'’‘

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   49  

Page 50: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Header  Comments  def isElementOf(element, listOfElems): '’’Checks if element is part of the provided list''‘

def isElementOf(element, listOfElems):

'’’object * list -> bool Checks if element is part of the provided list'’‘

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   50  

Page 51: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Header  Comments  

•  NotaHon  for  describing  types:  int, float, string, bool •  Separate  mulHple  arguments  with  “*”:   open(filename, “r”)  string * string -> file  CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   51  

Page 52: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Header  Comments  

•  Also  say  what  the  funcHon  produces  in  via  its  return  statement:  

def printMovieRevenues(movie_dict):

'''(string, int) dict -> .

#some print commands here… #some extra stuff particular to the function…

•  Use  “.”  to  mean  “nothing  at  all”    

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   52  

Page 53: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

More  complicated  types  

•  DicHonaries   (string, int)dict (string, string list)dict

•  Lists   int list [2, 3, 4] string list ['cat', 'zebra']

string list list [['a', 'b'],['cat', 'h']] •  Use  parentheses  to  clarify  as  needed   (string list) list  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   53  

Page 54: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Synonyms  

•  OK  to  use  “text”  for  a  long  string  that  represents  a  whole  sentence  or  book,  etc.    

•  OK  to  use  “word”  for  a  string  containing  an  individual  word.    

 def getMobyWords(fileString): ''' text -> string list split text of Moby Dick into individual words''' return fileString.split()

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   54  

Page 55: Project2:(TextAnalysis(with(Python( Header(Comments(cs.brown.edu/courses/csci0931/2015-fall/2-text_analysis/LEC2-7.pdf · Project2:(TextAnalysis(with(Python((Header(Comments(Oct22,(2015(CSCI0931(C(Intro.(to(Comp.(for(the(HumaniHes(and(Social(Sciences(

Next  Classes  •  String  funcHons  in  Python  (split,  search,  etc)  

•  Get  input  from  the  user’s  keyboard!  

•  Generate  Files  

•  Using  Python  to  compute  a  similarity  score  between  books  –  “Which  book  might  have  been  authored  by  someone  different  than  the  rest?”  

CSCI  0931  -­‐  Intro.  to  Comp.  for  the  HumaniHes  and  Social  Sciences   55