measuring reliability and validity in human coding and machine classification

29
Measuring Reliability and Validity in Human Coding and Machine Classifica9on Dr. Stuart Shulman May 2, 2014 CAQDAS Conference 2014 “…a wealth of informa0on creates a poverty of a6en0on.” Herbert Simon, 1971

Upload: stuart-shulman

Post on 04-Jul-2015

526 views

Category:

Education


2 download

DESCRIPTION

Slides delivered as a part of #CAQDAS14. In 1989 the Department of Sociology at the University of Surrey convened the world's first conference on qualitative software, which brought together qualitative methodologists and software developers who debated the pros and cons of the use of technology for qualitative data analysis. The result was a book (Fielding & Lee (1991) Using Computers in Qualitative Research, Sage Publications), the setting-up of the CAQDAS Networking Project and many other conferences concerning the topics over the years. This conference will be another opportunity for methodologists, developers and researchers to come together and debate the issues.There will be keynote papers by leading experts in the field, software support clinics and opportunities to present work in progress. http://www.surrey.ac.uk/sociology/files/Programme%20.pdf

TRANSCRIPT

Page 1: Measuring reliability and validity in human coding and machine classification

Measuring  Reliability  and  Validity    in  Human  Coding  and    Machine  Classifica9on  

 Dr.  Stuart  Shulman  

May  2,  2014  CAQDAS  Conference  2014  

“…a  wealth  of  informa0on  creates  a  poverty  of  a6en0on.”      -­‐  Herbert  Simon,  1971  

Page 2: Measuring reliability and validity in human coding and machine classification
Page 3: Measuring reliability and validity in human coding and machine classification

•  This  research  has  been  supported  by  grants  from  the  NaGonal  Science  FoundaGon  (NSF)  and  was  supplemented  through  interagency  agreements  between  the  US  Environmental  ProtecGon  Agency,  the  US  Fish  &  Wildlife  Service,  and  the  NSF.    –  EIA  0089892  (2001-­‐2002)  

v “SGER  CiGzen  Agenda-­‐SeVng  in  the  Regulatory  Process:  Electronic  CollecGon  and  Synthesis  of  Public  Commentary”  

–  EIA  0327979  (2003-­‐2004)  v “SGER  CollaboraGve:  A  Testbed  for  eRulemaking  Data”  

–  SES  0322662  (2003-­‐2005)  v “Democracy  and  E-­‐Rulemaking:    Comparing  TradiGonal  vs.  Electronic  Comment  from  a  

Discursive  DemocraGc  Framework”  –  IIS  0429293  (2004-­‐2007)    

v “CollaboraGve  Research:  Language  Processing  Technology  for  Electronic  Rulemaking”    –  SES-­‐0620673  (2007)  

v   “Coding  across  the  Disciplines:  A  Project-­‐Based  Workshop  on  Manual  Text  AnnotaGon  Techniques”  

–  IIS-­‐0705566  (2007-­‐2010)  v “CollaboraGve  Research  III-­‐COR:  From  a  Pile  of  Documents  to  a  CollecGon  of  InformaGon:  

A  Framework  for  MulG-­‐Dimensional  Text  Analysis”    

•  Any  opinions,  findings  and  conclusions  or  recommenda9ons  expressed  in  this  material  are  those  of  the  authors  and  do  not  necessarily  reflect  those  of  the  Na9onal  Science  Founda9on    

Acknowledgements  

Page 4: Measuring reliability and validity in human coding and machine classification

An  Incredibly  Important  Book  

Page 5: Measuring reliability and validity in human coding and machine classification

Qualita9ve  Methods:  Genes,  Taste,  or  Tac9c?  •  Qualita9ve  by  birth  or  choice?  

–  Some  look  to  words  as  an  alternaGve  to  number  crunching  –  Others  rooted  in  rich  and  meaningful  interpreGve  tradiGons  

•  Another  group  is  fluent  in  both  qual  &  quant  –  Mixed  methods  open  up  rather  than  limits  fields  of  knowledge  

•  One  central  goal  is  valid  inferences  about  phenomena  –  Replicable  and  transparent  methods  –  AbenGon  to  error  and  correcGve  measures  –  Internal  and  external  validaGon  of  results  

•  Using  computers  for  qualita9ve  data  analysis  helps,  but…  –  Rigor  sGll  originates  with  the  research  design,  not  the  technology  –  Socware  makes  beber  organizaGon  and  efficiency  possible  –  Coders  enable  the  researcher  to  step  back  while  scaling  up  

Page 6: Measuring reliability and validity in human coding and machine classification

Purist                                          Pluralist                                                  Posi9vist  

A  spectrum  of  approaches  to  working  with  qualita9ve  data  Different  types  of  knowledge  claims  depending  where  you  sit  

deep  immersion  closeness  to  data  

anGpathy  to  numbers  credible  interpretaGon  

in-­‐depth  analysis  contextual  subjecGve  

experimental    mixed  method  adapGve  hybrid  flexible  approach  interdisciplinary  

 

quanGtaGve  focus  on  error  

measurement  criGcal  validity  and  reliability  

replicaGon  &  objecGvity  generalizaGon  hypotheses  

These  choices  philosophical,  ideological,  poli9cal  and  ethical  

Page 7: Measuring reliability and validity in human coding and machine classification

Emergent  proper9es  found  in  a  very  well  read  texts,    such  as  the  character  type  “extremist  agent  of  the  law”  

Page 8: Measuring reliability and validity in human coding and machine classification

Agenda-­‐secng  in  the  press  

Page 9: Measuring reliability and validity in human coding and machine classification

Rela9ons  between  Classes  

Rates  and  Terms  for  Credit  

Farm  Profitability  

Cost  of  Living  

Soil  Fer9lity  

Educa9on  

Explora9on  Specula9on  Coding  

Valida9on  

Page 10: Measuring reliability and validity in human coding and machine classification

Skip  Ahead  10  Years:  Display  Ideas  Using    IR  &  NLP  Techniques  

•  Informa9on  Retrieval  (IR)  –  Search  and  cluster  topics  and  cross-­‐

correlate  by  stakeholders  

•  Natural  Language  Processing  (NLP)  –  Grouped  by  opinion  and  writer  type    

Con   Pro  

25,000  

20,000  

15,000  

10,000  

5,000  

Par  2.2(a1)  Ø Con:  

ü 150,  818:  “impossible  to  maintain”  ü 272:  “too  expensive  for  elderly”  

Ø Pro:    ü 169,  213,  391,  392,  394:  “already  being  done  in  Alaska”  

ü 18:  “extend  to  children”  

Xxx  xx  xxx  xx  x  xxx  x  xxx    Xx  xxxx  x  xxx  x  xxxxxxx  x  Xxxxx  x  xx  xxxx  x    xx  x  Xx  xx  xxxx  x  

Xxx  xx  xxx  xx  x  xxx  x  xxx    Xx  xxxx  x  xxx  x  xxxxxxx  x  Xxxxx  xx  xxxx  xxx  Xxx  xxx  xxxxxxx  x  xxx  xx  x  Xx  xx  xxxx  x  

Xxx  xx  xxx  xx  x  xxx  x  xxx    Xx  xxxx  x  xxx  x  xxxxxxx  x  Xxxxx  x  xx  xxxx  x  xx  x  Xx  xx  xxxx  x  

Page 11: Measuring reliability and validity in human coding and machine classification

Stuart  W.  Shulman.  2003.  "An  Experiment  in  Digital  Government  at  the  United  States  Na9onal  Organic  Program,"  Agriculture  and  Human  Values  20(3),  253-­‐265.  

Page 12: Measuring reliability and validity in human coding and machine classification
Page 13: Measuring reliability and validity in human coding and machine classification
Page 14: Measuring reliability and validity in human coding and machine classification
Page 15: Measuring reliability and validity in human coding and machine classification
Page 16: Measuring reliability and validity in human coding and machine classification
Page 17: Measuring reliability and validity in human coding and machine classification
Page 18: Measuring reliability and validity in human coding and machine classification

Coding  Web  Sites  and  Focus  Groups  to  Study  Agenda-­‐Secng  

Page 19: Measuring reliability and validity in human coding and machine classification

Annota9on  to  Improve  Op9cal  Character  Recogni9on  

Page 20: Measuring reliability and validity in human coding and machine classification

Over  13,000  hours  of  video  and  audio  were  recorded  of  the  public  spaces  in  a  LTC  facility’s  demenGa  unit  in    suburban  Pibsburgh,  PA.    A  codebook  of  80+  codes  was  developed  to  categorize  the  behavior  of  the  consenGng  residents  and  staff  (only  in  relaGon  to  paGents).    22  coders  spent  more  than  4,400  hours  over  a  period  of  22  months  coding  the  video  data.  The  data  were  coded  using  the  Informedia  Digital  Video  Library  (IDVL),  an  interface  designed  by  computer  scienGsts  at  Carnegie  Mellon  University.  

Page 21: Measuring reliability and validity in human coding and machine classification
Page 22: Measuring reliability and validity in human coding and machine classification

hjp://cat.ucsur.pij.edu  

Page 23: Measuring reliability and validity in human coding and machine classification
Page 24: Measuring reliability and validity in human coding and machine classification
Page 25: Measuring reliability and validity in human coding and machine classification
Page 26: Measuring reliability and validity in human coding and machine classification
Page 27: Measuring reliability and validity in human coding and machine classification
Page 28: Measuring reliability and validity in human coding and machine classification

                               

Page 29: Measuring reliability and validity in human coding and machine classification

Dr.  Stuart  W.  Shulman  Founder  &  CEO,  Texicer,  LLC  Research  Associate  Professor,  Department  of  PoliGcal  Science  University  of  Massachusebs  Amherst  Director,  QualitaGve  Data  Analysis  Program  (QDAP)  Associate  Director,  NaGonal  Center  for  Digital  Government  Editor  Emeritus,  Journal  of  Informa0on  Technology  &  Poli0cs  [email protected]  hbp://people.umass.edu/stu/  @stuartwshulman