discovery - harvard university

119
Discovery Gary King Institute for Quantitative Social Science Harvard University covering joint work with Justin Grimmer (Harvard) and Eleanor Powell (Harvard Yale) (talk @ Institute for Qualitative and Multimethod Research, Syracuse University, 5/26/09) () (talk @ Institute for Qualitative an / 19

Upload: others

Post on 23-Nov-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discovery - Harvard University

Discovery

Gary KingInstitute for Quantitative Social Science

Harvard University

covering joint work with

Justin Grimmer (Harvard) and Eleanor Powell (Harvard Yale)

(talk @ Institute for Qualitative and Multimethod Research, Syracuse University, 5/26/09)

()(talk @ Institute for Qualitative and Multimethod Research, Syracuse University, 5/26/09) 1

/ 19

Page 2: Discovery - Harvard University

Reading

Gary King and Eleanor Neff Powell. 2008. “How Not to Lie WithoutStatistics”

Justin Grimmer and Gary King. 2009. “Quantitative Discovery ofQualitative Information: A General Purpose Document ClusteringMethodology”

http://GKing.harvard.edu

Gary King (Harvard, IQSS) 2 / 19

Page 3: Discovery - Harvard University

Is Qualitative or Quantitative Research Better?

If insufficient information is quantified, qualitative judgment wins

The vast majority of human decisions are qualitative

If not, we wouldn’t have survived as a species

If sufficient information is quantified, statistics wins

The information required: shockingly little

6 crude variables beat 83 law professors (Martin et al., 2004)

Simple quantitative models forecast elections better than expertpundits (Gelman and King, 1993; Campbell, 2005)

284 experts forecast the political future with “less skill than simpleextrapolation” (Tetlock, 2005)

statistical model out-performs physicians in determining cause-specificmortality rates (King and Lu, 2008)

Hundreds of head-to-head contests, mostly with the same conclusion(Meehl, 1954; Grove, 2005)

The march of quantification across fields of academia, professions,commerce, sports, etc. (Moneyball, SuperCrunchers, Numerati)

Gary King (Harvard, IQSS) 3 / 19

Page 4: Discovery - Harvard University

Is Qualitative or Quantitative Research Better?This question has an answer (hard to know from the qualitative methods literature!)

If insufficient information is quantified, qualitative judgment wins

The vast majority of human decisions are qualitativeIf not, we wouldn’t have survived as a species

If sufficient information is quantified, statistics wins

The information required: shockingly little6 crude variables beat 83 law professors (Martin et al., 2004)Simple quantitative models forecast elections better than expertpundits (Gelman and King, 1993; Campbell, 2005)284 experts forecast the political future with “less skill than simpleextrapolation” (Tetlock, 2005)statistical model out-performs physicians in determining cause-specificmortality rates (King and Lu, 2008)Hundreds of head-to-head contests, mostly with the same conclusion(Meehl, 1954; Grove, 2005)The march of quantification across fields of academia, professions,commerce, sports, etc. (Moneyball, SuperCrunchers, Numerati)

Gary King (Harvard, IQSS) 3 / 19

Page 5: Discovery - Harvard University

Is Qualitative or Quantitative Research Better?This question has an answer (hard to know from the qualitative methods literature!)

If insufficient information is quantified, qualitative judgment wins

The vast majority of human decisions are qualitativeIf not, we wouldn’t have survived as a species

If sufficient information is quantified, statistics wins

The information required: shockingly little6 crude variables beat 83 law professors (Martin et al., 2004)Simple quantitative models forecast elections better than expertpundits (Gelman and King, 1993; Campbell, 2005)284 experts forecast the political future with “less skill than simpleextrapolation” (Tetlock, 2005)statistical model out-performs physicians in determining cause-specificmortality rates (King and Lu, 2008)Hundreds of head-to-head contests, mostly with the same conclusion(Meehl, 1954; Grove, 2005)The march of quantification across fields of academia, professions,commerce, sports, etc. (Moneyball, SuperCrunchers, Numerati)

Gary King (Harvard, IQSS) 3 / 19

Page 6: Discovery - Harvard University

Is Qualitative or Quantitative Research Better?This question has an answer (hard to know from the qualitative methods literature!)

If insufficient information is quantified, qualitative judgment wins

The vast majority of human decisions are qualitativeIf not, we wouldn’t have survived as a species

If sufficient information is quantified, statistics wins

The information required: shockingly little6 crude variables beat 83 law professors (Martin et al., 2004)Simple quantitative models forecast elections better than expertpundits (Gelman and King, 1993; Campbell, 2005)284 experts forecast the political future with “less skill than simpleextrapolation” (Tetlock, 2005)statistical model out-performs physicians in determining cause-specificmortality rates (King and Lu, 2008)Hundreds of head-to-head contests, mostly with the same conclusion(Meehl, 1954; Grove, 2005)The march of quantification across fields of academia, professions,commerce, sports, etc. (Moneyball, SuperCrunchers, Numerati)

Gary King (Harvard, IQSS) 3 / 19

Page 7: Discovery - Harvard University

Is Qualitative or Quantitative Research Better?This question has an answer (hard to know from the qualitative methods literature!)

If insufficient information is quantified, qualitative judgment wins

The vast majority of human decisions are qualitative

If not, we wouldn’t have survived as a species

If sufficient information is quantified, statistics wins

The information required: shockingly little6 crude variables beat 83 law professors (Martin et al., 2004)Simple quantitative models forecast elections better than expertpundits (Gelman and King, 1993; Campbell, 2005)284 experts forecast the political future with “less skill than simpleextrapolation” (Tetlock, 2005)statistical model out-performs physicians in determining cause-specificmortality rates (King and Lu, 2008)Hundreds of head-to-head contests, mostly with the same conclusion(Meehl, 1954; Grove, 2005)The march of quantification across fields of academia, professions,commerce, sports, etc. (Moneyball, SuperCrunchers, Numerati)

Gary King (Harvard, IQSS) 3 / 19

Page 8: Discovery - Harvard University

Is Qualitative or Quantitative Research Better?This question has an answer (hard to know from the qualitative methods literature!)

If insufficient information is quantified, qualitative judgment wins

The vast majority of human decisions are qualitativeIf not, we wouldn’t have survived as a species

If sufficient information is quantified, statistics wins

The information required: shockingly little6 crude variables beat 83 law professors (Martin et al., 2004)Simple quantitative models forecast elections better than expertpundits (Gelman and King, 1993; Campbell, 2005)284 experts forecast the political future with “less skill than simpleextrapolation” (Tetlock, 2005)statistical model out-performs physicians in determining cause-specificmortality rates (King and Lu, 2008)Hundreds of head-to-head contests, mostly with the same conclusion(Meehl, 1954; Grove, 2005)The march of quantification across fields of academia, professions,commerce, sports, etc. (Moneyball, SuperCrunchers, Numerati)

Gary King (Harvard, IQSS) 3 / 19

Page 9: Discovery - Harvard University

Is Qualitative or Quantitative Research Better?This question has an answer (hard to know from the qualitative methods literature!)

If insufficient information is quantified, qualitative judgment wins

The vast majority of human decisions are qualitativeIf not, we wouldn’t have survived as a species

If sufficient information is quantified, statistics wins

The information required: shockingly little

6 crude variables beat 83 law professors (Martin et al., 2004)Simple quantitative models forecast elections better than expertpundits (Gelman and King, 1993; Campbell, 2005)284 experts forecast the political future with “less skill than simpleextrapolation” (Tetlock, 2005)statistical model out-performs physicians in determining cause-specificmortality rates (King and Lu, 2008)Hundreds of head-to-head contests, mostly with the same conclusion(Meehl, 1954; Grove, 2005)The march of quantification across fields of academia, professions,commerce, sports, etc. (Moneyball, SuperCrunchers, Numerati)

Gary King (Harvard, IQSS) 3 / 19

Page 10: Discovery - Harvard University

Is Qualitative or Quantitative Research Better?This question has an answer (hard to know from the qualitative methods literature!)

If insufficient information is quantified, qualitative judgment wins

The vast majority of human decisions are qualitativeIf not, we wouldn’t have survived as a species

If sufficient information is quantified, statistics wins

The information required: shockingly little6 crude variables beat 83 law professors (Martin et al., 2004)

Simple quantitative models forecast elections better than expertpundits (Gelman and King, 1993; Campbell, 2005)284 experts forecast the political future with “less skill than simpleextrapolation” (Tetlock, 2005)statistical model out-performs physicians in determining cause-specificmortality rates (King and Lu, 2008)Hundreds of head-to-head contests, mostly with the same conclusion(Meehl, 1954; Grove, 2005)The march of quantification across fields of academia, professions,commerce, sports, etc. (Moneyball, SuperCrunchers, Numerati)

Gary King (Harvard, IQSS) 3 / 19

Page 11: Discovery - Harvard University

Is Qualitative or Quantitative Research Better?This question has an answer (hard to know from the qualitative methods literature!)

If insufficient information is quantified, qualitative judgment wins

The vast majority of human decisions are qualitativeIf not, we wouldn’t have survived as a species

If sufficient information is quantified, statistics wins

The information required: shockingly little6 crude variables beat 83 law professors (Martin et al., 2004)Simple quantitative models forecast elections better than expertpundits (Gelman and King, 1993; Campbell, 2005)

284 experts forecast the political future with “less skill than simpleextrapolation” (Tetlock, 2005)statistical model out-performs physicians in determining cause-specificmortality rates (King and Lu, 2008)Hundreds of head-to-head contests, mostly with the same conclusion(Meehl, 1954; Grove, 2005)The march of quantification across fields of academia, professions,commerce, sports, etc. (Moneyball, SuperCrunchers, Numerati)

Gary King (Harvard, IQSS) 3 / 19

Page 12: Discovery - Harvard University

Is Qualitative or Quantitative Research Better?This question has an answer (hard to know from the qualitative methods literature!)

If insufficient information is quantified, qualitative judgment wins

The vast majority of human decisions are qualitativeIf not, we wouldn’t have survived as a species

If sufficient information is quantified, statistics wins

The information required: shockingly little6 crude variables beat 83 law professors (Martin et al., 2004)Simple quantitative models forecast elections better than expertpundits (Gelman and King, 1993; Campbell, 2005)284 experts forecast the political future with “less skill than simpleextrapolation” (Tetlock, 2005)

statistical model out-performs physicians in determining cause-specificmortality rates (King and Lu, 2008)Hundreds of head-to-head contests, mostly with the same conclusion(Meehl, 1954; Grove, 2005)The march of quantification across fields of academia, professions,commerce, sports, etc. (Moneyball, SuperCrunchers, Numerati)

Gary King (Harvard, IQSS) 3 / 19

Page 13: Discovery - Harvard University

Is Qualitative or Quantitative Research Better?This question has an answer (hard to know from the qualitative methods literature!)

If insufficient information is quantified, qualitative judgment wins

The vast majority of human decisions are qualitativeIf not, we wouldn’t have survived as a species

If sufficient information is quantified, statistics wins

The information required: shockingly little6 crude variables beat 83 law professors (Martin et al., 2004)Simple quantitative models forecast elections better than expertpundits (Gelman and King, 1993; Campbell, 2005)284 experts forecast the political future with “less skill than simpleextrapolation” (Tetlock, 2005)statistical model out-performs physicians in determining cause-specificmortality rates (King and Lu, 2008)

Hundreds of head-to-head contests, mostly with the same conclusion(Meehl, 1954; Grove, 2005)The march of quantification across fields of academia, professions,commerce, sports, etc. (Moneyball, SuperCrunchers, Numerati)

Gary King (Harvard, IQSS) 3 / 19

Page 14: Discovery - Harvard University

Is Qualitative or Quantitative Research Better?This question has an answer (hard to know from the qualitative methods literature!)

If insufficient information is quantified, qualitative judgment wins

The vast majority of human decisions are qualitativeIf not, we wouldn’t have survived as a species

If sufficient information is quantified, statistics wins

The information required: shockingly little6 crude variables beat 83 law professors (Martin et al., 2004)Simple quantitative models forecast elections better than expertpundits (Gelman and King, 1993; Campbell, 2005)284 experts forecast the political future with “less skill than simpleextrapolation” (Tetlock, 2005)statistical model out-performs physicians in determining cause-specificmortality rates (King and Lu, 2008)Hundreds of head-to-head contests, mostly with the same conclusion(Meehl, 1954; Grove, 2005)

The march of quantification across fields of academia, professions,commerce, sports, etc. (Moneyball, SuperCrunchers, Numerati)

Gary King (Harvard, IQSS) 3 / 19

Page 15: Discovery - Harvard University

Is Qualitative or Quantitative Research Better?This question has an answer (hard to know from the qualitative methods literature!)

If insufficient information is quantified, qualitative judgment wins

The vast majority of human decisions are qualitativeIf not, we wouldn’t have survived as a species

If sufficient information is quantified, statistics wins

The information required: shockingly little6 crude variables beat 83 law professors (Martin et al., 2004)Simple quantitative models forecast elections better than expertpundits (Gelman and King, 1993; Campbell, 2005)284 experts forecast the political future with “less skill than simpleextrapolation” (Tetlock, 2005)statistical model out-performs physicians in determining cause-specificmortality rates (King and Lu, 2008)Hundreds of head-to-head contests, mostly with the same conclusion(Meehl, 1954; Grove, 2005)The march of quantification across fields of academia, professions,commerce, sports, etc. (Moneyball, SuperCrunchers, Numerati)

Gary King (Harvard, IQSS) 3 / 19

Page 16: Discovery - Harvard University

Discovery

3 Steps

1 Conceptualization (e.g., a categorization scheme)2 Measurement (e.g., classifying objects into categories)3 Verification (e.g., testing a hypothesis that could be wrong)

Styles & Approaches (the cause of many misunderstandings)

Quantitative: focus on measurement and verification assumes &underplays conceptualizationQualitative: focus on (& iterate between) conceptualization andmeasurement ignores or underplays verification

Proposed Solutions

Do both qualitative & quantitative analysis, separatelyAtheoretical unsupervised learning algorithms Integrated quant/qual: Computer-assisted qualitative analysis

Gary King (Harvard, IQSS) 4 / 19

Page 17: Discovery - Harvard University

Discovery

3 Steps

1 Conceptualization (e.g., a categorization scheme)

2 Measurement (e.g., classifying objects into categories)3 Verification (e.g., testing a hypothesis that could be wrong)

Styles & Approaches (the cause of many misunderstandings)

Quantitative: focus on measurement and verification assumes &underplays conceptualizationQualitative: focus on (& iterate between) conceptualization andmeasurement ignores or underplays verification

Proposed Solutions

Do both qualitative & quantitative analysis, separatelyAtheoretical unsupervised learning algorithms Integrated quant/qual: Computer-assisted qualitative analysis

Gary King (Harvard, IQSS) 4 / 19

Page 18: Discovery - Harvard University

Discovery

3 Steps

1 Conceptualization (e.g., a categorization scheme)2 Measurement (e.g., classifying objects into categories)

3 Verification (e.g., testing a hypothesis that could be wrong)

Styles & Approaches (the cause of many misunderstandings)

Quantitative: focus on measurement and verification assumes &underplays conceptualizationQualitative: focus on (& iterate between) conceptualization andmeasurement ignores or underplays verification

Proposed Solutions

Do both qualitative & quantitative analysis, separatelyAtheoretical unsupervised learning algorithms Integrated quant/qual: Computer-assisted qualitative analysis

Gary King (Harvard, IQSS) 4 / 19

Page 19: Discovery - Harvard University

Discovery

3 Steps

1 Conceptualization (e.g., a categorization scheme)2 Measurement (e.g., classifying objects into categories)3 Verification (e.g., testing a hypothesis that could be wrong)

Styles & Approaches (the cause of many misunderstandings)

Quantitative: focus on measurement and verification assumes &underplays conceptualizationQualitative: focus on (& iterate between) conceptualization andmeasurement ignores or underplays verification

Proposed Solutions

Do both qualitative & quantitative analysis, separatelyAtheoretical unsupervised learning algorithms Integrated quant/qual: Computer-assisted qualitative analysis

Gary King (Harvard, IQSS) 4 / 19

Page 20: Discovery - Harvard University

Discovery

3 Steps

1 Conceptualization (e.g., a categorization scheme)2 Measurement (e.g., classifying objects into categories)3 Verification (e.g., testing a hypothesis that could be wrong)

Styles & Approaches (the cause of many misunderstandings)

Quantitative: focus on measurement and verification assumes &underplays conceptualizationQualitative: focus on (& iterate between) conceptualization andmeasurement ignores or underplays verification

Proposed Solutions

Do both qualitative & quantitative analysis, separatelyAtheoretical unsupervised learning algorithms Integrated quant/qual: Computer-assisted qualitative analysis

Gary King (Harvard, IQSS) 4 / 19

Page 21: Discovery - Harvard University

Discovery

3 Steps

1 Conceptualization (e.g., a categorization scheme)2 Measurement (e.g., classifying objects into categories)3 Verification (e.g., testing a hypothesis that could be wrong)

Styles & Approaches (the cause of many misunderstandings)

Quantitative: focus on measurement and verification assumes &underplays conceptualization

Qualitative: focus on (& iterate between) conceptualization andmeasurement ignores or underplays verification

Proposed Solutions

Do both qualitative & quantitative analysis, separatelyAtheoretical unsupervised learning algorithms Integrated quant/qual: Computer-assisted qualitative analysis

Gary King (Harvard, IQSS) 4 / 19

Page 22: Discovery - Harvard University

Discovery

3 Steps

1 Conceptualization (e.g., a categorization scheme)2 Measurement (e.g., classifying objects into categories)3 Verification (e.g., testing a hypothesis that could be wrong)

Styles & Approaches (the cause of many misunderstandings)

Quantitative: focus on measurement and verification assumes &underplays conceptualizationQualitative: focus on (& iterate between) conceptualization andmeasurement ignores or underplays verification

Proposed Solutions

Do both qualitative & quantitative analysis, separatelyAtheoretical unsupervised learning algorithms Integrated quant/qual: Computer-assisted qualitative analysis

Gary King (Harvard, IQSS) 4 / 19

Page 23: Discovery - Harvard University

Discovery

3 Steps

1 Conceptualization (e.g., a categorization scheme)2 Measurement (e.g., classifying objects into categories)3 Verification (e.g., testing a hypothesis that could be wrong)

Styles & Approaches (the cause of many misunderstandings)

Quantitative: focus on measurement and verification assumes &underplays conceptualizationQualitative: focus on (& iterate between) conceptualization andmeasurement ignores or underplays verification

Proposed Solutions

Do both qualitative & quantitative analysis, separatelyAtheoretical unsupervised learning algorithms Integrated quant/qual: Computer-assisted qualitative analysis

Gary King (Harvard, IQSS) 4 / 19

Page 24: Discovery - Harvard University

Discovery

3 Steps

1 Conceptualization (e.g., a categorization scheme)2 Measurement (e.g., classifying objects into categories)3 Verification (e.g., testing a hypothesis that could be wrong)

Styles & Approaches (the cause of many misunderstandings)

Quantitative: focus on measurement and verification assumes &underplays conceptualizationQualitative: focus on (& iterate between) conceptualization andmeasurement ignores or underplays verification

Proposed Solutions

Do both qualitative & quantitative analysis, separately

Atheoretical unsupervised learning algorithms Integrated quant/qual: Computer-assisted qualitative analysis

Gary King (Harvard, IQSS) 4 / 19

Page 25: Discovery - Harvard University

Discovery

3 Steps

1 Conceptualization (e.g., a categorization scheme)2 Measurement (e.g., classifying objects into categories)3 Verification (e.g., testing a hypothesis that could be wrong)

Styles & Approaches (the cause of many misunderstandings)

Quantitative: focus on measurement and verification assumes &underplays conceptualizationQualitative: focus on (& iterate between) conceptualization andmeasurement ignores or underplays verification

Proposed Solutions

Do both qualitative & quantitative analysis, separatelyAtheoretical unsupervised learning algorithms

Integrated quant/qual: Computer-assisted qualitative analysis

Gary King (Harvard, IQSS) 4 / 19

Page 26: Discovery - Harvard University

Discovery

3 Steps

1 Conceptualization (e.g., a categorization scheme)2 Measurement (e.g., classifying objects into categories)3 Verification (e.g., testing a hypothesis that could be wrong)

Styles & Approaches (the cause of many misunderstandings)

Quantitative: focus on measurement and verification assumes &underplays conceptualizationQualitative: focus on (& iterate between) conceptualization andmeasurement ignores or underplays verification

Proposed Solutions

Do both qualitative & quantitative analysis, separatelyAtheoretical unsupervised learning algorithms Integrated quant/qual: Computer-assisted qualitative analysis

Gary King (Harvard, IQSS) 4 / 19

Page 27: Discovery - Harvard University

The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blogposts, comments, product reviews, emails, social media updates,audio-to-text summaries, speeches, press releases, legal decisions, etc.

10 minutes of worldwide email = 1 LOC equivalent

An essential part of discovery is classification: “one of the mostcentral and generic of all our conceptual exercises. . . . the foundationnot only for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or, for that matter, social science research.” (Bailey,1994).

We focus on cluster analysis: discovery through (1) classification and(2) simultaneously inventing a classification scheme

(We analyze text; our methods apply more generally)

Gary King (Harvard, IQSS) 5 / 19

Page 28: Discovery - Harvard University

The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blogposts, comments, product reviews, emails, social media updates,audio-to-text summaries, speeches, press releases, legal decisions, etc.

10 minutes of worldwide email = 1 LOC equivalent

An essential part of discovery is classification: “one of the mostcentral and generic of all our conceptual exercises. . . . the foundationnot only for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or, for that matter, social science research.” (Bailey,1994).

We focus on cluster analysis: discovery through (1) classification and(2) simultaneously inventing a classification scheme

(We analyze text; our methods apply more generally)

Gary King (Harvard, IQSS) 5 / 19

Page 29: Discovery - Harvard University

The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blogposts, comments, product reviews, emails, social media updates,audio-to-text summaries, speeches, press releases, legal decisions, etc.

10 minutes of worldwide email = 1 LOC equivalent

An essential part of discovery is classification: “one of the mostcentral and generic of all our conceptual exercises. . . . the foundationnot only for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or, for that matter, social science research.” (Bailey,1994).

We focus on cluster analysis: discovery through (1) classification and(2) simultaneously inventing a classification scheme

(We analyze text; our methods apply more generally)

Gary King (Harvard, IQSS) 5 / 19

Page 30: Discovery - Harvard University

The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blogposts, comments, product reviews, emails, social media updates,audio-to-text summaries, speeches, press releases, legal decisions, etc.

10 minutes of worldwide email = 1 LOC equivalent

An essential part of discovery is classification: “one of the mostcentral and generic of all our conceptual exercises. . . . the foundationnot only for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or, for that matter, social science research.” (Bailey,1994).

We focus on cluster analysis: discovery through (1) classification and(2) simultaneously inventing a classification scheme

(We analyze text; our methods apply more generally)

Gary King (Harvard, IQSS) 5 / 19

Page 31: Discovery - Harvard University

The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blogposts, comments, product reviews, emails, social media updates,audio-to-text summaries, speeches, press releases, legal decisions, etc.

10 minutes of worldwide email = 1 LOC equivalent

An essential part of discovery is classification: “one of the mostcentral and generic of all our conceptual exercises. . . . the foundationnot only for conceptualization, language, and speech, but also formathematics, statistics, and data analysis. . . . Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or, for that matter, social science research.” (Bailey,1994).

We focus on cluster analysis: discovery through (1) classification and(2) simultaneously inventing a classification scheme

(We analyze text; our methods apply more generally)

Gary King (Harvard, IQSS) 5 / 19

Page 32: Discovery - Harvard University

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand

Qualitative-only approaches are hopeless

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) 6 / 19

Page 33: Discovery - Harvard University

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand

Qualitative-only approaches are hopeless

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) 6 / 19

Page 34: Discovery - Harvard University

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand

Qualitative-only approaches are hopeless

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) 6 / 19

Page 35: Discovery - Harvard University

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand

Qualitative-only approaches are hopeless

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) 6 / 19

Page 36: Discovery - Harvard University

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈

1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand

Qualitative-only approaches are hopeless

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) 6 / 19

Page 37: Discovery - Harvard University

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈ 1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand

Qualitative-only approaches are hopeless

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) 6 / 19

Page 38: Discovery - Harvard University

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈ 1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand

Qualitative-only approaches are hopeless

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) 6 / 19

Page 39: Discovery - Harvard University

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈ 1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand

Qualitative-only approaches are hopeless

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) 6 / 19

Page 40: Discovery - Harvard University

Why Johnny Can’t Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈ 1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand

Qualitative-only approaches are hopeless

That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) 6 / 19

Page 41: Discovery - Harvard University

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge:

With few exceptions, who knows?!

The literature: little guidance on when methods applyDeriving such guidance: difficult or impossible(Perhaps true by definition in unsupervised learning: If we knew theDGP, we wouldn’t be at the discovery stage.)

Gary King (Harvard, IQSS) 7 / 19

Page 42: Discovery - Harvard University

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge:

With few exceptions, who knows?!

The literature: little guidance on when methods applyDeriving such guidance: difficult or impossible(Perhaps true by definition in unsupervised learning: If we knew theDGP, we wouldn’t be at the discovery stage.)

Gary King (Harvard, IQSS) 7 / 19

Page 43: Discovery - Harvard University

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge:

With few exceptions, who knows?!

The literature: little guidance on when methods applyDeriving such guidance: difficult or impossible(Perhaps true by definition in unsupervised learning: If we knew theDGP, we wouldn’t be at the discovery stage.)

Gary King (Harvard, IQSS) 7 / 19

Page 44: Discovery - Harvard University

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge:

With few exceptions, who knows?!

The literature: little guidance on when methods applyDeriving such guidance: difficult or impossible(Perhaps true by definition in unsupervised learning: If we knew theDGP, we wouldn’t be at the discovery stage.)

Gary King (Harvard, IQSS) 7 / 19

Page 45: Discovery - Harvard University

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .

Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge:

With few exceptions, who knows?!

The literature: little guidance on when methods applyDeriving such guidance: difficult or impossible(Perhaps true by definition in unsupervised learning: If we knew theDGP, we wouldn’t be at the discovery stage.)

Gary King (Harvard, IQSS) 7 / 19

Page 46: Discovery - Harvard University

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundations

How to add substantive knowledge:

With few exceptions, who knows?!

The literature: little guidance on when methods applyDeriving such guidance: difficult or impossible(Perhaps true by definition in unsupervised learning: If we knew theDGP, we wouldn’t be at the discovery stage.)

Gary King (Harvard, IQSS) 7 / 19

Page 47: Discovery - Harvard University

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge:

With few exceptions, who knows?!The literature: little guidance on when methods applyDeriving such guidance: difficult or impossible(Perhaps true by definition in unsupervised learning: If we knew theDGP, we wouldn’t be at the discovery stage.)

Gary King (Harvard, IQSS) 7 / 19

Page 48: Discovery - Harvard University

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, who knows?!

The literature: little guidance on when methods applyDeriving such guidance: difficult or impossible(Perhaps true by definition in unsupervised learning: If we knew theDGP, we wouldn’t be at the discovery stage.)

Gary King (Harvard, IQSS) 7 / 19

Page 49: Discovery - Harvard University

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, who knows?!The literature: little guidance on when methods apply

Deriving such guidance: difficult or impossible(Perhaps true by definition in unsupervised learning: If we knew theDGP, we wouldn’t be at the discovery stage.)

Gary King (Harvard, IQSS) 7 / 19

Page 50: Discovery - Harvard University

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, who knows?!The literature: little guidance on when methods applyDeriving such guidance: difficult or impossible

(Perhaps true by definition in unsupervised learning: If we knew theDGP, we wouldn’t be at the discovery stage.)

Gary King (Harvard, IQSS) 7 / 19

Page 51: Discovery - Harvard University

Why HAL Can’t Classify Either

The Goal — an optimal application-independent cluster analysismethod — is mathematically impossible:

No free lunch theorem: every possible clustering method performsequally well on average over all possible substantive applications

Existing methods:

Many choices: model-based, subspace, spectral, grid-based, graph-based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .Well-defined statistical, data analytic, or machine learning foundationsHow to add substantive knowledge: With few exceptions, who knows?!The literature: little guidance on when methods applyDeriving such guidance: difficult or impossible(Perhaps true by definition in unsupervised learning: If we knew theDGP, we wouldn’t be at the discovery stage.)

Gary King (Harvard, IQSS) 7 / 19

Page 52: Discovery - Harvard University

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: can’t do it by understanding the model

We do it ex post (by qualitative choice)

For discovery (our goal): No problemFor estimation & confirmation: more difficult or biased

Complicated concepts are easier to define ex post:

“I know it when I see it” (Justice Stewart’s definition of obscenity)Anchoring Vignettes (on defining concepts by example)

But how to choose from an enormous list of clusterings?

Gary King (Harvard, IQSS) 8 / 19

Page 53: Discovery - Harvard University

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: can’t do it by understanding the model

We do it ex post (by qualitative choice)

For discovery (our goal): No problemFor estimation & confirmation: more difficult or biased

Complicated concepts are easier to define ex post:

“I know it when I see it” (Justice Stewart’s definition of obscenity)Anchoring Vignettes (on defining concepts by example)

But how to choose from an enormous list of clusterings?

Gary King (Harvard, IQSS) 8 / 19

Page 54: Discovery - Harvard University

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: can’t do it by understanding the model

We do it ex post (by qualitative choice)

For discovery (our goal): No problemFor estimation & confirmation: more difficult or biased

Complicated concepts are easier to define ex post:

“I know it when I see it” (Justice Stewart’s definition of obscenity)Anchoring Vignettes (on defining concepts by example)

But how to choose from an enormous list of clusterings?

Gary King (Harvard, IQSS) 8 / 19

Page 55: Discovery - Harvard University

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: can’t do it by understanding the model

We do it ex post (by qualitative choice)

For discovery (our goal): No problemFor estimation & confirmation: more difficult or biased

Complicated concepts are easier to define ex post:

“I know it when I see it” (Justice Stewart’s definition of obscenity)Anchoring Vignettes (on defining concepts by example)

But how to choose from an enormous list of clusterings?

Gary King (Harvard, IQSS) 8 / 19

Page 56: Discovery - Harvard University

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: can’t do it by understanding the model

We do it ex post (by qualitative choice)

For discovery (our goal): No problem

For estimation & confirmation: more difficult or biased

Complicated concepts are easier to define ex post:

“I know it when I see it” (Justice Stewart’s definition of obscenity)Anchoring Vignettes (on defining concepts by example)

But how to choose from an enormous list of clusterings?

Gary King (Harvard, IQSS) 8 / 19

Page 57: Discovery - Harvard University

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: can’t do it by understanding the model

We do it ex post (by qualitative choice)

For discovery (our goal): No problemFor estimation & confirmation: more difficult or biased

Complicated concepts are easier to define ex post:

“I know it when I see it” (Justice Stewart’s definition of obscenity)Anchoring Vignettes (on defining concepts by example)

But how to choose from an enormous list of clusterings?

Gary King (Harvard, IQSS) 8 / 19

Page 58: Discovery - Harvard University

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: can’t do it by understanding the model

We do it ex post (by qualitative choice)

For discovery (our goal): No problemFor estimation & confirmation: more difficult or biased

Complicated concepts are easier to define ex post:

“I know it when I see it” (Justice Stewart’s definition of obscenity)Anchoring Vignettes (on defining concepts by example)

But how to choose from an enormous list of clusterings?

Gary King (Harvard, IQSS) 8 / 19

Page 59: Discovery - Harvard University

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: can’t do it by understanding the model

We do it ex post (by qualitative choice)

For discovery (our goal): No problemFor estimation & confirmation: more difficult or biased

Complicated concepts are easier to define ex post:

“I know it when I see it” (Justice Stewart’s definition of obscenity)

Anchoring Vignettes (on defining concepts by example)

But how to choose from an enormous list of clusterings?

Gary King (Harvard, IQSS) 8 / 19

Page 60: Discovery - Harvard University

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: can’t do it by understanding the model

We do it ex post (by qualitative choice)

For discovery (our goal): No problemFor estimation & confirmation: more difficult or biased

Complicated concepts are easier to define ex post:

“I know it when I see it” (Justice Stewart’s definition of obscenity)Anchoring Vignettes (on defining concepts by example)

But how to choose from an enormous list of clusterings?

Gary King (Harvard, IQSS) 8 / 19

Page 61: Discovery - Harvard University

If Ex Ante doesn’t work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

The usual approach fails: can’t do it by understanding the model

We do it ex post (by qualitative choice)

For discovery (our goal): No problemFor estimation & confirmation: more difficult or biased

Complicated concepts are easier to define ex post:

“I know it when I see it” (Justice Stewart’s definition of obscenity)Anchoring Vignettes (on defining concepts by example)

But how to choose from an enormous list of clusterings?

Gary King (Harvard, IQSS) 8 / 19

Page 62: Discovery - Harvard University

Our Idea: Meaning Through Geography

We develop a (conceptual) geography of clusterings

Gary King (Harvard, IQSS) 9 / 19

Page 63: Discovery - Harvard University

Our Idea: Meaning Through Geography

We develop a (conceptual) geography of clusterings

Gary King (Harvard, IQSS) 9 / 19

Page 64: Discovery - Harvard University

Our Idea: Meaning Through Geography

We develop a (conceptual) geography of clusterings

Gary King (Harvard, IQSS) 9 / 19

Page 65: Discovery - Harvard University

Our Idea: Meaning Through Geography

We develop a (conceptual) geography of clusterings

Gary King (Harvard, IQSS) 9 / 19

Page 66: Discovery - Harvard University

A New Strategy

1 Code text as numbers (in one or more of several ways)

2 Apply all existing clustering methods (that have been used by at leastone person other than the author) to the data — each representingdifferent substantive assumptions (<15 mins)

3 Develop an application-independent distance metric betweenclusterings

4 Create a metric space of clusterings, and a 2D projection

5 Introduce the local cluster ensemble to summarize any point,including points with no existing clustering

6 Propose a new animated visualization: use the local cluster ensembleto explore the space of clusterings (smoothly morphing from one intoothers)

meaning revealed through a geography of clusterings

Gary King (Harvard, IQSS) 10 / 19

Page 67: Discovery - Harvard University

A New Strategy

1 Code text as numbers (in one or more of several ways)

2 Apply all existing clustering methods (that have been used by at leastone person other than the author) to the data — each representingdifferent substantive assumptions (<15 mins)

3 Develop an application-independent distance metric betweenclusterings

4 Create a metric space of clusterings, and a 2D projection

5 Introduce the local cluster ensemble to summarize any point,including points with no existing clustering

6 Propose a new animated visualization: use the local cluster ensembleto explore the space of clusterings (smoothly morphing from one intoothers)

meaning revealed through a geography of clusterings

Gary King (Harvard, IQSS) 10 / 19

Page 68: Discovery - Harvard University

A New Strategy

1 Code text as numbers (in one or more of several ways)

2 Apply all existing clustering methods (that have been used by at leastone person other than the author) to the data — each representingdifferent substantive assumptions (<15 mins)

3 Develop an application-independent distance metric betweenclusterings

4 Create a metric space of clusterings, and a 2D projection

5 Introduce the local cluster ensemble to summarize any point,including points with no existing clustering

6 Propose a new animated visualization: use the local cluster ensembleto explore the space of clusterings (smoothly morphing from one intoothers)

meaning revealed through a geography of clusterings

Gary King (Harvard, IQSS) 10 / 19

Page 69: Discovery - Harvard University

A New Strategy

1 Code text as numbers (in one or more of several ways)

2 Apply all existing clustering methods (that have been used by at leastone person other than the author) to the data — each representingdifferent substantive assumptions (<15 mins)

3 Develop an application-independent distance metric betweenclusterings

4 Create a metric space of clusterings, and a 2D projection

5 Introduce the local cluster ensemble to summarize any point,including points with no existing clustering

6 Propose a new animated visualization: use the local cluster ensembleto explore the space of clusterings (smoothly morphing from one intoothers)

meaning revealed through a geography of clusterings

Gary King (Harvard, IQSS) 10 / 19

Page 70: Discovery - Harvard University

A New Strategy

1 Code text as numbers (in one or more of several ways)

2 Apply all existing clustering methods (that have been used by at leastone person other than the author) to the data — each representingdifferent substantive assumptions (<15 mins)

3 Develop an application-independent distance metric betweenclusterings

4 Create a metric space of clusterings, and a 2D projection

5 Introduce the local cluster ensemble to summarize any point,including points with no existing clustering

6 Propose a new animated visualization: use the local cluster ensembleto explore the space of clusterings (smoothly morphing from one intoothers)

meaning revealed through a geography of clusterings

Gary King (Harvard, IQSS) 10 / 19

Page 71: Discovery - Harvard University

A New Strategy

1 Code text as numbers (in one or more of several ways)

2 Apply all existing clustering methods (that have been used by at leastone person other than the author) to the data — each representingdifferent substantive assumptions (<15 mins)

3 Develop an application-independent distance metric betweenclusterings

4 Create a metric space of clusterings, and a 2D projection

5 Introduce the local cluster ensemble to summarize any point,including points with no existing clustering

6 Propose a new animated visualization: use the local cluster ensembleto explore the space of clusterings (smoothly morphing from one intoothers)

meaning revealed through a geography of clusterings

Gary King (Harvard, IQSS) 10 / 19

Page 72: Discovery - Harvard University

A New Strategy

1 Code text as numbers (in one or more of several ways)

2 Apply all existing clustering methods (that have been used by at leastone person other than the author) to the data — each representingdifferent substantive assumptions (<15 mins)

3 Develop an application-independent distance metric betweenclusterings

4 Create a metric space of clusterings, and a 2D projection

5 Introduce the local cluster ensemble to summarize any point,including points with no existing clustering

6 Propose a new animated visualization: use the local cluster ensembleto explore the space of clusterings (smoothly morphing from one intoothers)

meaning revealed through a geography of clusterings

Gary King (Harvard, IQSS) 10 / 19

Page 73: Discovery - Harvard University

A New Strategy

1 Code text as numbers (in one or more of several ways)

2 Apply all existing clustering methods (that have been used by at leastone person other than the author) to the data — each representingdifferent substantive assumptions (<15 mins)

3 Develop an application-independent distance metric betweenclusterings

4 Create a metric space of clusterings, and a 2D projection

5 Introduce the local cluster ensemble to summarize any point,including points with no existing clustering

6 Propose a new animated visualization: use the local cluster ensembleto explore the space of clusterings (smoothly morphing from one intoothers)

meaning revealed through a geography of clusterings

Gary King (Harvard, IQSS) 10 / 19

Page 74: Discovery - Harvard University

Many Thousands of Clusterings, Sorted & OrganizedYou choose one (or more), based on insight, discovery, useful information,. . .

Space of Cluster Solutions

biclust_spectral

clust_convex

mult_dirproc

dismea

rock

som

spec_cos spec_eucspec_man

spec_mink

spec_max

spec_canb

mspec_cos

mspec_euc

mspec_man

mspec_mink

mspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euc

kmedoids euclidean

kmedoids manhattan

mixvmf

mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattan

kmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearman

kmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean average

hclust euclidean mcquitty

hclust euclidean median

hclust euclidean centroidhclust maximum ward

hclust maximum single

hclust maximum completehclust maximum averagehclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan averagehclust manhattan mcquittyhclust manhattan median

hclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquittyhclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquitty

hclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson complete

hclust pearson averagehclust pearson mcquitty

hclust pearson medianhclust pearson centroid

hclust correlation ward

hclust correlation single

hclust correlation complete

hclust correlation averagehclust correlation mcquitty

hclust correlation medianhclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman average

hclust spearman mcquitty

hclust spearman median

hclust spearman centroid

hclust kendall ward

hclust kendall single

hclust kendall complete

hclust kendall average

hclust kendall mcquitty

hclust kendall median

hclust kendall centroid

Cluster Solution 1

Carter

Clinton

Eisenhower

Ford

Johnson

Kennedy

Nixon

Obama

Roosevelt

Truman

Bush

HWBush

Reagan

``Other Presidents ''

``Reagan Republicans''

Cluster Solution 2

Carter

Eisenhower

Ford

Johnson

Kennedy

Nixon

Roosevelt

Truman

Bush

ClintonHWBush

Obama

Reagan

``RooseveltTo Carter''

`` Reagan To Obama ''

Gary King (Harvard, IQSS) 11 / 19

Page 75: Discovery - Harvard University

Application-Independent Distance Metric: Axioms

1 Clusterings with more pairwise document agreements are closer (weprove: pairwise agreements encompass triples, quadruples, etc.)

2 Invariance: Distance is invariant to the number of documents (for anyfixed number of clusters)

3 Scale: the maximum distance is set to log(num clusters)

Only one measure satisfies all three (the “variation of information”)

Gary King (Harvard, IQSS) 12 / 19

Page 76: Discovery - Harvard University

Application-Independent Distance Metric: Axioms

1 Clusterings with more pairwise document agreements are closer (weprove: pairwise agreements encompass triples, quadruples, etc.)

2 Invariance: Distance is invariant to the number of documents (for anyfixed number of clusters)

3 Scale: the maximum distance is set to log(num clusters)

Only one measure satisfies all three (the “variation of information”)

Gary King (Harvard, IQSS) 12 / 19

Page 77: Discovery - Harvard University

Application-Independent Distance Metric: Axioms

1 Clusterings with more pairwise document agreements are closer (weprove: pairwise agreements encompass triples, quadruples, etc.)

2 Invariance: Distance is invariant to the number of documents (for anyfixed number of clusters)

3 Scale: the maximum distance is set to log(num clusters)

Only one measure satisfies all three (the “variation of information”)

Gary King (Harvard, IQSS) 12 / 19

Page 78: Discovery - Harvard University

Application-Independent Distance Metric: Axioms

1 Clusterings with more pairwise document agreements are closer (weprove: pairwise agreements encompass triples, quadruples, etc.)

2 Invariance: Distance is invariant to the number of documents (for anyfixed number of clusters)

3 Scale: the maximum distance is set to log(num clusters)

Only one measure satisfies all three (the “variation of information”)

Gary King (Harvard, IQSS) 12 / 19

Page 79: Discovery - Harvard University

Application-Independent Distance Metric: Axioms

1 Clusterings with more pairwise document agreements are closer (weprove: pairwise agreements encompass triples, quadruples, etc.)

2 Invariance: Distance is invariant to the number of documents (for anyfixed number of clusters)

3 Scale: the maximum distance is set to log(num clusters)

Only one measure satisfies all three (the “variation of information”)

Gary King (Harvard, IQSS) 12 / 19

Page 80: Discovery - Harvard University

Available March 2009: 304ppPb: 978-0-415-99701-0: $24.95www.routledge.com/politics

“The list of authors in The Future of Political Science is a 'who’s who' of political science. As I was reading it, I came to think of it as a platter of tasty hors d’oeuvres. It hooked me thoroughly.”

—Peter Kingstone, University of Connecticut

“In this one-of-a-kind collection, an eclectic set of contributors offer short but forceful forecasts about the future of the discipline. The resulting assortment is captivating, consistently thought-provoking, often intriguing, and sure to spur discussion and debate.”

—Wendy K. Tam Cho, University of Illinois at Urbana-Champaign

“King, Schlozman, and Nie have created a visionary and stimulating volume. The organization of the essays strikes me as nothing less than brilliant. . . It is truly a joy to read.”

—Lawrence C. Dodd, Manning J. Dauer Eminent Scholar in Political Science, University of Florida

The FuTure oF PoliTical Science100 Perspectivesedited by Gary King, harvard university, Kay Lehman Schlozman, Boston college and Norman H. Nie, Stanford university

Gary King (Harvard, IQSS) 13 / 19

Page 81: Discovery - Harvard University

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60

Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) 14 / 19

Page 82: Discovery - Harvard University

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60

Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) 14 / 19

Page 83: Discovery - Harvard University

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60

Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) 14 / 19

Page 84: Discovery - Harvard University

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60

Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) 14 / 19

Page 85: Discovery - Harvard University

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60

Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) 14 / 19

Page 86: Discovery - Harvard University

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68

Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) 14 / 19

Page 87: Discovery - Harvard University

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24

Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) 14 / 19

Page 88: Discovery - Harvard University

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) 14 / 19

Page 89: Discovery - Harvard University

Evaluators’ Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60Hand-Coded Clusters 1.58 1.48 1.68Hand-Coding 2.06 1.88 2.24Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) 14 / 19

Page 90: Discovery - Harvard University

Cluster Quality Experiments

(Our Method) − (Human Coders)

−0.3 −0.2 −0.1 0.1 0.2 0.3

Lautenberg Press Releases

Policy Agendas Project

Reuter's Gold Standard

Gary King (Harvard, IQSS) 15 / 19

Page 91: Discovery - Harvard University

Cluster Quality ExperimentsScale: mean(within clusters) − mean(between clusters)

(Our Method) − (Human Coders)

−0.3 −0.2 −0.1 0.1 0.2 0.3

Lautenberg Press Releases

Policy Agendas Project

Reuter's Gold Standard

Gary King (Harvard, IQSS) 15 / 19

Page 92: Discovery - Harvard University

Cluster Quality ExperimentsScale: mean(within clusters) − mean(between clusters)

(Our Method) − (Human Coders)

−0.3 −0.2 −0.1 0.1 0.2 0.3

Lautenberg Press Releases

Policy Agendas Project

Reuter's Gold Standard

Lautenberg: 200 Senate Press Releases (appropriations, economy,education, tax, veterans, . . . )

Gary King (Harvard, IQSS) 15 / 19

Page 93: Discovery - Harvard University

Cluster Quality ExperimentsScale: mean(within clusters) − mean(between clusters)

(Our Method) − (Human Coders)

−0.3 −0.2 −0.1 0.1 0.2 0.3

Lautenberg Press Releases

Policy Agendas Project

Reuter's Gold Standard

Policy Agendas: 213 quasi-sentences from Bush’s State of the Union(agriculture, banking & commerce, civil rights/liberties, defense, . . . )

Gary King (Harvard, IQSS) 15 / 19

Page 94: Discovery - Harvard University

Cluster Quality ExperimentsScale: mean(within clusters) − mean(between clusters)

(Our Method) − (Human Coders)

−0.3 −0.2 −0.1 0.1 0.2 0.3

Lautenberg Press Releases

Policy Agendas Project

Reuter's Gold Standard

Reuter’s: financial news (trade, earnings, copper, gold, coffee, . . . ); “goldstandard” for supervised learning studies

Gary King (Harvard, IQSS) 15 / 19

Page 95: Discovery - Harvard University

What do Members of Congress Do?

David Mayhew’s (1974) famous typology

1 Advertising2 Credit Claiming3 Position Taking

We find one more: Partisan Taunting

“Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ”[Government Oversight]“The scopes trial took place in 1925. Sadly, President Bush’s vetotoday shows that we haven’t progressed much since then.” [Healthcare]“John Kerry had enough conviction to sign up for the military duringwartime, unlike the Vice President, who had a deep conviction to avoidmilitary service” [Government Oversight] Is this what it means to be a member of a political party?

Gary King (Harvard, IQSS) 16 / 19

Page 96: Discovery - Harvard University

What do Members of Congress Do?Substantive example of a finding, using our approach

David Mayhew’s (1974) famous typology

1 Advertising2 Credit Claiming3 Position Taking

We find one more: Partisan Taunting

“Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ”[Government Oversight]“The scopes trial took place in 1925. Sadly, President Bush’s vetotoday shows that we haven’t progressed much since then.” [Healthcare]“John Kerry had enough conviction to sign up for the military duringwartime, unlike the Vice President, who had a deep conviction to avoidmilitary service” [Government Oversight] Is this what it means to be a member of a political party?

Gary King (Harvard, IQSS) 16 / 19

Page 97: Discovery - Harvard University

What do Members of Congress Do?Substantive example of a finding, using our approach

David Mayhew’s (1974) famous typology

1 Advertising2 Credit Claiming3 Position Taking

We find one more: Partisan Taunting

“Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ”[Government Oversight]“The scopes trial took place in 1925. Sadly, President Bush’s vetotoday shows that we haven’t progressed much since then.” [Healthcare]“John Kerry had enough conviction to sign up for the military duringwartime, unlike the Vice President, who had a deep conviction to avoidmilitary service” [Government Oversight] Is this what it means to be a member of a political party?

Gary King (Harvard, IQSS) 16 / 19

Page 98: Discovery - Harvard University

What do Members of Congress Do?Substantive example of a finding, using our approach

David Mayhew’s (1974) famous typology1 Advertising

2 Credit Claiming3 Position Taking

We find one more: Partisan Taunting

“Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ”[Government Oversight]“The scopes trial took place in 1925. Sadly, President Bush’s vetotoday shows that we haven’t progressed much since then.” [Healthcare]“John Kerry had enough conviction to sign up for the military duringwartime, unlike the Vice President, who had a deep conviction to avoidmilitary service” [Government Oversight] Is this what it means to be a member of a political party?

Gary King (Harvard, IQSS) 16 / 19

Page 99: Discovery - Harvard University

What do Members of Congress Do?Substantive example of a finding, using our approach

David Mayhew’s (1974) famous typology1 Advertising2 Credit Claiming

3 Position Taking

We find one more: Partisan Taunting

“Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ”[Government Oversight]“The scopes trial took place in 1925. Sadly, President Bush’s vetotoday shows that we haven’t progressed much since then.” [Healthcare]“John Kerry had enough conviction to sign up for the military duringwartime, unlike the Vice President, who had a deep conviction to avoidmilitary service” [Government Oversight] Is this what it means to be a member of a political party?

Gary King (Harvard, IQSS) 16 / 19

Page 100: Discovery - Harvard University

What do Members of Congress Do?Substantive example of a finding, using our approach

David Mayhew’s (1974) famous typology1 Advertising2 Credit Claiming3 Position Taking

We find one more: Partisan Taunting

“Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ”[Government Oversight]“The scopes trial took place in 1925. Sadly, President Bush’s vetotoday shows that we haven’t progressed much since then.” [Healthcare]“John Kerry had enough conviction to sign up for the military duringwartime, unlike the Vice President, who had a deep conviction to avoidmilitary service” [Government Oversight] Is this what it means to be a member of a political party?

Gary King (Harvard, IQSS) 16 / 19

Page 101: Discovery - Harvard University

What do Members of Congress Do?Substantive example of a finding, using our approach

David Mayhew’s (1974) famous typology1 Advertising2 Credit Claiming3 Position Taking

We find one more: Partisan Taunting

“Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ”[Government Oversight]“The scopes trial took place in 1925. Sadly, President Bush’s vetotoday shows that we haven’t progressed much since then.” [Healthcare]“John Kerry had enough conviction to sign up for the military duringwartime, unlike the Vice President, who had a deep conviction to avoidmilitary service” [Government Oversight] Is this what it means to be a member of a political party?

Gary King (Harvard, IQSS) 16 / 19

Page 102: Discovery - Harvard University

What do Members of Congress Do?Substantive example of a finding, using our approach

David Mayhew’s (1974) famous typology1 Advertising2 Credit Claiming3 Position Taking

We find one more: Partisan Taunting

“Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ”[Government Oversight]

“The scopes trial took place in 1925. Sadly, President Bush’s vetotoday shows that we haven’t progressed much since then.” [Healthcare]“John Kerry had enough conviction to sign up for the military duringwartime, unlike the Vice President, who had a deep conviction to avoidmilitary service” [Government Oversight] Is this what it means to be a member of a political party?

Gary King (Harvard, IQSS) 16 / 19

Page 103: Discovery - Harvard University

What do Members of Congress Do?Substantive example of a finding, using our approach

David Mayhew’s (1974) famous typology1 Advertising2 Credit Claiming3 Position Taking

We find one more: Partisan Taunting

“Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ”[Government Oversight]“The scopes trial took place in 1925. Sadly, President Bush’s vetotoday shows that we haven’t progressed much since then.” [Healthcare]

“John Kerry had enough conviction to sign up for the military duringwartime, unlike the Vice President, who had a deep conviction to avoidmilitary service” [Government Oversight] Is this what it means to be a member of a political party?

Gary King (Harvard, IQSS) 16 / 19

Page 104: Discovery - Harvard University

What do Members of Congress Do?Substantive example of a finding, using our approach

David Mayhew’s (1974) famous typology1 Advertising2 Credit Claiming3 Position Taking

We find one more: Partisan Taunting

“Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ”[Government Oversight]“The scopes trial took place in 1925. Sadly, President Bush’s vetotoday shows that we haven’t progressed much since then.” [Healthcare]“John Kerry had enough conviction to sign up for the military duringwartime, unlike the Vice President, who had a deep conviction to avoidmilitary service” [Government Oversight]

Is this what it means to be a member of a political party?

Gary King (Harvard, IQSS) 16 / 19

Page 105: Discovery - Harvard University

What do Members of Congress Do?Substantive example of a finding, using our approach

David Mayhew’s (1974) famous typology1 Advertising2 Credit Claiming3 Position Taking

We find one more: Partisan Taunting

“Senator Lautenberg Blasts Republicans as ‘Chicken Hawks’ ”[Government Oversight]“The scopes trial took place in 1925. Sadly, President Bush’s vetotoday shows that we haven’t progressed much since then.” [Healthcare]“John Kerry had enough conviction to sign up for the military duringwartime, unlike the Vice President, who had a deep conviction to avoidmilitary service” [Government Oversight] Is this what it means to be a member of a political party?

Gary King (Harvard, IQSS) 16 / 19

Page 106: Discovery - Harvard University

More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

For each: created 2 clusterings from each of 3 methods, including ours

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 � vMF 1 � vMF 2 � Our Method 2 � K-Means 1 � K-Means 2

“Genetic testing”:

Our Method 1 � {Our Method 2, K-Means 1, K-means 2} � Dir Proc. 1 � Dir Proc. 2

Gary King (Harvard, IQSS) 17 / 19

Page 107: Discovery - Harvard University

More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

For each: created 2 clusterings from each of 3 methods, including ours

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 � vMF 1 � vMF 2 � Our Method 2 � K-Means 1 � K-Means 2

“Genetic testing”:

Our Method 1 � {Our Method 2, K-Means 1, K-means 2} � Dir Proc. 1 � Dir Proc. 2

Gary King (Harvard, IQSS) 17 / 19

Page 108: Discovery - Harvard University

More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

For each: created 2 clusterings from each of 3 methods, including ours

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 � vMF 1 � vMF 2 � Our Method 2 � K-Means 1 � K-Means 2

“Genetic testing”:

Our Method 1 � {Our Method 2, K-Means 1, K-means 2} � Dir Proc. 1 � Dir Proc. 2

Gary King (Harvard, IQSS) 17 / 19

Page 109: Discovery - Harvard University

More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

For each: created 2 clusterings from each of 3 methods, including ours

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 � vMF 1 � vMF 2 � Our Method 2 � K-Means 1 � K-Means 2

“Genetic testing”:

Our Method 1 � {Our Method 2, K-Means 1, K-means 2} � Dir Proc. 1 � Dir Proc. 2

Gary King (Harvard, IQSS) 17 / 19

Page 110: Discovery - Harvard University

More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

For each: created 2 clusterings from each of 3 methods, including ours

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 � vMF 1 � vMF 2 � Our Method 2 � K-Means 1 � K-Means 2

“Genetic testing”:

Our Method 1 � {Our Method 2, K-Means 1, K-means 2} � Dir Proc. 1 � Dir Proc. 2

Gary King (Harvard, IQSS) 17 / 19

Page 111: Discovery - Harvard University

More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

For each: created 2 clusterings from each of 3 methods, including ours

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 � vMF 1 � vMF 2 � Our Method 2 � K-Means 1 � K-Means 2

“Genetic testing”:

Our Method 1 � {Our Method 2, K-Means 1, K-means 2} � Dir Proc. 1 � Dir Proc. 2

Gary King (Harvard, IQSS) 17 / 19

Page 112: Discovery - Harvard University

More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

For each: created 2 clusterings from each of 3 methods, including ours

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 � vMF 1 � vMF 2 � Our Method 2 � K-Means 1 � K-Means 2

“Genetic testing”:

Our Method 1 � {Our Method 2, K-Means 1, K-means 2} � Dir Proc. 1 � Dir Proc. 2

Gary King (Harvard, IQSS) 17 / 19

Page 113: Discovery - Harvard University

More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

For each: created 2 clusterings from each of 3 methods, including ours

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 � vMF 1 � vMF 2 � Our Method 2 � K-Means 1 � K-Means 2

“Genetic testing”:

Our Method 1 � {Our Method 2, K-Means 1, K-means 2} � Dir Proc. 1 � Dir Proc. 2

Gary King (Harvard, IQSS) 17 / 19

Page 114: Discovery - Harvard University

Intended contributions

An encompassing cluster analytic approach for discovery

A new approach to evaluating results in unsupervised learning

Especially useful for the ongoing spectacular increase in theproduction and availability of unstructured text

Multiple approaches: Integrated, not separate

Gary King (Harvard, IQSS) 18 / 19

Page 115: Discovery - Harvard University

Intended contributions

An encompassing cluster analytic approach for discovery

A new approach to evaluating results in unsupervised learning

Especially useful for the ongoing spectacular increase in theproduction and availability of unstructured text

Multiple approaches: Integrated, not separate

Gary King (Harvard, IQSS) 18 / 19

Page 116: Discovery - Harvard University

Intended contributions

An encompassing cluster analytic approach for discovery

A new approach to evaluating results in unsupervised learning

Especially useful for the ongoing spectacular increase in theproduction and availability of unstructured text

Multiple approaches: Integrated, not separate

Gary King (Harvard, IQSS) 18 / 19

Page 117: Discovery - Harvard University

Intended contributions

An encompassing cluster analytic approach for discovery

A new approach to evaluating results in unsupervised learning

Especially useful for the ongoing spectacular increase in theproduction and availability of unstructured text

Multiple approaches: Integrated, not separate

Gary King (Harvard, IQSS) 18 / 19

Page 118: Discovery - Harvard University

Intended contributions

An encompassing cluster analytic approach for discovery

A new approach to evaluating results in unsupervised learning

Especially useful for the ongoing spectacular increase in theproduction and availability of unstructured text

Multiple approaches: Integrated, not separate

Gary King (Harvard, IQSS) 18 / 19

Page 119: Discovery - Harvard University

For more information:

http://GKing.Harvard.edu

Gary King (Harvard, IQSS) 19 / 19