tagging and other microtasks a dissertation …vb525jb6753/paulphdthesis-augmented.pdftagging a...

TAGGING AND OTHER MICROTASKS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Paul Brian Heymann

January 2011

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/vb525jb6753

© 2011 by Paul Brian Heymann. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/vb525jb6753

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Hector Garcia-Molina, Primary Adviser


Jurij Leskovec


Andreas Paepcke

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

Over the past decade, the web has become increasingly participatory. Many web sites

would be non-functional without the contribution of many tiny units of work by users

and workers around the world. We call such tiny units of work microtasks. Microtasks

usually represent less than five minutes of someone’s time. However, microtasks

can produce massive effects when pooled together. Examples of microtasks include

tagging a photo with a descriptive keyword, rating a movie, or categorizing a product.

This thesis explores tagging systems, one of the first places where unpaid micro-

tasks became common. Tagging systems allow regular users to annotate keywords

(“tags”) to objects like URLs, photos, and videos. We begin by looking at social

bookmarking systems, tagging systems where users tag URLs. We consider whether

social bookmarking tags are useful for web search, finding that they often mirror other

available metadata. We also show that social bookmarking tags can be predicted to

varying degrees with two techniques: support vector machines and market basket

data mining.

To expand our understanding of tags, we look at social cataloging systems, tag-

ging systems where users tag books. Social cataloging systems allow us to compare

user generated tags and expert library terms that were created in parallel. We find

that tags have important features like consistency, quality, and completeness in com-

mon with expert library terms. We also find that paid tagging can be an effective

supplement to a tagging system.

Finally, our work expands to all microtasks, rather than tagging alone. We propose

a framework called Human Processing for programming with and studying paid and

unpaid microtasks. We then develop a tool called HPROC for programming within

v

this framework, primarily on top of a paid microtask marketplace called Amazon

Mechanical Turk (AMT). Lastly, we describe Turkalytics, a system for monitoring of

workers completing paid microtasks on AMT.

We cover tagging from web search, machine learning, and library science per-

spectives, and work extensively with both the paid and unpaid microtasks which are

becoming a fixture of the modern web.

vi

Acknowledgments

This thesis would not exist without my advisor, Hector Garcia-Molina. Hector shares

everything with his advisees, and always has their best interests at heart. He has given

me the freedom and support to pursue varied interests, while insisting on technical

rigor and intellectual clarity along the way. He is a model advisor and great friend.

Aside from my advisor, I am indebted to my reading and orals committees, in-

cluding Andreas Paepcke, Jure Leskovec, Jennifer Widom, and Ashish Goel. Their

comments and words have improved both this document and my time at Stanford.

I have been lucky to have had fruitful collaborations over the years. In chrono-

logical order, Georgia Koutrika, Dan Ramage, and Andreas Paepcke have been my

primary co-authors and helped immensely with the tagging work that makes up Chap-

ters 2–5. Georgia is an effective and delightful collaborator. Dan and I always seem

to have the same thoughts at the same time. Andreas’ enthusiasm is boundless.

Several other people have played key roles in chapters of this thesis. Chapter 2

benefited both from Zhichen Xu and Mark Lucovsky. Zhichen informed my under-

standing of tags and Mark provided infrastructure support in the form of millions

of backlink queries. (Chapters 2 and 3 were also supported by an NSF Graduate

Research Fellowship and the School of Engineering Finch Family Fellowship.) Chap-

ters 4 and 5 would not have been possible without James Jacobs and Philip Schreur.

Among other things, James and Philip pointed me to the Scriblio MARC records

used in those chapters. Chapter 6 is illustrated by Caitlin Hogan. Chapter 7 was

clarified by discussion with Greg Little about execution models. Chapter 8 would not

exist without the encouragement of Aleksandra Korolova.

This work has benefited from interactions throughout the Gates Computer Science

vii

building. In particular, members of the InfoLab, Artificial Intelligence, and Theory

groups have given me numerous insights into my work over the years. While there

are far too many people to name here, I would like to especially thank the members

of boot camp, the hack circle, wafflers, and various residents of Gates 424.

My academic career before Stanford benefited from a series of excellent mentors

at Duke and Harvard. At Duke, Alexander Hartemink introduced me to research

and served as an amazing research advisor. At Harvard, Barbara Grosz and Stuart

Shieber gave me advice and numerous opportunities within the Colored Trails project.

Jody Heymann, Cynthia LuBien, and Jenny Finkel taught me most of the key

insights for surviving and thriving in the Computer Science Ph.D. program.

Thanks to my wife, sister, parents, and the rest of my family for their continuous

support in all of my endeavors, wherever they take me.

viii

Contents

Abstract v

Acknowledgments vii

1 Introduction 1

1.1 Overview: Social Bookmarking (Part I) . . . . . . . . . . . . . . . . . 3

1.2 Overview: Social Cataloging (Part II) . . . . . . . . . . . . . . . . . . 5

1.3 Overview: Paid Microtasks (Part III) . . . . . . . . . . . . . . . . . . 7

1.4 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Social Bookmarking and Web Search 11

2.1 Social Bookmarking Terms and Notation . . . . . . . . . . . . . . . . 12

2.2 Creating a Social Bookmarking Dataset . . . . . . . . . . . . . . . . . 13

2.2.1 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 Realtime Processing Pipeline . . . . . . . . . . . . . . . . . . 14

2.2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.4 Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Positive Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.2 Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 Negative Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.1 URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.2 Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

ix

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Social Tag Prediction 39

3.1 Tag Prediction Terms and Notation . . . . . . . . . . . . . . . . . . . 41

3.2 Creating a Prediction Dataset . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Two Tag Prediction Methods . . . . . . . . . . . . . . . . . . . . . . 45

3.3.1 Tag Prediction Using Page Information . . . . . . . . . . . . . 46

3.3.2 Tag Prediction Using Tags . . . . . . . . . . . . . . . . . . . . 53

3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4 Tagging Human Knowledge 63

4.1 Social Cataloging Terms and Notation . . . . . . . . . . . . . . . . . 65

4.1.1 Library Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2 Creating a Social Cataloging Dataset . . . . . . . . . . . . . . . . . . 67

4.3 Experiments: Consistency . . . . . . . . . . . . . . . . . . . . . . . . 68

4.3.1 Synonymy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.2 Cross-System Annotation Use . . . . . . . . . . . . . . . . . . 71

4.3.3 Cross-System Object Annotation . . . . . . . . . . . . . . . . 73

4.3.4 $-tag Annotation Overlap . . . . . . . . . . . . . . . . . . . . 75

4.4 Experiments: Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.4.1 Objective, Content-based Groups . . . . . . . . . . . . . . . . 77

4.4.2 Quality Paid Annotations . . . . . . . . . . . . . . . . . . . . 79

4.4.3 Finding Quality User Tags . . . . . . . . . . . . . . . . . . . . 81

4.5 Experiments: Completeness . . . . . . . . . . . . . . . . . . . . . . . 83

4.5.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.5.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5 Fallibility of Experts 95

5.1 Notes on LCSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

x

5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2.1 Syntactic Equivalence . . . . . . . . . . . . . . . . . . . . . . 97

5.2.2 Rank Correlation of Syntactic Equivalents . . . . . . . . . . . 98

5.2.3 Expert/User Annotator Agreement . . . . . . . . . . . . . . . 99

5.2.4 Semantic Equivalence . . . . . . . . . . . . . . . . . . . . . . . 101

5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6 Human Processing 107

6.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.2 Basic Buyer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.3 Game Maker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.4 Human Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7 Programming with HPROC 119

7.1 HPROC Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.2 Preliminaries: TurKit . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.3 HPROC Subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7.4 HPROC Hprocesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.5 HPROC Walkthrough . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.5.1 Making a Remote Connection . . . . . . . . . . . . . . . . . . 128

7.5.2 Uploading Code . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.5.3 Introspection . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.5.4 Hprocess Creation . . . . . . . . . . . . . . . . . . . . . . . . 139

7.5.5 Polling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.5.6 Executable Environment . . . . . . . . . . . . . . . . . . . . . 141

7.5.7 Dispatch Handling . . . . . . . . . . . . . . . . . . . . . . . . 141

7.5.8 Remote Function Calling . . . . . . . . . . . . . . . . . . . . . 143

7.5.9 Local Hprocess Instantiation . . . . . . . . . . . . . . . . . . . 145

7.5.10 Form Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.5.11 Form Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

7.5.12 Form Recruiting . . . . . . . . . . . . . . . . . . . . . . . . . 149

xi

7.6 HPROC Walkthrough Summary . . . . . . . . . . . . . . . . . . . . . 151

7.7 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7.7.1 Stanford University Shoe Dataset 2010 . . . . . . . . . . . . . 152

7.7.2 Sorting Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

7.7.3 Comparison Interfaces . . . . . . . . . . . . . . . . . . . . . . 154

7.8 H-Merge-Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7.8.1 Classical Merge-Sort . . . . . . . . . . . . . . . . . . . . . . 155

7.8.2 Convenience Functions . . . . . . . . . . . . . . . . . . . . . . 156

7.8.3 H-Merge-Sort Overview . . . . . . . . . . . . . . . . . . . . 157

7.8.4 H-Merge-Sort Functions . . . . . . . . . . . . . . . . . . . 158

7.8.5 H-Merge-Sort Walkthrough . . . . . . . . . . . . . . . . . . 163

7.9 H-Quick-Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

7.9.1 Classical Quick-Sort . . . . . . . . . . . . . . . . . . . . . . 166

7.9.2 H-Quick-Sort Overview . . . . . . . . . . . . . . . . . . . . 167

7.9.3 H-Quick-Sort Functions . . . . . . . . . . . . . . . . . . . . 167

7.9.4 H-Quick-Sort Walkthrough . . . . . . . . . . . . . . . . . . 169

7.10 Human Algorithm Evaluation . . . . . . . . . . . . . . . . . . . . . . 172

7.11 Case Study Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7.11.1 H-Merge-Sort Interfaces . . . . . . . . . . . . . . . . . . . 174

7.11.2 H-Quick-Sort Median Pivot . . . . . . . . . . . . . . . . . . 177

7.11.3 H-Merge-Sort versus H-Quick-Sort . . . . . . . . . . . . 177

7.11.4 Complete Data Table . . . . . . . . . . . . . . . . . . . . . . . 177

7.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

8 Worker Monitoring with Turkalytics 181

8.1 Worker Monitoring Terms and Notation . . . . . . . . . . . . . . . . 182

8.1.1 Interaction Model . . . . . . . . . . . . . . . . . . . . . . . . . 183

8.1.2 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

8.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

8.2.1 Client-Side JavaScript . . . . . . . . . . . . . . . . . . . . . . 187

8.2.2 Log Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

xii

8.2.3 Analysis Server . . . . . . . . . . . . . . . . . . . . . . . . . . 188

8.2.4 Design Choices . . . . . . . . . . . . . . . . . . . . . . . . . . 191

8.3 Requester Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

8.3.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

8.3.2 Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

8.4 Results: System Aspects . . . . . . . . . . . . . . . . . . . . . . . . . 196

8.4.1 Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

8.4.2 Logging Server . . . . . . . . . . . . . . . . . . . . . . . . . . 198

8.4.3 Analysis Server . . . . . . . . . . . . . . . . . . . . . . . . . . 199

8.5 Results: Worker Aspects . . . . . . . . . . . . . . . . . . . . . . . . . 199

8.6 Results: Activity Aspects . . . . . . . . . . . . . . . . . . . . . . . . 202

8.6.1 What States/Actions Occur in Practice? . . . . . . . . . . . . 203

8.6.2 When Do Previews Occur? . . . . . . . . . . . . . . . . . . . . 204

8.6.3 Does Activity Help? . . . . . . . . . . . . . . . . . . . . . . . 205

8.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

8.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

9 Conclusion 209

9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

Bibliography 215

xiii

List of Tables

2.1 Top tags and their rank as terms in AOL queries. . . . . . . . . . . . 26

2.2 This example lists the five hosts in Dataset C with the most URLs

annotated with the tag java. . . . . . . . . . . . . . . . . . . . . . . 33

2.3 Average accuracy for different values of τ . . . . . . . . . . . . . . . . 35

3.1 The top 15 tags account for more than 13of top 100 tags added to

URLs after the 100th bookmark. Most are relatively ambiguous and

personal. The bottom 15 tags account for very few of the top 100

tags added to URLs after the 100th bookmark. Most are relatively

unambiguous and impersonal. . . . . . . . . . . . . . . . . . . . . . . 43

3.2 Association Rules: A selection of the top 30 tag pair association rules.

All of the top 30 rules appear to be valid, these rules are representative. 54

3.3 Association Rules: A random sample of association rules of length ≤ 3

and support > 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4 Association Rules: Tradeoffs between number of original sampled book-

marks, minimum confidence and resulting tag expansions. . . . . . . . 57


marks, minimum confidence, estimated precision and actual precision. 58


marks, minimum confidence, recall, and precision. . . . . . . . . . . . 58

4.1 Tag types for top 2000 LibraryThing and top 1000 GoodReads tags as

percentages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

xv

4.2 Basic statistics for the mean h-score assigned by evaluators to each

annotation type. Mean (µ) and standard deviation (SD) are abbreviated. 81

4.3 Basic statistics for the mean h-score assigned to a particular annota-

tion type with user tags split by frequency. Mean (µ) and standard

deviation (SD) are abbreviated. . . . . . . . . . . . . . . . . . . . . . 82

4.4 Randomly sampled containment and equivalence relationships for il-

lustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.5 Dewey Decimal Classification coverage by tags. . . . . . . . . . . . . 88

5.1 Sampled (ti, lj) pairs with Wikipedia ESA values. . . . . . . . . . . . 102

7.1 The code descriptors table within the MySQL database in the HPROC

system, after walkthroughscript.py has been introspected. Some

columns have been removed, edu.stanford.thesis has been abbrevi-

ated to e.s.t, and default poll seconds has been abbreviated to “Poll

(s).” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.2 The process descriptors table within the MySQL database in the HPROC

system, after a new hprocess with the edu.stanford.thesis.sa code

descriptor of walkthroughscript.py has been created. Some columns

have been removed, edu.stanford.thesis has been abbreviated to

e.s.t. The HPID is the process identifier for the hprocess. . . . . . 140

7.3 The row of the variable storage table corresponding to the compareItems

function call. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.4 Comparison of different sorting strategies and interfaces. Sorting dataset

is the Stanford University Shoe Dataset 2010. All runs done during

the week of November 8th, 2010. Results listed are the mean over ten

runs, with standard deviation in parentheses. . . . . . . . . . . . . . 178

8.1 Top Ten Countries of Turkers (by Number of Workers). 2,884 Workers,

8,216 IPs total. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

8.2 The number of user agents, IP addresses, cookies and views for top

workers by page views. . . . . . . . . . . . . . . . . . . . . . . . . . . 202

xvi

List of Figures

1.1 Two interfaces to the del.icio.us social bookmarking system. . . . . . 2

1.2 Two social cataloging systems. . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Screenshot of a Mechanical Turk paid microtask for sorting photos. . 7

2.1 Realtime Processing Pipeline: (1) shows where the post metadata is

acquired, (2) and (4) show where the page text and forward link page

text is acquired, and (3) shows where the backlink page text is acquired. 14

2.2 Number of times URLs had been posted and whether they appeared in

the recent feed or not. Each increase in height in “Found URLs” is a

single URL (“this URL”) that was retrieved from a user’s bookmarks

and was found in the recent feed. Each increase in height in “Missing

URLs” is a single URL (“this URL”) that was retrieved from a user’s

bookmarks and was not found in the recent feed. “Combined” shows

these two URL groups together. . . . . . . . . . . . . . . . . . . . . . 17

2.3 Histograms showing the relative distribution of ages of pages in del.icio.us,

Yahoo! Search results and ODP. . . . . . . . . . . . . . . . . . . . . . 18

2.4 Cumulative Portion of del.icio.us Posts Covered by Users . . . . . . . 23

2.5 How many times has a URL just posted been posted to del.icio.us? . 23

2.6 A scatter plot of tag count versus query count for top tags and queries

in del.icio.us and the AOL query dataset. r ≈ 0.18. For the overlap

between the top 1000 tags and queries by rank, τ ≈ 0.07. . . . . . . . 25

2.7 Posts per hour and comparison to Philipp Keller. . . . . . . . . . . . 28

2.8 Details of Keller’s post per hour data. . . . . . . . . . . . . . . . . . . 29

xvii

2.9 Host Classifier: The accuracy for the first 130 tags by rank for a host-

based classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1 Average new tags versus number of posts. . . . . . . . . . . . . . . . 44

3.2 Tags in T100 in increasing order of predictability from left to right.

“cool” is the least predictable tag, “recipes” is the most predictable tag. 48

3.3 When the rarity of a tag is controlled in 200/200, entropy is negatively

correlated with predicability. . . . . . . . . . . . . . . . . . . . . . . . 51

3.4 When the rarity of a tag is controlled in 200/200, occurrence rate is

negatively correlated with predicability. . . . . . . . . . . . . . . . . . 51

3.5 When the rarity of a tag is not controlled, in Full/Full, additional

examples are more important than the vagueness of a tag, and more

common tags are more predictable. . . . . . . . . . . . . . . . . . . . 52

4.1 Synonym set frequencies. (“Frequency of Count” is the number of

times synonym sets of the given size occur.) . . . . . . . . . . . . . . 70

4.2 Tag frequency versus synonym set size. . . . . . . . . . . . . . . . . . 70

4.3 H(ti) (Top 2000, 6= 0) . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4 Distribution of same book similarities using Jaccard similarity over all

tags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.5 Distribution of same book similarities using Jaccard similarity over the

top twenty tags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.6 Distribution of same book similarities using cosine similarity over all

tags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.7 Overlap Rate Distribution. . . . . . . . . . . . . . . . . . . . . . . . . 76

4.8 Conditional density plot [39] showing probability of (1) annotators

agreeing a tag is objective, content-based, (2) annotators agreeing on

another tag type, or (3) no majority of annotators agreeing. . . . . . 79

4.9 Recall for 603 tags in the full dataset. . . . . . . . . . . . . . . . . . . 90

4.10 Recall for 603 tags in the “min100” dataset. . . . . . . . . . . . . . . 90

4.11 Jaccard for 603 tags in the full dataset. . . . . . . . . . . . . . . . . . 90

xviii

5.1 Spinogram [40] [39] showing probability of an LCSH keyword having

a corresponding tag based on the frequency of the LCSH keyword.

(Log-scale.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.2 Symmetric Jaccard Similarity. . . . . . . . . . . . . . . . . . . . . . . 100

5.3 Asymmetric Jaccard Similarity. . . . . . . . . . . . . . . . . . . . . . 101

5.4 Conditional density plot showing probability of a (ti, lj) pair meaning

that (ti, lj) could annotate {none, few, some,many, almostall, all} of

the same books according to human annotators based on Wikipedia

ESA score of the pair. . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.5 Histogram of Top Wikipedia ESA for Missing LCSH and All Tags. . . 103

6.1 Basic Buyer human programming environment. A human program

generates forms. These forms are advertised through a marketplace.

Workers look at posts advertising the forms, and then complete the

forms for compensation. . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.2 Game Maker human programming environment. The programmer

writes a human program and a game. The game implements features

to make it fun and difficult to cheat. The human program loads and

dumps data from the game. . . . . . . . . . . . . . . . . . . . . . . . 111

6.3 Human Processing programming environment. HP is a generalization

of BB and GM. It provides abstractions so that algorithms can be

written, tasks can be defined, and marketplaces can be swapped out.

It provides separation of concerns so that the programmer can focus on

the current need, while the environment designer focuses on recruiting

workers and designing tasks. . . . . . . . . . . . . . . . . . . . . . . 113

7.1 Graphical overview of the full HPROC system. . . . . . . . . . . . . . 124

7.2 Shoes from the Stanford University Shoe Dataset 2010 blurred to vary-

ing degrees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

7.3 Two different human comparison interfaces. . . . . . . . . . . . . . . 154

7.4 Comparison of total cost of three variations of sorting. . . . . . . . . 174

7.5 Comparison of wall clock time for three variations of sorting. . . . . 175

xix

7.6 Comparison of accuracy for three variations of sorting. . . . . . . . . 176

8.1 Search-Preview-Accept (SPA) model. . . . . . . . . . . . . . . . . . 183

8.2 Search-Continue-RapidAccept-Accept-Preview (SCRAP) model. . . . 184

8.3 Turkalytics data model (Entity/Relationship diagram). . . . . . . . . 185

8.4 Number of transitions between different states in our dataset. Note:

These numbers are approximate and unload states are unlabeled. . . 203

8.5 Number of new previewers visiting three task groups over time. . . . 204

8.6 Plot of average active and total seconds for each worker who completed

the NER task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

8.7 Two activity signatures showing different profiles for completing a task.

Key: a=activity, i=inactivity, d=DOM load, s=submit, b=beforeunload,

u=unload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

xx

Chapter 1

Introduction

Over the past two decades, the web has experienced explosive growth. There are now

over a billion people connected to the Internet and over a trillion web pages. This

rapid growth has led to huge challenges as well as huge opportunities. For instance,

how can we organize over a trillion web pages? How can we utilize the collective out-

put of billions of Internet-connected users? This thesis takes steps towards answering

these important questions.

In particular, we investigate microtasks, which are tiny units of work performed

by humans usually lasting less than five minutes. Examples of microtasks include

tagging a photo with a descriptive keyword, rating a movie, or categorizing a product.

Microtasks hit a sort of sweet spot for the web. On one hand, microtasks are so short

that users and workers are often willing to perform them for cheap or free. On the

other hand, the sum of many microtasks can be a significant source of labor.

Chapters 2–5 focus on a specific type of microtask called tagging, while Chapters

6–8 focus on tools to program microtasks more generally. In a tagging system, regular

users annotate objects (they tag objects) with uncontrolled keywords (tags) of their

choosing. By contrast, library systems (i.e., libraries), only allow expert taxonomists

(rather than regular users) to annotate objects, and those objects may usually only

be annotated with terms from a controlled vocabulary that has been determined

beforehand.

Once enough users annotate objects with tags, patterns tend to emerge, even

1

2 CHAPTER 1. INTRODUCTION

(a) Tag Cloud (b) Query by Tag

Figure 1.1: Two interfaces to the del.icio.us social bookmarking system.

though the tags are from an uncontrolled vocabulary. In particular, some tags become

more or less popular, and some tags become commonly used to annotate different

types of objects. Users of a tagging system then browse the system using interfaces

designed to take advantage of these patterns. Two common interfaces for tagging

systems are shown in Figure 1.1. Figure 1.1(a) shows a tag cloud, an interface which

shows popular tags by increasing the size of the font for a tag based on its frequency.

Tag clouds give users an idea of what the most prevalent tags are within a tagging

system. Figure 1.1(b) shows a query by tag interface, which displays objects which

have been annotated with a particular tag. Figure 1.1(b) shows objects annotated

with the tag “thesis,” including a URL with the title “Useful Things to Know About

Ph.D. Thesis Research.”

The rest of this chapter gives a high level overview of the three major parts of

this thesis.

Part I Social bookmarking systems (Section 1.1).

Part II Contrasting tagging and library systems (Section 1.2).

Part III Programming (paid) microtasks (Section 1.3).

1.1. OVERVIEW: SOCIAL BOOKMARKING (PART I) 3

Lastly, we summarize our research contributions (Section 1.4). (Note that we do not

include related work in this chapter, instead including it in each individual chapter.)

1.1 Overview: Social Bookmarking (Part I)

Part I begins our study of tagging by studying social bookmarking systems. Social

bookmarking systems are tagging systems where the specific type of object being

annotated is a URL. Social bookmarking systems were one of the first places that

tagging became popular. It makes sense to start our study of tagging and microtasks

with social bookmarking systems for two reasons.

The first reason for studying social bookmarking systems is that the challenges

faced by social bookmarking systems had a major impact on the evolution of tagging

systems. Specifically, systems like Yahoo! Directory, the Open Directory Project

(ODP), and del.icio.us all try to organize and classify URLs on the web. However,

Yahoo! Directory and ODP take a substantially different approach, using trusted

taxonomists and taxonomies to determine the organization of URLs, rather than

regular users. The Yahoo! Directory and ODP approach seems to have significant

scaling problems because expert, trusted labor is scarce and expensive. By contrast,

del.icio.us represents an alternative model for human labeling of vast numbers of

URLs with descriptive metadata. This alternative model can help solve the scaling

problem, but presents other challenges in terms of using non-expert, untrusted data

from regular users.

The second reason for studying social bookmarking systems is that such systems

are among the largest and most mature tagging systems today. Over the course of

nearly a decade, these systems have grown to the point where their users now tag

hundreds of thousands of URLs each day. The size and maturity of a tagging system

matters for our study because of the dependence of tagging systems on the uncon-

trolled tags contributed by regular users. At an early, smaller phase, this dependence

can mean that tagging systems are dominated by a few prolific users or by spam. By

studying social bookmarking, we see how late phase tagging systems work in the large,

rather than falling prey to the peculiarities of a given early stage tagging system.


Our study of social bookmarking in Part I is made up of Chapters 2 and 3. Both

chapters look at the del.icio.us social bookmarking system, specifically its relationship

to the web. Both chapters also rely on the size and maturity of del.icio.us to make

claims about social bookmarking as a whole.

Chapter 2 looks at a very specific, but very important, potential application of

social bookmarking systems: web search. Web search engines like Google depend

on page content, link structure, and query or clickthrough log data to provide users

with relevant retrieved results. All of these types of data are somewhat indirect

descriptions of web pages. By contrast, Chapter 2 asks whether the direct, human

annotated tags in social bookmarking systems can help in the task of web search. To

evaluate “helpfulness,” we consider various features of the URLs and tags posted to

del.icio.us, and we ask whether each is likely to provide additional information above

and beyond the data already available to web search engines.

Chapter 3 asks a more general question about social bookmarking systems: can

frequent tags in these systems be predicted? We attempt to predict tags based on

both data specific to social bookmarking systems (e.g., page text, anchor text) and

data general to all tagging systems (e.g., predicting tags based on other tags). For

example, can we predict the tag “linux” based on Linux related terms in the page

text of a URL? Can we predict the tag “linux” based on other tags, such as the tag

“debian” which refers to a specific Linux distribution?

Predictability of tags is both good and bad. If tags can be predicted successfully,

then tagging systems can be enhanced in various ways. For example, when a tagging

system is just getting started, a system owner might provide automatic tags produced

by a machine to make the system more useful at first. Even for a late phase tagging

system, tag prediction may help increase recall for interfaces like query by tag, because

often different users use different tags to mean the same thing. For example, in our

“debian” and “linux” example above, a modified query by tag might return a URL

labeled with only “debian” when a user queries for the tag “linux.” On the other

hand, if tags are predictable, users may not be adding any information to the system

when they tag objects.

1.2. OVERVIEW: SOCIAL CATALOGING (PART II) 5

(a) LibraryThing (b) Goodreads

Figure 1.2: Two social cataloging systems.

1.2 Overview: Social Cataloging (Part II)

While social bookmarking systems are one of the most important applications of

tagging, they may not be the best place to study tagging itself. For instance, Chapter

3 asks whether tags can be predicted, but are the tags themselves any good? The

trouble is that the notion of “good” is quite subjective, depends on the objects being

annotated, and often takes a subject matter expert years to develop. Ideally, we

would compare against the ground truth produced by experts, but social bookmarking

systems do not really have a good source of ground truth. In fact, as we saw in the

last section, a major reason for the development of social bookmarking systems was

that it was so difficult for experts to annotate the web in a scalable way. Luckily,

there is a different type of tagging system called a social cataloging system where we

can evaluate tagging using ground truth from experts.

Part II expands our study of tagging to include social cataloging systems. So-

cial cataloging systems are tagging systems where the specific type of object being

annotated is a book. Figures 1.2(a) and 1.2(b) show web pages for the book “The

Indispensable Calvin and Hobbes” at the two social cataloging sites for which we have

data, LibraryThing and Goodreads. Social cataloging systems are a perfect place to


contrast tags to ground truth from experts. What makes social cataloging systems

perfect is that books are simultaneously annotated with regular user tags as well as

library terms annotated by experts.

Libraries organize books into massive hierarchies called classifications, like the

Dewey Decimal Classification (DDC) and the Library of Congress Classification (LCC).

Do tags correlate with nodes at different levels of these hierarchies? For example, one

of the top level nodes in the LCC hierarchy is the node “medicine.” Do users fre-

quently annotate books with a “medicine” tag? Similarly, libraries have controlled

vocabularies consisting of predefined terms, like the Library of Congress Subject Head-

ings (LCSH). How are tags used in comparison to these controlled vocabularies?

Chapter 4 evaluates the tags in social cataloging systems by assuming that library

annotations like LCSH, LCC, and DDC are a gold standard. In particular, we argue

that library terms are consistent, complete, and uniformly high quality. (We define

the terms consistent, complete and high quality more explicitly in the chapter.) To

what degree are tags similar to these consistent, complete, and high quality library

terms?

Tags in LibraryThing and Goodreads were, in effect, donated by users of the site.

One interesting aspect of our study of social cataloging is that we also develop the idea

of paid tagging. Paid tagging is a paid microtask where we pay workers to provide

tags for objects, rather than relying on the benevolence of users. Overall, this allows

us to compare three types of annotations: expert library terms, unpaid regular user

tags, and paid worker tags. In addition to informing the use of tagging systems, this

is one of the few places where one can compare unpaid microtasks (tags by users),

paid microtasks (paid tags), and classical “work” (annotations by experts).

Chapter 5 drops the assumption that data created by experts should be a gold

standard. Instead, we compare tags to controlled vocabularies generated and created

by experts, but do not assume that either is a priori correct. Do the terms annotated

by experts and regular users tend to be the same? Do the same terms tend to get

annotated by experts and regular users to the same objects?

1.3. OVERVIEW: PAID MICROTASKS (PART III) 7

Figure 1.3: Screenshot of a Mechanical Turk paid microtask for sorting photos.

1.3 Overview: Paid Microtasks (Part III)

With the exception of the paid tags discussed in the previous section, tagging is usu-

ally an unpaid microtask. The big advantage of unpaid microtasks is that they are

free. Unfortunately, the free nature of unpaid microtasks is also their big disadvan-

tage. Users contribute labor at their whim, and unpaid microtasks must be made

fun, easy, in the users’ self interest, or all of the above.

Developing a system—whether a tagging system or otherwise—which is fun, easy,

and useful for potential users can take a long time. What’s more, the system then

needs to be advertised and promoted to develop a user base. Even assuming users

decide to use a system based on unpaid microtasks, neither the owners of a system,

nor we as researchers, have much real control over what users produce. As a result,

research on unpaid microtasks tends to focus on post hoc analysis of data after the mi-

crotasks have been completed. By contrast, recent systems like Amazon’s Mechanical

Turk are allowing researchers (and system developers) to get microtasks accomplished

in a much more directed way—so long as the microtasks are paid.

Mechanical Turk is a marketplace made up of requesters and workers. The re-

questers provide a task (usually through an HTML IFRAME displaying an external


website) and set a price. The workers accept or decline the task. Finally, requesters

pay the set price if the work done was acceptable. Figure 1.3 shows the interface that

a Mechanical Turk worker sees while deciding whether to accept or decline one of our

tasks. At the top, one can see the reward offered—one US cent. At the bottom, one

can see an IFRAME displaying the task. In this case, the task is a web form asking the

worker to choose which of two photos is less blurry.

Microtasks on Mechanical Turk are commonly things like annotating data, tagging

photos, and judging search results. Marketplaces allow requesters to dictate exactly

what tasks they want done, and how they want the work done. Requesters have

greater control, but with that greater control comes a host of other problems. How

does one combine more than one type of paid microtask into a single program? How

should one pay for good work and avoid paying for bad work when machines cannot

evaluate the quality of the work itself?

Part III expands our focus to microtasks in general. We build out a complete

framework for building systems on top of the Mechanical Turk and similar market-

places. Our goal is to simplify and formalize the process of building such systems.

Chapter 6 develops a conceptual model for such systems. Chapter 7 describes our

implementation of that model, called the HPROC system. Chapter 8 describes a tool

for worker monitoring, in order to better understand how workers are completing

tasks and using the marketplace.

Chapter 6 proposes our conceptual model for writing programs where use of micro-

tasks is common (“human programming”), called the Human Processing model. Hu-

man Processing provides for separation of concerns (separating pricing from program

operation, for example), reduces redundant code (by enabling libraries of functional-

ity based on microtasks), and generally makes human programming easier. Human

processing also aims to make the experimental analysis of algorithms using humans

more controlled, a topic which is returned to in Chapter 7.

Chapter 7 describes an implementation of the Human Processing model, called

the HPROC system. HPROC is a large and comprehensive system. HPROC aims to

make it easy to build complex workflows involving multiple types of tasks. HPROC

1.4. RESEARCH CONTRIBUTIONS 9

also aims to make more natural the interaction between processes performing compu-

tation and web processes that interact directly with workers. Lastly, HPROC aims to

separate out recruiting functionality, wherein specialized programs ensure that paid

microtasks are advertised and priced correctly on the marketplace.

Chapter 7 also gives a brief case study showing how to use HPROC for analyzing

sorting algorithms. We demonstrate analogues of classicalMerge-Sort andQuick-

Sort. These sorting algorithms allow us to demonstrate the importance of interfaces

to workers in the design of algorithms meant to interact with humans. For example,

should we implement sorting with a binary interface where a worker chooses which

item is less, or should we implement sorting with a ranking interface where a worker

orders multiple items at a time? Our sorting case study also allows us to demonstrate

how we believe such algorithms should be evaluated.

Chapter 8 describes an analytics system for worker monitoring. Unlike unpaid

microtasks, it is quite important to detect bad workers, and detect them early, when

dealing with paid microtasks. Our system, called Turkalytics, is a realtime system

which monitors a wide variety of actions by workers as they complete microtasks.

These actions include clicks, pressing of keys, form submissions, and others. Turka-

lytics can be seen as part of the Human Processing framework, though it is also useful

as a standalone tool.

1.4 Research Contributions

In summary, the high level research contributions in this thesis are:

• A characterization of social bookmarking, especially as it relates to web search

(Chapter 2).

• Methods for predicting tags, and evaluation of those methods (Chapter 3).

• A comparison of tagging to established methods of organization in library sci-

ence (Chapters 4 and 5).


• A full framework and programming system for paid microtasks, including a

model (Chapter 6), system (Chapter 7), and monitoring tool (Chapter 8).

Chapter 2

Social Bookmarking and Web

Search

For most of the history of the web, search engines have only had access to three

major types of data describing pages. These types are page content, link structure,

and query or clickthrough log data. Today a fourth type of data is becoming available:

user generated content (e.g., tags, bookmarks) describing the pages directly. Unlike

the three previous types of data, this new source of information is neither well studied

nor well understood. Our aim in this chapter is to quantify the size of this data source,

characterize what information it contains, and to determine the potential impact it

may have on improving web search.

This chapter also begins our study of microtasks by looking at tagging, more

specifically, social bookmarking systems. In particular, this chapter represents a de-

tailed analysis of the potential impact of social bookmarking on arguably the web’s

most important application: web search. Our analysis centers around a series of

experiments conducted on the social bookmarking site del.icio.us.1 However, we be-

lieve that many of the insights apply more generally, both to social systems centered

around URLs (e.g., Twitter) and to other tagging systems with textual objects (e.g.,

tagging systems for books and academic papers).

1In the course of this work, del.icio.us changed its name from “del.icio.us” to “Delicious.” Forclarity, we refer to it as del.icio.us throughout.

11

12 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH

In Section 2.1 we introduce the terminology for our experiments on del.icio.us

and tagging systems more generally. Section 2.2 explains the complex process of

creating one of the biggest social bookmarking datasets ever studied, as well as the

methodological concerns that motivated it. The core of this chapter, Sections 2.3

and 2.4, gives two sets of results. Section 2.3 contains results that suggest that social

bookmarking will be useful for web search, while Section 2.4 contains those results that

suggest it will not. Both sections are divided into “URL” and “tag” subsections which

focus on the two major types of data that social bookmarking provides. In Section 2.5

we point to related work in web search and social bookmarking. Finally, in Section

2.6 we conclude with our thoughts on the overall picture of social bookmarking, its

ability to augment web search, and how our study generalizes to tagging in general.

(This chapter draws on material from Heymann et al. [36] which is primarily the work

of the thesis author.)

2.1 Social Bookmarking Terms and Notation

A social tagging system consists of users u ∈ U , tags t ∈ T , and objects o ∈ O. We

call an annotation of a set of tags to an object by a user a post. A post is made up

of one or more (ti, uj , ok) triples. A label is a (ti, ok) pair that signifies that at least

one triple containing tag i and object k exists in the system.

Social bookmarking systems are social tagging systems where the objects are

URLs. Each post signifies that a user has bookmarked a particular URL, and may

also include some information like a user comment.

In this chapter, we use term to describe a unit of text, whether it is a tag or part

of a query. Terms are usually words, but are also sometimes acronyms, numbers, or

other tokens.

We use host to mean the full host part of a URL, and domain to mean the

“effective” institutional level part of the host. For instance, in http://i.stanford.

edu/index.html, we call i.stanford.edu the host, and stanford.edu the domain.

Likewise, in http://www.cl.cam.ac.uk/, we call www.cl.cam.ac.uk the host, and

cam.ac.uk the domain. We use the effective top level domain (TLD) list from the

2.2. CREATING A SOCIAL BOOKMARKING DATASET 13

Mozilla Foundation to determine the effective “domain” of a particular host.2

2.2 Creating a Social Bookmarking Dataset

The companies that control social sites often run a number of internal analyses,

but are usually reluctant to release specific results. This can be for competitive

reasons, or perhaps simply to ensure the privacy of their users. As a result, we worked

independently and through public interfaces to gather the social bookmarking data

for this chapter and the next. Doing so presented a number of challenges.

2.2.1 Interfaces

del.icio.us offers a variety of interfaces to interested parties, but each of these has

its own caveats and potential problems. For instance, the “recent” feed provides the

most recent bookmarks posted to del.icio.us in real time. However, while we found

that the majority of public posts by users were present in the feed, some posts were

missing (due to filtering, see Section 2.2.4). Interfaces also exist which show all posts

of a given URL, all posts by a given user, and the most recent posts with a given

tag. We believe that at least the posts-by-a-given-user interface is unfiltered, because

users often share this interface with other users to give them an idea of their current

bookmarks.

These interfaces allow for two different strategies in gathering datasets from

del.icio.us. One can monitor the recent feed. The advantage of this is that the recent

feed is in real time. This strategy also does not provide a mechanism for gathering

older posts. Alternatively, one can crawl del.icio.us, treating it as a tripartite graph.

One starts with some set of seeds—tags, URLs, or users. At each tag, all URLs tagged

with that tag and all users who had used the tag are added to the queue. At each

URL, all tags which had been annotated to the URL (e.g., all labels) and all users

who had posted the URL are added to the queue. At each user, all URLs posted or

tags used by the user are added to the queue. The advantage of this strategy is that

2Available http://publicsuffix.org/.


Figure 2.1: Realtime Processing Pipeline: (1) shows where the post metadata isacquired, (2) and (4) show where the page text and forward link page text is acquired,and (3) shows where the backlink page text is acquired.

it provides a relatively unfiltered view of the data. However, the disadvantage is that

doing a partial crawl of a small world graph like del.icio.us can lead to data which

is highly biased towards popular tags, users, and URLs. Luckily, these two methods

complement each other. Monitoring is biased against popular pages, while crawling

tends to be biased toward these pages (we further explore the sources of these biases

in Section 2.2.4). As a result, we created datasets based on both strategies.

2.2.2 Realtime Processing Pipeline

For certain analyses (see Result 10), we need to have not just the URL being book-

marked, but also the content of the page, as well as the forward links from the page.

We also wanted to have the backlinks from those pages, and the pagetext content of

those backlinks. We wanted to have this page text data as soon as possible after a

URL was posted.

As a result, for a one month period we set up a real time processing pipeline

(shown in Figure 2.1). Every 20 to 40 seconds, we polled del.icio.us to see the most

recently added posts. For each post, we added the URL of the post to two queues, a

pre-page-crawl queue and a pre-backlink queue.

2.2. CREATING A SOCIAL BOOKMARKING DATASET 15

Every two hours, we ran an 80 minute Heritrix web crawl seeded with the pages in

the pre-page-crawl queue.3 We crawled the seeds themselves, plus pages linked from

those seeds up until the 80 minute time limit elapsed.4

Meanwhile, we had a set of processes which periodically checked the pre-backlink

queue. These processes got URLs from the queue and then ran between one and

three link: queries against one of Google’s internal APIs. This resulted in 0-60

backlink URLs which we then added to a pre-backlink-crawl queue. Finally, once

every two hours, we ran a 30 minute Heritrix crawl which crawled only the pages in

the pre-backlink-crawl queue. In terms of scale, our pipeline produced around 2GB of

(compressed) data per hour in terms of crawled pages and crawled backlinks.

2.2.3 Datasets

Over the course of nine months starting in September 2006 and ending in July 2007,

we collected three datasets from del.icio.us:

Dataset C(rawl) This dataset consists of a large scale crawl of del.icio.us in Septem-

ber 2006. The crawl was breadth first from the tag “web”, with the crawling

performed as described above. This dataset consists of 22, 588, 354 posts and

1, 371, 941 unique URLs.

Dataset R(ecent) This dataset consists of approximately 8 months of data begin-

ning September 28th, 2006. The data was gathered from the del.icio.us recent

feed. This dataset consists of 11, 613, 913 posts and 3, 004, 998 unique URLs.

Dataset M(onth) This dataset consists of one contiguous month of data starting

May 25th 2007. This data was gathered from the del.icio.us recent feed. For

each URL posted to the recent feed, Dataset M also contains a crawl of that

URL within 2 hours of its posting, pages linked from that URL, and inlinks to

the URL. This page content was acquired in the manner described in Section

3Heritrix software available at http://crawler.archive.org/.4The reason for running 80 minutes every two hours is that we used a single machine for crawling.

The single machine would spend 80 minutes crawling forward links, 30 minutes crawling backlinks,and we left two five minute buffers between the crawls, leading to 120 minutes.


2.2.2. Unlike Dataset R, the gathering process was enhanced so that changes

in the feed were detected more quickly. As a result, we believe that Dataset M

has within 1% of all of the posts that were present in the recent feed during the

month long period. This dataset consists of 3, 630, 250 posts, 2, 549, 282 unique

URLs, 301, 499 active unique usernames and about 2 TB of crawled data.

We are unaware of any analysis of del.icio.us of a similar scale either in terms of

duration, size, or depth.

We also use the AOL query dataset [58] for certain analyses (Results 1, 3, and 6).

The AOL query dataset consists of about 20 million search queries corresponding to

about 650, 000 users. We use this dataset to represent the distribution of queries a

search engine might receive.

2.2.4 Tradeoffs

As we will see, del.icio.us data is large and grows rapidly. The web pages del.icio.us

refers to are also changing and evolving. Thus, any “snapshot” will be imprecise in

one way or another. For instance, a URL in del.icio.us may refer to a deleted page,

or a forward link may point to a deleted page. Some postings, users, or tags may be

missing due to filtering or the crawl process. Lastly, the data may be biased, e.g.,

unpopular URLs or popular tags may be over-represented.

Datasets C, R, and M each have bias due to the ways in which they were gathered.

Dataset C appears to be heavily biased towards popular tags, popular users, and

popular URLs due to its crawling methodology. Dataset R may be missing data due

to incomplete gathering of data from the recent feed. Datasets R and M are both

missing data due to filtering of the recent feed. In this chapter, we analyze Dataset

M because we believe it is the most complete and unbiased. We use Datasets C and

R to supplement Dataset M for certain analyses.

It was important for the analyses that follow not just to know that the recent feed

(and thus Datasets R and M) was filtered, but also to have a rough idea of exactly

how it was filtered. We analyzed over 2, 000 randomly sampled users, and came to two

conclusions. First, on average, about 20% of public posts fail to appear in the recent

2.3. POSITIVE FACTORS 17

# of Posts of This URL in System

Nu

mb

er

of

Po

sts

01000

2000

3000

4000

5000

6000

1 10 100 1000 10000 100000

(a) Found URLs


Nu

mb

er

of

Po

sts

02000

4000

6000

1 10 100 1000 10000 100000

(b) Combined


Nu

mb

er

of

Po

sts

0500

1000

1500

2000

2500

1 10 100 1000 10000 100000

(c) Missing URLs

Figure 2.2: Number of times URLs had been posted and whether they appeared in therecent feed or not. Each increase in height in “Found URLs” is a single URL (“thisURL”) that was retrieved from a user’s bookmarks and was found in the recent feed.Each increase in height in “Missing URLs” is a single URL (“this URL”) that wasretrieved from a user’s bookmarks and was not found in the recent feed. “Combined”shows these two URL groups together.

feed (as opposed to the posts-by-user interface, for example). Second, popular URLs,

URLs from popular domains (e.g., youtube.com), posts using automated methods

(e.g., programmatic APIs), and spam will often not appear in the recent feed. Figure

2.2 shows this second conclusion for popular URLs. It shows three histograms of

URL popularity for URLs which appeared in the recent feed (“found”), those that

did not (“missing”), and the combination of the two (i.e.., the “real” distribution,

“combined”). Missing posts on the whole refer to noticeably more popular URLs, but

the effect of their absence seems minimal. In other words, the “combined” distribution

is not substantially different from the “found” distribution.

2.3 Positive Factors

Bookmarks are useful in two major ways. First, they can allow an individual to

remember URLs visited. For example, if a user tags a page with their mother’s name,

this tag might be useful to them, but is unlikely to be useful to others. Second, tags

can be made by the community to guide users to valuable content. For example, the

tag “katrina” might be valuable before search engine indices update with Hurricane

Katrina web sites. Non-obvious tags like “analgesic” on a page about painkillers


Figure 2.3: Histograms showing the relative distribution of ages of pages in del.icio.us,Yahoo! Search results and ODP.

might also help users who know content by different names locate content of interest.

In this chapter, our focus is on the second use. Will bookmarks and tags really

be useful in the ways described above? How often do we find “non-obvious” tags? Is

del.icio.us really more up-to-date than a search engine? What coverage does del.icio.us

have of the web? Sections 2.3 and 2.4 try to answer questions like these. At the

beginning of each result in these sections, we highlight the main result in “capsule

form” and we summarize the high level conclusion we think can be reached. In this

section, we provide positive factors which suggest that social bookmarking might help

with various aspects of web search.


2.3.1 URLs

Summary

Result 1: Pages posted to del.icio.us are often recently modified.

Conclusion: del.icio.us users post interesting pages that are actively updated or

have been recently created.

Details

Determining the approximate age of a web page is fraught with challenges. Many

pages corresponding to on disk documents will return the HTTP/1.1 Last-Modified

header accurately. However, many dynamic web sites will return a Last-Modified

date which is the current time (or another similar time for caching purposes), and

about 23of pages in Dataset M do not return the header at all! Fortunately, search

engines need to solve this problem for crawl ordering. They likely use a variety of

heuristics to determine if page content has changed significantly. As a result, the

Yahoo! Search API gives a ModificationDate for all result URLs which it returns.

While the specifics are unknown, ModificationDate appears to be a combination of

the Last-Modified HTTP/1.1 header, the time at which a particular page was last

crawled and its page content. We used this API to test the recency of five groups of

pages:

del.icio.us Pages sampled from the del.icio.us recent feed as they were posted.

Yahoo! 1, 10, and 100 The top 1, 10, and 100 results (respectively) of Ya-

hoo! searches for queries sampled from the AOL query dataset.

ODP Pages sampled from the Open Directory Project (dmoz.org).

Rather than compare the age of del.icio.us pages to random pages from the web (which

would neither be possible nor meaningful), we chose the four comparison groups to

represent groups of pages a user might encounter. The Yahoo 1, 10, and 100 groups

represent pages a user might encounter as a result of searches. ODP represents

pages a user might encounter using an Internet directory, and is also probably more


representative of the web more broadly. For each URL in each set, we recorded the

time since the page was last modified. In order to avoid bias by time, we ran equal

proportions of queries for each set at similar times.

Figure 2.3 shows the results. Each bar represents the number of pages in the group

with the given (x-axis) age. We found that pages from del.icio.us were usually more

recently modified than ODP, which tends to have older pages. We also found that

there is a correlation between a search result being ranked higher and a result having

been modified more recently. However, most interestingly, we found that the top 10

results from Yahoo! Search were about the same age as the pages found bookmarked

in del.icio.us. This could be interpreted in one of two ways: (i) del.icio.us is getting

recent, topical bookmarks which Yahoo! Search is trying to emulate, or (ii) del.icio.us

is getting bookmarks which are a result of searches, and thus have the same recency

as the top 10.

Summary

Result 2: Approximately 25% of URLs posted by users are new, unindexed pages.

Conclusion: del.icio.us can serve as a (small) data source for new web pages and to

help crawl ordering.

Details

We next looked at what proportion of pages were “new” in the sense that they were

not yet indexed by a search engine at the time they were posted to del.icio.us. We

sampled pages from the del.icio.us recent feed as they were posted, and then ran

Yahoo! searches for those pages immediately after. Of those pages, about 42.5% were

not found. This could be for a variety of reasons—the pages could be indexed under

another canonicalized URL, they could be spam, they could be an odd MIME-type

(an image, for instance) or the page could have not been found yet. Anecdotally, all

four of these causes appear to be fairly common in the set of sampled missing URLs.

As a result, we next followed up by continuously searching for the missing pages over

the course of the following five months. When a missing page appears in a later


result, we argue that the most likely reason is that the page was not indexed but was

later crawled. This methodology seems to eliminate the possibility that spam and

canonicalization issues are the reason for missing URLs, but does not eliminate the

possibility, for instance, that multiple datacenters give out different results.

We found that of the 5, 724 URLs which we sampled and were missing from the

week beginning June 22, 3, 427 were later found and 1, 750 were found within four

weeks. This implies that roughly 60% of the missing URLs were in fact new URLs,

or roughly 25% of del.icio.us (i.e., 42.5% × 60%). This works out to roughly 30, 000

new pages per day.

Social bookmarking seems to be a good source of new and active pages. As a

source of new pages, social bookmarking may help a search engine discover pages

it might not otherwise. For instance, Dasgupta et al. [22] suggest that 25% of new

pages are not discoverable using historical information about old pages. As a source

of both new and active pages, social bookmarking may also help more generally with

the “crawl ordering” problem—should we update old pages, or try to discover new

pages? To the extent to which social bookmarks represent “interesting” changes to

pages, they should be weighted in crawl ordering schemes.

Summary

Result 3: Roughly 9% of results for search queries are URLs present in del.icio.us.

Conclusion: del.icio.us URLs are disproportionately common in search results com-

pared to their coverage.

Details

Similarly to the recently modified pages discussion above, we used queries chosen by

sampling from the AOL query dataset to check the coverage of results by del.icio.us.

Specifically, we randomly sampled queries from the query dataset, ran them on Ya-

hoo! Search, and then cross-referenced them with the millions of unique URLs present

in Datasets C, M, and R. When we randomly sample, we sample over query events

rather than unique query strings. This means that the query “american idol” which


occurs roughly 15, 000 times, is about five times more likely to be picked than “power-

ball” which occurs roughly 3, 000 times.

We found that despite the fact that del.icio.us covers a relatively small portion

of the web (see discussion below in Result 9), it covers a disproportionately high

proportion of search results. For the top 100 results of the queries, del.icio.us covers

9% of results returned for a set of over 30,000 queries. For the top 10 results, this

coverage is about double: 19% of results returned are in del.icio.us. This set of queries

is weighted towards more popular queries, which can explain part of this effect. By

comparison, we might expect 11000

of URLs in query results to be in del.icio.us if they

were selected at random from the web (again, see Result 9). This suggests that to

whatever extent del.icio.us gives us additional metadata about web pages, it may lead

to result reordering for queries.

Summary

Result 4: While some users are more prolific than others, the top 10% of users only

account for 56% of posts.

Conclusion: del.icio.us is not highly reliant on a relatively small group of users (e.g.,

< 30, 000 users).

Details

Figure 2.4 shows the extent to which the most prolific users are responsible for large

numbers of posts. While there are some URLs, domains, users, and tags that cover

many posts or triples, the distributions do not seem so condensed as to be problematic.

For instance, on social news sites, it is commonly cited that the majority of front page

posts come from a dedicated group of less than 100 users. However, the majority of

posts in Dataset M instead come from tens of thousands of users. Nonetheless, the

distribution is still power law shaped and there is a core group of relatively active

users and a long tail of relatively inactive users.


Figure 2.4: Cumulative Portion of del.icio.us Posts Covered by Users

Figure 2.5: How many times has a URL just posted been posted to del.icio.us?


Summary

Result 5: 30-40% of URLs and approximately one in eight domains posted were not

previously in del.icio.us.

Conclusion: del.icio.us has relatively little redundancy in page information.

Details

The recent feed states for each post how many times the URL in that post is already

in del.icio.us. Figure 2.5 shows the distribution of this value. A new post in Dataset

M is of a new URL not yet in the system about 40% of the time. This proportion

might be 30% of total posts to del.icio.us if we adjust for filtering. In Dataset M, a

majority of the URLs posted were only posted once during the time period.

Another way to look at new URLs being added to del.icio.us is in terms of how of-

ten a completely new domain is added (as opposed to just another URL at an existing

domain). Unfortunately, we do not know the exact set of domains in del.icio.us. How-

ever, we can provide an upper-bound by comparing against the domains in Datasets

C and R. We found that about 12% of posts in Dataset M were URLs whose domains

were not in either Dataset C or R. This suggests that about one eighth of the time,

a new URL is not just a new page to be crawled, but may also suggest an entire new

domain to crawl.

This result coupled with Result 4 may impact the potential actions one might use

to fight tag spam. Because of the relatively high number of new pages, it may be

more difficult for those pages to determine the quality of labels placed on them. Fur-

thermore, due to the relatively low number of label redundancies, it may be difficult

to determine the trustworthiness of a user based on coincident labels with other users

(as in, e.g., [47]). For instance, 85% of the labels in Dataset M are non-redundant.

As a result, it may become increasingly important to use interface-based methods

to keep attackers out rather than analyzing the data that they add to the system.

However, on the other hand, the low level of redundancy does mean that users are

relatively efficient in labeling the parts of the web that they label.


Figure 2.6: A scatter plot of tag count versus query count for top tags and queriesin del.icio.us and the AOL query dataset. r ≈ 0.18. For the overlap between the top1000 tags and queries by rank, τ ≈ 0.07.

2.3.2 Tags

Summary

Result 6: Popular query terms and tags overlap significantly (though tags and query

terms are not correlated).

Conclusion: del.icio.us may be able to help with queries where tags overlap with

query terms.

Details

One important question is whether the metadata attached to bookmarks is actually

relevant to web searches. That is, if popular query terms often appear as tags, then

we would expect the tags to help guide users to relevant pages. SocialSimRank [14]

suggests an easy way to make use of this information. We opted to look at tag–query

overlap between the tags in Dataset M and the query terms in the AOL query dataset.


Tag (Rank) # Queries (Rank) Tag (Rank) # Queries (Rank)design (#1) 10318 (#545) tutorial (#16) 779 (#7098)blog (#2) 3367 (#1924) news (#17) 63916 (#40)

imported (#3) 215 (#18292) blogs (#18) 1478 (#4205)music (#4) 63250 (#41) howto (#19) 152 (#23341)

software (#5) 10823 (#506) shopping (#20) 5394 (#1222)reference (#6) 1312 (#4655) travel (#21) 20703 (#227)

art (#7) 29558 (#130) free (#22) 184569 (#9)programming (#8) 478 (#10272) css (#23) 456 (#10624)

tools (#9) 6811 (#921) education (#24) 15546 (#335)web2.0 (#10) 0 (None) business (#25) 21970 (#212)

web (#11) 24992 (#184) flash (#26) 5170 (#1274)video (#12) 29833 (#127) games (#27) 59480 (#49)

webdesign (#13) 11 (#155992) mac (#28) 3440 (#1873)linux (#14) 178 (#20937) google (#29) 191670 (#8)

photography (#15) 4711 (#1384) books (#30) 16643 (#296)

Table 2.1: Top tags and their rank as terms in AOL queries.

For this analysis, we did not attempt to remove “stop tags”—tags like “imported” that

were automatically added by the system or otherwise not very meaningful. Figure

2.6 shows the number of times a tag occurs in Dataset M versus the number of times

it occurs in the AOL query dataset. Table 2.1 shows the corresponding query term

rank for the top 30 del.icio.us tags in Dataset M. Both show that while there was

a reasonable degree of overlap between query terms and tags, there was no positive

correlation between popular tags and popular query terms.

One likely reason the two are uncorrelated is that search queries are primarily

navigational, while tags tend to be used primarily for browsing or categorizing. For

instance, 21.9% of the AOL query dataset is made up of queries that look like URLs

or domains, e.g., www.google.com or http://i.stanford.edu/ and variations. To

compute the overlap between tags and queries (but not for Figure 2.6), we first

removed these URL or domain-like queries from consideration. We also removed

certain stopword like tags, including “and”, “for”, “the”, and “2.0” and all tags with

less than three characters. We found that at least one of the top 100, 500, and 1000

tags occurred in 8.6%, 25.3% and 36.8% of these non-domain, non-URL queries.


In some sense, overlap both overstates and understates the potential coverage.

On one hand, tags may correlate with but not be identical to particular query terms.

However, on the other, certain tags may overlap with the least salient parts of a query.

We also believe that because AOL and del.icio.us represent substantially different

communities, the query terms are a priori less likely to match tags than if we had a

collection of queries written by del.icio.us users.

Summary

Result 7: In our study, most tags were deemed relevant and objective by users.

Conclusion: Tags are on the whole accurate.

Details

One concern is that tags at social bookmarking sites may be of “low quality.” For

example, perhaps users attach nonsensical tags (e.g., “fi32”) or very subjective tags

(e.g., “cool”). To get a sense of tag quality, we conducted a small user study. We had

a group of ten people, a mix of graduate students and individuals associated with our

department, manually evaluate posts to determine their quality. We sampled one post

out of every five hundred, and then gave blocks of posts to different individuals to

label. Most of the individuals labeled about 100 to 150 posts. For each tag, we asked

whether the tag was “relevant,” “applies to the whole domain,” and/or “subjective.”

For each post, we asked whether the URL was “spam,” “unavailable,” and a few other

questions. We set the bar relatively low for “relevance”: whether a random person

would agree that it was reasonable to say that the tag describes the page. Roughly

7% of tags were deemed “irrelevant” according to this definition. Also, remarkably

few tags were deemed “subjective”: less than one in twenty for all users. Lastly,

there was almost no “spam” in the dataset, either due to low amounts of spam on

del.icio.us, or due to the filtering described in Section 2.2.


Bookmarks Posted in a Given Hour

Time

Num

ber

of B

ookm

ark

s

May 25 May 31 June 3 June 15 June 24

0

0 p

osts

/s

3600

1 p

ost/s

7200

2 p

osts

/s

10800

3 p

osts

/s

14400

4 p

osts

/s

Figure 2.7: Posts per hour and comparison to Philipp Keller.

2.4 Negative Factors

In this section, we present negative factors which suggest that social bookmarking

might not help with various aspects of web search.

2.4.1 URLs

Summary

Result 8: Approximately 120,000 URLs were posted to del.icio.us each day.

Conclusion: The number of posts per day is relatively small; for instance, it repre-

sents about 110

of the number of blog posts per day.

Details

Figure 2.7 shows the posts per hour for every hour in Dataset M. The dashed lines

show (where available) the independently sampled data collected by Philipp Keller.5

Keller’s data comes from sampling the recent feed every 10 minutes and extrapolating

based on the difference in age between the youngest and oldest bookmark in the fixed

size feed. Dataset M comes from attempting to capture every post in the recent feed.

The two datasets seem to be mutually reinforcing—our data only differs from Keller’s

5Available at http://deli.ckoma.net/stats.

2.4. NEGATIVE FACTORS 29

Date

Estim

ate

d N

um

ber

of P

osts

August 1, 2005 December 9, 2005 August 16, 2006

030000

60000

90000

120000

(a) August 2005—August 2006

Date

Estim

ate

d N

um

ber

of P

osts

November 11, 2006 July 5, 2007

30000

60000

90000

120000

150000

(b) November 2006—July 2007

Date

Estim

ate

d N

um

ber

of P

osts

August 1, 2005 August 16, 2006 July 5, 2007

030000

60000

90000

120000

150000

(c) August 2005—July 2007

Figure 2.8: Details of Keller’s post per hour data.

slightly, and this usually occurs at points where the feed “crashed.” At these points,

near June 3rd and June 15th respectively in Figure 2.7, the feed stopped temporarily,

and then restarted, replaying past bookmarks until it caught up to the present.

There are an average of 120, 087 posts per day in Dataset M. However, more

relevant for extrapolation are the number of posts in a given week. On average,

92, 690 posts occurred per day of each weekend, and 133, 133 posts occurred each

weekday. Thus, del.icio.us produced about 851, 045 posts per week during our period

of study, or a little more than 44 million posts per year. For comparison, David Sifry

[65] suggests that there were on the order of 1.5 million blog posts per day during

the same time period. This means that for every bookmark posted to del.icio.us, ten

blog entries were posted to blogs on the web.

More important than the rate at which posts were being generated is the rate

at which posts per day accelerate. However, this rate of acceleration is harder to

determine. For instance, Dataset M shows a 50% jump in posts per hour on the

evening of May 30th, when del.icio.us announced a partnership with Adobe. However,

we believe that this may have simply been bouncing back from a previous slump.

Keller’s data, shown in Figure 2.8 seems to tell multiple stories. From August 2005,

until August 2006 (including December 2005, when del.icio.us was bought), del.icio.us

seems to have been accelerating at a steady rate. However, from November 2006 to

June 2007, the rate of acceleration seems to be flat. Our Dataset R, while not covering


the same length of time, does not lead us to reject Keller’s data. As a result, we believe

that the history of social bookmarking on del.icio.us seems to be a series of increases

in posting rate followed by relative stability. To the extent to which this is the case,

we believe that future rates of increase in posts per day are highly dependent on

external factors and are thus not easily predictable.

Summary

Result 9: There were roughly 115 million public posts, coinciding with about 30-50

million unique URLs at the time of our study.

Conclusion: The number of total posts is relatively small; for instance, this is a

small portion (perhaps 11000

) of the web as a whole.

Details

Relatively little is known about the size of social bookmarking sites, and in particular

del.icio.us. In September 2006, del.icio.us announced that they had reached 1 million

users, and in March 2007, they announced they had reached 2 million. The last

official statement on the number of unique posts and URLs was in May of 2004, when

del.icio.us’ creator, Joshua Schacter stated that there were about 400, 000 posts and

200, 000 URLs.

One way to estimate the size of del.icio.us is to extrapolate from some set of URLs

or tags. For instance, if the URL http://www.cnn.com/ was posted um times in a

one month period, there were tm posts total during that month, and the URL had

been posted to the system a total of us times, we might estimate the total size ts of

del.icio.us as ts =ustmum

(assuming um

tm= us

ts). However, we found that this led to poor

estimates—often in the billions of posts.

Instead, we assume that the rate of posting of URLs to del.icio.us has been mono-

tonically increasing (given a sufficient time window) since its creation. We then divide

the historical record of del.icio.us into three time periods. The first, t1, is the period

before Schacter’s announcement on May 24th. The second, t2, is between May 24th

and the start of Keller’s data gathering. The third, t3, is from the start of Keller’s


data gathering to the present.

We assume that t1 is equal to 400, 000 posts. We estimate that t2 is equal to

the time period (about p1 = 420 days) times the maximum amount of posts per day

in the one month period after Keller’s data starts (db = 44, 536) times a filtering

factor (f = 1.25) to compensate for the filtering which we observed during our data

gathering. We estimate that t3 is equal to the posts observed by Keller (ok), plus the

posts in the gaps in Keller’s data gathering (gk). ok is nk = 58, 194, 463 posts, which

we multiply by the filtering factor (f = 1.25). We estimate gk as the number of days

missing (mk = 104) times the highest number of posts for a given day observed by

Keller (dk = 161, 937) times the filtering factor (f = 1.25).

Putting this all together, we estimate that the number of posts in del.icio.us as of

late June 2007 was:

t1 + t2 + t3

= (400000) + (p1 × db × f) + (nk × f +mk × dk × f)

= (400000) + (420× 44536× 1.25) +

(58194463× 1.25 + 104× 161937× 1.25)

≈ 117 million posts

This estimate is likely an over-estimate because we choose upper bound values for db

and dk. Depending on the real values of {db, dk, f}, one could reasonably estimate

the number of posts anywhere between about 60 and 150 million posts. It should be

noted that this does not, however, include private (rather than public) posts, which

we do not have any easy way to estimate. Finally, we estimate that between about

20 and 50 percent of posts are unique URLs (see discussion in Result 4 and Figure

2.5). This leads us to an estimate of about 12 to 75 million unique URLs.

The indexes of the major search engines are now commonly believed to be in

the billions to hundreds of billions of pages. For instance, Eiron et al. [26] state

in 2004 that after crawling for some period of time, their crawler had explored 1

billion pages and had 4.75 billion pages remaining to be explored. Of course, as

dynamic content has proliferated on the web, such estimates become increasingly


subjective. Nonetheless, the number of unique URLs in del.icio.us is relatively small

as a proportion of the web as a whole.

2.4.2 Tags

Summary

Result 10: Tags are present in the pagetext of 50% of the pages they annotate and

in the titles of 16% of the pages they annotate.

Conclusion: A substantial proportion of tags are obvious in context, and many

tagged pages would be discovered by a search engine.

Details

For a random sampling of over 20, 000 posts in Dataset M, we checked whether tags

were in the text of the pages they annotate or related pages. To get plain text from

pages, we used John Cowan’s TagSoup Java package to convert from HTML.6 To

get tokens from plain text, we used the Stanford NLP Group’s implementation of

the Penn Treebank Tokenizer.7 We also checked whether pages were likely to be in

English or not, using Marco Olivo’s lc4j Language Categorization package.8 Finally,

we lowercased all tags and all tokens before doing comparisons.

We found that 50% of the time, if a tag annotates a page, then it is present in the

page text. Furthermore, 16% of the time, the tag is not just anywhere in the page

text, but it is present in the title. We also, looked at the page text of pages that

link to the URL in question (backlinks) and pages that are linked from the URL in

question (forward links). 20% of the time, a tag annotating a particular page will

appear in three places: the page it annotates, at least one of its backlinks, and at

least one of its forward links. 80% of the time, the tag will appear in at least one

of these places: the page, backlinks or forward links. Anecdotally, the tags in the

6TagSoup is available at http://ccil.org/~cowan/XML/tagsoup/.7The PTB Tokenizer is available at http://nlp.stanford.edu/javanlp/—we used the version

from the Stanford NER.8lc4j is available at http://www.olivo.net/software/lc4j/ and implements algorithms from

[15].


Host % of Tag Tag % of Host Host5.0% 87.7% java.sun.com3.2% 81.5% onjava.com3.1% 82.0% javaworld.com1.6% 67.9% theserverside.com1.3% 88.7% today.java.net

Table 2.2: This example lists the five hosts in Dataset C with the most URLs anno-tated with the tag java.

missing 20% appear to be “lower quality.” They tend to be mistakes of various kinds

(misspellings or mistypes of tags) or confusing tagging schemes (like “food/dining”).

Overall, this seems to suggest that a search engine, which is already looking at page

text and particularly at titles (and sometimes at linked text), is unlikely to gain much

from tag information in a significant number of cases.

Summary

Result 11: Domains are often highly correlated with particular tags and vice versa.

Conclusion: It may be more efficient to train librarians to label domains than to

ask users to tag pages.

Details

One way in which tags may be predicted is by host. Hosts tend to be created to focus

on certain topics, and certain topics tend to gravitate to a few top sites focusing on

them. For instance, Table 2.2 shows the proportion of the URLs in Dataset C labeled

“java” which are on particular hosts (first column). It also shows the proportion of

the URLs at those hosts which have been labeled “java” (second column). This table

shows that 14 percent of the URLs that are annotated with the tag java come from

five large topical Java sites where the majority of URLs are in turn tagged with java.

Unfortunately, due to the filtering discussed in Section 2.2.4 we could not use

Dataset M for our analysis. Instead, we use Dataset C, with the caveat that based

on our discussions in Section 2.2.4 and Result 9, Dataset C represents about 25% of


(a) On Positive Examples

(b) On Negative Examples

Figure 2.9: Host Classifier: The accuracy for the first 130 tags by rank for a host-basedclassifier.


Avg Accuracy (+) Avg Accuracy (-)τ = 0.33 19.647 99.670τ = 0.5 7.372 99.943τ = 0.66 4.704 99.984

Table 2.3: Average accuracy for different values of τ .

the posts in del.icio.us, biased towards more popular URLs, users, and tags. As a

result, one should not assume that the conclusions from this section apply to all of

del.icio.us as opposed to the more concentrated section of Dataset C.

We denote the number of URLs tagged with a tag ti at a given host dj as

tagged(ti, dj), and the total number of URLs at that host in the tagging corpus

as total(dj). We can construct a binary classifier for determining if a particular URL

ok having host dj should be annotated with tag ti with the simple rule:

classify(ti, dj) =

{

t :tagged(ti,dj)

total(dj)> τ

¬t :tagged(ti,dj)

total(dj)≤ τ

}

where τ is some threshold. We define the positive accuracy to be the rate at which

our classifier labels positive examples correctly as positives, and negative accuracy

to be the rate at which our classifier correctly labels negative examples correctly as

negatives. Further, we define the macro-averaged positive and negative accuracies,

given in Table 2.3, as the mean of the positive and negative accuracies—with each

tag weighted equally—for the top 130 tags, respectively.

This classifier allows us to predict (simply based on the domain) between about

five and twenty percent of the tag annotations in Dataset C, with between a few

false positives per 1, 000 and a few per 10, 000. We also show the accuracies on

positive and negative examples in Figure 2.9. All experiments use leave-one-out cross

validation. Our user study (described in Result 7) also supported this conclusion.

About 20% of the tags which were sampled were deemed by our users to “apply

to the whole domain.” Because our user study and our experiments above were

based on differently biased datasets, Datasets C and M, they seem to be mutually


reinforcing in their conclusions. Both experiments suggest that a human librarian

capable of labeling a host with a tag on a host-wide basis (for instance, “java” for

java.sun.com) might be able to make substantial numbers of user contributed labels

redundant.

2.5 Related Work

Since the beginning of the web, people have used page content to aid in navigation

and searching. However, almost as early—Eiron and McCurley [25] suggest as early

as 1994—users were suggesting the use of anchortext and link structure to improve

web search. Craswell et al. [21] also give some early justification for use of anchortext

to augment web search.

Meanwhile, there has also been a current of users attempting to annotate their own

pages with metadata. This began with the <meta> tag which allowed for keywords on

a web page to aid search engines. However, due to search engine spam, this practice

has lost favor. The most recent instance of this idea is Google Co-op,9 where Google

encourages site owners to label their sites with “topics.” Co-op allows Google to

refine search results based on this additional information. However, unlike social

bookmarking, these metadata approaches require site owners to know all of the labels

a user might attach to their site. This leads to the well studied “vocabulary problem”

(see [28], [17]), whereby users have many different types of terminology for the same

resources. Ultimately, unlike previous metadata, social bookmarking systems have

the potential to overcome the vocabulary problem by presenting many terms for the

same content created by many disparate users.

Golder and Huberman [31] were two of the earliest researchers to look at the dy-

namics of tagging in del.icio.us. While a number of papers have looked at del.icio.us,

only a few have looked at its relationship to web search. Both Bao et al. [14] and

Yanbe et al. [72] propose methods to modify web search to include tagging data.

However, neither looked at whether del.icio.us (or any other social bookmarking site)

9See http://www.google.com/coop/.

2.6. CONCLUSION 37

was producing data of a sufficient quantity, quality or variety to support their meth-

ods. Both also use relatively small datasets—Bao et al. use 1, 736, 268 web pages and

269, 566 annotations, while Yanbe et al. use several thousand unique URLs. Also,

both of these papers are primarily interested in the popularity and tags of the URLs

studied, rather than other possible uses of the data.

The ultimate test of whether social bookmarking can aid web search would be to

implement systems like those of Bao et al. or Yanbe et al. and see if they improve

search results at a major search engine.

2.6 Conclusion

The eleven results presented in Sections 2.3 and 2.4 paint a mixed picture for web

search. We found that social bookmarking as a data source for search has URLs

that are often actively updated and prominent in search results. We also found that

tags were overwhelmingly relevant and objective. However, del.icio.us produces small

amounts of data on the scale of the web. Furthermore, the tags which annotate URLs,

while relevant, are often functionally determined by context. Nearly one in six tags

are present in the title of the page they annotate, and one in two tags are present in

the page text. Aside from page content, many tags are determined by the domain of

the URL that they annotate, as is the case with the tag “java” for “java.sun.com.”

These results suggest that URLs produced by social bookmarking are unlikely to be

numerous enough to impact the crawl ordering of a major search engine, and the tags

produced are unlikely to be much more useful than a full text search emphasizing

page titles.

This chapter represented our first large study of a tagging system. While the

results were mixed for web search, many of the insights are quite general. For example,

our user study in Result 7 foreshadows later, similar results about objective, relevant

tags in Section 4.4.1. Our analysis of the dangers and potential responses to spam

also led to a variety of later tag spam work (e.g., [35] and [48]). Overall, we hope

this chapter has given a taste for the type of data, and challenges, of a real tagging

system at scale.

Chapter 3

Social Tag Prediction

In Chapter 2, we conducted a broad analysis of the social bookmarking system

del.icio.us. In particular, we focused on properties which we believe are important

to web search. This chapter drills down and focuses on one property, predictability.

In particular, we focus on a problem which we call the social tag prediction problem,

asking, “given a set of objects, and a set of tags applied to those objects by users,

can we predict whether a given tag could/should be applied to a particular object?”

In this chapter, we look at how effective different types of data are at predicting tags

in a tagging system.

Solving the social tag prediction problem has two benefits. At a fundamental

level, we gain insights into the “information content” of tags: that is, if tags are easy

to predict from other content, they add little value. At a practical level, we can use a

tag predictor to enhance a social tagging site. These enhancements can take a variety

of forms:

Increase Recall of Single Tag Queries/Feeds Many, if not most, queries in tag-

ging systems are for objects labeled with a particular tag. Similarly, many

tagging systems allow users to monitor a feed of items tagged with a particular

tag. For example, a user of a social bookmarking site might set up a feed of

all “photography” related web pages. Tag prediction could serve as a recall

enhancing device for such queries and feeds. In Section 3.3.2, we set up such a

recall enhancing tag prediction task.

39

40 CHAPTER 3. SOCIAL TAG PREDICTION

Inter-User Agreement Many users have similar interests, but different vocabular-

ies. Tag prediction would ease sharing of objects despite vocabulary differences.

Tag Disambiguation Many tags are polysemous, that is, they have different mean-

ings. For example, “apple” might mean the fruit, or the computer company.

Predicting additional tags (like “macos” or “computer”) might aid in disam-

biguating what a user meant when annotating an object. Past work by Aurn-

hammer et al. [13] looks at similar issues in photo tagging.

Bootstrapping Sen et al. [64] find that the way users use tags is determined by

previous experience with tags in the system. For example, in systems with low

tag usage, fewer users will apply tags. If tag usage in the system is mostly

personal tags, users tend to apply more personal tags. Using tag prediction,

a system designer could pre-seed a system with appropriate tags to encourage

quality contributions from users.

System Suggestion Some tagging systems provide tag suggestions when a user is

annotating an object (see for example, Xu et al. [71]). Predicted tags might

be reasonable to suggest to users in such a system. However, unlike the other

applications in this list, it might be more informative for the system to suggest

tags that it is unsure of to see if the user selects them.

We examine whether tags are predictable based on the page text, anchor text, and

surrounding domains of pages they annotate. We find that there is a high variance in

the predictability of tags, and we look at metrics associated with predictability. One

such metric, a novel entropy measure, captures a notion of generality that we think

might be helpful for other tasks in tagging systems. Next, we look at how to predict

tags based on other tags annotating a URL. We find that we can expand a small set

of tags with high confidence. We conclude with a summary of our findings and their

broader implications for tagging systems and web search. (This chapter draws on

material from Heymann et al. [38] which is primarily the work of the thesis author.)

3.1. TAG PREDICTION TERMS AND NOTATION 41

3.1 Tag Prediction Terms and Notation

We use the same terms and notation from Section 2.1, with some additions. We

imagine that every object o has a vast set of tags that do not describe it, a smaller set

of tags which do describe it, and an even smaller set of tags which users have actually

chosen to input into the system as applicable to the object. We say that the first

set of tags negatively describes the object, the second set of tags positively describes

the object, and the last set of tags currently annotates the object. We model each of

these three relationships as relations or tables:

Rp: A set of (t, o) pairs; each pair means that tag t positively describes object o.

Rn: A set of (t, o) pairs; each pair means that tag t negatively describes object o.

Ra: A set of (t, u, o) triples; each triple means that user u annotated object o with

tag t.

In practice, the system owner only has access to Ra.

We manipulate the relations Rp, Rn, and Ra using two standard relational algebra

operators with set semantics. Selection, or σc selects tuples from a relation where a

particular condition c holds. Projection, or πp projects a relation into a smaller

number of attributes. σc is equivalent to the WHERE c clause in SQL whereas πp

is equivalent to the SELECT p clause in SQL. σc can be read as “select all tuples

satisfying c.” πp can be read as “show only the attributes in p from each tuple.”

Suppose a tagging system had only two objects, a web page obagels about a down-

town bagel shop and a web page opizza about a pizzeria next door. We might have:

Rp = {(tbagels, obagels), (tshop, obagels), (tdowntown, obagels),

(tpizza, opizza), (tpizzeria, opizza)}

Rn = {(tpizzeria, obagels), (tpizza, obagels), (tbagels, opizza), . . .}

If we want to know all of the tags which positively describe obagel, we would write

πt(σobagel(Rp)) and the result would be (tbagels, tshop, tdowntown). If we want all (t, o)


pairs which do not describe opizza, we would write π(t,o)(σopizza(Rn)). Suppose also

that a user usally has annotated the pizzeria web page with the tag tpizzeria:

Ra = (tpizzeria, usally, opizza)

If we want to know all users who have tagged opizza, we would write πu(σopizza(Ra))

and the result would be (usally).

3.2 Creating a Prediction Dataset

In this chapter, we continue to use the del.icio.us social bookmarking dataset de-

scribed in Section 2.2. For our current purposes, we are most interested in very

common tags in that dataset. We call the set of the top 100 tags in the dataset by

frequency T100 for short (shown in Figure 3.2).

We wanted to construct a dataset approximating Rp and Rn for our prediction

experiments. However, we only know Ra. Section 2.3 suggested that if (ti, ok) ∈

π(t,o)(Ra) then (ti, ok) ∈ Rp. In other words, annotated tags tend to be accurate.

However, the reverse is not true. The case where (ti, ok) 6∈ π(t,o)(Ra) and (ti, ok) ∈ Rp

occurs sufficiently often that measures of precision, recall, and accuracy can be heavily

skewed. In early experiments on a naively created dataset, we found that as many

as 34of false positives were erroneous according to manual reviews we conducted. By

“erroneous false positives,” we mean that our classifiers had accurately predicted for

a given (ti, ok) pair that (ti, ok) ∈ Rp, but (ti, ok) 6∈ π(t,o)(Ra).

When comparing systems, it is reasonable to use a partially labeled dataset, be-

cause the true relative ranking of the systems is likely to be preserved. Pooling [45],

for example, makes this assumption. However, for this work, we wanted to give

absolute numbers for how accurately tags can be predicted, rather than comparing

systems.

We decided to filter our dataset by looking at the total number of posts for a given

3.2. CREATING A PREDICTION DATASET 43

Rank # Tag1 4225 reference2 3794 toread3 3788 resources4 3677 cool5 3593 work6 3469 technology7 3366 tools8 3365 internet9 3205 computer10 3016 blog11 3012 web12 2996 web2.013 2879 online14 2759 free15 2661 software. . . . . . . . .86 396 politics87 396 mobile88 351 game89 343 jobs90 341 wordpress91 328 mp392 326 health93 310 environment94 266 finance95 233 ruby96 226 fashion97 216 rails98 135 food99 74 recipes100 5 fic

Table 3.1: The top 15 tags account for more than 13of top 100 tags added to URLs

after the 100th bookmark. Most are relatively ambiguous and personal. The bottom15 tags account for very few of the top 100 tags added to URLs after the 100thbookmark. Most are relatively unambiguous and impersonal.


Figure 3.1: Average new tags versus number of posts.

URL:

postcount(ok) = |πu(σok(Ra))|

As postcount(ok) increases, we expect the probability for any given ti that (ti, ok) 6∈

π(t,o)(Ra) and (ti, ok) ∈ Rp to decrease.1 We chose a cutoff of 100, which leads us to

approximate Rp and Rn as:

(ti, ok) ∈ Rp iff 100 ≤ postcount(ok) < 3000

and |πu(σti,ok(Ra))| ≥postcount(ok)

100(ti, ok) ∈ Rn iff 100 ≤ postcount(ok) < 3000

and σti,ok(Ra) = ∅

This results in a filtered set of |πo(Rp ∪Rn)| ≈ 62, 000 URLs and their corresponding

tags.

Our reasoning for the 100 post minimum is based on the rate at which new unique

tags from the top 100 tags, T100, are added to a URL. Figure 3.1 shows the average

number of new tags ti ∈ T100 that are added to a URL by the nth post. This

1Using postcount(ok) for filtering relies to a certain extent on evaluating popular tags.While we do not examine it here, for lower frequency tags we suggest vocabcount(ti, ok) =∑

uj∈πu(σok(Ra))

min(|σ(ti,uj)(Ra)|, 1) which should behave similarly to postcount(ok) but relies on

users’ vocabularies rather than raw number of posts.

3.3. TWO TAG PREDICTION METHODS 45

information is computed over all URLs which occur at least 200 times in our dataset.

On average, the first person to post a URL adds one tag from T100, the second adds

0.6 of a tag from T100, and so on. By the 100th post, the probability of a post adding

a new tag from T100 is less than 5% and remains relatively flat. Furthermore, the

top tags which are added later tend to be much more ambiguous or personal. Table

3.1 shows the fifteen tags which most and least commonly get added after the 100th

post. Tags like “mp3” and “food” are relatively clear in meaning, whereas tags like

“internet” and “toread” are much more ambiguous and personal. While we cannot

completely eliminate the possibility of erroneous tuples in Rp and Rn, our approach

is most accurate for unambiguous or impersonal tags and does not require creating a

gold standard based on human annotation. Such a gold standard would be especially

difficult to create for subjective tags like “cool” or “toread.”

3.3 Two Tag Prediction Methods

In the two sections that follow, we look at the predictability of tags given two broad

types of data. In Section 3.3.1 we look at the predictability of the tags in T100 given

information we have about the web pages in our dataset. We look at page text,

anchor text, and surrounding hosts to try to determine whether particular tags apply

to objects in our dataset. This task is specific to social bookmarking systems because

the data we use for prediction is specific to web pages. However, the predictability

of tags for web pages may also be important for web search, which may want to

determine if tags provide information above and beyond page text, anchor text, and

surrounding hosts, and to vertical (web) search, which may want to categorize parts

of the web by tags. Chapter 2 provides some initial answers to these questions, but

does not address predictability directly, nor does it look specifically at anchor text.

“Predictability” is approximated by the predictive power of a support vector machine.

While classifiers differ, we believe our results enable qualitative conclusions about the

machine predictability of tags for state of the art text classifiers.

In Section 3.3.2 we look at the predictability of tags based on other tags already


annotating an object. (In Section 3.3.1, we make the simplifying “cold start” as-

sumption that no other tags are available, using only page text, anchor text, and

surrounding hosts for prediction.) The task of predicting tags given other tags has

many potential applications within tagging systems, as discussed at the beginning of

this chapter. Unlike the task in Section 3.3.1, our work in Section 3.3.2 is applica-

ble to tagging systems in general (including video, photo and other tagging systems)

rather than solely social bookmarking systems because it does not rely on any par-

ticular type of object (e.g., web pages). We also consider the problem of ranking the

additional tags in order of how likely they are to annotate an object.2

3.3.1 Tag Prediction Using Page Information

We chose to evaluate prediction accuracy using page information on the top 100 tags

in our dataset (i.e., T100). These tags collectively represent 2, 145, 593 of 9, 414, 275

triples, meaning they make up about 22.7% of the user contributed tags in the full

Stanford Tag Crawl dataset. The dataset contains crawled page text and additional

information for about 60, 000 of the URLs in πo(Rp) ∪ πo(Rn) (about 95%).

We treated the prediction of each tag ti ∈ T100 as a binary classification task. For

each tag ti ∈ T100, our positive examples were all ok ∈ πo(σti(Rp)) and our negative

examples were all ok ∈ πo(σti(Rn)). For each task, we defined two different divisions

of the data into train/test splits. In the first division, which we call Full/Full, we

randomly select 1116

of the positive examples and 1116

of the negative examples to be

our training set. The other 516

of each is our test set. For each Full/Full task, the

number of training, test, positive, and negative examples varies depending on the tag.

However, usually the training set is between 30, 000 and 35, 000 examples and the test

set is about 15, 000 examples. The proportion of positive examples can vary between

1% and 60% with a median of 14% and a mean of 9%. In the second division, which

we call 200/200, we randomly select 200 positive and 200 negative examples for our

training set, and the same for our test set.

How well we do on the Full/Full split implies how well we can predict tags on

2Note that the techniques from Section 3.3.1 could be expanded to not assume cold start and tohandle ranking, but we do not do so here.


the naturally occurring distribution of tagged pages. (We call it Full/Full because

the union of positive and negative examples is the full set of URLs in Rp and Rn.)

However, we can get high accuracy (if not high precision) on Full/Full by biasing

towards guessing negative examples for rare tags. For example, because “recipes” only

naturally occurs on 1.2% of pages, we could achieve 98.8% accuracy by predicting all

negative on the “recipes” binary classification task. One solution to this problem is

to change metrics to precision-recall break even point (PRBEP) or F1 (we report the

former later). However, these measures are still highly impacted by the proportion of

positive examples. We provide 200/200 as an imperfect indication of how predictable

a tag is due to its “information content” rather than the distribution of examples in

the system.

Each example represented one URL and had one of three different feature repre-

sentations depending on whether we were predicting tags based on page text, anchor

text, or surrounding hosts. Page text means all text present at the URL. Anchor

text means all text within fifteen words of inlinks to the URL (similar to Haveliwala

et al. [32]). Surrounding hosts means the sites linked to and from the URL, as well

as the site of the URL itself. For both page text and anchor text, our feature repre-

sentation was a bag of words. We tokenized pages and anchor text using the Penn

TreeBank Tokenizer, dropped infrequent tokens (less frequent than the top 10 million

tokens) and then converted tokens to token ids. For anchor text tasks, we only used

URLs as examples which had at least 100 inlinks.3 The value of each feature was the

number of times the token occurred. For surrounding hosts, we constructed six types

of features. These features were: the hosts of backlinks, the domains of backlinks,

the host of the URL of the example, the domain of the URL of the example, the

hosts of the forward links, and the domains of the forward links. The value of each

feature was one if the domain or host in question was a backlink/forwardlink/current

domain/host and zero if not.

We chose to evaluate page text, anchor text, and host structure rather than just

combining all text of pages linked to or from the URL of each example because

3We found that the difference between 10 and 100 inlinks as the cutoff was negligible. More dataabout a particular URL improves classification accuracy for that URL, but having more URLs inthe training set improves classification accuracy in general.


cool, online, resources, community, work, culture, portfolio, social, technology,history, advertising, writing, architecture, flash, inspiration, humor, search, funny,tools, fun, internet, home, media, free, illustration, fashion, library, research, ajax,marketing, books, computer, environment, firefox, art, jobs, productivity, free-ware, business, download, education, news, web2.0, language, tips, wiki, word-press, graphics, mobile, video, google, php, article, blogs, mp3, travel, security,science, shopping, hardware, photography, games, reference, tutorials, toread, au-dio, photos, movies, javascript, tv, maps, blog, mac, howto, game, health, photo,design, music, opensource, osx, politics, photoshop, java, web, windows, finance,tutorial, webdesign, css, software, apple, development, food, linux, ruby, program-ming, rails, recipes

Figure 3.2: Tags in T100 in increasing order of predictability from left to right. “cool”is the least predictable tag, “recipes” is the most predictable tag.

Yang et al. [74] state that including all surrounding text may reduce accuracy. For

all representations (page text, anchor text, and surrounding hosts), we engineered

our features by applying Term Frequency Inverse Document Frequency (TFIDF),

normalizing to unit length, and then feature selected down to the top 1000 features

by mutual information. We chose mutual information due to discussion in Yang and

Pedersen [73]. In previous experiments, we found that the impact of more features was

negligible, and reducing the feature space helped simplify and speed up the training

process.4

For our experiments, we used support vector machines for classification. Specif-

ically we used Thorsten Joachims’ SVMlight package with a linear kernel and the

default regularization parameter (see [43]) and his SVMperf package with a linear

kernel and regularization parameters of 4 and 150 (see [44]). With SVMlight, we

trained to minimize average error, with SVMperf, we trained to minimize PRBEP.

Given that we had 100 tags, 2 splits (200/200 and Full/Full), and 3 feature types

for examples (page text, anchor text, and surrounding hosts), we conducted 600

binary classification tasks total. Assuming only a few evaluation metrics for each

binary classification task, we could have thousands of numbers to report. Instead,

4Gabrilovich and Markovitch [29] actually find that aggressive feature selection is necessary forSVM to be competitive with decision trees for certain types of hypertext data.


in the rest of this section, we ask several questions intended to give an idea of the

highlights of our analysis. Apart from the questions answered below, Figure 3.2 gives

a quick at-a-glance view of which tags are more or less predictable in T100 ranked by

the sum of PRBEP (Full/Full), Prec@10% (Full/Full) and Accuracy (200/200).5 See

discussion below for description of each metric. In the analysis below, when we give

the mean of the values of tags, we mean the macro-averaged value.

What precision can we get at the PRBEP?

For applications like vertical search (or search enhanced by topics), one natural ques-

tion is what our precision-recall curve looks like at reasonably high recall. PRBEP

gives a good single number measurement of how we can tradeoff precision for recall.

For the Full/Full split, we calculated the PRBEP for each of the 600 binary classifi-

cation tasks. On average, the PRBEP for page text was about 60%, for anchor text

was about 58%, and for surrounding hosts was about 51% with a standard deviation

of between 8% and 10%. This suggests that on realistic data, we can get about 23of

the URLs labeled with a particular tag with about 13erroneous URLs in our resulting

set. This is pretty good—we are doing much better than chance given that a majority

of tags in T100 occur on less than 15% of documents.

What precision can we get with low recall?

For applications like bootstrapping or single tag queries, we may care less about

overall recall (because the web is huge), but we may want high precision. We used

the Full/Full split to look at this question. For each binary classification task, we

calculated the precision at 10% recall (i.e., Prec@10%). With all of our feature types

(page text, anchor text, and surrounding hosts), we were able to get a mean Prec@10%

value of over 90%. The page text Prec@10% was slightly higher, at 92.5%, and all

feature types had a standard deviation of between 7% and 9%. This suggests that

whatever our feature representation, if we have many more examples than we need for

our system, we can get high precision by reducing the recall. Furthermore, it suggests

5Two tags are missing, “system:imported” (a system generated tag) and “fic” (which is commonin the full dataset but uncommon for top URLs and was removed as an outlier).


that there are some examples of most tags that our classifiers are much more certain

about, rather than a relatively uniform distribution of certainty.

Which page information is best for predicting tags?

According to all evaluation metrics, we found a strict ordering among our feature

types. Page text was strictly more informative than anchor text which was strictly

more informative than surrounding hosts. For example, for PRBEP, the ordering is

(60, 58, 51), for Prec@10% it is (92.5, 90, 90), for accuracy on the 200/200 split, it

is (75, 73, 66). Usually, page text was incrementally better than anchor text, while

both were much better than surrounding hosts. This may have been due to our

representation or usage of our surrounding hosts, or it could simply be that text is a

particularly strong predictor of the topic of a page.

Is anchor text particularly predictive of tags?

One common complaint about tags is that they should be highly predictable based

on anchor text, because both serve as commentary on a particular URL. While both

page text and anchor text are predictive of tags, we did not find anchor text to

be more predictive on average than page text for any of our split/evaluation metric

combinations.

What makes a tag predictable?

A more general question than those above is what makes a tag predictable. Pre-

dictability may give clues as to the “information content” of a tag, but it may also be

practically useful for tasks like deciding which tags to suggest to users. In order to

try to quantify this, we defined an entropy measure to try to mirror the “generality”

of a tag. Specifically, we call the distribution of tag co-occurrence events with a given

tag ti, P (T |ti). Given this number, we define the entropy of a tag ti to be:

H(ti) = −∑

tj∈T,tj 6=ti

P (tj|ti) logP (tj|ti)


Figure 3.3: When the rarity of a tag is controlled in 200/200, entropy is negativelycorrelated with predicability.

Figure 3.4: When the rarity of a tag is controlled in 200/200, occurrence rate isnegatively correlated with predicability.

For example, if the tag tcar co-occurs with tauto 3 times, with tvehicle 1 time, and with

tautomobile 1 time, we would say its entropy was equal to:

H(tcar) = −3

5log

3

5−

1

5log

1

5−

1

5log

1

5≈ 1.37

The intuition for entropy in this case is that tags which co-occur with a broad base

of other tags tend to be more general than those tags which primarily co-occur with

a small group of related tags.

Because the relative rarity of a tag heavily impacts its predictability, we used the

200/200 split to try to evaluate predictability of tags in the abstract. For this split,


Figure 3.5: When the rarity of a tag is not controlled, in Full/Full, additional examplesare more important than the vagueness of a tag, and more common tags are morepredictable.

we found a significant correlation between our entropy measure H(ti) and accuracy

of a classifier on 200/200 (see Figure 3.3). For page text, we had a Pearson product-

moment correlation coefficient of r = −0.46, for anchor text r = −0.51, and for

surrounding hosts r = −0.54. All p-values were less than 10−5.6 However, for the

same split, we also found that the popularity of a tag was highly negatively correlated

with our accuracy (see Figure 3.4). Specifically, for page text, we had r = −0.53, for

anchor text r = −0.51, and for domains r = −0.27. In other words, the popularity

of a tag seems to be as good a proxy for “generality” as a more complex entropy

measure. The two are not exclusive—a linear model fit to accuracy based on both

popularity and entropy does better than a model trained on either one alone.

For the Full/Full split, we found that the commonality of a tag (and hence the com-

monality of positive examples) was highly positively correlated with high PRBEP (see

Figure 3.5). However, perhaps because the recall was relatively low, we found no corre-

lation between the commonality of a tag and our performance on Prec@10% (though

we did find some low but significant correlation between PRBEP and Prec@10%).

The entropy measure was uncorrelated with PRBEP or Prec@10% for the Full/Full

split.

6Though we do not quote them here, we also computed Kendall’s τ and Spearman’s ρ valueswhich gave similarly strong p-values.


3.3.2 Tag Prediction Using Tags

Between about 30 and 50 percent of URLs posted to del.icio.us have only been book-

marked once or twice. Given that the average bookmark has about 2.5 tags, the odds

that a query for a particular tag will return a bookmark only posted once or twice

are low. In other words, our recall for single tag queries is heavily limited by the high

number of rare URLs with few tags. For example, a user labeling a new software tool

for Apple’s Mac OS X operating system might annotate it with “software,” “tool,”

and “osx.” A second user looking for this content with the single tag query (or feed)

“mac” would miss this content, even though a human might easily realize that “osx”

implies “mac.” The question in this section is given a small number of tags, how

much can we expand this set of tags in a high precision manner? The better we do at

this task, the less likely we are to have situations like the “osx”/“mac” case because

we will be able to expand tags like “osx” into implied tags like “mac.”

A natural approach to this problem is market-basket data mining. In the market-

basket model, there are a large set of items and a large set of baskets each of which

contains a small set of items. The goal is to find correlations between sets of items

in the baskets. Market-basket data mining produces association rules of the form

X → Y . Association rules commonly have three values associated with them:

Support The number of baskets containing both X and Y .

Confidence P (Y |X). (How likely is Y given X?)

Interest P (Y |X) − P (Y ), alternatively P (Y |X)P (Y )

. (How much more common is X&Y

than expected by chance?)

Given a minimum support, Agrawal et al. [11] provide an algorithm for computing

association rules from a dataset.

In our case, the baskets are URLs, and the items are tags. Specifically, for each

ok ∈ πo(Rp), we construct a basket πt(σok(Rp)). We constructed three sets of rules:

rules with support > 500 and length 2, rules with support > 1000 and length 3, and

rules with support > 2000 of any length. (The length is the number of distinct tags


Int. Conf. Supp. Rule0.59 0.994 634 graphic-design → design

0.69 0.992 644 oop → programming

0.56 0.992 654 macsoftware → software

0.89 0.990 605 photographer → photography

0.44 0.990 1780 webstandards → web

0.44 0.990 786 w3c → web

0.58 0.989 2144 designer → design

0.85 0.987 669 windowsxp → windows

0.44 0.987 1891 dhtml → web

0.85 0.986 872 debian → linux

0.58 0.986 1092 illustrator → design

0.56 0.986 707 sourceforge → software

0.85 0.985 1146 gnu/linux → linux

0.61 0.985 539 bloggers → blog

0.58 0.985 597 ilustracion → design

0.44 0.985 1794 web-development → web

0.44 0.985 3366 xhtml → web

0.68 0.984 730 disenoweb → webdesign

0.87 0.983 648 macsoftware → mac

Table 3.2: Association Rules: A selection of the top 30 tag pair association rules. Allof the top 30 rules appear to be valid, these rules are representative.

involved in the rule.) We merged these three rule sets into a single association rule

set for our experiments.

Found Association Rules

We found a surprising number of high quality association rules in our data. Table 3.2

shows some of the top association rules of length two. The rules capture a number of

different relationships between tags in the data. Some rules correspond to a “type-

of” style of relationship, for example, “graphic-design” is a type of “design.” Others

correspond to different word forms, for example, “photographer” and “photography.”

Some association rules correspond to translations of a tag, for example, “disenoweb”

is Spanish for “webdesign.” Some of the relationships are surprisingly deep, for

example, the “w3c” is a consortium that develops “web” standards. Arguably, one


Int. Conf. Supp. Rule0.81 0.989 1097 open source & source → opensource

0.55 0.979 1003 downloads & os → software

0.42 0.967 1686 free & webservice → web

0.73 0.964 1134 accessibility & css → webdev

0.84 0.952 1305 app & osx → mac

0.40 0.950 2162 webdesign & websites → web

0.47 0.947 2269 technology & webtools → tools

0.40 0.945 1662 php & resources → web

0.63 0.937 2754 html & tips → webdesign

0.50 0.934 1914 xp → software

0.45 0.928 1332 freeware & system → tools

0.62 0.919 1513 cool & socialsoftware → web2.0

0.61 0.915 1165 business & css → webdesign

0.61 0.912 2231 tips & webdevelopment → development

0.35 0.900 6337 toread & web2.0 → web

0.69 0.897 1010 fotografia & inspiration → art

0.33 0.895 1723 help & useful → reference

Table 3.3: Association Rules: A random sample of association rules of length ≤ 3and support > 1000.


might suggest that if both ti → tj and tj → ti with high confidence and high interest,

ti and tj are probably synonymous.

Depending on computational resources, numbers of association rules in the mil-

lions or billions can be generated with reasonable support. However, in practice, the

most intuitive rules seem to be rules of length four or less. In order to give an idea

of the rules in general, rather than picking the top rules, we give a random sampling

of the top 8000 rules of length three or less. This information is shown in Table 3.3.

There is sometimes redundancy in longer rules, for example, one might suggest that

rather than “webdesign & websites → web” we should instead have “webdesign →

web” and “websites → web”. This is a minor issue however, and it is relatively rare

for a rule with high confidence to be outright incorrect. Furthermore, given their ease

of interpretation, it would not be unreasonable to have human moderators look over

low length, high support and high confidence rules.

Tag Application Simulation

For our evaluation, we simulate rare URLs with few bookmarks. We separated πo(Rp)

into a training set of about 50, 000 URLs and a test set of about 10, 000 URLs.

We generated our three sets of association rules based only on baskets from the

training set. We then sampled n bookmarks from each of the the URLs in the test

set, pretending these were the only bookmarks available. Given this set of sampled

bookmarks, we attempted to apply association rules in decreasing order of confidence

to expand the set of known tags. We stopped applying association rules once we had

reached a particular minimum confidence c.

For example, suppose we have a URL which has a recipe for oven-cooked pizza

bagels with three bookmarks corresponding to three sets of tags:

{(food, recipe), (food, recipes), (pizza, bagels)}

For n = 1, we might sample the bookmark (pizza, bagels). Assuming we had two

association rules:

• pizza → food (confidence = 0.9)


Orig. Sampled Min. # Tag ExpansionsBookmarks Conf. 0 1 2 3 4 5+

1 0.50 2096 100 153 435 486 76671 0.75 4015 1717 1422 1263 866 16541 0.90 7898 1845 709 291 116 782 0.50 545 78 115 283 300 96162 0.75 2067 1630 1582 1664 1145 28492 0.90 6520 2491 1113 473 208 1323 0.50 216 61 62 172 164 102623 0.75 1397 1415 1545 1558 1363 36593 0.90 5913 2746 1265 596 249 1685 0.50 71 31 29 101 80 106255 0.75 810 1070 1360 1507 1485 47055 0.90 5145 3065 1427 692 366 242

Table 3.4: Association Rules: Tradeoffs between number of original sampled book-marks, minimum confidence and resulting tag expansions.

• bagels → bagel (confidence = 0.8)

We would first apply the confidence 0.9 rule and then the confidence 0.8 rule. Ap-

plying both rules (i.e., two applications), would result in (pizza, bagels, food, bagel).

Number and Precision of Tag Expansions

We ran a simulation as described above for each number of original sampled book-

marks n ∈ {1, 2, 3, 5} and for each minimum confidence c ∈ {0.5, 0.75, 0.9}. Our

results are shown in Tables 3.4, and 3.5, and 3.6. Each row of each table represents

one setting of n and c. We asked two initial questions: “How many tags were added?”

and “How accurate were our applications of tags?” The column in Table 3.4 labeled

“# Tag Expansions” shows, for each simulation, the number of URLs to which we

were able to add 0, 1, 2, 3, 4 or 5+ tags. The column in Table 3.5 labeled “Ac-

tual Precision” shows the percentage of tag applications which were correct (given

the other information we had about each URL in Rp). For each simulation in Table

3.5, we also computed our estimate (“Estimated Precision”) of what our precision

should have been based on the confidence values of applied rules. Our estimate is


Orig. Sampled Min. Exp. PrecisionBookmarks Conf. Est. Actual

1 0.50 0.650 0.6331 0.75 0.844 0.8541 0.90 0.941 0.9542 0.50 0.652 0.5902 0.75 0.844 0.8112 0.90 0.942 0.9313 0.50 0.653 0.5593 0.75 0.844 0.7793 0.90 0.943 0.9175 0.50 0.654 0.5095 0.75 0.842 0.7325 0.90 0.943 0.873

Table 3.5: Association Rules: Tradeoffs between number of original sampled book-marks, minimum confidence, estimated precision and actual precision.

Orig. Sampled Min. Mean Recall (T100) New Precision (T100)Bookmarks Conf. Orig. Expd. Mean Median

1 0.50 0.099 0.271 0.629 0.6771 0.75 0.099 0.153 0.929 0.9631 0.90 0.100 0.113 0.993 1.0002 0.50 0.160 0.386 0.585 0.6262 0.75 0.164 0.237 0.909 0.9492 0.90 0.161 0.180 0.989 1.0003 0.50 0.205 0.451 0.550 0.5773 0.75 0.207 0.289 0.900 0.9423 0.90 0.204 0.226 0.988 1.0005 0.50 0.265 0.524 0.497 0.4965 0.75 0.268 0.358 0.881 0.9315 0.90 0.271 0.294 0.983 0.996

Table 3.6: Association Rules: Tradeoffs between number of original sampled book-marks, minimum confidence, recall, and precision.


the average of the confidence of all applied rules. For example, in our oven-cooked

pizza bagel example above (assuming the URL was the only URL in our simulation),

we would have an actual precision of 0.5 because “food” is a tag which appears in

other bookmarks annotating the URL, whereas “bagel” is not. Our estimate of our

precision would be 0.9+0.82

= 0.85. We would also increment the “2” column of “#

Tag Expansions” because the URL was expanded twice.

Thus, the first row of Table 3.4 says that we ran a simulation with n = 1, c = 0.5

and 10, 937 URLs. In 2, 096 cases, we were not able to add any tags (anecdotally,

this usually happens when a bookmark only has one tag). In 7, 667 cases, we were

able to add five or more tags. The first row of Table 3.5, which shows data about

the same simulation (n = 1 and c = 0.5), says that our estimate of our precision was

0.650 while our actual precision was a little lower, 0.633.

The results in Table 3.4 show that with only a single bookmark, we can expand

anywhere from 10 to 80 percent of our URLs by at least one tag depending on our

desired precision. With larger numbers of bookmarks, we can do better, though

the most pertinent tags for a URL are applied quickly. Table 3.5 shows that as the

number of bookmarks increases, the difference between estimated and actual precision

increases. This means that as a URL receives more and more annotations, we become

increasingly unsure of the effectiveness of association rules for unapplied tags.

How Useful are Predicted Tags?

As we argued at the beginning of this chapter, predicted tags can be used by a system

in many ways. Here we briefly explore one such use: increasing recall for single tag

queries. For instance, if the user searches for “food,” the system can return objects

annotated with “food” as well as objects which we predict “food” annotates. Using

term co-occurrence to expand query results is a well known IR technique; here we

want to know how well it works for tags.

For evaluation, we consider each tag ti ∈ T100 to be a query qti . For each query

qti , the result set s contains the URLs annotated with the tag, and the result set s′

contains the URLs annotated with the tag plus URLs which we predict are anno-

tated with the tag using association rules. We then compare the recall and precision


achieved by s and s′. For example, suppose five objects are positively described by

“food.” In our simulation suppose only two of the objects are known to have “food.”

Suppose that we correctly predict that one of the remaining three objects is labeled

“food” (perhaps using our “bagels” → “food” rule above), and we incorrectly predict

that two other objects are labeled “food.” Without expansion, query q retrieves s

which has two known bagel objects, so recall is 2/5. With expansion, s′ returns three

additional objects, one of which was correct, for a recall of 2+15

and a precision of 2+12+3

.

Table 3.6 shows the simulation details just described. The first two columns are

n and c, as in Tables 3.4 and 3.5. The simulation results are shown in the last four

columns. For each simulation (row), we give (a) the mean recall before expansion

(macro average over all tags in T100); (b) the mean recall after expansion; and (c)

the mean and median of the precision after expansion. (Note that without expansion

precision is always 1.) For instance, if we sample one bookmark per URL (n = 1) and

use 50% confidence (c = 0.5), we see that tag expansion improves mean recall from

0.099 to 0.271, a factor of 3 improvement! Of course, our average precision drops

from 1 to 0.629. For both the one and two tag cases (i.e., n = 1 and n = 2) with

confidence c = 0.75 we can increase recall by 50% while keeping precision above 90%.

3.4 Related Work

Previous work has looked at the nature of tags chosen by users [31, 64]. We do not

know of any work explicitly looking at how to construct a reasonable dataset for pre-

diction in tagging systems as we do in Section 3.2. While our hypertext classification

task in Section 3.3.1 is inspired by a long line of work, usefully surveyed by Yang et

al. [74], we believe the application to tags is new. Chakrabarti et al. [16] suggest a

different way to use local link information for classification that might prove more

effective than our domain features, however, we do not evaluate this possibility here.

Our use of an entropy measure for tagging systems is inspired by Chi and Mytkowicz

[18]. Other work has looked at tag suggestion, usually from a collaborative filtering

and UI perspective, for example with URLs [71] and blog posts [56, 68].

Our work in Section 3.3.2 is similar to work by Schmitz et al. [62]. However,

3.5. CONCLUSION 61

Schmitz et al. is primarily concerned with theoretical properties of mining association

rules in tripartite graphs. Schwarzkopf et al. [63] extend Schmitz’s association rules

work to build full ontologies. However, neither Schmitz et al. nor Schwarzkopf et al.

appear to evaluate the quality of the rules themselves aside from generating ontologies.

Lastly, there is also much previous work in IR studying query expansion and relevance

feedback trying to address similar questions of cross-language and cross-vocabulary

queries (see for example a general reference such as Manning et al. [54]). However, we

believe that association rules may be the most natural approach to these problems in

tagging systems due to user interface issues (for example, feeds, browsing).

3.5 Conclusion

Our tag prediction results suggest three insights.

First, this chapter reinforced evidence from Chapter 2 that many tags on the web

do not contribute substantial additional information beyond page text, anchor text,

and surrounding hosts. All three types of data can be quite predictive of different

tags in our dataset, and if we only want a small recall (e.g., 10%) we can have a

precision above 90%. The predictability of social bookmarking tags influences web

search (by suggesting ways to use tagging information or whether to use it at all),

as well as system designers who might bootstrap tagging systems with initial quality

data (by making it possible to predict such initial data).

Second, the predictability of a tag when our classifiers are given balanced training

data is negatively correlated with its occurrence rate and with its entropy. More

popular tags are harder to predict and higher entropy tags are harder to predict.

When considering tags in their natural (skewed) distributions, data sparsity issues

tend to dominate, so each further example of a tag improves classifier performance.

To the extent to which predictability is correlated with the “generality” of a tag,

these measures may serve as building blocks for tagging system designers to produce

new features that rely upon understanding the specificity of tags (for example, system

suggestion and tag browsing). Both of our measures of tag predictability are object

type independent. This suggests that they may be applicable to tagging systems


where photos or video are annotated rather than only social bookmarking systems.

Third, association rules can increase recall on the single tag queries and feeds

which are common in tagging systems today. This suggests that they may serve as a

way to link disparate vocabularies among users. We found association rules linking

languages, super/subconcepts, and other relationships. These rules may also indicate

synonymy and polysemy, two issues that have plagued tagging systems since Golder

and Huberman’s seminal work [31]. (We return to the question of synonymy in the

context of social cataloging systems in Section 4.3.1.)

Chapter 4

Tagging Human Knowledge

As we noted in Chapter 1, tagging evolved in response to pressures to organize massive

numbers of online objects. For example, in 1994, two students organized pages on

the web into what became the Yahoo! Directory. What they did could be caricatured

as the “library approach” to organizing a collection: create a limited taxonomy or

set of terms and then have expert catalogers annotate objects in the collection with

taxonomy nodes or terms from the pre-set vocabulary. In 1998, the Open Directory

Project (ODP) replaced expert catalogers with volunteers, but kept the predetermined

taxonomy. Experts were too expensive, and users of the Internet too numerous to

ignore as volunteers. In 2003, del.icio.us, the subject of Chapters 2 and 3, was started.

del.icio.us uses what we call the “tagging approach” to organizing a collection: ask

users with no knowledge of how the collection is organized to provide terms to organize

the collection. Within a few years, del.icio.us had an order of magnitude more URLs

annotated than either Yahoo! Directory or ODP.

Increasingly, web sites are turning to the “tagging approach” rather than the “li-

brary approach” for organizing the content generated by their users. This is both by

necessity and by choice. For example, the photo tagging site Flickr has thousands

of photos uploaded each second, an untenable amount to have labeled by experts.

Popular websites tend to have many users, unknown future objects, and few re-

sources dedicated up-front to data organization—the perfect recipe for the “tagging

approach.”

63

64 CHAPTER 4. TAGGING HUMAN KNOWLEDGE

However, the “library approach,” even as we have caricatured it above, has many

advantages. In particular, annotations are generally consistent, of uniformly high

quality, and complete (given enough resources). In the tagging approach, who knows

whether two annotators will label the same object the same way? Or whether they

will use useful annotations? Or whether an object will end up with the annotations

needed to describe it? These questions are the subject of this chapter: to what

extent does the tagging approach match the consistency, quality, and completeness

of the library approach? We believe these questions are a good proxy for the general

question of whether the tagging approach organizes data well, a question which affects

some of the most popular sites on the web.

Unfortunately, we cannot really compare the library approach to tagging systems

using social bookmarking data, because librarians have not labeled even a small

fraction of the URLs in social bookmarking systems. Instead, this chapter and the

next look at social cataloging sites—sites where users tag books. By using books as

our objects, we can compare user tags to decades of expert library cataloger metadata.

In this chapter, we primarily treat library metadata as a gold standard. For example,

we test if tags have high coverage of existing library annotations. (In the next chapter,

we consider that library annotations might be faulty or inadequate.) By using two

social cataloging sites (LibraryThing and Goodreads), we can see how consistently

users annotate objects across tagging systems. Overall, we give a comprehensive

picture of the tradeoffs and techniques involved in using the tagging approach for

organizing a collection, though we do focus by necessity on popular tags and topics.

Our investigation proceeds as follows. In Section 4.1 we build a vocabulary to

discuss tagging and library data. In Section 4.2, we describe our datasets. In each of

Sections 4.3, 4.4, and 4.5, we evaluate the tagging approach in terms of consistency,

quality, and completeness. In Section 4.6 we discuss related work, and we conclude

in Section 4.7. (This chapter draws on material from Heymann et al. [37] which is

primarily the work of the thesis author.)

4.1. SOCIAL CATALOGING TERMS AND NOTATION 65

4.1 Social Cataloging Terms and Notation

This chapter and the next use the (ti, oj, uk) representation of tagging systems from

Chapter 2, with some modifications and additions. In contrast to del.icio.us in Chap-

ters 2 and 3, we focus on social cataloging sites where the objects are books. More

accurately, an object is a work, which represents one or more closely related books

(e.g., the different editions of a book represent a work).

An object o can be annotated in three ways. First, an object o can be annotated

(for free) by a user of the site, in which case we call the annotation a tag or (in some

contexts) a user tag (written ti ∈ T ). For example, the top 10 most popular tags

that users use to annotate their personal books in our LibraryThing social cataloging

dataset are “non-fiction,” “fiction,” “history,” “read,” “unread,” “own,” “reference,”

“paperback,” “biography,” and “novel.” Second, in a variety of experiments, we pay

non-experts to produce “tags” for a given object. These are functionally the same as

tags, but the non-experts may know little about the object they are tagging. As a

result, we call these paid non-experts “paid taggers,” and the annotations they create

“$-tags”, or $i ∈ $ to differentiate them from unpaid user tags. Thirdly, works are

annotated by librarians. For example, the Dewey Decimal Classification may say a

work is in class 811, which as we will see below, is equivalent to saying the book has

annotations “Language and Literature”, “American and Canadian Literature,” and

“Poetry.” We will call the annotations made by librarians “library terms” (written

li ∈ L).

In a given system, an annotation a implicitly defines a group, i.e., the group of all

objects that have annotation a (we define O(a) to return this set of objects). We call

a the name of such a group. A group also has a size equal to the number of objects

it contains (we define oc(a) to return this size). Since an object can have multiple

annotations, it can belong to many groups. An object o becomes contained in group

a when an annotator annotates o with a. We overload the notation for T , $, and L

such that T (oi), $(oi), and L(oi) return the bag (multiset) of user tags, paid tags, and

library annotations for work oi, respectively.


4.1.1 Library Terms

We look at three types of library terms: classifications, subject headings, and the

contents of MARC 008.1

A classification is a set of annotations arranged as a tree, where each annotation

may contain one or more other annotations. An object is only allowed to have one

position in a classification. This means that an object is associated with one most

specific annotation in the tree and all of its ancestor annotations in the tree.

A subject heading is a library term chosen from a controlled list of annotations.

A controlled list is a predetermined set of annotations. The annotator may not make

up new subject headings. An object may have as many subject headings as desired

by the annotator.

Works are annotated with two classifications, the Library of Congress Classifica-

tion (LCC) and the Dewey Decimal Classification (DDC). A work has a position in

both classifications. LCC and DDC encode their hierarchy information in a short

string annotating a work, for example, GV735 or 811 respectively. The number 811

encodes that the book is about “Language and Literature” because it is in the 800s,

“American and Canadian Literature” because it is in the 810s, and “Poetry” most

specifically, because it is in the 811s. Likewise, “GV735” is about “Recreation and

Leisure” because it is in GV, and “Umpires and Sports officiating” because it is in

GV735. One needs a mapping table to decode the string into its constituent hierarchy

information.

Works are also annotated with zero or more Library of Congress Subject Headings

(LCSH).2 LCSH annotations are structured as one LCSH main topic and zero or

more LCSH subtopics selected from a vocabulary of phrases. For example, a book

about the philosophy of religion might have the heading “Religion” (Main Topic)

and “Philosophy” (Subtopic). In practice, books rarely have more than three LCSH

headings for space, cost, and historical reasons. Commonly only the most specific

1This section gives a brief overview of library terms and library science for this chapter. However,it is necessarily United States-centric, and should not be considered the only way to organize datain a library! For more information, see a general reference such as one by Mann ([53], [52]).

2Strictly speaking, we sometimes use any subject heading in MARC 650, but almost all of theseare LCSH in our dataset.

4.2. CREATING A SOCIAL CATALOGING DATASET 67

LCSH headings are annotated to a book, even if more general headings apply.

We flatten LCC, DDC, and LCSH. For example in DDC, 811 is treated

as three groups {800, 810, 811}. LCSH is somewhat more complex. For

example, we treat “Religion” more specifically “Philosophy” as three groups

{Main:Religion:Sub:Philosophy, Religion, Philosophy}. This is, in some sense, not

fair to LCC, DDC, or LCSH because the structure in the annotations provides ad-

ditional information. However, we also ignore significant strengths of tagging in this

work, for example, its ability to have thousands of unique annotations for a single

work, or its ability to show gradation of meaning (e.g., a work 500 people tag “fan-

tasy” may be more classically “fantasy” than a work that only 10 people have tagged).

In any case, the reader should note that our group model does not fully model the

difference between structured and unstructured terms.

A MARC record is a standard library record that contains library terms for a

particular book. It includes a fixed length string which we call MARC 008 that

states whether the book is a biography, whether the book is fiction, and other details.

We define LLCC , LDDC , LLCSH , LLM , and LMARC008 to be the set of library terms in

LCC, DDC, LCSH, LCSH main topics, and MARC008, respectively.

4.2 Creating a Social Cataloging Dataset

We use a dump of Library of Congress MARC records from the Internet Archive as

the source of our library terms. We chose to use only those 2, 218, 687 records which

had DDC and LCC library terms as well as an ISBN (a unique identifier for a book).

We also use a list of approximately 6, 000 groups in LCC from the Internet Archive,

and a list of approximately 2, 000 groups in DDC from a library school board in

Canada as mapping tables for LCC and DDC.

We started crawling LibraryThing in early April 2008, and began crawling

Goodreads in mid June 2008. In both cases, our dataset ends in mid-October 2008.

We crawled a sample of works from each site based on a random selection of ISBNs

from our Library of Congress dataset. LibraryThing focuses on cataloging books (and

has attracted a number of librarians in addition to regular users), whereas Goodreads


focuses on social networking (which means it has sparser tagging data). We gathered

synonym sets (see Section 4.3.1) from LibraryThing on October 19th and 20th.

We use two versions of the LibraryThing dataset, one with all of the works which

were found from our crawl, and one with only those works with at least 100 unique

tags. The former dataset, which we call the “full” dataset, has 309, 071 works. The

latter dataset, which we call the “min100” dataset, has 23, 396 works. We use only

one version of our Goodreads dataset, a version where every work must have at least

25 tags and there are 7, 233 unique ISBNs.

4.3 Experiments: Consistency

In this and the next two sections, we conduct experiments to determine if tagging

systems are consistent, high quality, and complete. Each experiment has a descrip-

tion of a feature of the library approach to be emulated, a summary of the results,

zero or more preliminaries sections, and details about background, methodology, and

outcome.

The experiments in this section look at consistency :

Section 4.3.1 How big a problem is synonymy? That is, how consistent are users

of the same tagging system in choosing the same tag for the same topic?

Section 4.3.2 How consistent is the tag vocabulary chosen, or used, by users across

different tagging systems? That is, do users use the same tags across tagging

systems?

Section 4.3.3 How consistently is a particular tag applied across different tagging

systems? That is, do users use the same tags to describe the same objects?

Section 4.3.4 If paid taggers are asked to annotate objects with $-tags, are those

$-tags consistent with user tags?

4.3. EXPERIMENTS: CONSISTENCY 69

4.3.1 Synonymy

Summary

Library Feature: There should not be multiple places to look for a particular object.

This means that we would prefer tags not to have synonyms. When a tag does have

synonyms, we would prefer one of the tags to have many more objects annotated with

it than the others.

Result: Most tags have few or no synonyms appearing in the collection. In a given

synonym set, one tag is usually much more common.

Conclusion: Synonymy is not a major problem for tags.

Preliminaries: Synonymy

A group of users named combiners mark tags as equivalent. We call two tags that are

equivalent according to a combiner synonyms. A set of synonymous tags is called a

synonym set. Combiners are regular users of LibraryThing who do not work directly

for us. While we assume their work to be correct and complete in our analysis, they

do have two notable biases: they are strict in what they consider a synonym (e.g.,

“humour” as British comedy is not a synonym of “humor” as American comedy) and

they may focus more on finding synonyms of popular, mature tags.

We write the synonym set of ti, including itself, as S(ti). We calculate the entropy

H(ti) (based on the probability p(tj) of each tag) of a synonym set S(ti) as:

p(tj) =oc(tj)

∑

tk∈S(tj)oc(tk)

H(ti) = −∑

tj∈S(ti)

p(tj) log2 p(tj)

H(ti) measures the entropy of the probability distribution that we get when we assume

that an annotator will choose a tag at random from a synonym set with probability

in proportion to its object count. For example, if there are two equally likely tags in

a synonym set, H(ti) = 1. If there are four equally likely tags, H(ti) = 2. The higher

the entropy, the more uncertainty that an annotator will have in choosing which tag to

annotate from a synonym set, and the more uncertainty a user will have in determining

which tag to use to find the right objects. We believe low entropy is generally better


Figure 4.1: Synonym set frequencies. (“Frequency of Count” is the number of timessynonym sets of the given size occur.)

Figure 4.2: Tag frequency versus synonym set size.

than high entropy, though it may be desirable under some circumstances (like query

expansion) to have high entropy synonym sets.

Details

Due to the lack of a controlled vocabulary, tags will inevitably have synonymous

forms. The best we can hope for is that users ultimately “agree” on a single form,

by choosing one form over the others much more often. For example, we hope that if

the tag “fiction” annotates 500 works about fiction, that perhaps 1 or 2 books might

be tagged “fictionbook” or another uncommon synonym. For this experiment, we use

the top 2000 LibraryThing tags and their synonyms.

Most tags have no synonyms, though a minority have as many as tens of synonyms

(Figure 4.1). The largest synonym set is 70 tags (synonyms of “19th century”). Unlike

one might expect, |S(ti)| is not strongly correlated with oc(ti) as shown in Figure 4.2.

(Kendall’s τ ≈ 0.208.)


Figure 4.3: H(ti) (Top 2000, 6= 0)

Figure 4.3 is a histogram of the entropies of the top 2000 tags, minus those syn-

onym sets with an entropy of zero. In 85 percent of cases, H(ti) = 0. The highest

entropy synonym set, at H(ti) = 1.56 is the synonym set for the tag “1001bym-

rbfd,” or “1001 books you must read before you die.” Less than fifteen tags (out of

2000) have an entropy above 0.5. The extremely low entropies of most synonym sets

suggests that most tags have a relatively definitive form.

4.3.2 Cross-System Annotation Use

Summary

Library Feature: Across tagging systems, we would like to see the systems use

the same vocabulary of tags because they are annotating the same type of objects—

works.

Result: The top 500 tags of LibraryThing and Goodreads have an intersection of

almost 50 percent.

Conclusion: Similar systems have similar tags, though tagging system owners should

encourage short tags.

Preliminaries: Information Integration

Federation is when multiple sites share data in a distributed fashion allowing them

to combine their collections. Information integration is the process of combining,

de-duplicating, and resolving inconsistencies in the shared data. Two useful features


for information integration are consistent cross-system annotation use and consis-

tent cross-system object annotation. We say two systems have consistent annotation

use if the same annotations are used overall in both systems (this section). We say

two systems have consistent object annotation if the same object in both systems is

annotated similarly (Section 4.3.3). Libraries achieve these two features through “au-

thority control” (the process of creating controlled lists of headings) and professional

catalogers.

Details

For both LibraryThing and Goodreads, we look at the top 500 tags by object count.

Ideally, a substantial portion of these tags would be the same, suggesting similar

tagging practices. Differences in the works and users in the two systems will lead to

some differences in tag distribution. Nonetheless, both are mostly made up of general

interest books and similar demographics.

The overlap between the two sets is 189 tags, or about 38 percent of each top 500

list.3 We can also match by determining if a tag in one list is in the synonym set of

a tag in the other list. This process leads to higher overlap—231 tags, or about 46

percent. The higher overlap suggests “combiners” are more helpful for integrating two

systems than for improving navigation within their own system. An overlap of nearly

50 percent of top tags seems quite high to us, given that tags come from an unlimited

vocabulary, and books can come from the entire universe of human knowledge.

Much of the failed overlap can be accounted for by noting Goodreads’ prevalence

of multi-word tags. Multi-word tags lead to less overlap with other users, and less

overlap across systems. We compute the number of words in a tag by splitting on

spaces, underscores, and hyphens. On average, tags in the intersection of the two

systems have about 1.4 words. However, tags not in the intersection have an average

of 1.6 words in LibraryThing, and 2.3 words in Goodreads. This implies that for

tagging to be federated across systems users should be encouraged to use fewer words.

3Note that comparing sets at the same 500 tag cutoff may unfairly penalize border tags (e.g.,“vampires” might be tag 499 in LT but tag 501 in GR). We use the simpler measurement above,but we also conducted an analysis comparing, e.g., the top 500 in one system to the top 1000 in theother system. Doing so increases the overlap by ≈ 40 tags.


While there are 231 tags in the overlap between the systems (with synonyms),

it is also important to know if these tags are in approximately the same ranking.

Is “fantasy” used substantially more than “humor” in one system? We computed

a Kendall’s τ rank correlation between the two rankings from LibraryThing and

Goodreads of the 231 tags in the overlap of τ ≈ 0.44. This means that if we choose

any random pair of tags in both rankings, it is a little over twice as likely that the

pair of tags is in the same order in both rankings as it is that the pair will be in a

different order.

4.3.3 Cross-System Object Annotation

Summary

Library Feature: We would like annotators to be consistent, in particular, the same

work in two different tagging systems should be annotated with the same, or a similar

distribution, of tags. In other words, does “Winnie-the-Pooh” have the same set of

tags in LibraryThing and Goodreads?

Result: Duplicate objects across systems have low Jaccard similarity in annotated

tags, but high cosine similarity.

Conclusion: Annotation practices are similar across systems for the most popular

tags of an object, but often less so for less common tags for that object.

Details

We limited our analysis to works in both LibraryThing and Goodreads, where

Goodreads has at least 25 tags for each book. This results in 787 works. Ideally,

for each work, the tags would be almost the same, implying that given the same

source object, users of different systems will tag similarly.

Figures 4.4, 4.5, and 4.6 show distributions of similarities of tag annotations for

the same works across the systems. We use Jaccard similarity for set similarity (i.e.,

each annotation counts as zero or one), and cosine similarity for similarity with bags

(i.e., counts). Because the distributions are peaked, Jaccard similarity measures how


Figure 4.4: Distribution of same book similarities using Jaccard similarity over alltags.

Figure 4.5: Distribution of same book similarities using Jaccard similarity over thetop twenty tags.

Figure 4.6: Distribution of same book similarities using cosine similarity over all tags.


many annotations are shared, while cosine similarity measures overlap of the main

annotations.

Figure 4.4 shows that the Jaccard similarity of the tag sets for a work in the

two systems is quite low. For example, about 150 of the 787 works have a Jaccard

similarity of the two tag sets between 0.02 and 0.03. One might expect that the issue

is that LibraryThing has disproportionately many more tags than Goodreads, and

these tags increase the size of the union substantially. To control for this, in Figure

4.5, we take the Jaccard similarity of the top 20 tags for each work. Nonetheless,

this does not hugely increase the Jaccard value in most cases. Figure 4.6 shows the

distribution of cosine similarity values. (We treat tags as a bag of words and ignore

three special system tags.) Strikingly, the cosine similarity for the same work is

actually quite high. This suggests that for the same work, the most popular tags are

likely to be quite popular in both systems, but that overall relatively few tags for a

given work will overlap.

4.3.4 $-tag Annotation Overlap

Summary

Library Feature: We would like paid taggers to be able to annotate objects in a

way that is consistent with users. This reduces dependence on users, and means that

unpopular objects can be annotated for a fee.

Result: $-tags produced by paid taggers overlap with user tags on average 52 percent

of the time.

Conclusion: Tagging systems can use paid taggers.

Preliminaries: $-tag Tagging Setup

This section asks whether the terms used when paid taggers annotate objects with

$-tags are the same as the terms used when regular users annotate objects with

user tags. We randomly selected works from the “min100” dataset with at least three

unique li ∈ LLM . We then showed paid taggers (in our case, Mechanical Turk workers)

a search for the work (by ISBN) on Google Book Search and Google Product Search,


Figure 4.7: Overlap Rate Distribution.

two searches which generally provide a synopsis and reviews, but do not generally

provide library metadata like subject headings. The paid taggers were asked to add

three $-tags which described the given work. Each work was labeled by at least three

paid taggers, but different paid taggers could annotate more or fewer books (this is

standard on the Mechanical Turk). We provided 2, 000 works to be tagged with 3

$-tags each. Some paid taggers provided an incomplete set of $-tags, leading to a

total of 16, 577 $-tags. Paid taggers spent ≈ 90 seconds per work, and we usually

spent less than $0.01 per $-tag/work pair. (We analyze $-tags in Sections 4.3.4, 4.4.2,

and 4.4.3.)

Details

$-tags matched with tags ti already annotated to the work at least once on average

52% of the time (standard deviation of 0.21). Thus, paid taggers who had in the vast

majority of cases not read the book, overlapped with real book readers more than half

the time in what $-tags they applied. A natural followup question is whether some

workers are much better at paid tagging than others. We found a range of “overlap

rates” among paid taggers (shown in Figure 4.7), but we are unsure whether higher

performance could be predicted in advance.

4.4 Experiments: Quality

The experiments in this section look at quality :

Section 4.4.1 Are the bulk of tags of high quality types? For example, are subjective

4.4. EXPERIMENTS: QUALITY 77

tags like “stupid” common?

Section 4.4.2 Are $-tags high quality in comparison to library annotations and user

tags?

Section 4.4.3 Can we characterize high quality user tags?

4.4.1 Objective, Content-based Groups

Summary

Library Feature: Works should be organized objectively based on their content.

For example, we would prefer a system with groups of works like “History” and

“Biography,” to one with groups of works like “sucks” and “my stuff.”

Result: Most tags in both of our social cataloging sites were objective and content-

based. Not only are most very popular tags (oc(ti) > 300) objective and content-

based, but so are less popular and rare tags.

Conclusion: Most tags, rather than merely tags that become very popular, are

objective and content-based, even if they are only used a few times by one user.

Preliminaries: Tag Types

We divide tags into six types:

Objective and Content-based Objective means not depending on a particular an-

notator for reference. For example, “bad books” is not an objective tag (because

one needs to know who thought it was bad), whereas “world war II books” is

an objective tag. Content-based means relating to the book contents (e.g., the

story, facts, genre). For example, “books at my house” is not a content-based

tag, whereas “bears” is.

Opinion The tag implies a personal opinion. For example, “sucks” or “excellent.”

Personal The tag relates to personal or community activity or use. For example,

“my book”, “wishlist”, “mike’s reading list”, or “class reading list”.

Physical The tag describes the book physically. For example, “in bedroom” or

“paperback”.


LT% GR%

Objective, Content of Book 60.55 57.10Personal or Related to Owner 6.15 22.30

Acronym 3.75 1.80Unintelligible or Junk 3.65 1.00

Physical (e.g., “Hardcover”) 3.55 1.00Opinion (e.g., “Excellent”) 1.80 2.30

None of the Above 0.20 0.20No Annotator Majority 20.35 14.30

Total 100 100

Table 4.1: Tag types for top 2000 LibraryThing and top 1000 GoodReads tags aspercentages.

Acronym The tag is an acronym that might mean multiple things. For example,

“sf” or “tbr”.

Junk The tag is meaningless or indecipherable. For example, “b” or “jiowefijowef”.

Details

If a tagging system is primarily made up of objective, content-based tags, then it is

easier for users to find objects. In a library system, all annotations are objective and

content-based in that they do not depend on reference to the annotator, and they

refer to the contents of the book.

To produce an unbiased view of the types of tags in our sites, we used Mechanical

Turk. We submitted the top 2, 000 LibraryThing tags and top 1, 000 Goodreads tags

by annotation count to be evaluated. We also sampled 1, 140 LibraryThing tags,

20 per rounded value of log(oc(ti)), from 2.1 to 7.7. We say a worker provides a

determination of the answer to a task (for example, the tag “favorite” is an opinion).

Overall, 126 workers examined 4, 140 tags, five workers to a tag, leading to a total of

20, 700 determinations. We say the inter-annotator agreement is the pair-wise fraction

of times two workers provide the same answer. The inter-annotator agreement rate

was about 65 percent.

Table 4.1 shows the proportion of top tags by type for LibraryThing and Goodreads.


Figure 4.8: Conditional density plot [39] showing probability of (1) annotators agree-ing a tag is objective, content-based, (2) annotators agreeing on another tag type, or(3) no majority of annotators agreeing.

For example, for 60.55% of the top 2000 LibraryThing tags (e.g., 12112000

), at least three

of five workers agreed that the tag was objective and content-based. The results show

that regardless of the site, a majority of tags tend to be objective, content-based tags.

In both sites, about 60 percent of the tags examined were objective and content-based.

Interestingly, Goodreads has a substantially higher number of “personal” tags than

LibraryThing. We suspect that this is because Goodreads calls tags “bookshelves”

in their system.

Even if we look at tags ranging from oc(ti) = 8 to oc(ti) = 2208, as shown in Figure

4.8, the proportion of objective, content-based tags remains very high. That figure

shows the probability that a tag will be objective and content-based conditioned on

knowing its object count. For example, a tag annotating 55 objects has about a 50

percent chance of being objective and content-based.

4.4.2 Quality Paid Annotations

Summary

Library Feature: We would like to purchase annotations of equal or greater quality

to those provided by users.

Result: Judges like $-tags as much as subject headings.


Conclusion: Paid taggers can annotate old objects where users do a poor job of

providing coverage and new objects which do not yet have tags. Paid taggers can

quickly and inexpensively tag huge numbers of objects.

Preliminaries: $-tag Judging Setup

In this section and the next, we evaluate the relative perceived helpfulness of annota-

tions ti ∈ T , $i ∈ $ and li ∈ LLM . We randomly selected 60 works with at least three

tags ti ∈ T and three LCSH terms li ∈ LLM from our “min100” dataset.

We created tasks on the Mechanical Turk, each of which consisted of 20 subtasks

(a “work set”), one for each of 20 works. Each subtask consisted of a synopsis of the

work oi and an annotation evaluation section. A synopsis consisted of searches over

Google Books and Google Products as in Section 4.3.4. The annotation evaluation

section showed nine annotations in random order, three each from T (oi), $(oi), and

LLM(oi), and asked how helpful the given annotation would be for finding works

similar to the given work oi on a scale of 1 (“not at all helpful”) to 7 (“extremely

helpful”).

We removed three outlier evaluators who either skipped excessive numbers of eval-

uations, or awarded excessive numbers of the highest score. Remaining missing values

were replaced by group means. That is, a missing value for a work/annotation/evalu-

ator triplet was replaced by the mean of helpfulness scores from among all evaluators

who had provided scores for that triplet. We abbreviate “helpfulness score” as h-score

in the following. We say that annotations ti ∈ T , $i ∈ $, and li ∈ LLM differ in their

annotation type.

Details

In order to understand the perceived quality of $-tags, we wondered if, given the works

that each evaluator saw, they tended to prefer $-tags, user tags, or LCSH on average.

To answer this question, we produced a mean of means for each annotation type

(i.e., $-tags, user tags, and LCSH main topics) to compare to the other annotation

types. We do so by averaging the annotations of a given type within a given evaluator


H-Scores (by Evaluator) µ SDUser Tags 4.46 0.75

LCSH Main Topics 5.18 0.76$-tags 5.22 0.83

Table 4.2: Basic statistics for the mean h-score assigned by evaluators to each anno-tation type. Mean (µ) and standard deviation (SD) are abbreviated.

(i.e., to determine what that evaluator thought) and then by averaging the averages

produced by each evaluator across all evaluators.

Table 4.2 summarizes the basic statistics by annotation type. For example, the

mean evaluator assigned a mean score of 4.46 to user tags, 5.18 to LCSH main topics,

and 5.22 to $-tags. At least for our 60 works, $-tags are perceived as being about

as helpful as LCSH library annotations, and both are perceived as better than user

tags (by about 0.6 h-score). A repeated measures ANOVA showed annotation type

differences in general to be significant, and all differences between mean h-scores by

annotation type were significant (p < 0.001) with the exception of the difference

between $-tags and LCSH main topics.

4.4.3 Finding Quality User Tags

Summary

Library Feature: We would like tag annotations to be viewed as competitive in

terms of perceived helpfulness with annotations provided by expert taxonomists.

Result: Moderately common user tags are perceived as more helpful than both LCSH

and $-tags.

Conclusion: Tags may be competitive with manually entered metadata created by

paid taggers and experts, especially when information like frequency is taken into

account.


H-Scores µ SD µ 95% CI$-tags 4.93 1.92 (4.69, 5.17)

Rare User Tags 4.23 2.11 (3.97, 4.50)Moderate User Tags 5.80 1.47 (5.63, 5.98)Common User Tags 5.27 1.72 (5.05, 5.48)LCSH Main Topics 5.13 1.83 (4.91, 5.36)

Table 4.3: Basic statistics for the mean h-score assigned to a particular annotationtype with user tags split by frequency. Mean (µ) and standard deviation (SD) areabbreviated.

Details

Section 4.4.2 would seem to suggest that tags ti ∈ T are actually the worst possible

annotation type because the average evaluator gave $-tags and LCSH main topics

a mean h-score 0.6 higher than user tags. Nonetheless, in practice we found that

tags ti ∈ T (oi) often had higher h-scores for the same object oi than corresponding

annotations $i ∈ $(oi) and li ∈ LLM(oi). It turns out that this discrepancy can be

explained in large part by the popularity of a user tag.

We define pop(oi, tm) to be the percentage of the time that tag tm is assigned to

object oi. For example, if an object oi has been annotated (“food”, “food”, “cuisine”,

“pizza”) then we would say that pop(oi, tfood) =24. We partitioned the h-scores for T

into three sets based on the value pop(oi, tm) of the annotation. Those sets were user

tag annotations with pop(oi, tm) < 0.11 (“rare”), those with 0.11 ≤ pop(oi, tm) < 0.17

(“moderate”), and those with 0.17 ≤ pop(oi, tm) (“common”).4

Table 4.3 shows the basic statistics with these more fine grained categories on a per

evaluation basis (i.e., not averaging per annotator). For example, the 95% confidence

interval for the mean h-score of moderate popularity user tags is (5.63, 5.98), and the

mean h-score of $-tags is 4.93 in our sample. The ANOVA result, Welch-corrected to

adjust for unequal variances within the five annotation types, is (WelchF (4, 629.6) =

26.2; p < .001). All differences among these finer grained categories are significant,

4H-scores were sampled for the “common” set for analysis due to large frequency differencesbetween rare user tags and more common tags. Values of pop(oi, tj) varied between less than 1percent and 28 percent in our evaluated works.

4.5. EXPERIMENTS: COMPLETENESS 83

with the exception of common user tags versus LCSH, common user tags versus $-

tags, and LCSH main topics versus $-tags.

Using the finer grained categories in Table 4.3 we can now see that moderately

common user tags are perceived as better than all other annotation types. (Fur-

thermore, rare user tags were dragging down the average in the analysis of Section

4.4.2.) We speculate that rare user tags are too personal and common user tags too

general. Despite some caveats (evaluators do not read the work, value of annotations

changes over time, works limited by Librarything availability), we are struck by the

fact that evaluators perceive moderately common user tags to be more helpful than

professional, expert-assigned library annotations.

4.5 Experiments: Completeness

The experiments in this section look at completeness :

Section 4.5.1 Do user tag annotations cover many of the same topics as professional

library annotations?

Section 4.5.2 Do user tags and library annotations corresponding to the same topic

annotate the same objects?

4.5.1 Coverage

Summary

Library Feature: We believe that after decades of consensus, libraries have roughly

the right groups of works. A system which attempts to organize works should end up

with groups similar to or a superset of library terms.

Result: Many top tags have equivalent (see below) library terms. Tags contain more

than half of the tens level DDC headings. There is a corresponding LCSH heading

for more than 65 percent of top objective, content-based tags.

Conclusion: Top tags often correspond to library terms.


Preliminaries: Containment and Equivalence

Our goal is to compare the groups formed by user tags and those formed by library

annotations. For instance, is the group defined by tag “History of Europe” equivalent

to the group formed by the library term “European History?” We can take two

approaches to defining equivalence. First, we could say that group g1 is equivalent

to group g2 if they both contain the same objects (in a given tagging system). By

this definition, the group “Art” could be equivalent to the group “cool” if users had

tagged all works annotated with the library term “Art” with the tag “cool.” Note

that this definition is system specific.

A second approach is to say that group g1 is equivalent to group g2 if the names

g1 and g2 “semantically mean the same.” Under this definition, “cool” and “Art”

are not equivalent, but “European History” and “History of Europe” are. The latter

equivalence holds even if there are some books that have one annotation but not the

other. For this definition of equivalence we assume there is a semantic test m(a, b)

that tells us if names a and b “semantically mean the same.” (We implement m by

asking humans to decide.)

In this chapter we use the second definition of equivalence (written g1 = g2). We

do this because we want to know to what extent library terms exist which are seman-

tically equivalent to tags (Section 4.5.1) and to what extent semantically equivalent

groups contain similar objects (Section 4.5.2).

When we compare groups, not only are we interested in equivalence, but also in

“containment.” We again use semantic definitions: We say a group g1 contains group

g2 (written g1 ⊇ g2) if a human that annotates an object o with g2 would agree that o

could also be annotated with g1. Note that even though we have defined equivalence

and containment of groups, we can also say that two annotations are equivalent or

contain one another if the groups they name are equivalent or contain one another.

Preliminaries: Gold Standard (tj, li) Relationships

In this section and the next, we look at the extent to which tags tj and library terms

li satisfy similar information needs. We assume a model where users find objects


(a) Sampled Containment Relationships (con-pairs)

Tag Contained Library Termspanish romance

→ spanish (lc pc 4001.0-4978.0)

pastoral pastoral theology (lc bv 4000.0-4471.0)

civil war

united states

→ civil war period, 1861-1865

→ civil war, 1861-1865

→ armies. troops (lc e 491.0-587.0)

therapy psychotherapy (lcsh)

chemistrychemistry

→ organic chemistry (lc qd 241.0-442.0)

(b) Sampled Equivalence Relationships (eq-pairs)

Tag Equivalent Library Term

mammals

zoology

→ chordates. vertebrates

→ mammals (lc ql 700.0-740.8)

fitness physical fitness (lcsh)

catholic church catholic church (lcsh)

golf golf (lcsh)astronomy astronomy (lc qb 1.0-992.0)

Table 4.4: Randomly sampled containment and equivalence relationships for illustra-tion.

using single annotation queries. If tj = li for a given (tj, li), we say (tj, li) is an

eq-pair. If tj ⊇ li for a given (tj, li), we say (tj, li) is a con-pair. In this section, we

look for and describe eq-pairs (where both annotations define the same information

need) and con-pairs (where a library term defines a subset of an information need

defined by a tag). In Section 4.5.2, we use these pairs to evaluate the recall of single

tag queries—does a query for tag tj return a high proportion of objects labeled with

library terms equivalent or contained by tj? For both sections, we need a set of gold

standard eq- and con-pairs.

Ideally, we would identify all eq- and con-pairs (tj, li) ∈ T × L. However, this is

prohibitively expensive. Instead, we create our gold standard eq- and con-pairs as

follows:


Step 1 We limit the set of tags under consideration. Specifically, we only look at tags

in T738: the 738 tags from the top 2, 000 which were unanimously considered

objective and content-based in Section 4.4.1. (These 738 tags are present in

about 35% of tag annotations.)

Step 2 We identify (tj, li) pairs that are likely to be eq- or con-pairs based on how tj

and li are used in our dataset. First, we drop all (tj, li) pairs that do not occur

together on at least 15 works. Second, we look for (tj, li) pairs with high values

of q(tj, li) = (P (tj, li) − P (tj)P (li)) × |O(li)|. q(tj, li) is inspired by leverage

(P (tj, li)−P (tj)P (li)) from the association rule mining community [59], though

with bias (|O(li)|) towards common relationships. We drop all (tj, li) pairs that

do not have q(tj, li) in the top ten for a given tag tj.

Step 3 We (the researchers) manually examine pairs output from Step 2 and judge

if they are indeed eq- or con-pairs. At the end of this step, our gold standard

eq- and con-pairs have been determined.

Step 4 We evaluate our gold standard using Mechanical Turk workers. We do not

change any eq- or con-pair designations based on worker input, but this step

gives us an indication of the quality of our gold standard.

The filtering procedures in Steps 1 and 2 allowed us to limit our manual evaluation

to 5, 090 pairs in Step 3. (Though, the filtering procedures mean we are necessarily

providing a lower bound on the eq- and con-pairs present in the data.) In Step 3,

we found 2, 924 con-pairs and 524 eq-pairs. (Table 4.4 shows random samples of

relationships produced.)

To evaluate our gold standard in Step 4, we provided Mechanical Turk workers

with a random sample of eq- and con-pairs from Step 3 in two scenarios. In a true/false

validation scenario, the majority of 20 workers agreed with our tj = li and tj ⊇ li

judgments in 6465

= 98% of cases. However, they said that tj = li when tj 6= li or tj ⊇ li

when tj 6⊇ li in3490

= 38% of cases, making our gold standard somewhat conservative.

A χ2 analysis of the relationship between the four testing conditions (true con-pair,

false con-pair, true eq-pair, and false eq-pair) shows a strong correlation between

containment/equivalence examples and true/false participant judgments (χ2(3) =

45.3, p < .001). In a comparison scenario where workers chose which of two pairs


they preferred to be an eq- or con-pair, the majority of 30 workers agreed with our

judgments in 138150

= 92% of cases.

Details

In this analysis, we ask if tags correspond to library annotations. We ask this question

in two directions: how many top tags have equivalent or contained library annotations,

and how many of the library annotations are contained or equivalent to top tags?

Assuming library annotations represent good topics, the first direction asks if top

tags represent good topics, while the second direction asks what portion of those

good topics are represented by top tags.

In this section and the next, we use an imaginary “System I” to illustrate coverage

and recall. System I has top tags {t1, t2, t3, t4}, library terms {l1, l2, l3, l4, l5, l6}, eq-

pairs {t1 = l1, t2 = l2}, and con-pairs {t3 ⊇ l3, t1 ⊇ l5}. Further, l3 ⊇ l4 based

on hierarchy or other information (perhaps l3 might be “History” and l4 might be

“European History”).

Looking at how well tags represent library terms in System I, we see that 2 of

the 4 unique tags appear in eq-pairs, so 24of the tags have equivalent library terms.

Going in the opposite direction, we see that 2 out of 6 library terms have equivalent

tags, so what we call eq-coverage below is 26. We also see that 2 of the library terms

(l3, l5) are directly contained by tags, and in addition another term (l4) is contained

by l3. Thus, a total of 3 library terms are contained by tags. We call this 36fraction

the con-coverage.

We now report these statistics for our real data. Of 738 tags in our data set, 373

appear in eq-pairs. This means at least half (373738

) of the tags have equivalent library

terms.5

To go in the opposite direction, we compute coverage by level in the library term

hierarchy, to gain additional insights. In particular, we use DDC terms which have

an associated value between 0 and 1000. As discussed in Section 2.1, if the value is

5Note that this is a lower bound—based on techniques in Chapter 5, we suspect more than 503738

tags have equivalents.


X00 XX0 XXXCon-Coverage 0.3 0.65 0.677Eq-Coverage 0.1 0.28 0.021

Table 4.5: Dewey Decimal Classification coverage by tags.

of the form X00, then the term is high level (e.g., 800 is Language and Literature);

if the value is of the form XX0 it is lower level, and so on (e.g., 810 is American

and Canadian Literature). We thus group the library terms into three sets, LX00,

LXX0 and LXXX . (Set LX00 contains all terms with numbers of the form X00).

For Lrs ∈ {LX00, LXX0, LXXX} being one of these groups, we define two metrics for

coverage:

concoverage(Lrs) =

∑

li∈Lrs1{∃tj ∈ T s.t. tj ⊇ li}

|Lrs|

eqcoverage(Lrs) =

∑

li∈Lrs1{∃tj ∈ T s.t. tj = li}

|Lrs|

Table 4.5 shows these metrics for our data. For example, the first row, second column

says that 65100

of XX0 DCC terms are contained by a tag. (More specifically, 65 percent

of XXO terms have this property: the term li is in LXX0 and there is a con-pair (tj, li)

in our gold standard, or there is a (tj, lk) con-pair where lk ∈ LX00 and lk ⊃ li.) The

second row, third column says that 211000

DDC ones level terms li ∈ LXXX have an

eq-pair (tj, li). About one quarter of XXO DDC terms have equivalent T738 tags.

4.5.2 Recall

Summary

Library Feature: A system should not only have the right groups of works, but it

should have enough works annotated in order to be useful. For example, a system

with exactly the same groups as libraries, but with only one work per group (rather

than, say, thousands) would not be very useful.

Result: Recall is low (10 to 40 percent) using the full dataset. Recall is high (60 to

100 percent) when we focus on popular objects (min100).

Conclusion: Tagging systems provide excellent recall for popular objects, but not


necessarily for unpopular objects.

Preliminaries: Recall

Returning to our System I example, say that l1 annotates {o1, o3}, and l5 annotates

{o4, o5}. Because t1 is equivalent to l1, and contains l5, we expect that any work

labeled with either l1 or l5 could and should be labeled with t1. We call o1, o3, o4, o5

the potential objects for tag t1. Our goal is to see how closely the potential object

set actually matches the set of objects tagged with t1. For instance, suppose that t1

actually annotates {o1, o2}. Since t1 annotates one of the four potential works, we

say that recall(t1) =14.

More formally, if li = tj, then we say li ∈ E(tj). If tj ⊇ li, then we say li ∈ C(tj).

Any object annotated with terms from either E(tj) or C(tj) should also have a tag tj.

Hence, the potential object set for a tag based on its contained or equivalent library

terms is:

Ptj =⋃

li∈(E(tj)∪C(tj))

O(li)

We define recall to be the recall of a single tag query on relevant objects according

to our gold standard library data:

recall(tj) =|O(tj) ∩ Ptj |

|Ptj |

and that the Jaccard similarity between the potential object set and the objects

contained by a tag is:

J(O(tj), Ptj) =|O(tj) ∩ Ptj |

|O(tj) ∪ Ptj |

Details

In this experiment, we ask whether the tags provided by users have good recall of

their contained library terms. An ideal system should have both good coverage (see

Section 4.5.1) and high recall of library terms.

We look at recall for the tags T603 ⊂ T738 that have at least one con-pair. Figures


Figure 4.9: Recall for 603 tags in the full dataset.

Figure 4.10: Recall for 603 tags in the “min100” dataset.

Figure 4.11: Jaccard for 603 tags in the full dataset.

4.6. RELATED WORK 91

4.9 and 4.10 show the distribution of recall of tags tj ∈ T603 using the full and min100

datasets. Figure 4.9 shows that using the full dataset, most tags have 10 to 40 percent

recall. For example, about 140 tags have recall between 10 and 20 percent. Figure 4.10

shows recall using the “min100” dataset. We can see that when we have sufficient

interest in an object (i.e., many tags), we are very likely to have the appropriate

tags annotated. Recall is often 80 percent and up. Lastly, Figure 4.11 shows the

distribution of Jaccard similarity between O(tj) and Ptj . For most tags, the set of

tag annotated works is actually quite different from the set of library term annotated

works, with the overlap often being 20 percent of the total works in the union or less.

The objects in O(tj)−Ptj are not necessarily incorrectly annotated with tj. Since we

know that many tags are of high quality (Section 4.4), a more likely explanation is

that the library experts missed some valid annotations.

4.6 Related Work

Our synonymy experiment in Section 4.3.1 is similar to previous work on synonymy

and entropy in tagging systems. Clements et al. [20] use LibraryThing synonym sets

to try to predict synonyms. By contrast, our goal was to determine if synonyms were

a problem, rather than to predict them. Chi et al. [18] used entropy to study the

evolution of the navigability of tagging systems. They look at entropy as a global

tool, whereas we use it as a local tool within synonym sets.

Our experiments relating to information integration in Sections 4.3.2 and 4.3.3

(primarily Section 4.3.2, however), share some similarities to Oldenburg et al. [57]

which looked at how to integrate tags across tagging systems, though that work is

fairly preliminary (and focused on the Jaccard measure). That work also focuses on

different sorts of tagging systems, specifically, social bookmarking and research paper

tagging systems, rather than social cataloging systems.

Our tag type experiment in Section 4.4.1 is related to work like Golder and Hu-

berman [31] and Marlow et al. [55] which looked at common tag types in a tagging

systems. However, we believe our work is possibly the first to analyze how tag types

change over the long tail of tag usage (i.e., are less popular tags used differently from


more popular tags?).

Like Section 4.4.3, other work has found moderately common terms in a collection

to be useful. For instance, Haveliwala et al. [32] propose Nonmonotonic Document

Frequency (NMDF), a weighting which weights moderately frequent terms highly.

We are not aware of other work that has suggested this particular weighting for tags,

however.

The most related work to our experiments in Sections 4.5.1 and 4.5.2 is our own

later work discussed in Chapter 5. Some older work, for example, DeZelar-Tiedman

[23] and Smith [66] looks at the relationship between tagging and traditional library

metadata. However, these works tend to look at a few hundred books at most,

and focus on whether tags can enhance libraries. Also related to these experiments,

there has been some work on association rules in tagging systems, including work by

Schmitz et al. [62] and our own work in Chapter 3. However, that work focused on

prediction of tags (or other tagging system quantities). We believe our work is the

first to look at relationships between tags and library terms using methods inspired

by association rules.

We are unaware of other work either examining $-tags (or even suggesting paying

for tags) or attempting to understand how tagging works as a data management or

information organization tool (i.e., in the same sense as libraries) in a large-scale,

quantitative way.

4.7 Conclusion

We conducted a series of experiments that suggested that tagging systems tend to be

at least somewhat consistent, high quality, and complete. These experiments found

the tagging approach to be suitable for synonymy, information integration, paid an-

notation, programmatic filtering for quality, and for situations where an objective

and high recall set of annotations covering general topics is needed. In a span of only

a few years, LibraryThing has grown to tens of millions of books, and the groups

developed by taggers are quite close to the groups developed by professional tax-

onomists. This is a testament both to the taxonomists, who did a remarkable job of

4.7. CONCLUSION 93

choosing consensus controlled lists and classifications to describe books, and to tags

which are unusually adaptable to different types of collections. Strikingly, we found

that a particular type of user tag (moderately common user tags) is perceived as even

more helpful than expert assigned library annotations. These two sets of experiments

are mutually reinforcing. Overall, tags seem to do a remarkably good job of organiz-

ing data when viewed either quantitatively in comparison to “gold standard” library

metadata or qualitatively as viewed by human evaluators.

Chapter 5

Fallibility of Experts

The previous chapter looked at library metadata as a basis for evaluating tagging

systems as an information organization tool. We looked at questions like whether

such systems could be federated and whether tags correspond to taxonomies like the

Dewey Decimal Classification. Here, our focus is instead on the nature of keyword

annotations, and specifically how user and expert keyword annotations differ. (The

more recent experiments in this chapter also apply a semantic relatedness measure in

a novel way which we believe will be valuable more broadly in tagging systems.)

In this chapter, we ask whether a controlled vocabulary of library keywords called

the Library of Congress Subject Headings (LCSH) is different from the vocabulary

developed by the users of LibraryThing. We find that many LCSH keywords corre-

spond to tag keywords used by users of LibraryThing. However, we also find that

even though an LCSH keyword and a tag may be syntactically the same, often the

two keywords may annotate almost completely different groups of books. In our case,

the experts seem to have picked the right keywords, but perhaps annotated them to

the wrong books (from the users’ perspectives). Thus, the common practice on the

web of letting users organize their own data may be more appropriate. (This chapter

draws on material from Heymann et al. [33] which is primarily the work of the thesis

author.)

95

96 CHAPTER 5. FALLIBILITY OF EXPERTS

5.1 Notes on LCSH

In this chapter, we continue to use the library terms introduced in Section 4.1. How-

ever, here a library term lj is always an LCSH keyword. As noted before, LCSH

keywords come from a controlled vocabulary of hundreds of thousands of terms with

main and subtopics. Here, we treat all main and subtopics as separate keywords.

(When we refer to “LCSH keywords,” we mean the value of MARC 650. MARC 650,

strictly speaking, may include expert-assigned keywords from vocabularies other than

LCSH, but in practice is made up almost entirely of that vocabulary in our dataset.)

LCSH has some hierarchical structure. An LCSH keyword li has keywords which

are “broader than” that keyword ({lj , lk . . .} ∈ B(li)), “narrower than” that keyword

({lj, lk . . .} ∈ N(li)), and “related to” the keyword ({lj, lk . . .} ∈ R(li)). Unfortu-

nately, this structure is not particularly consistent [24], in that if lk ∈ B(lj) and

lj ∈ B(li), it may not be the case that lk ∈ B(li). In practice, books rarely have more

than three to six LCSH keywords due to originally being designed for card catalogs

where space was at a premium. It is also common for only the most specific LCSH

keywords to be annotated to a book, even if more general keywords apply. Lastly,

because tags are annotated by regular users, and LCSH keywords are annotated by

paid experts, {oc(lj)|lj ∈ L} and {oc(ti)|ti ∈ T} are quite different. Tags tend to

focus on popular works, while keywords by paid experts annotate more works, less

densely.

5.2 Experiments

In our experiments, we use the dataset from Section 4.2, restricted to works found in

both LibraryThing and the Library of Congress, and only at the 8, 783 unique LCSH

keywords and 47, 957 unique tags which annotate at least 10 works. Our research

question is, “how many keywords determined by expert consensus for LCSH are also

used as tags, and are these keywords used in the same way?” In the experiments

below, we divide this question as follows:

1. Section 5.2.1 asks whether LCSH keywords have syntactically equivalent tags.

5.2. EXPERIMENTS 97

(For example, tag “java” is equivalent to LCSH “Java.”)

2. Section 5.2.2 asks whether for a given syntactically equivalent (ti, lj) pair, ti

and lj have the same prominence in lists ranked by oc(ti) and oc(lj).

3. Section 5.2.3 asks if syntactically equivalent (ti, lj) pairs are used in the same

way by experts and users.

4. Section 5.2.4 asks whether LCSH keywords have semantically equivalent tags.

(For example, “jewish life” is semantically equivalent to “jewish way of life.”)

We do not replicate the experiments from Sections 5.2.2 and 5.2.3 for semantic

equivalence, but we expect less correlation and less similar usage between non-

syntactically but semantically equivalent keyword pairs.

5.2.1 Syntactic Equivalence

Definition

The tag “painters” and the LCSH keyword “Painters” are obviously equivalent key-

words. But is the tag “american science fiction” equivalent to “Science Fiction, Amer-

ican”? Is the tag “men in black” equivalent to “Men in Black (UFO Phenomenon)”?

We define two types of syntactic equivalence:

Exact The lower-cased tag is identical to the lower-cased LCSH keyword.

Almost Exact The lower-cased tag is identical to the lower-cased LCSH keyword

if the LCSH keyword is modified to remove parenthetical remarks, swap the

ordering of words around a comma, stem, or add or remove an “s.”

Our “painters” example is exactly equivalent, while the other two examples are al-

most exactly equivalent. If there exists a tag ti that is exactly or almost exactly

syntactically equivalent to lj , we say that lj ∈ Slcsh and (ti, lj) ∈ Spair.

Results

We found that 34088783

LCSH keywords were exactly equivalent to a tag, while an ad-

ditional 8388783

were almost exactly equivalent to a tag. In all, about 48% of LCSH

keywords have equivalents according to one of the above two definitions. Such a high


Figure 5.1: Spinogram [40] [39] showing probability of an LCSH keyword having acorresponding tag based on the frequency of the LCSH keyword. (Log-scale.)

keyword overlap is all the more surprising given that many of the exactly equiva-

lent LCSH keywords are multiple words, for example, “Vernacular Architecture” or

“Quantum Field Theory.”

Cases where lj 6∈ Slcsh are highly correlated with low oc(lj). Figure 5.1 shows the

distribution of syntactic equivalence (y-axis) based on oc(lj) (x-axis). For example, if

10 ≤ oc(lj) ≤ 15, there is about a 30 percent chance that lj ∈ Slcsh (and a 20 percent

chance that lj is exactly equivalent to some tag ti). By contrast, if 63 ≤ oc(lj) ≤ 100,

there is about a 70 percent chance that lj ∈ Slcsh. (We also suspect that longer LCSH

keywords may be less likely to have syntactically equivalent tags because tags tend

to be short.)

5.2.2 Rank Correlation of Syntactic Equivalents

Are syntactically equivalent (ti, lj) pairs equally popular within their respective anno-

tation types? For example, if the “java” tag annotates many works, does the “Java”

LCSH keyword also annotate many works? We create two rankings of {(ti, lj) ∈

Spair}, one ordered by oc(ti), the other ordered by oc(lj). We use Kendall’s tau rank

correlation to determine how similarly ranked the pairs are. For our data, τ ≈ 0.305.

This means that the pairs are somewhat, but not highly, positively correlated. The

experts and regular users have somewhat similar views of what the most important

keywords are, but they do still differ substantially.

5.2. EXPERIMENTS 99

5.2.3 Expert/User Annotator Agreement

Do experts and regular users use the same keywords in the same ways? For example,

many users in our dataset have annotated the book “The Wind in the Willows” with

the tag “children’s stories,” yet no expert has annotated the book with the LCSH

keyword “Children’s Stories.” We investigate the question of how common problems

like these are below, and find that they are quite common.

Jaccard Similarities

We define three measures to try to get an idea of how commonly (ti, lj) pairs annotate

the same books. We define symmetric Jaccard similarity as:

Jsym =|O(ti) ∩O(lj)|

|O(ti) ∪O(lj)|

For example, “children’s stories” (above) has Jsym = 0, while “origami” has Jsym =

0.75. We also define two asymmetric Jaccard similarity measures, one for tags and

one for LCSH:

Jtag(ti, lj) =|O(ti) ∩O(lj)|

|O(ti)|Jlcsh(ti, lj) =

|O(ti) ∩O(lj)|

|O(lj)|

Jsym gives the ratio of the size of the intersection of two annotations to their union, so

it may be dominated by one annotation if that annotation annotates many works. Jtag

tells us what portion of the tagged works are covered by the LCSH keyword, and Jlcsh

tells us what portion of LCSH annotated works are covered by the tag. For example,

“knitting” has Jlcsh = 0.97 but Jsym = 0.53 because even though almost all works in

O(lknitting) are in O(tknitting), |O(tknitting)| is about twice as large as |O(lknitting)|.

Results

For most (ti, lj) ∈ Spair, O(ti)∩O(lj) is quite small. Figure 5.2(a) shows the distribu-

tion of Jsym for the 4, 246 (ti, lj) pairs in Spair. The vast majority of such pairs have

less than 20% overlap in work coverage.


(a) Histogram of Jsym

(b) Histogram of Jsym, N(li) = ∅

Figure 5.2: Symmetric Jaccard Similarity.

A possible reason for small O(ti) ∩O(lj) could be that librarians only choose the

most specific appropriate LCSH keywords (see Section 5.1). In order to test this

hypothesis, we computed Jsym, but only over LCSH keywords which were at the

bottom of the LCSH hierarchy. In other words, we only chose li where N(li) = ∅.

Jsym values for these pairs, shown in Figure 5.2(b) are very similar to those in Figure

5.2(a). This leads us to believe that specificity is not the core reason user and expert

annotations differ.

Figures 5.3(a) and 5.3(b) show the values of Jlcsh and Jtag for the 4, 246 pairs.

Both show predominantly low Jaccard values. Jlcsh does have slightly higher Jaccard

values, but it is still mostly below 0.4. A work labeled with an LCSH keyword is less

than 50 percent likely to be labeled with the corresponding tag. A work labeled with

a tag is even less likely to be labeled with the corresponding LCSH keyword.

5.2. EXPERIMENTS 101

(a) Histogram of Jlcsh

(b) Histogram of Jtag

Figure 5.3: Asymmetric Jaccard Similarity.

5.2.4 Semantic Equivalence

Are there semantically, rather than syntactically equivalent tag/LCSH keyword pairs?

In other words, are there many pairs like “middle ages” and “Middle Ages, 500-1500”

where the meaning is the same, but the phrasing is slightly different? If so, how

many?

Definition

We use semantic relatedness to determine whether (ti, lj) pairs are semantically equiv-

alent. Semantic relatedness is a task where an algorithm gives a number between 0

and 1 for how related two words or phrases (w1, w2) are. For example, “vodka” and

“gin” are highly related (closer to 1) while “rooster” and “voyage” are not (closer to

0). We use an algorithm called Wikipedia Explicit Semantic Analysis (ESA) [30] to

calculate semantic relatedness. Wikipedia ESA calculates relatedness by looking at

how often w1 and w2 co-occur in articles in Wikipedia. We write Wikipedia ESA as

a function sresa(ti, lj) → [0, 1].


ESA (ti, lj) pair0.1 nature photography, indian baskets

0.2 fiction xxi c, angels in art

0.3 christian walk, women and peace

0.4 novecento/20th century, african american churches

0.5 20th century british literature, indians in literature

0.6 countries: italy, european economic community countries

0.7 medieval christianity, medieval, 500-1500

0.8 christian church, church work with the bereaved

0.9 detective and mystery fiction, detective and mystery stories

Table 5.1: Sampled (ti, lj) pairs with Wikipedia ESA values.

Figure 5.4: Conditional density plot showing probability of a (ti, lj) pair meaningthat (ti, lj) could annotate {none, few, some,many, almostall, all} of the same booksaccording to human annotators based on Wikipedia ESA score of the pair.

Understanding Wikipedia ESA Values

Table 5.1 shows representative Wikipedia ESA values for LCSH keywords lj 6∈ Slcsh.

For example, for tag tma “middle ages” and LCSH keyword lma “Middle Ages, 500-

1500”, sresa(tdmf , ldms) ≈ 0.98 (not shown). By contrast, for tnp “nature photogra-

phy” and lib “Indian Baskets”, sresa(tnp, lib) ≈ 0.1.

Figure 5.4 shows how Wikipedia ESA values translate into real relationships be-

tween (ti, lj) keyword pairs. We uniformly sampled (ti, lj) pairs where lj 6∈ Slcsh by

sresa(ti, lj). We then asked human annotators how many books labeled with either ti

or lj would be labeled with both ti and lj. Figure 5.4 shows sresa values on the x-axis

5.3. CONCLUSION 103

Figure 5.5: Histogram of Top Wikipedia ESA for Missing LCSH and All Tags.

and the distribution of answers ∈ {none, few , some,many, almostall, all} on the y-

axis. For example, at sresa = 0.8, 20 percent of keyword pairs have many, almostall,

or all books in common (top three grays) according to human annotators. Likewise,

more than half of pairs at sresa = 0.8 have at least some books in common by this

measure. sresa is well correlated with how humans see the relationship between two

keywords.

Results

We ran Wikipedia ESA over all (ti, lj) pairs where lj 6∈ Slcsh. Figure 5.5 shows

{max({sresa(ti, lj)|ti ∈ T})|lj ∈ L− Slcsh}. That figure shows that most of the non-

syntactically equivalent LCSH keywords have a fairly semantically similar tag, with

a Wikipedia ESA value between 0.7 and 0.9. By simulation using the probabilities

from Figure 5.4, we estimate that ≈ 21 percent of lj 6∈ Slcsh have a tag matching

all or almostall of the keyword and ≈ 56 percent have a tag matching many books

annotated with the keyword.

5.3 Conclusion

This short chapter contrasted a mature controlled vocabulary built by experts over

decades with an uncontrolled vocabulary developed by hundreds of thousands of users

over a few years. We found many (about 50 percent) of the keywords in the controlled

vocabulary are in the uncontrolled vocabulary, especially more annotated keywords.

We also found using a semantic relatedness measure that most of the remaining LCSH


keywords have similar, though not exactly equivalent, tags. This suggests that often

the keywords selected as controlled vocabulary keywords are the keywords that users

naturally use to describe works.1

However, we found little agreement as to how to apply shared keywords. Sets of

works annotated by corresponding LCSH keywords and tags rarely intersect signifi-

cantly. This is true even if we merely check whether a corresponding tag annotates

most of the works annotated by an LCSH keyword. This suggests one of three inter-

esting possibilities:

1. Users and experts use many of the same keywords, but ultimately differ heavily

as to how to apply them.

2. Experts are not allowed, or do not have time, to annotate works with all of the

appropriate keywords.

3. Experts only label highly representative works with a term, rather than all

works that might be considered to have the term, leading to low recall.

All of these possibilities are ultimately bad for retrieval using expert assigned con-

trolled vocabularies.

When users and experts differ in how they annotate objects, we believe it is

most reasonable to defer to the users. To say otherwise would be, in essence, to

tell users that they do not know how to organize their own collections of objects.

Ultimately, given that keywords are used by the users for navigation and browsing,

we should evaluate the usefulness of annotations from their perspective, rather than

the perspective of experts.

This chapter also suggests an interesting alternative view on the vocabulary prob-

lem [28], a long standing observation in the world of human-computer interaction.

The vocabulary problem suggests that given an object, people will choose many dif-

ferent names for that object. However, our work suggests that given a name (a tag in

our case), people, whether experts or not, may disagree substantially on what objects

1Keywords can be in one of three groups, but we focus on L ∩ T and L − T in this chapter,ignoring T − L. In Chapter 4, we found that about half of the 47, 957 tags ti ∈ T are likely tobe non-objective, non-content tags like “funny,” “tbr,” or “jiowef.” We suspect that the balanceof T − L that is not syntactically equivalent to LCSH keywords is still either related to the LCSHkeywords or describe completely different (objective, content-based) concepts.

5.3. CONCLUSION 105

that name should annotate.

Chapter 6

Human Processing

So far, we have focused on one particular type of microtask: tagging. Starting in this

chapter, we shift our focus to microtasks in general. In particular, we are interested in

how to write programs that use microtask marketplaces like Mechanical Turk. This

chapter begins by motivating our programming environment and methodology for

human microtasks, called the human processing model. Chapter 7 describes our first

attempt at an implementation of the human processing model, called the HPROC

system, and illustrates its usage through a sorting case study. Finally, Chapter 8

describes worker monitoring, a key part of recruiters (see Section 6.4), which are in

turn a major feature of the human processing model.

Why a programming environment and methodology for microtasks? Developing

a microtask-based application involves a lot of work, e.g., developing a web interface

for the human workers to receive their assignments and return their results, computer

code to divide the overall application into individual tasks to be done by humans,

computer code to collect results, and so on. However, with our programming envi-

ronment many of the programming steps that must be performed can be automated.

We start by giving a simple example (Section 6.1). We show how a programmer

would attack this example using two existing programming environments, which we

call Basic Buyer (Section 6.2) and Game Maker (Section 6.3). Then, we show how

the programmer would attack the same example using our novel proposed environ-

ment, Human Processing (Section 6.4). Finally, we contrast all three environments

107

108 CHAPTER 6. HUMAN PROCESSING

and describe remaining challenges in the area (Section 6.5).

6.1 Motivating Example

“Priam,” the editor of a photography magazine, wants to rank photos submitted to

the magazine’s photo contest. For each environment below, we explain how Priam

might go about accomplishing this task.

6.2 Basic Buyer

The premise of the Basic Buyer (BB) environment is that workers do short microtasks

for pay, based on listings on a website (a marketplace). The BB environment is mod-

eled on usage of Amazon’s Mechanical Turk [1],1 though a similar environment could

be used with Gambit Tasks [2] or LiveWork [3]. However, because the programmer in

BB targets a marketplace directly, and interaction patterns with marketplaces vary,

switching marketplaces requires rewriting previous code.

The BB environment (Figure 6.1) works as follows:

1. The programmer (Priam) writes a normal program.

2. That program can, in the course of execution, create HTML forms at one or

more URLs. This creation of forms can happen in any of the usual ways that

people currently generate web forms using web application frameworks.

3. The program can also interact with a marketplace, a website where workers

(users on the Internet visiting the marketplace) look for tasks to complete. The

program can make one of five remote procedure calls to a monetary marketplace:

post(url, price) → taskid Tell the marketplace to display a link to url

with the information that, if completed, the worker will be paid price.

(We do not specify, but post might include other specifications like the

number of desired assignments, task title, or a task description.) The URL

1In particular, correspondences for the operations mentioned in this section are post →CreateHIT, assignments,get→ GetAssignmentsForHIT, approve→ ApproveAssignment, reject→ RejectAssignment. We ignore bonuses and qualifications for ease of exposition.

6.2. BASIC BUYER 109

Figure 6.1: Basic Buyer human programming environment. A human program gen-erates forms. These forms are advertised through a marketplace. Workers look atposts advertising the forms, and then complete the forms for compensation.

url should correspond to a form which performs an HTTP POST to the

marketplace. The returned identifier taskid gives a handle for further

interaction with the marketplace related to this posted task. When a

worker completes the task later, the worker will post the result via the

form to the marketplace. The marketplace will then record a dictionary

containing the posted form contents from url, a workerid unique to the

worker, the taskid, and a unique assignid with which to look up the

dictionary. (By dictionary, we mean a hash table where one can enter and

search for entries on keys.)

assignments(taskid) → assignids Return a list of identifiers assignids for

looking up individual completions of the form associated with taskid.

(Can be called multiple times, perhaps with a special identifier to indicate

that there will be no additional assignments registered.)

get(assignid) → dict Get the dictionary that corresponds to the submit-

ted task with the given assignid. The dictionary contains which worker

completed the task (a workerid) and the results of the form, as key-value

pairs.


approve(assignid) Request that the marketplace pay the worker associated

with assignid the price associated with the taskid that that assignid

corresponds to.

reject(assignid) Request that the marketplace not pay the worker asso-

ciated with assignid the price associated with the taskid that that

assignid corresponds to.

The program posts one or more URLs, waits for assignments, gets the results,

and then approves or rejects the work.

Priam determines that workers are best at ranking five photos at a time, so a

web page is designed to display five photos and provide five entry fields for the ranks

one through five. A computer program now needs to be written to read the photos

from a database and generate multiple posts corresponding to groups of five photos.

The program needs a strategy to do its work: for instance, it may employ a type

of Merge-Sort strategy: divide the photos into disjoint sets of five, and rank each

set. Then the sorted sets (runs) can be merged by repeatedly calling on workers.

(Section 7.8 describes in more detail a similar Merge-Sort in our HPROC system

using ranked comparisons.)

In addition to the sorting logic itself, there is a lot of other “administrative”

work that needs to be done. Of course, assignments need to be approved (paying

workers for their work), but more importantly Priam needs to determine pricing, if

and when the work being submitted is good, which workers are good, and so on. For

example, one worker (a “spammer”) might simply fill in junk in order to get paid.

This spammer would need to be caught and their work ignored. Priam also may not

pay enough initially, or may need to change his price over time depending on market

conditions.

6.3 Game Maker

The Game Maker (GM) environment is modeled on the “Games with a Purpose”

(GWAP) literature [70]. The idea of GM is to incorporate one or more desired tasks

(like Priam’s five photo ranking) into a game which regular users on the Internet

6.3. GAME MAKER 111

Figure 6.2: Game Maker human programming environment. The programmer writesa human program and a game. The game implements features to make it fun anddifficult to cheat. The human program loads and dumps data from the game.

find fun to play. In theory this is not much different from BB—why not simply

post the URL of the “game” task so people can find it? However, in practice, GM

is quite different because only some tasks can be made fun, the question of pricing

is completely avoided, and it often takes a long time and a great deal of effort to

make desired work into a game. Many games have been developed, though our

model is based most closely on the ESP Game (a photo captioning game). The GM

environment (Figure 6.2) works as follows:

1. The programmer (Priam) writes two programs: the main program and a “game

with a purpose.”

2. The game is designed to take input items and compute some function fn of each

input item by coercing players to compute the function during game play. For

example, the ESP Game takes photos as input items and produces text labels

as outputs [70].

3. The interaction between the main program and the game is simple:

load(item) → itemid Add a new item for humans to compute the game’s

function on. Return an identifier for the item.

dump() → ((itemid,res),...) Get a list of all results that have been com-

puted up to this point (can be called repeatedly). Each returned tuple


includes an itemid and the result of computing the game’s function on the

original item.

4. While the function fn computed is usually quite simple (e.g., “give some la-

bels for this image”), the game itself is usually quite complex. This complexity

is for two reasons: the game must be fun, and the game must be difficult to

cheat. Making the game fun can be time consuming, requiring features such as

timed game play, multiple players, fake players (via replayed actions), leader-

boards, and quality graphic design. Making the game difficult to cheat can be

equally time consuming, requiring features such as randomization, gold stan-

dards, statistical analysis, and game design according to particular templates

(e.g., “output-agreement,” “input-agreement” [70]).

5. The game may be a Flash game or any other format, the fact that it is used for

human computation does not impact the technical details of how we program

it.

Priam determines that the magazine’s readers might be willing to play a game

where they determine the best photo out of a set of five photos. As with the Basic

Buyer case, Priam needs to write a program to handle the sorting logic. The program

could then use the load and dump operations to get data in and out of the game.

However, he now also needs to write a game where it is fun to sort groups of five

photos, and then promote the game online. Lastly, he needs to make sure that

players cannot cheat, either to make a particular contestant’s photo do well, or for

the player to succeed in the game by inputting bad data.

One problem with the GM environment is that to date, programmers have not

shared interfaces or source code for popular games. For example, even though the ESP

Game serves many players each day, it is not possible for Priam to get the (actual)

ESP Game to label his own images. This means that the programmer usually has to

develop and promote a new game, even if previous examples exist! (Even if the most

popular GWAPs did have open interfaces, it is likely that switching between GWAPs

would require rewriting code and that GWAPs would only cover a small fraction of

potential desired tasks.)

6.4. HUMAN PROCESSING 113

Figure 6.3: Human Processing programming environment. HP is a generalization ofBB and GM. It provides abstractions so that algorithms can be written, tasks can bedefined, and marketplaces can be swapped out. It provides separation of concerns sothat the programmer can focus on the current need, while the environment designerfocuses on recruiting workers and designing tasks.

6.4 Human Processing

The Human Processing (HP) environment builds upon the BB and GM environments

through abstraction. The HP environment (Figure 6.3) works as follows:

1. The programmer (Priam) writes a normal program. The programmer may also

write one or more implementations of (see below) human drivers, human tasks,

marketplace drivers, or recruiters. However, the point of the HP enviroment is

to maximize code reuse, so ideally, existing implementations should cover the

programmers’ common use cases.

2. A human driver is a program that manages an associated web form or other

user interface (so that the main program and other components do not have

to talk directly to the user interface). It is so named because it manages the

interaction with humans, much like a device driver manages a physical device

on a computer. A human driver supports four operations:

open() → driverid Make the associated user interface available to remote

users. By remote users we mean workers in the BB model, players in


the GM model, or other people capable of completing tasks. Returns an

identifier for the driver.

send(driverid, msg) Send message msg to the driver to change its behavior.

In Priam’s case, if he was using a driver for a game like the ESP Game he

would use a send operation to load input photos.

get(driverid) → (d,e) or 0 Get a (result) data object d from the interface,

with execution context e about how that data object was acquired. If no

new data is available, return nothing (i.e. 0). get is how results are

returned from the driver. Both d and e are dictionaries of key-value pairs.

For example, Priam’s photo comparison interface returns d as

{ranks: (1,4,2,3,5), taskid: TID183}

(ranks are the output, and tasks are defined below) and e as

{workerid: WID824}

(a worker who completed the task).

close(driverid) Make the associated user interface unavailable to remote

users.

A human program opens a driver and then sends setup messages. Human

drivers for web forms may only receive one setup message, though those for

games may be sent many messages to load inputs. Execution context comes

from user interaction, for example, how long did the task take and which worker

completed it? Such information can help with quality control in the main

program. Finally, the human driver is closed. Note that by itself, a human

driver can make its associated user interface available to remote users. However,

it does not handle the problem of finding remote users to interact with the user

interface.

3. The programmer reuses or defines structures called human task descriptions. A

human task description consists of an input schema, an output schema, a human

driver, a web form, and possibly other metadata. A human task description can

be instantiated into one or more human task instances. These instances contain

information as key-value pairs such as when the task started, a price if any, and

so on. For example, a task description for Priam’s case might look like ...

6.4. HUMAN PROCESSING 115

{input: (photo1, photo2, photo3, ...),

output: (int, int, int, int, int),

webform: compare.html,

driver: comparer.py}

... while a task instance might look like ...

{start: 20090429,

price: $0.07,

taskid: TID272}.

4. A marketplace driver provides an interface to a marketplace. Marketplaces

are a general term for both monetary marketplaces like Amazon’s Mechanical

Turk [1] (websites where workers are paid in money) and gaming marketplaces

like GWAP (websites where users choose among many games and are paid

in points or enjoyment). The environment may have many drivers for different

marketplaces, and these drivers may have different interfaces depending on what

the marketplaces themselves support.

5. The programmer avoids programming to any particular marketplace driver if

at all possible. Instead, the programmer targets a recruiter, which is a program

that serves as an interface to one or more marketplace drivers.2 Recruiters

support at least one operation:

recruit(taskid) Ensure that the task instance taskid is completed by work-

ers. The recruiter uses the task instance to find out how the user interface

associated with the task is accessed. For example, if it is a web form, the

task instance includes the URL of the web form. Then, the recruiter inter-

acts with one or more marketplace drivers. In the case of the marketplace

from the BB environment, one strategy might be to gradually increase

the price until workers complete the web form. The recruiter also interacts

with the human driver associated with the task instance to determine when

no more workers are needed (e.g., in Section 7.5.12, the recruiter calls an

isDone() method on a human driver).

2In practice, some services like Dolores Labs’ CrowdFlower may also be viewed as a form ofrecruiter.


In general, quality recruiters need both a strategy and worker monitoring. For

example, if workers are not completing a web form, should the price be increased,

or are there simply not many workers currently awake and available? Chapter 8

addresses the question of worker monitoring, with some consideration of viable

recruiting strategies.

6. The environment includes a library of human algorithms to encourage code

reuse. A human algorithm is a parameterized program which can handle many

possible needs. (For example, it might include algorithms for sorting, clustering,

and iterative improvement [49].) Often, it will be parameterized by human task

descriptions, but other parameters might be used as well. For example, a pair-

wise sort algorithm might take a human task description consisting of a human

driver and web form to compare two items. The human task description would

determine if the items compared were photos, videos, or something else.

The Human Processing environment is the novel environment we propose in this work.

In the HP environment, Priam’s workload is much reduced. A pairwise sorting

algorithm H-Quick-Sort is already included in the library. Priam may define a

human driver and web form for comparing two photos, though these might already

be available. Then, Priam defines a human task consisting of comparing two photos

using the human driver, web form, and appropriate schemas. Lastly, Priam runs

H-Quick-Sort with his human task and a pre-defined recruiter. An example pre-

defined recruiter is one that increases prices one cent each hour using Amazon’s

Mechanical Turk, though more complex recruiters could be built.

6.5 Discussion

HP extends BB and GM in compelling ways:

• Cost. BB excels for small numbers of tasks where programmer time is valuable.

GM excels for large numbers of tasks where cheaper work is valuable. HP

excels at both by providing payment optimizing recruiters and the opportunity

to degrade to either BB or GM.

• Ease. BB quickly becomes complicated as the programmer gets bogged down

6.5. DISCUSSION 117

in trivia like pricing. GM requires heavy attention to game play and cheaters.

HP allows the programmer to focus on the tasks to be completed, rather than

infrastructure.

• Reuse. There are no mechanisms in BB for reusing algorithms, forms, or admin-

istrative functionality. Current GM implementations do not share interfaces,

and games tend to be specialized to specific use cases. By contrast, abstrac-

tions in HP allow for a library of infrastructure. Algorithms can target recruiter

interfaces, recruiters can target market drivers, and so on.

• Independence. Programs in BB tend to be focused on a particular marketplace.

Programs in GM tend to be tied to a particular web site’s gaming user base.

By contrast, programs written to HP have an independence due to marketplace

drivers. (Likewise, algorithms, human drivers, and forms may have a simi-

lar independence.) Switching marketplaces or other infrastructure can require

substantial rewriting in BB or GM, but does not in HP.

• Algorithms. General algorithms can be written to target a higher level interface

in HP, but it is not clear how general algorithms can be reused in BB or GM.

• Separation of Concerns. Researchers or infrastructure writers can focus on

improving recruiters, algorithms, and human drivers in HP, independent of a

main program’s code.

The more environments that implement HP, the easier it will be to leverage disparate

work in algorithms, recruiters, and human drivers.

There are three main challenges in the future for HP.

1. Verification, Quality Control. GM focuses a great deal on verification, but BB

and HP do not. How should we identify bad output? How do we identify

high and low quality workers? Is worker quality task specific? We would like

to see a generic, modular way to handle verification and quality control in an

environment like HP.

2. Recruiters. We would like to see arbitrarily advanced recruiters. For example,

not only would we like to see recruiters that price tasks on monetary mar-

ketplaces, but we would also like to see recruiters that can choose amongst

alternative, equivalent task plans based on price and quality.


3. Algorithms. Algorithms targeted for the HP environment need to be developed

for various purposes. For example, sorting with people is not the same as sorting

with a computer! The HP environment provides a natural way to benchmark

algorithms, based on cost, time, input, and output with a given recruiter.

As we will see, the work in Chapters 7 and 8 goes part of the way towards address-

ing these challenges. Chapter 7 describes our partial implementation of HP, called

HPROC.We use HPROC to explore human sorting algorithms with a simple recruiter.

Chapter 8 demonstrates a worker monitoring tool intended for use by recruiters. This

worker monitoring tool might also be used for various forms of quality control in the

future. However, while we explore the space with one type of algorithm, one type of

recruiter, and one tool for worker monitoring, there is a great deal of potential future

work in algorithms, recruiters, and quality control. We believe that HP is a strong

foundation for this future work in human computation, allowing for much greater

reuse and modularization of common functionality.

Chapter 7

Programming with HPROC

This chapter describes HPROC, a system implementing most of the Human Process-

ing model described in Chapter 6. (The one notable exception is a lack of human

task descriptions.) HPROC makes human programming easier by storing expensive

human results in a database backend and providing an environment for programming

which is more amenable to control flow with humans. HPROC makes evaluation of

human algorithms easier with concepts like recruiters which help to control for the

variability of an underlying marketplace like Mechanical Turk. We describe HPROC

through a sorting case study illustrating both how HPROC works and how we believe

human algorithms should be evaluated.

7.1 HPROC Motivation

Programming systems designed for other problems are rarely a good fit for human

programs. In particular, human programs are:

Long Running With humans as a computational unit, processing can take days or

even weeks.

Costly Paying human workers costs money, which means that the programmer wants

to be extra careful not to lose previously computed results. In particular, pre-

viously computed results should be persistently stored in case a later part of

119

120 CHAPTER 7. PROGRAMMING WITH HPROC

1 quicksort(A)

2 if A.length > 0

3 pivot = A.remove(once A.randomIndex ())

4 left = new array

5 right = new array

6 for x in A

7 if compare(x, pivot) A

8 left.add(x)

9 else

10 right.add(x)

11 quicksort(left)

12 quicksort(right)

13 A.set(left + pivot + right)

14

15 compare(a, b) A

16 hitId = once createHIT (...a...b...)

17 result = once getHITResult(hitId)

18 return (result says a < b)

Listing 7.1: An idealized TurKit Quick-Sort program [51].

the program crashes.

Parallel It is usually much slower to post one task to a marketplace at a time in

sequence than to post many tasks in parallel. Making this human parallelism

easy is very important, while computational parallelism is much less important.

Web-related with State Human programs need to interact with workers on the

web, but unlike most web programming, the interactions are often stateful.

While other types of programming like web programming, database programming,

and systems programming with concurrency share some of these features, we do not

know of any type of programming task that shares all of them.

7.2 Preliminaries: TurKit

One of the first systems to explicitly aim to solve some of the programming problems

described in Section 7.1 for human programming was the TurKit system ([49], [50]).

7.2. PRELIMINARIES: TURKIT 121

TurKit has been used to transcribe blurry text, caption images, and execute genetic

algorithms [51]. We describe the TurKit system first because HPROCmakes a number

of design decisions inspired by TurKit. The main contribution of the TurKit system

was a novel programming model which we call TurKit crash-and-rerun. There are

four features of TurKit crash-and-rerun:

Single Program There is one and only one program which is run within the TurKit

environment.

Continuous Rerun The one program is continuously rerun until the program com-

pletes without raising an exception.

Idempotence The single program is made idempotent so that it can be rerun con-

tinuously without causing additional side effects.

Deterministic The single program is deterministic so that each time it reruns the

same execution path will be followed.

In short, TurKit is an environment for continuously rerunning a single, deterministic,

idempotent program written by a human programmer until it completes successfully,

with facilities to make writing and running such programs easier.

For example, Listing 7.1 shows pseudocode for Quick-Sort in the TurKit sys-

tem. Listing 7.1 looks like a classical Quick-Sort, with two key exceptions. The

first key exception is that the TurKit program pseudocode has a separate compare

function for performing binary comparisons. The compare function creates tasks on

the Mechanical Turk (via createHIT on line 16) and gets the results of those tasks

(via getHITResult on line 17). The second key exception is that the TurKit program

pseudocode includes the function once. The first time once is reached, once calls the

function to which it is applied (e.g., A.randomIndex() on line 3—once is applied to

whatever function immediately follows it). If the called function returns successfully,

the result of the function is stored in a database which is part of the TurKit system.

Then, every subsequent time that particular once is reached, the recorded result is

returned from the database, rather than calling the function to which once was ap-

plied. (The use of once is to perform memoization.) On line 3, once is used to ensure


that the same pivot is chosen on each run of the program. On lines 16 and 17, once

is used to ensure that we do not repeatedly create duplicate tasks, or repeatedly get

the same results, respectively.

Note that although Listing 7.1 looks like a classical Quick-Sort, the program

does not run in the same manner. The program reads like a single, imperative run

of Quick-Sort. However, in practice, the program will “crash” whenever it reaches

line 17 and results are not yet available from the Mechanical Turk. Then the program

will periodically rerun (e.g., every minute), replaying all of its actions and retrieving

stored results when it reaches once, until the program has all results necessary to

complete. The imperative appearance of TurKit programs is an advantage, and the

crash-and-rerun style avoids problems like memory leaks. The crash-and-rerun style

also does not need any special operating system or language support for suspending

threads or processes. However, a disadvantage of the style (shared as we will see, by

HPROC) is that the programmer must be careful to make their program idempotent

and deterministic using functions such as once (and unintended behavior can occur

if the programmer does not).

One practical way to implement once is via a counter. The counter is incremented

each time once is called and reset each time the program crashes. once then stores

and looks up the result of the function by a key associated with the current counter

in the database. (TurKit is designed for prototyping only one program at a time, so

we do not worry about conflicts between programs having the same counter keys or

programs colliding with their own past keys in this discussion.)

For example, suppose we use the Quick-Sort of Listing 7.1 to sort three images,

A, B and C. These images are initially in the order A = [C,B,A]. During the first run

of the program, once is first called on line 3. Supposing A.randomIndex() (line 3)

returned the index 2 (corresponding to image A), once would store the value 2 under

the key 0 in the database (e.g., db[0] = 2). Later on the same initial run, compare

is executed (lines 7 and 15), and the result of the first call to createHIT is success,

so we would save db[1] = true. Then, getHITResult is executed, but the result of

comparing image A (the pivot) to C (the first index) is not available, so the program

crashes. On the second run, the initial pivot on line 3 is looked up under db[0]. Then

7.3. HPROC SUBSYSTEMS 123

the initial createHIT would be skipped because the result is available under db[1].

However, supposing the getHITResult was now ready, the program might then save

db[2] = A to signify that A is less than C (according to some criteria, like blurriness).

Then the first compare returns, and the second compare in the for loop executes.

For that second compare, the call to createHIT on line 16 will be saved to db[3].

Note that because the counter is specific to the program, rather than the current

stack frame, this second call to createHIT has a different key (3) than the first (1).

Further, because the Quick-Sort has been made deterministic by calls to once, the

branches followed and functions recursed are always in the same order, making the

counters match the same once calls.

7.3 HPROC Subsystems

HPROC is a system which extends TurKit in various ways. One major difference

is that HPROC can run code in response to web requests, which TurKit cannot

do. Running code in response to web requests allows for more natural handling of

interaction between machine and human computation, as well as better handling of

the stateful nature of web requests in human programming. Another major difference

between TurKit and HPROC is that HPROC can have any number of hprocesses

(HPROC’s version of a process), while TurKit can only have one running program.

(We go into much more depth about hprocesses in Section 7.4.)

HPROC helps satisfy the motivations of Section 7.1 by creating a specialized type

of operating system within an operating system. HPROC has its own notions of

processes, state, and interfaces (analogous to, e.g., network interfaces) that are built

on top of the host operating system, with additional features that make those notions

more usable for human programming.1 A high level graphical overview of HPROC is

shown in Figure 7.1. The five main subsystems from Figure 7.1 are:

Database A MySQL database (shown in the middle left of Figure 7.1) contains

1In this chapter, we have changed the names of notions and code in our HPROC system forease of exposition. For example, hprocesses are called components in our system, HCDs are calledcomponent types, and the HPID environmental variable is really COMPONENTID. Nonetheless, there isa one-to-one mapping of notions in this chapter to notions and code in our real, running system.


Figure 7.1: Graphical overview of the full HPROC system.

7.4. HPROC HPROCESSES 125

descriptors of code and hprocesses within HPROC, event related information,

and variables. Most of the other subsystems, including the programmer remote

API CGI, the web hprocess wrapper CGI, and hprocron (all discussed below)

interact with the database.

Web Server A LigHTTPd web server (shown in the top left of Figure 7.1) serves

as a frontend for the web hprocess wrapper CGI and programmer remote API

CGI.

Hprocron An operating system process (upper right of Figure 7.1) which spawns

UNIX processes (“resumes hprocesses,” see Section 7.4) based on the contents

of the MySQL database. Analogous to UNIX cron. Used to implement events

(discussed in Section 7.4), polling (Section 7.5.5), and TurKit style crash-and-

rerun functionality.

Web Hprocess Wrapper CGI A CGI script which spawns UNIX processes (“re-

sumes hprocesses,” see Section 7.4) based on an HTTP request and the contents

of the MySQL database. (So-called because the processes are executed in an

environment where they are wrapped by the CGI script.) Used to implement a

web interface for hprocesses.

Programmer Remote API CGI A CGI script which allows a remote program-

mer to upload code into the HPROC system, run that code, and get results

out.

7.4 HPROC Hprocesses

Although we go into much more depth in our walkthrough (Section 7.5), this section

gives a brief overview of the key concept in HPROC—the hprocess. An hprocess is

the analogue to a process in our specialized operating system. Every hprocess has

an hpid, or hprocess identifier, analogous to an operating system process identifier

(PID). An hprocess runs regular UNIX code. An hprocess can be in one of three

states: active, waiting, or finished. When an hprocess is active, the UNIX code for


the hprocess is running as a UNIX operating system process. When an hprocess is

waiting or finished, there is no running UNIX operating system process corresponding

to the hprocess.

When an hprocess transitions from active to waiting, we say that the hprocess

was suspended, which is our equivalent to a crash in TurKit crash-and-rerun. When

an hprocess transitions from waiting to active, we say that the hprocess was resumed,

which is our equivalent to a rerun in TurKit crash-and-rerun. Once an hprocess

transitions to the finished state, it will not transition to either of the other states.

(Usually in our system, hprocesses transition to the finished state after completion,

as in TurKit crash-and-rerun—see Section 7.5.7.) Section 7.5.6 goes into more detail

about how hprocesses are suspended and resumed.

There are two types of hprocesses, standalone hprocesses and web (or webcgi)

hprocesses. A standalone hprocess will be resumed in response to an event in hprocron

(see below). A web hprocess will be resumed in response to a specially crafted web

request to the web hprocess wrapper CGI.

Each hprocess has some associated persistent variable storage. What this means

in practice is that there is a table within the MySQL database where hprocesses can

store variable data. Each row in this table contains an hpid, a name for the variable,

a value to which the variable is currently set, a type, and a status. Any part of

the HPROC system can add variables under any hpid, but hprocesses generally use

this variable storage to store their own information. Adding, deleting, and updating

variables is done via SQL.

One important use of this persistent variable storage is for memoizing intermediate

results, as in TurKit. HPROC has a function analogous to once, and an implementa-

tion which is functionally similar to the counter implementation described in Section

7.2.2 (Our implementation uses an identifier which maps to the current program

counter within the current stack frame, but is functionally largely the same as the

counter implementation.)

Hprocesses communicate primarily via two methods: events and cross-hprocess

2Also, many HPROC functions use our once equivalent internally to ensure idempotence anddeterminism. For example, hprocess creation and cross-hprocess function calls will implicitly storestate to ensure idempotence and determinism.

7.5. HPROC WALKTHROUGH 127

function calls.

The hprocron operating system process (introduced in Section 7.3) maintains a list

of hprocesses that are listening to particular events. Events are simply ASCII strings,

like E POLL 1003. When an event is fired, hprocron is responsible for resuming the

hprocesses listening for the event. Any part of the HPROC system can contact

hprocron to add a listener or fire an event. Every hprocess automatically listens on

a number of default events. For example, an hprocess with an hpid of 1003 would by

default listen for the event E POLL 1003, which is a polling event (see Section 7.5.5).

Hprocesses (and other parts of the system) can call other hprocesses via cross-

hprocess function calls. It is best to think of cross-hprocess function calls as a spe-

cialized form of message passing. Hprocesses can leave messages for one another via

variables, but the hprocesses themselves perform whatever actions they desire based

on the variables available to them whenever they next resume. Cross-hprocess func-

tion calls work by placing a variable in the variable storage of the target hprocess.

The variable contains information about the desired function call, as well as a dif-

ferent hpid and variable name in which to put the result. The target hprocess then

computes the function based on the information given in the original variable and

places the result in the new variable requested by the source hprocess. (We give a

full example in Section 7.5.8.)

7.5 HPROC Walkthrough

HPROC is a large and complex system made up of over ten thousand lines of custom

Python code. Probably the easiest way to get a feel for working with the system is

a walkthrough of the most common functionality. To demonstrate this functionality,

this section uses the example of a program which asks a worker to compare two photos,

asking which of the two the worker prefers. Listings 7.3, 7.4, 7.5 and 7.6 make up

a script called walkthroughscript.py which is uploaded into the HPROC system

for this walkthrough. Listing 7.2 is a script called walkthroughuploader.py which

does the uploading of the first script and retrieves results. These two scripts make up

our example program. We follow our example program through the following steps


in this walkthrough:

1. A remote connection is made (Section 7.5.1).

2. The code is uploaded, using the upload script (Section 7.5.2).

3. Introspection is performed on the uploaded code (Section 7.5.3).

4. An hprocess is created remotely using the upload script (Section 7.5.4).

5. The new hprocess is set up for polling (Section 7.5.5).

6. The hprocess is resumed, causing a UNIX operating system process to be

spawned (Section 7.5.6).

7. The upload script calls a remote function (Section 7.5.8) and the script uses

dispatch handling to handle the remote function (Section 7.5.7).

8. Local hprocesses are created within the HPROC system (Section 7.5.9).

9. Web forms and human drivers are created within the HPROC system (Sections

7.5.10 and 7.5.11).

10. A recruiter is asked to recruit for the human driver until a worker completes

the associated form (Section 7.5.12).

We go into more depth about each of these steps below.

7.5.1 Making a Remote Connection

HPROC is a self-contained system, running on a system remote from the program-

mer. For this walkthrough, we will assume that the HPROC system is running on

a computer with the hostname hproc.stanford.edu. We call the computer run-

ning HPROC the HPROC host. Likewise, we will assume that the programmer is

working on a separate computer at test.stanford.edu. We will call the remote

programmer’s computer the remote client.


1 #!/usr/bin/env python

2

3 def main ():

4 conn = connect(’https :// hproc.stanford.edu/remote.cgi’)

5

6 ... # removed

7

8 conn.uploadCode(’walkthroughscript .py’)

9

10 compareItemsProc = conn.newHprocess(’edu.stanford.thesis.sa’)

11 comparisonResult = compareItemsProc.fn.compareItems(

12 ’http ://i.stanford.edu/photo1.jpg’,

13 ’http ://i.stanford.edu/photo2.jpg’).get()

14

15 print comparisonResult

16

17 ... # removed

18

19 if __name__ == ’__main__ ’:

20 main()

Listing 7.2: Walkthrough script upload script (walkthroughuploader.py).


1 #!/usr/bin/env python

2

3 from hp.mop import dispatch , env

4 from hp.shared import exceptions , util

5 from hp.recruit import manturk_recruitd_iface

6 import cgi , SimpleXMLRPCServer

7

8 turk_javascript = ... # removed

9

10 def doThrower(photo_url1 , photo_url2 , target_url ):

11 global turk_javascript

12

13 print """ Content -Type: text/html

14

15 <html >

16 <head ><title >Photo Comparison </title >%s</head >

17 <body >

18 First Photo URL is<BR/><IMG SRC=’%s’ /><br/><br/>

19 Second Photo URL is<BR/><IMG SRC=’%s’ /><br/><br/>

20 Which do you prefer?

21 <form name=" thrower" action ="%s" method ="post">

22 <input type=" radio" name=" choice" value =" photo1" /> First Photo <br />

23 <input type=" radio" name=" choice" value =" photo2" /> Second Photo <br />

24 <input type=" hidden" name=" assignmentId" id=" assignmentId" value ="" />

25 <input type=" submit" value =" Submit" />

26 </form >

27 </body >

28 </html >

29 """ % (turk_javascript , photo_url1 , photo_url2 , target_url)

30

31 def doCatcher ():

32 form = cgi.FieldStorage ()

33

34 myHprocess ().v[’results ’] = [form.getfirst("choice", ""),]

35

36 redirect_html = "Location: %s://%s/%s?%s&%s" % (

37 ’https ’, ’www.mturk.com’, ’mturk/externalSubmit ’,

38 ’assignmentId =%s’ % form.getfirst(’assignmentId ’, ’’),

39 ’data=none’

40 )

41 print redirect_html

Listing 7.3: Walkthrough HPROC script (walkthroughscript.py), Part I: Initialsetup, thrower, and catcher functionality.


42 class CompareFormHandler(object ):

43 def getThrower(self):

44 return util.getThrowerUrl(env.getMyHpid ())

45

46 def isDone(self):

47 return myHprocess ().v.has_key(’results ’)

48

49 def getResults(self):

50 return myHprocess ().v[’results ’]

51

52 def getTaskType(self):

53 return {’fqn’: ’edu.stanford.hproc.tasktype.compareform.v1’}

54

55 def doXmlRpc ():

56 handler = SimpleXMLRPCServer.CGIXMLRPCRequestHandler (

57 allow_none=True)

58 handler.register_introspection_functions ()

59 handler.register_instance (CompareFormHandler ())

60 handler.handle_request ()

61

62 def handleRequest(photo_url1 , photo_url2 ):

63 if util.requestType () == ’thrower ’:

64 doThrower(photo_url1 , photo_url2 ,

65 util.getCatcherUrl(env.getMyHpid ()))

66 elif util.requestType () == ’catcher ’:

67 doCatcher ()

68 elif util.requestType () == ’xmlrpc ’:

69 doXmlRpc ()

70 else:

71 return ’’

Listing 7.4: Walkthrough HPROC script (walkthroughscript.py), Part II: XML-RPC and web request handling functionality.


72 def makeForm(photo_url1 , photo_url2 ):

73 formhprocess = newHprocess(’edu.stanford.thesis.web’)

74 formhprocess.dfn.handleRequest(photo_url1 , photo_url2)

75 xmlrpc_url = util.getXmlRpcUrl(formhprocess.id)

76 thrower_url = util.getThrowerUrl(formhprocess.id)

77

78 return {’xmlrpc ’: xmlrpc_url ,

79 ’thrower ’: thrower_url}

80

81 def fillForm(xmlrpc_url ):

82 r_iface = manturk_recruitd_iface.getInterface ()

83 r_iface.setMemo(True)

84

85 ticket_id = r_iface.getUniqueTicketIdentifier ()

86 r_iface.manage(ticket_id , xmlrpc_url)

87 r_iface.setMemo(False)

88

89 if not r_iface.isDone(ticket_id ):

90 raise exceptions.HprocIntendedError(

91 "Waiting on ticket %s." % ticket_id)

92 r_iface.setMemo(True)

93

94 res = r_iface.getResults(ticket_id)

95 r_iface.finishTicket(ticket_id)

96 return res

97

98 def compareItems(photo_url1 , photo_url2 ):

99 makeFormProc = newHprocess(’edu.stanford.thesis.sa’)

100 lazyMakeForm = makeFormProc.fn.makeForm(photo_url1 , photo_url2)

101 makeFormResult = lazyMakeForm.get()

102

103 fillFormProc = newHprocess(’edu.stanford.thesis.sa’)

104 lazyFillForm = fillFormProc.fn.fillForm(

105 makeFormResult[’xmlrpc ’])

106 fillFormResult = lazyFillForm.get()

107 return fillFormResult

Listing 7.5: Walkthrough HPROC script (walkthroughscript.py), Part III:makeForm, fillForm, and compareItems standalone functions.


108 def runFunc(args):

109 if env.getMyEnvironmentType () == ’webcgi ’:

110 dispatch.dispatchDefaultFunction (globals ())

111 else:

112 dispatch.dispatchSingle(globals ())

113

114 def main ():

115 CODE_DESCRIPTORS = [

116 {

117 ’fqn’: ’edu.stanford.thesis.sa’,

118 ’language ’:’python ’,

119 ’args’: ’--run’,

120 ’environment ’: ’standalone ’,

121 ’help’: """ Code to create comparison form.""",

122 ’default_poll_s ’: 10,

123 },

124 {

125 ’fqn’: ’edu.stanford.thesis.web’,

126 ’language ’:’python ’,

127 ’args’: ’--run’,

128 ’environment ’: ’webcgi ’,

129 ’help’:"""A binary comparison form for photos.""",

130 ’default_poll_s ’:0,

131 }

132 ]

133

134 dispatch.defaultCommandLineHandler (CODE_DESCRIPTORS , runFunc)

135

136 if __name__ == ’__main__ ’:

137 main()

Listing 7.6: Walkthrough HPROC script (walkthroughscript.py), Part IV: Dis-patch functions, code descriptors (invocations), and main functions.


1 <?xml version="1.0"?>

2 <methodCall >

3 <methodName >getVariable </methodName >

4 <params >

5 <param ><value ><int>41</int></value ></param >

6 <param ><value ><string >workerformresult </string ></value ></param >

7 </params >

8 </methodCall >

Listing 7.7: Example HTTP POST in XML-RPC for the call getVariable(41,

"workerformresult").

Because HPROC is a self-contained system remote from the programmer, the first

thing that the programmer needs to do is to make a connection from the remote client

to the HPROC host. Listing 7.2 is a program which we will call the upload script

program, because, among other things, it uploads code from the remote client to the

HPROC host. Our walkthrough begins with the upload script program setting up a

connection from the remote client to the HPROC host.

For those unfamiliar with Python, line 1 is the common Python preamble, and lines

19 and 20 call the main function when the script is invoked. Line 4 sets up a connection

object, connected remotely to the HPROC host through the programmer remote API

CGI. Figure 7.1 shows this interaction, where the programmer is connected to the

web server, specifically the programmer remote API CGI, in the upper left corner.

The programmer remote API CGI is a common gateway interface (CGI) script. In

particular, when a request is made for https://hproc.stanford.edu/remote.cgi,

the programmer remote API CGI script is called. We now go on a brief tangent

to describe the programmer remote API CGI implementation, before returning to

walking through Listing 7.2.

The programmer remote API CGI script itself is implemented as an XML Remote

Procedure Call (XML-RPC) handler. XML-RPC is a form of RPC where the remote

client does an HTTP POST with an XML document describing a function to be called,

and then the response is an XML document explaining the result. For example, a

simplified version of an HTTP POST doing a remote procedure call for the call

getVariable(41, "workerformresult") would look like the XML shown in Listing


1 <?xml version="1.0"?>

2 <methodResponse >

3 <params >

4 <param ><value ><string >photo1 </string ></value ></param >

5 </params >

6 </methodResponse >

Listing 7.8: Example HTTP response in XML-RPC for the call getVariable(41,"workerformresult") where the response is the string value “photo1.”

7.7. Meanwhile, the response, assuming the result of the RPC was “photo1”, would be

that shown in Listing 7.8. In short, XML-RPC is just a form of RPC where one uses

HTTP POSTs and responses to perform procedure calls, and can be implemented

using CGI scripts, as we do in the HPROC system. In our case, the programmer

is abstracted from having to deal with the specifics of XML-RPC by the connection

object conn. conn turns any function call made on it into an XML-RPC call to the

programmer remote API CGI on the HPROC host.

There is various additional boilerplate for setting up a remote connection to the

HPROC system via the connection object conn. We have removed that boilerplate,

which would usually be at line 6 of Listing 7.2.

7.5.2 Uploading Code

After making a remote connection (Section 7.5.1), the next step is for the programmer

to upload some code into the HPROC system. The idea is that code (other than the

upload script) runs on the HPROC host, rather than on the remote client controlled by

the programmer. The uploadCodemethod, on line 8 of Listing 7.2 takes a path on the

remote client file system that corresponds to code that can be executed on the HPROC

host, within the HPROC system. We assume that the file walkthroughscript.py

exists on the file system of the remote client. The upload script (via uploadCode)

reads the walkthroughscript.py file and then posts it via an XML-RPC call to the

programmer remote API CGI. Specifically, the file is sent verbatim via an XML-RPC

function called uploadCode.


7.5.3 Introspection

This section discusses introspection—the process by which code uploaded into the

HPROC system describes itself. First, we discuss when introspection occurs and how

it produces code descriptors. Each code descriptor represents a particular way that

a code file can be invoked, together with restrictions and information about that

invocation. Second, we discuss the format and content of code descriptors. Third, we

discuss the aftermath of introspection.

When the programmer remote API CGI receives the uploadCode call, the CGI

performs three actions.

1. The CGI saves the code that was sent via the RPC to a file.

2. The CGI performs introspection on the uploaded code file.

3. The CGI performs any additional actions that should result from that intro-

spection.

To perform introspection, the code file is made executable on disk (e.g., via chmod

770), and then it is run with the argument --info. The code file is then expected to

produce a list of code descriptors.

Listing 7.6 shows the last of the four pieces of code that make up the

walkthroughscript.py code file uploaded by the walkthroughuploader.py of

Listing 7.2. (Recall that Listings 7.3, 7.4 and 7.5 are the other three pieces

of code in this file.) Lines 115–134 of Listing 7.6 handle producing code de-

scriptors when walkthroughscript.py is called with --info. In particular, the

defaultCommandLineHandler function on line 134 is a convenience function for out-

putting code descriptors.

The defaultCommandLineHandler function takes two arguments. The first argu-

ment to defaultCommandLineHandler will be output as a JavaScript Object Notation

(JSON) list when the executable is run with --info. JSON is a lightweight data-

interchange format inspired by the JavaScript language, made up of arrays (e.g.,

[1,2,3]), objects (e.g., {"x":3, "y":7}) and values (e.g., 7 or "foo"). The sec-

ond argument to defaultCommandLineHandler, a function, will be called when the


1 > ./ walkthroughscript .py --info

2 [

3 {

4 "fqn": "edu.stanford.thesis.sa",

5 "help": "Code to create comparison form.",

6 "language": "python",

7 "args": "--run",

8 "environment": "standalone",

9 "default_poll_s": 10

10 },

11 {

12 "fqn": "edu.stanford.thesis.web",

13 "help": "A binary comparison form for photos.",

14 "language": "python",

15 "args": "--run",

16 "environment": "webcgi",

17 "default_poll_s": 0

18 }

19 ]

Listing 7.9: Introspection on walkthroughscript.py by using --info.

executable is run with --run. In our case, walkthroughscript.py will output the

code descriptors specified on lines 115–132 when run with --info, and will return

the result of the function runFunc (shown on line 108) when run with --run.

The output of walkthroughscript.py --info is shown in Listing 7.9. The out-

put consists of an array of two objects, where each object corresponds to one way to

run the executable code file. In particular, the output shown in Listing 7.9 says that

walkthroughscript.py can be invoked in two ways, based on two code descriptors.

The first code descriptor has the following details:

1. The code descriptor has a fully qualified name, or FQN, of

edu.stanford.thesis.sa. This name is used to identify the code de-

scriptor within the HPROC system. (Later we will create hprocesses based on

the code descriptor FQN.)

2. The code descriptor has help text, informing us that the invocation is meant

for creating comparison forms.


3. The code descriptor informs us that the code is written in the Python language.

4. The code descriptor informs us that for this invocation, the executable should

be run as walkthroughscript.py --run.

5. The code descriptor tells us in what circumstances the executable should be

run. In particular, this invocation is intended for a standalone environment,

which means that it should be run by hprocron (see Section 7.5.5).

6. The code descriptor tells us that a default polling time of ten seconds should

be used (see Section 7.5.5).

By contrast, the second code descriptor has the following details:

1. The code descriptor has an FQN of edu.stanford.thesis.web.

2. The code descriptor has help text, informing us that the invocation is meant

to display a binary photo comparison form by outputting HTML to a remote

worker.

3. The code descriptor informs us that the code is written in the Python language.

4. The code descriptor informs us that for this invocation, the executable should

be run as walkthroughscript.py --run.

5. The code descriptor tells us in what circumstances the executable should be

run. In particular, this invocation is intended for a webcgi environment, which

means that it should be run as the result of an HTTP request to the web server

(see Section 7.5.10).

6. The code descriptor tells us not to set a default polling time.

Thus, the walkthroughscript.py code file is intended to be invoked in two different

ways, with two different names and environments.

Once the newly uploaded code file has been introspected via --info, the intro-

spection information is added to the MySQL database. Specifically, it is added to a

table of code descriptors that HPROC knows about, shown in Table 7.1. Once the


FQN Environment Command Poll (s) . . .

e.s.t.sa standalone /opt/hproc/code/walkthroughscript.py –run 10 . . .e.s.t.web webcgi /opt/hproc/code/walkthroughscript.py –run 0 . . .

Table 7.1: The code descriptors table within the MySQL database in the HPROCsystem, after walkthroughscript.py has been introspected. Some columns havebeen removed, edu.stanford.thesis has been abbreviated to e.s.t, and defaultpoll seconds has been abbreviated to “Poll (s).”

code descriptor is in this table, as we will see, other parts of HPROC can use the

code file. Once the code file has been introspected and one or more rows have been

added to the code descriptors table, the uploadCode XML-RPC call (triggered on line

8 of Listing 7.2) returns successfully, signalling that the code has been successfully

uploaded and introspected.

7.5.4 Hprocess Creation

Section 7.5.1 established a connection from the programmer’s remote client machine

to the HPROC system. Section 7.5.2 uploaded a code file, and registered that code

file’s code descriptors in a table in the MySQL database. However, we have not yet

actually done anything with walkthroughscript.py other than sending it to the

HPROC host and introspecting it. The next step is to create an hprocess (analogous

to an operating system process) within the HPROC system, associated with one of

the code descriptors just registered.

Line 10 of Listing 7.2 (i.e., the upload script) creates a new hprocess, using the

newHprocess method of the previously created connection object. In particular,

newHprocess takes the fully qualified name of a code descriptor and creates a new

hprocess associated with that code descriptor. The practical effect of “creating a new

hprocess” is two things:

1. A line is added to a table of process descriptors, with a new identifier, to signify

that there is a new hprocess.

2. Any special aspects of the code descriptor are handled (see Section 7.5.5).


HPID Code Descriptor FQN Status . . .

1003 e.s.t.sa waiting . . .

Table 7.2: The process descriptors table within the MySQL database in the HPROCsystem, after a new hprocess with the edu.stanford.thesis.sa code descriptorof walkthroughscript.py has been created. Some columns have been removed,edu.stanford.thesis has been abbreviated to e.s.t. The HPID is the processidentifier for the hprocess.

Since newHprocess is called through the connection object, these actions actually

take place through the programmer remote API CGI on the web server.

In our case, after the line is added to the table of process descriptors in the MySQL

database, that table looks something like Table 7.2. Note that this table is keyed on

the HPID column, and every hprocess has a unique hpid identifier (even if multiple

hprocesses have the same code descriptor FQN).

7.5.5 Polling

Once the hprocess is created, there may be additional actions that need to occur. In

our case, recall that the edu.stanford.thesis.sa code descriptor (shown in Listing

7.9) had a default poll s value of 10. As a result, when the programmer remote

API CGI creates a new hprocess associated with this code descriptor, in addition to

adding it to the process descriptors table, the CGI also tells hprocron to periodically

poll the new hprocess. (The purpose of having default poll s be a code descriptor

option is to make it easier for a programmer to work in a TurKit crash-and-rerun

style where the hprocess is run periodically.)

Hprocron is responsible for maintaining a list of hprocesses that should be polled

periodically or at a particular time. For example, the hprocess just created in our

walkthrough needs to be polled every ten seconds. Once hprocron is notified by the

programmer remote API CGI that this polling should occur, the hprocron process

will fire an E POLL 1003 event every ten seconds, which will in turn cause the event

handling code to resume the hprocess. (See Section 7.4 for a discussion of event han-

dling by hprocron.) Any part of the HPROC system can request that an hprocess be

polled. Once the hprocron process has been notified to poll the hprocess periodically,


creation of the new edu.stanford.thesis.sa hprocess has finished.

7.5.6 Executable Environment

Section 7.4 stated that hprocron was responsible for resuming hprocesses, but

did not specify how this resuming was done. To resume an hprocess, hprocron

spawns a real UNIX operating system process. This operating system process is

spawned with the command line of the code descriptor for the process. For exam-

ple, our hprocess with hpid 1003 would be resumed by invoking the command line

/opt/hproc/code/walkthroughscript.py --run based on Table 7.1.

This operating system process runs in a UNIX environment where UNIX environ-

mental variables are set. (By UNIX environmental variables, we mean, for example,

$HOME whose value is the user’s home directory.) In particular, the environmental

variable HPID is set, with the value of the hpid (i.e., from Section 7.5.4). For ex-

ample, the process spawned as a result of our hprocess resuming spawns in a UNIX

environment where the environmental variable HPID is set to 1003. Having the hpid

available in the environment of the process allows hprocesses with the same code de-

scriptor to behave differently. In fact, this walkthrough will include three hprocesses

with the same edu.stanford.thesis.sa FQN.

When the hprocess suspends, it simply throws an exception and exits. Hprocesses

always suspend themselves, by voluntarily exiting.

7.5.7 Dispatch Handling

Section 7.5.4 created a new hprocess. Section 7.5.5 started hprocron polling

that hprocess. Then 7.5.6 showed how that periodic polling leads to a running

UNIX operating system process. As a result, at this stage in the walkthrough,

walkthroughscript.py --run is being invoked and exiting every ten seconds. So

what does walkthroughscript.py --run do?

As we noted in Section 7.5.3, walkthroughscript.py starts in the main() func-

tion of Listing 7.6 (line 114) and calls the defaultCommandLineHandler (line 134).

When run with --run (rather than --info), this handler calls the runFunc function


(line 108).

The runFunc function checks to see whether it is being run by hprocron or by

the web server (we will see how the latter is possible in Section 7.5.10). The goal

of the runFunc function is to serve as a branch between the hprocess acting either

as a standalone hprocess (as here) or as a web hprocess (discussed later). In our

case, the hprocess being resumed every ten seconds is being run by hprocron, so the

dispatchSingle function on line 112 is run. Both the dispatchSingle function

of line 112 and the dispatchDefaultFunction function of line 110 are convenience

functions for implementing cross-hprocess function calls.

Both functions will check the variable storage for variables of the proper type

(func call or func default variables). If they find a variable of the right type,

both parse the variable and determine the desired function and arguments within

the variable. If the desired function exists within the program, that function is

called with the arguments. In the case of dispatchSingle, if the called function

returns successfully, dispatchSingle will place the result in the variable storage of

the source hprocess and set the status of the target hprocess to finished. (Recall

from Table 7.2 that hprocesses have a status in their process descriptor.) Hprocesses

with a finished status will not be resumed (rerun) by hprocron (or by the web

hprocess wrapper CGI, as we will see). The dispatchDefaultFunction function

works similarly to dispatchSingle, but does not set the status to finished, and

does not return a result. Thus, dispatchDefaultFunction is useful for hprocesses

which may have functions called many times (in our case, web hprocesses), while

dispatchSingle is useful for emulating TurKit crash-and-rerun.

At this stage in our walkthrough, nothing has put a function call in variable stor-

age for the hprocess with hpid 1003. As a result, hprocron will keep polling hprocess

1003, and hprocess 1003 will keep being resumed, leading to a UNIX operating sys-

tem process running walkthroughscript.py. However, that script will run until

it reaches dispatchSingle, which will throw an exception (causing the hprocess to

suspend) because there is no function call in its variable storage.


HPID 1003

Name bc7eec4b56f300080fb36682f7c763768d9a7bce

Type func call

Status waiting

Value

{”fn”: ”compareItems”,”args”: [”http://i.stanford.edu/photo1.jpg”,”http://i.stanford.edu/photo2.jpg”],”return hpid”: 1000,”return varname”: ”74f9dd61890dfced29673b6b5ecd7b34f7fe3845”

}

. . . . . .

Table 7.3: The row of the variable storage table corresponding to the compareItemsfunction call.

7.5.8 Remote Function Calling

Sections 7.5.4, 7.5.5, 7.5.6, and 7.5.7 described what happened when the upload script

created a new hprocess on line 10 of Listing 7.2. The return value of newHprocess

on that line is an object which serves as a proxy for that new hprocess, named

compareItemsProc. On lines 11–13, the upload script makes a function call on that

hprocess, requesting the result of the function call compareItems on two URLs,

http://i.stanford.edu/photo1.jpg and http://i.stanford.edu/photo2.jpg,

executed using the new hprocess with hpid 1003.

When we say that the upload script makes a function call on the new hprocess,

what we really mean is that:

1. Lines 11–13 are parsed on the remote programmer’s client. Specifically, the

function to be called (compareItems) and the arguments to that function (the

two URLs) are determined by the compareItemsProc object.

2. The compareItemsProc object determines the hpid of the hprocess which it

corresponds to (1003). This information was saved when the newHprocess call

returned.

3. The compareItemsProc object determines an appropriate hpid to identify the

upload script. (This hpid just needs to not conflict with other hpids within the


HPROC system, and is only used for accessing the variable storage.)

4. The compareItemsProc object then makes an XML-RPC request to the pro-

grammer remote API CGI, requesting that a variable be added to variable stor-

age. The variable is a function call variable requesting the result of compareItems

on the two URLs be placed in variable storage with the upload script’s hpid

and a particular variable name.

5. The programmer remote API CGI processes the XML-RPC request, and places

the function call variable in variable storage within the HPROC system on the

HPROC host.

6. The compareItemsProc object returns a lazy result object (described below)

corresponding to the return variable of the function call.

After these steps, the row of variable storage corresponding to the function call

looks like Table 7.3. (Note that this is a single row of the variable storage table

presented vertically because of the length of some of the content.) The hpid of the

variable is 1003, because that is the hpid of the target hprocess. The name of the

variable is an auto-generated string, designed to be unique. The type of the variable

is func call. The status of the variable is waiting, meaning that the function call

is waiting to be processed. The value of the variable is a JSON object which details

the specifics of the function call. In particular, the value shows the function (fn is

compareItems), the arguments (args is the two URLs), the desired hpid of the return

variable (return hpid is 1000, the hpid of the upload script), and the desired variable

name of the return variable (return varname is a string designed to be unique).

All that is left for the upload script now is to wait. Eventually, the returned

result of compareItems will be in variable storage under the appropriate return hpid

and name. As noted previously, the return value of the function call to the upload

script is a lazy result object. By calling get on this lazy result object on line 13, we

cause the upload script to periodically request the return variable via XML-RPC to

the programmer remote API CGI. These periodic requests continue until the return

variable exists and there is a result returned.


7.5.9 Local Hprocess Instantiation

By this point, we have walked through most of the upload script in Listing 7.2. We

turn our attention back now to walkthroughscript.py and hprocess 1003 which was

previously (Section 7.5.7) suspending and resuming periodically waiting for a function

call in variable storage. Now the variable storage has the function call shown in Table

7.3.

The next time hprocess 1003 is resumed, it will reach dispatchSingle on line

112 of Listing 7.6. This time, dispatchSingle will see the function call in variable

storage. dispatchSingle will then check the rest of the code file to determine if

there is a function called compareItems to call. There is such a function, on line

98 of Listing 7.5. dispatchSingle calls this function with the two URLs that were

contained in the arguments in variable storage.

The compareItems function uses newHprocess, cross-hprocess function calls, and

lazy result get like we saw in the upload script. However, in this case, these constructs

perform slightly differently because they are running locally on the HPROC host

within the HPROC system, rather than externally on the remote client. The key

differences are:

1. Each construct goes directly to the MySQL database, rather than going through

the programmer remote API CGI. However, newHprocess and cross-hprocess

function calls add the same rows to the process descriptors and variables tables

respectively.

2. The get method of the lazy result object returned by a cross-hprocess function

call will suspend the hprocess (throw an exception) rather than continuously

request the return variable if the return variable for the function is not yet

available. (This is in keeping with TurKit crash-and-rerun style.)

The first three lines of compareItems (starting on line 99 of Listing 7.5) create a new

hprocess (with hpid 1004), place a function call in variable storage for the new hpro-

cess (requesting the makeForm function) and then request the result of the makeForm

function (with get). Because the code descriptor for this new hprocess (hpid 1004)


is the same as that for hprocess 1003, it will be set up similarly, including the default

polling period of once every ten seconds. Because the result is not yet available,

hprocess 1003 will suspend. (In fact, hprocess 1003 will fail to make progress beyond

this point until hprocess 1004 returns the result of the makeForm call.)

7.5.10 Form Creation

When hprocess 1004 is next polled by hprocron, it will resume. At that point, like

hprocess 1003, hprocess 1004 will reach the dispatchSingle function on line 112 of

Listing 7.6. However, in this case, dispatchSingle will look in the UNIX environ-

mental variables and find that HPID is set to 1004 rather than 1003. As a result,

it will look under variable storage for function call variables with a waiting status

for hpid 1004 rather than 1003. dispatchSingle will then find a similar variable to

that shown in Table 7.3. However, the function call variable for hprocess 1004 will

be a function call for the function makeForm. As a result, dispatchSingle will call

makeForm on line 72 of Listing 7.5.

makeForm creates a new hprocess on line 73. However, makeForm uses a different

code descriptor from the one that we have been using so far. Specifically, it creates a

new hprocess corresponding to edu.stanford.thesis.web, which is a code descrip-

tor for an hprocess designed to be run in response to HTTP requests to the web

hprocess wrapper CGI through the web server rather than local events. We assume

that this new hprocess has an hpid of 1005. When a worker makes a request to a

specially crafted URL on the web server running on the HPROC host, this hprocess

will be resumed and the output of the hprocess will be sent to the requesting worker.

Specifically, a request to the URL:

https://hproc.stanford.edu/www.cgi/1005/*

... will cause the web hprocess wrapper CGI to resume hprocess 1005. By the

asterisk, we mean that any URL beginning with the text before the asterisk will lead

to hprocess 1005 being run.

There are four additional ways in which web hprocesses differ from the standalone

hprocesses introduced so far.


1. A web hprocess includes the environmental variables described in Section 7.5.6,

but it additionally includes environmental variables specific to the web server,

such as the URL of the HTTP request that caused the hprocess to be resumed.

2. The web hprocess wrapper CGI (via the web server) will return the standard

output produced by the hprocess to the requesting worker.

3. Because every web hprocess is intended to produce standard output, they are

not intended to be used with the TurKit crash-and-rerun model.

4. Standalone hprocesses are resumed one-by-one by hprocron, while many web

hprocesses may be resumed at the same time by executions of the web hprocess

wrapper CGI (via the web server) in response to HTTP requests.

Otherwise, web and standalone hprocesses are quite similar, and are intended to inter-

act in natural ways. In particular, in this walkthrough, both the web and standalone

hprocesses are in the same script, walkthroughscript.py.

After creating the web hprocess with hpid 1005, makeForm sets up a default func-

tion for hprocess 1005 on line 74 of Listing 7.5. As noted in Section 7.5.7, default

functions are similar to regular function calls, except that they do not return a result

to the caller. (Default functions are a special case of the human processing send

operation from Section 6.4.) In this case, the default function for hprocess 1005 is

being set to the handleRequest function with the arguments being the two URLs

supplied to makeForm. As we will see later, this means that when a worker makes an

HTTP request for the URL associated with hprocess 1005, handleRequest will get

called with the two URLs as arguments.

The last thing that makeForm does is that it returns to its caller with two URLs.

These URLs are both specially crafted to resume hprocess 1005, but one will lead to

the handleRequest function displaying a form to an end user, while the other will be

an endpoint for XML-RPC. (Both of these will be discussed in Section 7.5.11.) When

makeForm returns these URLs, it actually returns them to the dispatchSingle func-

tion of hprocess 1004, which will then return them via the variable storage to hprocess

1003, which was the original compareItems hprocess. Hprocess 1004 will then be set


to have a status of finished, because the hprocess has completed successfully and

used the dispatchSingle function.

7.5.11 Form Parts

Section 7.5.10 created a web hprocess as hpid 1005. However, we have not yet ex-

plained what happens when a worker requests a URL associated with hprocess 1005.

In fact, three separate responses can happen depending on three potential URLs.

These three URLs and associated functionality, which we call thrower, catcher, and

XML-RPC are the standard way of setting up a human driver (see Section 6.4) in

the HPROC system. We trace this functionality below.

Recall that if we are running in a web environment, dispatchDefaultFunction

on line 110 of Listing 7.6 will be called, rather than dispatchSingle. This dispatch

function will check the variable storage for variables of the func default type. It

will then find the variable set by makeForm in hprocess 1004, and call handleRequest

(the default function). This is true for any URL of the form

https://hproc.stanford.edu/www.cgi/1005/*

because any URL of that form will cause hprocess 1005 to resume.

However, handleRequest (line 62 of Listing 7.4) differentiates between three

URLs. In particular,

Thrower URL https://hproc.stanford.edu/www.cgi/1005/thrower

Catcher URL https://hproc.stanford.edu/www.cgi/1005/catcher

XML-RPC URL https://hproc.stanford.edu/www.cgi/1005/xmlrpc

handleRequest differentiates with a utility function called requestType, which sim-

ply looks at the environmental variables of the program to determine the URL

type. Depending on the URL of an HTTP request, requestHandle will call one

of doThrower, doCatcher, or doXmlRpc.

The doThrower function (starting on line 10 of Listing 7.3) displays an HTML

form to the worker requesting the URL. The HTML form is parameterized by the two


URLs to compare, and allows the worker to choose between them using radio buttons.

The HTML form is also parameterized by its action. In particular, the action (the

URL which the form will submit to) is the URL for the catcher. The HTML form

includes some necessary JavaScript for compatibility with Mechanical Turk which we

have removed for clarity.

The doCatcher function (starting on line 31 of Listing 7.3) is responsible for

receiving the form submission from the thrower form. First, doCatcher saves the

choice made by the worker. The code on line 34 just parses the form submission by

the worker and then saves the choice made by the worker to variable storage. Note

that there is a special syntax (v[’variable’]) for conveniently accessing variables

persisted to variable storage. Second, once the worker’s choice is persisted, the worker

is redirected back to the Mechanical Turk. Because the standard output of web

hprocesses is directly sent by the web hprocess wrapper CGI (via the web server),

web hprocesses can send redirects and other HTTP headers. In our case, the thrower

sends a Location header to redirect.

The doXmlRpc function (starting on line 55 of Listing 7.4) is responsible for com-

municating with the recruiter, discussed in Section 7.5.12. As discussed in Section

7.5.1, XML-RPC is an RPC format that can easily be set up via a CGI script. In our

case, if a program does an HTTP POST to

https://hproc.stanford.edu/www.cgi/1005/xmlrpc

it will execute a method on the object that begins on line 42 of Listing 7.4. The

doXmlRpc function on line 55 simply registers the object to respond to XML-RPC

requests. Because standard output is returned verbatim by the web hprocess wrapper

CGI (via the web server), a web hprocess can just as easily form an RPC end point

as return output to a user.

7.5.12 Form Recruiting

Section 7.5.11 described what happens if each of three URLs associated with the web

hprocess 1005 is requested. However, as it stands, there is not yet any reason for the

URLs to be requested. We now describe how those URLs are advertised and reached.


Recall that hprocess 1003, running compareItems, was previously stuck suspend-

ing at the end of Section 7.5.9 because the result of makeForm was not yet available.

However, in Section 7.5.10, makeForm completed and returned its result via vari-

able storage. As a result, compareItems can now proceed. From lines 103–106,

compareItems does the same thing as with makeForm, creating a new hprocess, call-

ing a cross-hprocess function, and then throwing an exception when the result of the

function is not available. In this case, the cross-hprocess function call is to fillForm,

which takes the XML-RPC URL that was returned by the makeForm call as an argu-

ment.

We assume that the new hprocess which is created to satisfy the fillForm call

has hpid 1006. Once the compareItems hprocess suspends, the new hprocess that

was created will eventually be polled and then resumed by hprocron. This hprocess

will check its variable storage via dispatchSingle and run fillForm as requested.

The purpose of fillForm (starting on line 81 of Listing 7.5) is to interact with the

recruiter. On line 82, a connection object to the recruiter is created, called r iface.

The recruiter is based on a ticketing system, so on line 85, fillForm gets a unique

ticket identifier for the human driver that will be managed. On line 86, fillForm

sends a request, via the connection object, for the recruiter to manage the human

driver associated with the given XML-RPC URL. In this case, the XML-RPC URL is

the one we received from makeForm earlier. (Note that the manage request is similar

to the recruit operation from Section 6.4.) The human driver is also associated with

the unique ticket identifier.

The recruiter is a separate process accessible via the connection object. When

it receives the request to manage the human driver, it uses the XML-RPC URL of

the human driver to get more information. In particular, the recruiter can use any

of the methods in the CompareFormHandler object on line 42 of Listing 7.4. To get

the thrower URL, which will need to be advertised on Mechanical Turk, it can call

getThrower. To determine if workers have filled out the form, the recruiter can call

isDone (not to be confused with the recruiter’s version of isDone). Lastly, to get

the current results of the form, the recruiter can call getResults which will return

the results posted by workers from variable storage. (getResults is the HPROC

7.6. HPROC WALKTHROUGH SUMMARY 151

equivalent of the human processing get(driverid) operation from Section 6.4.) The

recruiter is then in charge of advertising the thrower URL on Mechanical Turk and

checking with isDone to determine if the advertising was successful.

Meanwhile, after requesting that the recruiter manage the human driver, the

fillForm hprocess checks if the recruiter says that the human driver has completed

(i.e., workers have filled out the form). If not, the hprocess throws an exception, until

resuming again at some later point to check again as a result of hprocron. Eventually,

the hprocess will resume and the recruiter will say that the human driver is complete.

Then fillForm requests the results from the recruiter and marks the ticket complete.

(These are the results of the form associated with the human driver, but could include

additional information from the recruiter.) Then, the fillForm function returns the

form results, leading dispatchSingle to return the form results.

Once the fillForm hprocess returns the results via variable storage, the

compareItems hprocess does the same, and the upload script sees the result remotely

in variable storage. Finally, the upload script prints the result of the worker’s choice

(line 15 of Listing 7.2) and there is various boilerplate to tear down a remote connec-

tion (line 17 of Listing 7.2).

7.6 HPROC Walkthrough Summary

The walkthrough in Section 7.5 illustrated most of the features of the HPROC sys-

tem. As we saw, HPROC is a self contained system in which to run code. To program

with HPROC, one writes at least two programs, a remote program and a program

to run within the HPROC system. Programs that are run within HPROC are intro-

spected to create code descriptors that can be used by other code within the system.

Special hprocesses can then be created based on a code descriptor. These hprocesses

can emulate TurKit crash-and-rerun style by using the poller in hprocron, but can

also serve as human drivers that handle managing web forms and even the web forms

themselves. Hprocesses have their own variable storage, special environmental vari-

ables, and event system. Hprocesses (as well as external programs) can also make

cross-hprocess function calls to accomplish tasks and split up work. Lastly, we saw


how hprocesses can delegate the advertising of human drivers to recruiters and then

retrieve human results. Overall, HPROC is a powerful system that combines TurKit

style programming, web hprocesses, human drivers, and the recruiter concept in a

natural way.

7.7 Case Study

For the rest of this chapter, we consider a case study of using human comparators

to sort blurry photographs. In this section, we describe the organization of our

case study. Specifically, we describe our dataset to be sorted (Section 7.7.1), our

modifications to the dataset to make it suitable for sorting (Section 7.7.2), and the

human interfaces involved (Section 7.7.3).

In Sections 7.8 and 7.9, we introduce two example sorting algorithms based on

Merge-Sort and Quick-Sort, respectively. Merge-Sort and Quick-Sort are

not necessarily perfect for this problem. In practice, one might use a tournament or an

algorithm which takes into account human variability. However, Merge-Sort and

Quick-Sort are simple algorithms which we expect that the reader is familiar with,

and they illustrate the challenges of evaluation of such algorithms well. In particular,

Merge-Sort and Quick-Sort show how important interfaces (such as those in

Section 7.7.3) are to human algorithms, and demonstrate fairly wide differences in

cost, time, and accuracy—the variables that we aim to measure.

7.7.1 Stanford University Shoe Dataset 2010

If researchers are to compare the performance of their human algorithms, they need

standard datasets. We created a dataset of over 100 photographs of single shoes taken

in the same lighting conditions at the same distance using a full-frame digital camera.

(See Figure 7.2 for examples.)

Why photographs of shoes? The dataset naturally has two types of orderings:

objective and subjective orderings. We can create an objective ordering by modifying

the pristine original photographs in some known way. (This is what we do in this case

7.7. CASE STUDY 153

Figure 7.2: Shoes from the Stanford University Shoe Dataset 2010 blurred to varyingdegrees.

study, see Section 7.7.2 below.) We can also create subjective orderings by asking

workers which shoes they like best, or which seem dirtiest. These types of orderings

may only be partial, may differ across workers, and are generally more complex to

handle. In general, human sorting has several variables: disinterest by particular

workers, worker quality for a particular task, and how consistently people agree on

the ordering of the dataset, to name a few. We believe that this dataset will be quite

effective at disentangling these various variables in the future.

7.7.2 Sorting Task

For this chapter, we created a sorting task in the following way. We started with

72 photos of different shoes from the Stanford University Shoe Dataset. We then

randomly ordered them and applied a Gaussian blur to each photo in ascending

order using the ImageMagick convert binary [4]. The first photo had no blur at all,

the next had a blur of σ = 0.5 and radius 1.5, the next had a blur of σ = 1.0 and

radius 3, up to the last photo with σ = 35.5 and radius 106.5. (The σ value is the


(a) Binary Comparison (b) Ranked Comparison

Figure 7.3: Two different human comparison interfaces.

primary determinant of how blurry the image becomes.) We resized the photos to

351x234 for presentation on the web.

For each sorting task, we randomly select n photos from our set of blurred photos.

We then ask workers on Mechanical Turk to sort them by blurriness using our system

and the algorithms described below. We evaluate the quality of our algorithms based

on Kendall’s τ rank correlation between the output ordering and the true ordering.

In other words, if the workers return the results in the order (σ = 0.5, σ = 3.5, σ =

8.0, σ = 12.0) then the algorithm worked well, whereas if we get (σ = 12.0, σ =

3.5, σ = 8.0, σ = 0.5) then it did not. (In contrast to, for example, sorting pictures of

numbers, we hope that it will not be immediately obvious to the workers that we are

studying them.)

7.7.3 Comparison Interfaces

We consider two different human interfaces for comparing photographs in this case

study. In our case, each of the comparison interfaces allows human workers to order

photos by blurriness. The first interface (Figure 7.3(a)), which we call the binary

7.8. H-MERGE-SORT 155

comparison interface, asks the worker which of two photos is less blurry. The worker

uses a radio button to select the less blurry photo. The second interface (Figure

7.3(b)), which we call the ranked comparison interface, asks the worker to rank photos

from least blurry to most blurry. The worker drags and drops the photos until they

are in the correct order. Both interfaces show the photos vertically in sequence. There

is no limit to the number of photos which can be presented in the ranked comparison

interface, though in our evaluation, we only consider 4 and 8 photo rankings.

7.8 H-Merge-Sort

This section describes our H-Merge-Sort variant of Merge-Sort. We begin in

Section 7.8.1 by describing Merge-Sort. Then, we introduce some convenience

functions for use in our algorithms in Section 7.8.2. In Section 7.8.3, we give an

overview of our new H-Merge-Sort. In Section 7.8.4, we describe the functions we

use in our implementation of H-Merge-Sort. Finally, we walk through our HPROC

implementation of H-Merge-Sort in Section 7.8.5.

7.8.1 Classical Merge-Sort

The traditionalMerge-Sort is a bottom-up divide-and-conquer approach to sorting.

Traditional Merge-Sort consists of two alternating functions, Merge-Sort and

Merge.

TheMerge function takes two sorted lists, s1 and s2, and produces a single sorted

list s3 by merging the lists item by item. While there are still items in both s1 and

s2, Merge will compare the first item in both lists (i.e., s1[0] vs s2[0]) and append

the minimum item to the final sorted list s3 to be returned. Once either list is empty,

the remaining list is appended to s3, and finally s3 is returned by Merge.

The Merge-Sort function takes an unsorted list u0 and eventually returns a

sorted list. If the unsorted list is of length 1, the list is returned, because the list is

already sorted. If the unsorted list is of length greater than one, the list is split in

half, into two sublists, u1 and u2. Then, Merge-Sort is recursively called on each


of the two sublists u1 and u2, producing two sorted sublists, s1 and s2. Then, Merge

is called on the two sorted sublists, producing a single sorted list s3, which is then

returned.

7.8.2 Convenience Functions

There are three functions which we use in our H-Merge-Sort and H-Quick-Sort,

but do not show to the reader. These three functions are getBinaryOrdering,

getRankOrdering, and order2pairs.

getBinaryOrdering takes two items and returns a sorted list of those two items

sorted by a human through a binary comparison form. Likewise, getRankOrdering

takes a list of items and returns a sorted list of those items sorted by a human through

a ranked comparison form. getBinaryOrdering and getRankOrdering are wrappers

around functionality we already saw in our Section 7.5 walkthrough. Specifically,

getBinaryOrdering and getRankOrdering are mostly the same as compareItems

from Listing 7.5. All three functions handle posting tasks to Mechanical Turk, and

all three post photo comparison tasks. However, there are two differences between

these two functions and compareItems. First, the two functions use the two dif-

ferent interfaces of Section 7.7.3. getBinaryOrdering uses the binary compari-

son interface, while getRankOrdering used the ranked comparison interface. Sec-

ond, the returned result of getBinaryOrdering and getRankOrdering are different

from compareItems. compareItems returned which of the two photos was less, e.g.,

photo1. getBinaryOrdering and getRankOrdering both return a sorted list of the

items.

order2pairs, converts an ordered list into a dictionary of binary comparisons. For

example, if getRankOrdering returned the ordering (url3, url2, url1), order2pairs

on the returned ordering would yield:

(url1, url2) = “l > r” (url2, url3) = “l > r”

(url1, url3) = “l > r” (url3, url1) = “l < r”

(url2, url1) = “l < r” (url3, url2) = “l < r”

The dictionary stores for each URL pair a string indicating whether the left side l is


less than or greater than the right side r. order2pairs lets us determine the result

of a single binary comparison based on an ordering that contains that comparison.

7.8.3 H-Merge-Sort Overview

Transitioning from the classical Merge-Sort of Section 7.8.1 to a human version

where the binary comparisons are based on humans is fairly easy. Anywhere we would

do a binary comparison in regularMerge-Sort, we can just request that a worker do

that binary comparison using getBinaryOrdering (from Section 7.8.2). Otherwise,

the original Merge-Sort does not need to be changed to use humans.

However, things get more complex once it is possible to rank up to r items at a time

using humans with a ranked comparison form, a case for which Merge-Sort was

not designed. We make two changes to Merge-Sort to handle ranked comparisons.

The first change involves the base case. Recall that in Section 7.8.1, Merge-

Sort would recurse until reaching a singleton list. By contrast, our ranked version

recurses until it reaches a list of length less than or equal to r, the maximum number

of photos to be compared at a time. Recursing any deeper than this would simply

lead to unnecessary human comparisons, which are expensive.

The second change involves the Merge function. Recall that in Section 7.8.1,

Merge would conduct a binary comparison of the two items at the heads of the

sorted lists s1 and s2, appending the minimum of those two items to the final sorted

list s3. With a ranked comparison, we should be able to append more than one item

at a time to the list s3. Our solution is to do a ranked comparison of the first f1

elements of list s1, and the first f2 elements of list s2. We choose these values as:

f1 = min(⌈r

2⌉, len(s1)) f2 = min(r − f1, len(s2))

Put another way, we choose more or less r2items from the front of each of s1 and s2,

and compare the combined list. For example, if we had the lists

s1 = [1, 3, 5, 7, 9] s2 = [2, 4, 6, 8, 10]


and r = 8, we would take the first four items from each list, e.g., f1 = 4 and f2 = 4,

leading us to do a ranked comparison of items:

[1, 3, 5, 7, 2, 4, 6, 8]

The Merge strategy described with ranked comparisons can be shown to always be

able to append at least r2items to s3 (assuming that there are at least r

2items left in

s1 and s2).

7.8.4 H-Merge-Sort Functions

Our H-Merge-Sort consists of three functions, addcache (Listing 7.10), merge

(Listing 7.11), and mergesort (Listing 7.12), plus the convenience functions described

in Section 7.8.2. (Both our H-Merge-Sort and H-Quick-Sort are presumed to

have the fully qualified name edu.stanford.sort.)

We implement Merge with ranked comparisons (as discussed in Section 7.8.3)

using a cache. The idea is that the cache stores all pairwise orderings that have been

discovered to date. Merge functions like a regular binary comparison Merge, using

binary comparisons from the cache. The cache itself may be added to using binary or

ranked comparisons, using the addcache function below. As we will see, addcache

gets called when a needed comparison is not available, and will add at least that

comparison to the cache (though possibly more comparisons as well).

The addcache function takes four arguments, a left side (i.e., s1), a right side

(i.e., s2), a configuration dictionary conf for specifying options, and a dictionary

cache to which to add binary comparisons (see below). The addcache function

can be configured using conf to either use binary or ranked comparisons. If con-

figured for binary comparisons, addcache will take the first items from lists s1 and

s2, i.e., s1[0] and s2[0], request a getBinaryOrdering of s1[0] ∪ s2[0], and then place

the order2pairs of the ordering in cache. If configured for ranked comparisons,

addcache will take the first items f1 and f2 (calculated above) from lists s1 and

s2, request a getRankOrdering of s1[0..f1 − 1] ∪ s2[0..f2 − 1], and then place the

order2pairs of the ordering in cache. (In the special case when s1[0..f1 − 1] and


1 def addcache(left , right , conf , cache ):

2 foundordering = []

3

4 maxranked = 2

5

6 if conf.has_key(’maxranked ’):

7 maxranked = int(conf[’maxranked ’])

8

9 if (( maxranked ==2) or ((len(left) == 1) and (len(right) == 1))):

10 findOrderProc = newHprocess(’edu.stanford.sort’)

11 lazyFindOrder = findOrderProc.fn.getBinaryOrdering (

12 left[0], right [0])

13 foundordering = lazyFindOrder.get()

14

15 else:

16 leftamount = maxranked / 2 + (

17 maxranked /2 - min(maxranked/2,len(right )))

18 rightamount = maxranked / 2 + (

19 maxranked /2 - min(maxranked/2,len(left )))

20


22 lazyFindOrder = findOrderProc.fn.getRankOrdering (

23 left[: leftamount] + right [: rightamount ])


25

26 comps = order2pairs(foundordering)

27 cache.update(comps)

Listing 7.10: addcache function (with FQN “edu.stanford.sort”).


s2[0..f2 − 1] are both singleton lists, the ranked comparison will be downgraded to a

binary comparison.)

The merge function takes four arguments, a left side (i.e., s1), a right side (i.e.,

s2), a configuration dictionary conf for specifying options, and a dictionary cache

containing binary comparisons between items (see below). Our merge function is

more or less the same as the Merge described in Section 7.8.1, with two exceptions:

1. Our merge is tail call recursive, for system specific reasons not discussed here.

In other words, rather than a for loop, our merge calls itself with fewer items

in either left or right.

2. Our merge does not compare two items directly. Instead, merge checks whether

the needed comparison of the items s1[0] and s2[0] is available in the cache.

If the comparison is not available, merge calls addcache to add one or more

comparisons to the cache, including at least the currently necessary comparison.

(conf is passed to addcache, but not otherwise used by merge.) Either way,

the comparison is now available, and merge continues the merging process,

appending the minimum item to s3.

In other words, our merge pretends that it is a tail recursive version of our binary

comparison Merge, but keeps a cache so that it can make use of multiple binary

comparisons implicitly produced by a single ranked comparison. Our merge eventually

produces a sorted list s3 of the left (s1) and right (s2), as with Merge.

The mergesort function takes two arguments, an unsorted list l and a configura-

tion dictionary conf for specifying options. Our mergesort is more or less the same

as the Merge-Sort described in Section 7.8.1, with two exceptions:

1. If the configuration variable conf is set to allow ranked comparisons, the base

case will be changed to ranked comparisons of size r, rather than singleton lists.

2. All recursive calls to mergesort are requested lazily. That is, the call is made

to mergesort with separate hprocesses on the left half of the unsorted items,

and on the right half of the unsorted items. Only then is the result requested

of either, using get, potentially causing a crash.


1 def merge(left , right , conf , cache=None):

2 if cache is None:

3 cache = {}

4

5 result = []

6

7 if ((len(left) == 0) or (len(right) == 0)):

8 result.extend(left)

9 result.extend(right)

10 return result

11

12 if not cache.has_key ((left[0], right [0])):

13 addcache(left , right , conf , cache)

14

15 rightmerge = newHprocess(’edu.stanford.sort’)

16 lazymerge = None

17

18 if cache [(left[0], right [0])] == ’l<r’:

19 result.append(left [0])

20 lazymerge = rightmerge.fn.merge(

21 left [1:], right , conf , cache)

22 else:

23 result.append(right [0])

24 lazymerge = rightmerge.fn.merge(

25 left , right [1:], conf , cache)

26

27 next = lazymerge.get()

28

29 return result + next

Listing 7.11: merge function (with FQN “edu.stanford.sort”).


1 def mergesort(l, conf):

2 if len(l) < 2:

3 return l

4

5 if conf.has_key(’maxranked ’):

6 maxranked = int(conf[’maxranked ’])

7

8 if (len(l) <= maxranked) and (maxranked > 2) and (len(l) > 2):


10 lazyFindOrder = findOrderProc.fn.getRankOrdering (l)


12

13 return foundordering

14

15 middle = len(l) / 2

16

17 lazyleft = newHprocess(’edu.stanford.sort’).fn.mergesort(

18 l[: middle], conf)

19 lazyright = newHprocess(’edu.stanford.sort’).fn.mergesort(

20 l[middle:], conf)

21

22 left = lazyleft.get()

23 right = lazyright.get()

24

25 lazymerge = newHprocess(’edu.stanford.sort’).fn.merge(

26 left , right , conf)

27 final = lazymerge.get()

28

29 return final

Listing 7.12: mergesort function (with FQN “edu.stanford.sort”).


mergesort eventually produces a sorted list based on merge and addcache.

7.8.5 H-Merge-Sort Walkthrough

We now demonstrate a partial walkthrough of our H-Merge-Sort. We assume that

we are sorting eight photographs that we will number 1–8. We will assume that the

true sort order is

[1, 2, 3, 4, 5, 6, 7, 8]

and that the initial ordering is

[8, 6, 4, 2, 5, 7, 3, 1]

As discussed in Section 7.5, we need an upload script in order to run a program

from a remote client. In our case, we do not show the uploader, but presume that

it is similar to Listing 7.2. However, rather than calling compareItems on an hpro-

cess associated with edu.stanford.thesis.sa, in our case we call mergesort on an

hprocess associated with edu.stanford.sort. Specifically, the call to mergesort is

mergesort([8,6,4,2,5,7,3,1], {’maxranked’:4})

The second parameter is the conf configuration parameter, which is a dictionary

containing configuration information. In this case, conf indicates that at most 4

items can be ranked at the same time using the ranked comparison interface. At this

point, we have a single hprocess within the HPROC system periodically running a

dispatchSingle (not shown—from Section 7.5.7) to mergesort.

When this first single hprocess next resumes, the hprocess runs the code shown in

Listing 7.12. The singleton list base case on lines 2–3 does not apply because there

are eight items in the list. Because we have specified maxranked in conf, a second

base case is checked on lines 5–13. Specifically, we check if the items in the list are

less than the maximum number that can be ranked. As it turns out, there are eight

items in the list, and only four can be ranked at a time, so this base case is skipped

as well.


Next, the list is split in half (line 15) and two recursive mergesort calls are made

on the left and right sides, with two separate new hprocesses (lines 17–20). These

recursive calls return lazy results, which are not requested until lines 22–23. This

means that our initial cross-hprocess function call (say, hpid 1001), will produce

two new hprocesses (hpids 1002 and 1003) corresponding to mergesort on the lists

[8, 6, 4, 2] and [5, 7, 3, 1]. Then, when the first lazy result object has its get method,

the first mergesort hprocess (hpid 1001) crashes.

This allows the other two mergesort hprocesses to run. Both now have less than

or equal to four items, so the base case on lines 5–13 now applies. This means that for

both, a new hprocess will be created, to get a getRankOrdering ranked comparison

for items [8, 6, 4, 2] for hpid 1002, and for items [5, 7, 3, 1] for hpid 1003. Both of these

getRankOrdering calls will eventually create comparison forms via web hprocesses,

in a style similar to our original walkthrough.

All three hprocesses described thus far (hpids 1001, 1002, and 1003) will now

continuously crash-and-rerun waiting for new data. Eventually, workers will fill out

the web forms created by the calls to getRankOrdering, and the hprocesses with hpids

1002 and 1003 will return two ranked orderings, [2, 4, 6, 8] and [1, 3, 5, 7], assuming

the workers compute the correct orderings. Our original hprocess 1001 then creates a

new hprocess to merge these lists (lines 25–26), which are returned to it as left and

right (lines 22–23). The result is again not available on line 27, so hprocess 1001

crashes again waiting on the result of a call to

merge([2,4,6,8], [1,3,5,7], {’maxranked’:4})

to hprocess 1010 (a new hprocess). (Note that we choose a later hpid here, because

there have been a number of hprocesses created by getRankOrdering in hpids 1002

and 1003.)

When hprocess 1010 is resumed, the hprocess calls the function merge in Listing

7.11. Neither left nor right is empty, so the case on line 7 is skipped. The cache

is then checked for the initial comparison of the head items of the left and right

lists. The cache is thus checked for the tuple (2, 1), which is not in fact in the cache,

because the cache is empty. As a result, addcache is called. (We do not create a


new hprocess for addcache because human parallelism will not be affected.)

When addcache (Listing 7.10) is called, the arguments are the full left and

right lists. The addcache function checks whether the maximum number of items

that are rankable at a time is two on line 9. There could be only two items rankable

either because the value of conf[’maxranked’] is two, or because there is only one

item each remaining in left and right. In our case, there are more than two items

remaining, and maxranked is four, so addcache skips to the case on line 15. On lines

16–19, addcache makes the f1 and f2 calculation described in Section 7.8.3. In our

case, f1 = 2 and f2 = 2, so addcache creates a new hprocess to call getRankOrdering

on the first two items of both lists, [2, 4] and [1, 3]. Hprocess 1010 then crashes on line

24 periodically until workers return the ordering. Supposing they eventually return

the correct ordering, foundordering is now [1, 2, 3, 4] which order2pairs (line 26)

turns into a dictionary of pairs as described in Section 7.8.2. Finally, the cache is

updated with these pairs on line 27, and addcache returns.

Now the comparison on line 18 can be computed, because it is in the cache. The

comparison is (2, 1) and the cache says that the answer is l>r. As a result, 1 is

appended to the result, and a new hprocess is created to do the rest of the merge,

without 1. In other words,

merge([2,4,6,8],[1,3,5,7], {’maxranked’:4})

will return the result

[1,merge([2,4,6,8],[3,5,7], {’maxranked’:4}, cache)]

When the new hprocess runs merge, the cache will also include the comparison (2, 3).

In fact, the cache will include all comparisons up until the lists are [4, 6, 8] and [5, 7].

Then, a new hprocess will be created to getRankOrdering of [4, 6, 5, 7]. The

ordering of these values allows merge to progress to [8] and [7], which are then finally

compared using an addcache which in this case does a binary ordering (line 9 of

Listing 7.10). Finally, the merge with hpid 1010 has merged all items, which are then

returned to hpid 1001, which returns the fully sorted list.

Two things should be noted from our walkthrough. First, hprocesses were created,

and recursive mergesorts were called, until we hit base cases (minimum numbers of


items that we could rank) or necessary merges. Second, the hprocess conducting

the merge was effectively “blocked” while waiting for results. Before the result of

the ranked comparison [2, 4, 1, 3] was available, merge did not know it should rank

[6, 8, 5, 7]. The recursive mergesorts mean that there is a reasonable amount of

human parallelism—many tasks will be posted in parallel to the Mechanical Turk.

However, the dependence of comparisons on previous comparisons in merge puts a

limit on this human parallelism.

7.9 H-Quick-Sort

This section describes our H-Quick-Sort variant of Quick-Sort. We begin in

Section 7.9.1 by describing Quick-Sort. In Section 7.9.2, we give an overview of

our new H-Quick-Sort. In Section 7.9.3, we describe the functions we use in our

implementation of H-Quick-Sort. (We presume that the convenience functions

from Section 7.8.2 continue to be available.) Finally, we walk through our HPROC

implementation of H-Quick-Sort in Section 7.9.4.

7.9.1 Classical Quick-Sort

The traditional Quick-Sort is a top-down divide-and-conquer approach to sorting.

Traditional Quick-Sort consists of two alternating functions, Quick-Sort and

Partition.

The Partition function takes an unsorted list u0, and an item ip called the pivot.

The Partition function compares every item in u0 to ip, producing three lists:

ul is the list of items less than the pivot.

ue is the list of items equal to the pivot.

ug is the list of items greater than the pivot.

The return value of Partition is these three lists.

The Quick-Sort function takes an unsorted list u0. If the unsorted list is of

length 0, the list is returned, because the list is already sorted. If the unsorted list is

7.9. H-QUICK-SORT 167

of length greater than zero, a pivot is chosen. The pivot is a random item within the

list u0. Then, the partition function is called with the pivot and the unsorted list u0,

producing three lists (ul, ue, ug). Finally, Quick-Sort returns the concatenation of

Quick-Sort applied to ul, ue, and Quick-Sort applied to ug.

7.9.2 H-Quick-Sort Overview

Similarly to H-Merge-Sort (Section 7.8.3), transitioning from the classical Quick-

Sort of Section 7.9.1 to one based on human binary comparisons is fairly easy.

Anywhere we would usually do a binary comparison, we instead do a human binary

comparison using getBinaryOrdering. In fact, this conversion is quite natural for

Quick-Sort. When H-Merge-Sort merges, the next items to be compared (the

front items in the lists s1 and s2) always depend on the results of the last comparison.

However, in H-Quick-Sort, in the Partition phase, all items in u0 are compared

to the pivot without dependence on one another. This non-dependence means that

all comparisons for a given Partition can be done at the same time.

We want our H-Quick-Sort to also be able to take advantage of ranked com-

parisons. However, for the Partition phase, binary comparisons are in a sense fairly

optimal, because they can post all of the comparisons to the pivot at the same time.

Instead, we modified Quick-Sort for ranked comparisons in our H-Quick-Sort by

using a ranked comparison to select the pivot. The idea is that the choice of pivot can

make a big difference in how effectively the list u0 is split, and we want a pivot which

is as close to the median as possible. Therefore, we request a ranked comparison of

five random items in u0 using getRankOrdering, choosing the median of five as the

pivot.

7.9.3 H-Quick-Sort Functions

OurH-Quick-Sort consists of two functions, partition (Listing 7.13) and quicksort

(Listing 7.14), plus the convenience functions described in Section 7.8.2.

The partition function takes two arguments, an original unsorted list l and a

pivot. Then, partition will call getBinaryOrdering some number of times (in a


1 def partition(l, pivot ):

2 if len(l) == 0:

3 return ([], [pivot], [])

4

5 subpartcomp = newHprocess(’edu.stanford.sort’)

6 subpartfn = subpartcomp.fn.partition(l[1:], pivot)

7

8 head = l[0]

9


11 lazyFindOrder = findOrderProc.fn.getBinaryOrdering (head , pivot)


13 comps = order2pairs(foundordering)

14

15 newl , newe , newg = subpartfn.get()

16

17 if comps [(head , pivot )] == ’l<r’:

18 return (newl + [head], newe , newg)

19 else:

20 return (newl , newe , newg + [head])

Listing 7.13: partition function (with FQN “edu.stanford.sort”).


tail recursive manner), comparing the pivot to all items in the unordered list, and

will eventually return items found to be less, equal, and greater than the pivot.

The quicksort function takes two arguments, an unordered list and a config-

uration dictionary conf for specifying options. Depending on the value of conf,

quicksort will either choose the first item in the unordered list as a pivot, or it will

do the getRankOrdering median pivot described in Section 7.9.2. Having chosen the

pivot, quicksort will partition using the pivot, and eventually call itself recursively

to order the lesser and greater parts. Lastly, quicksort returns a human sorted list.

7.9.4 H-Quick-Sort Walkthrough

We now demonstrate a partial walkthrough of our H-Quick-Sort. We make the

same assumptions as our H-Merge-Sort walkthrough in Section 7.8.5. Specifically,

we assume:

1. Eight photographs 1–8.

2. True sort order is [1, 2, 3, 4, 5, 6, 7, 8].

3. Initial ordering is [8, 6, 4, 2, 5, 7, 3, 1].

4. An upload script is used, but not shown.

We use the initial cross-hprocess function call:

quicksort([8,6,4,2,5,7,3,1], {’pivot’:’fiverank’})

The second argument is the conf parameter, requesting that the median-based pivot

be chosen with a human ranked comparison. At this point, we have a single hprocess

(hpid 1001) within the HPROC system periodically running a dispatchSingle (not

shown) to quicksort.

The quicksort hprocess just described (hpid 1001) runs the code in Listing 7.14.

The given list l is not empty, so the first case (lines 2–3) is skipped. A pivot is chosen

on lines 5–6, but the pivot is overwritten in lines 8–16. Specifically, the condition on

line 8 checks conf and finds that we want a fiverank pivot. Because the length of


1 def quicksort(l, conf):

2 if len(l) == 0:

3 return []

4 else:

5 pivot = l[0]

6 newlist = l[1:]

7

8 if conf.has_key(’pivot ’) and \

9 conf[’pivot ’] == ’fiverank ’ and \

10 len(l) > 4:


12 lazyFindOrder = findOrderProc.fn.getRankOrdering (l[:5])


14

15 pivot = foundordering [2]

16 newlist = [i for i in l if i != pivot]

17

18 partcomp = newHprocess(’edu.stanford.sort’)

19 partfn = partcomp.fn.partition(newlist , pivot)

20 lesser , equal , greater = partfn.get()

21

22 qsortlesser = newHprocess(’edu.stanford.sort’)

23 qsortgreater = newHprocess(’edu.stanford.sort’)

24

25 qsortlfn = qsortlesser.fn.quicksort(lesser , conf)

26 qsortgfn = qsortgreater.fn.quicksort(greater , conf)

27

28 return qsortlfn.get() + equal + qsortgfn.get()

Listing 7.14: quicksort function (with FQN “edu.stanford.sort”).


the current list l is at least five items (line 10), a ranked comparison is requested using

getRankOrdering. Specifically, the first five items in the list l are passed to a new

hprocess running getRankOrdering on line 12. These first five items are [8, 6, 4, 2, 5].

Then, hprocess 1001 crashes, waiting for the getRankOrdering hprocess to return

a result, which eventually happens. Presuming that the result from the worker is

correct, foundordering is [2, 4, 5, 6, 8] (line 13) and the median is chosen (line 15),

which is 5 in our case. The median, 5 then replaces the chosen pivot, and 5 is removed

from the list of items to be partitioned later (see below).

Now that a pivot has been chosen, a new hprocess is created and called to partition

the list l with the pivot 5 on lines 18–20. The quicksort hprocess 1001 then crashes,

waiting for partition results. The newly created hprocess for the partition runs the

call

partition([8,6,4,2,7,3,1], 5)

in Listing 7.13. The early lines of partition (lines 5–6) create more hprocesses with

calls to partition. Specifically, they create the calls

partition([6,4,2,7,3,1], 5)

partition([4,2,7,3,1], 5)

partition([2,7,3,1], 5)

partition([7,3,1], 5)

partition([3,1], 5)

partition([1], 5)

partition([], 5)

The final new partition hprocess does not create a new hprocess with a partition

call because the function returns with the base case on lines 2–3.

Each of the hprocesses then proceeds to request a getBinaryOrdering between

the head of its individual passed list l and the pivot. Each hprocess then periodically

crashes on line 12 until the worker binary comparison is returned. When each worker

binary comparison is returned, each hprocess then waits for the sub-hprocess that it

created to return (line 15). Then, each hprocess returns its sub-hprocess’ lesser, equal,


and greater items, together with its own single comparison result. This continues up

the chain until partition returns to the original hprocess 1001 running quicksort.

The result of partition is

([4,2,3,1], [5], [8,6,4])

Note that because we chose a median pivot, the partition of the list is quite equal,

whereas if we had chosen the first item, 8, we would have had a very unequal partition.

Once hprocess 1001 has the partition results, it can run quicksort on the lesser

and greater items. This further recursion is shown on lines 22–26 of Listing 7.14.

Specifically, two new hprocesses are created, the first to quicksort the list [4, 2, 3, 1]

and the second to quicksort the list [8, 6, 4]. (These two quicksorts will function

the same as the previously discussed one, though they will not use the median pivot

because there are not enough items.) Those quicksorts will eventually return sorted

lists, which are then combined with the pivot on line 28, producing a final sorted list.

The most interesting thing to note about H-Quick-Sort is the high level of hu-

man parallelism. Every binary comparison within a partition is handled in parallel,

unlike merge in H-Merge-Sort.

7.10 Human Algorithm Evaluation

Before conducting an evaluation of H-Merge-Sort and H-Quick-Sort, we first

consider how to evaluate algorithms in general using the human processing model.

A human algorithm really consists of a strategy plus an interface. In our case, our

strategies are H-Merge-Sort and H-Quick-Sort. Our interfaces are the binary

comparison and ranked comparison interfaces discussed in Section 7.7.3. Once we

have paired a strategy with one or more appropriate interfaces, we can evaluate the

combination as a complete human algorithm.

There are five main variables in evaluating any human algorithm in the human

processing model: recruiter type, cost, time, accuracy, and algorithm-specific param-

eters. (We consider the recruiter type to be part of the evaluation parameters, rather

than part of the algorithm, though it could arguably be considered either.) In our

7.11. CASE STUDY EVALUATION 173

case, the recruiter type we consider is a single “basic” recruiter which offers a task on

the Mechanical Turk for one cent, and re-posts the comparison if it is not accepted

every 20 minutes. The recruiter also only hires workers who have maintained an ac-

ceptance rate greater than 95%. The cost is the amount paid to workers in cents over

the runtime of the algorithm. The time is the length of time it took for the algorithm

to complete. The accuracy is algorithm-specific, though in the case of sort, we calcu-

late Kendall’s τ of the sort’s result versus the true ordering. The algorithm-specific

parameters vary by algorithm, though in the case of sort, we are interested in the

total number of items to be sorted (which in turn impact cost, time, and accuracy).

There are three other aspects that are important for human algorithm evaluation:

time period, dataset, and variation. The first aspect is the time period during which

the evaluation is done. Evaluation conducted in the middle of the night might perform

quite differently from evaluation during the day, because we are dealing with humans.

The second aspect is the dataset used, because humans may be heavily impacted by

dataset choice. As a result, we use the same dataset across our evaluation. The third

aspect is the variation across multiple runs, due to the natural variation of workers

across multiple tasks. We reflect this variation by computing standard deviation

across a number of runs.

7.11 Case Study Evaluation

We now evaluate H-Merge-Sort and H-Quick-Sort using the evaluation criteria

described in Section 7.10. Specifically, we consider the strategies H-Merge-Sort

and H-Quick-Sort paired with the interfaces for binary comparison and ranked

comparison. We use the basic recruiter described in Section 7.10. We consider dif-

ferent settings of the number of items to be sorted and the impact of this setting

on the cost, time, and accuracy of the sorting algorithms under consideration. Our

evaluation is using the Stanford University Shoe Dataset 2010 (Section 7.7.1). We run

all comparisons between algorithms consecutively over the course of several days in

order to control for the time period and variation of workers. Specifically, we conduct

one run starting November 8th, 2010, comparing all settings of our H-Merge-Sort


2 3 4 5 6 7 8 9 10

05

10

15

20

25

30

Items Sorted

Co

st

in C

en

ts

(a) H-Merge-Sort (Binary)

2 3 4 5 6 7 8 9 10

05

10

15

20

25

30

Items Sorted

Co

st

in C

en

ts

(b) H-Merge-Sort (Rank 8)

2 3 4 5 6 7 8 9 10

05

10

15

20

25

30

Items Sorted

Co

st

in C

en

ts

(c) H-Quick-Sort (Binary)

Figure 7.4: Comparison of total cost of three variations of sorting.

and H-Quick-Sort for n = (5, 10, 20) (where n is the number of items to be sorted).

We conduct the other run starting November 29th, 2010, comparing three settings of

H-Merge-Sort and H-Quick-Sort across n = (2, 3, 4, 5, 6, 7, 8, 9, 10).

We consider three questions. First, we ask how interfaces impact H-Merge-Sort

in Section 7.11.1. Second, we ask whether the median pivot option in H-Quick-Sort

is helpful in Section 7.11.2. Third, we compare H-Merge-Sort to H-Quick-Sort

in Section 7.11.3. Finally, we discuss other observations on the data as a whole in

Section 7.11.4.

7.11.1 H-Merge-Sort Interfaces

How is H-Merge-Sort impacted by the choice of interface between binary com-

parisons and ranked comparisons? A change in interface could impact cost, time, or

accuracy. We consider each below.

Figures 7.4(a) and 7.4(b) show boxplots of the cost of H-Merge-Sort across


2 3 4 5 6 7 8 9 10

05

00

10

00

15

00

20

00

Items Sorted

Clo

ck T

ime

in

Se

co

nd

s


2 3 4 5 6 7 8 9 10

05

00

10

00

15

00

20

00

Items Sorted

Clo

ck T

ime

in

Se

co

nd

s


2 3 4 5 6 7 8 9 10

05

00

10

00

15

00

20

00

Items Sorted

Clo

ck T

ime

in

Se

co

nd

s


Figure 7.5: Comparison of wall clock time for three variations of sorting.

ten runs with binary comparisons and ranked comparisons (eight way), respectively.

(Boxes represent the 25th and 75th percentile of the data from ten runs, with a

horizontal line at the median, “whiskers” are drawn to the maximum and minimum

points within 1.5 times the inner-quartile range, and circles represent outliers outside

of that range.) For example, sorting nine items with a binary comparison H-Merge-

Sort costs around 17 cents, while sorting nine items with a ranked comparison

H-Merge-Sort costs around 4 cents. We can see that H-Merge-Sort with eight

way ranked comparisons is substantially cheaper than binary comparison H-Merge-

Sort.

Figures 7.5(a) and 7.5(b) show boxplots of the time taken by H-Merge-Sort

across ten runs with binary comparisons and ranked comparisons (eight way), respec-

tively. For example, sorting five items with a binary comparison H-Merge-Sort

takes around 500 seconds. We can see that H-Merge-Sort with eight way ranked

comparisons takes substantially less time than binary comparison H-Merge-Sort.

However, in both cases, H-Merge-Sort has a big spike around 9 or 10 sorted items.


2 3 4 5 6 7 8 9 10

−1

.0−

0.5

0.0

0.5

1.0

Items Sorted

Ke

nd

all’

s T

au


2 3 4 5 6 7 8 9 10

−1

.0−

0.5

0.0

0.5

1.0

Items Sorted

Ke

nd

all’

s T

au


2 3 4 5 6 7 8 9 10

−1

.0−

0.5

0.0

0.5

1.0

Items Sorted

Ke

nd

all’

s T

au


Figure 7.6: Comparison of accuracy for three variations of sorting.

This is because H-Merge-Sort spends more time in the Merge phase as the num-

ber of items n increases, regardless of the number of comparisons possible via ranked

comparisons. (9 items is the point at which we need to Merge when we have eight-

way ranked comparisons, and it is a point at which more work needs to done in

Merge for binary comparison H-Merge-Sort.)

Figures 7.6(a) and 7.6(b) show boxplots of the accuracy, in terms of Kendall’s

τ , across ten runs with binary comparisons and ranked comparisons (eight way),

respectively. Kendall’s τ ranges between −1 (if the ordering is the perfect reversal of

the correct ordering) and +1 (if the ordering is the correct ordering). For example,

all orderings of two items are either −1 (the wrong order) or +1 (the correct order)

in Figures 7.6(a) and 7.6(b). For comparison purposes, the author scores a Kendall’s

τ roughly in the range of 0.7–1.0 when manually sorting ten items. Ultimately, it

is difficult to discern patterns in Figures 7.6(a) and 7.6(b), though we will find later

in Section 7.11.4 that they have slightly different accuracies, but primarily different

variance.


7.11.2 H-Quick-Sort Median Pivot

How isH-Quick-Sort impacted by the choice of a random versus median pivot based

on a ranked comparison? We found that there was relatively little average difference

in cost, time, or accuracy between the random versus median pivots. However, we

did find that choosing a median pivot based on a ranked comparison substantially

reduced the variance in accuracy. In general, H-Quick-Sort with the median pivot

had about half the variance in accuracy of H-Quick-Sort with a random pivot. The

full numbers are shown in Table 7.4, which is described in Section 7.11.4.

7.11.3 H-Merge-Sort versus H-Quick-Sort

How does H-Merge-Sort compare to H-Quick-Sort? Figures 7.4, 7.5 and 7.6

show this comparison from n = 2 to n = 10. We only compare H-Quick-Sort with

a median pivot to H-Merge-Sort, because Section 7.11.2 showed that the two pivot

choices were similar, with median pivots having lower variance.

Our H-Quick-Sort performs largely the same as binary comparison H-Merge-

Sort in terms of cost, accuracy, and time. (We did not incorporate ranked com-

parisons into the Partition phase of our H-Quick-Sort, which might have made

H-Quick-Sort more competitive with eight way ranked comparison H-Merge-

Sort.) However, there is one big difference, which is that H-Quick-Sort does not

show as substantial a jump around n = 9 for the time taken to sort. This jump

illustrates the lack of the “blocking” Merge behavior described earlier, and suggests

that H-Quick-Sort would perform better at larger n than H-Merge-Sort.

7.11.4 Complete Data Table

Table 7.4 shows our full data for the November 8th, 2010, run described at the

beginning of this section. Here, in addition to computing mean values across ten runs,

we also compute standard deviations. We can see, for example, that the variance for

ranked comparisons (i.e., ≈ 0.4) tends to be substantially higher than the variance

for binary comparisons (i.e., ≈ 0.2–0.3).


Strategy

Interface

Recru

iter

Items(n

)

Cost

(¢)

Tim

e(s)

Accura

cy(τ)

H-M

erge-

Sort

Choice

(2-w

ay)

Basic@(1

¢,20,95%)

5 6.6 (σ = 1.17) 395.272 (σ = 101.44) 0.460 (σ = 0.30)10 22.8 (σ = 1.40) 1091.386 (σ = 281.94) 0.649 (σ = 0.22)20 62.6 (σ = 4.06) 3009.043 (σ = 753.61) 0.702 (σ = 0.11)

Ran

ked

(4-w

ay) 5 3.5 (σ = 0.53) 242.979 (σ = 87.86) 0.520 (σ = 0.43)

10 10.4 (σ = 0.97) 630.557 (σ = 243.50) 0.569 (σ = 0.41)20 29.2 (σ = 1.40) 1873.899 (σ = 588.48) 0.661 (σ = 0.17)

Ran

ked

(8-w

ay) 5 1.0 (σ = 0.00) 125.163 (σ = 166.62) 0.640 (σ = 0.44)

10 4.0 (σ = 0.00) 351.805 (σ = 151.72) 0.502 (σ = 0.43)20 11.6 (σ = 0.52) 1197.461 (σ = 373.74) 0.494 (σ = 0.35)

H-Q

uick-

Sort

Ran

d.

Pivot

5 7.4 (σ = 0.97) 320.833 (σ = 129.80) 0.740 (σ = 0.25)10 24.1 (σ = 2.73) 741.514 (σ = 208.99) 0.698 (σ = 0.31)20 65.9 (σ = 6.67) 1688.436 (σ = 261.63) 0.714 (σ = 0.19)

Median

Pivot

5 7.5 (σ = 0.97) 342.251 (σ = 103.06) 0.760 (σ = 0.25)10 22.7 (σ = 1.25) 701.911 (σ = 152.27) 0.693 (σ = 0.18)20 66.3 (σ = 3.02) 1709.612 (σ = 235.46) 0.747 (σ = 0.08)

Table 7.4: Comparison of different sorting strategies and interfaces. Sorting dataset isthe Stanford University Shoe Dataset 2010. All runs done during the week of Novem-ber 8th, 2010. Results listed are the mean over ten runs, with standard deviation inparentheses.

Seeing the full time values, we can also evaluate whether the scale of the time

values makes sense for crash-and-rerun programming. Crash-and-rerun programming

only makes sense when interfacing with humans takes substantially longer than com-

puting time [51]. In our case, waiting for a human worker can take tens of minutes,

so crash-and-rerun seems like a reasonable design choice.

7.12 Conclusion

This chapter introduced our HPROC system implementing most of the human pro-

cessing model of Chapter 6 and then used the system to conduct a short case study

of two sorting algorithms.

7.12. CONCLUSION 179

We first described the semantics of TurKit, the most closely related system to

HPROC. We then described the main subsystems of HPROC and how they help

implement the core HPROC concept—hprocesses. Because HPROC is such a com-

prehensive system, we illustrate system usage with a short walkthrough. Overall,

the HPROC system meaningfully extends the crash-and-rerun model introduced by

TurKit. HPROC allows a rapid prototyping style of human programming. For exam-

ple, we were able to convert our H-Merge-Sort into an H-Quick-Sort program

in a matter of hours.

However, if we were to improve upon HPROC, we would likely focus on debugging

tools to make programming easier. Crash-and-rerun leads to many effective threads

of control, which can be difficult to debug. As a result, detailed information about

what state is stored in the database as well as what hprocesses are currently active,

waiting and finished can be quite useful.

Having introduced HPROC, we used it to conduct our sorting case study. In our

case study, we used strategies based on Merge-Sort and Quick-Sort. We found

that the choice of human comparison interfaces had a large impact on cost, time and

accuracy in our evaluation. We also found that such interfaces could be used in unique

ways, for example, in the case of our median pivot selection for H-Quick-Sort which

tends to reduce the variance of sorting accuracy.

However, the Merge-Sort and Quick-Sort algorithms that are the basis for

H-Merge-Sort and H-Quick-Sort assume that comparisons are correct, and er-

rors can greatly reduce the quality of output. In the future, the best strategies will

take into account worker uncertainty, for example, by requesting multiple judgments.

Nonetheless, one cannot get too many judgments, because this increases cost! We are

often forced to choose between trusting fewer workers to do more work (e.g., ranked

comparisons) or having more workers do less work (e.g., binary comparisons). Future

approaches to this problem might include asking workers to do a test task before

trusting their judgments, or giving workers more advanced interfaces after simpler

ones. (In general, the goal is to reduce the “noise” added by bad workers, both by

avoiding such bad workers, and by taking into account their existence in algorithm

design.) Lastly, our case study used a very basic recruiter because of current issues


with pricing in the Mechanical Turk [19]. Future recruiters will be better at recruiting

at particular times, changing prices to speed execution, and focusing on particular

workers that have provided quality, verified input in the past. Overall, the issues in

this area are quite varied, and most of the likely changes substantially impact cost,

time, accuracy, and the variability.

HPROC is designed to make such exploration easy and systematic. Recruiters

and related concepts are designed both to simplify program design and, crucially, to

control for variability of the underlying marketplace, allowing for comparison between

different proposed algorithms. Thus, our contribution of the recruiter concept is a

key part of a methodology for evaluating human algorithms across systems and im-

plementations. We believe that both our evaluation methodology and the HPROC

system should be beneficial to any number of human algorithms, like sorting, cluster-

ing, and summarization. Shared systems and datasets like those introduced in this

chapter can only accelerate the exciting progress that is being rapidly made in the

growing field of human algorithms touched on in our case study.

Chapter 8

Worker Monitoring with

Turkalytics

One challenge in the human processing model of Chapters 6 and 7 is the collection

of reliable data about the workers and the tasks they are performing. This data is

needed by our recruiter in particular, but is also needed by any system trying to make

human processing more effective: If a task is not being completed, is it because no

workers are seeing it? Is it because the task is currently being offered at too low a

price? How does the task completion time break down? Do workers spend more time

previewing tasks (see below) or doing them? Do they take long breaks? Which are

the more “reliable” workers?

This chapter addresses the problem of analytics for recruiting workers and study-

ing the performance of ongoing tasks. We describe our prototype system for gathering

analytics, illustrate its use, and give some initial findings on observable worker behav-

ior. We believe our tool for analytics, “Turkalytics,” is the first human computation

analytics tool to be embeddable across human computation systems (see Section 8.1

for the explicit definition of this and other terms). Turkalytics makes analytics orthog-

onal to overall system design and encourages data sharing. Turkalytics can be used

in stand-alone mode by anyone, without need for our full human-processing infras-

tructure (Figure 6.3). Turkalytics functions similarly to tools like Google Analytics

[5], but with a different set of tradeoffs (see Section 8.2.4).

181

182 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS

We proceed as follows. Section 8.1 defines terms and describes the interaction and

data models underlying our system. We describe the implementation of Turkalytics

based on these models in Section 8.2. Section 8.3 describes how a requester uses

our system. Sections 8.4, 8.5, and 8.6 present results. Section 8.4 describes the

workload we experienced and shows our architecture to be robust. Section 8.5 gives

some initial findings about workers and their environments. Section 8.6 considers

higher granularity activity data and worker marketplace interactions. Section 8.7

summarizes related work, and we conclude in Section 8.8.

8.1 Worker Monitoring Terms and Notation

We define crowdsourcing to be the process of getting one or more people over the

Internet to perform work via a marketplace. We call the people doing the work

workers. We call the people who need the work completed requesters. A marketplace

is a web site that connects workers to requesters, allowing workers to complete (micro-

)tasks for a monetary, virtual, or emotional reward.

Tasks are grouped in task groups, so that workers can find similar tasks. Mechani-

cal Turk, the marketplace for which our Turkalytics tool is designed, calls tasks HITs

and task groups HITTypes. When a worker completes a task, we call the completed

(task, worker) pair an assignment or work.

Tasks are posted to marketplaces programmatically by requesters using interfaces

provided by the marketplaces. A requester usually builds a program called a human

computation system to ease posting many tasks (e.g., HPROC in Chapter 7). (We

use “system” in both this specific sense and in a colloquial sense, though we try to be

explicit where possible.) The system needs to solve problems like determining when to

post, how to price tasks, and how to determine quality work. The human computation

system may be based on a framework designed and/or implemented by someone else

to solve some of these tasks, like the human processing model of Chapter 6. The

human computation system may also leave certain problems to outside services, such

as our analytics tool (for analytics) or a full service posting and pricing tool like

CrowdFlower [6].

8.1. WORKER MONITORING TERMS AND NOTATION 183

Figure 8.1: Search-Preview-Accept (SPA) model.

The rest of this section describes two models at the core of our Turkalytics tool.

The worker interaction model of Section 8.1.1 makes it possible to represent (and

report on) the steps taken to perform work. The data model of Section 8.1.2 is key

to understanding what data needs to be collected. As we will see in Section 8.6,

our interaction model helps us present results about worker behavior. Similarly, our

data model helps us describe the implementation (Section 8.2) and requester usage

(Section 8.3).

8.1.1 Interaction Model

Crowdsourcing marketplaces vary. Some focus on areas of expertise (e.g., program-

ming or graphic design) while others are more defined by the average time span of

a task (e.g., one minute microtasks or month long research projects). Different mar-

ketplaces call for different interactions. For example, marketplaces with longer, more

skilled tasks tend to have contests or bidding based on proposals, while marketplaces

for microtasks tend to have a simpler accept or reject style. Section 8.1.1 describes a

simple microtask model, and Section 8.1.1 extends it to cover Mechanical Turk.

Simple Model

The Search-Preview-Accept (SPA) model is a simple model for microtasks (Fig-

ure 8.1). Workers initially are in the Search or Browse state, looking for work they

can do at an appropriate price. Workers can then indicate some interest in a task

by entering the Preview state through a preview action. Preview differs from Search


Figure 8.2: Search-Continue-RapidAccept-Accept-Preview (SCRAP) model.

or Browse in that the worker may have a complete view of the task, rather than

some summary information. From Preview, the worker can enter the Accept state by

accepting and actually complete the task. Lastly, the worker can always return to a

previous state, for example, a worker can return an accepted task, or leave behind a

task that he found uninteresting on preview.

The SPA model fits microtasks well because the overhead of a more complex

process like an auction seems to be much too high for tasks that may only pay a few

pennies. However, the SPA model does provide flexibility to allow workers to self

select for particular tasks and to back out of tasks that they feel unsuited for. The

Accept state also allows greater control over how many workers may complete a given

task, because workers may be prevented from accepting a task.

Mechanical Turk Extensions

Mechanical Turk uses a more complex model than SPA which we call the Search-

Continue-RapidAccept-Accept-Preview (SCRAP) model (Figure 8.2). This model

is similar to SPA, but adds two new states, Continue and RapidAccept. Continue

allows a worker to continue completing a task that was accepted but not submitted

or returned. RapidAccept allows a worker to accept the next task in a task group

without previewing it first. In practice, the actual states and transitions in Mechanical

8.1. WORKER MONITORING TERMS AND NOTATION 185

Figure 8.3: Turkalytics data model (Entity/Relationship diagram).

Turk are much messier than Figure 8.2. However, we will see in Section 8.6.1 that

mapping from Mechanical Turk to SCRAP is usually straightforward.

While SCRAP is a reasonable model of Mechanical Turk worker activity, it is

incomplete in two notable ways. First, it ignores certain specialized Mechanical Turk

features like qualifications. This is primarily because Turkalytics, as an unobtrusive

third-party add-on cannot really observe these states. Second, SCRAP chooses a

particular granularity of activity to describe. As we will see, Turkalytics actually

includes data within a state, for example, form filling activity or mouse movement.

We can think of such data as being attached to a state, which is more or less how it

is represented in our data model.

8.1.2 Data Model

This section uses the terminology of data warehousing and online analytical processing

(OLAP) systems. Data in Turkalytics is organized in a star schema, centered around

a single fact table, Page Views. Each entry in Page Views represents one worker

visiting one web page, in any of the states of Figure 8.2. There are a number of

dimension tables, which can be loosely divided into task, remote user, and activity

tables. The three task tables are:


1. Tasks: The task corresponding to a given page view.

2. Task Groups: The task group containing a given task.

3. Owners: The owner or requester of a given task group.

The four remote user tables are:

1. IPs: The IP address and geolocation information associated with a remote user

who triggered a page view.

2. Cookies: The cookie associated with a given page view.

3. Browser Details: The details of a remote user’s browser, like user agent

(a browser identifier like Mozilla/5.0 (Windows; U; Windows NT 5.1;

en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.99

Safari/533.4,gzip(gfe)) and available plugins (e.g., Flash).

4. Workers: The worker information associated with a given remote user.

The two activity tables are:

1. Activity Signatures: Details of what activity (and inactivity) occurred during a

page view.

2. Form Contents: The contents of forms on the page over the course of a page

view.

Figure 8.3 shows an Entity/Relationship diagram. Entities in Figure 8.3 (the rect-

angles) correspond to actual tables in our database, with the exception of “Remote

Users.” Entities attached to “Remote Users” are dimension tables for “Page Views.”

The circles in the figure represent the attributes or properties of each entity.

There is one set of tables that we have left out for the purpose of clarity. As

we will see in Section 8.2, we need to build up information about a single page view

through many separate logging events. As a result, there are a number of tables,

which we do not enumerate here, that enable us to incrementally build from logging

events into complete logging messages, and then finally into higher level entities like

overall activity signatures and page views.

8.2. IMPLEMENTATION 187

8.2 Implementation

Turkalytics is implemented in three parts: client-side JavaScript code (Section 8.2.1),

a log server (Section 8.2.2), and an analysis server (Section 8.2.3). Section 8.2.4 gives

a broad overview of the design choices we made and limitations of our design.

8.2.1 Client-Side JavaScript

A requester on Mechanical Turk usually creates a HIT (task) based on a URL. The

URL corresponds to an HTML page with a form that the worker completes. Re-

questers add a small snippet of HTML to their HTML page to embed Turkalytics

(see Section 8.3.1). This HTML in turn includes JavaScript code (ta.js) which tracks

details about workers as they complete the HIT.

The ta.js code has two main responsibilities:

1. Monitoring: Detect relevant worker data and actions.

2. Sending: Log events by making image requests to our log server (Section 8.2.2).

ta.js monitors the following:

1. Client Information: What resolution is the worker’s screen? What plugins are

supported? Can ta.js set cookies?

2. DOM Events: Over the course of a page view, the browser emits various events.

ta.js detects the load, submit, beforeunload, and unload events.

3. Activity: ta.js listens on a second by second basis for the mousemove, scroll

and keydown events to determine if the worker is active or inactive. ta.js then

produces an activity signature, e.g., iaaia represents three seconds of activity

and two seconds of inactivity.

4. Form Contents: ta.js determines what forms are on the page and the contents

of those forms. In particular, ta.js logs initial form contents, incremental

updates, and final state.

ta.js sends monitored data to the log server via image requests. We define a

logging event (or event, where the meaning is clear) to be a single image request.

Image requests are necessary to circumvent the same origin policies common in most

mainstream browsers, which block actions like sending data to external sites. Special


care is also needed to send these image requests in less than two kilobytes due to

restrictions in Microsoft Internet Explorer (MSIE). We define a logging message to

be a single piece of logged data split across one or more events in order to satisfy

MSIE’s URL size requirements. For example, logging messages sent by ta.js include

activity signatures, related URLs, client details, and so on (Listing 8.1 is one such

logging message). A single page view can lead to as few as seven or as many as

hundreds of image requests. (For example, the NER task described at the beginning

of Section 8.4 can lead to over one hundred requests as it sends details of its form

contents because it has over 2, 000 form elements.)

8.2.2 Log Server

The log server is an extremely simple web application built on Google’s App Engine.

It receives logging events from clients running ta.js and saves them to a data store.

In addition to saving the events themselves, the log server also records HTTP data

like IP address, user agent, and referer. (We intentionally continue the historical

convention of misspelling “referer” when used in the context of the HTTP referer,

and also do so when referring to the JavaScript document referrer.) A script on the

analysis server (Section 8.2.3) periodically polls the web application, downloading

and deleting any new events that have been received. This simplicity pays off: our

log server has scaled to over one hundred thousand requests per day.

8.2.3 Analysis Server

The analysis server periodically polls the log server for new events. These events are

then inserted into a PostgreSQL database, where they are processed by a network

of triggers. These triggers are arguably the most complex part of the Turkalytics

implementation, for four main reasons:

1. Time Constraints: One of our goals is for the analysis server to be updated, and

query-able, in real-time. Currently, the turnaround from client to availability

in the analysis server is less than one minute.

2. Dependencies: What to do when an event is inserted into the analysis server


may depend on one or more other events that may not have even been received

yet.

3. Incomplete Input: When Turkalytics has not yet received all logging events

pertaining to a message, page view, or any other entity from Figure 8.3, we

call that entity incomplete. Nonetheless, requesters should be able to query

as much information as possible, even if some entities are incomplete. In fact,

many entities will remain incomplete forever. (This is one negative result of an

explicit design decision in Section 8.2.4.)

4. Unknown Input: The analysis server may receive unexpected input that conflicts

with our model of how the Mechanical Turk works, yet it must still handle this

input.

These challenges are sufficiently difficult that our current PL/Python trigger solution

is our second or third attempt at a solution. (One earlier attempt made use of

dependency handling from a build tool, for example.)

Rather than fully describing our triggers here, we give an example of the function-

ality instead. Suppose a worker A19... has just finished previewing a task 152...,

and chooses to accept it. When the worker loads a new page corresponding to the

accept state, ta.js sends a number of events. Listing 8.1 shows one such event,

a related URLs event detailing the page which referred the worker to the current

page. (Our implementation uses JavaScript Object Notation (JSON) as the format

for logging events.)

From the HTTP REFERER, Turkalytics can now learn the identifiers for the as-

signment (1D9...), HIT (152...), and worker (A191...). From the PATH INFO,

Turkalytics can learn what type of data ta.js is sending (relatedUrls). From the

QUERY STRING, Turkalytics gets the actual data being sent by ta.js, in this case,

an escaped referer URL (documentReferrerEsc) which in turn includes the reward

(USD0.01), group identifier (1ZSQ...), and other information. The QUERY STRING

also includes a pageSessionId, which is a unique identifier shared by all events

sent as a result of a single page view. (pageSessionId is the key for the “Page

Views” table.) Note that neither the HTTP REFERER nor the referer sent by ta.js

as documentReferrerEsc represents the worker’s previously visited page. The


1 { ...

2 "HTTP_REFERER":

3 "...? assignmentId =1D9...

4 &hitId =152...

5 &workerId=A1Y9 ...",

6 "PATH_INFO": "/event/relatedUrls",

7 "QUERY_STRING":

8 "turkaMsgId =2

9 &documentReferrerEsc=https%3A%2F...

10 %26 prevRequester %3 DStanford %2B...

11 %26 requesterId %3 DA2IP5GMJBH7TXJ

12 %26 prevReward %3 DUSD0 .01...

13 %26 groupId %3 D1ZSQ ...

14 &turkaConcatNum =0

15 &turkaConcatLen =1

16 &targetId=f68daad1

17 &timeMillis =127...

18 &pageSessionId =0.828...

19 &clientHash =150...",

20 ... }

Listing 8.1: Excerpt from a related URLs logging event formatted as JSON.


HTTP REFERER seen by our log server is the HIT URL that the worker is currently

viewing, and documentReferrerEsc is the Mechanical Turk URL containing that

HIT URL in an IFRAME.

When the event from Listing 8.1 is inserted into the database, the following func-

tionality is triggered (and more!):

1. The current page view, as specified by pageSessionId, is updated to have

assignment 1D9..., HIT 152..., and worker A191....

2. Other page views which lack a worker identifier may have one inferred based on

the current page view’s worker identifier. (This requires an invariant involving

the distance in time between page views.) For example, the page view associated

with the worker’s previous preview state is updated to have a worker identifier

of A191.... (When the worker was in the preview state, the assignment and

worker identifiers were unknown, but now we can infer them based on this later

information.)

3. If not already known, a new task group 1ZSQ... with a reward of one cent is

added.

4. If not already known, a new mapping from the current HIT 152... to the task

group 1ZSQ... is added.

5. If not already known, the requester name and identifier are added to the task

group and owner entities.

This example shows that incrementally building entities from Figure 8.3 in real-time

requires careful consideration of both event dependencies and appropriate invariants.

8.2.4 Design Choices

There are four main considerations in Turkalytics’ design:

1. Ease: We wanted Turkalytics to be easy for requesters to use and install.

2. Unobtrusiveness: We wanted Turkalytics to be as invisible as possible to workers

as they perform work, and not to impact the operation of requesters.

3. Data Collection: We wanted to gather as much worker task completion data as

reasonable possible.


4. Cross-Platform: We wanted Turkalytics to work across a number of different

human computation systems for posting work to Mechanical Turk, because such

systems are currently quite heterogeneous.

Our requirements that Turkalytics be easy, unobtrusive, and cross-platform led us to

build our tool as embeddable JavaScript, and to use simple cross-platform ideas like

sessions and cookies to group events by workers.

It is perhaps worth taking a moment to note why building an analytics tool like

Turkalytics, and in particular building it as embeddable, cross-platform JavaScript

is nontrivial. First, we do not have direct access to information about the state of

Mechanical Turk. We do not want to access the Mechanical Turk API as each of

our requesters. However, even if we did, the Mechanical Turk API does not allow

us to query fine grained data about worker states, worker activity, or form contents

over time. Nor does it tell us which workers are reliable, or how many workers are

currently using the system. Second, data collected is often incomplete, as discussed

in Section 8.2.3, and we often need to infer additional data based on information

that we have. Third, remote users can change identifiers in a variety of ways, and

in many cases we are more interested in the true remote user than any particular

worker identifier. All of these challenges, in addition to simply writing JavaScript

that works quickly and invisibly across a variety of unknown web browsers with a

variety of security restrictions (same origin policy, third party cookies), make writing

an analytics tool like Turkalytics difficult.

Two of our design considerations, “unobtrusiveness” and “data collection” are in

direct opposition to one another. For example, consider the following trade-offs:

• ta.js could send more logging messages with more details about the worker’s

browser state, but this may be felt through processor, memory, or bandwidth

usage.

• ta.js could sample workers and only gather data from some of them, improving

the average worker’s experience, but reducing overall data collection.

• ta.js could interfere with the worker’s browser to ensure that all logging events

are sent and received by our logging server, for example, by delaying submission

of forms while logging messages are being sent.

8.3. REQUESTER USAGE 193

1 <script type="text/javascript"

2 src="https ://.../ ta.js">

3 </script >

4 <script type="text/javascript">

5 Turka.Frontend.startAllTracking("...");

6 </script >

Listing 8.2: Turkalytics embed code.

These options represent a spectrum between unobtrusiveness and data collection.

We chose to send fairly complete logging messages and avoid sampling. This is

because we believe that workers are more motivated to deal with minor performance

degradation (on the order of hundreds of milliseconds) than regular web visitors. This

is quite different than the assumptions behind tools like Google Analytics. Nonethe-

less, we draw the line at interfering with worker behavior, which we deem too obtru-

sive. A result of this decision is that we may occasionally have incomplete data from

missed logging messages.

A current technical limitation of our implementation is a focus on HTML forms.

HITs that make use of Flash or an IFRAME may produce incomplete activity and form

data. However, there is nothing in our design which means that such cases could not

be handled eventually.

8.3 Requester Usage

Requesters interact with Turkalytics at two points: installation (Section 8.3.1) and

reporting (Section 8.3.2). Our goal in this section is to illustrate just how easy our

current Turkalytics tool is to use and to show just how much benefit requesters get.

8.3.1 Installation

In most cases, embedding Turkalytics simply requires adding a snippet of HTML (see

Listing 8.2) to each HTML page corresponding to a posted HIT. (See Section 8.2.1

for more on how HTML pages relate to HITs.) Most systems for displaying HITs


have some form of templating system in place, so this change usually only requires

a copy-and-paste to one file. An important special case of a human computation

system with a templating system is the Mechanical Turk requester bulk loading web

interface [7]. Underlying that interface is a templating system which generates HTML

pages on Amazon’s S3 system [8], so Listing 8.2 works there as well (by adding it to

the bottom of each template). These two cases, requesters posting HTML pages and

requesters using the bulk interface, cover all of our current requesters.

Listing 8.2 has two parts elided. The first “...” is where ta.js is located, on our

server. This does not change across installations. The second “...” is an identifier

identifying the particular requester. Currently, we assign each Turkalytics requester

a hexadecimal identifier, like 7e3f6604. Once the requester has added the snippet

from Listing 8.2 with these changes, they are done. (Requesters lacking SSL also

need to use a simple workaround script due to referer handling, but such requesters

are rare.) In our experience, the process usually takes less than five minutes and is

largely invisible to the requester afterwards.

Two implementation notes bear pointing out here about the hexadecimal identi-

fier. The first note is that due to the web browser context, and due to our status

as a third party, it is possible that an “attacker” could send our system fake data.

At this stage, there is not a lot of reason to do this, and this is a problem with

most analytics systems. The second note is that the hexadecimal identifier allows

us to easily partition our data on a per requester basis. Our current analysis server

uses a multitenant database where we query individual requester statistics using this

identifier, but could easily be split across multiple databases.

8.3.2 Reporting

Once Turkalytics is installed (Section 8.3.1), all that remains is to later report to

the requester what analysis we have done. Like most data warehousing systems, we

have two ways of doing this. We support ad hoc PostgreSQL queries in SQL and we

are in the process of implementing a simple web reporting system with some of the

more common queries. In fact, most of the data in this chapter was queried from our

8.3. REQUESTER USAGE 195

1 SELECT tg.requester_name

2 , sum(tg.reward_cents) AS total_cents

3 , count (*) AS num_submits

4 FROM page_views AS pv

5 , task_groups AS tg

6 WHERE pv.task_group_id=tg.task_group_id

7 AND pv.page_view_type=’accept ’

8 AND pv.page_view_end=’submit ’

9 GROUP BY tg.requester_name;

10

11 SELECT tg.requester_name

12 , pv.task_group_id

13 , sum(tg.reward_cents * 3600)

14 / sum(a.active_secs)

15 , sum(tg.reward_cents * 3600)

16 / sum(a.total_secs)

17 FROM page_views AS pv

18 , task_groups AS tg

19 , activity_signatures AS a

20 WHERE pv.task_group_id=tg.task_group_id

21 AND pv.page_view_id=a.page_view_id

22 AND pv.page_view_type=’accept ’

23 AND pv.page_view_end=’submit ’

24 AND a.is_complete

25 GROUP BY tg.requester_name

26 , pv.task_group_id;

Listing 8.3: Two SQL reporting queries.

live system, including Tables 8.1 and 8.2, Figures 8.5 and 8.6, and most of the inline

statistics. (The only notable exception is Figure 8.4, where it is somewhat awkward

to compute sequential transitions in SQL.)

To give a flavor for what you can do, Listing 8.3 gives two example queries that

run on our real system. For example, suppose we want to know which requesters in

our system are the heaviest users of Mechanical Turk. The first query computes total

number of tasks and total payout aggregated by requester by joining page view data

with task group data. An example output tuple is ("Petros Venetis", 740, 160).


That output tuple means that the requester Petros Venetis spent $7.40 USD on

160 tasks. The second query computes the hourly rate of workers based on active and

total seconds grouped by task group. (The query does so by joining page views, task

groups, and activity data, and using the activity data to determine amount of time

worked.) We might want to do this, for example, to determine appropriate pricing

of a future task based on an estimate of how long it takes. An example output

tuple is ("Paul H", "1C4...", 6101.695, 122.553). That output tuple means

that for task group 1C4... owned by Paul H, the hourly rate of workers based on

the number of active seconds was ≈$61.02 USD, while the hourly rate based on total

time to completion was ≈$1.23 USD. (Note that the example tuple has a very large

discrepancy between active and total hourly rate, because the task required workers

to upload an image created offline.)

8.4 Results: System Aspects

This section, and the two that follow (Sections 8.5 and 8.6) describe our production

experience with the Turkalytics system. Our data for these sections is collected over

the course of about a month and a half starting on June 14th, 2010. The data consists

of 12, 370 tasks, 125 worker days, and a total cost of $543.66. In our discussion below,

we refer to three groups of tasks posted by requesters using our tool:

1. Named Entity Recognition (NER): This task, posted in groups of 200 by a

researcher in Natural Language Processing, asks workers to label words in a

Wikipedia article if they correspond to people, organizations, locations, or de-

monyms. (2, 000 HITs, 1 HITType, more than 500 workers.)

2. Turker Count (TC): This task, posted once a week by a professor of business

at U.C. Berkeley, asks workers to push a button, and is designed just to gauge

how many workers are present in the marketplace. (2 HITs, 1 HITType, more

than 1, 000 workers each.)

3. Create Diagram (CD): This task, posted by the authors, asked workers to draw

diagrams for this chapter based on hand drawn sketches. In particular, Fig-

ures 8.1, 8.2, and 8.4 were created by worker A1K17L5K2RL3V5 while Figure 8.3

8.4. RESULTS: SYSTEM ASPECTS 197

was created by worker ABDDE4BOU86A8. (≈ 5 HITs, 1 HITType, more than 100

workers.)

There are two questions worth asking about our Turkalytics tool itself. The first

is whether the system is performant, i.e., how fast it is and how much load it can

handle. The second is whether it is successfully collecting the intended monitored

information. (Because Turkalytics is designed to be unobtrusive, there are numerous

situations in which Turkalytics might voluntarily lose data in the interest of a better

client experience.) This section answers these questions focusing on the client (Sec-

tion 8.4.1), the logging server (Section 8.4.2), and the analysis server (Section 8.4.3).

8.4.1 Client

Does ta.js effectively send remote logging messages?

We asked some of our requesters to add an image tag corresponding to a one pixel

GIF directly above the Listing 8.2 HTML in their HITs. Based on access to this

“baseline” image, we can determine how many remote users viewed a page versus

how many actually loaded and ran our JavaScript. (This assumes that remote users

did not block our server, and that they waited for pages to load.)

Overall, the baseline image was inserted for 25, 744 URLs. Turkalytics received

JavaScript logging messages for all but 88 of those URLs, which means that our loss

rate was less than 0.5%. There is no apparent pattern in which messages are missing

on a per browser or other basis. Our ta.js runs on all modern browsers, though

some features vary in availability based on the browser. (For example, Safari makes

it difficult to set third party cookies and early versions of MSIE slow down form

contents discovery due to DOM speed.)

How complete is activity sending?

Activity data is sent periodically as logging messages by ta.js. However, as with

other logging messages, the browser thinks we are loading a series of images from

the logging server rather than sending messages. As a result, if the worker navigates


away from the page, the browser may not bother to finish loading the images. After

all, there is no longer a page for the images to be displayed on!

How commonly are activity logging messages lost? We looked at activity signa-

tures for 9, 884 page views corresponding to NER tasks. Each page view was an

accepted task which the worker later submitted. We computed an expected number

of activity seconds for a given page view by subtracting the timestamp of the first

logging event received by Turkalytics from the timestamp of the last logging event.

If we have within 20 seconds, or within 10% of the total expected number fo activity

seconds, whichever is less, we say that we have full activity data. (Activity moni-

toring may take time to start, so we leave a buffer before expecting activity logging

messages.) For 8, 426 of these page views, or about 85%, we have “full” activity data

in this sense.

How fast and correct is form content sending?

Checking the form contents to send to the logging server usually takes on the order

of a few hundred milliseconds every ten to thirty seconds. This assumes a modern

browser and a reasonably modern desktop machine. Of the 9, 884 NER page views

accepted and submitted from the previous section, 8, 049 had complete form data.

8.4.2 Logging Server

Given the simplicity of the logging server, it only makes sense to ask what the peak

load has been and whether there were any failed requests. (Failed requests are logged

for us by the App Engine.) In general, Mechanical Turk traffic is extremely bursty—

at the point of initial post, many workers complete a task, but then traffic falls off

sharply (see Section 8.6.2). However, our architecture handles this gracefully. In

practice, we saw a peak requests/second of about ten, and a peak requests/day of

over 100, 000, depending on what tasks were being posted by our requesters on a given

day. However, there is no reason to think that these are anywhere near the limits

of the logging server. During the period of our data gathering, we logged 1, 659, 944

logging events, and we lost about 20 per day on average due to (relatively short)

8.5. RESULTS: WORKER ASPECTS 199

outages in Google’s App Engine itself.

8.4.3 Analysis Server

Our analysis server is an Intel Q6600 with four gigabytes of RAM and four regular

SATA hard drives located at Stanford. We batch loaded 1, 515, 644 JSON logging

events in about 520 minutes to test our trigger system’s loading speed. Despite the

fact that our code is currently single threaded and limited to running forty seconds of

every minute, our batch load represents an amortized rate of about 48 logging events

per second. The current JSON data itself is about 2.1 gigabytes compressed, and

our generated database is about 4.6 gigabytes on disk. Both the data format and

the speed of insertion into the analysis server could both be optimized: currently the

insertion is mostly CPU bound by Python, most likely due to JSON parsing overhead.

8.5 Results: Worker Aspects

Where are workers located?

Most demographic information that is known about Mechanical Turk workers is the

result of surveys on the Mechanical Turk itself. Surveys are necessary because the

workers themselves are kept anonymous by the Mechanical Turk. However, such

surveys can easily be biased, as workers appear to specialize in particular types of

work, and one common specialization is filling out surveys.

Our work with Turkalytics allows us to test the geographic accuracy of these

past surveys. In particular, we use the “GeoLite City” database from MaxMind to

geolocate all remote users by IP address. (MaxMind claims that this database is 99.5%

accurate at the country level [9].) The results are shown in Table 8.1. For example,

the first line of Table 8.1 shows that in our data, the United States represented 2, 534

unique IP addresses (29.84% of the total), 1, 299 unique workers (44.716% of the

total), 199 of the unique workers who did the NER task, and 1, 011 of the unique US

workers who completed the TC task.

There are two groups of countries in the data. The United States and India


#IP

s

%IP

s

#Workers

%Workers

#Workers

#Workers

United States 2534 29.840 1299 44.716 199 1011India 4794 56.453 1116 38.417 227 717

Philippines 127 1.496 52 1.790 15 32United Kingdom 92 1.083 43 1.480 11 27

Canada 86 1.013 42 1.446 10 33Germany 50 0.589 16 0.551 6 10Australia 32 0.377 16 0.551 4 10Pakistan 49 0.577 15 0.516 5 10Romania 96 1.130 14 0.482 5 7

Anonymous Proxy 13 0.153 12 0.413 0 13Overall NER TC

Table 8.1: Top Ten Countries of Turkers (by Number of Workers). 2,884 Workers,8,216 IPs total.

are the first group, and they represent about 80% of workers. The second group is

everywhere else, consisting of about 80 other countries, and 20% of the workers. This

second group is more or less power law distributed. We suspect that worker countries

are heavily biased by the availability of payment methods. Mechanical Turk has

very natural payment methods in the United States and India, but not elsewhere

(e.g., even in other English speaking countries like Canada, the United Kingdom, and

Australia).

Comparing the NER and TC tasks, we can see that Indians are more prevalent on

the NER task. However, regardless of grouping, the nationality orderings seem to be

fairly similar, with the caveat that Indians have many more IPs than Americans. This

suggests that previous survey data may be slightly biased based on respondents, but

overall may not be terribly different from the true underlying worker demographics.

8.5. RESULTS: WORKER ASPECTS 201

What does a “standard” browser look like?

The most common worker screen resolutions are 1024x768 (2266 workers at at least

one point), 1280x800 (1166 workers), 1366x768 (670 workers), 1440x900 (494 work-

ers), 1280x1024 (451 workers), and 800x600 (228 workers). No other resolution has

more than 200 workers. Given an approximate browser height of 170px and a Mechan-

ical Turk interface height of 230px or more, a huge number of workers are previewing

(and possibly completing) tasks in less than 400px of screen height. As a caveat,

some workers may be double counted as they switch computers or resolutions. The

average is about 1.5 distinct resolutions per worker, so most workers have one or two

distinct resolutions.

About half of our page views are by Firefox users, and about a quarter are by

MSIE users. In terms of plugins, Java and Flash represent 70–75% of our page views,

each, while PDF and WMA represent 50–55% each. the Java plugin, and about 70%

have workers with the Flash plugin. These may be underestimates based on our

detection mechanism (navigator MIME types).

Are workers identifiable? Do they switch browsers?

It is becoming increasingly common to use the Mechanical Turk as a subject pool for

research studies. A growing body of literature has looked at how to design studies

around the constraints of Mechanical Turk. One key question is how to identify a

remote user uniquely. For example, how do I know that an account for a 30 year old

woman from Kansas is not really owned by a professional survey completer with a

number of accounts in different demographic categories?

One solution is to look at reasonably unique data associated with a remote user.

Table 8.2 shows the number of user agents, IP addresses, and identifier cookies asso-

ciated with a given worker. Ideally, for identification purposes, each of these numbers

would be one. In practice, these possibly unique pieces of data seem to vary heavily

by worker. Common user agent strings, dynamic IPs, and downgrading or blocking of

cookies (particularly third party cookies as Turkalytics uses) are all possible reasons

for this variability. For example, the worker three from the bottom had 84 distinct


Worker Counts CountryUAs IPs Cookies Views

AXF... 3 1 17 619 IndiaA1B... 2 9 4 618 MultinationalA1K... 5 23 8 537 IndiaA3O... 4 13 68 502 IndiaA2C... 4 47 33 462 IndiaA3I... 4 2 3 450 United StatesA2I... 3 4 2 393 United StatesA1V... 4 14 1 314 United StatesA1C... 4 10 7 303 IndiaA31... 3 11 2 288 IndiaA2H... 8 6 8 268 IndiaA29... 1 17 2 244 IndiaA3J... 3 84 2 226 IndiaA2O... 3 3 4 225 United StatesA1E... 1 25 5 225 India

Table 8.2: The number of user agents, IP addresses, cookies and views for top workersby page views.

IP addresses over the course of 226 page views, but nonetheless kept the same two

tracking cookies throughout. On the other hand, the first worker in the table had 17

tracking cookies over 619 page views, but only had one IP address throughout. On

average, for active workers, the user agent to page view ratio is about 1:25, the IP to

page view ratio is about 1:10, and the cookie to page view ratio is about 1:11. These

numbers are skewed by special cases however, and the median numbers are usually

lower. (“Cheaters” appear rare, though one remote user with a single cookie seems

to have logged in seven different times, as seven different workers, to complete the

TC task.)

8.6 Results: Activity Aspects

Section 8.1.1 gave a model for interaction with Mechanical Turk. This section looks

at what behavior that model produces. Section 8.6.1 looks at what states and actions

8.6. RESULTS: ACTIVITY ASPECTS 203

Figure 8.4: Number of transitions between different states in our dataset. Note:These numbers are approximate and unload states are unlabeled.

occur in practice. Section 8.6.2 looks in particular at previewing interactions. Sec-

tion 8.6.3 looks at activity data generated by workers within page views. Our main

goals are to show that Turkalytics is capable of gathering interesting system interac-

tion data and to illustrate the tradeoffs of Mechanical Turk-like (i.e., SCRAP-like)

systems.

8.6.1 What States/Actions Occur in Practice?

Workers in the SCRAP model can theoretically execute a large number of actions.

However, we found that most transitions were relatively rare. Figure 8.4 shows the

transitions between states that we observed. (Very rare, unclear, or “unload” tran-

sitions are marked with question marks.) For example, we observed 720 transitions

by workers from Accept to RapidAccept, and 344 transitions out of the RapidAccept

state. To generate Figure 8.4, we assume the model described in Section 8.1.1 and

that workers are “single threaded,” that is, they use a single browser with a single

window and no tabs. These assumptions let us infer state transitions based on times-

tamp, which is necessary because of the referer setup described in Section 8.2.3. Over

88% of our observed state transitions are transitions in our mapped SCRAP model,

so our simplifying assumptions appear relatively safe.


Figure 8.5: Number of new previewers visiting three task groups over time.

Do workers use the extensions provided by the SCRAP model above and beyond

the SPA model? We found that RapidAccept was used quite commonly, but Continue

was quite rare. About half of the workers who chose to do large numbers of tasks

chose to RapidAccept often rather than continuously moving between the Preview

and Accept state. However, continues represent less than 0.5% of our action data,

and returns about 2%. (We suspect that the transition to Accept from Preview is

particularly common in our data due to the prevalence of the simple Turker Count

task.)

8.6.2 When Do Previews Occur?

A common Mechanical Turk complaint is that the interfaces for searching and brows-

ing constrain workers. In particular, Chilton et al. observe that workers primarily

sort task groups by how recently they were created and how many tasks are available

in the group. This observation appears to be true in our data as well.

Figure 8.5 shows when previously unseen workers preview the NER, TC, and

CD task groups immediately after an instance was posted. For example, in the first

hour of availability of the TC task, almost 150 workers completed the task. The

NER task group has many tasks, but is posted only once. The TC task group has

only one task, and is only posted once. The CD task group has five tasks, but is

artificially kept near the top of the recently created list. In Figure 8.5, both NER and

8.6. RESULTS: ACTIVITY ASPECTS 205

Figure 8.6: Plot of average active and total seconds for each worker who completedthe NER task.

TC show a stark drop off in previews after the first hour when they leave the most

recently created list. NER drops off less, likely because it is near the top of the tasks

available list. CD drops off less than TC, suggesting artificial recency helps. These

examples suggest that researchers should be quite careful when drawing conclusions

about worker interaction (e.g., due to pricing) because the effect due to rankings is

quite strong.

8.6.3 Does Activity Help?

Turkalytics collects activity and inactivity information, but is this information more

useful than lower granularity information like the duration that it took for the worker

to submit the task? It turns out that there are actually two answers to this question.

The first answer is that, as one might expect, the amount of time a worker is active

is highly correlated with the total amount of time a worker spends completing the

task in general. The second answer is that, despite this, signature data does seem to

clarify the way in which a worker is completing a particular task.

We looked at the activity signatures of 321 workers who had at least one complete

signature and had completed the NER task. The Pearson correlation between the

number of active seconds and the total number of seconds for these workers was 0.88

(see Figure 8.6). However, the activity signatures do give a more granular picture

of the work style of different workers. Figure 8.7 shows two quite different activity

signatures, both of which end in completing an accepted task. The first signature

shows a long period of inactive seconds (i) followed by bursts of active seconds (a),


diiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii

.............. 300 inactive seconds .............

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii

iaaaaaaiiiiaaaaaaaiaiaaaaaaaaaaaaaaaaiaaiaaaaaaaaa

aaaaaaaaaaaaaaaiiiiiiiiiaaaaaaaaaaaaaaaiaaaiiiaaaa

aaaaaiiiiiiiiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

aaiaaaiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

aaaaiiiiiiiiiiiiiiiaiaiiiiiiiaaiiaiaaaaaaaaiiaaaaa

aaiiaiiaaaaaaaaaiaaaaaaaiiiiaaiaaaaaaaasbu

(a) 688 Second Activity Signature

daaaaiaaaaaiaaaaaaaaaaaaiiiiiiaaaaaaaiiiaaaaaaaaaa

aaaaiiiiaaiiiaaaiiiiaaaaaiaaaaiaaaaaiiaaaaiiiasabu

(b) 96 Second Activity Signature

Figure 8.7: Two activity signatures showing different profiles for completing a task.Key: a=activity, i=inactivity, d=DOM load, s=submit, b=beforeunload, u=unload.

while the second signature shows a short period of mostly active seconds. One might

prefer one work completion style or the other for particular tasks.

8.7 Related Work

Our Turkalytics system, and the results presented, are related to four main areas

of work: human computation systems, analytics, Mechanical Turk demographic re-

search, and general Mechanical Turk work. With respect to human computation

systems, a number have been recently developed [6], [10], [27], [49]. Our intent is for

our tool to improve such systems and make building them easier. With respect to

analytics, numerous analytics tools (e.g., [5]) exist in industry, though there does not

appear to be a great deal of work in the academic literature about such tools. With

respect to demographics, independent work by Ipeirotis [41], [42] and Ross et al. [61]

used worker surveys to illustrate the changing demographics of Mechanical Turk over

time. (Section 8.5 more or less validates these previous results, as well as adding a

more recent data point.) With respect to general Mechanical Turk research, the most

8.8. CONCLUSION 207

common focuses to date have been conducting controlled experiments [46] and per-

forming data annotation in areas like natural language processing [67], information

retrieval [12], and computer vision [69].

8.8 Conclusion

Turkalytics gathers data about workers completing human computation tasks. We

envision Turkalytics as part of a broader system, in particular a system like HPROC

(Chapter 7) implementing the Human Processing model (Chapter 6). However, one

big advantage of our design for Turkalytics is that it is not tied to any one system.

Turkalytics enables both code sharing among systems (systems need not reimplement

worker monitoring code) and data sharing among systems (requesters benefit from

data gathered from other requesters). Our contributions include interaction and data

models, implementation details, and findings about both our system architecture and

the popular Mechanical Turk marketplace. We showed that our system was scalable to

more than 100, 000 requests/day. We also verified previous demographic data about

the Turk, and presented some findings about location and interaction that are unique

to our tool. Overall, Turkalytics is a novel and practical tool for human computation

that has already seen production use.

Chapter 9

Conclusion

Over the course of the previous chapters, the development of this thesis has mirrored

the chronological development of microtasks. We began with social bookmarking

systems. Social bookmarking systems were one of the first places tags appeared, as

an adaptation to help organize a corpus that had grown too large for more classi-

cal annotators like librarians. We then looked at social cataloging systems. Social

cataloging systems and other types of more niche tagging systems were developed as

it became well understood that tags and other microtasks could be applied across a

wide variety of systems, not just bookmarking and multimedia systems. Finally, we

looked at paid microtasks. Once the power of microtasks to produce common web

features (ratings, image classification) was well understood, it made sense to design

systems that could produce microtasks in a repeatable way.

While our study happens to be chronological, it also covers microtasks at multiple

levels of detail. Overall, each of the places we studied microtasks gave us greater

insight into our research goal of better understanding the possibilities and limitations

of microtasks. Our study of social bookmarking uncovered a number of limitations

to unpaid microtasks (redundancy, lack of control), while our usage of the HPROC

system illustrates a vast number of future possiblities (human algorithms). Along the

way, we were also able to develop useful techniques—tag prediction, paid tagging, and

a methodology for human programming—for supplementing or controlling microtasks.

We summarize the highest level findings below.

209

210 CHAPTER 9. CONCLUSION

9.1 Summary

Chapter 2 looked at social bookmarking as it relates to arguably the most important

application on the web: web search. Our dataset, which led to a great deal of follow-

on work (e.g., [38], [60], [48]), was collected in a methodologically sound way, as well

as being one of the biggest crawls of a social bookmarking site. It thus allowed us

to ask a comprehensive set of questions at a scale where we would not have to worry

about sampling bias invalidating our findings.

Ultimately, we found that tags are often (though not always!) redundant in the

context of social bookmarking. Tags are commonly in the HTML title tag, and it is

also often the case that tags apply to whole domains, rather than simply the URL

being annotated. However, we did find that URLs posted to social bookmarking

systems were often useful. Such URLs tended to be new or recently modified, as

well as commonly returned in the results of web search queries. While this chapter

presented one of the largest studies of a textual tagging site specifically, we also believe

that our findings generalize to more recent systems. In particular, systems continue

to be developed which center around users saving and sharing their favorite URLs,

and interest in such systems (e.g., Twitter, Facebook “likes”), continues to grow.

Chapter 3 looked at tag prediction for social bookmarking systems. The reason

for studying tag prediction was essentially two-fold. First, we wanted to understand

the predictability of tags (based on the objects annotated as well as based on other

tags) in an abstract sense, in order to better understand tags themselves. Second,

we wanted to be able to enhance tagging systems with features that required tagging

prediction, ranging from bootstrapping to system suggestion to tag disambiguation.

We proposed and evaluated tag prediction from two different perspectives. The

first perspective looked specifically at the features of the URLs available on social

bookmarking systems like del.icio.us, including features like page text, anchor text,

and surrounding domains. We showed that support vector machines were effective

for prediction in this case. We also proposed two measures, frequency and entropy,

that are correlated with how predictable a tag will be. The second perspective looked

only at predicting tags using other tags. By using only tags, our methods can work

9.1. SUMMARY 211

on any tagging system, rather than only social bookmarking systems. We were able

to perform tag to tag prediction with association rules (market basket data mining),

which are both efficient and interpretable by humans.

Chapter 4 marked a transition from studying social bookmarking systems to study-

ing social cataloging systems. In social bookmarking systems like del.icio.us, it is

difficult to determine if users are performing microtasks effectively because we lack a

gold standard for comparison. We analyzed whether tags were redundant given infor-

mation like page text and anchor text, but we had no real way to determine if a tag

was intrinsically good. By contrast, social cataloging systems, where users tag books,

have objects which are annotated with library terms by trained librarians. This fact

allowed us to compare tags to another form of organization by treating library terms

as a gold standard.

In a sense, tagging represents a cheap, non-expert model for annotation, in con-

trast to an expensive, expert model in the form of established library science. We

found that tagging actually had many of the features of the expert annotated library

terms. In particular, we found that tagging was usually consistent, high quality, and

complete. In terms of consistency, we found that tags could be federated across tag-

ging systems, in our case, between the LibraryThing and Goodreads systems. (This

was because tags, and their usage, was similar across systems.) In terms of quality,

we found that medium frequency tags, as well as paid tags, were competitive with

library terms in side-by-side comparisons. We also found that neither synonymy, nor

low quality tag types, were common. In terms of completeness, we found that, at

least for highly tagged objects, tags tended to have good coverage of existing library

terms.

Chapter 5 dropped the assumption of Chapter 4 that library data represented

a gold standard. Instead, we simply compared the library terms and tags with no

particular preference for one or the other. We found that by and large, experts appear

to choose good terms for organizing data. However, we found, on the basis of tagging,

that experts tend to annotate different objects with those terms than regular users.

Chapter 6 began a three chapter sequence on microtasks in general, and in par-

ticular, our Human Processing model. We introduced the model using an extended


example, and contrasted it to two other major models: the Basic Buyer and Game

Maker models. The Human Processing model focuses on modularity and reuse, so

that programmers do not have to constantly reinvent the wheel when writing human

programs. The model also introduces novel features, like the recruiter, which is a

concept meant to make it easier for researchers to compare human algorithms in a

controlled manner.

Chapter 7 discussed our implementation of the Human Processing model in the

form of our HPROC system. HPROC is a large system (over ten thousand lines of

code) with a number of useful features. It has a novel execution model, allowing the

programmer to mix crash-and-rerun and web execution. It supports cross-hprocess

function calls and simple memoization, which are necessary to make crash-and-rerun

easier to work with. It supports recruiters, as required by the Human Processing

model, and a full Mechanical Turk API.

We illustrated the usage of our HPROC system by doing a case study on hu-

man algorithms for sorting. In particular, we looked at two algorithms: a variation of

Merge-Sort (H-Merge-Sort) and a variation ofQuick-Sort (H-Quick-Sort).

We also looked at variations of interfaces to support these algorithms, in particular,

comparing binary and ranked comparison interfaces. This case study shows the prac-

ticality of the Human Processing model for evaluating human algorithms and the

HPROC system for developing them. It also shows the importance that interfaces

play in human algorithms.

Chapter 8 went one step further in making the Human Processing model a pow-

erful, practical model. Human Processing relies on quality recruiters, and recruiters

in turn require a state model of the marketplaces where they work, as well as a strat-

egy for choosing actions based on that state. As a result, we developed an analytics

tool for gathering state about the Mechanical Turk marketplace. This analytics tool,

called Turkalytics, allowed us to better understand the state of the marketplace by

observing a number of tasks that researchers at Stanford posted on the Mechanical

Turk. In addition to engineering a system which was robust to significant load, we

defined a model for workers in Mechanical Turk-like systems which allowed us to map

which actions workers commonly take.

9.2. FUTURE WORK 213

9.2 Future Work

The future of microtasks looks very bright. The number of Internet-connected users

keeps growing, and there are no signs that they are losing interest in microtasks.

Today, Twitter gets tens of millions of “tweets” each day, and innumerable Facebook

users promote URLs by “liking” them. Researchers continue to better understand

unpaid microtasks like tags, ratings, tweets and likes. Meanwhile, there is an increas-

ing drive, especially recently, to build systems based on paid microtasks like those of

Mechanical Turk. This thesis has not answered all of the questions involved in unpaid

and paid microtasks, but we hope it has laid the groundwork for a great deal of future

work. We conclude by discussing potential future work in the areas of unpaid and

paid microtasks, respectively.

Probably the biggest future opportunities in unpaid microtasks will be in new

services that capture the imaginations of millions of users. If forced to predict, we

expect that services based on realtime information or geography promise to produce

a variety of donated unpaid microtasks in the near future. For example, services like

foursquare currently collect from users large amounts of realtime location informa-

tion together with advice about the locations (i.e., unpaid microtasks collecting data

about the real world). This data is already being mined in interesting ways in the

aggregate. However, it is hard to predict such services, both because they are often

very simple and because they usually depend highly on network effects. Furthermore,

as researchers, it is usually the aggregation of a multitude of microtasks, rather than

any particular one, that makes unpaid microtasks interesting.

In the specific case of tagging, the challenges continue to be in two major areas:

enhancing tagging interfaces, and better post hoc analysis of tagging data. For ex-

ample, in terms of enhancing tagging interfaces, many tagging systems now include

interfaces for suggesting tags to users as they are annotating a particular object. We

described in Chapter 3 how to predict tags, but often users might prefer a tag sugges-

tion from such an interface that is non-obvious. In terms of post hoc tagging analysis,

there is still a great need for better tools for tag clouds, taxonomy generation, and

similar aggregate understanding of tags. For example, it is difficult for a user to find


what they are looking for in a tag cloud of even one hundred tags, while systems often

contain millions of tags. However, improving upon tag clouds will require better al-

gorithms for understanding how tags are related, as well as grouping tags by purpose

and usage. The heavy interest in our early paper [34] on creating taxonomies out of

tags neatly illustrates the desire in this area for solutions to the problem of organizing

tags.

Meanwhile, the area of paid microtasks is wide open. Studying paid microtasks

overlaps with many other areas, like economics (for pricing), human computer inter-

action (interfacing with humans), and operations research (for designing workflows).

However, many of the challenges in the area seem to be unique to paid microtasks

themselves, and unstudied previously. We envision much better systems for program-

ming and developing human programs and human algorithms. While HPROC does

support and simplify many aspects of human programming, two places it could be im-

proved are in the areas of debugging long running programs and designing interfaces

for workers. We also envision a variety of human algorithms, including analogues for

many classical algorithms, like sort, clustering, translation, and classification. These

human algorithms may mix and match human and machine processing to take ad-

vantage of the strengths of each. As marketplaces improve, we hope that better

recruiters will evolve that take advantage of better pricing models, and we suspect

that the best tasks for workers may start to look like games. And finally, we hope

that programming tools for microtasks will eventually filter down to the point where

regular users, rather than just computer scientists, can produce useful work using

many human microtasks.

Bibliography

[1] http://www.mturk.com/.

[2] http://getgambit.com/.

[3] http://www.livework.com/.

[4] http://www.imagemagick.org/.

[5] http://www.google.com/analytics/.

[6] http://crowdflower.com/.

[7] http://requester.mturk.com/.

[8] http://s3.amazonaws.com/.

[9] http://www.maxmind.com/app/geolitecity.

[10] http://www.smartsheet.com/.

[11] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining Association Rules

Between Sets of Items in Large Databases. SIGMOD Record, 22:207–216, June

1993.

[12] Omar Alonso and Stefano Mizzaro. Can We Get Rid of TREC Assessors? Using

Mechanical Turk for Relevance Assessment. In SIGIR ’09 Workshop on the

Future of IR Evaluation.

215

216 BIBLIOGRAPHY

[13] Melanie Aurnhammer, Peter Hanappe, and Luc Steels. Integrating Collaborative

Tagging and Emergent Semantics for Image Retrieval. In Collaborative Web

Tagging Workshop (WWW’06).

[14] Shenghua Bao, Guirong Xue, Xiaoyuan Wu, Yong Yu, Ben Fei, and Zhong Su.

Optimizing Web Search Using Social Annotations. In Proceedings of the 16th

International Conference on World Wide Web, WWW ’07, pages 501–510, New

York, NY, USA, 2007. ACM.

[15] William B. Cavnar and John M. Trenkle. N-Gram-Based Text Categorization.

Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and

Information Retrieval, pages 161–175, 1994.

[16] Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced Hypertext Cate-

gorization Using Hyperlinks. In Proceedings of the 1998 ACM SIGMOD Interna-

tional Conference on Management of Data, SIGMOD ’98, pages 307–318, New


[17] Hsinchun Chen. Collaborative Systems: Solving the Vocabulary Problem. IEEE

Computer, Special Issue on Computer Supported Cooperative Work (CSCW),

27(5):58–66, 1994.

[18] Ed H. Chi and Todd Mytkowicz. Understanding the Efficiency of Social Tag-

ging Systems Using Information Theory. In Proceedings of the Nineteenth ACM

Conference on Hypertext and Hypermedia, HT ’08, pages 81–88, New York, NY,

USA, 2008. ACM.

[19] Lydia B. Chilton, John J. Horton, Robert C. Miller, and Shiri Azenkot. Task

Search in a Human Computation Market. In Proceedings of the ACM SIGKDD

Workshop on Human Computation, HCOMP ’10, pages 1–9, New York, NY,

USA, 2010. ACM.

[20] Maarten Clements, Arjen P. de Vries, and Marcel J.T. Reinders. Detecting Syn-

onyms in Social Tagging Systems to Improve Content Retrieval. In Proceedings

BIBLIOGRAPHY 217

of the 31st Annual International ACM SIGIR Conference on Research and De-

velopment in Information Retrieval, SIGIR ’08, pages 739–740, New York, NY,

USA, 2008. ACM.

[21] Nick Craswell, David Hawking, and Stephen Robertson. Effective Site Finding

Using Link Anchor Information. In Proceedings of the 24th Annual International

ACM SIGIR Conference on Research and Development in Information Retrieval,

SIGIR ’01, pages 250–257, New York, NY, USA, 2001. ACM.

[22] Anirban Dasgupta, Arpita Ghosh, Ravi Kumar, Christopher Olston, Sandeep

Pandey, and Andrew Tomkins. The Discoverability of the Web. In Proceedings

of the 16th International Conference on World Wide Web, WWW ’07, pages

421–430, New York, NY, USA, 2007. ACM.

[23] Christine DeZelar-Tiedman. Doing the LibraryThing in an Academic Library

Catalog. Metadata for Semantic and Social Applications, page 211.

[24] Mary Dykstra. LC Subject Headings Disguised as a Thesaurus. Library Journal,

113(4):42–46, 1988.

[25] Nadav Eiron and Kevin S. McCurley. Analysis of Anchor Text for Web Search.

In Proceedings of the 26th Annual International ACM SIGIR Conference on

Research and Development in Informaion Retrieval, SIGIR ’03, pages 459–460,

New York, NY, USA, 2003. ACM.

[26] Nadav Eiron, Kevin S. McCurley, and John A. Tomlin. Ranking the Web Fron-

tier. In Proceedings of the 13th International Conference on World Wide Web,

WWW ’04, pages 309–318, New York, NY, USA, 2004. ACM.

[27] Donghui Feng. Talk: Tackling ATTi Business Problems Using Mechanical Turk.

Palo Alto Mechanical Turk Meetup, 2010.

[28] George W. Furnas, Thomas K. Landauer, Louis M. Gomez, and Susan T. Du-

mais. The Vocabulary Problem in Human-System Communication. Communi-

cations of the ACM, 30:964–971, November 1987.

218 BIBLIOGRAPHY

[29] Evgeniy Gabrilovich and Shaul Markovitch. Text Categorization with Many

Redundant Features: Using Aggressive Feature Selection to Make SVMs Com-

petitive with C4.5. In Proceedings of the Twenty-first International Conference

on Machine Learning, ICML ’04, pages 41–, New York, NY, USA, 2004. ACM.

[30] Evgeniy Gabrilovich and Shaul Markovitch. Computing Semantic Relatedness

Using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th

International Joint Conference on Artifical Intelligence, pages 1606–1611, San

Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.

[31] Scott A. Golder and Bernardo A. Huberman. Usage Patterns of Collaborative

Tagging Systems. Journal of Information Science, 32:198–208, April 2006.

[32] Taher H. Haveliwala, Aristides Gionis, Dan Klein, and Piotr Indyk. Evaluating

Strategies for Similarity Search on the Web. In Proceedings of the 11th Inter-

national Conference on the World Wide Web, WWW ’02, pages 432–442, New


[33] Paul Heymann and Hector Garcia-Molina. Contrasting Controlled Vocabulary

and Tagging: Experts Choose the Right Names to Label the Wrong Things. In

WSDM ‘09 Late Breaking Results.

[34] Paul Heymann and Hector Garcia-Molina. Collaborative Creation of Communal

Hierarchical Taxonomies in Social Tagging Systems. 2006.

[35] Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina. Fighting Spam

on Social Web Sites: A Survey of Approaches and Future Challenges. IEEE

Internet Computing, 11:36–45, November 2007.

[36] Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina. Can Social Book-

marking Improve Web Search? In Proceedings of the International Conference

on Web Search and Web Data Mining, WSDM ’08, pages 195–206, New York,

NY, USA, 2008. ACM.

BIBLIOGRAPHY 219

[37] Paul Heymann, Andreas Paepcke, and Hector Garcia-Molina. Tagging Human

Knowledge. In Proceedings of the Third ACM International Conference on Web

Search and Data Mining, WSDM ’10, pages 51–60, New York, NY, USA, 2010.

ACM.

[38] Paul Heymann, Daniel Ramage, and Hector Garcia-Molina. Social Tag Predic-

tion. In Proceedings of the 31st Annual International ACM SIGIR Conference on

Research and Development in Information Retrieval, SIGIR ’08, pages 531–538,


[39] H. Hofmann and M. Theus. Interactive Graphics for Visualizing Conditional

Distributions. Unpublished Manuscript, 2005.

[40] Jurgen Hummel. Linked Bar Charts: Analysing Categorical Data Graphically.

Computational Statistics, 11(1):23–34, 1996.

[41] Panos Ipeirotis. Mechanical Turk: The Demograph-

ics. http://behind-the-enemy-lines.blogspot.com/2008/03/

mechanical-turk-demographics.html.

[42] Panos Ipeirotis. The New Demographics of Mechanical

Turk. http://behind-the-enemy-lines.blogspot.com/2010/03/

new-demographics-of-mechanical-turk.html.

[43] Thorsten Joachims. Making Large-scale Support Vector Machine Learning Prac-

tical, pages 169–184. MIT Press, Cambridge, MA, USA, 1999.

[44] Thorsten Joachims. A Support Vector Method for Multivariate Performance

Measures. In Proceedings of the 22nd International Conference on Machine

Learning, ICML ’05, pages 377–384, New York, NY, USA, 2005. ACM.

[45] Karen S. Jones and C. J. van Rijsbergen. Information Retrieval Test Collections.

Journal of Documentation, 32(1):59–75, 1976.

[46] Aniket Kittur, Ed H. Chi, and Bongwon Suh. Crowdsourcing User Studies with

Mechanical Turk. In Proceedings of the Twenty-sixth Annual SIGCHI Conference

220 BIBLIOGRAPHY

on Human Factors in Computing Systems, CHI ’08, pages 453–456, New York,

NY, USA, 2008. ACM.

[47] Georgia Koutrika, Frans Adjie Effendi, Zoltan Gyongyi, Paul Heymann, and

Hector Garcia-Molina. Combating Spam in Tagging Systems. In Proceedings

of the 3rd International Workshop on Adversarial Information Retrieval on the

Web, AIRWeb ’07, pages 57–64, New York, NY, USA, 2007. ACM.

[48] Georgia Koutrika, Frans Adjie Effendi, Zoltan Gyongyi, Paul Heymann, and

Hector Garcia-Molina. Combating Spam in Tagging Systems: An Evaluation.

ACM Transactions on the Web (TWEB), 2:22:1–22:34, October 2008.

[49] Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. TurKit:

Tools for Iterative Tasks on Mechanical Turk. In Proceedings of the ACM

SIGKDD Workshop on Human Computation, HCOMP ’09, pages 29–30, New


[50] Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. Exploring

Iterative and Parallel Human Computation Processes. In Proceedings of the ACM

SIGKDD Workshop on Human Computation, HCOMP ’10, pages 68–76, New


[51] Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. TurKit:

Human Computation Algorithms on Mechanical Turk. In Proceedings of the 23nd

Annual ACM Symposium on User Interface Software and Technology, UIST ’10,

pages 57–66, New York, NY, USA, 2010. ACM.

[52] Thomas Mann. Library Research Models: A Guide to Classification, Cataloging,

and Computers. Oxford University Press, USA, 1993.

[53] Thomas Mann. The Oxford Guide to Library Research. Oxford University Press,

USA, 2005.

[54] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. Introduction

to Information Retrieval. Cambridge University Press, New York, NY, USA,

2008.

BIBLIOGRAPHY 221

[55] Cameron Marlow, Mor Naaman, Danah Boyd, and Marc Davis. HT06, tagging

paper, taxonomy, Flickr, academic article, to read. In Proceedings of the Sev-

enteenth Conference on Hypertext and Hypermedia, HYPERTEXT ’06, pages

31–40, New York, NY, USA, 2006. ACM.

[56] Gilad Mishne. AutoTag: A Collaborative Approach to Automated Tag Assign-

ment for Weblog Posts. In Proceedings of the 15th International Conference on

World Wide Web, WWW ’06, pages 953–954, New York, NY, USA, 2006. ACM.

[57] Steffen Oldenburg, Martin Garbe, and Clemens Cap. Similarity Cross-analysis

of Tag / Co-tag Spaces in Social Classification Systems. In Proceeding of the

2008 ACM Workshop on Search in Social Media, SSM ’08, pages 11–18, New


[58] Greg Pass, Abdur Chowdhury, and Cayley Torgeson. A Picture of Search. In

Proceedings of the 1st International Conference on Scalable Information Systems,

InfoScale ’06, New York, NY, USA, 2006. ACM.

[59] Gregory Piatetsky-Shapiro. Discovery, Analysis, and Presentation of Strong

Rules. In G. Piatetsky-Shapiro and W.J. Frawley, editors, Knowledge Discovery

in Databases. AAAI/MIT Press, Cambridge, MA, 1991.

[60] Daniel Ramage, Paul Heymann, Christopher D. Manning, and Hector Garcia-

Molina. Clustering the Tagged Web. In Proceedings of the Second ACM Inter-

national Conference on Web Search and Data Mining, WSDM ’09, pages 54–63,


[61] Joel Ross, Lilly Irani, M. Six Silberman, Andrew Zaldivar, and Bill Tomlinson.

Who are the Crowdworkers?: Shifting Demographics in Mechanical Turk. In

Proceedings of the 28th of the International Conference Extended Abstracts on

Human Factors in Computing Systems, CHI EA ’10, pages 2863–2872, New York,

NY, USA, 2010. ACM.

[62] Christoph Schmitz, Andreas Hotho, Robert Jaschke, and Gerd Stumme. Mining

Association Rules in Folksonomies. IFCS’06.

222 BIBLIOGRAPHY

[63] Eric Schwarzkopf, Dominik Heckmann, Dietmar Dengler, and Alexander Kroner.

Mining the Structure of Tag Spaces for User Modeling. In Workshop on Data

Mining for User Modeling (ICUM’07).

[64] Shilad Sen, Shyong K. Lam, Al Mamunur Rashid, Dan Cosley, Dan Frankowski,

Jeremy Osterhouse, F. Maxwell Harper, and John Riedl. tagging, communities,

vocabulary, evolution. In Proceedings of the 2006 20th Anniversary Conference

on Computer Supported Cooperative Work, CSCW ’06, pages 181–190, New York,

NY, USA, 2006. ACM.

[65] David Sifry. State of the Live Web: April 2007. http://www.sifry.com/

stateoftheliveweb/.

[66] Tiffany L. Smith. Cataloging and You: Measuring the Efficacy of a Folksonomy

for Subject Analysis. In Workshop of the ASIST SIG/CR ’07.

[67] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. Cheap

and Fast—But is it Good?: Evaluating Non-expert Annotations for Natural

Language Tasks. In Proceedings of the Conference on Empirical Methods in

Natural Language Processing, EMNLP ’08, pages 254–263, Morristown, NJ, USA,

2008. Association for Computational Linguistics.

[68] Sanjay Sood, Kristian Hammond, Sara Owsley, and Larry Birnbaum. TagAssist:

Automatic Tag Suggestion for Blog Posts. ICWSM’07.

[69] Alexander Sorokin and David Forsyth. Utility Data Annotation with Amazon

Mechanical Turk. In CVPRW’08.

[70] Luis von Ahn and Laura Dabbish. Designing Games with a Purpose. Commu-

nications of the ACM, 51:58–67, August 2008.

[71] Zhichen Xu, Yun Fu, Jianchang Mao, and Difu Su. Towards the Semantic

Web: Collaborative Tag Suggestions. In Collaborative Web Tagging Workshop

(WWW’06).

BIBLIOGRAPHY 223

[72] Yusuke Yanbe, Adam Jatowt, Satoshi Nakamura, and Katsumi Tanaka. Can

Social Bookmarking Enhance Search in the Web? In Proceedings of the 7th

ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’07, pages 107–

116, New York, NY, USA, 2007. ACM.

[73] Yiming Yang and Jan O. Pedersen. A Comparative Study on Feature Selection in

Text Categorization. In Proceedings of the Fourteenth International Conference

on Machine Learning, ICML ’97, pages 412–420, San Francisco, CA, USA, 1997.

Morgan Kaufmann Publishers Inc.

[74] Yiming Yang, Sean Slattery, and Rayid Ghani. A Study of Approaches to Hy-

pertext Categorization. Journal of Intelligent Information Systems, 18:219–241,

March 2002.

tagging and other microtasks a dissertation …vb525jb6753/paulphdthesis-augmented.pdftagging a...

Documents