tagging and other microtasks a dissertation …vb525jb6753/paulphdthesis-augmented.pdftagging a...
TRANSCRIPT
TAGGING AND OTHER MICROTASKS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Paul Brian Heymann
January 2011
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/vb525jb6753
© 2011 by Paul Brian Heymann. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Hector Garcia-Molina, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Jurij Leskovec
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Andreas Paepcke
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
iv
Abstract
Over the past decade, the web has become increasingly participatory. Many web sites
would be non-functional without the contribution of many tiny units of work by users
and workers around the world. We call such tiny units of work microtasks. Microtasks
usually represent less than five minutes of someone’s time. However, microtasks
can produce massive effects when pooled together. Examples of microtasks include
tagging a photo with a descriptive keyword, rating a movie, or categorizing a product.
This thesis explores tagging systems, one of the first places where unpaid micro-
tasks became common. Tagging systems allow regular users to annotate keywords
(“tags”) to objects like URLs, photos, and videos. We begin by looking at social
bookmarking systems, tagging systems where users tag URLs. We consider whether
social bookmarking tags are useful for web search, finding that they often mirror other
available metadata. We also show that social bookmarking tags can be predicted to
varying degrees with two techniques: support vector machines and market basket
data mining.
To expand our understanding of tags, we look at social cataloging systems, tag-
ging systems where users tag books. Social cataloging systems allow us to compare
user generated tags and expert library terms that were created in parallel. We find
that tags have important features like consistency, quality, and completeness in com-
mon with expert library terms. We also find that paid tagging can be an effective
supplement to a tagging system.
Finally, our work expands to all microtasks, rather than tagging alone. We propose
a framework called Human Processing for programming with and studying paid and
unpaid microtasks. We then develop a tool called HPROC for programming within
v
this framework, primarily on top of a paid microtask marketplace called Amazon
Mechanical Turk (AMT). Lastly, we describe Turkalytics, a system for monitoring of
workers completing paid microtasks on AMT.
We cover tagging from web search, machine learning, and library science per-
spectives, and work extensively with both the paid and unpaid microtasks which are
becoming a fixture of the modern web.
vi
Acknowledgments
This thesis would not exist without my advisor, Hector Garcia-Molina. Hector shares
everything with his advisees, and always has their best interests at heart. He has given
me the freedom and support to pursue varied interests, while insisting on technical
rigor and intellectual clarity along the way. He is a model advisor and great friend.
Aside from my advisor, I am indebted to my reading and orals committees, in-
cluding Andreas Paepcke, Jure Leskovec, Jennifer Widom, and Ashish Goel. Their
comments and words have improved both this document and my time at Stanford.
I have been lucky to have had fruitful collaborations over the years. In chrono-
logical order, Georgia Koutrika, Dan Ramage, and Andreas Paepcke have been my
primary co-authors and helped immensely with the tagging work that makes up Chap-
ters 2–5. Georgia is an effective and delightful collaborator. Dan and I always seem
to have the same thoughts at the same time. Andreas’ enthusiasm is boundless.
Several other people have played key roles in chapters of this thesis. Chapter 2
benefited both from Zhichen Xu and Mark Lucovsky. Zhichen informed my under-
standing of tags and Mark provided infrastructure support in the form of millions
of backlink queries. (Chapters 2 and 3 were also supported by an NSF Graduate
Research Fellowship and the School of Engineering Finch Family Fellowship.) Chap-
ters 4 and 5 would not have been possible without James Jacobs and Philip Schreur.
Among other things, James and Philip pointed me to the Scriblio MARC records
used in those chapters. Chapter 6 is illustrated by Caitlin Hogan. Chapter 7 was
clarified by discussion with Greg Little about execution models. Chapter 8 would not
exist without the encouragement of Aleksandra Korolova.
This work has benefited from interactions throughout the Gates Computer Science
vii
building. In particular, members of the InfoLab, Artificial Intelligence, and Theory
groups have given me numerous insights into my work over the years. While there
are far too many people to name here, I would like to especially thank the members
of boot camp, the hack circle, wafflers, and various residents of Gates 424.
My academic career before Stanford benefited from a series of excellent mentors
at Duke and Harvard. At Duke, Alexander Hartemink introduced me to research
and served as an amazing research advisor. At Harvard, Barbara Grosz and Stuart
Shieber gave me advice and numerous opportunities within the Colored Trails project.
Jody Heymann, Cynthia LuBien, and Jenny Finkel taught me most of the key
insights for surviving and thriving in the Computer Science Ph.D. program.
Thanks to my wife, sister, parents, and the rest of my family for their continuous
support in all of my endeavors, wherever they take me.
viii
Contents
Abstract v
Acknowledgments vii
1 Introduction 1
1.1 Overview: Social Bookmarking (Part I) . . . . . . . . . . . . . . . . . 3
1.2 Overview: Social Cataloging (Part II) . . . . . . . . . . . . . . . . . . 5
1.3 Overview: Paid Microtasks (Part III) . . . . . . . . . . . . . . . . . . 7
1.4 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Social Bookmarking and Web Search 11
2.1 Social Bookmarking Terms and Notation . . . . . . . . . . . . . . . . 12
2.2 Creating a Social Bookmarking Dataset . . . . . . . . . . . . . . . . . 13
2.2.1 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Realtime Processing Pipeline . . . . . . . . . . . . . . . . . . 14
2.2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Positive Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Negative Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.2 Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
ix
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Social Tag Prediction 39
3.1 Tag Prediction Terms and Notation . . . . . . . . . . . . . . . . . . . 41
3.2 Creating a Prediction Dataset . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Two Tag Prediction Methods . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Tag Prediction Using Page Information . . . . . . . . . . . . . 46
3.3.2 Tag Prediction Using Tags . . . . . . . . . . . . . . . . . . . . 53
3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 Tagging Human Knowledge 63
4.1 Social Cataloging Terms and Notation . . . . . . . . . . . . . . . . . 65
4.1.1 Library Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Creating a Social Cataloging Dataset . . . . . . . . . . . . . . . . . . 67
4.3 Experiments: Consistency . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.1 Synonymy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.2 Cross-System Annotation Use . . . . . . . . . . . . . . . . . . 71
4.3.3 Cross-System Object Annotation . . . . . . . . . . . . . . . . 73
4.3.4 $-tag Annotation Overlap . . . . . . . . . . . . . . . . . . . . 75
4.4 Experiments: Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.1 Objective, Content-based Groups . . . . . . . . . . . . . . . . 77
4.4.2 Quality Paid Annotations . . . . . . . . . . . . . . . . . . . . 79
4.4.3 Finding Quality User Tags . . . . . . . . . . . . . . . . . . . . 81
4.5 Experiments: Completeness . . . . . . . . . . . . . . . . . . . . . . . 83
4.5.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.5.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5 Fallibility of Experts 95
5.1 Notes on LCSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
x
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.1 Syntactic Equivalence . . . . . . . . . . . . . . . . . . . . . . 97
5.2.2 Rank Correlation of Syntactic Equivalents . . . . . . . . . . . 98
5.2.3 Expert/User Annotator Agreement . . . . . . . . . . . . . . . 99
5.2.4 Semantic Equivalence . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6 Human Processing 107
6.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2 Basic Buyer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3 Game Maker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.4 Human Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7 Programming with HPROC 119
7.1 HPROC Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Preliminaries: TurKit . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.3 HPROC Subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4 HPROC Hprocesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.5 HPROC Walkthrough . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.5.1 Making a Remote Connection . . . . . . . . . . . . . . . . . . 128
7.5.2 Uploading Code . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.5.3 Introspection . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.5.4 Hprocess Creation . . . . . . . . . . . . . . . . . . . . . . . . 139
7.5.5 Polling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.5.6 Executable Environment . . . . . . . . . . . . . . . . . . . . . 141
7.5.7 Dispatch Handling . . . . . . . . . . . . . . . . . . . . . . . . 141
7.5.8 Remote Function Calling . . . . . . . . . . . . . . . . . . . . . 143
7.5.9 Local Hprocess Instantiation . . . . . . . . . . . . . . . . . . . 145
7.5.10 Form Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.5.11 Form Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.5.12 Form Recruiting . . . . . . . . . . . . . . . . . . . . . . . . . 149
xi
7.6 HPROC Walkthrough Summary . . . . . . . . . . . . . . . . . . . . . 151
7.7 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.7.1 Stanford University Shoe Dataset 2010 . . . . . . . . . . . . . 152
7.7.2 Sorting Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.7.3 Comparison Interfaces . . . . . . . . . . . . . . . . . . . . . . 154
7.8 H-Merge-Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.8.1 Classical Merge-Sort . . . . . . . . . . . . . . . . . . . . . . 155
7.8.2 Convenience Functions . . . . . . . . . . . . . . . . . . . . . . 156
7.8.3 H-Merge-Sort Overview . . . . . . . . . . . . . . . . . . . . 157
7.8.4 H-Merge-Sort Functions . . . . . . . . . . . . . . . . . . . 158
7.8.5 H-Merge-Sort Walkthrough . . . . . . . . . . . . . . . . . . 163
7.9 H-Quick-Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.9.1 Classical Quick-Sort . . . . . . . . . . . . . . . . . . . . . . 166
7.9.2 H-Quick-Sort Overview . . . . . . . . . . . . . . . . . . . . 167
7.9.3 H-Quick-Sort Functions . . . . . . . . . . . . . . . . . . . . 167
7.9.4 H-Quick-Sort Walkthrough . . . . . . . . . . . . . . . . . . 169
7.10 Human Algorithm Evaluation . . . . . . . . . . . . . . . . . . . . . . 172
7.11 Case Study Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.11.1 H-Merge-Sort Interfaces . . . . . . . . . . . . . . . . . . . 174
7.11.2 H-Quick-Sort Median Pivot . . . . . . . . . . . . . . . . . . 177
7.11.3 H-Merge-Sort versus H-Quick-Sort . . . . . . . . . . . . 177
7.11.4 Complete Data Table . . . . . . . . . . . . . . . . . . . . . . . 177
7.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8 Worker Monitoring with Turkalytics 181
8.1 Worker Monitoring Terms and Notation . . . . . . . . . . . . . . . . 182
8.1.1 Interaction Model . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.1.2 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.2.1 Client-Side JavaScript . . . . . . . . . . . . . . . . . . . . . . 187
8.2.2 Log Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
xii
8.2.3 Analysis Server . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.2.4 Design Choices . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.3 Requester Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.3.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.3.2 Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
8.4 Results: System Aspects . . . . . . . . . . . . . . . . . . . . . . . . . 196
8.4.1 Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.4.2 Logging Server . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.4.3 Analysis Server . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.5 Results: Worker Aspects . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.6 Results: Activity Aspects . . . . . . . . . . . . . . . . . . . . . . . . 202
8.6.1 What States/Actions Occur in Practice? . . . . . . . . . . . . 203
8.6.2 When Do Previews Occur? . . . . . . . . . . . . . . . . . . . . 204
8.6.3 Does Activity Help? . . . . . . . . . . . . . . . . . . . . . . . 205
8.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9 Conclusion 209
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Bibliography 215
xiii
xiv
List of Tables
2.1 Top tags and their rank as terms in AOL queries. . . . . . . . . . . . 26
2.2 This example lists the five hosts in Dataset C with the most URLs
annotated with the tag java. . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Average accuracy for different values of τ . . . . . . . . . . . . . . . . 35
3.1 The top 15 tags account for more than 13of top 100 tags added to
URLs after the 100th bookmark. Most are relatively ambiguous and
personal. The bottom 15 tags account for very few of the top 100
tags added to URLs after the 100th bookmark. Most are relatively
unambiguous and impersonal. . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Association Rules: A selection of the top 30 tag pair association rules.
All of the top 30 rules appear to be valid, these rules are representative. 54
3.3 Association Rules: A random sample of association rules of length ≤ 3
and support > 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4 Association Rules: Tradeoffs between number of original sampled book-
marks, minimum confidence and resulting tag expansions. . . . . . . . 57
3.5 Association Rules: Tradeoffs between number of original sampled book-
marks, minimum confidence, estimated precision and actual precision. 58
3.6 Association Rules: Tradeoffs between number of original sampled book-
marks, minimum confidence, recall, and precision. . . . . . . . . . . . 58
4.1 Tag types for top 2000 LibraryThing and top 1000 GoodReads tags as
percentages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
xv
4.2 Basic statistics for the mean h-score assigned by evaluators to each
annotation type. Mean (µ) and standard deviation (SD) are abbreviated. 81
4.3 Basic statistics for the mean h-score assigned to a particular annota-
tion type with user tags split by frequency. Mean (µ) and standard
deviation (SD) are abbreviated. . . . . . . . . . . . . . . . . . . . . . 82
4.4 Randomly sampled containment and equivalence relationships for il-
lustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5 Dewey Decimal Classification coverage by tags. . . . . . . . . . . . . 88
5.1 Sampled (ti, lj) pairs with Wikipedia ESA values. . . . . . . . . . . . 102
7.1 The code descriptors table within the MySQL database in the HPROC
system, after walkthroughscript.py has been introspected. Some
columns have been removed, edu.stanford.thesis has been abbrevi-
ated to e.s.t, and default poll seconds has been abbreviated to “Poll
(s).” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2 The process descriptors table within the MySQL database in the HPROC
system, after a new hprocess with the edu.stanford.thesis.sa code
descriptor of walkthroughscript.py has been created. Some columns
have been removed, edu.stanford.thesis has been abbreviated to
e.s.t. The HPID is the process identifier for the hprocess. . . . . . 140
7.3 The row of the variable storage table corresponding to the compareItems
function call. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.4 Comparison of different sorting strategies and interfaces. Sorting dataset
is the Stanford University Shoe Dataset 2010. All runs done during
the week of November 8th, 2010. Results listed are the mean over ten
runs, with standard deviation in parentheses. . . . . . . . . . . . . . 178
8.1 Top Ten Countries of Turkers (by Number of Workers). 2,884 Workers,
8,216 IPs total. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.2 The number of user agents, IP addresses, cookies and views for top
workers by page views. . . . . . . . . . . . . . . . . . . . . . . . . . . 202
xvi
List of Figures
1.1 Two interfaces to the del.icio.us social bookmarking system. . . . . . 2
1.2 Two social cataloging systems. . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Screenshot of a Mechanical Turk paid microtask for sorting photos. . 7
2.1 Realtime Processing Pipeline: (1) shows where the post metadata is
acquired, (2) and (4) show where the page text and forward link page
text is acquired, and (3) shows where the backlink page text is acquired. 14
2.2 Number of times URLs had been posted and whether they appeared in
the recent feed or not. Each increase in height in “Found URLs” is a
single URL (“this URL”) that was retrieved from a user’s bookmarks
and was found in the recent feed. Each increase in height in “Missing
URLs” is a single URL (“this URL”) that was retrieved from a user’s
bookmarks and was not found in the recent feed. “Combined” shows
these two URL groups together. . . . . . . . . . . . . . . . . . . . . . 17
2.3 Histograms showing the relative distribution of ages of pages in del.icio.us,
Yahoo! Search results and ODP. . . . . . . . . . . . . . . . . . . . . . 18
2.4 Cumulative Portion of del.icio.us Posts Covered by Users . . . . . . . 23
2.5 How many times has a URL just posted been posted to del.icio.us? . 23
2.6 A scatter plot of tag count versus query count for top tags and queries
in del.icio.us and the AOL query dataset. r ≈ 0.18. For the overlap
between the top 1000 tags and queries by rank, τ ≈ 0.07. . . . . . . . 25
2.7 Posts per hour and comparison to Philipp Keller. . . . . . . . . . . . 28
2.8 Details of Keller’s post per hour data. . . . . . . . . . . . . . . . . . . 29
xvii
2.9 Host Classifier: The accuracy for the first 130 tags by rank for a host-
based classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 Average new tags versus number of posts. . . . . . . . . . . . . . . . 44
3.2 Tags in T100 in increasing order of predictability from left to right.
“cool” is the least predictable tag, “recipes” is the most predictable tag. 48
3.3 When the rarity of a tag is controlled in 200/200, entropy is negatively
correlated with predicability. . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 When the rarity of a tag is controlled in 200/200, occurrence rate is
negatively correlated with predicability. . . . . . . . . . . . . . . . . . 51
3.5 When the rarity of a tag is not controlled, in Full/Full, additional
examples are more important than the vagueness of a tag, and more
common tags are more predictable. . . . . . . . . . . . . . . . . . . . 52
4.1 Synonym set frequencies. (“Frequency of Count” is the number of
times synonym sets of the given size occur.) . . . . . . . . . . . . . . 70
4.2 Tag frequency versus synonym set size. . . . . . . . . . . . . . . . . . 70
4.3 H(ti) (Top 2000, 6= 0) . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Distribution of same book similarities using Jaccard similarity over all
tags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Distribution of same book similarities using Jaccard similarity over the
top twenty tags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6 Distribution of same book similarities using cosine similarity over all
tags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.7 Overlap Rate Distribution. . . . . . . . . . . . . . . . . . . . . . . . . 76
4.8 Conditional density plot [39] showing probability of (1) annotators
agreeing a tag is objective, content-based, (2) annotators agreeing on
another tag type, or (3) no majority of annotators agreeing. . . . . . 79
4.9 Recall for 603 tags in the full dataset. . . . . . . . . . . . . . . . . . . 90
4.10 Recall for 603 tags in the “min100” dataset. . . . . . . . . . . . . . . 90
4.11 Jaccard for 603 tags in the full dataset. . . . . . . . . . . . . . . . . . 90
xviii
5.1 Spinogram [40] [39] showing probability of an LCSH keyword having
a corresponding tag based on the frequency of the LCSH keyword.
(Log-scale.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Symmetric Jaccard Similarity. . . . . . . . . . . . . . . . . . . . . . . 100
5.3 Asymmetric Jaccard Similarity. . . . . . . . . . . . . . . . . . . . . . 101
5.4 Conditional density plot showing probability of a (ti, lj) pair meaning
that (ti, lj) could annotate {none, few, some,many, almostall, all} of
the same books according to human annotators based on Wikipedia
ESA score of the pair. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5 Histogram of Top Wikipedia ESA for Missing LCSH and All Tags. . . 103
6.1 Basic Buyer human programming environment. A human program
generates forms. These forms are advertised through a marketplace.
Workers look at posts advertising the forms, and then complete the
forms for compensation. . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2 Game Maker human programming environment. The programmer
writes a human program and a game. The game implements features
to make it fun and difficult to cheat. The human program loads and
dumps data from the game. . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 Human Processing programming environment. HP is a generalization
of BB and GM. It provides abstractions so that algorithms can be
written, tasks can be defined, and marketplaces can be swapped out.
It provides separation of concerns so that the programmer can focus on
the current need, while the environment designer focuses on recruiting
workers and designing tasks. . . . . . . . . . . . . . . . . . . . . . . 113
7.1 Graphical overview of the full HPROC system. . . . . . . . . . . . . . 124
7.2 Shoes from the Stanford University Shoe Dataset 2010 blurred to vary-
ing degrees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.3 Two different human comparison interfaces. . . . . . . . . . . . . . . 154
7.4 Comparison of total cost of three variations of sorting. . . . . . . . . 174
7.5 Comparison of wall clock time for three variations of sorting. . . . . 175
xix
7.6 Comparison of accuracy for three variations of sorting. . . . . . . . . 176
8.1 Search-Preview-Accept (SPA) model. . . . . . . . . . . . . . . . . . 183
8.2 Search-Continue-RapidAccept-Accept-Preview (SCRAP) model. . . . 184
8.3 Turkalytics data model (Entity/Relationship diagram). . . . . . . . . 185
8.4 Number of transitions between different states in our dataset. Note:
These numbers are approximate and unload states are unlabeled. . . 203
8.5 Number of new previewers visiting three task groups over time. . . . 204
8.6 Plot of average active and total seconds for each worker who completed
the NER task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.7 Two activity signatures showing different profiles for completing a task.
Key: a=activity, i=inactivity, d=DOM load, s=submit, b=beforeunload,
u=unload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
xx
Chapter 1
Introduction
Over the past two decades, the web has experienced explosive growth. There are now
over a billion people connected to the Internet and over a trillion web pages. This
rapid growth has led to huge challenges as well as huge opportunities. For instance,
how can we organize over a trillion web pages? How can we utilize the collective out-
put of billions of Internet-connected users? This thesis takes steps towards answering
these important questions.
In particular, we investigate microtasks, which are tiny units of work performed
by humans usually lasting less than five minutes. Examples of microtasks include
tagging a photo with a descriptive keyword, rating a movie, or categorizing a product.
Microtasks hit a sort of sweet spot for the web. On one hand, microtasks are so short
that users and workers are often willing to perform them for cheap or free. On the
other hand, the sum of many microtasks can be a significant source of labor.
Chapters 2–5 focus on a specific type of microtask called tagging, while Chapters
6–8 focus on tools to program microtasks more generally. In a tagging system, regular
users annotate objects (they tag objects) with uncontrolled keywords (tags) of their
choosing. By contrast, library systems (i.e., libraries), only allow expert taxonomists
(rather than regular users) to annotate objects, and those objects may usually only
be annotated with terms from a controlled vocabulary that has been determined
beforehand.
Once enough users annotate objects with tags, patterns tend to emerge, even
1
2 CHAPTER 1. INTRODUCTION
(a) Tag Cloud (b) Query by Tag
Figure 1.1: Two interfaces to the del.icio.us social bookmarking system.
though the tags are from an uncontrolled vocabulary. In particular, some tags become
more or less popular, and some tags become commonly used to annotate different
types of objects. Users of a tagging system then browse the system using interfaces
designed to take advantage of these patterns. Two common interfaces for tagging
systems are shown in Figure 1.1. Figure 1.1(a) shows a tag cloud, an interface which
shows popular tags by increasing the size of the font for a tag based on its frequency.
Tag clouds give users an idea of what the most prevalent tags are within a tagging
system. Figure 1.1(b) shows a query by tag interface, which displays objects which
have been annotated with a particular tag. Figure 1.1(b) shows objects annotated
with the tag “thesis,” including a URL with the title “Useful Things to Know About
Ph.D. Thesis Research.”
The rest of this chapter gives a high level overview of the three major parts of
this thesis.
Part I Social bookmarking systems (Section 1.1).
Part II Contrasting tagging and library systems (Section 1.2).
Part III Programming (paid) microtasks (Section 1.3).
1.1. OVERVIEW: SOCIAL BOOKMARKING (PART I) 3
Lastly, we summarize our research contributions (Section 1.4). (Note that we do not
include related work in this chapter, instead including it in each individual chapter.)
1.1 Overview: Social Bookmarking (Part I)
Part I begins our study of tagging by studying social bookmarking systems. Social
bookmarking systems are tagging systems where the specific type of object being
annotated is a URL. Social bookmarking systems were one of the first places that
tagging became popular. It makes sense to start our study of tagging and microtasks
with social bookmarking systems for two reasons.
The first reason for studying social bookmarking systems is that the challenges
faced by social bookmarking systems had a major impact on the evolution of tagging
systems. Specifically, systems like Yahoo! Directory, the Open Directory Project
(ODP), and del.icio.us all try to organize and classify URLs on the web. However,
Yahoo! Directory and ODP take a substantially different approach, using trusted
taxonomists and taxonomies to determine the organization of URLs, rather than
regular users. The Yahoo! Directory and ODP approach seems to have significant
scaling problems because expert, trusted labor is scarce and expensive. By contrast,
del.icio.us represents an alternative model for human labeling of vast numbers of
URLs with descriptive metadata. This alternative model can help solve the scaling
problem, but presents other challenges in terms of using non-expert, untrusted data
from regular users.
The second reason for studying social bookmarking systems is that such systems
are among the largest and most mature tagging systems today. Over the course of
nearly a decade, these systems have grown to the point where their users now tag
hundreds of thousands of URLs each day. The size and maturity of a tagging system
matters for our study because of the dependence of tagging systems on the uncon-
trolled tags contributed by regular users. At an early, smaller phase, this dependence
can mean that tagging systems are dominated by a few prolific users or by spam. By
studying social bookmarking, we see how late phase tagging systems work in the large,
rather than falling prey to the peculiarities of a given early stage tagging system.
4 CHAPTER 1. INTRODUCTION
Our study of social bookmarking in Part I is made up of Chapters 2 and 3. Both
chapters look at the del.icio.us social bookmarking system, specifically its relationship
to the web. Both chapters also rely on the size and maturity of del.icio.us to make
claims about social bookmarking as a whole.
Chapter 2 looks at a very specific, but very important, potential application of
social bookmarking systems: web search. Web search engines like Google depend
on page content, link structure, and query or clickthrough log data to provide users
with relevant retrieved results. All of these types of data are somewhat indirect
descriptions of web pages. By contrast, Chapter 2 asks whether the direct, human
annotated tags in social bookmarking systems can help in the task of web search. To
evaluate “helpfulness,” we consider various features of the URLs and tags posted to
del.icio.us, and we ask whether each is likely to provide additional information above
and beyond the data already available to web search engines.
Chapter 3 asks a more general question about social bookmarking systems: can
frequent tags in these systems be predicted? We attempt to predict tags based on
both data specific to social bookmarking systems (e.g., page text, anchor text) and
data general to all tagging systems (e.g., predicting tags based on other tags). For
example, can we predict the tag “linux” based on Linux related terms in the page
text of a URL? Can we predict the tag “linux” based on other tags, such as the tag
“debian” which refers to a specific Linux distribution?
Predictability of tags is both good and bad. If tags can be predicted successfully,
then tagging systems can be enhanced in various ways. For example, when a tagging
system is just getting started, a system owner might provide automatic tags produced
by a machine to make the system more useful at first. Even for a late phase tagging
system, tag prediction may help increase recall for interfaces like query by tag, because
often different users use different tags to mean the same thing. For example, in our
“debian” and “linux” example above, a modified query by tag might return a URL
labeled with only “debian” when a user queries for the tag “linux.” On the other
hand, if tags are predictable, users may not be adding any information to the system
when they tag objects.
1.2. OVERVIEW: SOCIAL CATALOGING (PART II) 5
(a) LibraryThing (b) Goodreads
Figure 1.2: Two social cataloging systems.
1.2 Overview: Social Cataloging (Part II)
While social bookmarking systems are one of the most important applications of
tagging, they may not be the best place to study tagging itself. For instance, Chapter
3 asks whether tags can be predicted, but are the tags themselves any good? The
trouble is that the notion of “good” is quite subjective, depends on the objects being
annotated, and often takes a subject matter expert years to develop. Ideally, we
would compare against the ground truth produced by experts, but social bookmarking
systems do not really have a good source of ground truth. In fact, as we saw in the
last section, a major reason for the development of social bookmarking systems was
that it was so difficult for experts to annotate the web in a scalable way. Luckily,
there is a different type of tagging system called a social cataloging system where we
can evaluate tagging using ground truth from experts.
Part II expands our study of tagging to include social cataloging systems. So-
cial cataloging systems are tagging systems where the specific type of object being
annotated is a book. Figures 1.2(a) and 1.2(b) show web pages for the book “The
Indispensable Calvin and Hobbes” at the two social cataloging sites for which we have
data, LibraryThing and Goodreads. Social cataloging systems are a perfect place to
6 CHAPTER 1. INTRODUCTION
contrast tags to ground truth from experts. What makes social cataloging systems
perfect is that books are simultaneously annotated with regular user tags as well as
library terms annotated by experts.
Libraries organize books into massive hierarchies called classifications, like the
Dewey Decimal Classification (DDC) and the Library of Congress Classification (LCC).
Do tags correlate with nodes at different levels of these hierarchies? For example, one
of the top level nodes in the LCC hierarchy is the node “medicine.” Do users fre-
quently annotate books with a “medicine” tag? Similarly, libraries have controlled
vocabularies consisting of predefined terms, like the Library of Congress Subject Head-
ings (LCSH). How are tags used in comparison to these controlled vocabularies?
Chapter 4 evaluates the tags in social cataloging systems by assuming that library
annotations like LCSH, LCC, and DDC are a gold standard. In particular, we argue
that library terms are consistent, complete, and uniformly high quality. (We define
the terms consistent, complete and high quality more explicitly in the chapter.) To
what degree are tags similar to these consistent, complete, and high quality library
terms?
Tags in LibraryThing and Goodreads were, in effect, donated by users of the site.
One interesting aspect of our study of social cataloging is that we also develop the idea
of paid tagging. Paid tagging is a paid microtask where we pay workers to provide
tags for objects, rather than relying on the benevolence of users. Overall, this allows
us to compare three types of annotations: expert library terms, unpaid regular user
tags, and paid worker tags. In addition to informing the use of tagging systems, this
is one of the few places where one can compare unpaid microtasks (tags by users),
paid microtasks (paid tags), and classical “work” (annotations by experts).
Chapter 5 drops the assumption that data created by experts should be a gold
standard. Instead, we compare tags to controlled vocabularies generated and created
by experts, but do not assume that either is a priori correct. Do the terms annotated
by experts and regular users tend to be the same? Do the same terms tend to get
annotated by experts and regular users to the same objects?
1.3. OVERVIEW: PAID MICROTASKS (PART III) 7
Figure 1.3: Screenshot of a Mechanical Turk paid microtask for sorting photos.
1.3 Overview: Paid Microtasks (Part III)
With the exception of the paid tags discussed in the previous section, tagging is usu-
ally an unpaid microtask. The big advantage of unpaid microtasks is that they are
free. Unfortunately, the free nature of unpaid microtasks is also their big disadvan-
tage. Users contribute labor at their whim, and unpaid microtasks must be made
fun, easy, in the users’ self interest, or all of the above.
Developing a system—whether a tagging system or otherwise—which is fun, easy,
and useful for potential users can take a long time. What’s more, the system then
needs to be advertised and promoted to develop a user base. Even assuming users
decide to use a system based on unpaid microtasks, neither the owners of a system,
nor we as researchers, have much real control over what users produce. As a result,
research on unpaid microtasks tends to focus on post hoc analysis of data after the mi-
crotasks have been completed. By contrast, recent systems like Amazon’s Mechanical
Turk are allowing researchers (and system developers) to get microtasks accomplished
in a much more directed way—so long as the microtasks are paid.
Mechanical Turk is a marketplace made up of requesters and workers. The re-
questers provide a task (usually through an HTML IFRAME displaying an external
8 CHAPTER 1. INTRODUCTION
website) and set a price. The workers accept or decline the task. Finally, requesters
pay the set price if the work done was acceptable. Figure 1.3 shows the interface that
a Mechanical Turk worker sees while deciding whether to accept or decline one of our
tasks. At the top, one can see the reward offered—one US cent. At the bottom, one
can see an IFRAME displaying the task. In this case, the task is a web form asking the
worker to choose which of two photos is less blurry.
Microtasks on Mechanical Turk are commonly things like annotating data, tagging
photos, and judging search results. Marketplaces allow requesters to dictate exactly
what tasks they want done, and how they want the work done. Requesters have
greater control, but with that greater control comes a host of other problems. How
does one combine more than one type of paid microtask into a single program? How
should one pay for good work and avoid paying for bad work when machines cannot
evaluate the quality of the work itself?
Part III expands our focus to microtasks in general. We build out a complete
framework for building systems on top of the Mechanical Turk and similar market-
places. Our goal is to simplify and formalize the process of building such systems.
Chapter 6 develops a conceptual model for such systems. Chapter 7 describes our
implementation of that model, called the HPROC system. Chapter 8 describes a tool
for worker monitoring, in order to better understand how workers are completing
tasks and using the marketplace.
Chapter 6 proposes our conceptual model for writing programs where use of micro-
tasks is common (“human programming”), called the Human Processing model. Hu-
man Processing provides for separation of concerns (separating pricing from program
operation, for example), reduces redundant code (by enabling libraries of functional-
ity based on microtasks), and generally makes human programming easier. Human
processing also aims to make the experimental analysis of algorithms using humans
more controlled, a topic which is returned to in Chapter 7.
Chapter 7 describes an implementation of the Human Processing model, called
the HPROC system. HPROC is a large and comprehensive system. HPROC aims to
make it easy to build complex workflows involving multiple types of tasks. HPROC
1.4. RESEARCH CONTRIBUTIONS 9
also aims to make more natural the interaction between processes performing compu-
tation and web processes that interact directly with workers. Lastly, HPROC aims to
separate out recruiting functionality, wherein specialized programs ensure that paid
microtasks are advertised and priced correctly on the marketplace.
Chapter 7 also gives a brief case study showing how to use HPROC for analyzing
sorting algorithms. We demonstrate analogues of classicalMerge-Sort andQuick-
Sort. These sorting algorithms allow us to demonstrate the importance of interfaces
to workers in the design of algorithms meant to interact with humans. For example,
should we implement sorting with a binary interface where a worker chooses which
item is less, or should we implement sorting with a ranking interface where a worker
orders multiple items at a time? Our sorting case study also allows us to demonstrate
how we believe such algorithms should be evaluated.
Chapter 8 describes an analytics system for worker monitoring. Unlike unpaid
microtasks, it is quite important to detect bad workers, and detect them early, when
dealing with paid microtasks. Our system, called Turkalytics, is a realtime system
which monitors a wide variety of actions by workers as they complete microtasks.
These actions include clicks, pressing of keys, form submissions, and others. Turka-
lytics can be seen as part of the Human Processing framework, though it is also useful
as a standalone tool.
1.4 Research Contributions
In summary, the high level research contributions in this thesis are:
• A characterization of social bookmarking, especially as it relates to web search
(Chapter 2).
• Methods for predicting tags, and evaluation of those methods (Chapter 3).
• A comparison of tagging to established methods of organization in library sci-
ence (Chapters 4 and 5).
10 CHAPTER 1. INTRODUCTION
• A full framework and programming system for paid microtasks, including a
model (Chapter 6), system (Chapter 7), and monitoring tool (Chapter 8).
Chapter 2
Social Bookmarking and Web
Search
For most of the history of the web, search engines have only had access to three
major types of data describing pages. These types are page content, link structure,
and query or clickthrough log data. Today a fourth type of data is becoming available:
user generated content (e.g., tags, bookmarks) describing the pages directly. Unlike
the three previous types of data, this new source of information is neither well studied
nor well understood. Our aim in this chapter is to quantify the size of this data source,
characterize what information it contains, and to determine the potential impact it
may have on improving web search.
This chapter also begins our study of microtasks by looking at tagging, more
specifically, social bookmarking systems. In particular, this chapter represents a de-
tailed analysis of the potential impact of social bookmarking on arguably the web’s
most important application: web search. Our analysis centers around a series of
experiments conducted on the social bookmarking site del.icio.us.1 However, we be-
lieve that many of the insights apply more generally, both to social systems centered
around URLs (e.g., Twitter) and to other tagging systems with textual objects (e.g.,
tagging systems for books and academic papers).
1In the course of this work, del.icio.us changed its name from “del.icio.us” to “Delicious.” Forclarity, we refer to it as del.icio.us throughout.
11
12 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
In Section 2.1 we introduce the terminology for our experiments on del.icio.us
and tagging systems more generally. Section 2.2 explains the complex process of
creating one of the biggest social bookmarking datasets ever studied, as well as the
methodological concerns that motivated it. The core of this chapter, Sections 2.3
and 2.4, gives two sets of results. Section 2.3 contains results that suggest that social
bookmarking will be useful for web search, while Section 2.4 contains those results that
suggest it will not. Both sections are divided into “URL” and “tag” subsections which
focus on the two major types of data that social bookmarking provides. In Section 2.5
we point to related work in web search and social bookmarking. Finally, in Section
2.6 we conclude with our thoughts on the overall picture of social bookmarking, its
ability to augment web search, and how our study generalizes to tagging in general.
(This chapter draws on material from Heymann et al. [36] which is primarily the work
of the thesis author.)
2.1 Social Bookmarking Terms and Notation
A social tagging system consists of users u ∈ U , tags t ∈ T , and objects o ∈ O. We
call an annotation of a set of tags to an object by a user a post. A post is made up
of one or more (ti, uj , ok) triples. A label is a (ti, ok) pair that signifies that at least
one triple containing tag i and object k exists in the system.
Social bookmarking systems are social tagging systems where the objects are
URLs. Each post signifies that a user has bookmarked a particular URL, and may
also include some information like a user comment.
In this chapter, we use term to describe a unit of text, whether it is a tag or part
of a query. Terms are usually words, but are also sometimes acronyms, numbers, or
other tokens.
We use host to mean the full host part of a URL, and domain to mean the
“effective” institutional level part of the host. For instance, in http://i.stanford.
edu/index.html, we call i.stanford.edu the host, and stanford.edu the domain.
Likewise, in http://www.cl.cam.ac.uk/, we call www.cl.cam.ac.uk the host, and
cam.ac.uk the domain. We use the effective top level domain (TLD) list from the
2.2. CREATING A SOCIAL BOOKMARKING DATASET 13
Mozilla Foundation to determine the effective “domain” of a particular host.2
2.2 Creating a Social Bookmarking Dataset
The companies that control social sites often run a number of internal analyses,
but are usually reluctant to release specific results. This can be for competitive
reasons, or perhaps simply to ensure the privacy of their users. As a result, we worked
independently and through public interfaces to gather the social bookmarking data
for this chapter and the next. Doing so presented a number of challenges.
2.2.1 Interfaces
del.icio.us offers a variety of interfaces to interested parties, but each of these has
its own caveats and potential problems. For instance, the “recent” feed provides the
most recent bookmarks posted to del.icio.us in real time. However, while we found
that the majority of public posts by users were present in the feed, some posts were
missing (due to filtering, see Section 2.2.4). Interfaces also exist which show all posts
of a given URL, all posts by a given user, and the most recent posts with a given
tag. We believe that at least the posts-by-a-given-user interface is unfiltered, because
users often share this interface with other users to give them an idea of their current
bookmarks.
These interfaces allow for two different strategies in gathering datasets from
del.icio.us. One can monitor the recent feed. The advantage of this is that the recent
feed is in real time. This strategy also does not provide a mechanism for gathering
older posts. Alternatively, one can crawl del.icio.us, treating it as a tripartite graph.
One starts with some set of seeds—tags, URLs, or users. At each tag, all URLs tagged
with that tag and all users who had used the tag are added to the queue. At each
URL, all tags which had been annotated to the URL (e.g., all labels) and all users
who had posted the URL are added to the queue. At each user, all URLs posted or
tags used by the user are added to the queue. The advantage of this strategy is that
2Available http://publicsuffix.org/.
14 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
Figure 2.1: Realtime Processing Pipeline: (1) shows where the post metadata isacquired, (2) and (4) show where the page text and forward link page text is acquired,and (3) shows where the backlink page text is acquired.
it provides a relatively unfiltered view of the data. However, the disadvantage is that
doing a partial crawl of a small world graph like del.icio.us can lead to data which
is highly biased towards popular tags, users, and URLs. Luckily, these two methods
complement each other. Monitoring is biased against popular pages, while crawling
tends to be biased toward these pages (we further explore the sources of these biases
in Section 2.2.4). As a result, we created datasets based on both strategies.
2.2.2 Realtime Processing Pipeline
For certain analyses (see Result 10), we need to have not just the URL being book-
marked, but also the content of the page, as well as the forward links from the page.
We also wanted to have the backlinks from those pages, and the pagetext content of
those backlinks. We wanted to have this page text data as soon as possible after a
URL was posted.
As a result, for a one month period we set up a real time processing pipeline
(shown in Figure 2.1). Every 20 to 40 seconds, we polled del.icio.us to see the most
recently added posts. For each post, we added the URL of the post to two queues, a
pre-page-crawl queue and a pre-backlink queue.
2.2. CREATING A SOCIAL BOOKMARKING DATASET 15
Every two hours, we ran an 80 minute Heritrix web crawl seeded with the pages in
the pre-page-crawl queue.3 We crawled the seeds themselves, plus pages linked from
those seeds up until the 80 minute time limit elapsed.4
Meanwhile, we had a set of processes which periodically checked the pre-backlink
queue. These processes got URLs from the queue and then ran between one and
three link: queries against one of Google’s internal APIs. This resulted in 0-60
backlink URLs which we then added to a pre-backlink-crawl queue. Finally, once
every two hours, we ran a 30 minute Heritrix crawl which crawled only the pages in
the pre-backlink-crawl queue. In terms of scale, our pipeline produced around 2GB of
(compressed) data per hour in terms of crawled pages and crawled backlinks.
2.2.3 Datasets
Over the course of nine months starting in September 2006 and ending in July 2007,
we collected three datasets from del.icio.us:
Dataset C(rawl) This dataset consists of a large scale crawl of del.icio.us in Septem-
ber 2006. The crawl was breadth first from the tag “web”, with the crawling
performed as described above. This dataset consists of 22, 588, 354 posts and
1, 371, 941 unique URLs.
Dataset R(ecent) This dataset consists of approximately 8 months of data begin-
ning September 28th, 2006. The data was gathered from the del.icio.us recent
feed. This dataset consists of 11, 613, 913 posts and 3, 004, 998 unique URLs.
Dataset M(onth) This dataset consists of one contiguous month of data starting
May 25th 2007. This data was gathered from the del.icio.us recent feed. For
each URL posted to the recent feed, Dataset M also contains a crawl of that
URL within 2 hours of its posting, pages linked from that URL, and inlinks to
the URL. This page content was acquired in the manner described in Section
3Heritrix software available at http://crawler.archive.org/.4The reason for running 80 minutes every two hours is that we used a single machine for crawling.
The single machine would spend 80 minutes crawling forward links, 30 minutes crawling backlinks,and we left two five minute buffers between the crawls, leading to 120 minutes.
16 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
2.2.2. Unlike Dataset R, the gathering process was enhanced so that changes
in the feed were detected more quickly. As a result, we believe that Dataset M
has within 1% of all of the posts that were present in the recent feed during the
month long period. This dataset consists of 3, 630, 250 posts, 2, 549, 282 unique
URLs, 301, 499 active unique usernames and about 2 TB of crawled data.
We are unaware of any analysis of del.icio.us of a similar scale either in terms of
duration, size, or depth.
We also use the AOL query dataset [58] for certain analyses (Results 1, 3, and 6).
The AOL query dataset consists of about 20 million search queries corresponding to
about 650, 000 users. We use this dataset to represent the distribution of queries a
search engine might receive.
2.2.4 Tradeoffs
As we will see, del.icio.us data is large and grows rapidly. The web pages del.icio.us
refers to are also changing and evolving. Thus, any “snapshot” will be imprecise in
one way or another. For instance, a URL in del.icio.us may refer to a deleted page,
or a forward link may point to a deleted page. Some postings, users, or tags may be
missing due to filtering or the crawl process. Lastly, the data may be biased, e.g.,
unpopular URLs or popular tags may be over-represented.
Datasets C, R, and M each have bias due to the ways in which they were gathered.
Dataset C appears to be heavily biased towards popular tags, popular users, and
popular URLs due to its crawling methodology. Dataset R may be missing data due
to incomplete gathering of data from the recent feed. Datasets R and M are both
missing data due to filtering of the recent feed. In this chapter, we analyze Dataset
M because we believe it is the most complete and unbiased. We use Datasets C and
R to supplement Dataset M for certain analyses.
It was important for the analyses that follow not just to know that the recent feed
(and thus Datasets R and M) was filtered, but also to have a rough idea of exactly
how it was filtered. We analyzed over 2, 000 randomly sampled users, and came to two
conclusions. First, on average, about 20% of public posts fail to appear in the recent
2.3. POSITIVE FACTORS 17
# of Posts of This URL in System
Nu
mb
er
of
Po
sts
01000
2000
3000
4000
5000
6000
1 10 100 1000 10000 100000
(a) Found URLs
# of Posts of This URL in System
Nu
mb
er
of
Po
sts
02000
4000
6000
1 10 100 1000 10000 100000
(b) Combined
# of Posts of This URL in System
Nu
mb
er
of
Po
sts
0500
1000
1500
2000
2500
1 10 100 1000 10000 100000
(c) Missing URLs
Figure 2.2: Number of times URLs had been posted and whether they appeared in therecent feed or not. Each increase in height in “Found URLs” is a single URL (“thisURL”) that was retrieved from a user’s bookmarks and was found in the recent feed.Each increase in height in “Missing URLs” is a single URL (“this URL”) that wasretrieved from a user’s bookmarks and was not found in the recent feed. “Combined”shows these two URL groups together.
feed (as opposed to the posts-by-user interface, for example). Second, popular URLs,
URLs from popular domains (e.g., youtube.com), posts using automated methods
(e.g., programmatic APIs), and spam will often not appear in the recent feed. Figure
2.2 shows this second conclusion for popular URLs. It shows three histograms of
URL popularity for URLs which appeared in the recent feed (“found”), those that
did not (“missing”), and the combination of the two (i.e.., the “real” distribution,
“combined”). Missing posts on the whole refer to noticeably more popular URLs, but
the effect of their absence seems minimal. In other words, the “combined” distribution
is not substantially different from the “found” distribution.
2.3 Positive Factors
Bookmarks are useful in two major ways. First, they can allow an individual to
remember URLs visited. For example, if a user tags a page with their mother’s name,
this tag might be useful to them, but is unlikely to be useful to others. Second, tags
can be made by the community to guide users to valuable content. For example, the
tag “katrina” might be valuable before search engine indices update with Hurricane
Katrina web sites. Non-obvious tags like “analgesic” on a page about painkillers
18 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
Figure 2.3: Histograms showing the relative distribution of ages of pages in del.icio.us,Yahoo! Search results and ODP.
might also help users who know content by different names locate content of interest.
In this chapter, our focus is on the second use. Will bookmarks and tags really
be useful in the ways described above? How often do we find “non-obvious” tags? Is
del.icio.us really more up-to-date than a search engine? What coverage does del.icio.us
have of the web? Sections 2.3 and 2.4 try to answer questions like these. At the
beginning of each result in these sections, we highlight the main result in “capsule
form” and we summarize the high level conclusion we think can be reached. In this
section, we provide positive factors which suggest that social bookmarking might help
with various aspects of web search.
2.3. POSITIVE FACTORS 19
2.3.1 URLs
Summary
Result 1: Pages posted to del.icio.us are often recently modified.
Conclusion: del.icio.us users post interesting pages that are actively updated or
have been recently created.
Details
Determining the approximate age of a web page is fraught with challenges. Many
pages corresponding to on disk documents will return the HTTP/1.1 Last-Modified
header accurately. However, many dynamic web sites will return a Last-Modified
date which is the current time (or another similar time for caching purposes), and
about 23of pages in Dataset M do not return the header at all! Fortunately, search
engines need to solve this problem for crawl ordering. They likely use a variety of
heuristics to determine if page content has changed significantly. As a result, the
Yahoo! Search API gives a ModificationDate for all result URLs which it returns.
While the specifics are unknown, ModificationDate appears to be a combination of
the Last-Modified HTTP/1.1 header, the time at which a particular page was last
crawled and its page content. We used this API to test the recency of five groups of
pages:
del.icio.us Pages sampled from the del.icio.us recent feed as they were posted.
Yahoo! 1, 10, and 100 The top 1, 10, and 100 results (respectively) of Ya-
hoo! searches for queries sampled from the AOL query dataset.
ODP Pages sampled from the Open Directory Project (dmoz.org).
Rather than compare the age of del.icio.us pages to random pages from the web (which
would neither be possible nor meaningful), we chose the four comparison groups to
represent groups of pages a user might encounter. The Yahoo 1, 10, and 100 groups
represent pages a user might encounter as a result of searches. ODP represents
pages a user might encounter using an Internet directory, and is also probably more
20 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
representative of the web more broadly. For each URL in each set, we recorded the
time since the page was last modified. In order to avoid bias by time, we ran equal
proportions of queries for each set at similar times.
Figure 2.3 shows the results. Each bar represents the number of pages in the group
with the given (x-axis) age. We found that pages from del.icio.us were usually more
recently modified than ODP, which tends to have older pages. We also found that
there is a correlation between a search result being ranked higher and a result having
been modified more recently. However, most interestingly, we found that the top 10
results from Yahoo! Search were about the same age as the pages found bookmarked
in del.icio.us. This could be interpreted in one of two ways: (i) del.icio.us is getting
recent, topical bookmarks which Yahoo! Search is trying to emulate, or (ii) del.icio.us
is getting bookmarks which are a result of searches, and thus have the same recency
as the top 10.
Summary
Result 2: Approximately 25% of URLs posted by users are new, unindexed pages.
Conclusion: del.icio.us can serve as a (small) data source for new web pages and to
help crawl ordering.
Details
We next looked at what proportion of pages were “new” in the sense that they were
not yet indexed by a search engine at the time they were posted to del.icio.us. We
sampled pages from the del.icio.us recent feed as they were posted, and then ran
Yahoo! searches for those pages immediately after. Of those pages, about 42.5% were
not found. This could be for a variety of reasons—the pages could be indexed under
another canonicalized URL, they could be spam, they could be an odd MIME-type
(an image, for instance) or the page could have not been found yet. Anecdotally, all
four of these causes appear to be fairly common in the set of sampled missing URLs.
As a result, we next followed up by continuously searching for the missing pages over
the course of the following five months. When a missing page appears in a later
2.3. POSITIVE FACTORS 21
result, we argue that the most likely reason is that the page was not indexed but was
later crawled. This methodology seems to eliminate the possibility that spam and
canonicalization issues are the reason for missing URLs, but does not eliminate the
possibility, for instance, that multiple datacenters give out different results.
We found that of the 5, 724 URLs which we sampled and were missing from the
week beginning June 22, 3, 427 were later found and 1, 750 were found within four
weeks. This implies that roughly 60% of the missing URLs were in fact new URLs,
or roughly 25% of del.icio.us (i.e., 42.5% × 60%). This works out to roughly 30, 000
new pages per day.
Social bookmarking seems to be a good source of new and active pages. As a
source of new pages, social bookmarking may help a search engine discover pages
it might not otherwise. For instance, Dasgupta et al. [22] suggest that 25% of new
pages are not discoverable using historical information about old pages. As a source
of both new and active pages, social bookmarking may also help more generally with
the “crawl ordering” problem—should we update old pages, or try to discover new
pages? To the extent to which social bookmarks represent “interesting” changes to
pages, they should be weighted in crawl ordering schemes.
Summary
Result 3: Roughly 9% of results for search queries are URLs present in del.icio.us.
Conclusion: del.icio.us URLs are disproportionately common in search results com-
pared to their coverage.
Details
Similarly to the recently modified pages discussion above, we used queries chosen by
sampling from the AOL query dataset to check the coverage of results by del.icio.us.
Specifically, we randomly sampled queries from the query dataset, ran them on Ya-
hoo! Search, and then cross-referenced them with the millions of unique URLs present
in Datasets C, M, and R. When we randomly sample, we sample over query events
rather than unique query strings. This means that the query “american idol” which
22 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
occurs roughly 15, 000 times, is about five times more likely to be picked than “power-
ball” which occurs roughly 3, 000 times.
We found that despite the fact that del.icio.us covers a relatively small portion
of the web (see discussion below in Result 9), it covers a disproportionately high
proportion of search results. For the top 100 results of the queries, del.icio.us covers
9% of results returned for a set of over 30,000 queries. For the top 10 results, this
coverage is about double: 19% of results returned are in del.icio.us. This set of queries
is weighted towards more popular queries, which can explain part of this effect. By
comparison, we might expect 11000
of URLs in query results to be in del.icio.us if they
were selected at random from the web (again, see Result 9). This suggests that to
whatever extent del.icio.us gives us additional metadata about web pages, it may lead
to result reordering for queries.
Summary
Result 4: While some users are more prolific than others, the top 10% of users only
account for 56% of posts.
Conclusion: del.icio.us is not highly reliant on a relatively small group of users (e.g.,
< 30, 000 users).
Details
Figure 2.4 shows the extent to which the most prolific users are responsible for large
numbers of posts. While there are some URLs, domains, users, and tags that cover
many posts or triples, the distributions do not seem so condensed as to be problematic.
For instance, on social news sites, it is commonly cited that the majority of front page
posts come from a dedicated group of less than 100 users. However, the majority of
posts in Dataset M instead come from tens of thousands of users. Nonetheless, the
distribution is still power law shaped and there is a core group of relatively active
users and a long tail of relatively inactive users.
2.3. POSITIVE FACTORS 23
Figure 2.4: Cumulative Portion of del.icio.us Posts Covered by Users
Figure 2.5: How many times has a URL just posted been posted to del.icio.us?
24 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
Summary
Result 5: 30-40% of URLs and approximately one in eight domains posted were not
previously in del.icio.us.
Conclusion: del.icio.us has relatively little redundancy in page information.
Details
The recent feed states for each post how many times the URL in that post is already
in del.icio.us. Figure 2.5 shows the distribution of this value. A new post in Dataset
M is of a new URL not yet in the system about 40% of the time. This proportion
might be 30% of total posts to del.icio.us if we adjust for filtering. In Dataset M, a
majority of the URLs posted were only posted once during the time period.
Another way to look at new URLs being added to del.icio.us is in terms of how of-
ten a completely new domain is added (as opposed to just another URL at an existing
domain). Unfortunately, we do not know the exact set of domains in del.icio.us. How-
ever, we can provide an upper-bound by comparing against the domains in Datasets
C and R. We found that about 12% of posts in Dataset M were URLs whose domains
were not in either Dataset C or R. This suggests that about one eighth of the time,
a new URL is not just a new page to be crawled, but may also suggest an entire new
domain to crawl.
This result coupled with Result 4 may impact the potential actions one might use
to fight tag spam. Because of the relatively high number of new pages, it may be
more difficult for those pages to determine the quality of labels placed on them. Fur-
thermore, due to the relatively low number of label redundancies, it may be difficult
to determine the trustworthiness of a user based on coincident labels with other users
(as in, e.g., [47]). For instance, 85% of the labels in Dataset M are non-redundant.
As a result, it may become increasingly important to use interface-based methods
to keep attackers out rather than analyzing the data that they add to the system.
However, on the other hand, the low level of redundancy does mean that users are
relatively efficient in labeling the parts of the web that they label.
2.3. POSITIVE FACTORS 25
Figure 2.6: A scatter plot of tag count versus query count for top tags and queriesin del.icio.us and the AOL query dataset. r ≈ 0.18. For the overlap between the top1000 tags and queries by rank, τ ≈ 0.07.
2.3.2 Tags
Summary
Result 6: Popular query terms and tags overlap significantly (though tags and query
terms are not correlated).
Conclusion: del.icio.us may be able to help with queries where tags overlap with
query terms.
Details
One important question is whether the metadata attached to bookmarks is actually
relevant to web searches. That is, if popular query terms often appear as tags, then
we would expect the tags to help guide users to relevant pages. SocialSimRank [14]
suggests an easy way to make use of this information. We opted to look at tag–query
overlap between the tags in Dataset M and the query terms in the AOL query dataset.
26 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
Tag (Rank) # Queries (Rank) Tag (Rank) # Queries (Rank)design (#1) 10318 (#545) tutorial (#16) 779 (#7098)blog (#2) 3367 (#1924) news (#17) 63916 (#40)
imported (#3) 215 (#18292) blogs (#18) 1478 (#4205)music (#4) 63250 (#41) howto (#19) 152 (#23341)
software (#5) 10823 (#506) shopping (#20) 5394 (#1222)reference (#6) 1312 (#4655) travel (#21) 20703 (#227)
art (#7) 29558 (#130) free (#22) 184569 (#9)programming (#8) 478 (#10272) css (#23) 456 (#10624)
tools (#9) 6811 (#921) education (#24) 15546 (#335)web2.0 (#10) 0 (None) business (#25) 21970 (#212)
web (#11) 24992 (#184) flash (#26) 5170 (#1274)video (#12) 29833 (#127) games (#27) 59480 (#49)
webdesign (#13) 11 (#155992) mac (#28) 3440 (#1873)linux (#14) 178 (#20937) google (#29) 191670 (#8)
photography (#15) 4711 (#1384) books (#30) 16643 (#296)
Table 2.1: Top tags and their rank as terms in AOL queries.
For this analysis, we did not attempt to remove “stop tags”—tags like “imported” that
were automatically added by the system or otherwise not very meaningful. Figure
2.6 shows the number of times a tag occurs in Dataset M versus the number of times
it occurs in the AOL query dataset. Table 2.1 shows the corresponding query term
rank for the top 30 del.icio.us tags in Dataset M. Both show that while there was
a reasonable degree of overlap between query terms and tags, there was no positive
correlation between popular tags and popular query terms.
One likely reason the two are uncorrelated is that search queries are primarily
navigational, while tags tend to be used primarily for browsing or categorizing. For
instance, 21.9% of the AOL query dataset is made up of queries that look like URLs
or domains, e.g., www.google.com or http://i.stanford.edu/ and variations. To
compute the overlap between tags and queries (but not for Figure 2.6), we first
removed these URL or domain-like queries from consideration. We also removed
certain stopword like tags, including “and”, “for”, “the”, and “2.0” and all tags with
less than three characters. We found that at least one of the top 100, 500, and 1000
tags occurred in 8.6%, 25.3% and 36.8% of these non-domain, non-URL queries.
2.3. POSITIVE FACTORS 27
In some sense, overlap both overstates and understates the potential coverage.
On one hand, tags may correlate with but not be identical to particular query terms.
However, on the other, certain tags may overlap with the least salient parts of a query.
We also believe that because AOL and del.icio.us represent substantially different
communities, the query terms are a priori less likely to match tags than if we had a
collection of queries written by del.icio.us users.
Summary
Result 7: In our study, most tags were deemed relevant and objective by users.
Conclusion: Tags are on the whole accurate.
Details
One concern is that tags at social bookmarking sites may be of “low quality.” For
example, perhaps users attach nonsensical tags (e.g., “fi32”) or very subjective tags
(e.g., “cool”). To get a sense of tag quality, we conducted a small user study. We had
a group of ten people, a mix of graduate students and individuals associated with our
department, manually evaluate posts to determine their quality. We sampled one post
out of every five hundred, and then gave blocks of posts to different individuals to
label. Most of the individuals labeled about 100 to 150 posts. For each tag, we asked
whether the tag was “relevant,” “applies to the whole domain,” and/or “subjective.”
For each post, we asked whether the URL was “spam,” “unavailable,” and a few other
questions. We set the bar relatively low for “relevance”: whether a random person
would agree that it was reasonable to say that the tag describes the page. Roughly
7% of tags were deemed “irrelevant” according to this definition. Also, remarkably
few tags were deemed “subjective”: less than one in twenty for all users. Lastly,
there was almost no “spam” in the dataset, either due to low amounts of spam on
del.icio.us, or due to the filtering described in Section 2.2.
28 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
Bookmarks Posted in a Given Hour
Time
Num
ber
of B
ookm
ark
s
May 25 May 31 June 3 June 15 June 24
0
0 p
osts
/s
3600
1 p
ost/s
7200
2 p
osts
/s
10800
3 p
osts
/s
14400
4 p
osts
/s
Figure 2.7: Posts per hour and comparison to Philipp Keller.
2.4 Negative Factors
In this section, we present negative factors which suggest that social bookmarking
might not help with various aspects of web search.
2.4.1 URLs
Summary
Result 8: Approximately 120,000 URLs were posted to del.icio.us each day.
Conclusion: The number of posts per day is relatively small; for instance, it repre-
sents about 110
of the number of blog posts per day.
Details
Figure 2.7 shows the posts per hour for every hour in Dataset M. The dashed lines
show (where available) the independently sampled data collected by Philipp Keller.5
Keller’s data comes from sampling the recent feed every 10 minutes and extrapolating
based on the difference in age between the youngest and oldest bookmark in the fixed
size feed. Dataset M comes from attempting to capture every post in the recent feed.
The two datasets seem to be mutually reinforcing—our data only differs from Keller’s
5Available at http://deli.ckoma.net/stats.
2.4. NEGATIVE FACTORS 29
Date
Estim
ate
d N
um
ber
of P
osts
August 1, 2005 December 9, 2005 August 16, 2006
030000
60000
90000
120000
(a) August 2005—August 2006
Date
Estim
ate
d N
um
ber
of P
osts
November 11, 2006 July 5, 2007
30000
60000
90000
120000
150000
(b) November 2006—July 2007
Date
Estim
ate
d N
um
ber
of P
osts
August 1, 2005 August 16, 2006 July 5, 2007
030000
60000
90000
120000
150000
(c) August 2005—July 2007
Figure 2.8: Details of Keller’s post per hour data.
slightly, and this usually occurs at points where the feed “crashed.” At these points,
near June 3rd and June 15th respectively in Figure 2.7, the feed stopped temporarily,
and then restarted, replaying past bookmarks until it caught up to the present.
There are an average of 120, 087 posts per day in Dataset M. However, more
relevant for extrapolation are the number of posts in a given week. On average,
92, 690 posts occurred per day of each weekend, and 133, 133 posts occurred each
weekday. Thus, del.icio.us produced about 851, 045 posts per week during our period
of study, or a little more than 44 million posts per year. For comparison, David Sifry
[65] suggests that there were on the order of 1.5 million blog posts per day during
the same time period. This means that for every bookmark posted to del.icio.us, ten
blog entries were posted to blogs on the web.
More important than the rate at which posts were being generated is the rate
at which posts per day accelerate. However, this rate of acceleration is harder to
determine. For instance, Dataset M shows a 50% jump in posts per hour on the
evening of May 30th, when del.icio.us announced a partnership with Adobe. However,
we believe that this may have simply been bouncing back from a previous slump.
Keller’s data, shown in Figure 2.8 seems to tell multiple stories. From August 2005,
until August 2006 (including December 2005, when del.icio.us was bought), del.icio.us
seems to have been accelerating at a steady rate. However, from November 2006 to
June 2007, the rate of acceleration seems to be flat. Our Dataset R, while not covering
30 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
the same length of time, does not lead us to reject Keller’s data. As a result, we believe
that the history of social bookmarking on del.icio.us seems to be a series of increases
in posting rate followed by relative stability. To the extent to which this is the case,
we believe that future rates of increase in posts per day are highly dependent on
external factors and are thus not easily predictable.
Summary
Result 9: There were roughly 115 million public posts, coinciding with about 30-50
million unique URLs at the time of our study.
Conclusion: The number of total posts is relatively small; for instance, this is a
small portion (perhaps 11000
) of the web as a whole.
Details
Relatively little is known about the size of social bookmarking sites, and in particular
del.icio.us. In September 2006, del.icio.us announced that they had reached 1 million
users, and in March 2007, they announced they had reached 2 million. The last
official statement on the number of unique posts and URLs was in May of 2004, when
del.icio.us’ creator, Joshua Schacter stated that there were about 400, 000 posts and
200, 000 URLs.
One way to estimate the size of del.icio.us is to extrapolate from some set of URLs
or tags. For instance, if the URL http://www.cnn.com/ was posted um times in a
one month period, there were tm posts total during that month, and the URL had
been posted to the system a total of us times, we might estimate the total size ts of
del.icio.us as ts =ustmum
(assuming um
tm= us
ts). However, we found that this led to poor
estimates—often in the billions of posts.
Instead, we assume that the rate of posting of URLs to del.icio.us has been mono-
tonically increasing (given a sufficient time window) since its creation. We then divide
the historical record of del.icio.us into three time periods. The first, t1, is the period
before Schacter’s announcement on May 24th. The second, t2, is between May 24th
and the start of Keller’s data gathering. The third, t3, is from the start of Keller’s
2.4. NEGATIVE FACTORS 31
data gathering to the present.
We assume that t1 is equal to 400, 000 posts. We estimate that t2 is equal to
the time period (about p1 = 420 days) times the maximum amount of posts per day
in the one month period after Keller’s data starts (db = 44, 536) times a filtering
factor (f = 1.25) to compensate for the filtering which we observed during our data
gathering. We estimate that t3 is equal to the posts observed by Keller (ok), plus the
posts in the gaps in Keller’s data gathering (gk). ok is nk = 58, 194, 463 posts, which
we multiply by the filtering factor (f = 1.25). We estimate gk as the number of days
missing (mk = 104) times the highest number of posts for a given day observed by
Keller (dk = 161, 937) times the filtering factor (f = 1.25).
Putting this all together, we estimate that the number of posts in del.icio.us as of
late June 2007 was:
t1 + t2 + t3
= (400000) + (p1 × db × f) + (nk × f +mk × dk × f)
= (400000) + (420× 44536× 1.25) +
(58194463× 1.25 + 104× 161937× 1.25)
≈ 117 million posts
This estimate is likely an over-estimate because we choose upper bound values for db
and dk. Depending on the real values of {db, dk, f}, one could reasonably estimate
the number of posts anywhere between about 60 and 150 million posts. It should be
noted that this does not, however, include private (rather than public) posts, which
we do not have any easy way to estimate. Finally, we estimate that between about
20 and 50 percent of posts are unique URLs (see discussion in Result 4 and Figure
2.5). This leads us to an estimate of about 12 to 75 million unique URLs.
The indexes of the major search engines are now commonly believed to be in
the billions to hundreds of billions of pages. For instance, Eiron et al. [26] state
in 2004 that after crawling for some period of time, their crawler had explored 1
billion pages and had 4.75 billion pages remaining to be explored. Of course, as
dynamic content has proliferated on the web, such estimates become increasingly
32 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
subjective. Nonetheless, the number of unique URLs in del.icio.us is relatively small
as a proportion of the web as a whole.
2.4.2 Tags
Summary
Result 10: Tags are present in the pagetext of 50% of the pages they annotate and
in the titles of 16% of the pages they annotate.
Conclusion: A substantial proportion of tags are obvious in context, and many
tagged pages would be discovered by a search engine.
Details
For a random sampling of over 20, 000 posts in Dataset M, we checked whether tags
were in the text of the pages they annotate or related pages. To get plain text from
pages, we used John Cowan’s TagSoup Java package to convert from HTML.6 To
get tokens from plain text, we used the Stanford NLP Group’s implementation of
the Penn Treebank Tokenizer.7 We also checked whether pages were likely to be in
English or not, using Marco Olivo’s lc4j Language Categorization package.8 Finally,
we lowercased all tags and all tokens before doing comparisons.
We found that 50% of the time, if a tag annotates a page, then it is present in the
page text. Furthermore, 16% of the time, the tag is not just anywhere in the page
text, but it is present in the title. We also, looked at the page text of pages that
link to the URL in question (backlinks) and pages that are linked from the URL in
question (forward links). 20% of the time, a tag annotating a particular page will
appear in three places: the page it annotates, at least one of its backlinks, and at
least one of its forward links. 80% of the time, the tag will appear in at least one
of these places: the page, backlinks or forward links. Anecdotally, the tags in the
6TagSoup is available at http://ccil.org/~cowan/XML/tagsoup/.7The PTB Tokenizer is available at http://nlp.stanford.edu/javanlp/—we used the version
from the Stanford NER.8lc4j is available at http://www.olivo.net/software/lc4j/ and implements algorithms from
[15].
2.4. NEGATIVE FACTORS 33
Host % of Tag Tag % of Host Host5.0% 87.7% java.sun.com3.2% 81.5% onjava.com3.1% 82.0% javaworld.com1.6% 67.9% theserverside.com1.3% 88.7% today.java.net
Table 2.2: This example lists the five hosts in Dataset C with the most URLs anno-tated with the tag java.
missing 20% appear to be “lower quality.” They tend to be mistakes of various kinds
(misspellings or mistypes of tags) or confusing tagging schemes (like “food/dining”).
Overall, this seems to suggest that a search engine, which is already looking at page
text and particularly at titles (and sometimes at linked text), is unlikely to gain much
from tag information in a significant number of cases.
Summary
Result 11: Domains are often highly correlated with particular tags and vice versa.
Conclusion: It may be more efficient to train librarians to label domains than to
ask users to tag pages.
Details
One way in which tags may be predicted is by host. Hosts tend to be created to focus
on certain topics, and certain topics tend to gravitate to a few top sites focusing on
them. For instance, Table 2.2 shows the proportion of the URLs in Dataset C labeled
“java” which are on particular hosts (first column). It also shows the proportion of
the URLs at those hosts which have been labeled “java” (second column). This table
shows that 14 percent of the URLs that are annotated with the tag java come from
five large topical Java sites where the majority of URLs are in turn tagged with java.
Unfortunately, due to the filtering discussed in Section 2.2.4 we could not use
Dataset M for our analysis. Instead, we use Dataset C, with the caveat that based
on our discussions in Section 2.2.4 and Result 9, Dataset C represents about 25% of
34 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
(a) On Positive Examples
(b) On Negative Examples
Figure 2.9: Host Classifier: The accuracy for the first 130 tags by rank for a host-basedclassifier.
2.4. NEGATIVE FACTORS 35
Avg Accuracy (+) Avg Accuracy (-)τ = 0.33 19.647 99.670τ = 0.5 7.372 99.943τ = 0.66 4.704 99.984
Table 2.3: Average accuracy for different values of τ .
the posts in del.icio.us, biased towards more popular URLs, users, and tags. As a
result, one should not assume that the conclusions from this section apply to all of
del.icio.us as opposed to the more concentrated section of Dataset C.
We denote the number of URLs tagged with a tag ti at a given host dj as
tagged(ti, dj), and the total number of URLs at that host in the tagging corpus
as total(dj). We can construct a binary classifier for determining if a particular URL
ok having host dj should be annotated with tag ti with the simple rule:
classify(ti, dj) =
{
t :tagged(ti,dj)
total(dj)> τ
¬t :tagged(ti,dj)
total(dj)≤ τ
}
where τ is some threshold. We define the positive accuracy to be the rate at which
our classifier labels positive examples correctly as positives, and negative accuracy
to be the rate at which our classifier correctly labels negative examples correctly as
negatives. Further, we define the macro-averaged positive and negative accuracies,
given in Table 2.3, as the mean of the positive and negative accuracies—with each
tag weighted equally—for the top 130 tags, respectively.
This classifier allows us to predict (simply based on the domain) between about
five and twenty percent of the tag annotations in Dataset C, with between a few
false positives per 1, 000 and a few per 10, 000. We also show the accuracies on
positive and negative examples in Figure 2.9. All experiments use leave-one-out cross
validation. Our user study (described in Result 7) also supported this conclusion.
About 20% of the tags which were sampled were deemed by our users to “apply
to the whole domain.” Because our user study and our experiments above were
based on differently biased datasets, Datasets C and M, they seem to be mutually
36 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
reinforcing in their conclusions. Both experiments suggest that a human librarian
capable of labeling a host with a tag on a host-wide basis (for instance, “java” for
java.sun.com) might be able to make substantial numbers of user contributed labels
redundant.
2.5 Related Work
Since the beginning of the web, people have used page content to aid in navigation
and searching. However, almost as early—Eiron and McCurley [25] suggest as early
as 1994—users were suggesting the use of anchortext and link structure to improve
web search. Craswell et al. [21] also give some early justification for use of anchortext
to augment web search.
Meanwhile, there has also been a current of users attempting to annotate their own
pages with metadata. This began with the <meta> tag which allowed for keywords on
a web page to aid search engines. However, due to search engine spam, this practice
has lost favor. The most recent instance of this idea is Google Co-op,9 where Google
encourages site owners to label their sites with “topics.” Co-op allows Google to
refine search results based on this additional information. However, unlike social
bookmarking, these metadata approaches require site owners to know all of the labels
a user might attach to their site. This leads to the well studied “vocabulary problem”
(see [28], [17]), whereby users have many different types of terminology for the same
resources. Ultimately, unlike previous metadata, social bookmarking systems have
the potential to overcome the vocabulary problem by presenting many terms for the
same content created by many disparate users.
Golder and Huberman [31] were two of the earliest researchers to look at the dy-
namics of tagging in del.icio.us. While a number of papers have looked at del.icio.us,
only a few have looked at its relationship to web search. Both Bao et al. [14] and
Yanbe et al. [72] propose methods to modify web search to include tagging data.
However, neither looked at whether del.icio.us (or any other social bookmarking site)
9See http://www.google.com/coop/.
2.6. CONCLUSION 37
was producing data of a sufficient quantity, quality or variety to support their meth-
ods. Both also use relatively small datasets—Bao et al. use 1, 736, 268 web pages and
269, 566 annotations, while Yanbe et al. use several thousand unique URLs. Also,
both of these papers are primarily interested in the popularity and tags of the URLs
studied, rather than other possible uses of the data.
The ultimate test of whether social bookmarking can aid web search would be to
implement systems like those of Bao et al. or Yanbe et al. and see if they improve
search results at a major search engine.
2.6 Conclusion
The eleven results presented in Sections 2.3 and 2.4 paint a mixed picture for web
search. We found that social bookmarking as a data source for search has URLs
that are often actively updated and prominent in search results. We also found that
tags were overwhelmingly relevant and objective. However, del.icio.us produces small
amounts of data on the scale of the web. Furthermore, the tags which annotate URLs,
while relevant, are often functionally determined by context. Nearly one in six tags
are present in the title of the page they annotate, and one in two tags are present in
the page text. Aside from page content, many tags are determined by the domain of
the URL that they annotate, as is the case with the tag “java” for “java.sun.com.”
These results suggest that URLs produced by social bookmarking are unlikely to be
numerous enough to impact the crawl ordering of a major search engine, and the tags
produced are unlikely to be much more useful than a full text search emphasizing
page titles.
This chapter represented our first large study of a tagging system. While the
results were mixed for web search, many of the insights are quite general. For example,
our user study in Result 7 foreshadows later, similar results about objective, relevant
tags in Section 4.4.1. Our analysis of the dangers and potential responses to spam
also led to a variety of later tag spam work (e.g., [35] and [48]). Overall, we hope
this chapter has given a taste for the type of data, and challenges, of a real tagging
system at scale.
38 CHAPTER 2. SOCIAL BOOKMARKING AND WEB SEARCH
Chapter 3
Social Tag Prediction
In Chapter 2, we conducted a broad analysis of the social bookmarking system
del.icio.us. In particular, we focused on properties which we believe are important
to web search. This chapter drills down and focuses on one property, predictability.
In particular, we focus on a problem which we call the social tag prediction problem,
asking, “given a set of objects, and a set of tags applied to those objects by users,
can we predict whether a given tag could/should be applied to a particular object?”
In this chapter, we look at how effective different types of data are at predicting tags
in a tagging system.
Solving the social tag prediction problem has two benefits. At a fundamental
level, we gain insights into the “information content” of tags: that is, if tags are easy
to predict from other content, they add little value. At a practical level, we can use a
tag predictor to enhance a social tagging site. These enhancements can take a variety
of forms:
Increase Recall of Single Tag Queries/Feeds Many, if not most, queries in tag-
ging systems are for objects labeled with a particular tag. Similarly, many
tagging systems allow users to monitor a feed of items tagged with a particular
tag. For example, a user of a social bookmarking site might set up a feed of
all “photography” related web pages. Tag prediction could serve as a recall
enhancing device for such queries and feeds. In Section 3.3.2, we set up such a
recall enhancing tag prediction task.
39
40 CHAPTER 3. SOCIAL TAG PREDICTION
Inter-User Agreement Many users have similar interests, but different vocabular-
ies. Tag prediction would ease sharing of objects despite vocabulary differences.
Tag Disambiguation Many tags are polysemous, that is, they have different mean-
ings. For example, “apple” might mean the fruit, or the computer company.
Predicting additional tags (like “macos” or “computer”) might aid in disam-
biguating what a user meant when annotating an object. Past work by Aurn-
hammer et al. [13] looks at similar issues in photo tagging.
Bootstrapping Sen et al. [64] find that the way users use tags is determined by
previous experience with tags in the system. For example, in systems with low
tag usage, fewer users will apply tags. If tag usage in the system is mostly
personal tags, users tend to apply more personal tags. Using tag prediction,
a system designer could pre-seed a system with appropriate tags to encourage
quality contributions from users.
System Suggestion Some tagging systems provide tag suggestions when a user is
annotating an object (see for example, Xu et al. [71]). Predicted tags might
be reasonable to suggest to users in such a system. However, unlike the other
applications in this list, it might be more informative for the system to suggest
tags that it is unsure of to see if the user selects them.
We examine whether tags are predictable based on the page text, anchor text, and
surrounding domains of pages they annotate. We find that there is a high variance in
the predictability of tags, and we look at metrics associated with predictability. One
such metric, a novel entropy measure, captures a notion of generality that we think
might be helpful for other tasks in tagging systems. Next, we look at how to predict
tags based on other tags annotating a URL. We find that we can expand a small set
of tags with high confidence. We conclude with a summary of our findings and their
broader implications for tagging systems and web search. (This chapter draws on
material from Heymann et al. [38] which is primarily the work of the thesis author.)
3.1. TAG PREDICTION TERMS AND NOTATION 41
3.1 Tag Prediction Terms and Notation
We use the same terms and notation from Section 2.1, with some additions. We
imagine that every object o has a vast set of tags that do not describe it, a smaller set
of tags which do describe it, and an even smaller set of tags which users have actually
chosen to input into the system as applicable to the object. We say that the first
set of tags negatively describes the object, the second set of tags positively describes
the object, and the last set of tags currently annotates the object. We model each of
these three relationships as relations or tables:
Rp: A set of (t, o) pairs; each pair means that tag t positively describes object o.
Rn: A set of (t, o) pairs; each pair means that tag t negatively describes object o.
Ra: A set of (t, u, o) triples; each triple means that user u annotated object o with
tag t.
In practice, the system owner only has access to Ra.
We manipulate the relations Rp, Rn, and Ra using two standard relational algebra
operators with set semantics. Selection, or σc selects tuples from a relation where a
particular condition c holds. Projection, or πp projects a relation into a smaller
number of attributes. σc is equivalent to the WHERE c clause in SQL whereas πp
is equivalent to the SELECT p clause in SQL. σc can be read as “select all tuples
satisfying c.” πp can be read as “show only the attributes in p from each tuple.”
Suppose a tagging system had only two objects, a web page obagels about a down-
town bagel shop and a web page opizza about a pizzeria next door. We might have:
Rp = {(tbagels, obagels), (tshop, obagels), (tdowntown, obagels),
(tpizza, opizza), (tpizzeria, opizza)}
Rn = {(tpizzeria, obagels), (tpizza, obagels), (tbagels, opizza), . . .}
If we want to know all of the tags which positively describe obagel, we would write
πt(σobagel(Rp)) and the result would be (tbagels, tshop, tdowntown). If we want all (t, o)
42 CHAPTER 3. SOCIAL TAG PREDICTION
pairs which do not describe opizza, we would write π(t,o)(σopizza(Rn)). Suppose also
that a user usally has annotated the pizzeria web page with the tag tpizzeria:
Ra = (tpizzeria, usally, opizza)
If we want to know all users who have tagged opizza, we would write πu(σopizza(Ra))
and the result would be (usally).
3.2 Creating a Prediction Dataset
In this chapter, we continue to use the del.icio.us social bookmarking dataset de-
scribed in Section 2.2. For our current purposes, we are most interested in very
common tags in that dataset. We call the set of the top 100 tags in the dataset by
frequency T100 for short (shown in Figure 3.2).
We wanted to construct a dataset approximating Rp and Rn for our prediction
experiments. However, we only know Ra. Section 2.3 suggested that if (ti, ok) ∈
π(t,o)(Ra) then (ti, ok) ∈ Rp. In other words, annotated tags tend to be accurate.
However, the reverse is not true. The case where (ti, ok) 6∈ π(t,o)(Ra) and (ti, ok) ∈ Rp
occurs sufficiently often that measures of precision, recall, and accuracy can be heavily
skewed. In early experiments on a naively created dataset, we found that as many
as 34of false positives were erroneous according to manual reviews we conducted. By
“erroneous false positives,” we mean that our classifiers had accurately predicted for
a given (ti, ok) pair that (ti, ok) ∈ Rp, but (ti, ok) 6∈ π(t,o)(Ra).
When comparing systems, it is reasonable to use a partially labeled dataset, be-
cause the true relative ranking of the systems is likely to be preserved. Pooling [45],
for example, makes this assumption. However, for this work, we wanted to give
absolute numbers for how accurately tags can be predicted, rather than comparing
systems.
We decided to filter our dataset by looking at the total number of posts for a given
3.2. CREATING A PREDICTION DATASET 43
Rank # Tag1 4225 reference2 3794 toread3 3788 resources4 3677 cool5 3593 work6 3469 technology7 3366 tools8 3365 internet9 3205 computer10 3016 blog11 3012 web12 2996 web2.013 2879 online14 2759 free15 2661 software. . . . . . . . .86 396 politics87 396 mobile88 351 game89 343 jobs90 341 wordpress91 328 mp392 326 health93 310 environment94 266 finance95 233 ruby96 226 fashion97 216 rails98 135 food99 74 recipes100 5 fic
Table 3.1: The top 15 tags account for more than 13of top 100 tags added to URLs
after the 100th bookmark. Most are relatively ambiguous and personal. The bottom15 tags account for very few of the top 100 tags added to URLs after the 100thbookmark. Most are relatively unambiguous and impersonal.
44 CHAPTER 3. SOCIAL TAG PREDICTION
Figure 3.1: Average new tags versus number of posts.
URL:
postcount(ok) = |πu(σok(Ra))|
As postcount(ok) increases, we expect the probability for any given ti that (ti, ok) 6∈
π(t,o)(Ra) and (ti, ok) ∈ Rp to decrease.1 We chose a cutoff of 100, which leads us to
approximate Rp and Rn as:
(ti, ok) ∈ Rp iff 100 ≤ postcount(ok) < 3000
and |πu(σti,ok(Ra))| ≥postcount(ok)
100(ti, ok) ∈ Rn iff 100 ≤ postcount(ok) < 3000
and σti,ok(Ra) = ∅
This results in a filtered set of |πo(Rp ∪Rn)| ≈ 62, 000 URLs and their corresponding
tags.
Our reasoning for the 100 post minimum is based on the rate at which new unique
tags from the top 100 tags, T100, are added to a URL. Figure 3.1 shows the average
number of new tags ti ∈ T100 that are added to a URL by the nth post. This
1Using postcount(ok) for filtering relies to a certain extent on evaluating popular tags.While we do not examine it here, for lower frequency tags we suggest vocabcount(ti, ok) =∑
uj∈πu(σok(Ra))
min(|σ(ti,uj)(Ra)|, 1) which should behave similarly to postcount(ok) but relies on
users’ vocabularies rather than raw number of posts.
3.3. TWO TAG PREDICTION METHODS 45
information is computed over all URLs which occur at least 200 times in our dataset.
On average, the first person to post a URL adds one tag from T100, the second adds
0.6 of a tag from T100, and so on. By the 100th post, the probability of a post adding
a new tag from T100 is less than 5% and remains relatively flat. Furthermore, the
top tags which are added later tend to be much more ambiguous or personal. Table
3.1 shows the fifteen tags which most and least commonly get added after the 100th
post. Tags like “mp3” and “food” are relatively clear in meaning, whereas tags like
“internet” and “toread” are much more ambiguous and personal. While we cannot
completely eliminate the possibility of erroneous tuples in Rp and Rn, our approach
is most accurate for unambiguous or impersonal tags and does not require creating a
gold standard based on human annotation. Such a gold standard would be especially
difficult to create for subjective tags like “cool” or “toread.”
3.3 Two Tag Prediction Methods
In the two sections that follow, we look at the predictability of tags given two broad
types of data. In Section 3.3.1 we look at the predictability of the tags in T100 given
information we have about the web pages in our dataset. We look at page text,
anchor text, and surrounding hosts to try to determine whether particular tags apply
to objects in our dataset. This task is specific to social bookmarking systems because
the data we use for prediction is specific to web pages. However, the predictability
of tags for web pages may also be important for web search, which may want to
determine if tags provide information above and beyond page text, anchor text, and
surrounding hosts, and to vertical (web) search, which may want to categorize parts
of the web by tags. Chapter 2 provides some initial answers to these questions, but
does not address predictability directly, nor does it look specifically at anchor text.
“Predictability” is approximated by the predictive power of a support vector machine.
While classifiers differ, we believe our results enable qualitative conclusions about the
machine predictability of tags for state of the art text classifiers.
In Section 3.3.2 we look at the predictability of tags based on other tags already
46 CHAPTER 3. SOCIAL TAG PREDICTION
annotating an object. (In Section 3.3.1, we make the simplifying “cold start” as-
sumption that no other tags are available, using only page text, anchor text, and
surrounding hosts for prediction.) The task of predicting tags given other tags has
many potential applications within tagging systems, as discussed at the beginning of
this chapter. Unlike the task in Section 3.3.1, our work in Section 3.3.2 is applica-
ble to tagging systems in general (including video, photo and other tagging systems)
rather than solely social bookmarking systems because it does not rely on any par-
ticular type of object (e.g., web pages). We also consider the problem of ranking the
additional tags in order of how likely they are to annotate an object.2
3.3.1 Tag Prediction Using Page Information
We chose to evaluate prediction accuracy using page information on the top 100 tags
in our dataset (i.e., T100). These tags collectively represent 2, 145, 593 of 9, 414, 275
triples, meaning they make up about 22.7% of the user contributed tags in the full
Stanford Tag Crawl dataset. The dataset contains crawled page text and additional
information for about 60, 000 of the URLs in πo(Rp) ∪ πo(Rn) (about 95%).
We treated the prediction of each tag ti ∈ T100 as a binary classification task. For
each tag ti ∈ T100, our positive examples were all ok ∈ πo(σti(Rp)) and our negative
examples were all ok ∈ πo(σti(Rn)). For each task, we defined two different divisions
of the data into train/test splits. In the first division, which we call Full/Full, we
randomly select 1116
of the positive examples and 1116
of the negative examples to be
our training set. The other 516
of each is our test set. For each Full/Full task, the
number of training, test, positive, and negative examples varies depending on the tag.
However, usually the training set is between 30, 000 and 35, 000 examples and the test
set is about 15, 000 examples. The proportion of positive examples can vary between
1% and 60% with a median of 14% and a mean of 9%. In the second division, which
we call 200/200, we randomly select 200 positive and 200 negative examples for our
training set, and the same for our test set.
How well we do on the Full/Full split implies how well we can predict tags on
2Note that the techniques from Section 3.3.1 could be expanded to not assume cold start and tohandle ranking, but we do not do so here.
3.3. TWO TAG PREDICTION METHODS 47
the naturally occurring distribution of tagged pages. (We call it Full/Full because
the union of positive and negative examples is the full set of URLs in Rp and Rn.)
However, we can get high accuracy (if not high precision) on Full/Full by biasing
towards guessing negative examples for rare tags. For example, because “recipes” only
naturally occurs on 1.2% of pages, we could achieve 98.8% accuracy by predicting all
negative on the “recipes” binary classification task. One solution to this problem is
to change metrics to precision-recall break even point (PRBEP) or F1 (we report the
former later). However, these measures are still highly impacted by the proportion of
positive examples. We provide 200/200 as an imperfect indication of how predictable
a tag is due to its “information content” rather than the distribution of examples in
the system.
Each example represented one URL and had one of three different feature repre-
sentations depending on whether we were predicting tags based on page text, anchor
text, or surrounding hosts. Page text means all text present at the URL. Anchor
text means all text within fifteen words of inlinks to the URL (similar to Haveliwala
et al. [32]). Surrounding hosts means the sites linked to and from the URL, as well
as the site of the URL itself. For both page text and anchor text, our feature repre-
sentation was a bag of words. We tokenized pages and anchor text using the Penn
TreeBank Tokenizer, dropped infrequent tokens (less frequent than the top 10 million
tokens) and then converted tokens to token ids. For anchor text tasks, we only used
URLs as examples which had at least 100 inlinks.3 The value of each feature was the
number of times the token occurred. For surrounding hosts, we constructed six types
of features. These features were: the hosts of backlinks, the domains of backlinks,
the host of the URL of the example, the domain of the URL of the example, the
hosts of the forward links, and the domains of the forward links. The value of each
feature was one if the domain or host in question was a backlink/forwardlink/current
domain/host and zero if not.
We chose to evaluate page text, anchor text, and host structure rather than just
combining all text of pages linked to or from the URL of each example because
3We found that the difference between 10 and 100 inlinks as the cutoff was negligible. More dataabout a particular URL improves classification accuracy for that URL, but having more URLs inthe training set improves classification accuracy in general.
48 CHAPTER 3. SOCIAL TAG PREDICTION
cool, online, resources, community, work, culture, portfolio, social, technology,history, advertising, writing, architecture, flash, inspiration, humor, search, funny,tools, fun, internet, home, media, free, illustration, fashion, library, research, ajax,marketing, books, computer, environment, firefox, art, jobs, productivity, free-ware, business, download, education, news, web2.0, language, tips, wiki, word-press, graphics, mobile, video, google, php, article, blogs, mp3, travel, security,science, shopping, hardware, photography, games, reference, tutorials, toread, au-dio, photos, movies, javascript, tv, maps, blog, mac, howto, game, health, photo,design, music, opensource, osx, politics, photoshop, java, web, windows, finance,tutorial, webdesign, css, software, apple, development, food, linux, ruby, program-ming, rails, recipes
Figure 3.2: Tags in T100 in increasing order of predictability from left to right. “cool”is the least predictable tag, “recipes” is the most predictable tag.
Yang et al. [74] state that including all surrounding text may reduce accuracy. For
all representations (page text, anchor text, and surrounding hosts), we engineered
our features by applying Term Frequency Inverse Document Frequency (TFIDF),
normalizing to unit length, and then feature selected down to the top 1000 features
by mutual information. We chose mutual information due to discussion in Yang and
Pedersen [73]. In previous experiments, we found that the impact of more features was
negligible, and reducing the feature space helped simplify and speed up the training
process.4
For our experiments, we used support vector machines for classification. Specif-
ically we used Thorsten Joachims’ SVMlight package with a linear kernel and the
default regularization parameter (see [43]) and his SVMperf package with a linear
kernel and regularization parameters of 4 and 150 (see [44]). With SVMlight, we
trained to minimize average error, with SVMperf, we trained to minimize PRBEP.
Given that we had 100 tags, 2 splits (200/200 and Full/Full), and 3 feature types
for examples (page text, anchor text, and surrounding hosts), we conducted 600
binary classification tasks total. Assuming only a few evaluation metrics for each
binary classification task, we could have thousands of numbers to report. Instead,
4Gabrilovich and Markovitch [29] actually find that aggressive feature selection is necessary forSVM to be competitive with decision trees for certain types of hypertext data.
3.3. TWO TAG PREDICTION METHODS 49
in the rest of this section, we ask several questions intended to give an idea of the
highlights of our analysis. Apart from the questions answered below, Figure 3.2 gives
a quick at-a-glance view of which tags are more or less predictable in T100 ranked by
the sum of PRBEP (Full/Full), Prec@10% (Full/Full) and Accuracy (200/200).5 See
discussion below for description of each metric. In the analysis below, when we give
the mean of the values of tags, we mean the macro-averaged value.
What precision can we get at the PRBEP?
For applications like vertical search (or search enhanced by topics), one natural ques-
tion is what our precision-recall curve looks like at reasonably high recall. PRBEP
gives a good single number measurement of how we can tradeoff precision for recall.
For the Full/Full split, we calculated the PRBEP for each of the 600 binary classifi-
cation tasks. On average, the PRBEP for page text was about 60%, for anchor text
was about 58%, and for surrounding hosts was about 51% with a standard deviation
of between 8% and 10%. This suggests that on realistic data, we can get about 23of
the URLs labeled with a particular tag with about 13erroneous URLs in our resulting
set. This is pretty good—we are doing much better than chance given that a majority
of tags in T100 occur on less than 15% of documents.
What precision can we get with low recall?
For applications like bootstrapping or single tag queries, we may care less about
overall recall (because the web is huge), but we may want high precision. We used
the Full/Full split to look at this question. For each binary classification task, we
calculated the precision at 10% recall (i.e., Prec@10%). With all of our feature types
(page text, anchor text, and surrounding hosts), we were able to get a mean Prec@10%
value of over 90%. The page text Prec@10% was slightly higher, at 92.5%, and all
feature types had a standard deviation of between 7% and 9%. This suggests that
whatever our feature representation, if we have many more examples than we need for
our system, we can get high precision by reducing the recall. Furthermore, it suggests
5Two tags are missing, “system:imported” (a system generated tag) and “fic” (which is commonin the full dataset but uncommon for top URLs and was removed as an outlier).
50 CHAPTER 3. SOCIAL TAG PREDICTION
that there are some examples of most tags that our classifiers are much more certain
about, rather than a relatively uniform distribution of certainty.
Which page information is best for predicting tags?
According to all evaluation metrics, we found a strict ordering among our feature
types. Page text was strictly more informative than anchor text which was strictly
more informative than surrounding hosts. For example, for PRBEP, the ordering is
(60, 58, 51), for Prec@10% it is (92.5, 90, 90), for accuracy on the 200/200 split, it
is (75, 73, 66). Usually, page text was incrementally better than anchor text, while
both were much better than surrounding hosts. This may have been due to our
representation or usage of our surrounding hosts, or it could simply be that text is a
particularly strong predictor of the topic of a page.
Is anchor text particularly predictive of tags?
One common complaint about tags is that they should be highly predictable based
on anchor text, because both serve as commentary on a particular URL. While both
page text and anchor text are predictive of tags, we did not find anchor text to
be more predictive on average than page text for any of our split/evaluation metric
combinations.
What makes a tag predictable?
A more general question than those above is what makes a tag predictable. Pre-
dictability may give clues as to the “information content” of a tag, but it may also be
practically useful for tasks like deciding which tags to suggest to users. In order to
try to quantify this, we defined an entropy measure to try to mirror the “generality”
of a tag. Specifically, we call the distribution of tag co-occurrence events with a given
tag ti, P (T |ti). Given this number, we define the entropy of a tag ti to be:
H(ti) = −∑
tj∈T,tj 6=ti
P (tj|ti) logP (tj|ti)
3.3. TWO TAG PREDICTION METHODS 51
Figure 3.3: When the rarity of a tag is controlled in 200/200, entropy is negativelycorrelated with predicability.
Figure 3.4: When the rarity of a tag is controlled in 200/200, occurrence rate isnegatively correlated with predicability.
For example, if the tag tcar co-occurs with tauto 3 times, with tvehicle 1 time, and with
tautomobile 1 time, we would say its entropy was equal to:
H(tcar) = −3
5log
3
5−
1
5log
1
5−
1
5log
1
5≈ 1.37
The intuition for entropy in this case is that tags which co-occur with a broad base
of other tags tend to be more general than those tags which primarily co-occur with
a small group of related tags.
Because the relative rarity of a tag heavily impacts its predictability, we used the
200/200 split to try to evaluate predictability of tags in the abstract. For this split,
52 CHAPTER 3. SOCIAL TAG PREDICTION
Figure 3.5: When the rarity of a tag is not controlled, in Full/Full, additional examplesare more important than the vagueness of a tag, and more common tags are morepredictable.
we found a significant correlation between our entropy measure H(ti) and accuracy
of a classifier on 200/200 (see Figure 3.3). For page text, we had a Pearson product-
moment correlation coefficient of r = −0.46, for anchor text r = −0.51, and for
surrounding hosts r = −0.54. All p-values were less than 10−5.6 However, for the
same split, we also found that the popularity of a tag was highly negatively correlated
with our accuracy (see Figure 3.4). Specifically, for page text, we had r = −0.53, for
anchor text r = −0.51, and for domains r = −0.27. In other words, the popularity
of a tag seems to be as good a proxy for “generality” as a more complex entropy
measure. The two are not exclusive—a linear model fit to accuracy based on both
popularity and entropy does better than a model trained on either one alone.
For the Full/Full split, we found that the commonality of a tag (and hence the com-
monality of positive examples) was highly positively correlated with high PRBEP (see
Figure 3.5). However, perhaps because the recall was relatively low, we found no corre-
lation between the commonality of a tag and our performance on Prec@10% (though
we did find some low but significant correlation between PRBEP and Prec@10%).
The entropy measure was uncorrelated with PRBEP or Prec@10% for the Full/Full
split.
6Though we do not quote them here, we also computed Kendall’s τ and Spearman’s ρ valueswhich gave similarly strong p-values.
3.3. TWO TAG PREDICTION METHODS 53
3.3.2 Tag Prediction Using Tags
Between about 30 and 50 percent of URLs posted to del.icio.us have only been book-
marked once or twice. Given that the average bookmark has about 2.5 tags, the odds
that a query for a particular tag will return a bookmark only posted once or twice
are low. In other words, our recall for single tag queries is heavily limited by the high
number of rare URLs with few tags. For example, a user labeling a new software tool
for Apple’s Mac OS X operating system might annotate it with “software,” “tool,”
and “osx.” A second user looking for this content with the single tag query (or feed)
“mac” would miss this content, even though a human might easily realize that “osx”
implies “mac.” The question in this section is given a small number of tags, how
much can we expand this set of tags in a high precision manner? The better we do at
this task, the less likely we are to have situations like the “osx”/“mac” case because
we will be able to expand tags like “osx” into implied tags like “mac.”
A natural approach to this problem is market-basket data mining. In the market-
basket model, there are a large set of items and a large set of baskets each of which
contains a small set of items. The goal is to find correlations between sets of items
in the baskets. Market-basket data mining produces association rules of the form
X → Y . Association rules commonly have three values associated with them:
Support The number of baskets containing both X and Y .
Confidence P (Y |X). (How likely is Y given X?)
Interest P (Y |X) − P (Y ), alternatively P (Y |X)P (Y )
. (How much more common is X&Y
than expected by chance?)
Given a minimum support, Agrawal et al. [11] provide an algorithm for computing
association rules from a dataset.
In our case, the baskets are URLs, and the items are tags. Specifically, for each
ok ∈ πo(Rp), we construct a basket πt(σok(Rp)). We constructed three sets of rules:
rules with support > 500 and length 2, rules with support > 1000 and length 3, and
rules with support > 2000 of any length. (The length is the number of distinct tags
54 CHAPTER 3. SOCIAL TAG PREDICTION
Int. Conf. Supp. Rule0.59 0.994 634 graphic-design → design
0.69 0.992 644 oop → programming
0.56 0.992 654 macsoftware → software
0.89 0.990 605 photographer → photography
0.44 0.990 1780 webstandards → web
0.44 0.990 786 w3c → web
0.58 0.989 2144 designer → design
0.85 0.987 669 windowsxp → windows
0.44 0.987 1891 dhtml → web
0.85 0.986 872 debian → linux
0.58 0.986 1092 illustrator → design
0.56 0.986 707 sourceforge → software
0.85 0.985 1146 gnu/linux → linux
0.61 0.985 539 bloggers → blog
0.58 0.985 597 ilustracion → design
0.44 0.985 1794 web-development → web
0.44 0.985 3366 xhtml → web
0.68 0.984 730 disenoweb → webdesign
0.87 0.983 648 macsoftware → mac
Table 3.2: Association Rules: A selection of the top 30 tag pair association rules. Allof the top 30 rules appear to be valid, these rules are representative.
involved in the rule.) We merged these three rule sets into a single association rule
set for our experiments.
Found Association Rules
We found a surprising number of high quality association rules in our data. Table 3.2
shows some of the top association rules of length two. The rules capture a number of
different relationships between tags in the data. Some rules correspond to a “type-
of” style of relationship, for example, “graphic-design” is a type of “design.” Others
correspond to different word forms, for example, “photographer” and “photography.”
Some association rules correspond to translations of a tag, for example, “disenoweb”
is Spanish for “webdesign.” Some of the relationships are surprisingly deep, for
example, the “w3c” is a consortium that develops “web” standards. Arguably, one
3.3. TWO TAG PREDICTION METHODS 55
Int. Conf. Supp. Rule0.81 0.989 1097 open source & source → opensource
0.55 0.979 1003 downloads & os → software
0.42 0.967 1686 free & webservice → web
0.73 0.964 1134 accessibility & css → webdev
0.84 0.952 1305 app & osx → mac
0.40 0.950 2162 webdesign & websites → web
0.47 0.947 2269 technology & webtools → tools
0.40 0.945 1662 php & resources → web
0.63 0.937 2754 html & tips → webdesign
0.50 0.934 1914 xp → software
0.45 0.928 1332 freeware & system → tools
0.62 0.919 1513 cool & socialsoftware → web2.0
0.61 0.915 1165 business & css → webdesign
0.61 0.912 2231 tips & webdevelopment → development
0.35 0.900 6337 toread & web2.0 → web
0.69 0.897 1010 fotografia & inspiration → art
0.33 0.895 1723 help & useful → reference
Table 3.3: Association Rules: A random sample of association rules of length ≤ 3and support > 1000.
56 CHAPTER 3. SOCIAL TAG PREDICTION
might suggest that if both ti → tj and tj → ti with high confidence and high interest,
ti and tj are probably synonymous.
Depending on computational resources, numbers of association rules in the mil-
lions or billions can be generated with reasonable support. However, in practice, the
most intuitive rules seem to be rules of length four or less. In order to give an idea
of the rules in general, rather than picking the top rules, we give a random sampling
of the top 8000 rules of length three or less. This information is shown in Table 3.3.
There is sometimes redundancy in longer rules, for example, one might suggest that
rather than “webdesign & websites → web” we should instead have “webdesign →
web” and “websites → web”. This is a minor issue however, and it is relatively rare
for a rule with high confidence to be outright incorrect. Furthermore, given their ease
of interpretation, it would not be unreasonable to have human moderators look over
low length, high support and high confidence rules.
Tag Application Simulation
For our evaluation, we simulate rare URLs with few bookmarks. We separated πo(Rp)
into a training set of about 50, 000 URLs and a test set of about 10, 000 URLs.
We generated our three sets of association rules based only on baskets from the
training set. We then sampled n bookmarks from each of the the URLs in the test
set, pretending these were the only bookmarks available. Given this set of sampled
bookmarks, we attempted to apply association rules in decreasing order of confidence
to expand the set of known tags. We stopped applying association rules once we had
reached a particular minimum confidence c.
For example, suppose we have a URL which has a recipe for oven-cooked pizza
bagels with three bookmarks corresponding to three sets of tags:
{(food, recipe), (food, recipes), (pizza, bagels)}
For n = 1, we might sample the bookmark (pizza, bagels). Assuming we had two
association rules:
• pizza → food (confidence = 0.9)
3.3. TWO TAG PREDICTION METHODS 57
Orig. Sampled Min. # Tag ExpansionsBookmarks Conf. 0 1 2 3 4 5+
1 0.50 2096 100 153 435 486 76671 0.75 4015 1717 1422 1263 866 16541 0.90 7898 1845 709 291 116 782 0.50 545 78 115 283 300 96162 0.75 2067 1630 1582 1664 1145 28492 0.90 6520 2491 1113 473 208 1323 0.50 216 61 62 172 164 102623 0.75 1397 1415 1545 1558 1363 36593 0.90 5913 2746 1265 596 249 1685 0.50 71 31 29 101 80 106255 0.75 810 1070 1360 1507 1485 47055 0.90 5145 3065 1427 692 366 242
Table 3.4: Association Rules: Tradeoffs between number of original sampled book-marks, minimum confidence and resulting tag expansions.
• bagels → bagel (confidence = 0.8)
We would first apply the confidence 0.9 rule and then the confidence 0.8 rule. Ap-
plying both rules (i.e., two applications), would result in (pizza, bagels, food, bagel).
Number and Precision of Tag Expansions
We ran a simulation as described above for each number of original sampled book-
marks n ∈ {1, 2, 3, 5} and for each minimum confidence c ∈ {0.5, 0.75, 0.9}. Our
results are shown in Tables 3.4, and 3.5, and 3.6. Each row of each table represents
one setting of n and c. We asked two initial questions: “How many tags were added?”
and “How accurate were our applications of tags?” The column in Table 3.4 labeled
“# Tag Expansions” shows, for each simulation, the number of URLs to which we
were able to add 0, 1, 2, 3, 4 or 5+ tags. The column in Table 3.5 labeled “Ac-
tual Precision” shows the percentage of tag applications which were correct (given
the other information we had about each URL in Rp). For each simulation in Table
3.5, we also computed our estimate (“Estimated Precision”) of what our precision
should have been based on the confidence values of applied rules. Our estimate is
58 CHAPTER 3. SOCIAL TAG PREDICTION
Orig. Sampled Min. Exp. PrecisionBookmarks Conf. Est. Actual
1 0.50 0.650 0.6331 0.75 0.844 0.8541 0.90 0.941 0.9542 0.50 0.652 0.5902 0.75 0.844 0.8112 0.90 0.942 0.9313 0.50 0.653 0.5593 0.75 0.844 0.7793 0.90 0.943 0.9175 0.50 0.654 0.5095 0.75 0.842 0.7325 0.90 0.943 0.873
Table 3.5: Association Rules: Tradeoffs between number of original sampled book-marks, minimum confidence, estimated precision and actual precision.
Orig. Sampled Min. Mean Recall (T100) New Precision (T100)Bookmarks Conf. Orig. Expd. Mean Median
1 0.50 0.099 0.271 0.629 0.6771 0.75 0.099 0.153 0.929 0.9631 0.90 0.100 0.113 0.993 1.0002 0.50 0.160 0.386 0.585 0.6262 0.75 0.164 0.237 0.909 0.9492 0.90 0.161 0.180 0.989 1.0003 0.50 0.205 0.451 0.550 0.5773 0.75 0.207 0.289 0.900 0.9423 0.90 0.204 0.226 0.988 1.0005 0.50 0.265 0.524 0.497 0.4965 0.75 0.268 0.358 0.881 0.9315 0.90 0.271 0.294 0.983 0.996
Table 3.6: Association Rules: Tradeoffs between number of original sampled book-marks, minimum confidence, recall, and precision.
3.3. TWO TAG PREDICTION METHODS 59
the average of the confidence of all applied rules. For example, in our oven-cooked
pizza bagel example above (assuming the URL was the only URL in our simulation),
we would have an actual precision of 0.5 because “food” is a tag which appears in
other bookmarks annotating the URL, whereas “bagel” is not. Our estimate of our
precision would be 0.9+0.82
= 0.85. We would also increment the “2” column of “#
Tag Expansions” because the URL was expanded twice.
Thus, the first row of Table 3.4 says that we ran a simulation with n = 1, c = 0.5
and 10, 937 URLs. In 2, 096 cases, we were not able to add any tags (anecdotally,
this usually happens when a bookmark only has one tag). In 7, 667 cases, we were
able to add five or more tags. The first row of Table 3.5, which shows data about
the same simulation (n = 1 and c = 0.5), says that our estimate of our precision was
0.650 while our actual precision was a little lower, 0.633.
The results in Table 3.4 show that with only a single bookmark, we can expand
anywhere from 10 to 80 percent of our URLs by at least one tag depending on our
desired precision. With larger numbers of bookmarks, we can do better, though
the most pertinent tags for a URL are applied quickly. Table 3.5 shows that as the
number of bookmarks increases, the difference between estimated and actual precision
increases. This means that as a URL receives more and more annotations, we become
increasingly unsure of the effectiveness of association rules for unapplied tags.
How Useful are Predicted Tags?
As we argued at the beginning of this chapter, predicted tags can be used by a system
in many ways. Here we briefly explore one such use: increasing recall for single tag
queries. For instance, if the user searches for “food,” the system can return objects
annotated with “food” as well as objects which we predict “food” annotates. Using
term co-occurrence to expand query results is a well known IR technique; here we
want to know how well it works for tags.
For evaluation, we consider each tag ti ∈ T100 to be a query qti . For each query
qti , the result set s contains the URLs annotated with the tag, and the result set s′
contains the URLs annotated with the tag plus URLs which we predict are anno-
tated with the tag using association rules. We then compare the recall and precision
60 CHAPTER 3. SOCIAL TAG PREDICTION
achieved by s and s′. For example, suppose five objects are positively described by
“food.” In our simulation suppose only two of the objects are known to have “food.”
Suppose that we correctly predict that one of the remaining three objects is labeled
“food” (perhaps using our “bagels” → “food” rule above), and we incorrectly predict
that two other objects are labeled “food.” Without expansion, query q retrieves s
which has two known bagel objects, so recall is 2/5. With expansion, s′ returns three
additional objects, one of which was correct, for a recall of 2+15
and a precision of 2+12+3
.
Table 3.6 shows the simulation details just described. The first two columns are
n and c, as in Tables 3.4 and 3.5. The simulation results are shown in the last four
columns. For each simulation (row), we give (a) the mean recall before expansion
(macro average over all tags in T100); (b) the mean recall after expansion; and (c)
the mean and median of the precision after expansion. (Note that without expansion
precision is always 1.) For instance, if we sample one bookmark per URL (n = 1) and
use 50% confidence (c = 0.5), we see that tag expansion improves mean recall from
0.099 to 0.271, a factor of 3 improvement! Of course, our average precision drops
from 1 to 0.629. For both the one and two tag cases (i.e., n = 1 and n = 2) with
confidence c = 0.75 we can increase recall by 50% while keeping precision above 90%.
3.4 Related Work
Previous work has looked at the nature of tags chosen by users [31, 64]. We do not
know of any work explicitly looking at how to construct a reasonable dataset for pre-
diction in tagging systems as we do in Section 3.2. While our hypertext classification
task in Section 3.3.1 is inspired by a long line of work, usefully surveyed by Yang et
al. [74], we believe the application to tags is new. Chakrabarti et al. [16] suggest a
different way to use local link information for classification that might prove more
effective than our domain features, however, we do not evaluate this possibility here.
Our use of an entropy measure for tagging systems is inspired by Chi and Mytkowicz
[18]. Other work has looked at tag suggestion, usually from a collaborative filtering
and UI perspective, for example with URLs [71] and blog posts [56, 68].
Our work in Section 3.3.2 is similar to work by Schmitz et al. [62]. However,
3.5. CONCLUSION 61
Schmitz et al. is primarily concerned with theoretical properties of mining association
rules in tripartite graphs. Schwarzkopf et al. [63] extend Schmitz’s association rules
work to build full ontologies. However, neither Schmitz et al. nor Schwarzkopf et al.
appear to evaluate the quality of the rules themselves aside from generating ontologies.
Lastly, there is also much previous work in IR studying query expansion and relevance
feedback trying to address similar questions of cross-language and cross-vocabulary
queries (see for example a general reference such as Manning et al. [54]). However, we
believe that association rules may be the most natural approach to these problems in
tagging systems due to user interface issues (for example, feeds, browsing).
3.5 Conclusion
Our tag prediction results suggest three insights.
First, this chapter reinforced evidence from Chapter 2 that many tags on the web
do not contribute substantial additional information beyond page text, anchor text,
and surrounding hosts. All three types of data can be quite predictive of different
tags in our dataset, and if we only want a small recall (e.g., 10%) we can have a
precision above 90%. The predictability of social bookmarking tags influences web
search (by suggesting ways to use tagging information or whether to use it at all),
as well as system designers who might bootstrap tagging systems with initial quality
data (by making it possible to predict such initial data).
Second, the predictability of a tag when our classifiers are given balanced training
data is negatively correlated with its occurrence rate and with its entropy. More
popular tags are harder to predict and higher entropy tags are harder to predict.
When considering tags in their natural (skewed) distributions, data sparsity issues
tend to dominate, so each further example of a tag improves classifier performance.
To the extent to which predictability is correlated with the “generality” of a tag,
these measures may serve as building blocks for tagging system designers to produce
new features that rely upon understanding the specificity of tags (for example, system
suggestion and tag browsing). Both of our measures of tag predictability are object
type independent. This suggests that they may be applicable to tagging systems
62 CHAPTER 3. SOCIAL TAG PREDICTION
where photos or video are annotated rather than only social bookmarking systems.
Third, association rules can increase recall on the single tag queries and feeds
which are common in tagging systems today. This suggests that they may serve as a
way to link disparate vocabularies among users. We found association rules linking
languages, super/subconcepts, and other relationships. These rules may also indicate
synonymy and polysemy, two issues that have plagued tagging systems since Golder
and Huberman’s seminal work [31]. (We return to the question of synonymy in the
context of social cataloging systems in Section 4.3.1.)
Chapter 4
Tagging Human Knowledge
As we noted in Chapter 1, tagging evolved in response to pressures to organize massive
numbers of online objects. For example, in 1994, two students organized pages on
the web into what became the Yahoo! Directory. What they did could be caricatured
as the “library approach” to organizing a collection: create a limited taxonomy or
set of terms and then have expert catalogers annotate objects in the collection with
taxonomy nodes or terms from the pre-set vocabulary. In 1998, the Open Directory
Project (ODP) replaced expert catalogers with volunteers, but kept the predetermined
taxonomy. Experts were too expensive, and users of the Internet too numerous to
ignore as volunteers. In 2003, del.icio.us, the subject of Chapters 2 and 3, was started.
del.icio.us uses what we call the “tagging approach” to organizing a collection: ask
users with no knowledge of how the collection is organized to provide terms to organize
the collection. Within a few years, del.icio.us had an order of magnitude more URLs
annotated than either Yahoo! Directory or ODP.
Increasingly, web sites are turning to the “tagging approach” rather than the “li-
brary approach” for organizing the content generated by their users. This is both by
necessity and by choice. For example, the photo tagging site Flickr has thousands
of photos uploaded each second, an untenable amount to have labeled by experts.
Popular websites tend to have many users, unknown future objects, and few re-
sources dedicated up-front to data organization—the perfect recipe for the “tagging
approach.”
63
64 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
However, the “library approach,” even as we have caricatured it above, has many
advantages. In particular, annotations are generally consistent, of uniformly high
quality, and complete (given enough resources). In the tagging approach, who knows
whether two annotators will label the same object the same way? Or whether they
will use useful annotations? Or whether an object will end up with the annotations
needed to describe it? These questions are the subject of this chapter: to what
extent does the tagging approach match the consistency, quality, and completeness
of the library approach? We believe these questions are a good proxy for the general
question of whether the tagging approach organizes data well, a question which affects
some of the most popular sites on the web.
Unfortunately, we cannot really compare the library approach to tagging systems
using social bookmarking data, because librarians have not labeled even a small
fraction of the URLs in social bookmarking systems. Instead, this chapter and the
next look at social cataloging sites—sites where users tag books. By using books as
our objects, we can compare user tags to decades of expert library cataloger metadata.
In this chapter, we primarily treat library metadata as a gold standard. For example,
we test if tags have high coverage of existing library annotations. (In the next chapter,
we consider that library annotations might be faulty or inadequate.) By using two
social cataloging sites (LibraryThing and Goodreads), we can see how consistently
users annotate objects across tagging systems. Overall, we give a comprehensive
picture of the tradeoffs and techniques involved in using the tagging approach for
organizing a collection, though we do focus by necessity on popular tags and topics.
Our investigation proceeds as follows. In Section 4.1 we build a vocabulary to
discuss tagging and library data. In Section 4.2, we describe our datasets. In each of
Sections 4.3, 4.4, and 4.5, we evaluate the tagging approach in terms of consistency,
quality, and completeness. In Section 4.6 we discuss related work, and we conclude
in Section 4.7. (This chapter draws on material from Heymann et al. [37] which is
primarily the work of the thesis author.)
4.1. SOCIAL CATALOGING TERMS AND NOTATION 65
4.1 Social Cataloging Terms and Notation
This chapter and the next use the (ti, oj, uk) representation of tagging systems from
Chapter 2, with some modifications and additions. In contrast to del.icio.us in Chap-
ters 2 and 3, we focus on social cataloging sites where the objects are books. More
accurately, an object is a work, which represents one or more closely related books
(e.g., the different editions of a book represent a work).
An object o can be annotated in three ways. First, an object o can be annotated
(for free) by a user of the site, in which case we call the annotation a tag or (in some
contexts) a user tag (written ti ∈ T ). For example, the top 10 most popular tags
that users use to annotate their personal books in our LibraryThing social cataloging
dataset are “non-fiction,” “fiction,” “history,” “read,” “unread,” “own,” “reference,”
“paperback,” “biography,” and “novel.” Second, in a variety of experiments, we pay
non-experts to produce “tags” for a given object. These are functionally the same as
tags, but the non-experts may know little about the object they are tagging. As a
result, we call these paid non-experts “paid taggers,” and the annotations they create
“$-tags”, or $i ∈ $ to differentiate them from unpaid user tags. Thirdly, works are
annotated by librarians. For example, the Dewey Decimal Classification may say a
work is in class 811, which as we will see below, is equivalent to saying the book has
annotations “Language and Literature”, “American and Canadian Literature,” and
“Poetry.” We will call the annotations made by librarians “library terms” (written
li ∈ L).
In a given system, an annotation a implicitly defines a group, i.e., the group of all
objects that have annotation a (we define O(a) to return this set of objects). We call
a the name of such a group. A group also has a size equal to the number of objects
it contains (we define oc(a) to return this size). Since an object can have multiple
annotations, it can belong to many groups. An object o becomes contained in group
a when an annotator annotates o with a. We overload the notation for T , $, and L
such that T (oi), $(oi), and L(oi) return the bag (multiset) of user tags, paid tags, and
library annotations for work oi, respectively.
66 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
4.1.1 Library Terms
We look at three types of library terms: classifications, subject headings, and the
contents of MARC 008.1
A classification is a set of annotations arranged as a tree, where each annotation
may contain one or more other annotations. An object is only allowed to have one
position in a classification. This means that an object is associated with one most
specific annotation in the tree and all of its ancestor annotations in the tree.
A subject heading is a library term chosen from a controlled list of annotations.
A controlled list is a predetermined set of annotations. The annotator may not make
up new subject headings. An object may have as many subject headings as desired
by the annotator.
Works are annotated with two classifications, the Library of Congress Classifica-
tion (LCC) and the Dewey Decimal Classification (DDC). A work has a position in
both classifications. LCC and DDC encode their hierarchy information in a short
string annotating a work, for example, GV735 or 811 respectively. The number 811
encodes that the book is about “Language and Literature” because it is in the 800s,
“American and Canadian Literature” because it is in the 810s, and “Poetry” most
specifically, because it is in the 811s. Likewise, “GV735” is about “Recreation and
Leisure” because it is in GV, and “Umpires and Sports officiating” because it is in
GV735. One needs a mapping table to decode the string into its constituent hierarchy
information.
Works are also annotated with zero or more Library of Congress Subject Headings
(LCSH).2 LCSH annotations are structured as one LCSH main topic and zero or
more LCSH subtopics selected from a vocabulary of phrases. For example, a book
about the philosophy of religion might have the heading “Religion” (Main Topic)
and “Philosophy” (Subtopic). In practice, books rarely have more than three LCSH
headings for space, cost, and historical reasons. Commonly only the most specific
1This section gives a brief overview of library terms and library science for this chapter. However,it is necessarily United States-centric, and should not be considered the only way to organize datain a library! For more information, see a general reference such as one by Mann ([53], [52]).
2Strictly speaking, we sometimes use any subject heading in MARC 650, but almost all of theseare LCSH in our dataset.
4.2. CREATING A SOCIAL CATALOGING DATASET 67
LCSH headings are annotated to a book, even if more general headings apply.
We flatten LCC, DDC, and LCSH. For example in DDC, 811 is treated
as three groups {800, 810, 811}. LCSH is somewhat more complex. For
example, we treat “Religion” more specifically “Philosophy” as three groups
{Main:Religion:Sub:Philosophy, Religion, Philosophy}. This is, in some sense, not
fair to LCC, DDC, or LCSH because the structure in the annotations provides ad-
ditional information. However, we also ignore significant strengths of tagging in this
work, for example, its ability to have thousands of unique annotations for a single
work, or its ability to show gradation of meaning (e.g., a work 500 people tag “fan-
tasy” may be more classically “fantasy” than a work that only 10 people have tagged).
In any case, the reader should note that our group model does not fully model the
difference between structured and unstructured terms.
A MARC record is a standard library record that contains library terms for a
particular book. It includes a fixed length string which we call MARC 008 that
states whether the book is a biography, whether the book is fiction, and other details.
We define LLCC , LDDC , LLCSH , LLM , and LMARC008 to be the set of library terms in
LCC, DDC, LCSH, LCSH main topics, and MARC008, respectively.
4.2 Creating a Social Cataloging Dataset
We use a dump of Library of Congress MARC records from the Internet Archive as
the source of our library terms. We chose to use only those 2, 218, 687 records which
had DDC and LCC library terms as well as an ISBN (a unique identifier for a book).
We also use a list of approximately 6, 000 groups in LCC from the Internet Archive,
and a list of approximately 2, 000 groups in DDC from a library school board in
Canada as mapping tables for LCC and DDC.
We started crawling LibraryThing in early April 2008, and began crawling
Goodreads in mid June 2008. In both cases, our dataset ends in mid-October 2008.
We crawled a sample of works from each site based on a random selection of ISBNs
from our Library of Congress dataset. LibraryThing focuses on cataloging books (and
has attracted a number of librarians in addition to regular users), whereas Goodreads
68 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
focuses on social networking (which means it has sparser tagging data). We gathered
synonym sets (see Section 4.3.1) from LibraryThing on October 19th and 20th.
We use two versions of the LibraryThing dataset, one with all of the works which
were found from our crawl, and one with only those works with at least 100 unique
tags. The former dataset, which we call the “full” dataset, has 309, 071 works. The
latter dataset, which we call the “min100” dataset, has 23, 396 works. We use only
one version of our Goodreads dataset, a version where every work must have at least
25 tags and there are 7, 233 unique ISBNs.
4.3 Experiments: Consistency
In this and the next two sections, we conduct experiments to determine if tagging
systems are consistent, high quality, and complete. Each experiment has a descrip-
tion of a feature of the library approach to be emulated, a summary of the results,
zero or more preliminaries sections, and details about background, methodology, and
outcome.
The experiments in this section look at consistency :
Section 4.3.1 How big a problem is synonymy? That is, how consistent are users
of the same tagging system in choosing the same tag for the same topic?
Section 4.3.2 How consistent is the tag vocabulary chosen, or used, by users across
different tagging systems? That is, do users use the same tags across tagging
systems?
Section 4.3.3 How consistently is a particular tag applied across different tagging
systems? That is, do users use the same tags to describe the same objects?
Section 4.3.4 If paid taggers are asked to annotate objects with $-tags, are those
$-tags consistent with user tags?
4.3. EXPERIMENTS: CONSISTENCY 69
4.3.1 Synonymy
Summary
Library Feature: There should not be multiple places to look for a particular object.
This means that we would prefer tags not to have synonyms. When a tag does have
synonyms, we would prefer one of the tags to have many more objects annotated with
it than the others.
Result: Most tags have few or no synonyms appearing in the collection. In a given
synonym set, one tag is usually much more common.
Conclusion: Synonymy is not a major problem for tags.
Preliminaries: Synonymy
A group of users named combiners mark tags as equivalent. We call two tags that are
equivalent according to a combiner synonyms. A set of synonymous tags is called a
synonym set. Combiners are regular users of LibraryThing who do not work directly
for us. While we assume their work to be correct and complete in our analysis, they
do have two notable biases: they are strict in what they consider a synonym (e.g.,
“humour” as British comedy is not a synonym of “humor” as American comedy) and
they may focus more on finding synonyms of popular, mature tags.
We write the synonym set of ti, including itself, as S(ti). We calculate the entropy
H(ti) (based on the probability p(tj) of each tag) of a synonym set S(ti) as:
p(tj) =oc(tj)
∑
tk∈S(tj)oc(tk)
H(ti) = −∑
tj∈S(ti)
p(tj) log2 p(tj)
H(ti) measures the entropy of the probability distribution that we get when we assume
that an annotator will choose a tag at random from a synonym set with probability
in proportion to its object count. For example, if there are two equally likely tags in
a synonym set, H(ti) = 1. If there are four equally likely tags, H(ti) = 2. The higher
the entropy, the more uncertainty that an annotator will have in choosing which tag to
annotate from a synonym set, and the more uncertainty a user will have in determining
which tag to use to find the right objects. We believe low entropy is generally better
70 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
Figure 4.1: Synonym set frequencies. (“Frequency of Count” is the number of timessynonym sets of the given size occur.)
Figure 4.2: Tag frequency versus synonym set size.
than high entropy, though it may be desirable under some circumstances (like query
expansion) to have high entropy synonym sets.
Details
Due to the lack of a controlled vocabulary, tags will inevitably have synonymous
forms. The best we can hope for is that users ultimately “agree” on a single form,
by choosing one form over the others much more often. For example, we hope that if
the tag “fiction” annotates 500 works about fiction, that perhaps 1 or 2 books might
be tagged “fictionbook” or another uncommon synonym. For this experiment, we use
the top 2000 LibraryThing tags and their synonyms.
Most tags have no synonyms, though a minority have as many as tens of synonyms
(Figure 4.1). The largest synonym set is 70 tags (synonyms of “19th century”). Unlike
one might expect, |S(ti)| is not strongly correlated with oc(ti) as shown in Figure 4.2.
(Kendall’s τ ≈ 0.208.)
4.3. EXPERIMENTS: CONSISTENCY 71
Figure 4.3: H(ti) (Top 2000, 6= 0)
Figure 4.3 is a histogram of the entropies of the top 2000 tags, minus those syn-
onym sets with an entropy of zero. In 85 percent of cases, H(ti) = 0. The highest
entropy synonym set, at H(ti) = 1.56 is the synonym set for the tag “1001bym-
rbfd,” or “1001 books you must read before you die.” Less than fifteen tags (out of
2000) have an entropy above 0.5. The extremely low entropies of most synonym sets
suggests that most tags have a relatively definitive form.
4.3.2 Cross-System Annotation Use
Summary
Library Feature: Across tagging systems, we would like to see the systems use
the same vocabulary of tags because they are annotating the same type of objects—
works.
Result: The top 500 tags of LibraryThing and Goodreads have an intersection of
almost 50 percent.
Conclusion: Similar systems have similar tags, though tagging system owners should
encourage short tags.
Preliminaries: Information Integration
Federation is when multiple sites share data in a distributed fashion allowing them
to combine their collections. Information integration is the process of combining,
de-duplicating, and resolving inconsistencies in the shared data. Two useful features
72 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
for information integration are consistent cross-system annotation use and consis-
tent cross-system object annotation. We say two systems have consistent annotation
use if the same annotations are used overall in both systems (this section). We say
two systems have consistent object annotation if the same object in both systems is
annotated similarly (Section 4.3.3). Libraries achieve these two features through “au-
thority control” (the process of creating controlled lists of headings) and professional
catalogers.
Details
For both LibraryThing and Goodreads, we look at the top 500 tags by object count.
Ideally, a substantial portion of these tags would be the same, suggesting similar
tagging practices. Differences in the works and users in the two systems will lead to
some differences in tag distribution. Nonetheless, both are mostly made up of general
interest books and similar demographics.
The overlap between the two sets is 189 tags, or about 38 percent of each top 500
list.3 We can also match by determining if a tag in one list is in the synonym set of
a tag in the other list. This process leads to higher overlap—231 tags, or about 46
percent. The higher overlap suggests “combiners” are more helpful for integrating two
systems than for improving navigation within their own system. An overlap of nearly
50 percent of top tags seems quite high to us, given that tags come from an unlimited
vocabulary, and books can come from the entire universe of human knowledge.
Much of the failed overlap can be accounted for by noting Goodreads’ prevalence
of multi-word tags. Multi-word tags lead to less overlap with other users, and less
overlap across systems. We compute the number of words in a tag by splitting on
spaces, underscores, and hyphens. On average, tags in the intersection of the two
systems have about 1.4 words. However, tags not in the intersection have an average
of 1.6 words in LibraryThing, and 2.3 words in Goodreads. This implies that for
tagging to be federated across systems users should be encouraged to use fewer words.
3Note that comparing sets at the same 500 tag cutoff may unfairly penalize border tags (e.g.,“vampires” might be tag 499 in LT but tag 501 in GR). We use the simpler measurement above,but we also conducted an analysis comparing, e.g., the top 500 in one system to the top 1000 in theother system. Doing so increases the overlap by ≈ 40 tags.
4.3. EXPERIMENTS: CONSISTENCY 73
While there are 231 tags in the overlap between the systems (with synonyms),
it is also important to know if these tags are in approximately the same ranking.
Is “fantasy” used substantially more than “humor” in one system? We computed
a Kendall’s τ rank correlation between the two rankings from LibraryThing and
Goodreads of the 231 tags in the overlap of τ ≈ 0.44. This means that if we choose
any random pair of tags in both rankings, it is a little over twice as likely that the
pair of tags is in the same order in both rankings as it is that the pair will be in a
different order.
4.3.3 Cross-System Object Annotation
Summary
Library Feature: We would like annotators to be consistent, in particular, the same
work in two different tagging systems should be annotated with the same, or a similar
distribution, of tags. In other words, does “Winnie-the-Pooh” have the same set of
tags in LibraryThing and Goodreads?
Result: Duplicate objects across systems have low Jaccard similarity in annotated
tags, but high cosine similarity.
Conclusion: Annotation practices are similar across systems for the most popular
tags of an object, but often less so for less common tags for that object.
Details
We limited our analysis to works in both LibraryThing and Goodreads, where
Goodreads has at least 25 tags for each book. This results in 787 works. Ideally,
for each work, the tags would be almost the same, implying that given the same
source object, users of different systems will tag similarly.
Figures 4.4, 4.5, and 4.6 show distributions of similarities of tag annotations for
the same works across the systems. We use Jaccard similarity for set similarity (i.e.,
each annotation counts as zero or one), and cosine similarity for similarity with bags
(i.e., counts). Because the distributions are peaked, Jaccard similarity measures how
74 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
Figure 4.4: Distribution of same book similarities using Jaccard similarity over alltags.
Figure 4.5: Distribution of same book similarities using Jaccard similarity over thetop twenty tags.
Figure 4.6: Distribution of same book similarities using cosine similarity over all tags.
4.3. EXPERIMENTS: CONSISTENCY 75
many annotations are shared, while cosine similarity measures overlap of the main
annotations.
Figure 4.4 shows that the Jaccard similarity of the tag sets for a work in the
two systems is quite low. For example, about 150 of the 787 works have a Jaccard
similarity of the two tag sets between 0.02 and 0.03. One might expect that the issue
is that LibraryThing has disproportionately many more tags than Goodreads, and
these tags increase the size of the union substantially. To control for this, in Figure
4.5, we take the Jaccard similarity of the top 20 tags for each work. Nonetheless,
this does not hugely increase the Jaccard value in most cases. Figure 4.6 shows the
distribution of cosine similarity values. (We treat tags as a bag of words and ignore
three special system tags.) Strikingly, the cosine similarity for the same work is
actually quite high. This suggests that for the same work, the most popular tags are
likely to be quite popular in both systems, but that overall relatively few tags for a
given work will overlap.
4.3.4 $-tag Annotation Overlap
Summary
Library Feature: We would like paid taggers to be able to annotate objects in a
way that is consistent with users. This reduces dependence on users, and means that
unpopular objects can be annotated for a fee.
Result: $-tags produced by paid taggers overlap with user tags on average 52 percent
of the time.
Conclusion: Tagging systems can use paid taggers.
Preliminaries: $-tag Tagging Setup
This section asks whether the terms used when paid taggers annotate objects with
$-tags are the same as the terms used when regular users annotate objects with
user tags. We randomly selected works from the “min100” dataset with at least three
unique li ∈ LLM . We then showed paid taggers (in our case, Mechanical Turk workers)
a search for the work (by ISBN) on Google Book Search and Google Product Search,
76 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
Figure 4.7: Overlap Rate Distribution.
two searches which generally provide a synopsis and reviews, but do not generally
provide library metadata like subject headings. The paid taggers were asked to add
three $-tags which described the given work. Each work was labeled by at least three
paid taggers, but different paid taggers could annotate more or fewer books (this is
standard on the Mechanical Turk). We provided 2, 000 works to be tagged with 3
$-tags each. Some paid taggers provided an incomplete set of $-tags, leading to a
total of 16, 577 $-tags. Paid taggers spent ≈ 90 seconds per work, and we usually
spent less than $0.01 per $-tag/work pair. (We analyze $-tags in Sections 4.3.4, 4.4.2,
and 4.4.3.)
Details
$-tags matched with tags ti already annotated to the work at least once on average
52% of the time (standard deviation of 0.21). Thus, paid taggers who had in the vast
majority of cases not read the book, overlapped with real book readers more than half
the time in what $-tags they applied. A natural followup question is whether some
workers are much better at paid tagging than others. We found a range of “overlap
rates” among paid taggers (shown in Figure 4.7), but we are unsure whether higher
performance could be predicted in advance.
4.4 Experiments: Quality
The experiments in this section look at quality :
Section 4.4.1 Are the bulk of tags of high quality types? For example, are subjective
4.4. EXPERIMENTS: QUALITY 77
tags like “stupid” common?
Section 4.4.2 Are $-tags high quality in comparison to library annotations and user
tags?
Section 4.4.3 Can we characterize high quality user tags?
4.4.1 Objective, Content-based Groups
Summary
Library Feature: Works should be organized objectively based on their content.
For example, we would prefer a system with groups of works like “History” and
“Biography,” to one with groups of works like “sucks” and “my stuff.”
Result: Most tags in both of our social cataloging sites were objective and content-
based. Not only are most very popular tags (oc(ti) > 300) objective and content-
based, but so are less popular and rare tags.
Conclusion: Most tags, rather than merely tags that become very popular, are
objective and content-based, even if they are only used a few times by one user.
Preliminaries: Tag Types
We divide tags into six types:
Objective and Content-based Objective means not depending on a particular an-
notator for reference. For example, “bad books” is not an objective tag (because
one needs to know who thought it was bad), whereas “world war II books” is
an objective tag. Content-based means relating to the book contents (e.g., the
story, facts, genre). For example, “books at my house” is not a content-based
tag, whereas “bears” is.
Opinion The tag implies a personal opinion. For example, “sucks” or “excellent.”
Personal The tag relates to personal or community activity or use. For example,
“my book”, “wishlist”, “mike’s reading list”, or “class reading list”.
Physical The tag describes the book physically. For example, “in bedroom” or
“paperback”.
78 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
LT% GR%
Objective, Content of Book 60.55 57.10Personal or Related to Owner 6.15 22.30
Acronym 3.75 1.80Unintelligible or Junk 3.65 1.00
Physical (e.g., “Hardcover”) 3.55 1.00Opinion (e.g., “Excellent”) 1.80 2.30
None of the Above 0.20 0.20No Annotator Majority 20.35 14.30
Total 100 100
Table 4.1: Tag types for top 2000 LibraryThing and top 1000 GoodReads tags aspercentages.
Acronym The tag is an acronym that might mean multiple things. For example,
“sf” or “tbr”.
Junk The tag is meaningless or indecipherable. For example, “b” or “jiowefijowef”.
Details
If a tagging system is primarily made up of objective, content-based tags, then it is
easier for users to find objects. In a library system, all annotations are objective and
content-based in that they do not depend on reference to the annotator, and they
refer to the contents of the book.
To produce an unbiased view of the types of tags in our sites, we used Mechanical
Turk. We submitted the top 2, 000 LibraryThing tags and top 1, 000 Goodreads tags
by annotation count to be evaluated. We also sampled 1, 140 LibraryThing tags,
20 per rounded value of log(oc(ti)), from 2.1 to 7.7. We say a worker provides a
determination of the answer to a task (for example, the tag “favorite” is an opinion).
Overall, 126 workers examined 4, 140 tags, five workers to a tag, leading to a total of
20, 700 determinations. We say the inter-annotator agreement is the pair-wise fraction
of times two workers provide the same answer. The inter-annotator agreement rate
was about 65 percent.
Table 4.1 shows the proportion of top tags by type for LibraryThing and Goodreads.
4.4. EXPERIMENTS: QUALITY 79
Figure 4.8: Conditional density plot [39] showing probability of (1) annotators agree-ing a tag is objective, content-based, (2) annotators agreeing on another tag type, or(3) no majority of annotators agreeing.
For example, for 60.55% of the top 2000 LibraryThing tags (e.g., 12112000
), at least three
of five workers agreed that the tag was objective and content-based. The results show
that regardless of the site, a majority of tags tend to be objective, content-based tags.
In both sites, about 60 percent of the tags examined were objective and content-based.
Interestingly, Goodreads has a substantially higher number of “personal” tags than
LibraryThing. We suspect that this is because Goodreads calls tags “bookshelves”
in their system.
Even if we look at tags ranging from oc(ti) = 8 to oc(ti) = 2208, as shown in Figure
4.8, the proportion of objective, content-based tags remains very high. That figure
shows the probability that a tag will be objective and content-based conditioned on
knowing its object count. For example, a tag annotating 55 objects has about a 50
percent chance of being objective and content-based.
4.4.2 Quality Paid Annotations
Summary
Library Feature: We would like to purchase annotations of equal or greater quality
to those provided by users.
Result: Judges like $-tags as much as subject headings.
80 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
Conclusion: Paid taggers can annotate old objects where users do a poor job of
providing coverage and new objects which do not yet have tags. Paid taggers can
quickly and inexpensively tag huge numbers of objects.
Preliminaries: $-tag Judging Setup
In this section and the next, we evaluate the relative perceived helpfulness of annota-
tions ti ∈ T , $i ∈ $ and li ∈ LLM . We randomly selected 60 works with at least three
tags ti ∈ T and three LCSH terms li ∈ LLM from our “min100” dataset.
We created tasks on the Mechanical Turk, each of which consisted of 20 subtasks
(a “work set”), one for each of 20 works. Each subtask consisted of a synopsis of the
work oi and an annotation evaluation section. A synopsis consisted of searches over
Google Books and Google Products as in Section 4.3.4. The annotation evaluation
section showed nine annotations in random order, three each from T (oi), $(oi), and
LLM(oi), and asked how helpful the given annotation would be for finding works
similar to the given work oi on a scale of 1 (“not at all helpful”) to 7 (“extremely
helpful”).
We removed three outlier evaluators who either skipped excessive numbers of eval-
uations, or awarded excessive numbers of the highest score. Remaining missing values
were replaced by group means. That is, a missing value for a work/annotation/evalu-
ator triplet was replaced by the mean of helpfulness scores from among all evaluators
who had provided scores for that triplet. We abbreviate “helpfulness score” as h-score
in the following. We say that annotations ti ∈ T , $i ∈ $, and li ∈ LLM differ in their
annotation type.
Details
In order to understand the perceived quality of $-tags, we wondered if, given the works
that each evaluator saw, they tended to prefer $-tags, user tags, or LCSH on average.
To answer this question, we produced a mean of means for each annotation type
(i.e., $-tags, user tags, and LCSH main topics) to compare to the other annotation
types. We do so by averaging the annotations of a given type within a given evaluator
4.4. EXPERIMENTS: QUALITY 81
H-Scores (by Evaluator) µ SDUser Tags 4.46 0.75
LCSH Main Topics 5.18 0.76$-tags 5.22 0.83
Table 4.2: Basic statistics for the mean h-score assigned by evaluators to each anno-tation type. Mean (µ) and standard deviation (SD) are abbreviated.
(i.e., to determine what that evaluator thought) and then by averaging the averages
produced by each evaluator across all evaluators.
Table 4.2 summarizes the basic statistics by annotation type. For example, the
mean evaluator assigned a mean score of 4.46 to user tags, 5.18 to LCSH main topics,
and 5.22 to $-tags. At least for our 60 works, $-tags are perceived as being about
as helpful as LCSH library annotations, and both are perceived as better than user
tags (by about 0.6 h-score). A repeated measures ANOVA showed annotation type
differences in general to be significant, and all differences between mean h-scores by
annotation type were significant (p < 0.001) with the exception of the difference
between $-tags and LCSH main topics.
4.4.3 Finding Quality User Tags
Summary
Library Feature: We would like tag annotations to be viewed as competitive in
terms of perceived helpfulness with annotations provided by expert taxonomists.
Result: Moderately common user tags are perceived as more helpful than both LCSH
and $-tags.
Conclusion: Tags may be competitive with manually entered metadata created by
paid taggers and experts, especially when information like frequency is taken into
account.
82 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
H-Scores µ SD µ 95% CI$-tags 4.93 1.92 (4.69, 5.17)
Rare User Tags 4.23 2.11 (3.97, 4.50)Moderate User Tags 5.80 1.47 (5.63, 5.98)Common User Tags 5.27 1.72 (5.05, 5.48)LCSH Main Topics 5.13 1.83 (4.91, 5.36)
Table 4.3: Basic statistics for the mean h-score assigned to a particular annotationtype with user tags split by frequency. Mean (µ) and standard deviation (SD) areabbreviated.
Details
Section 4.4.2 would seem to suggest that tags ti ∈ T are actually the worst possible
annotation type because the average evaluator gave $-tags and LCSH main topics
a mean h-score 0.6 higher than user tags. Nonetheless, in practice we found that
tags ti ∈ T (oi) often had higher h-scores for the same object oi than corresponding
annotations $i ∈ $(oi) and li ∈ LLM(oi). It turns out that this discrepancy can be
explained in large part by the popularity of a user tag.
We define pop(oi, tm) to be the percentage of the time that tag tm is assigned to
object oi. For example, if an object oi has been annotated (“food”, “food”, “cuisine”,
“pizza”) then we would say that pop(oi, tfood) =24. We partitioned the h-scores for T
into three sets based on the value pop(oi, tm) of the annotation. Those sets were user
tag annotations with pop(oi, tm) < 0.11 (“rare”), those with 0.11 ≤ pop(oi, tm) < 0.17
(“moderate”), and those with 0.17 ≤ pop(oi, tm) (“common”).4
Table 4.3 shows the basic statistics with these more fine grained categories on a per
evaluation basis (i.e., not averaging per annotator). For example, the 95% confidence
interval for the mean h-score of moderate popularity user tags is (5.63, 5.98), and the
mean h-score of $-tags is 4.93 in our sample. The ANOVA result, Welch-corrected to
adjust for unequal variances within the five annotation types, is (WelchF (4, 629.6) =
26.2; p < .001). All differences among these finer grained categories are significant,
4H-scores were sampled for the “common” set for analysis due to large frequency differencesbetween rare user tags and more common tags. Values of pop(oi, tj) varied between less than 1percent and 28 percent in our evaluated works.
4.5. EXPERIMENTS: COMPLETENESS 83
with the exception of common user tags versus LCSH, common user tags versus $-
tags, and LCSH main topics versus $-tags.
Using the finer grained categories in Table 4.3 we can now see that moderately
common user tags are perceived as better than all other annotation types. (Fur-
thermore, rare user tags were dragging down the average in the analysis of Section
4.4.2.) We speculate that rare user tags are too personal and common user tags too
general. Despite some caveats (evaluators do not read the work, value of annotations
changes over time, works limited by Librarything availability), we are struck by the
fact that evaluators perceive moderately common user tags to be more helpful than
professional, expert-assigned library annotations.
4.5 Experiments: Completeness
The experiments in this section look at completeness :
Section 4.5.1 Do user tag annotations cover many of the same topics as professional
library annotations?
Section 4.5.2 Do user tags and library annotations corresponding to the same topic
annotate the same objects?
4.5.1 Coverage
Summary
Library Feature: We believe that after decades of consensus, libraries have roughly
the right groups of works. A system which attempts to organize works should end up
with groups similar to or a superset of library terms.
Result: Many top tags have equivalent (see below) library terms. Tags contain more
than half of the tens level DDC headings. There is a corresponding LCSH heading
for more than 65 percent of top objective, content-based tags.
Conclusion: Top tags often correspond to library terms.
84 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
Preliminaries: Containment and Equivalence
Our goal is to compare the groups formed by user tags and those formed by library
annotations. For instance, is the group defined by tag “History of Europe” equivalent
to the group formed by the library term “European History?” We can take two
approaches to defining equivalence. First, we could say that group g1 is equivalent
to group g2 if they both contain the same objects (in a given tagging system). By
this definition, the group “Art” could be equivalent to the group “cool” if users had
tagged all works annotated with the library term “Art” with the tag “cool.” Note
that this definition is system specific.
A second approach is to say that group g1 is equivalent to group g2 if the names
g1 and g2 “semantically mean the same.” Under this definition, “cool” and “Art”
are not equivalent, but “European History” and “History of Europe” are. The latter
equivalence holds even if there are some books that have one annotation but not the
other. For this definition of equivalence we assume there is a semantic test m(a, b)
that tells us if names a and b “semantically mean the same.” (We implement m by
asking humans to decide.)
In this chapter we use the second definition of equivalence (written g1 = g2). We
do this because we want to know to what extent library terms exist which are seman-
tically equivalent to tags (Section 4.5.1) and to what extent semantically equivalent
groups contain similar objects (Section 4.5.2).
When we compare groups, not only are we interested in equivalence, but also in
“containment.” We again use semantic definitions: We say a group g1 contains group
g2 (written g1 ⊇ g2) if a human that annotates an object o with g2 would agree that o
could also be annotated with g1. Note that even though we have defined equivalence
and containment of groups, we can also say that two annotations are equivalent or
contain one another if the groups they name are equivalent or contain one another.
Preliminaries: Gold Standard (tj, li) Relationships
In this section and the next, we look at the extent to which tags tj and library terms
li satisfy similar information needs. We assume a model where users find objects
4.5. EXPERIMENTS: COMPLETENESS 85
(a) Sampled Containment Relationships (con-pairs)
Tag Contained Library Termspanish romance
→ spanish (lc pc 4001.0-4978.0)
pastoral pastoral theology (lc bv 4000.0-4471.0)
civil war
united states
→ civil war period, 1861-1865
→ civil war, 1861-1865
→ armies. troops (lc e 491.0-587.0)
therapy psychotherapy (lcsh)
chemistrychemistry
→ organic chemistry (lc qd 241.0-442.0)
(b) Sampled Equivalence Relationships (eq-pairs)
Tag Equivalent Library Term
mammals
zoology
→ chordates. vertebrates
→ mammals (lc ql 700.0-740.8)
fitness physical fitness (lcsh)
catholic church catholic church (lcsh)
golf golf (lcsh)astronomy astronomy (lc qb 1.0-992.0)
Table 4.4: Randomly sampled containment and equivalence relationships for illustra-tion.
using single annotation queries. If tj = li for a given (tj, li), we say (tj, li) is an
eq-pair. If tj ⊇ li for a given (tj, li), we say (tj, li) is a con-pair. In this section, we
look for and describe eq-pairs (where both annotations define the same information
need) and con-pairs (where a library term defines a subset of an information need
defined by a tag). In Section 4.5.2, we use these pairs to evaluate the recall of single
tag queries—does a query for tag tj return a high proportion of objects labeled with
library terms equivalent or contained by tj? For both sections, we need a set of gold
standard eq- and con-pairs.
Ideally, we would identify all eq- and con-pairs (tj, li) ∈ T × L. However, this is
prohibitively expensive. Instead, we create our gold standard eq- and con-pairs as
follows:
86 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
Step 1 We limit the set of tags under consideration. Specifically, we only look at tags
in T738: the 738 tags from the top 2, 000 which were unanimously considered
objective and content-based in Section 4.4.1. (These 738 tags are present in
about 35% of tag annotations.)
Step 2 We identify (tj, li) pairs that are likely to be eq- or con-pairs based on how tj
and li are used in our dataset. First, we drop all (tj, li) pairs that do not occur
together on at least 15 works. Second, we look for (tj, li) pairs with high values
of q(tj, li) = (P (tj, li) − P (tj)P (li)) × |O(li)|. q(tj, li) is inspired by leverage
(P (tj, li)−P (tj)P (li)) from the association rule mining community [59], though
with bias (|O(li)|) towards common relationships. We drop all (tj, li) pairs that
do not have q(tj, li) in the top ten for a given tag tj.
Step 3 We (the researchers) manually examine pairs output from Step 2 and judge
if they are indeed eq- or con-pairs. At the end of this step, our gold standard
eq- and con-pairs have been determined.
Step 4 We evaluate our gold standard using Mechanical Turk workers. We do not
change any eq- or con-pair designations based on worker input, but this step
gives us an indication of the quality of our gold standard.
The filtering procedures in Steps 1 and 2 allowed us to limit our manual evaluation
to 5, 090 pairs in Step 3. (Though, the filtering procedures mean we are necessarily
providing a lower bound on the eq- and con-pairs present in the data.) In Step 3,
we found 2, 924 con-pairs and 524 eq-pairs. (Table 4.4 shows random samples of
relationships produced.)
To evaluate our gold standard in Step 4, we provided Mechanical Turk workers
with a random sample of eq- and con-pairs from Step 3 in two scenarios. In a true/false
validation scenario, the majority of 20 workers agreed with our tj = li and tj ⊇ li
judgments in 6465
= 98% of cases. However, they said that tj = li when tj 6= li or tj ⊇ li
when tj 6⊇ li in3490
= 38% of cases, making our gold standard somewhat conservative.
A χ2 analysis of the relationship between the four testing conditions (true con-pair,
false con-pair, true eq-pair, and false eq-pair) shows a strong correlation between
containment/equivalence examples and true/false participant judgments (χ2(3) =
45.3, p < .001). In a comparison scenario where workers chose which of two pairs
4.5. EXPERIMENTS: COMPLETENESS 87
they preferred to be an eq- or con-pair, the majority of 30 workers agreed with our
judgments in 138150
= 92% of cases.
Details
In this analysis, we ask if tags correspond to library annotations. We ask this question
in two directions: how many top tags have equivalent or contained library annotations,
and how many of the library annotations are contained or equivalent to top tags?
Assuming library annotations represent good topics, the first direction asks if top
tags represent good topics, while the second direction asks what portion of those
good topics are represented by top tags.
In this section and the next, we use an imaginary “System I” to illustrate coverage
and recall. System I has top tags {t1, t2, t3, t4}, library terms {l1, l2, l3, l4, l5, l6}, eq-
pairs {t1 = l1, t2 = l2}, and con-pairs {t3 ⊇ l3, t1 ⊇ l5}. Further, l3 ⊇ l4 based
on hierarchy or other information (perhaps l3 might be “History” and l4 might be
“European History”).
Looking at how well tags represent library terms in System I, we see that 2 of
the 4 unique tags appear in eq-pairs, so 24of the tags have equivalent library terms.
Going in the opposite direction, we see that 2 out of 6 library terms have equivalent
tags, so what we call eq-coverage below is 26. We also see that 2 of the library terms
(l3, l5) are directly contained by tags, and in addition another term (l4) is contained
by l3. Thus, a total of 3 library terms are contained by tags. We call this 36fraction
the con-coverage.
We now report these statistics for our real data. Of 738 tags in our data set, 373
appear in eq-pairs. This means at least half (373738
) of the tags have equivalent library
terms.5
To go in the opposite direction, we compute coverage by level in the library term
hierarchy, to gain additional insights. In particular, we use DDC terms which have
an associated value between 0 and 1000. As discussed in Section 2.1, if the value is
5Note that this is a lower bound—based on techniques in Chapter 5, we suspect more than 503738
tags have equivalents.
88 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
X00 XX0 XXXCon-Coverage 0.3 0.65 0.677Eq-Coverage 0.1 0.28 0.021
Table 4.5: Dewey Decimal Classification coverage by tags.
of the form X00, then the term is high level (e.g., 800 is Language and Literature);
if the value is of the form XX0 it is lower level, and so on (e.g., 810 is American
and Canadian Literature). We thus group the library terms into three sets, LX00,
LXX0 and LXXX . (Set LX00 contains all terms with numbers of the form X00).
For Lrs ∈ {LX00, LXX0, LXXX} being one of these groups, we define two metrics for
coverage:
concoverage(Lrs) =
∑
li∈Lrs1{∃tj ∈ T s.t. tj ⊇ li}
|Lrs|
eqcoverage(Lrs) =
∑
li∈Lrs1{∃tj ∈ T s.t. tj = li}
|Lrs|
Table 4.5 shows these metrics for our data. For example, the first row, second column
says that 65100
of XX0 DCC terms are contained by a tag. (More specifically, 65 percent
of XXO terms have this property: the term li is in LXX0 and there is a con-pair (tj, li)
in our gold standard, or there is a (tj, lk) con-pair where lk ∈ LX00 and lk ⊃ li.) The
second row, third column says that 211000
DDC ones level terms li ∈ LXXX have an
eq-pair (tj, li). About one quarter of XXO DDC terms have equivalent T738 tags.
4.5.2 Recall
Summary
Library Feature: A system should not only have the right groups of works, but it
should have enough works annotated in order to be useful. For example, a system
with exactly the same groups as libraries, but with only one work per group (rather
than, say, thousands) would not be very useful.
Result: Recall is low (10 to 40 percent) using the full dataset. Recall is high (60 to
100 percent) when we focus on popular objects (min100).
Conclusion: Tagging systems provide excellent recall for popular objects, but not
4.5. EXPERIMENTS: COMPLETENESS 89
necessarily for unpopular objects.
Preliminaries: Recall
Returning to our System I example, say that l1 annotates {o1, o3}, and l5 annotates
{o4, o5}. Because t1 is equivalent to l1, and contains l5, we expect that any work
labeled with either l1 or l5 could and should be labeled with t1. We call o1, o3, o4, o5
the potential objects for tag t1. Our goal is to see how closely the potential object
set actually matches the set of objects tagged with t1. For instance, suppose that t1
actually annotates {o1, o2}. Since t1 annotates one of the four potential works, we
say that recall(t1) =14.
More formally, if li = tj, then we say li ∈ E(tj). If tj ⊇ li, then we say li ∈ C(tj).
Any object annotated with terms from either E(tj) or C(tj) should also have a tag tj.
Hence, the potential object set for a tag based on its contained or equivalent library
terms is:
Ptj =⋃
li∈(E(tj)∪C(tj))
O(li)
We define recall to be the recall of a single tag query on relevant objects according
to our gold standard library data:
recall(tj) =|O(tj) ∩ Ptj |
|Ptj |
and that the Jaccard similarity between the potential object set and the objects
contained by a tag is:
J(O(tj), Ptj) =|O(tj) ∩ Ptj |
|O(tj) ∪ Ptj |
Details
In this experiment, we ask whether the tags provided by users have good recall of
their contained library terms. An ideal system should have both good coverage (see
Section 4.5.1) and high recall of library terms.
We look at recall for the tags T603 ⊂ T738 that have at least one con-pair. Figures
90 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
Figure 4.9: Recall for 603 tags in the full dataset.
Figure 4.10: Recall for 603 tags in the “min100” dataset.
Figure 4.11: Jaccard for 603 tags in the full dataset.
4.6. RELATED WORK 91
4.9 and 4.10 show the distribution of recall of tags tj ∈ T603 using the full and min100
datasets. Figure 4.9 shows that using the full dataset, most tags have 10 to 40 percent
recall. For example, about 140 tags have recall between 10 and 20 percent. Figure 4.10
shows recall using the “min100” dataset. We can see that when we have sufficient
interest in an object (i.e., many tags), we are very likely to have the appropriate
tags annotated. Recall is often 80 percent and up. Lastly, Figure 4.11 shows the
distribution of Jaccard similarity between O(tj) and Ptj . For most tags, the set of
tag annotated works is actually quite different from the set of library term annotated
works, with the overlap often being 20 percent of the total works in the union or less.
The objects in O(tj)−Ptj are not necessarily incorrectly annotated with tj. Since we
know that many tags are of high quality (Section 4.4), a more likely explanation is
that the library experts missed some valid annotations.
4.6 Related Work
Our synonymy experiment in Section 4.3.1 is similar to previous work on synonymy
and entropy in tagging systems. Clements et al. [20] use LibraryThing synonym sets
to try to predict synonyms. By contrast, our goal was to determine if synonyms were
a problem, rather than to predict them. Chi et al. [18] used entropy to study the
evolution of the navigability of tagging systems. They look at entropy as a global
tool, whereas we use it as a local tool within synonym sets.
Our experiments relating to information integration in Sections 4.3.2 and 4.3.3
(primarily Section 4.3.2, however), share some similarities to Oldenburg et al. [57]
which looked at how to integrate tags across tagging systems, though that work is
fairly preliminary (and focused on the Jaccard measure). That work also focuses on
different sorts of tagging systems, specifically, social bookmarking and research paper
tagging systems, rather than social cataloging systems.
Our tag type experiment in Section 4.4.1 is related to work like Golder and Hu-
berman [31] and Marlow et al. [55] which looked at common tag types in a tagging
systems. However, we believe our work is possibly the first to analyze how tag types
change over the long tail of tag usage (i.e., are less popular tags used differently from
92 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
more popular tags?).
Like Section 4.4.3, other work has found moderately common terms in a collection
to be useful. For instance, Haveliwala et al. [32] propose Nonmonotonic Document
Frequency (NMDF), a weighting which weights moderately frequent terms highly.
We are not aware of other work that has suggested this particular weighting for tags,
however.
The most related work to our experiments in Sections 4.5.1 and 4.5.2 is our own
later work discussed in Chapter 5. Some older work, for example, DeZelar-Tiedman
[23] and Smith [66] looks at the relationship between tagging and traditional library
metadata. However, these works tend to look at a few hundred books at most,
and focus on whether tags can enhance libraries. Also related to these experiments,
there has been some work on association rules in tagging systems, including work by
Schmitz et al. [62] and our own work in Chapter 3. However, that work focused on
prediction of tags (or other tagging system quantities). We believe our work is the
first to look at relationships between tags and library terms using methods inspired
by association rules.
We are unaware of other work either examining $-tags (or even suggesting paying
for tags) or attempting to understand how tagging works as a data management or
information organization tool (i.e., in the same sense as libraries) in a large-scale,
quantitative way.
4.7 Conclusion
We conducted a series of experiments that suggested that tagging systems tend to be
at least somewhat consistent, high quality, and complete. These experiments found
the tagging approach to be suitable for synonymy, information integration, paid an-
notation, programmatic filtering for quality, and for situations where an objective
and high recall set of annotations covering general topics is needed. In a span of only
a few years, LibraryThing has grown to tens of millions of books, and the groups
developed by taggers are quite close to the groups developed by professional tax-
onomists. This is a testament both to the taxonomists, who did a remarkable job of
4.7. CONCLUSION 93
choosing consensus controlled lists and classifications to describe books, and to tags
which are unusually adaptable to different types of collections. Strikingly, we found
that a particular type of user tag (moderately common user tags) is perceived as even
more helpful than expert assigned library annotations. These two sets of experiments
are mutually reinforcing. Overall, tags seem to do a remarkably good job of organiz-
ing data when viewed either quantitatively in comparison to “gold standard” library
metadata or qualitatively as viewed by human evaluators.
94 CHAPTER 4. TAGGING HUMAN KNOWLEDGE
Chapter 5
Fallibility of Experts
The previous chapter looked at library metadata as a basis for evaluating tagging
systems as an information organization tool. We looked at questions like whether
such systems could be federated and whether tags correspond to taxonomies like the
Dewey Decimal Classification. Here, our focus is instead on the nature of keyword
annotations, and specifically how user and expert keyword annotations differ. (The
more recent experiments in this chapter also apply a semantic relatedness measure in
a novel way which we believe will be valuable more broadly in tagging systems.)
In this chapter, we ask whether a controlled vocabulary of library keywords called
the Library of Congress Subject Headings (LCSH) is different from the vocabulary
developed by the users of LibraryThing. We find that many LCSH keywords corre-
spond to tag keywords used by users of LibraryThing. However, we also find that
even though an LCSH keyword and a tag may be syntactically the same, often the
two keywords may annotate almost completely different groups of books. In our case,
the experts seem to have picked the right keywords, but perhaps annotated them to
the wrong books (from the users’ perspectives). Thus, the common practice on the
web of letting users organize their own data may be more appropriate. (This chapter
draws on material from Heymann et al. [33] which is primarily the work of the thesis
author.)
95
96 CHAPTER 5. FALLIBILITY OF EXPERTS
5.1 Notes on LCSH
In this chapter, we continue to use the library terms introduced in Section 4.1. How-
ever, here a library term lj is always an LCSH keyword. As noted before, LCSH
keywords come from a controlled vocabulary of hundreds of thousands of terms with
main and subtopics. Here, we treat all main and subtopics as separate keywords.
(When we refer to “LCSH keywords,” we mean the value of MARC 650. MARC 650,
strictly speaking, may include expert-assigned keywords from vocabularies other than
LCSH, but in practice is made up almost entirely of that vocabulary in our dataset.)
LCSH has some hierarchical structure. An LCSH keyword li has keywords which
are “broader than” that keyword ({lj , lk . . .} ∈ B(li)), “narrower than” that keyword
({lj, lk . . .} ∈ N(li)), and “related to” the keyword ({lj, lk . . .} ∈ R(li)). Unfortu-
nately, this structure is not particularly consistent [24], in that if lk ∈ B(lj) and
lj ∈ B(li), it may not be the case that lk ∈ B(li). In practice, books rarely have more
than three to six LCSH keywords due to originally being designed for card catalogs
where space was at a premium. It is also common for only the most specific LCSH
keywords to be annotated to a book, even if more general keywords apply. Lastly,
because tags are annotated by regular users, and LCSH keywords are annotated by
paid experts, {oc(lj)|lj ∈ L} and {oc(ti)|ti ∈ T} are quite different. Tags tend to
focus on popular works, while keywords by paid experts annotate more works, less
densely.
5.2 Experiments
In our experiments, we use the dataset from Section 4.2, restricted to works found in
both LibraryThing and the Library of Congress, and only at the 8, 783 unique LCSH
keywords and 47, 957 unique tags which annotate at least 10 works. Our research
question is, “how many keywords determined by expert consensus for LCSH are also
used as tags, and are these keywords used in the same way?” In the experiments
below, we divide this question as follows:
1. Section 5.2.1 asks whether LCSH keywords have syntactically equivalent tags.
5.2. EXPERIMENTS 97
(For example, tag “java” is equivalent to LCSH “Java.”)
2. Section 5.2.2 asks whether for a given syntactically equivalent (ti, lj) pair, ti
and lj have the same prominence in lists ranked by oc(ti) and oc(lj).
3. Section 5.2.3 asks if syntactically equivalent (ti, lj) pairs are used in the same
way by experts and users.
4. Section 5.2.4 asks whether LCSH keywords have semantically equivalent tags.
(For example, “jewish life” is semantically equivalent to “jewish way of life.”)
We do not replicate the experiments from Sections 5.2.2 and 5.2.3 for semantic
equivalence, but we expect less correlation and less similar usage between non-
syntactically but semantically equivalent keyword pairs.
5.2.1 Syntactic Equivalence
Definition
The tag “painters” and the LCSH keyword “Painters” are obviously equivalent key-
words. But is the tag “american science fiction” equivalent to “Science Fiction, Amer-
ican”? Is the tag “men in black” equivalent to “Men in Black (UFO Phenomenon)”?
We define two types of syntactic equivalence:
Exact The lower-cased tag is identical to the lower-cased LCSH keyword.
Almost Exact The lower-cased tag is identical to the lower-cased LCSH keyword
if the LCSH keyword is modified to remove parenthetical remarks, swap the
ordering of words around a comma, stem, or add or remove an “s.”
Our “painters” example is exactly equivalent, while the other two examples are al-
most exactly equivalent. If there exists a tag ti that is exactly or almost exactly
syntactically equivalent to lj , we say that lj ∈ Slcsh and (ti, lj) ∈ Spair.
Results
We found that 34088783
LCSH keywords were exactly equivalent to a tag, while an ad-
ditional 8388783
were almost exactly equivalent to a tag. In all, about 48% of LCSH
keywords have equivalents according to one of the above two definitions. Such a high
98 CHAPTER 5. FALLIBILITY OF EXPERTS
Figure 5.1: Spinogram [40] [39] showing probability of an LCSH keyword having acorresponding tag based on the frequency of the LCSH keyword. (Log-scale.)
keyword overlap is all the more surprising given that many of the exactly equiva-
lent LCSH keywords are multiple words, for example, “Vernacular Architecture” or
“Quantum Field Theory.”
Cases where lj 6∈ Slcsh are highly correlated with low oc(lj). Figure 5.1 shows the
distribution of syntactic equivalence (y-axis) based on oc(lj) (x-axis). For example, if
10 ≤ oc(lj) ≤ 15, there is about a 30 percent chance that lj ∈ Slcsh (and a 20 percent
chance that lj is exactly equivalent to some tag ti). By contrast, if 63 ≤ oc(lj) ≤ 100,
there is about a 70 percent chance that lj ∈ Slcsh. (We also suspect that longer LCSH
keywords may be less likely to have syntactically equivalent tags because tags tend
to be short.)
5.2.2 Rank Correlation of Syntactic Equivalents
Are syntactically equivalent (ti, lj) pairs equally popular within their respective anno-
tation types? For example, if the “java” tag annotates many works, does the “Java”
LCSH keyword also annotate many works? We create two rankings of {(ti, lj) ∈
Spair}, one ordered by oc(ti), the other ordered by oc(lj). We use Kendall’s tau rank
correlation to determine how similarly ranked the pairs are. For our data, τ ≈ 0.305.
This means that the pairs are somewhat, but not highly, positively correlated. The
experts and regular users have somewhat similar views of what the most important
keywords are, but they do still differ substantially.
5.2. EXPERIMENTS 99
5.2.3 Expert/User Annotator Agreement
Do experts and regular users use the same keywords in the same ways? For example,
many users in our dataset have annotated the book “The Wind in the Willows” with
the tag “children’s stories,” yet no expert has annotated the book with the LCSH
keyword “Children’s Stories.” We investigate the question of how common problems
like these are below, and find that they are quite common.
Jaccard Similarities
We define three measures to try to get an idea of how commonly (ti, lj) pairs annotate
the same books. We define symmetric Jaccard similarity as:
Jsym =|O(ti) ∩O(lj)|
|O(ti) ∪O(lj)|
For example, “children’s stories” (above) has Jsym = 0, while “origami” has Jsym =
0.75. We also define two asymmetric Jaccard similarity measures, one for tags and
one for LCSH:
Jtag(ti, lj) =|O(ti) ∩O(lj)|
|O(ti)|Jlcsh(ti, lj) =
|O(ti) ∩O(lj)|
|O(lj)|
Jsym gives the ratio of the size of the intersection of two annotations to their union, so
it may be dominated by one annotation if that annotation annotates many works. Jtag
tells us what portion of the tagged works are covered by the LCSH keyword, and Jlcsh
tells us what portion of LCSH annotated works are covered by the tag. For example,
“knitting” has Jlcsh = 0.97 but Jsym = 0.53 because even though almost all works in
O(lknitting) are in O(tknitting), |O(tknitting)| is about twice as large as |O(lknitting)|.
Results
For most (ti, lj) ∈ Spair, O(ti)∩O(lj) is quite small. Figure 5.2(a) shows the distribu-
tion of Jsym for the 4, 246 (ti, lj) pairs in Spair. The vast majority of such pairs have
less than 20% overlap in work coverage.
100 CHAPTER 5. FALLIBILITY OF EXPERTS
(a) Histogram of Jsym
(b) Histogram of Jsym, N(li) = ∅
Figure 5.2: Symmetric Jaccard Similarity.
A possible reason for small O(ti) ∩O(lj) could be that librarians only choose the
most specific appropriate LCSH keywords (see Section 5.1). In order to test this
hypothesis, we computed Jsym, but only over LCSH keywords which were at the
bottom of the LCSH hierarchy. In other words, we only chose li where N(li) = ∅.
Jsym values for these pairs, shown in Figure 5.2(b) are very similar to those in Figure
5.2(a). This leads us to believe that specificity is not the core reason user and expert
annotations differ.
Figures 5.3(a) and 5.3(b) show the values of Jlcsh and Jtag for the 4, 246 pairs.
Both show predominantly low Jaccard values. Jlcsh does have slightly higher Jaccard
values, but it is still mostly below 0.4. A work labeled with an LCSH keyword is less
than 50 percent likely to be labeled with the corresponding tag. A work labeled with
a tag is even less likely to be labeled with the corresponding LCSH keyword.
5.2. EXPERIMENTS 101
(a) Histogram of Jlcsh
(b) Histogram of Jtag
Figure 5.3: Asymmetric Jaccard Similarity.
5.2.4 Semantic Equivalence
Are there semantically, rather than syntactically equivalent tag/LCSH keyword pairs?
In other words, are there many pairs like “middle ages” and “Middle Ages, 500-1500”
where the meaning is the same, but the phrasing is slightly different? If so, how
many?
Definition
We use semantic relatedness to determine whether (ti, lj) pairs are semantically equiv-
alent. Semantic relatedness is a task where an algorithm gives a number between 0
and 1 for how related two words or phrases (w1, w2) are. For example, “vodka” and
“gin” are highly related (closer to 1) while “rooster” and “voyage” are not (closer to
0). We use an algorithm called Wikipedia Explicit Semantic Analysis (ESA) [30] to
calculate semantic relatedness. Wikipedia ESA calculates relatedness by looking at
how often w1 and w2 co-occur in articles in Wikipedia. We write Wikipedia ESA as
a function sresa(ti, lj) → [0, 1].
102 CHAPTER 5. FALLIBILITY OF EXPERTS
ESA (ti, lj) pair0.1 nature photography, indian baskets
0.2 fiction xxi c, angels in art
0.3 christian walk, women and peace
0.4 novecento/20th century, african american churches
0.5 20th century british literature, indians in literature
0.6 countries: italy, european economic community countries
0.7 medieval christianity, medieval, 500-1500
0.8 christian church, church work with the bereaved
0.9 detective and mystery fiction, detective and mystery stories
Table 5.1: Sampled (ti, lj) pairs with Wikipedia ESA values.
Figure 5.4: Conditional density plot showing probability of a (ti, lj) pair meaningthat (ti, lj) could annotate {none, few, some,many, almostall, all} of the same booksaccording to human annotators based on Wikipedia ESA score of the pair.
Understanding Wikipedia ESA Values
Table 5.1 shows representative Wikipedia ESA values for LCSH keywords lj 6∈ Slcsh.
For example, for tag tma “middle ages” and LCSH keyword lma “Middle Ages, 500-
1500”, sresa(tdmf , ldms) ≈ 0.98 (not shown). By contrast, for tnp “nature photogra-
phy” and lib “Indian Baskets”, sresa(tnp, lib) ≈ 0.1.
Figure 5.4 shows how Wikipedia ESA values translate into real relationships be-
tween (ti, lj) keyword pairs. We uniformly sampled (ti, lj) pairs where lj 6∈ Slcsh by
sresa(ti, lj). We then asked human annotators how many books labeled with either ti
or lj would be labeled with both ti and lj. Figure 5.4 shows sresa values on the x-axis
5.3. CONCLUSION 103
Figure 5.5: Histogram of Top Wikipedia ESA for Missing LCSH and All Tags.
and the distribution of answers ∈ {none, few , some,many, almostall, all} on the y-
axis. For example, at sresa = 0.8, 20 percent of keyword pairs have many, almostall,
or all books in common (top three grays) according to human annotators. Likewise,
more than half of pairs at sresa = 0.8 have at least some books in common by this
measure. sresa is well correlated with how humans see the relationship between two
keywords.
Results
We ran Wikipedia ESA over all (ti, lj) pairs where lj 6∈ Slcsh. Figure 5.5 shows
{max({sresa(ti, lj)|ti ∈ T})|lj ∈ L− Slcsh}. That figure shows that most of the non-
syntactically equivalent LCSH keywords have a fairly semantically similar tag, with
a Wikipedia ESA value between 0.7 and 0.9. By simulation using the probabilities
from Figure 5.4, we estimate that ≈ 21 percent of lj 6∈ Slcsh have a tag matching
all or almostall of the keyword and ≈ 56 percent have a tag matching many books
annotated with the keyword.
5.3 Conclusion
This short chapter contrasted a mature controlled vocabulary built by experts over
decades with an uncontrolled vocabulary developed by hundreds of thousands of users
over a few years. We found many (about 50 percent) of the keywords in the controlled
vocabulary are in the uncontrolled vocabulary, especially more annotated keywords.
We also found using a semantic relatedness measure that most of the remaining LCSH
104 CHAPTER 5. FALLIBILITY OF EXPERTS
keywords have similar, though not exactly equivalent, tags. This suggests that often
the keywords selected as controlled vocabulary keywords are the keywords that users
naturally use to describe works.1
However, we found little agreement as to how to apply shared keywords. Sets of
works annotated by corresponding LCSH keywords and tags rarely intersect signifi-
cantly. This is true even if we merely check whether a corresponding tag annotates
most of the works annotated by an LCSH keyword. This suggests one of three inter-
esting possibilities:
1. Users and experts use many of the same keywords, but ultimately differ heavily
as to how to apply them.
2. Experts are not allowed, or do not have time, to annotate works with all of the
appropriate keywords.
3. Experts only label highly representative works with a term, rather than all
works that might be considered to have the term, leading to low recall.
All of these possibilities are ultimately bad for retrieval using expert assigned con-
trolled vocabularies.
When users and experts differ in how they annotate objects, we believe it is
most reasonable to defer to the users. To say otherwise would be, in essence, to
tell users that they do not know how to organize their own collections of objects.
Ultimately, given that keywords are used by the users for navigation and browsing,
we should evaluate the usefulness of annotations from their perspective, rather than
the perspective of experts.
This chapter also suggests an interesting alternative view on the vocabulary prob-
lem [28], a long standing observation in the world of human-computer interaction.
The vocabulary problem suggests that given an object, people will choose many dif-
ferent names for that object. However, our work suggests that given a name (a tag in
our case), people, whether experts or not, may disagree substantially on what objects
1Keywords can be in one of three groups, but we focus on L ∩ T and L − T in this chapter,ignoring T − L. In Chapter 4, we found that about half of the 47, 957 tags ti ∈ T are likely tobe non-objective, non-content tags like “funny,” “tbr,” or “jiowef.” We suspect that the balanceof T − L that is not syntactically equivalent to LCSH keywords is still either related to the LCSHkeywords or describe completely different (objective, content-based) concepts.
5.3. CONCLUSION 105
that name should annotate.
106 CHAPTER 5. FALLIBILITY OF EXPERTS
Chapter 6
Human Processing
So far, we have focused on one particular type of microtask: tagging. Starting in this
chapter, we shift our focus to microtasks in general. In particular, we are interested in
how to write programs that use microtask marketplaces like Mechanical Turk. This
chapter begins by motivating our programming environment and methodology for
human microtasks, called the human processing model. Chapter 7 describes our first
attempt at an implementation of the human processing model, called the HPROC
system, and illustrates its usage through a sorting case study. Finally, Chapter 8
describes worker monitoring, a key part of recruiters (see Section 6.4), which are in
turn a major feature of the human processing model.
Why a programming environment and methodology for microtasks? Developing
a microtask-based application involves a lot of work, e.g., developing a web interface
for the human workers to receive their assignments and return their results, computer
code to divide the overall application into individual tasks to be done by humans,
computer code to collect results, and so on. However, with our programming envi-
ronment many of the programming steps that must be performed can be automated.
We start by giving a simple example (Section 6.1). We show how a programmer
would attack this example using two existing programming environments, which we
call Basic Buyer (Section 6.2) and Game Maker (Section 6.3). Then, we show how
the programmer would attack the same example using our novel proposed environ-
ment, Human Processing (Section 6.4). Finally, we contrast all three environments
107
108 CHAPTER 6. HUMAN PROCESSING
and describe remaining challenges in the area (Section 6.5).
6.1 Motivating Example
“Priam,” the editor of a photography magazine, wants to rank photos submitted to
the magazine’s photo contest. For each environment below, we explain how Priam
might go about accomplishing this task.
6.2 Basic Buyer
The premise of the Basic Buyer (BB) environment is that workers do short microtasks
for pay, based on listings on a website (a marketplace). The BB environment is mod-
eled on usage of Amazon’s Mechanical Turk [1],1 though a similar environment could
be used with Gambit Tasks [2] or LiveWork [3]. However, because the programmer in
BB targets a marketplace directly, and interaction patterns with marketplaces vary,
switching marketplaces requires rewriting previous code.
The BB environment (Figure 6.1) works as follows:
1. The programmer (Priam) writes a normal program.
2. That program can, in the course of execution, create HTML forms at one or
more URLs. This creation of forms can happen in any of the usual ways that
people currently generate web forms using web application frameworks.
3. The program can also interact with a marketplace, a website where workers
(users on the Internet visiting the marketplace) look for tasks to complete. The
program can make one of five remote procedure calls to a monetary marketplace:
post(url, price) → taskid Tell the marketplace to display a link to url
with the information that, if completed, the worker will be paid price.
(We do not specify, but post might include other specifications like the
number of desired assignments, task title, or a task description.) The URL
1In particular, correspondences for the operations mentioned in this section are post →CreateHIT, assignments,get→ GetAssignmentsForHIT, approve→ ApproveAssignment, reject→ RejectAssignment. We ignore bonuses and qualifications for ease of exposition.
6.2. BASIC BUYER 109
Figure 6.1: Basic Buyer human programming environment. A human program gen-erates forms. These forms are advertised through a marketplace. Workers look atposts advertising the forms, and then complete the forms for compensation.
url should correspond to a form which performs an HTTP POST to the
marketplace. The returned identifier taskid gives a handle for further
interaction with the marketplace related to this posted task. When a
worker completes the task later, the worker will post the result via the
form to the marketplace. The marketplace will then record a dictionary
containing the posted form contents from url, a workerid unique to the
worker, the taskid, and a unique assignid with which to look up the
dictionary. (By dictionary, we mean a hash table where one can enter and
search for entries on keys.)
assignments(taskid) → assignids Return a list of identifiers assignids for
looking up individual completions of the form associated with taskid.
(Can be called multiple times, perhaps with a special identifier to indicate
that there will be no additional assignments registered.)
get(assignid) → dict Get the dictionary that corresponds to the submit-
ted task with the given assignid. The dictionary contains which worker
completed the task (a workerid) and the results of the form, as key-value
pairs.
110 CHAPTER 6. HUMAN PROCESSING
approve(assignid) Request that the marketplace pay the worker associated
with assignid the price associated with the taskid that that assignid
corresponds to.
reject(assignid) Request that the marketplace not pay the worker asso-
ciated with assignid the price associated with the taskid that that
assignid corresponds to.
The program posts one or more URLs, waits for assignments, gets the results,
and then approves or rejects the work.
Priam determines that workers are best at ranking five photos at a time, so a
web page is designed to display five photos and provide five entry fields for the ranks
one through five. A computer program now needs to be written to read the photos
from a database and generate multiple posts corresponding to groups of five photos.
The program needs a strategy to do its work: for instance, it may employ a type
of Merge-Sort strategy: divide the photos into disjoint sets of five, and rank each
set. Then the sorted sets (runs) can be merged by repeatedly calling on workers.
(Section 7.8 describes in more detail a similar Merge-Sort in our HPROC system
using ranked comparisons.)
In addition to the sorting logic itself, there is a lot of other “administrative”
work that needs to be done. Of course, assignments need to be approved (paying
workers for their work), but more importantly Priam needs to determine pricing, if
and when the work being submitted is good, which workers are good, and so on. For
example, one worker (a “spammer”) might simply fill in junk in order to get paid.
This spammer would need to be caught and their work ignored. Priam also may not
pay enough initially, or may need to change his price over time depending on market
conditions.
6.3 Game Maker
The Game Maker (GM) environment is modeled on the “Games with a Purpose”
(GWAP) literature [70]. The idea of GM is to incorporate one or more desired tasks
(like Priam’s five photo ranking) into a game which regular users on the Internet
6.3. GAME MAKER 111
Figure 6.2: Game Maker human programming environment. The programmer writesa human program and a game. The game implements features to make it fun anddifficult to cheat. The human program loads and dumps data from the game.
find fun to play. In theory this is not much different from BB—why not simply
post the URL of the “game” task so people can find it? However, in practice, GM
is quite different because only some tasks can be made fun, the question of pricing
is completely avoided, and it often takes a long time and a great deal of effort to
make desired work into a game. Many games have been developed, though our
model is based most closely on the ESP Game (a photo captioning game). The GM
environment (Figure 6.2) works as follows:
1. The programmer (Priam) writes two programs: the main program and a “game
with a purpose.”
2. The game is designed to take input items and compute some function fn of each
input item by coercing players to compute the function during game play. For
example, the ESP Game takes photos as input items and produces text labels
as outputs [70].
3. The interaction between the main program and the game is simple:
load(item) → itemid Add a new item for humans to compute the game’s
function on. Return an identifier for the item.
dump() → ((itemid,res),...) Get a list of all results that have been com-
puted up to this point (can be called repeatedly). Each returned tuple
112 CHAPTER 6. HUMAN PROCESSING
includes an itemid and the result of computing the game’s function on the
original item.
4. While the function fn computed is usually quite simple (e.g., “give some la-
bels for this image”), the game itself is usually quite complex. This complexity
is for two reasons: the game must be fun, and the game must be difficult to
cheat. Making the game fun can be time consuming, requiring features such as
timed game play, multiple players, fake players (via replayed actions), leader-
boards, and quality graphic design. Making the game difficult to cheat can be
equally time consuming, requiring features such as randomization, gold stan-
dards, statistical analysis, and game design according to particular templates
(e.g., “output-agreement,” “input-agreement” [70]).
5. The game may be a Flash game or any other format, the fact that it is used for
human computation does not impact the technical details of how we program
it.
Priam determines that the magazine’s readers might be willing to play a game
where they determine the best photo out of a set of five photos. As with the Basic
Buyer case, Priam needs to write a program to handle the sorting logic. The program
could then use the load and dump operations to get data in and out of the game.
However, he now also needs to write a game where it is fun to sort groups of five
photos, and then promote the game online. Lastly, he needs to make sure that
players cannot cheat, either to make a particular contestant’s photo do well, or for
the player to succeed in the game by inputting bad data.
One problem with the GM environment is that to date, programmers have not
shared interfaces or source code for popular games. For example, even though the ESP
Game serves many players each day, it is not possible for Priam to get the (actual)
ESP Game to label his own images. This means that the programmer usually has to
develop and promote a new game, even if previous examples exist! (Even if the most
popular GWAPs did have open interfaces, it is likely that switching between GWAPs
would require rewriting code and that GWAPs would only cover a small fraction of
potential desired tasks.)
6.4. HUMAN PROCESSING 113
Figure 6.3: Human Processing programming environment. HP is a generalization ofBB and GM. It provides abstractions so that algorithms can be written, tasks can bedefined, and marketplaces can be swapped out. It provides separation of concerns sothat the programmer can focus on the current need, while the environment designerfocuses on recruiting workers and designing tasks.
6.4 Human Processing
The Human Processing (HP) environment builds upon the BB and GM environments
through abstraction. The HP environment (Figure 6.3) works as follows:
1. The programmer (Priam) writes a normal program. The programmer may also
write one or more implementations of (see below) human drivers, human tasks,
marketplace drivers, or recruiters. However, the point of the HP enviroment is
to maximize code reuse, so ideally, existing implementations should cover the
programmers’ common use cases.
2. A human driver is a program that manages an associated web form or other
user interface (so that the main program and other components do not have
to talk directly to the user interface). It is so named because it manages the
interaction with humans, much like a device driver manages a physical device
on a computer. A human driver supports four operations:
open() → driverid Make the associated user interface available to remote
users. By remote users we mean workers in the BB model, players in
114 CHAPTER 6. HUMAN PROCESSING
the GM model, or other people capable of completing tasks. Returns an
identifier for the driver.
send(driverid, msg) Send message msg to the driver to change its behavior.
In Priam’s case, if he was using a driver for a game like the ESP Game he
would use a send operation to load input photos.
get(driverid) → (d,e) or 0 Get a (result) data object d from the interface,
with execution context e about how that data object was acquired. If no
new data is available, return nothing (i.e. 0). get is how results are
returned from the driver. Both d and e are dictionaries of key-value pairs.
For example, Priam’s photo comparison interface returns d as
{ranks: (1,4,2,3,5), taskid: TID183}
(ranks are the output, and tasks are defined below) and e as
{workerid: WID824}
(a worker who completed the task).
close(driverid) Make the associated user interface unavailable to remote
users.
A human program opens a driver and then sends setup messages. Human
drivers for web forms may only receive one setup message, though those for
games may be sent many messages to load inputs. Execution context comes
from user interaction, for example, how long did the task take and which worker
completed it? Such information can help with quality control in the main
program. Finally, the human driver is closed. Note that by itself, a human
driver can make its associated user interface available to remote users. However,
it does not handle the problem of finding remote users to interact with the user
interface.
3. The programmer reuses or defines structures called human task descriptions. A
human task description consists of an input schema, an output schema, a human
driver, a web form, and possibly other metadata. A human task description can
be instantiated into one or more human task instances. These instances contain
information as key-value pairs such as when the task started, a price if any, and
so on. For example, a task description for Priam’s case might look like ...
6.4. HUMAN PROCESSING 115
{input: (photo1, photo2, photo3, ...),
output: (int, int, int, int, int),
webform: compare.html,
driver: comparer.py}
... while a task instance might look like ...
{start: 20090429,
price: $0.07,
taskid: TID272}.
4. A marketplace driver provides an interface to a marketplace. Marketplaces
are a general term for both monetary marketplaces like Amazon’s Mechanical
Turk [1] (websites where workers are paid in money) and gaming marketplaces
like GWAP (websites where users choose among many games and are paid
in points or enjoyment). The environment may have many drivers for different
marketplaces, and these drivers may have different interfaces depending on what
the marketplaces themselves support.
5. The programmer avoids programming to any particular marketplace driver if
at all possible. Instead, the programmer targets a recruiter, which is a program
that serves as an interface to one or more marketplace drivers.2 Recruiters
support at least one operation:
recruit(taskid) Ensure that the task instance taskid is completed by work-
ers. The recruiter uses the task instance to find out how the user interface
associated with the task is accessed. For example, if it is a web form, the
task instance includes the URL of the web form. Then, the recruiter inter-
acts with one or more marketplace drivers. In the case of the marketplace
from the BB environment, one strategy might be to gradually increase
the price until workers complete the web form. The recruiter also interacts
with the human driver associated with the task instance to determine when
no more workers are needed (e.g., in Section 7.5.12, the recruiter calls an
isDone() method on a human driver).
2In practice, some services like Dolores Labs’ CrowdFlower may also be viewed as a form ofrecruiter.
116 CHAPTER 6. HUMAN PROCESSING
In general, quality recruiters need both a strategy and worker monitoring. For
example, if workers are not completing a web form, should the price be increased,
or are there simply not many workers currently awake and available? Chapter 8
addresses the question of worker monitoring, with some consideration of viable
recruiting strategies.
6. The environment includes a library of human algorithms to encourage code
reuse. A human algorithm is a parameterized program which can handle many
possible needs. (For example, it might include algorithms for sorting, clustering,
and iterative improvement [49].) Often, it will be parameterized by human task
descriptions, but other parameters might be used as well. For example, a pair-
wise sort algorithm might take a human task description consisting of a human
driver and web form to compare two items. The human task description would
determine if the items compared were photos, videos, or something else.
The Human Processing environment is the novel environment we propose in this work.
In the HP environment, Priam’s workload is much reduced. A pairwise sorting
algorithm H-Quick-Sort is already included in the library. Priam may define a
human driver and web form for comparing two photos, though these might already
be available. Then, Priam defines a human task consisting of comparing two photos
using the human driver, web form, and appropriate schemas. Lastly, Priam runs
H-Quick-Sort with his human task and a pre-defined recruiter. An example pre-
defined recruiter is one that increases prices one cent each hour using Amazon’s
Mechanical Turk, though more complex recruiters could be built.
6.5 Discussion
HP extends BB and GM in compelling ways:
• Cost. BB excels for small numbers of tasks where programmer time is valuable.
GM excels for large numbers of tasks where cheaper work is valuable. HP
excels at both by providing payment optimizing recruiters and the opportunity
to degrade to either BB or GM.
• Ease. BB quickly becomes complicated as the programmer gets bogged down
6.5. DISCUSSION 117
in trivia like pricing. GM requires heavy attention to game play and cheaters.
HP allows the programmer to focus on the tasks to be completed, rather than
infrastructure.
• Reuse. There are no mechanisms in BB for reusing algorithms, forms, or admin-
istrative functionality. Current GM implementations do not share interfaces,
and games tend to be specialized to specific use cases. By contrast, abstrac-
tions in HP allow for a library of infrastructure. Algorithms can target recruiter
interfaces, recruiters can target market drivers, and so on.
• Independence. Programs in BB tend to be focused on a particular marketplace.
Programs in GM tend to be tied to a particular web site’s gaming user base.
By contrast, programs written to HP have an independence due to marketplace
drivers. (Likewise, algorithms, human drivers, and forms may have a simi-
lar independence.) Switching marketplaces or other infrastructure can require
substantial rewriting in BB or GM, but does not in HP.
• Algorithms. General algorithms can be written to target a higher level interface
in HP, but it is not clear how general algorithms can be reused in BB or GM.
• Separation of Concerns. Researchers or infrastructure writers can focus on
improving recruiters, algorithms, and human drivers in HP, independent of a
main program’s code.
The more environments that implement HP, the easier it will be to leverage disparate
work in algorithms, recruiters, and human drivers.
There are three main challenges in the future for HP.
1. Verification, Quality Control. GM focuses a great deal on verification, but BB
and HP do not. How should we identify bad output? How do we identify
high and low quality workers? Is worker quality task specific? We would like
to see a generic, modular way to handle verification and quality control in an
environment like HP.
2. Recruiters. We would like to see arbitrarily advanced recruiters. For example,
not only would we like to see recruiters that price tasks on monetary mar-
ketplaces, but we would also like to see recruiters that can choose amongst
alternative, equivalent task plans based on price and quality.
118 CHAPTER 6. HUMAN PROCESSING
3. Algorithms. Algorithms targeted for the HP environment need to be developed
for various purposes. For example, sorting with people is not the same as sorting
with a computer! The HP environment provides a natural way to benchmark
algorithms, based on cost, time, input, and output with a given recruiter.
As we will see, the work in Chapters 7 and 8 goes part of the way towards address-
ing these challenges. Chapter 7 describes our partial implementation of HP, called
HPROC.We use HPROC to explore human sorting algorithms with a simple recruiter.
Chapter 8 demonstrates a worker monitoring tool intended for use by recruiters. This
worker monitoring tool might also be used for various forms of quality control in the
future. However, while we explore the space with one type of algorithm, one type of
recruiter, and one tool for worker monitoring, there is a great deal of potential future
work in algorithms, recruiters, and quality control. We believe that HP is a strong
foundation for this future work in human computation, allowing for much greater
reuse and modularization of common functionality.
Chapter 7
Programming with HPROC
This chapter describes HPROC, a system implementing most of the Human Process-
ing model described in Chapter 6. (The one notable exception is a lack of human
task descriptions.) HPROC makes human programming easier by storing expensive
human results in a database backend and providing an environment for programming
which is more amenable to control flow with humans. HPROC makes evaluation of
human algorithms easier with concepts like recruiters which help to control for the
variability of an underlying marketplace like Mechanical Turk. We describe HPROC
through a sorting case study illustrating both how HPROC works and how we believe
human algorithms should be evaluated.
7.1 HPROC Motivation
Programming systems designed for other problems are rarely a good fit for human
programs. In particular, human programs are:
Long Running With humans as a computational unit, processing can take days or
even weeks.
Costly Paying human workers costs money, which means that the programmer wants
to be extra careful not to lose previously computed results. In particular, pre-
viously computed results should be persistently stored in case a later part of
119
120 CHAPTER 7. PROGRAMMING WITH HPROC
1 quicksort(A)
2 if A.length > 0
3 pivot = A.remove(once A.randomIndex ())
4 left = new array
5 right = new array
6 for x in A
7 if compare(x, pivot) A
8 left.add(x)
9 else
10 right.add(x)
11 quicksort(left)
12 quicksort(right)
13 A.set(left + pivot + right)
14
15 compare(a, b) A
16 hitId = once createHIT (...a...b...)
17 result = once getHITResult(hitId)
18 return (result says a < b)
Listing 7.1: An idealized TurKit Quick-Sort program [51].
the program crashes.
Parallel It is usually much slower to post one task to a marketplace at a time in
sequence than to post many tasks in parallel. Making this human parallelism
easy is very important, while computational parallelism is much less important.
Web-related with State Human programs need to interact with workers on the
web, but unlike most web programming, the interactions are often stateful.
While other types of programming like web programming, database programming,
and systems programming with concurrency share some of these features, we do not
know of any type of programming task that shares all of them.
7.2 Preliminaries: TurKit
One of the first systems to explicitly aim to solve some of the programming problems
described in Section 7.1 for human programming was the TurKit system ([49], [50]).
7.2. PRELIMINARIES: TURKIT 121
TurKit has been used to transcribe blurry text, caption images, and execute genetic
algorithms [51]. We describe the TurKit system first because HPROCmakes a number
of design decisions inspired by TurKit. The main contribution of the TurKit system
was a novel programming model which we call TurKit crash-and-rerun. There are
four features of TurKit crash-and-rerun:
Single Program There is one and only one program which is run within the TurKit
environment.
Continuous Rerun The one program is continuously rerun until the program com-
pletes without raising an exception.
Idempotence The single program is made idempotent so that it can be rerun con-
tinuously without causing additional side effects.
Deterministic The single program is deterministic so that each time it reruns the
same execution path will be followed.
In short, TurKit is an environment for continuously rerunning a single, deterministic,
idempotent program written by a human programmer until it completes successfully,
with facilities to make writing and running such programs easier.
For example, Listing 7.1 shows pseudocode for Quick-Sort in the TurKit sys-
tem. Listing 7.1 looks like a classical Quick-Sort, with two key exceptions. The
first key exception is that the TurKit program pseudocode has a separate compare
function for performing binary comparisons. The compare function creates tasks on
the Mechanical Turk (via createHIT on line 16) and gets the results of those tasks
(via getHITResult on line 17). The second key exception is that the TurKit program
pseudocode includes the function once. The first time once is reached, once calls the
function to which it is applied (e.g., A.randomIndex() on line 3—once is applied to
whatever function immediately follows it). If the called function returns successfully,
the result of the function is stored in a database which is part of the TurKit system.
Then, every subsequent time that particular once is reached, the recorded result is
returned from the database, rather than calling the function to which once was ap-
plied. (The use of once is to perform memoization.) On line 3, once is used to ensure
122 CHAPTER 7. PROGRAMMING WITH HPROC
that the same pivot is chosen on each run of the program. On lines 16 and 17, once
is used to ensure that we do not repeatedly create duplicate tasks, or repeatedly get
the same results, respectively.
Note that although Listing 7.1 looks like a classical Quick-Sort, the program
does not run in the same manner. The program reads like a single, imperative run
of Quick-Sort. However, in practice, the program will “crash” whenever it reaches
line 17 and results are not yet available from the Mechanical Turk. Then the program
will periodically rerun (e.g., every minute), replaying all of its actions and retrieving
stored results when it reaches once, until the program has all results necessary to
complete. The imperative appearance of TurKit programs is an advantage, and the
crash-and-rerun style avoids problems like memory leaks. The crash-and-rerun style
also does not need any special operating system or language support for suspending
threads or processes. However, a disadvantage of the style (shared as we will see, by
HPROC) is that the programmer must be careful to make their program idempotent
and deterministic using functions such as once (and unintended behavior can occur
if the programmer does not).
One practical way to implement once is via a counter. The counter is incremented
each time once is called and reset each time the program crashes. once then stores
and looks up the result of the function by a key associated with the current counter
in the database. (TurKit is designed for prototyping only one program at a time, so
we do not worry about conflicts between programs having the same counter keys or
programs colliding with their own past keys in this discussion.)
For example, suppose we use the Quick-Sort of Listing 7.1 to sort three images,
A, B and C. These images are initially in the order A = [C,B,A]. During the first run
of the program, once is first called on line 3. Supposing A.randomIndex() (line 3)
returned the index 2 (corresponding to image A), once would store the value 2 under
the key 0 in the database (e.g., db[0] = 2). Later on the same initial run, compare
is executed (lines 7 and 15), and the result of the first call to createHIT is success,
so we would save db[1] = true. Then, getHITResult is executed, but the result of
comparing image A (the pivot) to C (the first index) is not available, so the program
crashes. On the second run, the initial pivot on line 3 is looked up under db[0]. Then
7.3. HPROC SUBSYSTEMS 123
the initial createHIT would be skipped because the result is available under db[1].
However, supposing the getHITResult was now ready, the program might then save
db[2] = A to signify that A is less than C (according to some criteria, like blurriness).
Then the first compare returns, and the second compare in the for loop executes.
For that second compare, the call to createHIT on line 16 will be saved to db[3].
Note that because the counter is specific to the program, rather than the current
stack frame, this second call to createHIT has a different key (3) than the first (1).
Further, because the Quick-Sort has been made deterministic by calls to once, the
branches followed and functions recursed are always in the same order, making the
counters match the same once calls.
7.3 HPROC Subsystems
HPROC is a system which extends TurKit in various ways. One major difference
is that HPROC can run code in response to web requests, which TurKit cannot
do. Running code in response to web requests allows for more natural handling of
interaction between machine and human computation, as well as better handling of
the stateful nature of web requests in human programming. Another major difference
between TurKit and HPROC is that HPROC can have any number of hprocesses
(HPROC’s version of a process), while TurKit can only have one running program.
(We go into much more depth about hprocesses in Section 7.4.)
HPROC helps satisfy the motivations of Section 7.1 by creating a specialized type
of operating system within an operating system. HPROC has its own notions of
processes, state, and interfaces (analogous to, e.g., network interfaces) that are built
on top of the host operating system, with additional features that make those notions
more usable for human programming.1 A high level graphical overview of HPROC is
shown in Figure 7.1. The five main subsystems from Figure 7.1 are:
Database A MySQL database (shown in the middle left of Figure 7.1) contains
1In this chapter, we have changed the names of notions and code in our HPROC system forease of exposition. For example, hprocesses are called components in our system, HCDs are calledcomponent types, and the HPID environmental variable is really COMPONENTID. Nonetheless, there isa one-to-one mapping of notions in this chapter to notions and code in our real, running system.
124 CHAPTER 7. PROGRAMMING WITH HPROC
Figure 7.1: Graphical overview of the full HPROC system.
7.4. HPROC HPROCESSES 125
descriptors of code and hprocesses within HPROC, event related information,
and variables. Most of the other subsystems, including the programmer remote
API CGI, the web hprocess wrapper CGI, and hprocron (all discussed below)
interact with the database.
Web Server A LigHTTPd web server (shown in the top left of Figure 7.1) serves
as a frontend for the web hprocess wrapper CGI and programmer remote API
CGI.
Hprocron An operating system process (upper right of Figure 7.1) which spawns
UNIX processes (“resumes hprocesses,” see Section 7.4) based on the contents
of the MySQL database. Analogous to UNIX cron. Used to implement events
(discussed in Section 7.4), polling (Section 7.5.5), and TurKit style crash-and-
rerun functionality.
Web Hprocess Wrapper CGI A CGI script which spawns UNIX processes (“re-
sumes hprocesses,” see Section 7.4) based on an HTTP request and the contents
of the MySQL database. (So-called because the processes are executed in an
environment where they are wrapped by the CGI script.) Used to implement a
web interface for hprocesses.
Programmer Remote API CGI A CGI script which allows a remote program-
mer to upload code into the HPROC system, run that code, and get results
out.
7.4 HPROC Hprocesses
Although we go into much more depth in our walkthrough (Section 7.5), this section
gives a brief overview of the key concept in HPROC—the hprocess. An hprocess is
the analogue to a process in our specialized operating system. Every hprocess has
an hpid, or hprocess identifier, analogous to an operating system process identifier
(PID). An hprocess runs regular UNIX code. An hprocess can be in one of three
states: active, waiting, or finished. When an hprocess is active, the UNIX code for
126 CHAPTER 7. PROGRAMMING WITH HPROC
the hprocess is running as a UNIX operating system process. When an hprocess is
waiting or finished, there is no running UNIX operating system process corresponding
to the hprocess.
When an hprocess transitions from active to waiting, we say that the hprocess
was suspended, which is our equivalent to a crash in TurKit crash-and-rerun. When
an hprocess transitions from waiting to active, we say that the hprocess was resumed,
which is our equivalent to a rerun in TurKit crash-and-rerun. Once an hprocess
transitions to the finished state, it will not transition to either of the other states.
(Usually in our system, hprocesses transition to the finished state after completion,
as in TurKit crash-and-rerun—see Section 7.5.7.) Section 7.5.6 goes into more detail
about how hprocesses are suspended and resumed.
There are two types of hprocesses, standalone hprocesses and web (or webcgi)
hprocesses. A standalone hprocess will be resumed in response to an event in hprocron
(see below). A web hprocess will be resumed in response to a specially crafted web
request to the web hprocess wrapper CGI.
Each hprocess has some associated persistent variable storage. What this means
in practice is that there is a table within the MySQL database where hprocesses can
store variable data. Each row in this table contains an hpid, a name for the variable,
a value to which the variable is currently set, a type, and a status. Any part of
the HPROC system can add variables under any hpid, but hprocesses generally use
this variable storage to store their own information. Adding, deleting, and updating
variables is done via SQL.
One important use of this persistent variable storage is for memoizing intermediate
results, as in TurKit. HPROC has a function analogous to once, and an implementa-
tion which is functionally similar to the counter implementation described in Section
7.2.2 (Our implementation uses an identifier which maps to the current program
counter within the current stack frame, but is functionally largely the same as the
counter implementation.)
Hprocesses communicate primarily via two methods: events and cross-hprocess
2Also, many HPROC functions use our once equivalent internally to ensure idempotence anddeterminism. For example, hprocess creation and cross-hprocess function calls will implicitly storestate to ensure idempotence and determinism.
7.5. HPROC WALKTHROUGH 127
function calls.
The hprocron operating system process (introduced in Section 7.3) maintains a list
of hprocesses that are listening to particular events. Events are simply ASCII strings,
like E POLL 1003. When an event is fired, hprocron is responsible for resuming the
hprocesses listening for the event. Any part of the HPROC system can contact
hprocron to add a listener or fire an event. Every hprocess automatically listens on
a number of default events. For example, an hprocess with an hpid of 1003 would by
default listen for the event E POLL 1003, which is a polling event (see Section 7.5.5).
Hprocesses (and other parts of the system) can call other hprocesses via cross-
hprocess function calls. It is best to think of cross-hprocess function calls as a spe-
cialized form of message passing. Hprocesses can leave messages for one another via
variables, but the hprocesses themselves perform whatever actions they desire based
on the variables available to them whenever they next resume. Cross-hprocess func-
tion calls work by placing a variable in the variable storage of the target hprocess.
The variable contains information about the desired function call, as well as a dif-
ferent hpid and variable name in which to put the result. The target hprocess then
computes the function based on the information given in the original variable and
places the result in the new variable requested by the source hprocess. (We give a
full example in Section 7.5.8.)
7.5 HPROC Walkthrough
HPROC is a large and complex system made up of over ten thousand lines of custom
Python code. Probably the easiest way to get a feel for working with the system is
a walkthrough of the most common functionality. To demonstrate this functionality,
this section uses the example of a program which asks a worker to compare two photos,
asking which of the two the worker prefers. Listings 7.3, 7.4, 7.5 and 7.6 make up
a script called walkthroughscript.py which is uploaded into the HPROC system
for this walkthrough. Listing 7.2 is a script called walkthroughuploader.py which
does the uploading of the first script and retrieves results. These two scripts make up
our example program. We follow our example program through the following steps
128 CHAPTER 7. PROGRAMMING WITH HPROC
in this walkthrough:
1. A remote connection is made (Section 7.5.1).
2. The code is uploaded, using the upload script (Section 7.5.2).
3. Introspection is performed on the uploaded code (Section 7.5.3).
4. An hprocess is created remotely using the upload script (Section 7.5.4).
5. The new hprocess is set up for polling (Section 7.5.5).
6. The hprocess is resumed, causing a UNIX operating system process to be
spawned (Section 7.5.6).
7. The upload script calls a remote function (Section 7.5.8) and the script uses
dispatch handling to handle the remote function (Section 7.5.7).
8. Local hprocesses are created within the HPROC system (Section 7.5.9).
9. Web forms and human drivers are created within the HPROC system (Sections
7.5.10 and 7.5.11).
10. A recruiter is asked to recruit for the human driver until a worker completes
the associated form (Section 7.5.12).
We go into more depth about each of these steps below.
7.5.1 Making a Remote Connection
HPROC is a self-contained system, running on a system remote from the program-
mer. For this walkthrough, we will assume that the HPROC system is running on
a computer with the hostname hproc.stanford.edu. We call the computer run-
ning HPROC the HPROC host. Likewise, we will assume that the programmer is
working on a separate computer at test.stanford.edu. We will call the remote
programmer’s computer the remote client.
7.5. HPROC WALKTHROUGH 129
1 #!/usr/bin/env python
2
3 def main ():
4 conn = connect(’https :// hproc.stanford.edu/remote.cgi’)
5
6 ... # removed
7
8 conn.uploadCode(’walkthroughscript .py’)
9
10 compareItemsProc = conn.newHprocess(’edu.stanford.thesis.sa’)
11 comparisonResult = compareItemsProc.fn.compareItems(
12 ’http ://i.stanford.edu/photo1.jpg’,
13 ’http ://i.stanford.edu/photo2.jpg’).get()
14
15 print comparisonResult
16
17 ... # removed
18
19 if __name__ == ’__main__ ’:
20 main()
Listing 7.2: Walkthrough script upload script (walkthroughuploader.py).
130 CHAPTER 7. PROGRAMMING WITH HPROC
1 #!/usr/bin/env python
2
3 from hp.mop import dispatch , env
4 from hp.shared import exceptions , util
5 from hp.recruit import manturk_recruitd_iface
6 import cgi , SimpleXMLRPCServer
7
8 turk_javascript = ... # removed
9
10 def doThrower(photo_url1 , photo_url2 , target_url ):
11 global turk_javascript
12
13 print """ Content -Type: text/html
14
15 <html >
16 <head ><title >Photo Comparison </title >%s</head >
17 <body >
18 First Photo URL is<BR/><IMG SRC=’%s’ /><br/><br/>
19 Second Photo URL is<BR/><IMG SRC=’%s’ /><br/><br/>
20 Which do you prefer?
21 <form name=" thrower" action ="%s" method ="post">
22 <input type=" radio" name=" choice" value =" photo1" /> First Photo <br />
23 <input type=" radio" name=" choice" value =" photo2" /> Second Photo <br />
24 <input type=" hidden" name=" assignmentId" id=" assignmentId" value ="" />
25 <input type=" submit" value =" Submit" />
26 </form >
27 </body >
28 </html >
29 """ % (turk_javascript , photo_url1 , photo_url2 , target_url)
30
31 def doCatcher ():
32 form = cgi.FieldStorage ()
33
34 myHprocess ().v[’results ’] = [form.getfirst("choice", ""),]
35
36 redirect_html = "Location: %s://%s/%s?%s&%s" % (
37 ’https ’, ’www.mturk.com’, ’mturk/externalSubmit ’,
38 ’assignmentId =%s’ % form.getfirst(’assignmentId ’, ’’),
39 ’data=none’
40 )
41 print redirect_html
Listing 7.3: Walkthrough HPROC script (walkthroughscript.py), Part I: Initialsetup, thrower, and catcher functionality.
7.5. HPROC WALKTHROUGH 131
42 class CompareFormHandler(object ):
43 def getThrower(self):
44 return util.getThrowerUrl(env.getMyHpid ())
45
46 def isDone(self):
47 return myHprocess ().v.has_key(’results ’)
48
49 def getResults(self):
50 return myHprocess ().v[’results ’]
51
52 def getTaskType(self):
53 return {’fqn’: ’edu.stanford.hproc.tasktype.compareform.v1’}
54
55 def doXmlRpc ():
56 handler = SimpleXMLRPCServer.CGIXMLRPCRequestHandler (
57 allow_none=True)
58 handler.register_introspection_functions ()
59 handler.register_instance (CompareFormHandler ())
60 handler.handle_request ()
61
62 def handleRequest(photo_url1 , photo_url2 ):
63 if util.requestType () == ’thrower ’:
64 doThrower(photo_url1 , photo_url2 ,
65 util.getCatcherUrl(env.getMyHpid ()))
66 elif util.requestType () == ’catcher ’:
67 doCatcher ()
68 elif util.requestType () == ’xmlrpc ’:
69 doXmlRpc ()
70 else:
71 return ’’
Listing 7.4: Walkthrough HPROC script (walkthroughscript.py), Part II: XML-RPC and web request handling functionality.
132 CHAPTER 7. PROGRAMMING WITH HPROC
72 def makeForm(photo_url1 , photo_url2 ):
73 formhprocess = newHprocess(’edu.stanford.thesis.web’)
74 formhprocess.dfn.handleRequest(photo_url1 , photo_url2)
75 xmlrpc_url = util.getXmlRpcUrl(formhprocess.id)
76 thrower_url = util.getThrowerUrl(formhprocess.id)
77
78 return {’xmlrpc ’: xmlrpc_url ,
79 ’thrower ’: thrower_url}
80
81 def fillForm(xmlrpc_url ):
82 r_iface = manturk_recruitd_iface.getInterface ()
83 r_iface.setMemo(True)
84
85 ticket_id = r_iface.getUniqueTicketIdentifier ()
86 r_iface.manage(ticket_id , xmlrpc_url)
87 r_iface.setMemo(False)
88
89 if not r_iface.isDone(ticket_id ):
90 raise exceptions.HprocIntendedError(
91 "Waiting on ticket %s." % ticket_id)
92 r_iface.setMemo(True)
93
94 res = r_iface.getResults(ticket_id)
95 r_iface.finishTicket(ticket_id)
96 return res
97
98 def compareItems(photo_url1 , photo_url2 ):
99 makeFormProc = newHprocess(’edu.stanford.thesis.sa’)
100 lazyMakeForm = makeFormProc.fn.makeForm(photo_url1 , photo_url2)
101 makeFormResult = lazyMakeForm.get()
102
103 fillFormProc = newHprocess(’edu.stanford.thesis.sa’)
104 lazyFillForm = fillFormProc.fn.fillForm(
105 makeFormResult[’xmlrpc ’])
106 fillFormResult = lazyFillForm.get()
107 return fillFormResult
Listing 7.5: Walkthrough HPROC script (walkthroughscript.py), Part III:makeForm, fillForm, and compareItems standalone functions.
7.5. HPROC WALKTHROUGH 133
108 def runFunc(args):
109 if env.getMyEnvironmentType () == ’webcgi ’:
110 dispatch.dispatchDefaultFunction (globals ())
111 else:
112 dispatch.dispatchSingle(globals ())
113
114 def main ():
115 CODE_DESCRIPTORS = [
116 {
117 ’fqn’: ’edu.stanford.thesis.sa’,
118 ’language ’:’python ’,
119 ’args’: ’--run’,
120 ’environment ’: ’standalone ’,
121 ’help’: """ Code to create comparison form.""",
122 ’default_poll_s ’: 10,
123 },
124 {
125 ’fqn’: ’edu.stanford.thesis.web’,
126 ’language ’:’python ’,
127 ’args’: ’--run’,
128 ’environment ’: ’webcgi ’,
129 ’help’:"""A binary comparison form for photos.""",
130 ’default_poll_s ’:0,
131 }
132 ]
133
134 dispatch.defaultCommandLineHandler (CODE_DESCRIPTORS , runFunc)
135
136 if __name__ == ’__main__ ’:
137 main()
Listing 7.6: Walkthrough HPROC script (walkthroughscript.py), Part IV: Dis-patch functions, code descriptors (invocations), and main functions.
134 CHAPTER 7. PROGRAMMING WITH HPROC
1 <?xml version="1.0"?>
2 <methodCall >
3 <methodName >getVariable </methodName >
4 <params >
5 <param ><value ><int>41</int></value ></param >
6 <param ><value ><string >workerformresult </string ></value ></param >
7 </params >
8 </methodCall >
Listing 7.7: Example HTTP POST in XML-RPC for the call getVariable(41,
"workerformresult").
Because HPROC is a self-contained system remote from the programmer, the first
thing that the programmer needs to do is to make a connection from the remote client
to the HPROC host. Listing 7.2 is a program which we will call the upload script
program, because, among other things, it uploads code from the remote client to the
HPROC host. Our walkthrough begins with the upload script program setting up a
connection from the remote client to the HPROC host.
For those unfamiliar with Python, line 1 is the common Python preamble, and lines
19 and 20 call the main function when the script is invoked. Line 4 sets up a connection
object, connected remotely to the HPROC host through the programmer remote API
CGI. Figure 7.1 shows this interaction, where the programmer is connected to the
web server, specifically the programmer remote API CGI, in the upper left corner.
The programmer remote API CGI is a common gateway interface (CGI) script. In
particular, when a request is made for https://hproc.stanford.edu/remote.cgi,
the programmer remote API CGI script is called. We now go on a brief tangent
to describe the programmer remote API CGI implementation, before returning to
walking through Listing 7.2.
The programmer remote API CGI script itself is implemented as an XML Remote
Procedure Call (XML-RPC) handler. XML-RPC is a form of RPC where the remote
client does an HTTP POST with an XML document describing a function to be called,
and then the response is an XML document explaining the result. For example, a
simplified version of an HTTP POST doing a remote procedure call for the call
getVariable(41, "workerformresult") would look like the XML shown in Listing
7.5. HPROC WALKTHROUGH 135
1 <?xml version="1.0"?>
2 <methodResponse >
3 <params >
4 <param ><value ><string >photo1 </string ></value ></param >
5 </params >
6 </methodResponse >
Listing 7.8: Example HTTP response in XML-RPC for the call getVariable(41,"workerformresult") where the response is the string value “photo1.”
7.7. Meanwhile, the response, assuming the result of the RPC was “photo1”, would be
that shown in Listing 7.8. In short, XML-RPC is just a form of RPC where one uses
HTTP POSTs and responses to perform procedure calls, and can be implemented
using CGI scripts, as we do in the HPROC system. In our case, the programmer
is abstracted from having to deal with the specifics of XML-RPC by the connection
object conn. conn turns any function call made on it into an XML-RPC call to the
programmer remote API CGI on the HPROC host.
There is various additional boilerplate for setting up a remote connection to the
HPROC system via the connection object conn. We have removed that boilerplate,
which would usually be at line 6 of Listing 7.2.
7.5.2 Uploading Code
After making a remote connection (Section 7.5.1), the next step is for the programmer
to upload some code into the HPROC system. The idea is that code (other than the
upload script) runs on the HPROC host, rather than on the remote client controlled by
the programmer. The uploadCodemethod, on line 8 of Listing 7.2 takes a path on the
remote client file system that corresponds to code that can be executed on the HPROC
host, within the HPROC system. We assume that the file walkthroughscript.py
exists on the file system of the remote client. The upload script (via uploadCode)
reads the walkthroughscript.py file and then posts it via an XML-RPC call to the
programmer remote API CGI. Specifically, the file is sent verbatim via an XML-RPC
function called uploadCode.
136 CHAPTER 7. PROGRAMMING WITH HPROC
7.5.3 Introspection
This section discusses introspection—the process by which code uploaded into the
HPROC system describes itself. First, we discuss when introspection occurs and how
it produces code descriptors. Each code descriptor represents a particular way that
a code file can be invoked, together with restrictions and information about that
invocation. Second, we discuss the format and content of code descriptors. Third, we
discuss the aftermath of introspection.
When the programmer remote API CGI receives the uploadCode call, the CGI
performs three actions.
1. The CGI saves the code that was sent via the RPC to a file.
2. The CGI performs introspection on the uploaded code file.
3. The CGI performs any additional actions that should result from that intro-
spection.
To perform introspection, the code file is made executable on disk (e.g., via chmod
770), and then it is run with the argument --info. The code file is then expected to
produce a list of code descriptors.
Listing 7.6 shows the last of the four pieces of code that make up the
walkthroughscript.py code file uploaded by the walkthroughuploader.py of
Listing 7.2. (Recall that Listings 7.3, 7.4 and 7.5 are the other three pieces
of code in this file.) Lines 115–134 of Listing 7.6 handle producing code de-
scriptors when walkthroughscript.py is called with --info. In particular, the
defaultCommandLineHandler function on line 134 is a convenience function for out-
putting code descriptors.
The defaultCommandLineHandler function takes two arguments. The first argu-
ment to defaultCommandLineHandler will be output as a JavaScript Object Notation
(JSON) list when the executable is run with --info. JSON is a lightweight data-
interchange format inspired by the JavaScript language, made up of arrays (e.g.,
[1,2,3]), objects (e.g., {"x":3, "y":7}) and values (e.g., 7 or "foo"). The sec-
ond argument to defaultCommandLineHandler, a function, will be called when the
7.5. HPROC WALKTHROUGH 137
1 > ./ walkthroughscript .py --info
2 [
3 {
4 "fqn": "edu.stanford.thesis.sa",
5 "help": "Code to create comparison form.",
6 "language": "python",
7 "args": "--run",
8 "environment": "standalone",
9 "default_poll_s": 10
10 },
11 {
12 "fqn": "edu.stanford.thesis.web",
13 "help": "A binary comparison form for photos.",
14 "language": "python",
15 "args": "--run",
16 "environment": "webcgi",
17 "default_poll_s": 0
18 }
19 ]
Listing 7.9: Introspection on walkthroughscript.py by using --info.
executable is run with --run. In our case, walkthroughscript.py will output the
code descriptors specified on lines 115–132 when run with --info, and will return
the result of the function runFunc (shown on line 108) when run with --run.
The output of walkthroughscript.py --info is shown in Listing 7.9. The out-
put consists of an array of two objects, where each object corresponds to one way to
run the executable code file. In particular, the output shown in Listing 7.9 says that
walkthroughscript.py can be invoked in two ways, based on two code descriptors.
The first code descriptor has the following details:
1. The code descriptor has a fully qualified name, or FQN, of
edu.stanford.thesis.sa. This name is used to identify the code de-
scriptor within the HPROC system. (Later we will create hprocesses based on
the code descriptor FQN.)
2. The code descriptor has help text, informing us that the invocation is meant
for creating comparison forms.
138 CHAPTER 7. PROGRAMMING WITH HPROC
3. The code descriptor informs us that the code is written in the Python language.
4. The code descriptor informs us that for this invocation, the executable should
be run as walkthroughscript.py --run.
5. The code descriptor tells us in what circumstances the executable should be
run. In particular, this invocation is intended for a standalone environment,
which means that it should be run by hprocron (see Section 7.5.5).
6. The code descriptor tells us that a default polling time of ten seconds should
be used (see Section 7.5.5).
By contrast, the second code descriptor has the following details:
1. The code descriptor has an FQN of edu.stanford.thesis.web.
2. The code descriptor has help text, informing us that the invocation is meant
to display a binary photo comparison form by outputting HTML to a remote
worker.
3. The code descriptor informs us that the code is written in the Python language.
4. The code descriptor informs us that for this invocation, the executable should
be run as walkthroughscript.py --run.
5. The code descriptor tells us in what circumstances the executable should be
run. In particular, this invocation is intended for a webcgi environment, which
means that it should be run as the result of an HTTP request to the web server
(see Section 7.5.10).
6. The code descriptor tells us not to set a default polling time.
Thus, the walkthroughscript.py code file is intended to be invoked in two different
ways, with two different names and environments.
Once the newly uploaded code file has been introspected via --info, the intro-
spection information is added to the MySQL database. Specifically, it is added to a
table of code descriptors that HPROC knows about, shown in Table 7.1. Once the
7.5. HPROC WALKTHROUGH 139
FQN Environment Command Poll (s) . . .
e.s.t.sa standalone /opt/hproc/code/walkthroughscript.py –run 10 . . .e.s.t.web webcgi /opt/hproc/code/walkthroughscript.py –run 0 . . .
Table 7.1: The code descriptors table within the MySQL database in the HPROCsystem, after walkthroughscript.py has been introspected. Some columns havebeen removed, edu.stanford.thesis has been abbreviated to e.s.t, and defaultpoll seconds has been abbreviated to “Poll (s).”
code descriptor is in this table, as we will see, other parts of HPROC can use the
code file. Once the code file has been introspected and one or more rows have been
added to the code descriptors table, the uploadCode XML-RPC call (triggered on line
8 of Listing 7.2) returns successfully, signalling that the code has been successfully
uploaded and introspected.
7.5.4 Hprocess Creation
Section 7.5.1 established a connection from the programmer’s remote client machine
to the HPROC system. Section 7.5.2 uploaded a code file, and registered that code
file’s code descriptors in a table in the MySQL database. However, we have not yet
actually done anything with walkthroughscript.py other than sending it to the
HPROC host and introspecting it. The next step is to create an hprocess (analogous
to an operating system process) within the HPROC system, associated with one of
the code descriptors just registered.
Line 10 of Listing 7.2 (i.e., the upload script) creates a new hprocess, using the
newHprocess method of the previously created connection object. In particular,
newHprocess takes the fully qualified name of a code descriptor and creates a new
hprocess associated with that code descriptor. The practical effect of “creating a new
hprocess” is two things:
1. A line is added to a table of process descriptors, with a new identifier, to signify
that there is a new hprocess.
2. Any special aspects of the code descriptor are handled (see Section 7.5.5).
140 CHAPTER 7. PROGRAMMING WITH HPROC
HPID Code Descriptor FQN Status . . .
1003 e.s.t.sa waiting . . .
Table 7.2: The process descriptors table within the MySQL database in the HPROCsystem, after a new hprocess with the edu.stanford.thesis.sa code descriptorof walkthroughscript.py has been created. Some columns have been removed,edu.stanford.thesis has been abbreviated to e.s.t. The HPID is the processidentifier for the hprocess.
Since newHprocess is called through the connection object, these actions actually
take place through the programmer remote API CGI on the web server.
In our case, after the line is added to the table of process descriptors in the MySQL
database, that table looks something like Table 7.2. Note that this table is keyed on
the HPID column, and every hprocess has a unique hpid identifier (even if multiple
hprocesses have the same code descriptor FQN).
7.5.5 Polling
Once the hprocess is created, there may be additional actions that need to occur. In
our case, recall that the edu.stanford.thesis.sa code descriptor (shown in Listing
7.9) had a default poll s value of 10. As a result, when the programmer remote
API CGI creates a new hprocess associated with this code descriptor, in addition to
adding it to the process descriptors table, the CGI also tells hprocron to periodically
poll the new hprocess. (The purpose of having default poll s be a code descriptor
option is to make it easier for a programmer to work in a TurKit crash-and-rerun
style where the hprocess is run periodically.)
Hprocron is responsible for maintaining a list of hprocesses that should be polled
periodically or at a particular time. For example, the hprocess just created in our
walkthrough needs to be polled every ten seconds. Once hprocron is notified by the
programmer remote API CGI that this polling should occur, the hprocron process
will fire an E POLL 1003 event every ten seconds, which will in turn cause the event
handling code to resume the hprocess. (See Section 7.4 for a discussion of event han-
dling by hprocron.) Any part of the HPROC system can request that an hprocess be
polled. Once the hprocron process has been notified to poll the hprocess periodically,
7.5. HPROC WALKTHROUGH 141
creation of the new edu.stanford.thesis.sa hprocess has finished.
7.5.6 Executable Environment
Section 7.4 stated that hprocron was responsible for resuming hprocesses, but
did not specify how this resuming was done. To resume an hprocess, hprocron
spawns a real UNIX operating system process. This operating system process is
spawned with the command line of the code descriptor for the process. For exam-
ple, our hprocess with hpid 1003 would be resumed by invoking the command line
/opt/hproc/code/walkthroughscript.py --run based on Table 7.1.
This operating system process runs in a UNIX environment where UNIX environ-
mental variables are set. (By UNIX environmental variables, we mean, for example,
$HOME whose value is the user’s home directory.) In particular, the environmental
variable HPID is set, with the value of the hpid (i.e., from Section 7.5.4). For ex-
ample, the process spawned as a result of our hprocess resuming spawns in a UNIX
environment where the environmental variable HPID is set to 1003. Having the hpid
available in the environment of the process allows hprocesses with the same code de-
scriptor to behave differently. In fact, this walkthrough will include three hprocesses
with the same edu.stanford.thesis.sa FQN.
When the hprocess suspends, it simply throws an exception and exits. Hprocesses
always suspend themselves, by voluntarily exiting.
7.5.7 Dispatch Handling
Section 7.5.4 created a new hprocess. Section 7.5.5 started hprocron polling
that hprocess. Then 7.5.6 showed how that periodic polling leads to a running
UNIX operating system process. As a result, at this stage in the walkthrough,
walkthroughscript.py --run is being invoked and exiting every ten seconds. So
what does walkthroughscript.py --run do?
As we noted in Section 7.5.3, walkthroughscript.py starts in the main() func-
tion of Listing 7.6 (line 114) and calls the defaultCommandLineHandler (line 134).
When run with --run (rather than --info), this handler calls the runFunc function
142 CHAPTER 7. PROGRAMMING WITH HPROC
(line 108).
The runFunc function checks to see whether it is being run by hprocron or by
the web server (we will see how the latter is possible in Section 7.5.10). The goal
of the runFunc function is to serve as a branch between the hprocess acting either
as a standalone hprocess (as here) or as a web hprocess (discussed later). In our
case, the hprocess being resumed every ten seconds is being run by hprocron, so the
dispatchSingle function on line 112 is run. Both the dispatchSingle function
of line 112 and the dispatchDefaultFunction function of line 110 are convenience
functions for implementing cross-hprocess function calls.
Both functions will check the variable storage for variables of the proper type
(func call or func default variables). If they find a variable of the right type,
both parse the variable and determine the desired function and arguments within
the variable. If the desired function exists within the program, that function is
called with the arguments. In the case of dispatchSingle, if the called function
returns successfully, dispatchSingle will place the result in the variable storage of
the source hprocess and set the status of the target hprocess to finished. (Recall
from Table 7.2 that hprocesses have a status in their process descriptor.) Hprocesses
with a finished status will not be resumed (rerun) by hprocron (or by the web
hprocess wrapper CGI, as we will see). The dispatchDefaultFunction function
works similarly to dispatchSingle, but does not set the status to finished, and
does not return a result. Thus, dispatchDefaultFunction is useful for hprocesses
which may have functions called many times (in our case, web hprocesses), while
dispatchSingle is useful for emulating TurKit crash-and-rerun.
At this stage in our walkthrough, nothing has put a function call in variable stor-
age for the hprocess with hpid 1003. As a result, hprocron will keep polling hprocess
1003, and hprocess 1003 will keep being resumed, leading to a UNIX operating sys-
tem process running walkthroughscript.py. However, that script will run until
it reaches dispatchSingle, which will throw an exception (causing the hprocess to
suspend) because there is no function call in its variable storage.
7.5. HPROC WALKTHROUGH 143
HPID 1003
Name bc7eec4b56f300080fb36682f7c763768d9a7bce
Type func call
Status waiting
Value
{”fn”: ”compareItems”,”args”: [”http://i.stanford.edu/photo1.jpg”,”http://i.stanford.edu/photo2.jpg”],”return hpid”: 1000,”return varname”: ”74f9dd61890dfced29673b6b5ecd7b34f7fe3845”
}
. . . . . .
Table 7.3: The row of the variable storage table corresponding to the compareItemsfunction call.
7.5.8 Remote Function Calling
Sections 7.5.4, 7.5.5, 7.5.6, and 7.5.7 described what happened when the upload script
created a new hprocess on line 10 of Listing 7.2. The return value of newHprocess
on that line is an object which serves as a proxy for that new hprocess, named
compareItemsProc. On lines 11–13, the upload script makes a function call on that
hprocess, requesting the result of the function call compareItems on two URLs,
http://i.stanford.edu/photo1.jpg and http://i.stanford.edu/photo2.jpg,
executed using the new hprocess with hpid 1003.
When we say that the upload script makes a function call on the new hprocess,
what we really mean is that:
1. Lines 11–13 are parsed on the remote programmer’s client. Specifically, the
function to be called (compareItems) and the arguments to that function (the
two URLs) are determined by the compareItemsProc object.
2. The compareItemsProc object determines the hpid of the hprocess which it
corresponds to (1003). This information was saved when the newHprocess call
returned.
3. The compareItemsProc object determines an appropriate hpid to identify the
upload script. (This hpid just needs to not conflict with other hpids within the
144 CHAPTER 7. PROGRAMMING WITH HPROC
HPROC system, and is only used for accessing the variable storage.)
4. The compareItemsProc object then makes an XML-RPC request to the pro-
grammer remote API CGI, requesting that a variable be added to variable stor-
age. The variable is a function call variable requesting the result of compareItems
on the two URLs be placed in variable storage with the upload script’s hpid
and a particular variable name.
5. The programmer remote API CGI processes the XML-RPC request, and places
the function call variable in variable storage within the HPROC system on the
HPROC host.
6. The compareItemsProc object returns a lazy result object (described below)
corresponding to the return variable of the function call.
After these steps, the row of variable storage corresponding to the function call
looks like Table 7.3. (Note that this is a single row of the variable storage table
presented vertically because of the length of some of the content.) The hpid of the
variable is 1003, because that is the hpid of the target hprocess. The name of the
variable is an auto-generated string, designed to be unique. The type of the variable
is func call. The status of the variable is waiting, meaning that the function call
is waiting to be processed. The value of the variable is a JSON object which details
the specifics of the function call. In particular, the value shows the function (fn is
compareItems), the arguments (args is the two URLs), the desired hpid of the return
variable (return hpid is 1000, the hpid of the upload script), and the desired variable
name of the return variable (return varname is a string designed to be unique).
All that is left for the upload script now is to wait. Eventually, the returned
result of compareItems will be in variable storage under the appropriate return hpid
and name. As noted previously, the return value of the function call to the upload
script is a lazy result object. By calling get on this lazy result object on line 13, we
cause the upload script to periodically request the return variable via XML-RPC to
the programmer remote API CGI. These periodic requests continue until the return
variable exists and there is a result returned.
7.5. HPROC WALKTHROUGH 145
7.5.9 Local Hprocess Instantiation
By this point, we have walked through most of the upload script in Listing 7.2. We
turn our attention back now to walkthroughscript.py and hprocess 1003 which was
previously (Section 7.5.7) suspending and resuming periodically waiting for a function
call in variable storage. Now the variable storage has the function call shown in Table
7.3.
The next time hprocess 1003 is resumed, it will reach dispatchSingle on line
112 of Listing 7.6. This time, dispatchSingle will see the function call in variable
storage. dispatchSingle will then check the rest of the code file to determine if
there is a function called compareItems to call. There is such a function, on line
98 of Listing 7.5. dispatchSingle calls this function with the two URLs that were
contained in the arguments in variable storage.
The compareItems function uses newHprocess, cross-hprocess function calls, and
lazy result get like we saw in the upload script. However, in this case, these constructs
perform slightly differently because they are running locally on the HPROC host
within the HPROC system, rather than externally on the remote client. The key
differences are:
1. Each construct goes directly to the MySQL database, rather than going through
the programmer remote API CGI. However, newHprocess and cross-hprocess
function calls add the same rows to the process descriptors and variables tables
respectively.
2. The get method of the lazy result object returned by a cross-hprocess function
call will suspend the hprocess (throw an exception) rather than continuously
request the return variable if the return variable for the function is not yet
available. (This is in keeping with TurKit crash-and-rerun style.)
The first three lines of compareItems (starting on line 99 of Listing 7.5) create a new
hprocess (with hpid 1004), place a function call in variable storage for the new hpro-
cess (requesting the makeForm function) and then request the result of the makeForm
function (with get). Because the code descriptor for this new hprocess (hpid 1004)
146 CHAPTER 7. PROGRAMMING WITH HPROC
is the same as that for hprocess 1003, it will be set up similarly, including the default
polling period of once every ten seconds. Because the result is not yet available,
hprocess 1003 will suspend. (In fact, hprocess 1003 will fail to make progress beyond
this point until hprocess 1004 returns the result of the makeForm call.)
7.5.10 Form Creation
When hprocess 1004 is next polled by hprocron, it will resume. At that point, like
hprocess 1003, hprocess 1004 will reach the dispatchSingle function on line 112 of
Listing 7.6. However, in this case, dispatchSingle will look in the UNIX environ-
mental variables and find that HPID is set to 1004 rather than 1003. As a result,
it will look under variable storage for function call variables with a waiting status
for hpid 1004 rather than 1003. dispatchSingle will then find a similar variable to
that shown in Table 7.3. However, the function call variable for hprocess 1004 will
be a function call for the function makeForm. As a result, dispatchSingle will call
makeForm on line 72 of Listing 7.5.
makeForm creates a new hprocess on line 73. However, makeForm uses a different
code descriptor from the one that we have been using so far. Specifically, it creates a
new hprocess corresponding to edu.stanford.thesis.web, which is a code descrip-
tor for an hprocess designed to be run in response to HTTP requests to the web
hprocess wrapper CGI through the web server rather than local events. We assume
that this new hprocess has an hpid of 1005. When a worker makes a request to a
specially crafted URL on the web server running on the HPROC host, this hprocess
will be resumed and the output of the hprocess will be sent to the requesting worker.
Specifically, a request to the URL:
https://hproc.stanford.edu/www.cgi/1005/*
... will cause the web hprocess wrapper CGI to resume hprocess 1005. By the
asterisk, we mean that any URL beginning with the text before the asterisk will lead
to hprocess 1005 being run.
There are four additional ways in which web hprocesses differ from the standalone
hprocesses introduced so far.
7.5. HPROC WALKTHROUGH 147
1. A web hprocess includes the environmental variables described in Section 7.5.6,
but it additionally includes environmental variables specific to the web server,
such as the URL of the HTTP request that caused the hprocess to be resumed.
2. The web hprocess wrapper CGI (via the web server) will return the standard
output produced by the hprocess to the requesting worker.
3. Because every web hprocess is intended to produce standard output, they are
not intended to be used with the TurKit crash-and-rerun model.
4. Standalone hprocesses are resumed one-by-one by hprocron, while many web
hprocesses may be resumed at the same time by executions of the web hprocess
wrapper CGI (via the web server) in response to HTTP requests.
Otherwise, web and standalone hprocesses are quite similar, and are intended to inter-
act in natural ways. In particular, in this walkthrough, both the web and standalone
hprocesses are in the same script, walkthroughscript.py.
After creating the web hprocess with hpid 1005, makeForm sets up a default func-
tion for hprocess 1005 on line 74 of Listing 7.5. As noted in Section 7.5.7, default
functions are similar to regular function calls, except that they do not return a result
to the caller. (Default functions are a special case of the human processing send
operation from Section 6.4.) In this case, the default function for hprocess 1005 is
being set to the handleRequest function with the arguments being the two URLs
supplied to makeForm. As we will see later, this means that when a worker makes an
HTTP request for the URL associated with hprocess 1005, handleRequest will get
called with the two URLs as arguments.
The last thing that makeForm does is that it returns to its caller with two URLs.
These URLs are both specially crafted to resume hprocess 1005, but one will lead to
the handleRequest function displaying a form to an end user, while the other will be
an endpoint for XML-RPC. (Both of these will be discussed in Section 7.5.11.) When
makeForm returns these URLs, it actually returns them to the dispatchSingle func-
tion of hprocess 1004, which will then return them via the variable storage to hprocess
1003, which was the original compareItems hprocess. Hprocess 1004 will then be set
148 CHAPTER 7. PROGRAMMING WITH HPROC
to have a status of finished, because the hprocess has completed successfully and
used the dispatchSingle function.
7.5.11 Form Parts
Section 7.5.10 created a web hprocess as hpid 1005. However, we have not yet ex-
plained what happens when a worker requests a URL associated with hprocess 1005.
In fact, three separate responses can happen depending on three potential URLs.
These three URLs and associated functionality, which we call thrower, catcher, and
XML-RPC are the standard way of setting up a human driver (see Section 6.4) in
the HPROC system. We trace this functionality below.
Recall that if we are running in a web environment, dispatchDefaultFunction
on line 110 of Listing 7.6 will be called, rather than dispatchSingle. This dispatch
function will check the variable storage for variables of the func default type. It
will then find the variable set by makeForm in hprocess 1004, and call handleRequest
(the default function). This is true for any URL of the form
https://hproc.stanford.edu/www.cgi/1005/*
because any URL of that form will cause hprocess 1005 to resume.
However, handleRequest (line 62 of Listing 7.4) differentiates between three
URLs. In particular,
Thrower URL https://hproc.stanford.edu/www.cgi/1005/thrower
Catcher URL https://hproc.stanford.edu/www.cgi/1005/catcher
XML-RPC URL https://hproc.stanford.edu/www.cgi/1005/xmlrpc
handleRequest differentiates with a utility function called requestType, which sim-
ply looks at the environmental variables of the program to determine the URL
type. Depending on the URL of an HTTP request, requestHandle will call one
of doThrower, doCatcher, or doXmlRpc.
The doThrower function (starting on line 10 of Listing 7.3) displays an HTML
form to the worker requesting the URL. The HTML form is parameterized by the two
7.5. HPROC WALKTHROUGH 149
URLs to compare, and allows the worker to choose between them using radio buttons.
The HTML form is also parameterized by its action. In particular, the action (the
URL which the form will submit to) is the URL for the catcher. The HTML form
includes some necessary JavaScript for compatibility with Mechanical Turk which we
have removed for clarity.
The doCatcher function (starting on line 31 of Listing 7.3) is responsible for
receiving the form submission from the thrower form. First, doCatcher saves the
choice made by the worker. The code on line 34 just parses the form submission by
the worker and then saves the choice made by the worker to variable storage. Note
that there is a special syntax (v[’variable’]) for conveniently accessing variables
persisted to variable storage. Second, once the worker’s choice is persisted, the worker
is redirected back to the Mechanical Turk. Because the standard output of web
hprocesses is directly sent by the web hprocess wrapper CGI (via the web server),
web hprocesses can send redirects and other HTTP headers. In our case, the thrower
sends a Location header to redirect.
The doXmlRpc function (starting on line 55 of Listing 7.4) is responsible for com-
municating with the recruiter, discussed in Section 7.5.12. As discussed in Section
7.5.1, XML-RPC is an RPC format that can easily be set up via a CGI script. In our
case, if a program does an HTTP POST to
https://hproc.stanford.edu/www.cgi/1005/xmlrpc
it will execute a method on the object that begins on line 42 of Listing 7.4. The
doXmlRpc function on line 55 simply registers the object to respond to XML-RPC
requests. Because standard output is returned verbatim by the web hprocess wrapper
CGI (via the web server), a web hprocess can just as easily form an RPC end point
as return output to a user.
7.5.12 Form Recruiting
Section 7.5.11 described what happens if each of three URLs associated with the web
hprocess 1005 is requested. However, as it stands, there is not yet any reason for the
URLs to be requested. We now describe how those URLs are advertised and reached.
150 CHAPTER 7. PROGRAMMING WITH HPROC
Recall that hprocess 1003, running compareItems, was previously stuck suspend-
ing at the end of Section 7.5.9 because the result of makeForm was not yet available.
However, in Section 7.5.10, makeForm completed and returned its result via vari-
able storage. As a result, compareItems can now proceed. From lines 103–106,
compareItems does the same thing as with makeForm, creating a new hprocess, call-
ing a cross-hprocess function, and then throwing an exception when the result of the
function is not available. In this case, the cross-hprocess function call is to fillForm,
which takes the XML-RPC URL that was returned by the makeForm call as an argu-
ment.
We assume that the new hprocess which is created to satisfy the fillForm call
has hpid 1006. Once the compareItems hprocess suspends, the new hprocess that
was created will eventually be polled and then resumed by hprocron. This hprocess
will check its variable storage via dispatchSingle and run fillForm as requested.
The purpose of fillForm (starting on line 81 of Listing 7.5) is to interact with the
recruiter. On line 82, a connection object to the recruiter is created, called r iface.
The recruiter is based on a ticketing system, so on line 85, fillForm gets a unique
ticket identifier for the human driver that will be managed. On line 86, fillForm
sends a request, via the connection object, for the recruiter to manage the human
driver associated with the given XML-RPC URL. In this case, the XML-RPC URL is
the one we received from makeForm earlier. (Note that the manage request is similar
to the recruit operation from Section 6.4.) The human driver is also associated with
the unique ticket identifier.
The recruiter is a separate process accessible via the connection object. When
it receives the request to manage the human driver, it uses the XML-RPC URL of
the human driver to get more information. In particular, the recruiter can use any
of the methods in the CompareFormHandler object on line 42 of Listing 7.4. To get
the thrower URL, which will need to be advertised on Mechanical Turk, it can call
getThrower. To determine if workers have filled out the form, the recruiter can call
isDone (not to be confused with the recruiter’s version of isDone). Lastly, to get
the current results of the form, the recruiter can call getResults which will return
the results posted by workers from variable storage. (getResults is the HPROC
7.6. HPROC WALKTHROUGH SUMMARY 151
equivalent of the human processing get(driverid) operation from Section 6.4.) The
recruiter is then in charge of advertising the thrower URL on Mechanical Turk and
checking with isDone to determine if the advertising was successful.
Meanwhile, after requesting that the recruiter manage the human driver, the
fillForm hprocess checks if the recruiter says that the human driver has completed
(i.e., workers have filled out the form). If not, the hprocess throws an exception, until
resuming again at some later point to check again as a result of hprocron. Eventually,
the hprocess will resume and the recruiter will say that the human driver is complete.
Then fillForm requests the results from the recruiter and marks the ticket complete.
(These are the results of the form associated with the human driver, but could include
additional information from the recruiter.) Then, the fillForm function returns the
form results, leading dispatchSingle to return the form results.
Once the fillForm hprocess returns the results via variable storage, the
compareItems hprocess does the same, and the upload script sees the result remotely
in variable storage. Finally, the upload script prints the result of the worker’s choice
(line 15 of Listing 7.2) and there is various boilerplate to tear down a remote connec-
tion (line 17 of Listing 7.2).
7.6 HPROC Walkthrough Summary
The walkthrough in Section 7.5 illustrated most of the features of the HPROC sys-
tem. As we saw, HPROC is a self contained system in which to run code. To program
with HPROC, one writes at least two programs, a remote program and a program
to run within the HPROC system. Programs that are run within HPROC are intro-
spected to create code descriptors that can be used by other code within the system.
Special hprocesses can then be created based on a code descriptor. These hprocesses
can emulate TurKit crash-and-rerun style by using the poller in hprocron, but can
also serve as human drivers that handle managing web forms and even the web forms
themselves. Hprocesses have their own variable storage, special environmental vari-
ables, and event system. Hprocesses (as well as external programs) can also make
cross-hprocess function calls to accomplish tasks and split up work. Lastly, we saw
152 CHAPTER 7. PROGRAMMING WITH HPROC
how hprocesses can delegate the advertising of human drivers to recruiters and then
retrieve human results. Overall, HPROC is a powerful system that combines TurKit
style programming, web hprocesses, human drivers, and the recruiter concept in a
natural way.
7.7 Case Study
For the rest of this chapter, we consider a case study of using human comparators
to sort blurry photographs. In this section, we describe the organization of our
case study. Specifically, we describe our dataset to be sorted (Section 7.7.1), our
modifications to the dataset to make it suitable for sorting (Section 7.7.2), and the
human interfaces involved (Section 7.7.3).
In Sections 7.8 and 7.9, we introduce two example sorting algorithms based on
Merge-Sort and Quick-Sort, respectively. Merge-Sort and Quick-Sort are
not necessarily perfect for this problem. In practice, one might use a tournament or an
algorithm which takes into account human variability. However, Merge-Sort and
Quick-Sort are simple algorithms which we expect that the reader is familiar with,
and they illustrate the challenges of evaluation of such algorithms well. In particular,
Merge-Sort and Quick-Sort show how important interfaces (such as those in
Section 7.7.3) are to human algorithms, and demonstrate fairly wide differences in
cost, time, and accuracy—the variables that we aim to measure.
7.7.1 Stanford University Shoe Dataset 2010
If researchers are to compare the performance of their human algorithms, they need
standard datasets. We created a dataset of over 100 photographs of single shoes taken
in the same lighting conditions at the same distance using a full-frame digital camera.
(See Figure 7.2 for examples.)
Why photographs of shoes? The dataset naturally has two types of orderings:
objective and subjective orderings. We can create an objective ordering by modifying
the pristine original photographs in some known way. (This is what we do in this case
7.7. CASE STUDY 153
Figure 7.2: Shoes from the Stanford University Shoe Dataset 2010 blurred to varyingdegrees.
study, see Section 7.7.2 below.) We can also create subjective orderings by asking
workers which shoes they like best, or which seem dirtiest. These types of orderings
may only be partial, may differ across workers, and are generally more complex to
handle. In general, human sorting has several variables: disinterest by particular
workers, worker quality for a particular task, and how consistently people agree on
the ordering of the dataset, to name a few. We believe that this dataset will be quite
effective at disentangling these various variables in the future.
7.7.2 Sorting Task
For this chapter, we created a sorting task in the following way. We started with
72 photos of different shoes from the Stanford University Shoe Dataset. We then
randomly ordered them and applied a Gaussian blur to each photo in ascending
order using the ImageMagick convert binary [4]. The first photo had no blur at all,
the next had a blur of σ = 0.5 and radius 1.5, the next had a blur of σ = 1.0 and
radius 3, up to the last photo with σ = 35.5 and radius 106.5. (The σ value is the
154 CHAPTER 7. PROGRAMMING WITH HPROC
(a) Binary Comparison (b) Ranked Comparison
Figure 7.3: Two different human comparison interfaces.
primary determinant of how blurry the image becomes.) We resized the photos to
351x234 for presentation on the web.
For each sorting task, we randomly select n photos from our set of blurred photos.
We then ask workers on Mechanical Turk to sort them by blurriness using our system
and the algorithms described below. We evaluate the quality of our algorithms based
on Kendall’s τ rank correlation between the output ordering and the true ordering.
In other words, if the workers return the results in the order (σ = 0.5, σ = 3.5, σ =
8.0, σ = 12.0) then the algorithm worked well, whereas if we get (σ = 12.0, σ =
3.5, σ = 8.0, σ = 0.5) then it did not. (In contrast to, for example, sorting pictures of
numbers, we hope that it will not be immediately obvious to the workers that we are
studying them.)
7.7.3 Comparison Interfaces
We consider two different human interfaces for comparing photographs in this case
study. In our case, each of the comparison interfaces allows human workers to order
photos by blurriness. The first interface (Figure 7.3(a)), which we call the binary
7.8. H-MERGE-SORT 155
comparison interface, asks the worker which of two photos is less blurry. The worker
uses a radio button to select the less blurry photo. The second interface (Figure
7.3(b)), which we call the ranked comparison interface, asks the worker to rank photos
from least blurry to most blurry. The worker drags and drops the photos until they
are in the correct order. Both interfaces show the photos vertically in sequence. There
is no limit to the number of photos which can be presented in the ranked comparison
interface, though in our evaluation, we only consider 4 and 8 photo rankings.
7.8 H-Merge-Sort
This section describes our H-Merge-Sort variant of Merge-Sort. We begin in
Section 7.8.1 by describing Merge-Sort. Then, we introduce some convenience
functions for use in our algorithms in Section 7.8.2. In Section 7.8.3, we give an
overview of our new H-Merge-Sort. In Section 7.8.4, we describe the functions we
use in our implementation of H-Merge-Sort. Finally, we walk through our HPROC
implementation of H-Merge-Sort in Section 7.8.5.
7.8.1 Classical Merge-Sort
The traditionalMerge-Sort is a bottom-up divide-and-conquer approach to sorting.
Traditional Merge-Sort consists of two alternating functions, Merge-Sort and
Merge.
TheMerge function takes two sorted lists, s1 and s2, and produces a single sorted
list s3 by merging the lists item by item. While there are still items in both s1 and
s2, Merge will compare the first item in both lists (i.e., s1[0] vs s2[0]) and append
the minimum item to the final sorted list s3 to be returned. Once either list is empty,
the remaining list is appended to s3, and finally s3 is returned by Merge.
The Merge-Sort function takes an unsorted list u0 and eventually returns a
sorted list. If the unsorted list is of length 1, the list is returned, because the list is
already sorted. If the unsorted list is of length greater than one, the list is split in
half, into two sublists, u1 and u2. Then, Merge-Sort is recursively called on each
156 CHAPTER 7. PROGRAMMING WITH HPROC
of the two sublists u1 and u2, producing two sorted sublists, s1 and s2. Then, Merge
is called on the two sorted sublists, producing a single sorted list s3, which is then
returned.
7.8.2 Convenience Functions
There are three functions which we use in our H-Merge-Sort and H-Quick-Sort,
but do not show to the reader. These three functions are getBinaryOrdering,
getRankOrdering, and order2pairs.
getBinaryOrdering takes two items and returns a sorted list of those two items
sorted by a human through a binary comparison form. Likewise, getRankOrdering
takes a list of items and returns a sorted list of those items sorted by a human through
a ranked comparison form. getBinaryOrdering and getRankOrdering are wrappers
around functionality we already saw in our Section 7.5 walkthrough. Specifically,
getBinaryOrdering and getRankOrdering are mostly the same as compareItems
from Listing 7.5. All three functions handle posting tasks to Mechanical Turk, and
all three post photo comparison tasks. However, there are two differences between
these two functions and compareItems. First, the two functions use the two dif-
ferent interfaces of Section 7.7.3. getBinaryOrdering uses the binary compari-
son interface, while getRankOrdering used the ranked comparison interface. Sec-
ond, the returned result of getBinaryOrdering and getRankOrdering are different
from compareItems. compareItems returned which of the two photos was less, e.g.,
photo1. getBinaryOrdering and getRankOrdering both return a sorted list of the
items.
order2pairs, converts an ordered list into a dictionary of binary comparisons. For
example, if getRankOrdering returned the ordering (url3, url2, url1), order2pairs
on the returned ordering would yield:
(url1, url2) = “l > r” (url2, url3) = “l > r”
(url1, url3) = “l > r” (url3, url1) = “l < r”
(url2, url1) = “l < r” (url3, url2) = “l < r”
The dictionary stores for each URL pair a string indicating whether the left side l is
7.8. H-MERGE-SORT 157
less than or greater than the right side r. order2pairs lets us determine the result
of a single binary comparison based on an ordering that contains that comparison.
7.8.3 H-Merge-Sort Overview
Transitioning from the classical Merge-Sort of Section 7.8.1 to a human version
where the binary comparisons are based on humans is fairly easy. Anywhere we would
do a binary comparison in regularMerge-Sort, we can just request that a worker do
that binary comparison using getBinaryOrdering (from Section 7.8.2). Otherwise,
the original Merge-Sort does not need to be changed to use humans.
However, things get more complex once it is possible to rank up to r items at a time
using humans with a ranked comparison form, a case for which Merge-Sort was
not designed. We make two changes to Merge-Sort to handle ranked comparisons.
The first change involves the base case. Recall that in Section 7.8.1, Merge-
Sort would recurse until reaching a singleton list. By contrast, our ranked version
recurses until it reaches a list of length less than or equal to r, the maximum number
of photos to be compared at a time. Recursing any deeper than this would simply
lead to unnecessary human comparisons, which are expensive.
The second change involves the Merge function. Recall that in Section 7.8.1,
Merge would conduct a binary comparison of the two items at the heads of the
sorted lists s1 and s2, appending the minimum of those two items to the final sorted
list s3. With a ranked comparison, we should be able to append more than one item
at a time to the list s3. Our solution is to do a ranked comparison of the first f1
elements of list s1, and the first f2 elements of list s2. We choose these values as:
f1 = min(⌈r
2⌉, len(s1)) f2 = min(r − f1, len(s2))
Put another way, we choose more or less r2items from the front of each of s1 and s2,
and compare the combined list. For example, if we had the lists
s1 = [1, 3, 5, 7, 9] s2 = [2, 4, 6, 8, 10]
158 CHAPTER 7. PROGRAMMING WITH HPROC
and r = 8, we would take the first four items from each list, e.g., f1 = 4 and f2 = 4,
leading us to do a ranked comparison of items:
[1, 3, 5, 7, 2, 4, 6, 8]
The Merge strategy described with ranked comparisons can be shown to always be
able to append at least r2items to s3 (assuming that there are at least r
2items left in
s1 and s2).
7.8.4 H-Merge-Sort Functions
Our H-Merge-Sort consists of three functions, addcache (Listing 7.10), merge
(Listing 7.11), and mergesort (Listing 7.12), plus the convenience functions described
in Section 7.8.2. (Both our H-Merge-Sort and H-Quick-Sort are presumed to
have the fully qualified name edu.stanford.sort.)
We implement Merge with ranked comparisons (as discussed in Section 7.8.3)
using a cache. The idea is that the cache stores all pairwise orderings that have been
discovered to date. Merge functions like a regular binary comparison Merge, using
binary comparisons from the cache. The cache itself may be added to using binary or
ranked comparisons, using the addcache function below. As we will see, addcache
gets called when a needed comparison is not available, and will add at least that
comparison to the cache (though possibly more comparisons as well).
The addcache function takes four arguments, a left side (i.e., s1), a right side
(i.e., s2), a configuration dictionary conf for specifying options, and a dictionary
cache to which to add binary comparisons (see below). The addcache function
can be configured using conf to either use binary or ranked comparisons. If con-
figured for binary comparisons, addcache will take the first items from lists s1 and
s2, i.e., s1[0] and s2[0], request a getBinaryOrdering of s1[0] ∪ s2[0], and then place
the order2pairs of the ordering in cache. If configured for ranked comparisons,
addcache will take the first items f1 and f2 (calculated above) from lists s1 and
s2, request a getRankOrdering of s1[0..f1 − 1] ∪ s2[0..f2 − 1], and then place the
order2pairs of the ordering in cache. (In the special case when s1[0..f1 − 1] and
7.8. H-MERGE-SORT 159
1 def addcache(left , right , conf , cache ):
2 foundordering = []
3
4 maxranked = 2
5
6 if conf.has_key(’maxranked ’):
7 maxranked = int(conf[’maxranked ’])
8
9 if (( maxranked ==2) or ((len(left) == 1) and (len(right) == 1))):
10 findOrderProc = newHprocess(’edu.stanford.sort’)
11 lazyFindOrder = findOrderProc.fn.getBinaryOrdering (
12 left[0], right [0])
13 foundordering = lazyFindOrder.get()
14
15 else:
16 leftamount = maxranked / 2 + (
17 maxranked /2 - min(maxranked/2,len(right )))
18 rightamount = maxranked / 2 + (
19 maxranked /2 - min(maxranked/2,len(left )))
20
21 findOrderProc = newHprocess(’edu.stanford.sort’)
22 lazyFindOrder = findOrderProc.fn.getRankOrdering (
23 left[: leftamount] + right [: rightamount ])
24 foundordering = lazyFindOrder.get()
25
26 comps = order2pairs(foundordering)
27 cache.update(comps)
Listing 7.10: addcache function (with FQN “edu.stanford.sort”).
160 CHAPTER 7. PROGRAMMING WITH HPROC
s2[0..f2 − 1] are both singleton lists, the ranked comparison will be downgraded to a
binary comparison.)
The merge function takes four arguments, a left side (i.e., s1), a right side (i.e.,
s2), a configuration dictionary conf for specifying options, and a dictionary cache
containing binary comparisons between items (see below). Our merge function is
more or less the same as the Merge described in Section 7.8.1, with two exceptions:
1. Our merge is tail call recursive, for system specific reasons not discussed here.
In other words, rather than a for loop, our merge calls itself with fewer items
in either left or right.
2. Our merge does not compare two items directly. Instead, merge checks whether
the needed comparison of the items s1[0] and s2[0] is available in the cache.
If the comparison is not available, merge calls addcache to add one or more
comparisons to the cache, including at least the currently necessary comparison.
(conf is passed to addcache, but not otherwise used by merge.) Either way,
the comparison is now available, and merge continues the merging process,
appending the minimum item to s3.
In other words, our merge pretends that it is a tail recursive version of our binary
comparison Merge, but keeps a cache so that it can make use of multiple binary
comparisons implicitly produced by a single ranked comparison. Our merge eventually
produces a sorted list s3 of the left (s1) and right (s2), as with Merge.
The mergesort function takes two arguments, an unsorted list l and a configura-
tion dictionary conf for specifying options. Our mergesort is more or less the same
as the Merge-Sort described in Section 7.8.1, with two exceptions:
1. If the configuration variable conf is set to allow ranked comparisons, the base
case will be changed to ranked comparisons of size r, rather than singleton lists.
2. All recursive calls to mergesort are requested lazily. That is, the call is made
to mergesort with separate hprocesses on the left half of the unsorted items,
and on the right half of the unsorted items. Only then is the result requested
of either, using get, potentially causing a crash.
7.8. H-MERGE-SORT 161
1 def merge(left , right , conf , cache=None):
2 if cache is None:
3 cache = {}
4
5 result = []
6
7 if ((len(left) == 0) or (len(right) == 0)):
8 result.extend(left)
9 result.extend(right)
10 return result
11
12 if not cache.has_key ((left[0], right [0])):
13 addcache(left , right , conf , cache)
14
15 rightmerge = newHprocess(’edu.stanford.sort’)
16 lazymerge = None
17
18 if cache [(left[0], right [0])] == ’l<r’:
19 result.append(left [0])
20 lazymerge = rightmerge.fn.merge(
21 left [1:], right , conf , cache)
22 else:
23 result.append(right [0])
24 lazymerge = rightmerge.fn.merge(
25 left , right [1:], conf , cache)
26
27 next = lazymerge.get()
28
29 return result + next
Listing 7.11: merge function (with FQN “edu.stanford.sort”).
162 CHAPTER 7. PROGRAMMING WITH HPROC
1 def mergesort(l, conf):
2 if len(l) < 2:
3 return l
4
5 if conf.has_key(’maxranked ’):
6 maxranked = int(conf[’maxranked ’])
7
8 if (len(l) <= maxranked) and (maxranked > 2) and (len(l) > 2):
9 findOrderProc = newHprocess(’edu.stanford.sort’)
10 lazyFindOrder = findOrderProc.fn.getRankOrdering (l)
11 foundordering = lazyFindOrder.get()
12
13 return foundordering
14
15 middle = len(l) / 2
16
17 lazyleft = newHprocess(’edu.stanford.sort’).fn.mergesort(
18 l[: middle], conf)
19 lazyright = newHprocess(’edu.stanford.sort’).fn.mergesort(
20 l[middle:], conf)
21
22 left = lazyleft.get()
23 right = lazyright.get()
24
25 lazymerge = newHprocess(’edu.stanford.sort’).fn.merge(
26 left , right , conf)
27 final = lazymerge.get()
28
29 return final
Listing 7.12: mergesort function (with FQN “edu.stanford.sort”).
7.8. H-MERGE-SORT 163
mergesort eventually produces a sorted list based on merge and addcache.
7.8.5 H-Merge-Sort Walkthrough
We now demonstrate a partial walkthrough of our H-Merge-Sort. We assume that
we are sorting eight photographs that we will number 1–8. We will assume that the
true sort order is
[1, 2, 3, 4, 5, 6, 7, 8]
and that the initial ordering is
[8, 6, 4, 2, 5, 7, 3, 1]
As discussed in Section 7.5, we need an upload script in order to run a program
from a remote client. In our case, we do not show the uploader, but presume that
it is similar to Listing 7.2. However, rather than calling compareItems on an hpro-
cess associated with edu.stanford.thesis.sa, in our case we call mergesort on an
hprocess associated with edu.stanford.sort. Specifically, the call to mergesort is
mergesort([8,6,4,2,5,7,3,1], {’maxranked’:4})
The second parameter is the conf configuration parameter, which is a dictionary
containing configuration information. In this case, conf indicates that at most 4
items can be ranked at the same time using the ranked comparison interface. At this
point, we have a single hprocess within the HPROC system periodically running a
dispatchSingle (not shown—from Section 7.5.7) to mergesort.
When this first single hprocess next resumes, the hprocess runs the code shown in
Listing 7.12. The singleton list base case on lines 2–3 does not apply because there
are eight items in the list. Because we have specified maxranked in conf, a second
base case is checked on lines 5–13. Specifically, we check if the items in the list are
less than the maximum number that can be ranked. As it turns out, there are eight
items in the list, and only four can be ranked at a time, so this base case is skipped
as well.
164 CHAPTER 7. PROGRAMMING WITH HPROC
Next, the list is split in half (line 15) and two recursive mergesort calls are made
on the left and right sides, with two separate new hprocesses (lines 17–20). These
recursive calls return lazy results, which are not requested until lines 22–23. This
means that our initial cross-hprocess function call (say, hpid 1001), will produce
two new hprocesses (hpids 1002 and 1003) corresponding to mergesort on the lists
[8, 6, 4, 2] and [5, 7, 3, 1]. Then, when the first lazy result object has its get method,
the first mergesort hprocess (hpid 1001) crashes.
This allows the other two mergesort hprocesses to run. Both now have less than
or equal to four items, so the base case on lines 5–13 now applies. This means that for
both, a new hprocess will be created, to get a getRankOrdering ranked comparison
for items [8, 6, 4, 2] for hpid 1002, and for items [5, 7, 3, 1] for hpid 1003. Both of these
getRankOrdering calls will eventually create comparison forms via web hprocesses,
in a style similar to our original walkthrough.
All three hprocesses described thus far (hpids 1001, 1002, and 1003) will now
continuously crash-and-rerun waiting for new data. Eventually, workers will fill out
the web forms created by the calls to getRankOrdering, and the hprocesses with hpids
1002 and 1003 will return two ranked orderings, [2, 4, 6, 8] and [1, 3, 5, 7], assuming
the workers compute the correct orderings. Our original hprocess 1001 then creates a
new hprocess to merge these lists (lines 25–26), which are returned to it as left and
right (lines 22–23). The result is again not available on line 27, so hprocess 1001
crashes again waiting on the result of a call to
merge([2,4,6,8], [1,3,5,7], {’maxranked’:4})
to hprocess 1010 (a new hprocess). (Note that we choose a later hpid here, because
there have been a number of hprocesses created by getRankOrdering in hpids 1002
and 1003.)
When hprocess 1010 is resumed, the hprocess calls the function merge in Listing
7.11. Neither left nor right is empty, so the case on line 7 is skipped. The cache
is then checked for the initial comparison of the head items of the left and right
lists. The cache is thus checked for the tuple (2, 1), which is not in fact in the cache,
because the cache is empty. As a result, addcache is called. (We do not create a
7.8. H-MERGE-SORT 165
new hprocess for addcache because human parallelism will not be affected.)
When addcache (Listing 7.10) is called, the arguments are the full left and
right lists. The addcache function checks whether the maximum number of items
that are rankable at a time is two on line 9. There could be only two items rankable
either because the value of conf[’maxranked’] is two, or because there is only one
item each remaining in left and right. In our case, there are more than two items
remaining, and maxranked is four, so addcache skips to the case on line 15. On lines
16–19, addcache makes the f1 and f2 calculation described in Section 7.8.3. In our
case, f1 = 2 and f2 = 2, so addcache creates a new hprocess to call getRankOrdering
on the first two items of both lists, [2, 4] and [1, 3]. Hprocess 1010 then crashes on line
24 periodically until workers return the ordering. Supposing they eventually return
the correct ordering, foundordering is now [1, 2, 3, 4] which order2pairs (line 26)
turns into a dictionary of pairs as described in Section 7.8.2. Finally, the cache is
updated with these pairs on line 27, and addcache returns.
Now the comparison on line 18 can be computed, because it is in the cache. The
comparison is (2, 1) and the cache says that the answer is l>r. As a result, 1 is
appended to the result, and a new hprocess is created to do the rest of the merge,
without 1. In other words,
merge([2,4,6,8],[1,3,5,7], {’maxranked’:4})
will return the result
[1,merge([2,4,6,8],[3,5,7], {’maxranked’:4}, cache)]
When the new hprocess runs merge, the cache will also include the comparison (2, 3).
In fact, the cache will include all comparisons up until the lists are [4, 6, 8] and [5, 7].
Then, a new hprocess will be created to getRankOrdering of [4, 6, 5, 7]. The
ordering of these values allows merge to progress to [8] and [7], which are then finally
compared using an addcache which in this case does a binary ordering (line 9 of
Listing 7.10). Finally, the merge with hpid 1010 has merged all items, which are then
returned to hpid 1001, which returns the fully sorted list.
Two things should be noted from our walkthrough. First, hprocesses were created,
and recursive mergesorts were called, until we hit base cases (minimum numbers of
166 CHAPTER 7. PROGRAMMING WITH HPROC
items that we could rank) or necessary merges. Second, the hprocess conducting
the merge was effectively “blocked” while waiting for results. Before the result of
the ranked comparison [2, 4, 1, 3] was available, merge did not know it should rank
[6, 8, 5, 7]. The recursive mergesorts mean that there is a reasonable amount of
human parallelism—many tasks will be posted in parallel to the Mechanical Turk.
However, the dependence of comparisons on previous comparisons in merge puts a
limit on this human parallelism.
7.9 H-Quick-Sort
This section describes our H-Quick-Sort variant of Quick-Sort. We begin in
Section 7.9.1 by describing Quick-Sort. In Section 7.9.2, we give an overview of
our new H-Quick-Sort. In Section 7.9.3, we describe the functions we use in our
implementation of H-Quick-Sort. (We presume that the convenience functions
from Section 7.8.2 continue to be available.) Finally, we walk through our HPROC
implementation of H-Quick-Sort in Section 7.9.4.
7.9.1 Classical Quick-Sort
The traditional Quick-Sort is a top-down divide-and-conquer approach to sorting.
Traditional Quick-Sort consists of two alternating functions, Quick-Sort and
Partition.
The Partition function takes an unsorted list u0, and an item ip called the pivot.
The Partition function compares every item in u0 to ip, producing three lists:
ul is the list of items less than the pivot.
ue is the list of items equal to the pivot.
ug is the list of items greater than the pivot.
The return value of Partition is these three lists.
The Quick-Sort function takes an unsorted list u0. If the unsorted list is of
length 0, the list is returned, because the list is already sorted. If the unsorted list is
7.9. H-QUICK-SORT 167
of length greater than zero, a pivot is chosen. The pivot is a random item within the
list u0. Then, the partition function is called with the pivot and the unsorted list u0,
producing three lists (ul, ue, ug). Finally, Quick-Sort returns the concatenation of
Quick-Sort applied to ul, ue, and Quick-Sort applied to ug.
7.9.2 H-Quick-Sort Overview
Similarly to H-Merge-Sort (Section 7.8.3), transitioning from the classical Quick-
Sort of Section 7.9.1 to one based on human binary comparisons is fairly easy.
Anywhere we would usually do a binary comparison, we instead do a human binary
comparison using getBinaryOrdering. In fact, this conversion is quite natural for
Quick-Sort. When H-Merge-Sort merges, the next items to be compared (the
front items in the lists s1 and s2) always depend on the results of the last comparison.
However, in H-Quick-Sort, in the Partition phase, all items in u0 are compared
to the pivot without dependence on one another. This non-dependence means that
all comparisons for a given Partition can be done at the same time.
We want our H-Quick-Sort to also be able to take advantage of ranked com-
parisons. However, for the Partition phase, binary comparisons are in a sense fairly
optimal, because they can post all of the comparisons to the pivot at the same time.
Instead, we modified Quick-Sort for ranked comparisons in our H-Quick-Sort by
using a ranked comparison to select the pivot. The idea is that the choice of pivot can
make a big difference in how effectively the list u0 is split, and we want a pivot which
is as close to the median as possible. Therefore, we request a ranked comparison of
five random items in u0 using getRankOrdering, choosing the median of five as the
pivot.
7.9.3 H-Quick-Sort Functions
OurH-Quick-Sort consists of two functions, partition (Listing 7.13) and quicksort
(Listing 7.14), plus the convenience functions described in Section 7.8.2.
The partition function takes two arguments, an original unsorted list l and a
pivot. Then, partition will call getBinaryOrdering some number of times (in a
168 CHAPTER 7. PROGRAMMING WITH HPROC
1 def partition(l, pivot ):
2 if len(l) == 0:
3 return ([], [pivot], [])
4
5 subpartcomp = newHprocess(’edu.stanford.sort’)
6 subpartfn = subpartcomp.fn.partition(l[1:], pivot)
7
8 head = l[0]
9
10 findOrderProc = newHprocess(’edu.stanford.sort’)
11 lazyFindOrder = findOrderProc.fn.getBinaryOrdering (head , pivot)
12 foundordering = lazyFindOrder.get()
13 comps = order2pairs(foundordering)
14
15 newl , newe , newg = subpartfn.get()
16
17 if comps [(head , pivot )] == ’l<r’:
18 return (newl + [head], newe , newg)
19 else:
20 return (newl , newe , newg + [head])
Listing 7.13: partition function (with FQN “edu.stanford.sort”).
7.9. H-QUICK-SORT 169
tail recursive manner), comparing the pivot to all items in the unordered list, and
will eventually return items found to be less, equal, and greater than the pivot.
The quicksort function takes two arguments, an unordered list and a config-
uration dictionary conf for specifying options. Depending on the value of conf,
quicksort will either choose the first item in the unordered list as a pivot, or it will
do the getRankOrdering median pivot described in Section 7.9.2. Having chosen the
pivot, quicksort will partition using the pivot, and eventually call itself recursively
to order the lesser and greater parts. Lastly, quicksort returns a human sorted list.
7.9.4 H-Quick-Sort Walkthrough
We now demonstrate a partial walkthrough of our H-Quick-Sort. We make the
same assumptions as our H-Merge-Sort walkthrough in Section 7.8.5. Specifically,
we assume:
1. Eight photographs 1–8.
2. True sort order is [1, 2, 3, 4, 5, 6, 7, 8].
3. Initial ordering is [8, 6, 4, 2, 5, 7, 3, 1].
4. An upload script is used, but not shown.
We use the initial cross-hprocess function call:
quicksort([8,6,4,2,5,7,3,1], {’pivot’:’fiverank’})
The second argument is the conf parameter, requesting that the median-based pivot
be chosen with a human ranked comparison. At this point, we have a single hprocess
(hpid 1001) within the HPROC system periodically running a dispatchSingle (not
shown) to quicksort.
The quicksort hprocess just described (hpid 1001) runs the code in Listing 7.14.
The given list l is not empty, so the first case (lines 2–3) is skipped. A pivot is chosen
on lines 5–6, but the pivot is overwritten in lines 8–16. Specifically, the condition on
line 8 checks conf and finds that we want a fiverank pivot. Because the length of
170 CHAPTER 7. PROGRAMMING WITH HPROC
1 def quicksort(l, conf):
2 if len(l) == 0:
3 return []
4 else:
5 pivot = l[0]
6 newlist = l[1:]
7
8 if conf.has_key(’pivot ’) and \
9 conf[’pivot ’] == ’fiverank ’ and \
10 len(l) > 4:
11 findOrderProc = newHprocess(’edu.stanford.sort’)
12 lazyFindOrder = findOrderProc.fn.getRankOrdering (l[:5])
13 foundordering = lazyFindOrder.get()
14
15 pivot = foundordering [2]
16 newlist = [i for i in l if i != pivot]
17
18 partcomp = newHprocess(’edu.stanford.sort’)
19 partfn = partcomp.fn.partition(newlist , pivot)
20 lesser , equal , greater = partfn.get()
21
22 qsortlesser = newHprocess(’edu.stanford.sort’)
23 qsortgreater = newHprocess(’edu.stanford.sort’)
24
25 qsortlfn = qsortlesser.fn.quicksort(lesser , conf)
26 qsortgfn = qsortgreater.fn.quicksort(greater , conf)
27
28 return qsortlfn.get() + equal + qsortgfn.get()
Listing 7.14: quicksort function (with FQN “edu.stanford.sort”).
7.9. H-QUICK-SORT 171
the current list l is at least five items (line 10), a ranked comparison is requested using
getRankOrdering. Specifically, the first five items in the list l are passed to a new
hprocess running getRankOrdering on line 12. These first five items are [8, 6, 4, 2, 5].
Then, hprocess 1001 crashes, waiting for the getRankOrdering hprocess to return
a result, which eventually happens. Presuming that the result from the worker is
correct, foundordering is [2, 4, 5, 6, 8] (line 13) and the median is chosen (line 15),
which is 5 in our case. The median, 5 then replaces the chosen pivot, and 5 is removed
from the list of items to be partitioned later (see below).
Now that a pivot has been chosen, a new hprocess is created and called to partition
the list l with the pivot 5 on lines 18–20. The quicksort hprocess 1001 then crashes,
waiting for partition results. The newly created hprocess for the partition runs the
call
partition([8,6,4,2,7,3,1], 5)
in Listing 7.13. The early lines of partition (lines 5–6) create more hprocesses with
calls to partition. Specifically, they create the calls
partition([6,4,2,7,3,1], 5)
partition([4,2,7,3,1], 5)
partition([2,7,3,1], 5)
partition([7,3,1], 5)
partition([3,1], 5)
partition([1], 5)
partition([], 5)
The final new partition hprocess does not create a new hprocess with a partition
call because the function returns with the base case on lines 2–3.
Each of the hprocesses then proceeds to request a getBinaryOrdering between
the head of its individual passed list l and the pivot. Each hprocess then periodically
crashes on line 12 until the worker binary comparison is returned. When each worker
binary comparison is returned, each hprocess then waits for the sub-hprocess that it
created to return (line 15). Then, each hprocess returns its sub-hprocess’ lesser, equal,
172 CHAPTER 7. PROGRAMMING WITH HPROC
and greater items, together with its own single comparison result. This continues up
the chain until partition returns to the original hprocess 1001 running quicksort.
The result of partition is
([4,2,3,1], [5], [8,6,4])
Note that because we chose a median pivot, the partition of the list is quite equal,
whereas if we had chosen the first item, 8, we would have had a very unequal partition.
Once hprocess 1001 has the partition results, it can run quicksort on the lesser
and greater items. This further recursion is shown on lines 22–26 of Listing 7.14.
Specifically, two new hprocesses are created, the first to quicksort the list [4, 2, 3, 1]
and the second to quicksort the list [8, 6, 4]. (These two quicksorts will function
the same as the previously discussed one, though they will not use the median pivot
because there are not enough items.) Those quicksorts will eventually return sorted
lists, which are then combined with the pivot on line 28, producing a final sorted list.
The most interesting thing to note about H-Quick-Sort is the high level of hu-
man parallelism. Every binary comparison within a partition is handled in parallel,
unlike merge in H-Merge-Sort.
7.10 Human Algorithm Evaluation
Before conducting an evaluation of H-Merge-Sort and H-Quick-Sort, we first
consider how to evaluate algorithms in general using the human processing model.
A human algorithm really consists of a strategy plus an interface. In our case, our
strategies are H-Merge-Sort and H-Quick-Sort. Our interfaces are the binary
comparison and ranked comparison interfaces discussed in Section 7.7.3. Once we
have paired a strategy with one or more appropriate interfaces, we can evaluate the
combination as a complete human algorithm.
There are five main variables in evaluating any human algorithm in the human
processing model: recruiter type, cost, time, accuracy, and algorithm-specific param-
eters. (We consider the recruiter type to be part of the evaluation parameters, rather
than part of the algorithm, though it could arguably be considered either.) In our
7.11. CASE STUDY EVALUATION 173
case, the recruiter type we consider is a single “basic” recruiter which offers a task on
the Mechanical Turk for one cent, and re-posts the comparison if it is not accepted
every 20 minutes. The recruiter also only hires workers who have maintained an ac-
ceptance rate greater than 95%. The cost is the amount paid to workers in cents over
the runtime of the algorithm. The time is the length of time it took for the algorithm
to complete. The accuracy is algorithm-specific, though in the case of sort, we calcu-
late Kendall’s τ of the sort’s result versus the true ordering. The algorithm-specific
parameters vary by algorithm, though in the case of sort, we are interested in the
total number of items to be sorted (which in turn impact cost, time, and accuracy).
There are three other aspects that are important for human algorithm evaluation:
time period, dataset, and variation. The first aspect is the time period during which
the evaluation is done. Evaluation conducted in the middle of the night might perform
quite differently from evaluation during the day, because we are dealing with humans.
The second aspect is the dataset used, because humans may be heavily impacted by
dataset choice. As a result, we use the same dataset across our evaluation. The third
aspect is the variation across multiple runs, due to the natural variation of workers
across multiple tasks. We reflect this variation by computing standard deviation
across a number of runs.
7.11 Case Study Evaluation
We now evaluate H-Merge-Sort and H-Quick-Sort using the evaluation criteria
described in Section 7.10. Specifically, we consider the strategies H-Merge-Sort
and H-Quick-Sort paired with the interfaces for binary comparison and ranked
comparison. We use the basic recruiter described in Section 7.10. We consider dif-
ferent settings of the number of items to be sorted and the impact of this setting
on the cost, time, and accuracy of the sorting algorithms under consideration. Our
evaluation is using the Stanford University Shoe Dataset 2010 (Section 7.7.1). We run
all comparisons between algorithms consecutively over the course of several days in
order to control for the time period and variation of workers. Specifically, we conduct
one run starting November 8th, 2010, comparing all settings of our H-Merge-Sort
174 CHAPTER 7. PROGRAMMING WITH HPROC
2 3 4 5 6 7 8 9 10
05
10
15
20
25
30
Items Sorted
Co
st
in C
en
ts
(a) H-Merge-Sort (Binary)
2 3 4 5 6 7 8 9 10
05
10
15
20
25
30
Items Sorted
Co
st
in C
en
ts
(b) H-Merge-Sort (Rank 8)
2 3 4 5 6 7 8 9 10
05
10
15
20
25
30
Items Sorted
Co
st
in C
en
ts
(c) H-Quick-Sort (Binary)
Figure 7.4: Comparison of total cost of three variations of sorting.
and H-Quick-Sort for n = (5, 10, 20) (where n is the number of items to be sorted).
We conduct the other run starting November 29th, 2010, comparing three settings of
H-Merge-Sort and H-Quick-Sort across n = (2, 3, 4, 5, 6, 7, 8, 9, 10).
We consider three questions. First, we ask how interfaces impact H-Merge-Sort
in Section 7.11.1. Second, we ask whether the median pivot option in H-Quick-Sort
is helpful in Section 7.11.2. Third, we compare H-Merge-Sort to H-Quick-Sort
in Section 7.11.3. Finally, we discuss other observations on the data as a whole in
Section 7.11.4.
7.11.1 H-Merge-Sort Interfaces
How is H-Merge-Sort impacted by the choice of interface between binary com-
parisons and ranked comparisons? A change in interface could impact cost, time, or
accuracy. We consider each below.
Figures 7.4(a) and 7.4(b) show boxplots of the cost of H-Merge-Sort across
7.11. CASE STUDY EVALUATION 175
2 3 4 5 6 7 8 9 10
05
00
10
00
15
00
20
00
Items Sorted
Clo
ck T
ime
in
Se
co
nd
s
(a) H-Merge-Sort (Binary)
2 3 4 5 6 7 8 9 10
05
00
10
00
15
00
20
00
Items Sorted
Clo
ck T
ime
in
Se
co
nd
s
(b) H-Merge-Sort (Rank 8)
2 3 4 5 6 7 8 9 10
05
00
10
00
15
00
20
00
Items Sorted
Clo
ck T
ime
in
Se
co
nd
s
(c) H-Quick-Sort (Binary)
Figure 7.5: Comparison of wall clock time for three variations of sorting.
ten runs with binary comparisons and ranked comparisons (eight way), respectively.
(Boxes represent the 25th and 75th percentile of the data from ten runs, with a
horizontal line at the median, “whiskers” are drawn to the maximum and minimum
points within 1.5 times the inner-quartile range, and circles represent outliers outside
of that range.) For example, sorting nine items with a binary comparison H-Merge-
Sort costs around 17 cents, while sorting nine items with a ranked comparison
H-Merge-Sort costs around 4 cents. We can see that H-Merge-Sort with eight
way ranked comparisons is substantially cheaper than binary comparison H-Merge-
Sort.
Figures 7.5(a) and 7.5(b) show boxplots of the time taken by H-Merge-Sort
across ten runs with binary comparisons and ranked comparisons (eight way), respec-
tively. For example, sorting five items with a binary comparison H-Merge-Sort
takes around 500 seconds. We can see that H-Merge-Sort with eight way ranked
comparisons takes substantially less time than binary comparison H-Merge-Sort.
However, in both cases, H-Merge-Sort has a big spike around 9 or 10 sorted items.
176 CHAPTER 7. PROGRAMMING WITH HPROC
2 3 4 5 6 7 8 9 10
−1
.0−
0.5
0.0
0.5
1.0
Items Sorted
Ke
nd
all’
s T
au
(a) H-Merge-Sort (Binary)
2 3 4 5 6 7 8 9 10
−1
.0−
0.5
0.0
0.5
1.0
Items Sorted
Ke
nd
all’
s T
au
(b) H-Merge-Sort (Rank 8)
2 3 4 5 6 7 8 9 10
−1
.0−
0.5
0.0
0.5
1.0
Items Sorted
Ke
nd
all’
s T
au
(c) H-Quick-Sort (Binary)
Figure 7.6: Comparison of accuracy for three variations of sorting.
This is because H-Merge-Sort spends more time in the Merge phase as the num-
ber of items n increases, regardless of the number of comparisons possible via ranked
comparisons. (9 items is the point at which we need to Merge when we have eight-
way ranked comparisons, and it is a point at which more work needs to done in
Merge for binary comparison H-Merge-Sort.)
Figures 7.6(a) and 7.6(b) show boxplots of the accuracy, in terms of Kendall’s
τ , across ten runs with binary comparisons and ranked comparisons (eight way),
respectively. Kendall’s τ ranges between −1 (if the ordering is the perfect reversal of
the correct ordering) and +1 (if the ordering is the correct ordering). For example,
all orderings of two items are either −1 (the wrong order) or +1 (the correct order)
in Figures 7.6(a) and 7.6(b). For comparison purposes, the author scores a Kendall’s
τ roughly in the range of 0.7–1.0 when manually sorting ten items. Ultimately, it
is difficult to discern patterns in Figures 7.6(a) and 7.6(b), though we will find later
in Section 7.11.4 that they have slightly different accuracies, but primarily different
variance.
7.11. CASE STUDY EVALUATION 177
7.11.2 H-Quick-Sort Median Pivot
How isH-Quick-Sort impacted by the choice of a random versus median pivot based
on a ranked comparison? We found that there was relatively little average difference
in cost, time, or accuracy between the random versus median pivots. However, we
did find that choosing a median pivot based on a ranked comparison substantially
reduced the variance in accuracy. In general, H-Quick-Sort with the median pivot
had about half the variance in accuracy of H-Quick-Sort with a random pivot. The
full numbers are shown in Table 7.4, which is described in Section 7.11.4.
7.11.3 H-Merge-Sort versus H-Quick-Sort
How does H-Merge-Sort compare to H-Quick-Sort? Figures 7.4, 7.5 and 7.6
show this comparison from n = 2 to n = 10. We only compare H-Quick-Sort with
a median pivot to H-Merge-Sort, because Section 7.11.2 showed that the two pivot
choices were similar, with median pivots having lower variance.
Our H-Quick-Sort performs largely the same as binary comparison H-Merge-
Sort in terms of cost, accuracy, and time. (We did not incorporate ranked com-
parisons into the Partition phase of our H-Quick-Sort, which might have made
H-Quick-Sort more competitive with eight way ranked comparison H-Merge-
Sort.) However, there is one big difference, which is that H-Quick-Sort does not
show as substantial a jump around n = 9 for the time taken to sort. This jump
illustrates the lack of the “blocking” Merge behavior described earlier, and suggests
that H-Quick-Sort would perform better at larger n than H-Merge-Sort.
7.11.4 Complete Data Table
Table 7.4 shows our full data for the November 8th, 2010, run described at the
beginning of this section. Here, in addition to computing mean values across ten runs,
we also compute standard deviations. We can see, for example, that the variance for
ranked comparisons (i.e., ≈ 0.4) tends to be substantially higher than the variance
for binary comparisons (i.e., ≈ 0.2–0.3).
178 CHAPTER 7. PROGRAMMING WITH HPROC
Strategy
Interface
Recru
iter
Items(n
)
Cost
(¢)
Tim
e(s)
Accura
cy(τ)
H-M
erge-
Sort
Choice
(2-w
ay)
Basic@(1
¢,20,95%)
5 6.6 (σ = 1.17) 395.272 (σ = 101.44) 0.460 (σ = 0.30)10 22.8 (σ = 1.40) 1091.386 (σ = 281.94) 0.649 (σ = 0.22)20 62.6 (σ = 4.06) 3009.043 (σ = 753.61) 0.702 (σ = 0.11)
Ran
ked
(4-w
ay) 5 3.5 (σ = 0.53) 242.979 (σ = 87.86) 0.520 (σ = 0.43)
10 10.4 (σ = 0.97) 630.557 (σ = 243.50) 0.569 (σ = 0.41)20 29.2 (σ = 1.40) 1873.899 (σ = 588.48) 0.661 (σ = 0.17)
Ran
ked
(8-w
ay) 5 1.0 (σ = 0.00) 125.163 (σ = 166.62) 0.640 (σ = 0.44)
10 4.0 (σ = 0.00) 351.805 (σ = 151.72) 0.502 (σ = 0.43)20 11.6 (σ = 0.52) 1197.461 (σ = 373.74) 0.494 (σ = 0.35)
H-Q
uick-
Sort
Ran
d.
Pivot
5 7.4 (σ = 0.97) 320.833 (σ = 129.80) 0.740 (σ = 0.25)10 24.1 (σ = 2.73) 741.514 (σ = 208.99) 0.698 (σ = 0.31)20 65.9 (σ = 6.67) 1688.436 (σ = 261.63) 0.714 (σ = 0.19)
Median
Pivot
5 7.5 (σ = 0.97) 342.251 (σ = 103.06) 0.760 (σ = 0.25)10 22.7 (σ = 1.25) 701.911 (σ = 152.27) 0.693 (σ = 0.18)20 66.3 (σ = 3.02) 1709.612 (σ = 235.46) 0.747 (σ = 0.08)
Table 7.4: Comparison of different sorting strategies and interfaces. Sorting dataset isthe Stanford University Shoe Dataset 2010. All runs done during the week of Novem-ber 8th, 2010. Results listed are the mean over ten runs, with standard deviation inparentheses.
Seeing the full time values, we can also evaluate whether the scale of the time
values makes sense for crash-and-rerun programming. Crash-and-rerun programming
only makes sense when interfacing with humans takes substantially longer than com-
puting time [51]. In our case, waiting for a human worker can take tens of minutes,
so crash-and-rerun seems like a reasonable design choice.
7.12 Conclusion
This chapter introduced our HPROC system implementing most of the human pro-
cessing model of Chapter 6 and then used the system to conduct a short case study
of two sorting algorithms.
7.12. CONCLUSION 179
We first described the semantics of TurKit, the most closely related system to
HPROC. We then described the main subsystems of HPROC and how they help
implement the core HPROC concept—hprocesses. Because HPROC is such a com-
prehensive system, we illustrate system usage with a short walkthrough. Overall,
the HPROC system meaningfully extends the crash-and-rerun model introduced by
TurKit. HPROC allows a rapid prototyping style of human programming. For exam-
ple, we were able to convert our H-Merge-Sort into an H-Quick-Sort program
in a matter of hours.
However, if we were to improve upon HPROC, we would likely focus on debugging
tools to make programming easier. Crash-and-rerun leads to many effective threads
of control, which can be difficult to debug. As a result, detailed information about
what state is stored in the database as well as what hprocesses are currently active,
waiting and finished can be quite useful.
Having introduced HPROC, we used it to conduct our sorting case study. In our
case study, we used strategies based on Merge-Sort and Quick-Sort. We found
that the choice of human comparison interfaces had a large impact on cost, time and
accuracy in our evaluation. We also found that such interfaces could be used in unique
ways, for example, in the case of our median pivot selection for H-Quick-Sort which
tends to reduce the variance of sorting accuracy.
However, the Merge-Sort and Quick-Sort algorithms that are the basis for
H-Merge-Sort and H-Quick-Sort assume that comparisons are correct, and er-
rors can greatly reduce the quality of output. In the future, the best strategies will
take into account worker uncertainty, for example, by requesting multiple judgments.
Nonetheless, one cannot get too many judgments, because this increases cost! We are
often forced to choose between trusting fewer workers to do more work (e.g., ranked
comparisons) or having more workers do less work (e.g., binary comparisons). Future
approaches to this problem might include asking workers to do a test task before
trusting their judgments, or giving workers more advanced interfaces after simpler
ones. (In general, the goal is to reduce the “noise” added by bad workers, both by
avoiding such bad workers, and by taking into account their existence in algorithm
design.) Lastly, our case study used a very basic recruiter because of current issues
180 CHAPTER 7. PROGRAMMING WITH HPROC
with pricing in the Mechanical Turk [19]. Future recruiters will be better at recruiting
at particular times, changing prices to speed execution, and focusing on particular
workers that have provided quality, verified input in the past. Overall, the issues in
this area are quite varied, and most of the likely changes substantially impact cost,
time, accuracy, and the variability.
HPROC is designed to make such exploration easy and systematic. Recruiters
and related concepts are designed both to simplify program design and, crucially, to
control for variability of the underlying marketplace, allowing for comparison between
different proposed algorithms. Thus, our contribution of the recruiter concept is a
key part of a methodology for evaluating human algorithms across systems and im-
plementations. We believe that both our evaluation methodology and the HPROC
system should be beneficial to any number of human algorithms, like sorting, cluster-
ing, and summarization. Shared systems and datasets like those introduced in this
chapter can only accelerate the exciting progress that is being rapidly made in the
growing field of human algorithms touched on in our case study.
Chapter 8
Worker Monitoring with
Turkalytics
One challenge in the human processing model of Chapters 6 and 7 is the collection
of reliable data about the workers and the tasks they are performing. This data is
needed by our recruiter in particular, but is also needed by any system trying to make
human processing more effective: If a task is not being completed, is it because no
workers are seeing it? Is it because the task is currently being offered at too low a
price? How does the task completion time break down? Do workers spend more time
previewing tasks (see below) or doing them? Do they take long breaks? Which are
the more “reliable” workers?
This chapter addresses the problem of analytics for recruiting workers and study-
ing the performance of ongoing tasks. We describe our prototype system for gathering
analytics, illustrate its use, and give some initial findings on observable worker behav-
ior. We believe our tool for analytics, “Turkalytics,” is the first human computation
analytics tool to be embeddable across human computation systems (see Section 8.1
for the explicit definition of this and other terms). Turkalytics makes analytics orthog-
onal to overall system design and encourages data sharing. Turkalytics can be used
in stand-alone mode by anyone, without need for our full human-processing infras-
tructure (Figure 6.3). Turkalytics functions similarly to tools like Google Analytics
[5], but with a different set of tradeoffs (see Section 8.2.4).
181
182 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
We proceed as follows. Section 8.1 defines terms and describes the interaction and
data models underlying our system. We describe the implementation of Turkalytics
based on these models in Section 8.2. Section 8.3 describes how a requester uses
our system. Sections 8.4, 8.5, and 8.6 present results. Section 8.4 describes the
workload we experienced and shows our architecture to be robust. Section 8.5 gives
some initial findings about workers and their environments. Section 8.6 considers
higher granularity activity data and worker marketplace interactions. Section 8.7
summarizes related work, and we conclude in Section 8.8.
8.1 Worker Monitoring Terms and Notation
We define crowdsourcing to be the process of getting one or more people over the
Internet to perform work via a marketplace. We call the people doing the work
workers. We call the people who need the work completed requesters. A marketplace
is a web site that connects workers to requesters, allowing workers to complete (micro-
)tasks for a monetary, virtual, or emotional reward.
Tasks are grouped in task groups, so that workers can find similar tasks. Mechani-
cal Turk, the marketplace for which our Turkalytics tool is designed, calls tasks HITs
and task groups HITTypes. When a worker completes a task, we call the completed
(task, worker) pair an assignment or work.
Tasks are posted to marketplaces programmatically by requesters using interfaces
provided by the marketplaces. A requester usually builds a program called a human
computation system to ease posting many tasks (e.g., HPROC in Chapter 7). (We
use “system” in both this specific sense and in a colloquial sense, though we try to be
explicit where possible.) The system needs to solve problems like determining when to
post, how to price tasks, and how to determine quality work. The human computation
system may be based on a framework designed and/or implemented by someone else
to solve some of these tasks, like the human processing model of Chapter 6. The
human computation system may also leave certain problems to outside services, such
as our analytics tool (for analytics) or a full service posting and pricing tool like
CrowdFlower [6].
8.1. WORKER MONITORING TERMS AND NOTATION 183
Figure 8.1: Search-Preview-Accept (SPA) model.
The rest of this section describes two models at the core of our Turkalytics tool.
The worker interaction model of Section 8.1.1 makes it possible to represent (and
report on) the steps taken to perform work. The data model of Section 8.1.2 is key
to understanding what data needs to be collected. As we will see in Section 8.6,
our interaction model helps us present results about worker behavior. Similarly, our
data model helps us describe the implementation (Section 8.2) and requester usage
(Section 8.3).
8.1.1 Interaction Model
Crowdsourcing marketplaces vary. Some focus on areas of expertise (e.g., program-
ming or graphic design) while others are more defined by the average time span of
a task (e.g., one minute microtasks or month long research projects). Different mar-
ketplaces call for different interactions. For example, marketplaces with longer, more
skilled tasks tend to have contests or bidding based on proposals, while marketplaces
for microtasks tend to have a simpler accept or reject style. Section 8.1.1 describes a
simple microtask model, and Section 8.1.1 extends it to cover Mechanical Turk.
Simple Model
The Search-Preview-Accept (SPA) model is a simple model for microtasks (Fig-
ure 8.1). Workers initially are in the Search or Browse state, looking for work they
can do at an appropriate price. Workers can then indicate some interest in a task
by entering the Preview state through a preview action. Preview differs from Search
184 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
Figure 8.2: Search-Continue-RapidAccept-Accept-Preview (SCRAP) model.
or Browse in that the worker may have a complete view of the task, rather than
some summary information. From Preview, the worker can enter the Accept state by
accepting and actually complete the task. Lastly, the worker can always return to a
previous state, for example, a worker can return an accepted task, or leave behind a
task that he found uninteresting on preview.
The SPA model fits microtasks well because the overhead of a more complex
process like an auction seems to be much too high for tasks that may only pay a few
pennies. However, the SPA model does provide flexibility to allow workers to self
select for particular tasks and to back out of tasks that they feel unsuited for. The
Accept state also allows greater control over how many workers may complete a given
task, because workers may be prevented from accepting a task.
Mechanical Turk Extensions
Mechanical Turk uses a more complex model than SPA which we call the Search-
Continue-RapidAccept-Accept-Preview (SCRAP) model (Figure 8.2). This model
is similar to SPA, but adds two new states, Continue and RapidAccept. Continue
allows a worker to continue completing a task that was accepted but not submitted
or returned. RapidAccept allows a worker to accept the next task in a task group
without previewing it first. In practice, the actual states and transitions in Mechanical
8.1. WORKER MONITORING TERMS AND NOTATION 185
Figure 8.3: Turkalytics data model (Entity/Relationship diagram).
Turk are much messier than Figure 8.2. However, we will see in Section 8.6.1 that
mapping from Mechanical Turk to SCRAP is usually straightforward.
While SCRAP is a reasonable model of Mechanical Turk worker activity, it is
incomplete in two notable ways. First, it ignores certain specialized Mechanical Turk
features like qualifications. This is primarily because Turkalytics, as an unobtrusive
third-party add-on cannot really observe these states. Second, SCRAP chooses a
particular granularity of activity to describe. As we will see, Turkalytics actually
includes data within a state, for example, form filling activity or mouse movement.
We can think of such data as being attached to a state, which is more or less how it
is represented in our data model.
8.1.2 Data Model
This section uses the terminology of data warehousing and online analytical processing
(OLAP) systems. Data in Turkalytics is organized in a star schema, centered around
a single fact table, Page Views. Each entry in Page Views represents one worker
visiting one web page, in any of the states of Figure 8.2. There are a number of
dimension tables, which can be loosely divided into task, remote user, and activity
tables. The three task tables are:
186 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
1. Tasks: The task corresponding to a given page view.
2. Task Groups: The task group containing a given task.
3. Owners: The owner or requester of a given task group.
The four remote user tables are:
1. IPs: The IP address and geolocation information associated with a remote user
who triggered a page view.
2. Cookies: The cookie associated with a given page view.
3. Browser Details: The details of a remote user’s browser, like user agent
(a browser identifier like Mozilla/5.0 (Windows; U; Windows NT 5.1;
en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.99
Safari/533.4,gzip(gfe)) and available plugins (e.g., Flash).
4. Workers: The worker information associated with a given remote user.
The two activity tables are:
1. Activity Signatures: Details of what activity (and inactivity) occurred during a
page view.
2. Form Contents: The contents of forms on the page over the course of a page
view.
Figure 8.3 shows an Entity/Relationship diagram. Entities in Figure 8.3 (the rect-
angles) correspond to actual tables in our database, with the exception of “Remote
Users.” Entities attached to “Remote Users” are dimension tables for “Page Views.”
The circles in the figure represent the attributes or properties of each entity.
There is one set of tables that we have left out for the purpose of clarity. As
we will see in Section 8.2, we need to build up information about a single page view
through many separate logging events. As a result, there are a number of tables,
which we do not enumerate here, that enable us to incrementally build from logging
events into complete logging messages, and then finally into higher level entities like
overall activity signatures and page views.
8.2. IMPLEMENTATION 187
8.2 Implementation
Turkalytics is implemented in three parts: client-side JavaScript code (Section 8.2.1),
a log server (Section 8.2.2), and an analysis server (Section 8.2.3). Section 8.2.4 gives
a broad overview of the design choices we made and limitations of our design.
8.2.1 Client-Side JavaScript
A requester on Mechanical Turk usually creates a HIT (task) based on a URL. The
URL corresponds to an HTML page with a form that the worker completes. Re-
questers add a small snippet of HTML to their HTML page to embed Turkalytics
(see Section 8.3.1). This HTML in turn includes JavaScript code (ta.js) which tracks
details about workers as they complete the HIT.
The ta.js code has two main responsibilities:
1. Monitoring: Detect relevant worker data and actions.
2. Sending: Log events by making image requests to our log server (Section 8.2.2).
ta.js monitors the following:
1. Client Information: What resolution is the worker’s screen? What plugins are
supported? Can ta.js set cookies?
2. DOM Events: Over the course of a page view, the browser emits various events.
ta.js detects the load, submit, beforeunload, and unload events.
3. Activity: ta.js listens on a second by second basis for the mousemove, scroll
and keydown events to determine if the worker is active or inactive. ta.js then
produces an activity signature, e.g., iaaia represents three seconds of activity
and two seconds of inactivity.
4. Form Contents: ta.js determines what forms are on the page and the contents
of those forms. In particular, ta.js logs initial form contents, incremental
updates, and final state.
ta.js sends monitored data to the log server via image requests. We define a
logging event (or event, where the meaning is clear) to be a single image request.
Image requests are necessary to circumvent the same origin policies common in most
mainstream browsers, which block actions like sending data to external sites. Special
188 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
care is also needed to send these image requests in less than two kilobytes due to
restrictions in Microsoft Internet Explorer (MSIE). We define a logging message to
be a single piece of logged data split across one or more events in order to satisfy
MSIE’s URL size requirements. For example, logging messages sent by ta.js include
activity signatures, related URLs, client details, and so on (Listing 8.1 is one such
logging message). A single page view can lead to as few as seven or as many as
hundreds of image requests. (For example, the NER task described at the beginning
of Section 8.4 can lead to over one hundred requests as it sends details of its form
contents because it has over 2, 000 form elements.)
8.2.2 Log Server
The log server is an extremely simple web application built on Google’s App Engine.
It receives logging events from clients running ta.js and saves them to a data store.
In addition to saving the events themselves, the log server also records HTTP data
like IP address, user agent, and referer. (We intentionally continue the historical
convention of misspelling “referer” when used in the context of the HTTP referer,
and also do so when referring to the JavaScript document referrer.) A script on the
analysis server (Section 8.2.3) periodically polls the web application, downloading
and deleting any new events that have been received. This simplicity pays off: our
log server has scaled to over one hundred thousand requests per day.
8.2.3 Analysis Server
The analysis server periodically polls the log server for new events. These events are
then inserted into a PostgreSQL database, where they are processed by a network
of triggers. These triggers are arguably the most complex part of the Turkalytics
implementation, for four main reasons:
1. Time Constraints: One of our goals is for the analysis server to be updated, and
query-able, in real-time. Currently, the turnaround from client to availability
in the analysis server is less than one minute.
2. Dependencies: What to do when an event is inserted into the analysis server
8.2. IMPLEMENTATION 189
may depend on one or more other events that may not have even been received
yet.
3. Incomplete Input: When Turkalytics has not yet received all logging events
pertaining to a message, page view, or any other entity from Figure 8.3, we
call that entity incomplete. Nonetheless, requesters should be able to query
as much information as possible, even if some entities are incomplete. In fact,
many entities will remain incomplete forever. (This is one negative result of an
explicit design decision in Section 8.2.4.)
4. Unknown Input: The analysis server may receive unexpected input that conflicts
with our model of how the Mechanical Turk works, yet it must still handle this
input.
These challenges are sufficiently difficult that our current PL/Python trigger solution
is our second or third attempt at a solution. (One earlier attempt made use of
dependency handling from a build tool, for example.)
Rather than fully describing our triggers here, we give an example of the function-
ality instead. Suppose a worker A19... has just finished previewing a task 152...,
and chooses to accept it. When the worker loads a new page corresponding to the
accept state, ta.js sends a number of events. Listing 8.1 shows one such event,
a related URLs event detailing the page which referred the worker to the current
page. (Our implementation uses JavaScript Object Notation (JSON) as the format
for logging events.)
From the HTTP REFERER, Turkalytics can now learn the identifiers for the as-
signment (1D9...), HIT (152...), and worker (A191...). From the PATH INFO,
Turkalytics can learn what type of data ta.js is sending (relatedUrls). From the
QUERY STRING, Turkalytics gets the actual data being sent by ta.js, in this case,
an escaped referer URL (documentReferrerEsc) which in turn includes the reward
(USD0.01), group identifier (1ZSQ...), and other information. The QUERY STRING
also includes a pageSessionId, which is a unique identifier shared by all events
sent as a result of a single page view. (pageSessionId is the key for the “Page
Views” table.) Note that neither the HTTP REFERER nor the referer sent by ta.js
as documentReferrerEsc represents the worker’s previously visited page. The
190 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
1 { ...
2 "HTTP_REFERER":
3 "...? assignmentId =1D9...
4 &hitId =152...
5 &workerId=A1Y9 ...",
6 "PATH_INFO": "/event/relatedUrls",
7 "QUERY_STRING":
8 "turkaMsgId =2
9 &documentReferrerEsc=https%3A%2F...
10 %26 prevRequester %3 DStanford %2B...
11 %26 requesterId %3 DA2IP5GMJBH7TXJ
12 %26 prevReward %3 DUSD0 .01...
13 %26 groupId %3 D1ZSQ ...
14 &turkaConcatNum =0
15 &turkaConcatLen =1
16 &targetId=f68daad1
17 &timeMillis =127...
18 &pageSessionId =0.828...
19 &clientHash =150...",
20 ... }
Listing 8.1: Excerpt from a related URLs logging event formatted as JSON.
8.2. IMPLEMENTATION 191
HTTP REFERER seen by our log server is the HIT URL that the worker is currently
viewing, and documentReferrerEsc is the Mechanical Turk URL containing that
HIT URL in an IFRAME.
When the event from Listing 8.1 is inserted into the database, the following func-
tionality is triggered (and more!):
1. The current page view, as specified by pageSessionId, is updated to have
assignment 1D9..., HIT 152..., and worker A191....
2. Other page views which lack a worker identifier may have one inferred based on
the current page view’s worker identifier. (This requires an invariant involving
the distance in time between page views.) For example, the page view associated
with the worker’s previous preview state is updated to have a worker identifier
of A191.... (When the worker was in the preview state, the assignment and
worker identifiers were unknown, but now we can infer them based on this later
information.)
3. If not already known, a new task group 1ZSQ... with a reward of one cent is
added.
4. If not already known, a new mapping from the current HIT 152... to the task
group 1ZSQ... is added.
5. If not already known, the requester name and identifier are added to the task
group and owner entities.
This example shows that incrementally building entities from Figure 8.3 in real-time
requires careful consideration of both event dependencies and appropriate invariants.
8.2.4 Design Choices
There are four main considerations in Turkalytics’ design:
1. Ease: We wanted Turkalytics to be easy for requesters to use and install.
2. Unobtrusiveness: We wanted Turkalytics to be as invisible as possible to workers
as they perform work, and not to impact the operation of requesters.
3. Data Collection: We wanted to gather as much worker task completion data as
reasonable possible.
192 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
4. Cross-Platform: We wanted Turkalytics to work across a number of different
human computation systems for posting work to Mechanical Turk, because such
systems are currently quite heterogeneous.
Our requirements that Turkalytics be easy, unobtrusive, and cross-platform led us to
build our tool as embeddable JavaScript, and to use simple cross-platform ideas like
sessions and cookies to group events by workers.
It is perhaps worth taking a moment to note why building an analytics tool like
Turkalytics, and in particular building it as embeddable, cross-platform JavaScript
is nontrivial. First, we do not have direct access to information about the state of
Mechanical Turk. We do not want to access the Mechanical Turk API as each of
our requesters. However, even if we did, the Mechanical Turk API does not allow
us to query fine grained data about worker states, worker activity, or form contents
over time. Nor does it tell us which workers are reliable, or how many workers are
currently using the system. Second, data collected is often incomplete, as discussed
in Section 8.2.3, and we often need to infer additional data based on information
that we have. Third, remote users can change identifiers in a variety of ways, and
in many cases we are more interested in the true remote user than any particular
worker identifier. All of these challenges, in addition to simply writing JavaScript
that works quickly and invisibly across a variety of unknown web browsers with a
variety of security restrictions (same origin policy, third party cookies), make writing
an analytics tool like Turkalytics difficult.
Two of our design considerations, “unobtrusiveness” and “data collection” are in
direct opposition to one another. For example, consider the following trade-offs:
• ta.js could send more logging messages with more details about the worker’s
browser state, but this may be felt through processor, memory, or bandwidth
usage.
• ta.js could sample workers and only gather data from some of them, improving
the average worker’s experience, but reducing overall data collection.
• ta.js could interfere with the worker’s browser to ensure that all logging events
are sent and received by our logging server, for example, by delaying submission
of forms while logging messages are being sent.
8.3. REQUESTER USAGE 193
1 <script type="text/javascript"
2 src="https ://.../ ta.js">
3 </script >
4 <script type="text/javascript">
5 Turka.Frontend.startAllTracking("...");
6 </script >
Listing 8.2: Turkalytics embed code.
These options represent a spectrum between unobtrusiveness and data collection.
We chose to send fairly complete logging messages and avoid sampling. This is
because we believe that workers are more motivated to deal with minor performance
degradation (on the order of hundreds of milliseconds) than regular web visitors. This
is quite different than the assumptions behind tools like Google Analytics. Nonethe-
less, we draw the line at interfering with worker behavior, which we deem too obtru-
sive. A result of this decision is that we may occasionally have incomplete data from
missed logging messages.
A current technical limitation of our implementation is a focus on HTML forms.
HITs that make use of Flash or an IFRAME may produce incomplete activity and form
data. However, there is nothing in our design which means that such cases could not
be handled eventually.
8.3 Requester Usage
Requesters interact with Turkalytics at two points: installation (Section 8.3.1) and
reporting (Section 8.3.2). Our goal in this section is to illustrate just how easy our
current Turkalytics tool is to use and to show just how much benefit requesters get.
8.3.1 Installation
In most cases, embedding Turkalytics simply requires adding a snippet of HTML (see
Listing 8.2) to each HTML page corresponding to a posted HIT. (See Section 8.2.1
for more on how HTML pages relate to HITs.) Most systems for displaying HITs
194 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
have some form of templating system in place, so this change usually only requires
a copy-and-paste to one file. An important special case of a human computation
system with a templating system is the Mechanical Turk requester bulk loading web
interface [7]. Underlying that interface is a templating system which generates HTML
pages on Amazon’s S3 system [8], so Listing 8.2 works there as well (by adding it to
the bottom of each template). These two cases, requesters posting HTML pages and
requesters using the bulk interface, cover all of our current requesters.
Listing 8.2 has two parts elided. The first “...” is where ta.js is located, on our
server. This does not change across installations. The second “...” is an identifier
identifying the particular requester. Currently, we assign each Turkalytics requester
a hexadecimal identifier, like 7e3f6604. Once the requester has added the snippet
from Listing 8.2 with these changes, they are done. (Requesters lacking SSL also
need to use a simple workaround script due to referer handling, but such requesters
are rare.) In our experience, the process usually takes less than five minutes and is
largely invisible to the requester afterwards.
Two implementation notes bear pointing out here about the hexadecimal identi-
fier. The first note is that due to the web browser context, and due to our status
as a third party, it is possible that an “attacker” could send our system fake data.
At this stage, there is not a lot of reason to do this, and this is a problem with
most analytics systems. The second note is that the hexadecimal identifier allows
us to easily partition our data on a per requester basis. Our current analysis server
uses a multitenant database where we query individual requester statistics using this
identifier, but could easily be split across multiple databases.
8.3.2 Reporting
Once Turkalytics is installed (Section 8.3.1), all that remains is to later report to
the requester what analysis we have done. Like most data warehousing systems, we
have two ways of doing this. We support ad hoc PostgreSQL queries in SQL and we
are in the process of implementing a simple web reporting system with some of the
more common queries. In fact, most of the data in this chapter was queried from our
8.3. REQUESTER USAGE 195
1 SELECT tg.requester_name
2 , sum(tg.reward_cents) AS total_cents
3 , count (*) AS num_submits
4 FROM page_views AS pv
5 , task_groups AS tg
6 WHERE pv.task_group_id=tg.task_group_id
7 AND pv.page_view_type=’accept ’
8 AND pv.page_view_end=’submit ’
9 GROUP BY tg.requester_name;
10
11 SELECT tg.requester_name
12 , pv.task_group_id
13 , sum(tg.reward_cents * 3600)
14 / sum(a.active_secs)
15 , sum(tg.reward_cents * 3600)
16 / sum(a.total_secs)
17 FROM page_views AS pv
18 , task_groups AS tg
19 , activity_signatures AS a
20 WHERE pv.task_group_id=tg.task_group_id
21 AND pv.page_view_id=a.page_view_id
22 AND pv.page_view_type=’accept ’
23 AND pv.page_view_end=’submit ’
24 AND a.is_complete
25 GROUP BY tg.requester_name
26 , pv.task_group_id;
Listing 8.3: Two SQL reporting queries.
live system, including Tables 8.1 and 8.2, Figures 8.5 and 8.6, and most of the inline
statistics. (The only notable exception is Figure 8.4, where it is somewhat awkward
to compute sequential transitions in SQL.)
To give a flavor for what you can do, Listing 8.3 gives two example queries that
run on our real system. For example, suppose we want to know which requesters in
our system are the heaviest users of Mechanical Turk. The first query computes total
number of tasks and total payout aggregated by requester by joining page view data
with task group data. An example output tuple is ("Petros Venetis", 740, 160).
196 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
That output tuple means that the requester Petros Venetis spent $7.40 USD on
160 tasks. The second query computes the hourly rate of workers based on active and
total seconds grouped by task group. (The query does so by joining page views, task
groups, and activity data, and using the activity data to determine amount of time
worked.) We might want to do this, for example, to determine appropriate pricing
of a future task based on an estimate of how long it takes. An example output
tuple is ("Paul H", "1C4...", 6101.695, 122.553). That output tuple means
that for task group 1C4... owned by Paul H, the hourly rate of workers based on
the number of active seconds was ≈$61.02 USD, while the hourly rate based on total
time to completion was ≈$1.23 USD. (Note that the example tuple has a very large
discrepancy between active and total hourly rate, because the task required workers
to upload an image created offline.)
8.4 Results: System Aspects
This section, and the two that follow (Sections 8.5 and 8.6) describe our production
experience with the Turkalytics system. Our data for these sections is collected over
the course of about a month and a half starting on June 14th, 2010. The data consists
of 12, 370 tasks, 125 worker days, and a total cost of $543.66. In our discussion below,
we refer to three groups of tasks posted by requesters using our tool:
1. Named Entity Recognition (NER): This task, posted in groups of 200 by a
researcher in Natural Language Processing, asks workers to label words in a
Wikipedia article if they correspond to people, organizations, locations, or de-
monyms. (2, 000 HITs, 1 HITType, more than 500 workers.)
2. Turker Count (TC): This task, posted once a week by a professor of business
at U.C. Berkeley, asks workers to push a button, and is designed just to gauge
how many workers are present in the marketplace. (2 HITs, 1 HITType, more
than 1, 000 workers each.)
3. Create Diagram (CD): This task, posted by the authors, asked workers to draw
diagrams for this chapter based on hand drawn sketches. In particular, Fig-
ures 8.1, 8.2, and 8.4 were created by worker A1K17L5K2RL3V5 while Figure 8.3
8.4. RESULTS: SYSTEM ASPECTS 197
was created by worker ABDDE4BOU86A8. (≈ 5 HITs, 1 HITType, more than 100
workers.)
There are two questions worth asking about our Turkalytics tool itself. The first
is whether the system is performant, i.e., how fast it is and how much load it can
handle. The second is whether it is successfully collecting the intended monitored
information. (Because Turkalytics is designed to be unobtrusive, there are numerous
situations in which Turkalytics might voluntarily lose data in the interest of a better
client experience.) This section answers these questions focusing on the client (Sec-
tion 8.4.1), the logging server (Section 8.4.2), and the analysis server (Section 8.4.3).
8.4.1 Client
Does ta.js effectively send remote logging messages?
We asked some of our requesters to add an image tag corresponding to a one pixel
GIF directly above the Listing 8.2 HTML in their HITs. Based on access to this
“baseline” image, we can determine how many remote users viewed a page versus
how many actually loaded and ran our JavaScript. (This assumes that remote users
did not block our server, and that they waited for pages to load.)
Overall, the baseline image was inserted for 25, 744 URLs. Turkalytics received
JavaScript logging messages for all but 88 of those URLs, which means that our loss
rate was less than 0.5%. There is no apparent pattern in which messages are missing
on a per browser or other basis. Our ta.js runs on all modern browsers, though
some features vary in availability based on the browser. (For example, Safari makes
it difficult to set third party cookies and early versions of MSIE slow down form
contents discovery due to DOM speed.)
How complete is activity sending?
Activity data is sent periodically as logging messages by ta.js. However, as with
other logging messages, the browser thinks we are loading a series of images from
the logging server rather than sending messages. As a result, if the worker navigates
198 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
away from the page, the browser may not bother to finish loading the images. After
all, there is no longer a page for the images to be displayed on!
How commonly are activity logging messages lost? We looked at activity signa-
tures for 9, 884 page views corresponding to NER tasks. Each page view was an
accepted task which the worker later submitted. We computed an expected number
of activity seconds for a given page view by subtracting the timestamp of the first
logging event received by Turkalytics from the timestamp of the last logging event.
If we have within 20 seconds, or within 10% of the total expected number fo activity
seconds, whichever is less, we say that we have full activity data. (Activity moni-
toring may take time to start, so we leave a buffer before expecting activity logging
messages.) For 8, 426 of these page views, or about 85%, we have “full” activity data
in this sense.
How fast and correct is form content sending?
Checking the form contents to send to the logging server usually takes on the order
of a few hundred milliseconds every ten to thirty seconds. This assumes a modern
browser and a reasonably modern desktop machine. Of the 9, 884 NER page views
accepted and submitted from the previous section, 8, 049 had complete form data.
8.4.2 Logging Server
Given the simplicity of the logging server, it only makes sense to ask what the peak
load has been and whether there were any failed requests. (Failed requests are logged
for us by the App Engine.) In general, Mechanical Turk traffic is extremely bursty—
at the point of initial post, many workers complete a task, but then traffic falls off
sharply (see Section 8.6.2). However, our architecture handles this gracefully. In
practice, we saw a peak requests/second of about ten, and a peak requests/day of
over 100, 000, depending on what tasks were being posted by our requesters on a given
day. However, there is no reason to think that these are anywhere near the limits
of the logging server. During the period of our data gathering, we logged 1, 659, 944
logging events, and we lost about 20 per day on average due to (relatively short)
8.5. RESULTS: WORKER ASPECTS 199
outages in Google’s App Engine itself.
8.4.3 Analysis Server
Our analysis server is an Intel Q6600 with four gigabytes of RAM and four regular
SATA hard drives located at Stanford. We batch loaded 1, 515, 644 JSON logging
events in about 520 minutes to test our trigger system’s loading speed. Despite the
fact that our code is currently single threaded and limited to running forty seconds of
every minute, our batch load represents an amortized rate of about 48 logging events
per second. The current JSON data itself is about 2.1 gigabytes compressed, and
our generated database is about 4.6 gigabytes on disk. Both the data format and
the speed of insertion into the analysis server could both be optimized: currently the
insertion is mostly CPU bound by Python, most likely due to JSON parsing overhead.
8.5 Results: Worker Aspects
Where are workers located?
Most demographic information that is known about Mechanical Turk workers is the
result of surveys on the Mechanical Turk itself. Surveys are necessary because the
workers themselves are kept anonymous by the Mechanical Turk. However, such
surveys can easily be biased, as workers appear to specialize in particular types of
work, and one common specialization is filling out surveys.
Our work with Turkalytics allows us to test the geographic accuracy of these
past surveys. In particular, we use the “GeoLite City” database from MaxMind to
geolocate all remote users by IP address. (MaxMind claims that this database is 99.5%
accurate at the country level [9].) The results are shown in Table 8.1. For example,
the first line of Table 8.1 shows that in our data, the United States represented 2, 534
unique IP addresses (29.84% of the total), 1, 299 unique workers (44.716% of the
total), 199 of the unique workers who did the NER task, and 1, 011 of the unique US
workers who completed the TC task.
There are two groups of countries in the data. The United States and India
200 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
#IP
s
%IP
s
#Workers
%Workers
#Workers
#Workers
United States 2534 29.840 1299 44.716 199 1011India 4794 56.453 1116 38.417 227 717
Philippines 127 1.496 52 1.790 15 32United Kingdom 92 1.083 43 1.480 11 27
Canada 86 1.013 42 1.446 10 33Germany 50 0.589 16 0.551 6 10Australia 32 0.377 16 0.551 4 10Pakistan 49 0.577 15 0.516 5 10Romania 96 1.130 14 0.482 5 7
Anonymous Proxy 13 0.153 12 0.413 0 13Overall NER TC
Table 8.1: Top Ten Countries of Turkers (by Number of Workers). 2,884 Workers,8,216 IPs total.
are the first group, and they represent about 80% of workers. The second group is
everywhere else, consisting of about 80 other countries, and 20% of the workers. This
second group is more or less power law distributed. We suspect that worker countries
are heavily biased by the availability of payment methods. Mechanical Turk has
very natural payment methods in the United States and India, but not elsewhere
(e.g., even in other English speaking countries like Canada, the United Kingdom, and
Australia).
Comparing the NER and TC tasks, we can see that Indians are more prevalent on
the NER task. However, regardless of grouping, the nationality orderings seem to be
fairly similar, with the caveat that Indians have many more IPs than Americans. This
suggests that previous survey data may be slightly biased based on respondents, but
overall may not be terribly different from the true underlying worker demographics.
8.5. RESULTS: WORKER ASPECTS 201
What does a “standard” browser look like?
The most common worker screen resolutions are 1024x768 (2266 workers at at least
one point), 1280x800 (1166 workers), 1366x768 (670 workers), 1440x900 (494 work-
ers), 1280x1024 (451 workers), and 800x600 (228 workers). No other resolution has
more than 200 workers. Given an approximate browser height of 170px and a Mechan-
ical Turk interface height of 230px or more, a huge number of workers are previewing
(and possibly completing) tasks in less than 400px of screen height. As a caveat,
some workers may be double counted as they switch computers or resolutions. The
average is about 1.5 distinct resolutions per worker, so most workers have one or two
distinct resolutions.
About half of our page views are by Firefox users, and about a quarter are by
MSIE users. In terms of plugins, Java and Flash represent 70–75% of our page views,
each, while PDF and WMA represent 50–55% each. the Java plugin, and about 70%
have workers with the Flash plugin. These may be underestimates based on our
detection mechanism (navigator MIME types).
Are workers identifiable? Do they switch browsers?
It is becoming increasingly common to use the Mechanical Turk as a subject pool for
research studies. A growing body of literature has looked at how to design studies
around the constraints of Mechanical Turk. One key question is how to identify a
remote user uniquely. For example, how do I know that an account for a 30 year old
woman from Kansas is not really owned by a professional survey completer with a
number of accounts in different demographic categories?
One solution is to look at reasonably unique data associated with a remote user.
Table 8.2 shows the number of user agents, IP addresses, and identifier cookies asso-
ciated with a given worker. Ideally, for identification purposes, each of these numbers
would be one. In practice, these possibly unique pieces of data seem to vary heavily
by worker. Common user agent strings, dynamic IPs, and downgrading or blocking of
cookies (particularly third party cookies as Turkalytics uses) are all possible reasons
for this variability. For example, the worker three from the bottom had 84 distinct
202 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
Worker Counts CountryUAs IPs Cookies Views
AXF... 3 1 17 619 IndiaA1B... 2 9 4 618 MultinationalA1K... 5 23 8 537 IndiaA3O... 4 13 68 502 IndiaA2C... 4 47 33 462 IndiaA3I... 4 2 3 450 United StatesA2I... 3 4 2 393 United StatesA1V... 4 14 1 314 United StatesA1C... 4 10 7 303 IndiaA31... 3 11 2 288 IndiaA2H... 8 6 8 268 IndiaA29... 1 17 2 244 IndiaA3J... 3 84 2 226 IndiaA2O... 3 3 4 225 United StatesA1E... 1 25 5 225 India
Table 8.2: The number of user agents, IP addresses, cookies and views for top workersby page views.
IP addresses over the course of 226 page views, but nonetheless kept the same two
tracking cookies throughout. On the other hand, the first worker in the table had 17
tracking cookies over 619 page views, but only had one IP address throughout. On
average, for active workers, the user agent to page view ratio is about 1:25, the IP to
page view ratio is about 1:10, and the cookie to page view ratio is about 1:11. These
numbers are skewed by special cases however, and the median numbers are usually
lower. (“Cheaters” appear rare, though one remote user with a single cookie seems
to have logged in seven different times, as seven different workers, to complete the
TC task.)
8.6 Results: Activity Aspects
Section 8.1.1 gave a model for interaction with Mechanical Turk. This section looks
at what behavior that model produces. Section 8.6.1 looks at what states and actions
8.6. RESULTS: ACTIVITY ASPECTS 203
Figure 8.4: Number of transitions between different states in our dataset. Note:These numbers are approximate and unload states are unlabeled.
occur in practice. Section 8.6.2 looks in particular at previewing interactions. Sec-
tion 8.6.3 looks at activity data generated by workers within page views. Our main
goals are to show that Turkalytics is capable of gathering interesting system interac-
tion data and to illustrate the tradeoffs of Mechanical Turk-like (i.e., SCRAP-like)
systems.
8.6.1 What States/Actions Occur in Practice?
Workers in the SCRAP model can theoretically execute a large number of actions.
However, we found that most transitions were relatively rare. Figure 8.4 shows the
transitions between states that we observed. (Very rare, unclear, or “unload” tran-
sitions are marked with question marks.) For example, we observed 720 transitions
by workers from Accept to RapidAccept, and 344 transitions out of the RapidAccept
state. To generate Figure 8.4, we assume the model described in Section 8.1.1 and
that workers are “single threaded,” that is, they use a single browser with a single
window and no tabs. These assumptions let us infer state transitions based on times-
tamp, which is necessary because of the referer setup described in Section 8.2.3. Over
88% of our observed state transitions are transitions in our mapped SCRAP model,
so our simplifying assumptions appear relatively safe.
204 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
Figure 8.5: Number of new previewers visiting three task groups over time.
Do workers use the extensions provided by the SCRAP model above and beyond
the SPA model? We found that RapidAccept was used quite commonly, but Continue
was quite rare. About half of the workers who chose to do large numbers of tasks
chose to RapidAccept often rather than continuously moving between the Preview
and Accept state. However, continues represent less than 0.5% of our action data,
and returns about 2%. (We suspect that the transition to Accept from Preview is
particularly common in our data due to the prevalence of the simple Turker Count
task.)
8.6.2 When Do Previews Occur?
A common Mechanical Turk complaint is that the interfaces for searching and brows-
ing constrain workers. In particular, Chilton et al. observe that workers primarily
sort task groups by how recently they were created and how many tasks are available
in the group. This observation appears to be true in our data as well.
Figure 8.5 shows when previously unseen workers preview the NER, TC, and
CD task groups immediately after an instance was posted. For example, in the first
hour of availability of the TC task, almost 150 workers completed the task. The
NER task group has many tasks, but is posted only once. The TC task group has
only one task, and is only posted once. The CD task group has five tasks, but is
artificially kept near the top of the recently created list. In Figure 8.5, both NER and
8.6. RESULTS: ACTIVITY ASPECTS 205
Figure 8.6: Plot of average active and total seconds for each worker who completedthe NER task.
TC show a stark drop off in previews after the first hour when they leave the most
recently created list. NER drops off less, likely because it is near the top of the tasks
available list. CD drops off less than TC, suggesting artificial recency helps. These
examples suggest that researchers should be quite careful when drawing conclusions
about worker interaction (e.g., due to pricing) because the effect due to rankings is
quite strong.
8.6.3 Does Activity Help?
Turkalytics collects activity and inactivity information, but is this information more
useful than lower granularity information like the duration that it took for the worker
to submit the task? It turns out that there are actually two answers to this question.
The first answer is that, as one might expect, the amount of time a worker is active
is highly correlated with the total amount of time a worker spends completing the
task in general. The second answer is that, despite this, signature data does seem to
clarify the way in which a worker is completing a particular task.
We looked at the activity signatures of 321 workers who had at least one complete
signature and had completed the NER task. The Pearson correlation between the
number of active seconds and the total number of seconds for these workers was 0.88
(see Figure 8.6). However, the activity signatures do give a more granular picture
of the work style of different workers. Figure 8.7 shows two quite different activity
signatures, both of which end in completing an accepted task. The first signature
shows a long period of inactive seconds (i) followed by bursts of active seconds (a),
206 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
diiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
.............. 300 inactive seconds .............
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
iaaaaaaiiiiaaaaaaaiaiaaaaaaaaaaaaaaaaiaaiaaaaaaaaa
aaaaaaaaaaaaaaaiiiiiiiiiaaaaaaaaaaaaaaaiaaaiiiaaaa
aaaaaiiiiiiiiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaiaaaiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaiiiiiiiiiiiiiiiaiaiiiiiiiaaiiaiaaaaaaaaiiaaaaa
aaiiaiiaaaaaaaaaiaaaaaaaiiiiaaiaaaaaaaasbu
(a) 688 Second Activity Signature
daaaaiaaaaaiaaaaaaaaaaaaiiiiiiaaaaaaaiiiaaaaaaaaaa
aaaaiiiiaaiiiaaaiiiiaaaaaiaaaaiaaaaaiiaaaaiiiasabu
(b) 96 Second Activity Signature
Figure 8.7: Two activity signatures showing different profiles for completing a task.Key: a=activity, i=inactivity, d=DOM load, s=submit, b=beforeunload, u=unload.
while the second signature shows a short period of mostly active seconds. One might
prefer one work completion style or the other for particular tasks.
8.7 Related Work
Our Turkalytics system, and the results presented, are related to four main areas
of work: human computation systems, analytics, Mechanical Turk demographic re-
search, and general Mechanical Turk work. With respect to human computation
systems, a number have been recently developed [6], [10], [27], [49]. Our intent is for
our tool to improve such systems and make building them easier. With respect to
analytics, numerous analytics tools (e.g., [5]) exist in industry, though there does not
appear to be a great deal of work in the academic literature about such tools. With
respect to demographics, independent work by Ipeirotis [41], [42] and Ross et al. [61]
used worker surveys to illustrate the changing demographics of Mechanical Turk over
time. (Section 8.5 more or less validates these previous results, as well as adding a
more recent data point.) With respect to general Mechanical Turk research, the most
8.8. CONCLUSION 207
common focuses to date have been conducting controlled experiments [46] and per-
forming data annotation in areas like natural language processing [67], information
retrieval [12], and computer vision [69].
8.8 Conclusion
Turkalytics gathers data about workers completing human computation tasks. We
envision Turkalytics as part of a broader system, in particular a system like HPROC
(Chapter 7) implementing the Human Processing model (Chapter 6). However, one
big advantage of our design for Turkalytics is that it is not tied to any one system.
Turkalytics enables both code sharing among systems (systems need not reimplement
worker monitoring code) and data sharing among systems (requesters benefit from
data gathered from other requesters). Our contributions include interaction and data
models, implementation details, and findings about both our system architecture and
the popular Mechanical Turk marketplace. We showed that our system was scalable to
more than 100, 000 requests/day. We also verified previous demographic data about
the Turk, and presented some findings about location and interaction that are unique
to our tool. Overall, Turkalytics is a novel and practical tool for human computation
that has already seen production use.
208 CHAPTER 8. WORKER MONITORING WITH TURKALYTICS
Chapter 9
Conclusion
Over the course of the previous chapters, the development of this thesis has mirrored
the chronological development of microtasks. We began with social bookmarking
systems. Social bookmarking systems were one of the first places tags appeared, as
an adaptation to help organize a corpus that had grown too large for more classi-
cal annotators like librarians. We then looked at social cataloging systems. Social
cataloging systems and other types of more niche tagging systems were developed as
it became well understood that tags and other microtasks could be applied across a
wide variety of systems, not just bookmarking and multimedia systems. Finally, we
looked at paid microtasks. Once the power of microtasks to produce common web
features (ratings, image classification) was well understood, it made sense to design
systems that could produce microtasks in a repeatable way.
While our study happens to be chronological, it also covers microtasks at multiple
levels of detail. Overall, each of the places we studied microtasks gave us greater
insight into our research goal of better understanding the possibilities and limitations
of microtasks. Our study of social bookmarking uncovered a number of limitations
to unpaid microtasks (redundancy, lack of control), while our usage of the HPROC
system illustrates a vast number of future possiblities (human algorithms). Along the
way, we were also able to develop useful techniques—tag prediction, paid tagging, and
a methodology for human programming—for supplementing or controlling microtasks.
We summarize the highest level findings below.
209
210 CHAPTER 9. CONCLUSION
9.1 Summary
Chapter 2 looked at social bookmarking as it relates to arguably the most important
application on the web: web search. Our dataset, which led to a great deal of follow-
on work (e.g., [38], [60], [48]), was collected in a methodologically sound way, as well
as being one of the biggest crawls of a social bookmarking site. It thus allowed us
to ask a comprehensive set of questions at a scale where we would not have to worry
about sampling bias invalidating our findings.
Ultimately, we found that tags are often (though not always!) redundant in the
context of social bookmarking. Tags are commonly in the HTML title tag, and it is
also often the case that tags apply to whole domains, rather than simply the URL
being annotated. However, we did find that URLs posted to social bookmarking
systems were often useful. Such URLs tended to be new or recently modified, as
well as commonly returned in the results of web search queries. While this chapter
presented one of the largest studies of a textual tagging site specifically, we also believe
that our findings generalize to more recent systems. In particular, systems continue
to be developed which center around users saving and sharing their favorite URLs,
and interest in such systems (e.g., Twitter, Facebook “likes”), continues to grow.
Chapter 3 looked at tag prediction for social bookmarking systems. The reason
for studying tag prediction was essentially two-fold. First, we wanted to understand
the predictability of tags (based on the objects annotated as well as based on other
tags) in an abstract sense, in order to better understand tags themselves. Second,
we wanted to be able to enhance tagging systems with features that required tagging
prediction, ranging from bootstrapping to system suggestion to tag disambiguation.
We proposed and evaluated tag prediction from two different perspectives. The
first perspective looked specifically at the features of the URLs available on social
bookmarking systems like del.icio.us, including features like page text, anchor text,
and surrounding domains. We showed that support vector machines were effective
for prediction in this case. We also proposed two measures, frequency and entropy,
that are correlated with how predictable a tag will be. The second perspective looked
only at predicting tags using other tags. By using only tags, our methods can work
9.1. SUMMARY 211
on any tagging system, rather than only social bookmarking systems. We were able
to perform tag to tag prediction with association rules (market basket data mining),
which are both efficient and interpretable by humans.
Chapter 4 marked a transition from studying social bookmarking systems to study-
ing social cataloging systems. In social bookmarking systems like del.icio.us, it is
difficult to determine if users are performing microtasks effectively because we lack a
gold standard for comparison. We analyzed whether tags were redundant given infor-
mation like page text and anchor text, but we had no real way to determine if a tag
was intrinsically good. By contrast, social cataloging systems, where users tag books,
have objects which are annotated with library terms by trained librarians. This fact
allowed us to compare tags to another form of organization by treating library terms
as a gold standard.
In a sense, tagging represents a cheap, non-expert model for annotation, in con-
trast to an expensive, expert model in the form of established library science. We
found that tagging actually had many of the features of the expert annotated library
terms. In particular, we found that tagging was usually consistent, high quality, and
complete. In terms of consistency, we found that tags could be federated across tag-
ging systems, in our case, between the LibraryThing and Goodreads systems. (This
was because tags, and their usage, was similar across systems.) In terms of quality,
we found that medium frequency tags, as well as paid tags, were competitive with
library terms in side-by-side comparisons. We also found that neither synonymy, nor
low quality tag types, were common. In terms of completeness, we found that, at
least for highly tagged objects, tags tended to have good coverage of existing library
terms.
Chapter 5 dropped the assumption of Chapter 4 that library data represented
a gold standard. Instead, we simply compared the library terms and tags with no
particular preference for one or the other. We found that by and large, experts appear
to choose good terms for organizing data. However, we found, on the basis of tagging,
that experts tend to annotate different objects with those terms than regular users.
Chapter 6 began a three chapter sequence on microtasks in general, and in par-
ticular, our Human Processing model. We introduced the model using an extended
212 CHAPTER 9. CONCLUSION
example, and contrasted it to two other major models: the Basic Buyer and Game
Maker models. The Human Processing model focuses on modularity and reuse, so
that programmers do not have to constantly reinvent the wheel when writing human
programs. The model also introduces novel features, like the recruiter, which is a
concept meant to make it easier for researchers to compare human algorithms in a
controlled manner.
Chapter 7 discussed our implementation of the Human Processing model in the
form of our HPROC system. HPROC is a large system (over ten thousand lines of
code) with a number of useful features. It has a novel execution model, allowing the
programmer to mix crash-and-rerun and web execution. It supports cross-hprocess
function calls and simple memoization, which are necessary to make crash-and-rerun
easier to work with. It supports recruiters, as required by the Human Processing
model, and a full Mechanical Turk API.
We illustrated the usage of our HPROC system by doing a case study on hu-
man algorithms for sorting. In particular, we looked at two algorithms: a variation of
Merge-Sort (H-Merge-Sort) and a variation ofQuick-Sort (H-Quick-Sort).
We also looked at variations of interfaces to support these algorithms, in particular,
comparing binary and ranked comparison interfaces. This case study shows the prac-
ticality of the Human Processing model for evaluating human algorithms and the
HPROC system for developing them. It also shows the importance that interfaces
play in human algorithms.
Chapter 8 went one step further in making the Human Processing model a pow-
erful, practical model. Human Processing relies on quality recruiters, and recruiters
in turn require a state model of the marketplaces where they work, as well as a strat-
egy for choosing actions based on that state. As a result, we developed an analytics
tool for gathering state about the Mechanical Turk marketplace. This analytics tool,
called Turkalytics, allowed us to better understand the state of the marketplace by
observing a number of tasks that researchers at Stanford posted on the Mechanical
Turk. In addition to engineering a system which was robust to significant load, we
defined a model for workers in Mechanical Turk-like systems which allowed us to map
which actions workers commonly take.
9.2. FUTURE WORK 213
9.2 Future Work
The future of microtasks looks very bright. The number of Internet-connected users
keeps growing, and there are no signs that they are losing interest in microtasks.
Today, Twitter gets tens of millions of “tweets” each day, and innumerable Facebook
users promote URLs by “liking” them. Researchers continue to better understand
unpaid microtasks like tags, ratings, tweets and likes. Meanwhile, there is an increas-
ing drive, especially recently, to build systems based on paid microtasks like those of
Mechanical Turk. This thesis has not answered all of the questions involved in unpaid
and paid microtasks, but we hope it has laid the groundwork for a great deal of future
work. We conclude by discussing potential future work in the areas of unpaid and
paid microtasks, respectively.
Probably the biggest future opportunities in unpaid microtasks will be in new
services that capture the imaginations of millions of users. If forced to predict, we
expect that services based on realtime information or geography promise to produce
a variety of donated unpaid microtasks in the near future. For example, services like
foursquare currently collect from users large amounts of realtime location informa-
tion together with advice about the locations (i.e., unpaid microtasks collecting data
about the real world). This data is already being mined in interesting ways in the
aggregate. However, it is hard to predict such services, both because they are often
very simple and because they usually depend highly on network effects. Furthermore,
as researchers, it is usually the aggregation of a multitude of microtasks, rather than
any particular one, that makes unpaid microtasks interesting.
In the specific case of tagging, the challenges continue to be in two major areas:
enhancing tagging interfaces, and better post hoc analysis of tagging data. For ex-
ample, in terms of enhancing tagging interfaces, many tagging systems now include
interfaces for suggesting tags to users as they are annotating a particular object. We
described in Chapter 3 how to predict tags, but often users might prefer a tag sugges-
tion from such an interface that is non-obvious. In terms of post hoc tagging analysis,
there is still a great need for better tools for tag clouds, taxonomy generation, and
similar aggregate understanding of tags. For example, it is difficult for a user to find
214 CHAPTER 9. CONCLUSION
what they are looking for in a tag cloud of even one hundred tags, while systems often
contain millions of tags. However, improving upon tag clouds will require better al-
gorithms for understanding how tags are related, as well as grouping tags by purpose
and usage. The heavy interest in our early paper [34] on creating taxonomies out of
tags neatly illustrates the desire in this area for solutions to the problem of organizing
tags.
Meanwhile, the area of paid microtasks is wide open. Studying paid microtasks
overlaps with many other areas, like economics (for pricing), human computer inter-
action (interfacing with humans), and operations research (for designing workflows).
However, many of the challenges in the area seem to be unique to paid microtasks
themselves, and unstudied previously. We envision much better systems for program-
ming and developing human programs and human algorithms. While HPROC does
support and simplify many aspects of human programming, two places it could be im-
proved are in the areas of debugging long running programs and designing interfaces
for workers. We also envision a variety of human algorithms, including analogues for
many classical algorithms, like sort, clustering, translation, and classification. These
human algorithms may mix and match human and machine processing to take ad-
vantage of the strengths of each. As marketplaces improve, we hope that better
recruiters will evolve that take advantage of better pricing models, and we suspect
that the best tasks for workers may start to look like games. And finally, we hope
that programming tools for microtasks will eventually filter down to the point where
regular users, rather than just computer scientists, can produce useful work using
many human microtasks.
Bibliography
[1] http://www.mturk.com/.
[2] http://getgambit.com/.
[3] http://www.livework.com/.
[4] http://www.imagemagick.org/.
[5] http://www.google.com/analytics/.
[6] http://crowdflower.com/.
[7] http://requester.mturk.com/.
[8] http://s3.amazonaws.com/.
[9] http://www.maxmind.com/app/geolitecity.
[10] http://www.smartsheet.com/.
[11] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining Association Rules
Between Sets of Items in Large Databases. SIGMOD Record, 22:207–216, June
1993.
[12] Omar Alonso and Stefano Mizzaro. Can We Get Rid of TREC Assessors? Using
Mechanical Turk for Relevance Assessment. In SIGIR ’09 Workshop on the
Future of IR Evaluation.
215
216 BIBLIOGRAPHY
[13] Melanie Aurnhammer, Peter Hanappe, and Luc Steels. Integrating Collaborative
Tagging and Emergent Semantics for Image Retrieval. In Collaborative Web
Tagging Workshop (WWW’06).
[14] Shenghua Bao, Guirong Xue, Xiaoyuan Wu, Yong Yu, Ben Fei, and Zhong Su.
Optimizing Web Search Using Social Annotations. In Proceedings of the 16th
International Conference on World Wide Web, WWW ’07, pages 501–510, New
York, NY, USA, 2007. ACM.
[15] William B. Cavnar and John M. Trenkle. N-Gram-Based Text Categorization.
Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and
Information Retrieval, pages 161–175, 1994.
[16] Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced Hypertext Cate-
gorization Using Hyperlinks. In Proceedings of the 1998 ACM SIGMOD Interna-
tional Conference on Management of Data, SIGMOD ’98, pages 307–318, New
York, NY, USA, 1998. ACM.
[17] Hsinchun Chen. Collaborative Systems: Solving the Vocabulary Problem. IEEE
Computer, Special Issue on Computer Supported Cooperative Work (CSCW),
27(5):58–66, 1994.
[18] Ed H. Chi and Todd Mytkowicz. Understanding the Efficiency of Social Tag-
ging Systems Using Information Theory. In Proceedings of the Nineteenth ACM
Conference on Hypertext and Hypermedia, HT ’08, pages 81–88, New York, NY,
USA, 2008. ACM.
[19] Lydia B. Chilton, John J. Horton, Robert C. Miller, and Shiri Azenkot. Task
Search in a Human Computation Market. In Proceedings of the ACM SIGKDD
Workshop on Human Computation, HCOMP ’10, pages 1–9, New York, NY,
USA, 2010. ACM.
[20] Maarten Clements, Arjen P. de Vries, and Marcel J.T. Reinders. Detecting Syn-
onyms in Social Tagging Systems to Improve Content Retrieval. In Proceedings
BIBLIOGRAPHY 217
of the 31st Annual International ACM SIGIR Conference on Research and De-
velopment in Information Retrieval, SIGIR ’08, pages 739–740, New York, NY,
USA, 2008. ACM.
[21] Nick Craswell, David Hawking, and Stephen Robertson. Effective Site Finding
Using Link Anchor Information. In Proceedings of the 24th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval,
SIGIR ’01, pages 250–257, New York, NY, USA, 2001. ACM.
[22] Anirban Dasgupta, Arpita Ghosh, Ravi Kumar, Christopher Olston, Sandeep
Pandey, and Andrew Tomkins. The Discoverability of the Web. In Proceedings
of the 16th International Conference on World Wide Web, WWW ’07, pages
421–430, New York, NY, USA, 2007. ACM.
[23] Christine DeZelar-Tiedman. Doing the LibraryThing in an Academic Library
Catalog. Metadata for Semantic and Social Applications, page 211.
[24] Mary Dykstra. LC Subject Headings Disguised as a Thesaurus. Library Journal,
113(4):42–46, 1988.
[25] Nadav Eiron and Kevin S. McCurley. Analysis of Anchor Text for Web Search.
In Proceedings of the 26th Annual International ACM SIGIR Conference on
Research and Development in Informaion Retrieval, SIGIR ’03, pages 459–460,
New York, NY, USA, 2003. ACM.
[26] Nadav Eiron, Kevin S. McCurley, and John A. Tomlin. Ranking the Web Fron-
tier. In Proceedings of the 13th International Conference on World Wide Web,
WWW ’04, pages 309–318, New York, NY, USA, 2004. ACM.
[27] Donghui Feng. Talk: Tackling ATTi Business Problems Using Mechanical Turk.
Palo Alto Mechanical Turk Meetup, 2010.
[28] George W. Furnas, Thomas K. Landauer, Louis M. Gomez, and Susan T. Du-
mais. The Vocabulary Problem in Human-System Communication. Communi-
cations of the ACM, 30:964–971, November 1987.
218 BIBLIOGRAPHY
[29] Evgeniy Gabrilovich and Shaul Markovitch. Text Categorization with Many
Redundant Features: Using Aggressive Feature Selection to Make SVMs Com-
petitive with C4.5. In Proceedings of the Twenty-first International Conference
on Machine Learning, ICML ’04, pages 41–, New York, NY, USA, 2004. ACM.
[30] Evgeniy Gabrilovich and Shaul Markovitch. Computing Semantic Relatedness
Using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th
International Joint Conference on Artifical Intelligence, pages 1606–1611, San
Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.
[31] Scott A. Golder and Bernardo A. Huberman. Usage Patterns of Collaborative
Tagging Systems. Journal of Information Science, 32:198–208, April 2006.
[32] Taher H. Haveliwala, Aristides Gionis, Dan Klein, and Piotr Indyk. Evaluating
Strategies for Similarity Search on the Web. In Proceedings of the 11th Inter-
national Conference on the World Wide Web, WWW ’02, pages 432–442, New
York, NY, USA, 2002. ACM.
[33] Paul Heymann and Hector Garcia-Molina. Contrasting Controlled Vocabulary
and Tagging: Experts Choose the Right Names to Label the Wrong Things. In
WSDM ‘09 Late Breaking Results.
[34] Paul Heymann and Hector Garcia-Molina. Collaborative Creation of Communal
Hierarchical Taxonomies in Social Tagging Systems. 2006.
[35] Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina. Fighting Spam
on Social Web Sites: A Survey of Approaches and Future Challenges. IEEE
Internet Computing, 11:36–45, November 2007.
[36] Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina. Can Social Book-
marking Improve Web Search? In Proceedings of the International Conference
on Web Search and Web Data Mining, WSDM ’08, pages 195–206, New York,
NY, USA, 2008. ACM.
BIBLIOGRAPHY 219
[37] Paul Heymann, Andreas Paepcke, and Hector Garcia-Molina. Tagging Human
Knowledge. In Proceedings of the Third ACM International Conference on Web
Search and Data Mining, WSDM ’10, pages 51–60, New York, NY, USA, 2010.
ACM.
[38] Paul Heymann, Daniel Ramage, and Hector Garcia-Molina. Social Tag Predic-
tion. In Proceedings of the 31st Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR ’08, pages 531–538,
New York, NY, USA, 2008. ACM.
[39] H. Hofmann and M. Theus. Interactive Graphics for Visualizing Conditional
Distributions. Unpublished Manuscript, 2005.
[40] Jurgen Hummel. Linked Bar Charts: Analysing Categorical Data Graphically.
Computational Statistics, 11(1):23–34, 1996.
[41] Panos Ipeirotis. Mechanical Turk: The Demograph-
ics. http://behind-the-enemy-lines.blogspot.com/2008/03/
mechanical-turk-demographics.html.
[42] Panos Ipeirotis. The New Demographics of Mechanical
Turk. http://behind-the-enemy-lines.blogspot.com/2010/03/
new-demographics-of-mechanical-turk.html.
[43] Thorsten Joachims. Making Large-scale Support Vector Machine Learning Prac-
tical, pages 169–184. MIT Press, Cambridge, MA, USA, 1999.
[44] Thorsten Joachims. A Support Vector Method for Multivariate Performance
Measures. In Proceedings of the 22nd International Conference on Machine
Learning, ICML ’05, pages 377–384, New York, NY, USA, 2005. ACM.
[45] Karen S. Jones and C. J. van Rijsbergen. Information Retrieval Test Collections.
Journal of Documentation, 32(1):59–75, 1976.
[46] Aniket Kittur, Ed H. Chi, and Bongwon Suh. Crowdsourcing User Studies with
Mechanical Turk. In Proceedings of the Twenty-sixth Annual SIGCHI Conference
220 BIBLIOGRAPHY
on Human Factors in Computing Systems, CHI ’08, pages 453–456, New York,
NY, USA, 2008. ACM.
[47] Georgia Koutrika, Frans Adjie Effendi, Zoltan Gyongyi, Paul Heymann, and
Hector Garcia-Molina. Combating Spam in Tagging Systems. In Proceedings
of the 3rd International Workshop on Adversarial Information Retrieval on the
Web, AIRWeb ’07, pages 57–64, New York, NY, USA, 2007. ACM.
[48] Georgia Koutrika, Frans Adjie Effendi, Zoltan Gyongyi, Paul Heymann, and
Hector Garcia-Molina. Combating Spam in Tagging Systems: An Evaluation.
ACM Transactions on the Web (TWEB), 2:22:1–22:34, October 2008.
[49] Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. TurKit:
Tools for Iterative Tasks on Mechanical Turk. In Proceedings of the ACM
SIGKDD Workshop on Human Computation, HCOMP ’09, pages 29–30, New
York, NY, USA, 2009. ACM.
[50] Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. Exploring
Iterative and Parallel Human Computation Processes. In Proceedings of the ACM
SIGKDD Workshop on Human Computation, HCOMP ’10, pages 68–76, New
York, NY, USA, 2010. ACM.
[51] Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. TurKit:
Human Computation Algorithms on Mechanical Turk. In Proceedings of the 23nd
Annual ACM Symposium on User Interface Software and Technology, UIST ’10,
pages 57–66, New York, NY, USA, 2010. ACM.
[52] Thomas Mann. Library Research Models: A Guide to Classification, Cataloging,
and Computers. Oxford University Press, USA, 1993.
[53] Thomas Mann. The Oxford Guide to Library Research. Oxford University Press,
USA, 2005.
[54] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. Introduction
to Information Retrieval. Cambridge University Press, New York, NY, USA,
2008.
BIBLIOGRAPHY 221
[55] Cameron Marlow, Mor Naaman, Danah Boyd, and Marc Davis. HT06, tagging
paper, taxonomy, Flickr, academic article, to read. In Proceedings of the Sev-
enteenth Conference on Hypertext and Hypermedia, HYPERTEXT ’06, pages
31–40, New York, NY, USA, 2006. ACM.
[56] Gilad Mishne. AutoTag: A Collaborative Approach to Automated Tag Assign-
ment for Weblog Posts. In Proceedings of the 15th International Conference on
World Wide Web, WWW ’06, pages 953–954, New York, NY, USA, 2006. ACM.
[57] Steffen Oldenburg, Martin Garbe, and Clemens Cap. Similarity Cross-analysis
of Tag / Co-tag Spaces in Social Classification Systems. In Proceeding of the
2008 ACM Workshop on Search in Social Media, SSM ’08, pages 11–18, New
York, NY, USA, 2008. ACM.
[58] Greg Pass, Abdur Chowdhury, and Cayley Torgeson. A Picture of Search. In
Proceedings of the 1st International Conference on Scalable Information Systems,
InfoScale ’06, New York, NY, USA, 2006. ACM.
[59] Gregory Piatetsky-Shapiro. Discovery, Analysis, and Presentation of Strong
Rules. In G. Piatetsky-Shapiro and W.J. Frawley, editors, Knowledge Discovery
in Databases. AAAI/MIT Press, Cambridge, MA, 1991.
[60] Daniel Ramage, Paul Heymann, Christopher D. Manning, and Hector Garcia-
Molina. Clustering the Tagged Web. In Proceedings of the Second ACM Inter-
national Conference on Web Search and Data Mining, WSDM ’09, pages 54–63,
New York, NY, USA, 2009. ACM.
[61] Joel Ross, Lilly Irani, M. Six Silberman, Andrew Zaldivar, and Bill Tomlinson.
Who are the Crowdworkers?: Shifting Demographics in Mechanical Turk. In
Proceedings of the 28th of the International Conference Extended Abstracts on
Human Factors in Computing Systems, CHI EA ’10, pages 2863–2872, New York,
NY, USA, 2010. ACM.
[62] Christoph Schmitz, Andreas Hotho, Robert Jaschke, and Gerd Stumme. Mining
Association Rules in Folksonomies. IFCS’06.
222 BIBLIOGRAPHY
[63] Eric Schwarzkopf, Dominik Heckmann, Dietmar Dengler, and Alexander Kroner.
Mining the Structure of Tag Spaces for User Modeling. In Workshop on Data
Mining for User Modeling (ICUM’07).
[64] Shilad Sen, Shyong K. Lam, Al Mamunur Rashid, Dan Cosley, Dan Frankowski,
Jeremy Osterhouse, F. Maxwell Harper, and John Riedl. tagging, communities,
vocabulary, evolution. In Proceedings of the 2006 20th Anniversary Conference
on Computer Supported Cooperative Work, CSCW ’06, pages 181–190, New York,
NY, USA, 2006. ACM.
[65] David Sifry. State of the Live Web: April 2007. http://www.sifry.com/
stateoftheliveweb/.
[66] Tiffany L. Smith. Cataloging and You: Measuring the Efficacy of a Folksonomy
for Subject Analysis. In Workshop of the ASIST SIG/CR ’07.
[67] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. Cheap
and Fast—But is it Good?: Evaluating Non-expert Annotations for Natural
Language Tasks. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing, EMNLP ’08, pages 254–263, Morristown, NJ, USA,
2008. Association for Computational Linguistics.
[68] Sanjay Sood, Kristian Hammond, Sara Owsley, and Larry Birnbaum. TagAssist:
Automatic Tag Suggestion for Blog Posts. ICWSM’07.
[69] Alexander Sorokin and David Forsyth. Utility Data Annotation with Amazon
Mechanical Turk. In CVPRW’08.
[70] Luis von Ahn and Laura Dabbish. Designing Games with a Purpose. Commu-
nications of the ACM, 51:58–67, August 2008.
[71] Zhichen Xu, Yun Fu, Jianchang Mao, and Difu Su. Towards the Semantic
Web: Collaborative Tag Suggestions. In Collaborative Web Tagging Workshop
(WWW’06).
BIBLIOGRAPHY 223
[72] Yusuke Yanbe, Adam Jatowt, Satoshi Nakamura, and Katsumi Tanaka. Can
Social Bookmarking Enhance Search in the Web? In Proceedings of the 7th
ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’07, pages 107–
116, New York, NY, USA, 2007. ACM.
[73] Yiming Yang and Jan O. Pedersen. A Comparative Study on Feature Selection in
Text Categorization. In Proceedings of the Fourteenth International Conference
on Machine Learning, ICML ’97, pages 412–420, San Francisco, CA, USA, 1997.
Morgan Kaufmann Publishers Inc.
[74] Yiming Yang, Sean Slattery, and Rayid Ghani. A Study of Approaches to Hy-
pertext Categorization. Journal of Intelligent Information Systems, 18:219–241,
March 2002.