you are where you edit: locating wikipedia contributors through edit histories michael d. lieberman...

22
You Are Where You Edit: Locating Wikipedia Contributors Through Edit Histories Michael D. Lieberman [email protected] Jimmy Lin [email protected] Department of Computer Science University of Maryland College Park, MD 20742 USA

Upload: anna-underwood

Post on 13-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

You Are Where You Edit:Locating Wikipedia Contributors

Through Edit Histories

Michael D. Lieberman

[email protected]

Jimmy Lin

[email protected]

Department of Computer ScienceUniversity of Maryland

College Park, MD 20742 USA

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Geographic Data Mining

• Geography has increasingly prevalent role in public, communal, and collaborative Web projects– Manual contributions (Wikipedia, Flickr, …)– Automated annotation (geocoding, geotagging, …)

• Spatial and geographic mining methods increasingly relevant as annotation standards and automated metadata extraction mature

• Allows geographically-informed content retrieval, filtering, ranking, community identification, …– Low dimensional space!

• Potential for invasion of privacy

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Mining Wikipedia

• Contributors tend to add what they know and self-organize into groups based on interest

• Want to see whether contributors can be further categorized based on their edits to geographic pages– Pages that correspond to a physical location in the

real world with lat/lon coordinates– Termed geopages

• Identify Wikipedia contributors who:– Edit geopages in a constrained geographic area– Mostly edit one or two “pet” geopages

• Identify reasons for the above patterns

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Wikipedia Data

• Complete Wikipedia content freely downloadable in several XML formats

1. Current and previous versions of pages

2. Images, media

3. Edit histories

4. Contributor metadata and user pages• Only included English Wikipedia dump in our analyses

– Extensible to other languages

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Wikipedia Content

• Page content written in evolving Wiki markup language• Content consists of freeform text and structured data

– HTML and XML– Infoboxes, templates, categories, images, …

• Geopage content contains a parameterized geographic coordinate template

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Geopage Example

{{Infobox Settlement…|latd = 37 |latm = 18 |lats = 15 |latNS = N|longd = 121 |longm = 52 |longs = 22 |longEW = W…}}

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Identifying Geopages

• Must process page Wiki markup to identify geographic templates and extract coordinates1. Wiki markup language continually evolves2. Geographic templates continually evolve3. Over 20 distinct template forms at this time for

different coordinate systems and feature types• Shortcut: DBpedia

– Public ontology derived from Wikipedia, including extracted geographic coordinates

– Amounts to a primitive gazetteer of geographic entities in Wikipedia

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Wikipedia Geo Coverage

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Basic Observations

• Vast majority of geopages tagged to the US and Europe

• Possibly reflects the geographic distribution of contributors to the English Wikipedia

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Features With Extent

• All geopages are tagged with a single lat/lon point– Tradeoff between simplicity and accuracy– Examples: Country or state Center or capital city,

Road Midpoint, River Source• Want to distinguish these features, as tagged point may

be geographically distant from other contributor edits• In Wikipedia, more precise coordinates generally

indicates smaller extent– California: (37, -120)– San Jose, CA: (37.304, -121.873)

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Wikipedia Edit Histories

• Easily-parsed XML format• Information saved for each edit:

1. Username (or IP address, if anonymous)

2. Timestamp

3. Whether edit is “minor” (spelling, formatting)• Excluded anonymous edits

– Not allowed to be marked minor, to avoid abuse– Most Wikipedia vandalism perpetrated anonymously

• Also excluded minor edits– Geopages tend to have mostly non-minor edits

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Basic Observations

• A considerable number of pages (~330k) are tagged with geographic coordinates

• Named contributors are outnumbered by anonymous ones by about 5 to 1, but are responsible for 2–3 times as many geopage edits

• A nontrivial number of named contributors have made at least one non-minor edit to a geopage (14.6%)

• Most edits to geopages are non-minor edits (58.7%)

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Sample Edit Patterns

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Mining Contributor Locales

• Intuitively, want to find contributors with:

1. Large number of edits to geopages

2. Geopage edits constrained to a small area• Select contributors with:

1. At least K edited geopages

2. Area α of convex hull of edited geopage coordinates smaller than A — termed edit area

• We used K = 3 andA = 1 deg2 ≈ 70 x 70 mi

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Accounting For Outliers

• Local edit patterns may be muddled by “outlier” edits

• For each contributor, select a fraction F of edited geopages with smallest convex hull area

• Simple approximation scheme:

1. For each geopage P:

a. Sort edited geopages by distance from P

b. Compute convex hull HP of first F geopages

2. Select HP with smallest area α

• Example: 71 deg2 10 deg2

(5k x 5k mi 700 x 700 mi)

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Contributor Locality

• Computed minimum edit area sizes forF = {95%, 80%}, both(a) with and (b) without features with extent

• 30–35% of contributors have edit areas smaller than 1 deg2

• Over 50% of contributors with less than 5 geopage edits are highly local

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Pet Geopages

• Want to identify contributors with preference for editing particular geopages — “pet” geopages– Contributors with a large number of edits to a small

number of geopages• For each contributor:

– Determine the frequencies of the first- and second-most-edited geopage, F1 and F2

– Select the contributor if F1 or F1+F2 clears a frequency threshold Fmin

• We used Fmin = 0.80

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Pet Geopages

• Statistics for users with:

a) 5–20 edits (~93k)

b) over 20 edits (~28k)• Over 50% of contributors

with 5–20 edits, and 25% of contributors with over 20 edits, have over 80% of geopage edits confined to two geopages

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Reasons for Tight Edit Areas

• Randomly selected 100 contributors with at least 10 edits to geopages and small edit areas

• Concurrently examined contributors’ user pages and the set of edited geopages to determine an interest

• Contributors with small edit areas tend to be born in or are living in the region defined by their edit areas

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Future Work

• Using alternate measures to determine the significance of geopage edits, such as:– Page size before and after edit– Whether edit was undone by another editor

• Characterizing contributors’ geographic interests from supposedly minor edits

• Tracking evolving geographic interests over time• Mining other geographical data sources, such as:

– Flickr– Twitter

• Finding similar contributors based on geographic interest

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Summary

• A sizable number of Wikipedia contributors exhibit constrained geographic focus

• Mined geographic focus can be applied to enhance user experience in collaborative, Web-based systems

• Users should be aware that information about them and their interests can be gleaned easily and perhaps unexpectedly– Dangerous if combined with other online databases

You Are Where You Edit — Michael D. Lieberman, [email protected] ICWSM 2009, San Jose, CA

Thanks!