measuring the new wikipedia community (pydata sv 2013)

57
Measuring the New Wikipedia Community PyData 2013 Ryan Faulkner ([email protected]) Wikimedia Foundation

Upload: pydata

Post on 08-May-2015

283 views

Category:

Technology


0 download

DESCRIPTION

Talk given by Ryan Faulkner at PyData Silicon Valley 2013

TRANSCRIPT

Page 1: Measuring the New Wikipedia Community (PyData SV 2013)

Measuring the New Wikipedia Community

PyData 2013

Ryan Faulkner ([email protected])

Wikimedia Foundation

Page 2: Measuring the New Wikipedia Community (PyData SV 2013)

OverviewIntroduction

Problem & Motivation

Proposed Solution

User Metrics

A Short Example

Extending the Solution

Using the Tool

Live Demo!!

Page 3: Measuring the New Wikipedia Community (PyData SV 2013)

IntroductionMe: Data Analyst at Wikimedia

Machine Learning @ McGillFundraising - A/B testingEditor Experiments - increasing the number of Active editors

Editor Engagement Experiments (E3) team @ the Wikimedia Foundation

Micro-feature experimentation

Page 4: Measuring the New Wikipedia Community (PyData SV 2013)

Problem

What's wrong with Wikipedia?

Page 5: Measuring the New Wikipedia Community (PyData SV 2013)

Problem - Editor Decline

http://strategy.wikimedia.org/wiki/Editor_Trends_Study

Page 6: Measuring the New Wikipedia Community (PyData SV 2013)

Problem - ApproachCan we stimulate the community of users to become more

numerous and productive?

○ Focus on new users■ Encourage contribution, make it easier

○ Lower the threshold for account creation■ Bring more people in.

○ Rapid experimentation on features that retain more users and stimulate increased participation.■ This will help us determine what works with less

cost

Page 7: Measuring the New Wikipedia Community (PyData SV 2013)

Problem - Evaluation○ Data Consistency

■ Anomaly Detection

■ Auto-correlation (seasonality)

○ "A/B" testing

■ Hypothesis testing - student's t, chi-square

■ Linear / Logistic regression

○ Multivariate testing

■ Analysis of variance

Page 8: Measuring the New Wikipedia Community (PyData SV 2013)

Problem - What we needCurrently a lot of the work around analysis is done

manually and is a large drain on resources:

○ Faster Data gathering

○ Knowing what we're logging and measuring &

faster ETL

○ Faster Analysis

○ Broadening Service and iterating on results

Page 9: Measuring the New Wikipedia Community (PyData SV 2013)

Problem - What we needBuild better infrastructure around how we interpret and

analyze our data.

○ Determine what to measure.■ Rigorously define relevant metrics

○ Expose the metrics from our data store■ Python is great for writing code quickly to handle

tasks with data■ Library support for data analysis (pandas,

numpy)

Page 10: Measuring the New Wikipedia Community (PyData SV 2013)

Solution

The tools to build.

Page 11: Measuring the New Wikipedia Community (PyData SV 2013)

Solution - Proposed

We need to measure User Behaviour"User Metrics" & "UMAPI"

User Metrics & UMAPI

Python implementation for gathering data from MediaWiki data stores, producing well defined metrics, and facilitating subsequent modelling and

analysis. This includes a way to provide an interface for making different types of requests and returning standard responses.

Page 12: Measuring the New Wikipedia Community (PyData SV 2013)

Solution - Why BotherWhat exactly do we gain by building these classes? Why not just query the database?

1. Reproducibility & Standardization2. Extensibility3. Concise definition4. Increase turn around

a. Multiprocessing to optimize metrics generation (e.g. Revert rate on 100K usersvia MySQL = 24hrs,via User Metrics < 10mins)

Page 13: Measuring the New Wikipedia Community (PyData SV 2013)

Solution - Why Python?Why not C++, Java, or PHP?

1. Speed of development

2. Simplify the code base & easy extensibility a. more "Scientist Friendly"

3. Good support for data processing

4. Better integration for downstream data analysis

5. The way that metrics work lends them to "Pythonic" artifacts. List comprehension, decorator patterns, duck-typing, RESTful API.

Page 14: Measuring the New Wikipedia Community (PyData SV 2013)

User Metrics

How do we form a picture about what happens on Wikipedia?

Page 15: Measuring the New Wikipedia Community (PyData SV 2013)

User Metrics - User activityEvents (not exhaustive):

■ Registration

■ Making an edit

■ Contributions of Namespaces

■ Reverting edits

■ Blocking

Page 16: Measuring the New Wikipedia Community (PyData SV 2013)

User Metrics - What do we want to know about users?

○ How much do they contribute?

○ How often do they contribute?

○ Potential vandals. Do they go on to be reverted,

blocked, banned?

Page 17: Measuring the New Wikipedia Community (PyData SV 2013)

User Metrics - Metrics Definitions

https://meta.wikimedia.org/wiki/Research:Metrics

Retention Metrics

Survival(t) Boolean measure of an editor surviving beyond t

Threshold(t,n) Boolean measure of an editor reaching activity threshold n by time t

Live Account(t) Boolean measure of whether the new user click the edit button?

Volume Metrics

Edit Rate Float result of user's rate of contribution.

Content Integer bytes added by revision and edit count.

Sessions Average session length (future)

Time to Threshold Time to reach a threshold (e.g. first edit)

Page 18: Measuring the New Wikipedia Community (PyData SV 2013)

User Metrics - Metrics Definitions

Content Quality

Revert Rate Float representing the proportion of revisions reverted.

Block Boolean indicating a block event on the user.

Content Persistence Integer indicating how long this user's edits survive (future)

Contribution Type

Namespace of Edits Integer edit counts in all namespaces.

Scale of Change Float representation of fraction of total page content modified (future)

Page 19: Measuring the New Wikipedia Community (PyData SV 2013)

User Metrics - Bytes Added

userrevision history

(over a predifined period)

Revision k:byte increase

(user ID, bytes_added, bytes_removed, edit count)

Page 20: Measuring the New Wikipedia Community (PyData SV 2013)

User Metrics - Threshold

userrevision history

(over a predefined period)

(user ID, threshold_reached={0,1})

registration

Events since registration up to time "t"

if len(event_list) >= n:threshold_reached = True

else:threshold_reached = False

Page 21: Measuring the New Wikipedia Community (PyData SV 2013)

User Metrics - Revert Rate

userrevision history

(over a predefined period)

for each revision look

at page history

Future Revisions

Past Revisions

checksum k

checksum i

if checksum i == checksum k:# reverted!

(user ID, revert_rate, total_revisions)

Page 22: Measuring the New Wikipedia Community (PyData SV 2013)

User Metrics - Implementationhttps://github.com/wikimedia/user_metrics

1. MySQL & Redis (future) data store

a. All of the backend dependency is abstracted out of

metrics classes

2. Python implementation - MySQLdb (SQLalchemy)

3. Strategy Pattern of Parent user metrics class

4. Metrics built mainly from four core MediaWiki tables:

a. revision, user, page, logging

5. Python Decorator methods for handling metric

aggregation

Page 23: Measuring the New Wikipedia Community (PyData SV 2013)

User Metrics

Page 24: Measuring the New Wikipedia Community (PyData SV 2013)

A Concrete Example

How can we use this framework?

Page 25: Measuring the New Wikipedia Community (PyData SV 2013)

Example - Post Edit Feedback

What effect does editing feedback (confirmation/gratitude) have on new editors?

Page 26: Measuring the New Wikipedia Community (PyData SV 2013)

Example - Results

Page 27: Measuring the New Wikipedia Community (PyData SV 2013)

An Extended Solution

Turn the data machine into a service.

Page 28: Measuring the New Wikipedia Community (PyData SV 2013)

Editor Metrics go beyond feature experimentation ...

It became clear that...

● We needed a service to let clients generate their own user metrics data sets

● We wanted to add a way for this methodology to extend beyond E3 and potentially WMF

● A force multiplier was necessary to iterate on editor data in more interesting ways (Machine Learning & more sophisticated analyses)

Page 29: Measuring the New Wikipedia Community (PyData SV 2013)

User Metrics API [UMAPI]Open Source (almost) RESTful API (Flask)

Computes metrics per user (User Metrics)

Combines metrics in different ways depending on request types

HTTP response in JSON with resulting data

Store data internally for reuse

Page 30: Measuring the New Wikipedia Community (PyData SV 2013)

UMAPIhttp://metrics.wikimedia.org/

https://github.com/wikimedia/user_metrics

https://github.com/rfaulkner/E3_analysis

https://pypi.python.org/pypi/wmf_user_metrics/0.1.3-dev

Page 31: Measuring the New Wikipedia Community (PyData SV 2013)

UMAPI - OverviewService GET requests based on a combination of URL paths + query params

e.g. /cohort/metric?date_start=..&date_end=...&...

Define user "cohorts" on which to operate

API engine maps to metrics request object (Mediator Pattern) which is handed off to a request manager which builds and runs request

JSON response

Page 32: Measuring the New Wikipedia Community (PyData SV 2013)

UMAPI - Overview

Basic cPickle file cache for responsesCan substitute caching system (e.g. memcached)

Reusing request data where it overlaps

Request Types:"Raw" - metrics per userAggregation over cohorts: mean, sum, median, etc.Time series requests

Page 33: Measuring the New Wikipedia Community (PyData SV 2013)

UMAPI ArchitectureHTTP GET request

JSON response

Apache Flask / App Servermod_wsgi

Request Notifications

ListenerRequest Control

Response Control Cache

MediaWiki Slaves

User MetricsAPI

Messaging Queues

Metrics objects - Separate

Processes

Asynchronous Callbacks

Page 34: Measuring the New Wikipedia Community (PyData SV 2013)

UMAPI Architecture - Listeners

Request Notifications CallbackHandles managing and notifications on job status

Request ControllerQueues requestsSpawns jobs from metrics objectsCoordinates parameters

Response ControllerReconstruct response dataWrite to cache

Page 35: Measuring the New Wikipedia Community (PyData SV 2013)

We will want to consider large groups of users, for instance, a test or control group in some experiment:

Aggregate groups of userslists of user IDs

Cohort registration (under construction)adding new cohorts to the model

Single user endpoint

Boolean expressions over cohorts supported

UMAPI - User Cohorts

Page 36: Measuring the New Wikipedia Community (PyData SV 2013)

User Metric PeriodsHow do we define the periods over which metrics are measured?

RegistrationLook "t" hours since user registration

User DefinedUser supplied start and end dates

Conditional RegistrationRegistration as above with condition that registration falls within input

Page 37: Measuring the New Wikipedia Community (PyData SV 2013)

UMAPI - RequestMeta Module

Mediator Pattern to handle passing request data among different portions of the architecture

Abstraction allows for easy filtering and default behaviour of request parameters

Requests can easily be turned into reproducible and unique hashes for caching

Page 38: Measuring the New Wikipedia Community (PyData SV 2013)

How the Service Works

The user experience with user metrics.

Page 39: Measuring the New Wikipedia Community (PyData SV 2013)

UMAPI - Pipeline

Cohort or

comboRaw Params

Time Series

Aggregator

Aggregator Params

Params JSON

JSON

JSON

Page 40: Measuring the New Wikipedia Community (PyData SV 2013)

UMAPI - Frontend Flow

Page 41: Measuring the New Wikipedia Community (PyData SV 2013)

Job QueueAs you fire off requests the queue tracks what's running:

Page 42: Measuring the New Wikipedia Community (PyData SV 2013)

Response - Bytes Added

Page 43: Measuring the New Wikipedia Community (PyData SV 2013)

Response - Threshold

Page 44: Measuring the New Wikipedia Community (PyData SV 2013)

Response - Edit Rate

Page 45: Measuring the New Wikipedia Community (PyData SV 2013)

Response - Threshold w/ params

Page 46: Measuring the New Wikipedia Community (PyData SV 2013)

Response - Aggregation

Page 47: Measuring the New Wikipedia Community (PyData SV 2013)

Response - Aggregation

Page 48: Measuring the New Wikipedia Community (PyData SV 2013)

Response - Time series

Page 49: Measuring the New Wikipedia Community (PyData SV 2013)

Response - Combining Cohorts

"usertags_meta" - cohort definitions

Page 50: Measuring the New Wikipedia Community (PyData SV 2013)

Response - Combining Cohorts

Two intersecting cohorts:

Page 51: Measuring the New Wikipedia Community (PyData SV 2013)

Response - Combining Cohorts

AND (&)

Page 52: Measuring the New Wikipedia Community (PyData SV 2013)

Response - Combining Cohorts

OR (~)

Page 53: Measuring the New Wikipedia Community (PyData SV 2013)

Response - Single user endpointe.g.http://metrics-api.wikimedia.org/user/Renklauf/threshold?t=10000

Page 54: Measuring the New Wikipedia Community (PyData SV 2013)

Looking ahead ...Connectivity metrics (additional metrics)

○ Graph database? (Neo4j, gremlin w/ postgreSQL)○ User talk and common article edits

Better in-memory modelling○ python-memcached○ better reuse of generated data based on request data

Beyond English WikipediaImplemented!

Page 55: Measuring the New Wikipedia Community (PyData SV 2013)

Looking ahead ...More sophisticated and robust data modelling

○ Modelling richer data: contribution histories, articles

edited, aggregate metrics

○ Classification: Logistic classifiers, Support Vector

Machine, Deep Belief Networks, Dimensionality

Reduction

○ Modelling revision text - Neural Networks, Hidden

Markov Models

Page 56: Measuring the New Wikipedia Community (PyData SV 2013)

DEMO!!

http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/thresholdhttp://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/threshold?aggregator=average

http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_ratehttp://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_rate?aggregator=dist

http://metrics.wikimedia.org/cohorts/ryan_test_2/bytes_added?time_series&start=20120101&end=20130101&aggregator=sum&group=input&interval=720

Page 57: Measuring the New Wikipedia Community (PyData SV 2013)

The Endhttp://metrics.wikimedia.org/

stat1.wikimedia.org:4000

https://github.com/wikimedia/user_metrics

https://github.com/rfaulkner/E3_analysis

https://pypi.python.org/pypi/wmf_user_metrics/0.1.3-dev

Questions?