measuring the new wikipedia community (pydata sv 2013)

Measuring the New Wikipedia Community

PyData 2013

Ryan Faulkner (rfaulkner@wikimedia.org)

Wikimedia Foundation

OverviewIntroduction

Problem & Motivation

Proposed Solution

User Metrics

A Short Example

Extending the Solution

Using the Tool

Live Demo!!

IntroductionMe: Data Analyst at Wikimedia

Machine Learning @ McGillFundraising - A/B testingEditor Experiments - increasing the number of Active editors

Editor Engagement Experiments (E3) team @ the Wikimedia Foundation

Micro-feature experimentation

Problem

What's wrong with Wikipedia?

Problem - Editor Decline

http://strategy.wikimedia.org/wiki/Editor_Trends_Study

Problem - ApproachCan we stimulate the community of users to become more

numerous and productive?

○ Focus on new users■ Encourage contribution, make it easier

○ Lower the threshold for account creation■ Bring more people in.

○ Rapid experimentation on features that retain more users and stimulate increased participation.■ This will help us determine what works with less

Problem - Evaluation○ Data Consistency

■ Anomaly Detection

■ Auto-correlation (seasonality)

○ "A/B" testing

■ Hypothesis testing - student's t, chi-square

■ Linear / Logistic regression

○ Multivariate testing

■ Analysis of variance

Problem - What we needCurrently a lot of the work around analysis is done

manually and is a large drain on resources:

○ Faster Data gathering

○ Knowing what we're logging and measuring &

faster ETL

○ Faster Analysis

○ Broadening Service and iterating on results

Problem - What we needBuild better infrastructure around how we interpret and

analyze our data.

○ Determine what to measure.■ Rigorously define relevant metrics

○ Expose the metrics from our data store■ Python is great for writing code quickly to handle

tasks with data■ Library support for data analysis (pandas,

numpy)

Solution

The tools to build.

Solution - Proposed

We need to measure User Behaviour"User Metrics" & "UMAPI"

User Metrics & UMAPI

Python implementation for gathering data from MediaWiki data stores, producing well defined metrics, and facilitating subsequent modelling and

analysis. This includes a way to provide an interface for making different types of requests and returning standard responses.

Solution - Why BotherWhat exactly do we gain by building these classes? Why not just query the database?

1. Reproducibility & Standardization2. Extensibility3. Concise definition4. Increase turn around

a. Multiprocessing to optimize metrics generation (e.g. Revert rate on 100K usersvia MySQL = 24hrs,via User Metrics < 10mins)

Solution - Why Python?Why not C++, Java, or PHP?

1. Speed of development

2. Simplify the code base & easy extensibility a. more "Scientist Friendly"

3. Good support for data processing

4. Better integration for downstream data analysis

5. The way that metrics work lends them to "Pythonic" artifacts. List comprehension, decorator patterns, duck-typing, RESTful API.

User Metrics

How do we form a picture about what happens on Wikipedia?

User Metrics - User activityEvents (not exhaustive):

■ Registration

■ Making an edit

■ Contributions of Namespaces

■ Reverting edits

■ Blocking

User Metrics - What do we want to know about users?

○ How much do they contribute?

○ How often do they contribute?

○ Potential vandals. Do they go on to be reverted,

blocked, banned?

User Metrics - Metrics Definitions

https://meta.wikimedia.org/wiki/Research:Metrics

Retention Metrics

Survival(t) Boolean measure of an editor surviving beyond t

Threshold(t,n) Boolean measure of an editor reaching activity threshold n by time t

Live Account(t) Boolean measure of whether the new user click the edit button?

Volume Metrics

Edit Rate Float result of user's rate of contribution.

Content Integer bytes added by revision and edit count.

Sessions Average session length (future)

Time to Threshold Time to reach a threshold (e.g. first edit)

User Metrics - Metrics Definitions

Content Quality

Revert Rate Float representing the proportion of revisions reverted.

Block Boolean indicating a block event on the user.

Content Persistence Integer indicating how long this user's edits survive (future)

Contribution Type

Namespace of Edits Integer edit counts in all namespaces.

Scale of Change Float representation of fraction of total page content modified (future)

User Metrics - Bytes Added

userrevision history

(over a predifined period)

Revision k:byte increase

(user ID, bytes_added, bytes_removed, edit count)

User Metrics - Threshold

(over a predefined period)

(user ID, threshold_reached={0,1})

registration

Events since registration up to time "t"

if len(event_list) >= n:threshold_reached = True

else:threshold_reached = False

User Metrics - Revert Rate

(over a predefined period)

for each revision look

at page history

Future Revisions

Past Revisions

checksum k

checksum i

if checksum i == checksum k:# reverted!

(user ID, revert_rate, total_revisions)

User Metrics - Implementationhttps://github.com/wikimedia/user_metrics

1. MySQL & Redis (future) data store

a. All of the backend dependency is abstracted out of

metrics classes

2. Python implementation - MySQLdb (SQLalchemy)

3. Strategy Pattern of Parent user metrics class

4. Metrics built mainly from four core MediaWiki tables:

a. revision, user, page, logging

5. Python Decorator methods for handling metric

aggregation

User Metrics

A Concrete Example

How can we use this framework?

Example - Post Edit Feedback

What effect does editing feedback (confirmation/gratitude) have on new editors?

Example - Results

An Extended Solution

Turn the data machine into a service.

Editor Metrics go beyond feature experimentation ...

It became clear that...

● We needed a service to let clients generate their own user metrics data sets

● We wanted to add a way for this methodology to extend beyond E3 and potentially WMF

● A force multiplier was necessary to iterate on editor data in more interesting ways (Machine Learning & more sophisticated analyses)

User Metrics API [UMAPI]Open Source (almost) RESTful API (Flask)

Computes metrics per user (User Metrics)

Combines metrics in different ways depending on request types

HTTP response in JSON with resulting data

Store data internally for reuse

UMAPIhttp://metrics.wikimedia.org/

https://github.com/wikimedia/user_metrics

https://github.com/rfaulkner/E3_analysis

https://pypi.python.org/pypi/wmf_user_metrics/0.1.3-dev

UMAPI - OverviewService GET requests based on a combination of URL paths + query params

e.g. /cohort/metric?date_start=..&date_end=...&...

Define user "cohorts" on which to operate

API engine maps to metrics request object (Mediator Pattern) which is handed off to a request manager which builds and runs request

JSON response

UMAPI - Overview

Basic cPickle file cache for responsesCan substitute caching system (e.g. memcached)

Reusing request data where it overlaps

Request Types:"Raw" - metrics per userAggregation over cohorts: mean, sum, median, etc.Time series requests

UMAPI ArchitectureHTTP GET request

JSON response

Apache Flask / App Servermod_wsgi

Request Notifications

ListenerRequest Control

Response Control Cache

MediaWiki Slaves

User MetricsAPI

Messaging Queues

Metrics objects - Separate

Processes

Asynchronous Callbacks

UMAPI Architecture - Listeners

Request Notifications CallbackHandles managing and notifications on job status

Request ControllerQueues requestsSpawns jobs from metrics objectsCoordinates parameters

Response ControllerReconstruct response dataWrite to cache

We will want to consider large groups of users, for instance, a test or control group in some experiment:

Aggregate groups of userslists of user IDs

Cohort registration (under construction)adding new cohorts to the model

Single user endpoint

Boolean expressions over cohorts supported

UMAPI - User Cohorts

User Metric PeriodsHow do we define the periods over which metrics are measured?

RegistrationLook "t" hours since user registration

User DefinedUser supplied start and end dates

Conditional RegistrationRegistration as above with condition that registration falls within input

UMAPI - RequestMeta Module

Mediator Pattern to handle passing request data among different portions of the architecture

Abstraction allows for easy filtering and default behaviour of request parameters

Requests can easily be turned into reproducible and unique hashes for caching

How the Service Works

The user experience with user metrics.

UMAPI - Pipeline

Cohort or

comboRaw Params

Time Series

Aggregator

Aggregator Params

Params JSON

UMAPI - Frontend Flow

Job QueueAs you fire off requests the queue tracks what's running:

Response - Bytes Added

Response - Threshold

Response - Edit Rate

Response - Threshold w/ params

Response - Aggregation

Response - Time series

Response - Combining Cohorts

"usertags_meta" - cohort definitions

Two intersecting cohorts:

AND (&)

OR (~)

Response - Single user endpointe.g.http://metrics-api.wikimedia.org/user/Renklauf/threshold?t=10000

Looking ahead ...Connectivity metrics (additional metrics)

○ Graph database? (Neo4j, gremlin w/ postgreSQL)○ User talk and common article edits

Better in-memory modelling○ python-memcached○ better reuse of generated data based on request data

Beyond English WikipediaImplemented!

Looking ahead ...More sophisticated and robust data modelling

○ Modelling richer data: contribution histories, articles

edited, aggregate metrics

○ Classification: Logistic classifiers, Support Vector

Machine, Deep Belief Networks, Dimensionality

Reduction

○ Modelling revision text - Neural Networks, Hidden

Markov Models

DEMO!!

http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/thresholdhttp://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/threshold?aggregator=average

http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_ratehttp://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_rate?aggregator=dist

http://metrics.wikimedia.org/cohorts/ryan_test_2/bytes_added?time_series&start=20120101&end=20130101&aggregator=sum&group=input&interval=720

The Endhttp://metrics.wikimedia.org/

stat1.wikimedia.org:4000

https://github.com/wikimedia/user_metrics

https://github.com/rfaulkner/E3_analysis

https://pypi.python.org/pypi/wmf_user_metrics/0.1.3-dev

Questions?

measuring the new wikipedia community (pydata sv 2013)

Technology

memex - pydata, new york 2015

pydata nyc 2014 talk

pydata nyc 2015

python in an evolving enterprise system (pydata sv 2013)

shogun 2.0 @ pydata nyc 2012

networkx & gephi tutorial #pydata nyc

wise.io: a machine-learning platform (pydata sv 2013)

real-time streams & logs with storm and kafka by andrew...

the reference model for disease progression uses mist to...

pydata barcelona keynote

pydata boston 2013

assurance scoring pydata london 2016

bayesian machine learning & python – naïve bayes (pydata...

introduction to numpy (pydata sv 2013)

python as part of a production machine learning stack by...

ipython: a modern vision of interactive computing (pydata sv...

pythran: static compiler for high performance by mehdi amini...

validation methods - pydata israel

our data ourselves, pydata 2015

pydata berlin meetup