measuring the new wikipedia community (pydata sv 2013)
DESCRIPTION
Talk given by Ryan Faulkner at PyData Silicon Valley 2013TRANSCRIPT
Measuring the New Wikipedia Community
PyData 2013
Ryan Faulkner ([email protected])
Wikimedia Foundation
OverviewIntroduction
Problem & Motivation
Proposed Solution
User Metrics
A Short Example
Extending the Solution
Using the Tool
Live Demo!!
IntroductionMe: Data Analyst at Wikimedia
Machine Learning @ McGillFundraising - A/B testingEditor Experiments - increasing the number of Active editors
Editor Engagement Experiments (E3) team @ the Wikimedia Foundation
Micro-feature experimentation
Problem
What's wrong with Wikipedia?
Problem - Editor Decline
http://strategy.wikimedia.org/wiki/Editor_Trends_Study
Problem - ApproachCan we stimulate the community of users to become more
numerous and productive?
○ Focus on new users■ Encourage contribution, make it easier
○ Lower the threshold for account creation■ Bring more people in.
○ Rapid experimentation on features that retain more users and stimulate increased participation.■ This will help us determine what works with less
cost
Problem - Evaluation○ Data Consistency
■ Anomaly Detection
■ Auto-correlation (seasonality)
○ "A/B" testing
■ Hypothesis testing - student's t, chi-square
■ Linear / Logistic regression
○ Multivariate testing
■ Analysis of variance
Problem - What we needCurrently a lot of the work around analysis is done
manually and is a large drain on resources:
○ Faster Data gathering
○ Knowing what we're logging and measuring &
faster ETL
○ Faster Analysis
○ Broadening Service and iterating on results
Problem - What we needBuild better infrastructure around how we interpret and
analyze our data.
○ Determine what to measure.■ Rigorously define relevant metrics
○ Expose the metrics from our data store■ Python is great for writing code quickly to handle
tasks with data■ Library support for data analysis (pandas,
numpy)
Solution
The tools to build.
Solution - Proposed
We need to measure User Behaviour"User Metrics" & "UMAPI"
User Metrics & UMAPI
Python implementation for gathering data from MediaWiki data stores, producing well defined metrics, and facilitating subsequent modelling and
analysis. This includes a way to provide an interface for making different types of requests and returning standard responses.
Solution - Why BotherWhat exactly do we gain by building these classes? Why not just query the database?
1. Reproducibility & Standardization2. Extensibility3. Concise definition4. Increase turn around
a. Multiprocessing to optimize metrics generation (e.g. Revert rate on 100K usersvia MySQL = 24hrs,via User Metrics < 10mins)
Solution - Why Python?Why not C++, Java, or PHP?
1. Speed of development
2. Simplify the code base & easy extensibility a. more "Scientist Friendly"
3. Good support for data processing
4. Better integration for downstream data analysis
5. The way that metrics work lends them to "Pythonic" artifacts. List comprehension, decorator patterns, duck-typing, RESTful API.
User Metrics
How do we form a picture about what happens on Wikipedia?
User Metrics - User activityEvents (not exhaustive):
■ Registration
■ Making an edit
■ Contributions of Namespaces
■ Reverting edits
■ Blocking
User Metrics - What do we want to know about users?
○ How much do they contribute?
○ How often do they contribute?
○ Potential vandals. Do they go on to be reverted,
blocked, banned?
User Metrics - Metrics Definitions
https://meta.wikimedia.org/wiki/Research:Metrics
Retention Metrics
Survival(t) Boolean measure of an editor surviving beyond t
Threshold(t,n) Boolean measure of an editor reaching activity threshold n by time t
Live Account(t) Boolean measure of whether the new user click the edit button?
Volume Metrics
Edit Rate Float result of user's rate of contribution.
Content Integer bytes added by revision and edit count.
Sessions Average session length (future)
Time to Threshold Time to reach a threshold (e.g. first edit)
User Metrics - Metrics Definitions
Content Quality
Revert Rate Float representing the proportion of revisions reverted.
Block Boolean indicating a block event on the user.
Content Persistence Integer indicating how long this user's edits survive (future)
Contribution Type
Namespace of Edits Integer edit counts in all namespaces.
Scale of Change Float representation of fraction of total page content modified (future)
User Metrics - Bytes Added
userrevision history
(over a predifined period)
Revision k:byte increase
(user ID, bytes_added, bytes_removed, edit count)
User Metrics - Threshold
userrevision history
(over a predefined period)
(user ID, threshold_reached={0,1})
registration
Events since registration up to time "t"
if len(event_list) >= n:threshold_reached = True
else:threshold_reached = False
User Metrics - Revert Rate
userrevision history
(over a predefined period)
for each revision look
at page history
Future Revisions
Past Revisions
checksum k
checksum i
if checksum i == checksum k:# reverted!
(user ID, revert_rate, total_revisions)
User Metrics - Implementationhttps://github.com/wikimedia/user_metrics
1. MySQL & Redis (future) data store
a. All of the backend dependency is abstracted out of
metrics classes
2. Python implementation - MySQLdb (SQLalchemy)
3. Strategy Pattern of Parent user metrics class
4. Metrics built mainly from four core MediaWiki tables:
a. revision, user, page, logging
5. Python Decorator methods for handling metric
aggregation
User Metrics
A Concrete Example
How can we use this framework?
Example - Post Edit Feedback
What effect does editing feedback (confirmation/gratitude) have on new editors?
Example - Results
An Extended Solution
Turn the data machine into a service.
Editor Metrics go beyond feature experimentation ...
It became clear that...
● We needed a service to let clients generate their own user metrics data sets
● We wanted to add a way for this methodology to extend beyond E3 and potentially WMF
● A force multiplier was necessary to iterate on editor data in more interesting ways (Machine Learning & more sophisticated analyses)
User Metrics API [UMAPI]Open Source (almost) RESTful API (Flask)
Computes metrics per user (User Metrics)
Combines metrics in different ways depending on request types
HTTP response in JSON with resulting data
Store data internally for reuse
UMAPIhttp://metrics.wikimedia.org/
https://github.com/wikimedia/user_metrics
https://github.com/rfaulkner/E3_analysis
https://pypi.python.org/pypi/wmf_user_metrics/0.1.3-dev
UMAPI - OverviewService GET requests based on a combination of URL paths + query params
e.g. /cohort/metric?date_start=..&date_end=...&...
Define user "cohorts" on which to operate
API engine maps to metrics request object (Mediator Pattern) which is handed off to a request manager which builds and runs request
JSON response
UMAPI - Overview
Basic cPickle file cache for responsesCan substitute caching system (e.g. memcached)
Reusing request data where it overlaps
Request Types:"Raw" - metrics per userAggregation over cohorts: mean, sum, median, etc.Time series requests
UMAPI ArchitectureHTTP GET request
JSON response
Apache Flask / App Servermod_wsgi
Request Notifications
ListenerRequest Control
Response Control Cache
MediaWiki Slaves
User MetricsAPI
Messaging Queues
Metrics objects - Separate
Processes
Asynchronous Callbacks
UMAPI Architecture - Listeners
Request Notifications CallbackHandles managing and notifications on job status
Request ControllerQueues requestsSpawns jobs from metrics objectsCoordinates parameters
Response ControllerReconstruct response dataWrite to cache
We will want to consider large groups of users, for instance, a test or control group in some experiment:
Aggregate groups of userslists of user IDs
Cohort registration (under construction)adding new cohorts to the model
Single user endpoint
Boolean expressions over cohorts supported
UMAPI - User Cohorts
User Metric PeriodsHow do we define the periods over which metrics are measured?
RegistrationLook "t" hours since user registration
User DefinedUser supplied start and end dates
Conditional RegistrationRegistration as above with condition that registration falls within input
UMAPI - RequestMeta Module
Mediator Pattern to handle passing request data among different portions of the architecture
Abstraction allows for easy filtering and default behaviour of request parameters
Requests can easily be turned into reproducible and unique hashes for caching
How the Service Works
The user experience with user metrics.
UMAPI - Pipeline
Cohort or
comboRaw Params
Time Series
Aggregator
Aggregator Params
Params JSON
JSON
JSON
UMAPI - Frontend Flow
Job QueueAs you fire off requests the queue tracks what's running:
Response - Bytes Added
Response - Threshold
Response - Edit Rate
Response - Threshold w/ params
Response - Aggregation
Response - Aggregation
Response - Time series
Response - Combining Cohorts
"usertags_meta" - cohort definitions
Response - Combining Cohorts
Two intersecting cohorts:
Response - Combining Cohorts
AND (&)
Response - Combining Cohorts
OR (~)
Response - Single user endpointe.g.http://metrics-api.wikimedia.org/user/Renklauf/threshold?t=10000
Looking ahead ...Connectivity metrics (additional metrics)
○ Graph database? (Neo4j, gremlin w/ postgreSQL)○ User talk and common article edits
Better in-memory modelling○ python-memcached○ better reuse of generated data based on request data
Beyond English WikipediaImplemented!
Looking ahead ...More sophisticated and robust data modelling
○ Modelling richer data: contribution histories, articles
edited, aggregate metrics
○ Classification: Logistic classifiers, Support Vector
Machine, Deep Belief Networks, Dimensionality
Reduction
○ Modelling revision text - Neural Networks, Hidden
Markov Models
DEMO!!
http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/thresholdhttp://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/threshold?aggregator=average
http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_ratehttp://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_rate?aggregator=dist
http://metrics.wikimedia.org/cohorts/ryan_test_2/bytes_added?time_series&start=20120101&end=20130101&aggregator=sum&group=input&interval=720
The Endhttp://metrics.wikimedia.org/
stat1.wikimedia.org:4000
https://github.com/wikimedia/user_metrics
https://github.com/rfaulkner/E3_analysis
https://pypi.python.org/pypi/wmf_user_metrics/0.1.3-dev
Questions?