the lean data diet - usenix...the lean data diet free knowledge movement “imagine a world in which...

Post on 10-Mar-2021

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

and

the Lean Data Diet

analytics@wikimedia.org @pantojacoder

and

the Lean Data Diet

Free Knowledge Movement

“Imagine a world in which every single human being can freely share in the sum of all knowledge”.

Should not have to provide any information to participate in free knowledge movement.

There cannot be access to free knowledge without a strong guarantee of privacy.

Free n le e Move nt Cor Bel ar nd Pri y

Anyonecan edit!

How is this guarantee of Privacy Expressed?

https://foundation.wikimedia.org/wiki/Privacy_policy

Build the wiki way

Build the wiki way Dis sion

to l 150,000 wo ds

Read or edit without account.

Register account without name, email or any other info.

Never selling/sharing your info with third parties.

After at most 90 days, data will be deleted, aggregated, or de-identified

In Practice the Privacy Policy has strong implications on how we do engineering

Read or edit without account.

Register account without name, email or any other info.

Never selling/sharing your info with third parties.

After at most 90 days, data will be deleted, aggregated, or de-identified

Wikipedia runs on-prem

https://github.com/wikimedia/puppet

We compute metrics in privacy conscious ways, aggregate, release publicly and delete a lot of data

Deleting Data

Sanit ing Data

PrivacyCulture

Deleting DataAt S a e!

Usage Data - Web Request project es.wikipedia

ip_address 31.214.189.167

user_agent Mozilla/5.0 (X11; Linux ...

page COVID-19

username pepito_editor

ip_address 3x.214.189.167

user_agent Mozilla/5.0 (X11; Linux ...

session_id 8c878625792be023

edit_count 4257

ui_skin minerva

Usage Data -Behavioura

200,000 web requestsPER sec (at peak)

200,000 web requestsPER sec (at peak)

2,000eventsPER sec

Deleting DataDeleting DataAre you sure?

Cancel Delete

--dry-run undef -> execute

--tables-to-delete undef -> all

--execute undef -> dry-run

--tables-to-delete undef -> none * -> all

--database=event--tables=menuClicks--wikis=en.wikipedia--older-than=90--skip-trash=true

Executing tests… Tests passed. Starting DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished.

Parameter checksum: 57ca7987d987e9e98a6c79

--execute=<checksum>

--database=event--tables=menuClicks--wikis=en.wikipedia--older-than=90--skip-trash=true

Executing tests… Tests passed. Starting DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished.

Parameter checksum: 57ca7987d987e9e98a6c79

--execute=<checksum>

#1 Dry-run

#2 Execute

Sanit ing Data

Sanit ing Data

Ad ance

Clients

Event Processor (Spark)

HTTP Beacon Endpoint

VarnishkafkaKafka

Varnish HDFS

Behavioural data

Clients

Event Processor (Spark)

Sanitized Events

Events <90 days

HTTP Beacon Endpoint

VarnishkafkaKafka

Varnish

Allow-list

HDFS

https://github.com/wikimedia/analytics-refinery-source/tree/master/refinery-job

Behavioural data

date 2019-01-01

ip 31.214.189.167

user_agent Mozilla/5.0 (X11; Linux ...

wiki en.wikipedia

action click

target menu

Unsanitized

date 2019-01-01

ip 31.214.189.167

user_agent Mozilla/5.0 (X11; Linux ...

wiki en.wikipedia

action click

target menu

UnsanitizedDo-not-allow-list

date 2019-01-01

ip 31.214.189.167

user_agent Mozilla/5.0 (X11; Linux ...

wiki en.wikipedia

action click

target menu

Unsanitizeddate 2019-01-01

ip NULL

user_agent NULL

wiki en.wikipedia

action click

target menu

SanitizedDo-not-allow-list

date 2019-01-01

ip 31.214.189.167

user_agent Mozilla/5.0 (X11; Linux ...

wiki en.wikipedia

action click

target menu

cookie_id 724310

Unsanitizeddate 2019-01-01

ip NULL

user_agent NULL

wiki en.wikipedia

action click

target menu

cookie_id 724310

SanitizedDo-not-allow-list

date 2019-01-01

ip 31.214.189.167

user_agent Mozilla/5.0 (X11; Linux ...

wiki en.wikipedia

action click

target menu

UnsanitizedAllow-list

date 2019-01-01

ip 31.214.189.167

user_agent Mozilla/5.0 (X11; Linux ...

wiki en.wikipedia

action click

target menu

Unsanitizeddate 2019-01-01

ip NULL

user_agent NULL

wiki en.wikipedia

action click

target menu

SanitizedAllow-list

Unsanitizeddate 2019-01-01

ip 31.214.189.167

user_agent Mozilla/5.0 (X11; Linux ...

wiki en.wikipedia

action click

target menu

cookie_id 724310

date 2019-01-01

ip NULL

user_agent NULL

wiki en.wikipedia

action click

target menu

cookie_id NULL

SanitizedAllow-list

Unsanitizeddate 2019-01-01

ip 31.214.189.167

user_agent Mozilla/5.0 (X11; Linux ...

wiki en.wikipedia

action click

target menu

cookie_id 724310

date 2019-01-01

ip Spain

user_agent NULL

wiki en.wikipedia

action click

target menu

cookie_id NULL

SanitizedAllow-list

Unsanitizeddate 2019-01-01

ip 31.214.189.167

user_agent Mozilla/5.0 (X11; Linux ...

wiki en.wikipedia

action click

target menu

cookie_id 724310

date 2019-01-01

ip Spain

user_agent Linux

wiki en.wikipedia

action click

target menu

cookie_id NULL

SanitizedAllow-list

Unsanitizeddate 2019-01-01

ip 31.214.189.167

user_agent Mozilla/5.0 (X11; Linux ...

wiki en.wikipedia

action click

target menu

cookie_id 724310

date 2019-01-01

ip Spain

user_agent Linux

wiki en.wikipedia

action click

target menu

cookie_id 8d56ab209e10

SanitizedAllow-list

#

PrivacyCulture

Privacy is not the responsibility of one team.

All processes and metrics take privacy into account from the beginning until the end.

SELECT COUNT(DISTINCT uuid)FROM database.tableWHERE date = ’2019-01-01’;

UUID, REQ

Unique Device - DAU or MAU

UUID, REQ

UUID

Unique Device

UUID, REQ

UUID

SELECT page_title uuidFROM database.tableWHERE date = ’2019-01-01’ and uuid =<some>

LAST ACCESS

Unique Device

2020-09-01

https://diff.wikimedia.org/2016/03/30/unique-devices-dataset/

LAST ACCESS

LA, REQLA, REQ (today: 2020-10-15)

2020-09-01

2020-09-01

Unique Device

LAST ACCESS

LA, REQLA, REQ (today: 2020-10-15)

2020-09-01

2020-09-01

Timestamp IP Page Cookies

2020-10-15 776.9.* Titanic Last-Access=2020-09-01

Unique Device

LAST ACCESS

LA, REQLA, REQ (today: 2020-10-15)

2020-09-01Timestamp IP Page Cookies

2020-10-15 776.9.* Titanic Last-Access=2020-09-01

2020-09-01

Unique Device

LAST ACCESS

LA, REQLA, REQ (today: 2020-10-15)

2020-10-15

Timestamp IP Page Cookies

2020-10-15 776.9.* Titanic Last-Access=2020-09-01

2020-10-15

Unique Device

LAST ACCESS

Unique Device

2020-10-15

LAST ACCESS

LA, REQLA, REQ (today: 2020-10-15)

2020-10-15

2020-10-15

Unique Device

LAST ACCESS

LA, REQLA, REQ (today: 2020-10-15)

2020-10-15Timestamp IP Page Cookies

2020-10-15 776.9.* Titanic Last-Access=2020-09-01

2020-10-15

2020-10-15 776.9.* Everest Last-Access=2020-10-15

Unique Device

SELECT COUNT(*) FROM database.tableWHERE (last-access-date IS NULL OR last-access-date < date)AND date = ’2020-10-15’;

LAST ACCESS

LA, REQLA, REQ (today: 2020-10-15)

Unique Device

Timestamp IP Page Cookies

-> 2020-03-15 776.9.* Titanic Last-Access=2020-09-01

2020-03-15 776.9.* Everest Last-Access=2020-10-15

SELECT COUNT(*) FROM database.tableWHERE (last-access-date IS NULL OR last-access-date < date)AND date = ’2020-10-15’;

Unique Device

The Lean Data Diet

Less work related to data requests

Easier to make data public

Guarantee of Privacy

Extra work

Privacy culture needs time

Data Analysis needs a different mindset

Pr Con

Privacy is a Feature

Questions?

https://xkcd.com/285

analytics@wikimedia.org @pantojacoderAll pictures https://creativecommons.org/publicdomain/zero/1.0/

top related