infrastructure at coursera data applications and at coursera. who are we? engineering lead, data...

55
Pierre Barthelemy Roshan Sumbaly Data applications and infrastructure at Coursera

Upload: vohuong

Post on 27-Mar-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

Pierre BarthelemyRoshan Sumbaly

Data applications and infrastructure at Coursera

Who are we?

Engineering Lead, Data Infrastructure

Engineering Lead, University & Enterprise Products

O U R M I S S I O N :

Universal access tothe world’s best education

Partners

40% of our active learnerscome from emerging economies

Learners

18M+Learners

We have learners from every country!

Marketplace

LearnersUniversity Partners

Coursera

● Data profile

● Lessons: 3Cs

Agenda

3.3MCourse

Completions

18M+Learners

140M+Quizzes

submitted

28K+Years of video

watched

Data by the numbers...

350K+Financial aids

given out

Type of data

Quiz submissions Item progress Course content

Structured data

Type of data

Forum data Peer submission dataProgramming submissions

Un-structured data

Internal usage

● Dashboards for strategic metrics● Reports for operational use cases● A/B testing of new features● Surveys

External usage

● Instructor facing○ Instructor dashboards○ Data exports

● Learner facing○ Recommendations○ Personalized emails

3 Cs

Collecting Curating Capitalizing

In practice: Instructor Dashboards

In practice: Instructor dashboards

Student demographics

In practice: Instructor dashboards

Student demographics

Identify learner misconceptions

In practice: Instructor dashboards

Student demographics

Highlight learner dropout points

Identify learner misconceptions

In practice: Instructor dashboards

Student demographics

Highlight learner dropout points

Identify learner misconceptions

3 Cs

Collecting Curating Capitalizing

In practice: Instructor dashboards

Collecting Curating Capitalizing

Eventing data(Student progress)

Cassandra data(Course content)

RedshiftRaw tables

Learner

Collecting

Lesson 1: Play an active role in the systems that collect data

Collecting Curating Capitalizing

Collecting

Lesson 1: Play an active role in the systems that collect data (Eventing)

services

Collecting Curating Capitalizing

web

mobileEventingservice

Redshift

Collecting

Lesson 1: Play an active role in the systems that collect data (Eventing)

Collecting Curating Capitalizing

Challenges with distributed ownership1. Lots of useless data2. Unclear lossiness3. No standardization

Collecting

Lesson 2: If you create the data, ETL the data

Collecting Curating Capitalizing

NetflixAegisthus

Collecting

Lesson 2: If you create the data, ETL the data

Collecting Curating Capitalizing

Dataduct +

Redshift

Collecting

Lesson 2: If you create the data, ETL the data(Dataduct)

Collecting Curating Capitalizing

steps:

- type: extract-from-rds

sql: | SELECT instructor_id,

,course_id

,rank

FROM courses_instructorincourse;

hostname: host_db_1

database: master

- type: load-into-staging-table

table: staging.instructors_sessions

- type: reload-prod-table

source: staging.instructors_sessions

destination: prod.instructors_sessions

CollectingLesson 2: If you create the data, ETL the data(Scalding)

Collecting Curating Capitalizing

def processBranches(branchPipe: TypedPipe[BranchModel], outputPath: String): Unit = {

branchPipe

.map { branch =>

(StringKey(branch.branchId).key,

StringKey(branch.courseId).key,

branch.changesDescription.map(_.value).getOrElse(""))

}

.write(TypedTsv[COURSE_BRANCHES_OUTPUT_FORMAT](outputPath))}

Collecting

Lesson 1: Play an active role in the systems that collect data

Lesson 2: If you create the data, ETL the data

Collecting Curating Capitalizing

In practice: Instructor dashboards

Collecting Curating Capitalizing

Eventing data(Student progress)

Cassandra data(Course content)

RedshiftRaw tables

Learner

3 Cs

Collecting Curating Capitalizing

Curating

Data quality

Collecting Curating Capitalizing

“If you're not thinking about how to keep your data clean from the very beginning, you're fucked. I guarantee it.”

- Everything We Wish We'd Known About Building Data Products

Curating

Data quality

Collecting Curating Capitalizing

1. Correctness2. Completeness3. Interpretability

In practice: Instructor dashboards

Collecting Curating Capitalizing

RedshiftBI tables

Cumulative progress per student & course

Dataduct

Eventing data(Student progress)

Cassandra data(Course content)

RedshiftRaw tables

Learner

Curating

Lesson 1: Solve data correctness at the data source

Collecting Curating Capitalizing

Curating

Lesson 2: Centralize the raw data

Collecting Curating Capitalizing

Curating

Lesson 2: Centralize the raw data

Collecting Curating Capitalizing

Curating

Lesson 2: Centralize the raw data

Collecting Curating Capitalizing

Curating

Lesson 3: Standardize the business definitions

Collecting Curating Capitalizing

Curating

Lesson 3: Standardize the business definitions

Collecting Curating Capitalizing

BI schema

prod schema

instructordashboards

analyses

bi.[grain]__[purpose]e.g. bi.users__activity bi.users__ltv

Curating

Lesson 3: Standardize the business definitions

Collecting Curating Capitalizing

1. Parsimony

2. Documentation

3. Syntactic consistency

Curating

Lesson 1: Solve data correctness at the data source

Lesson 2: Centralize the raw data

Lesson 3: Standardize the business definitions

Collecting Curating Capitalizing

In practice: Instructor dashboards

Collecting Curating Capitalizing

RedshiftBI tables

Cumulative progress per student & course

Dataduct

Eventing data(Student progress)

Cassandra data(Course content)

RedshiftRaw tables

Learner

3 Cs

Collecting Curating Capitalizing

In practice: Instructor dashboards

Collecting Curating Capitalizing

Materialized progress per course (KVS)

Nostos

InstructorRedshiftBI tables

Cumulative progress per student & course

Dataduct

Eventing data(Student progress)

Cassandra data(Course content)

RedshiftRaw tables

Learner

Capitalizing

Lesson 1: Make analysis as easy as possible

Collecting Curating Capitalizing

Capitalizing

Lesson 1: Make analysis as easy as possible (MEGA)

Collecting Curating Capitalizing

Capitalizing

Lesson 1: Make analysis as easy as possible (MEGA)

Collecting Curating Capitalizing

Capitalizing

Lesson 2: Make egress as easy as possible

Collecting Curating Capitalizing

Capitalizing

Lesson 2: Make egress as easy as possible (Nostos)

Collecting Curating Capitalizing

Nostos service

Key/Value access

Capitalizing

Lesson 2: Make egress as easy as possible (Nostos)

Collecting Curating Capitalizing

Capitalizing

Collecting Curating Capitalizing

Lesson 1: Make analysis as easy as possible

Lesson 2: Make egress as easy as possible

In practice: Instructor dashboards

Collecting Curating Capitalizing

Materialized progress per course (KVS)

Nostos

InstructorRedshiftBI tables

Cumulative progress per student & course

Dataduct

Eventing data(Student progress)

Cassandra data(Course content)

RedshiftRaw tables

Learner

1. Play an active role in your data collection2. If you create the data, ETL the data

1. Solve data correctness at the data source2. Centralize the raw data3. Standardize the business definitions

1. Make analysis as easy as possible2. Make egress as easy as possible

Collecting

Curating

Capitalizing

Team

coursera.org/jobsbuilding.coursera.org

@CourseraEng