infrastructure at coursera data applications and at coursera. who are we? engineering lead, data...
TRANSCRIPT
Who are we?
Engineering Lead, Data Infrastructure
Engineering Lead, University & Enterprise Products
40% of our active learnerscome from emerging economies
Learners
18M+Learners
We have learners from every country!
3.3MCourse
Completions
18M+Learners
140M+Quizzes
submitted
28K+Years of video
watched
Data by the numbers...
350K+Financial aids
given out
Internal usage
● Dashboards for strategic metrics● Reports for operational use cases● A/B testing of new features● Surveys
External usage
● Instructor facing○ Instructor dashboards○ Data exports
● Learner facing○ Recommendations○ Personalized emails
In practice: Instructor dashboards
Student demographics
Highlight learner dropout points
Identify learner misconceptions
In practice: Instructor dashboards
Student demographics
Highlight learner dropout points
Identify learner misconceptions
In practice: Instructor dashboards
Collecting Curating Capitalizing
Eventing data(Student progress)
Cassandra data(Course content)
RedshiftRaw tables
Learner
Collecting
Lesson 1: Play an active role in the systems that collect data
Collecting Curating Capitalizing
Collecting
Lesson 1: Play an active role in the systems that collect data (Eventing)
services
Collecting Curating Capitalizing
web
mobileEventingservice
Redshift
Collecting
Lesson 1: Play an active role in the systems that collect data (Eventing)
Collecting Curating Capitalizing
Challenges with distributed ownership1. Lots of useless data2. Unclear lossiness3. No standardization
NetflixAegisthus
Collecting
Lesson 2: If you create the data, ETL the data
Collecting Curating Capitalizing
Dataduct +
Redshift
Collecting
Lesson 2: If you create the data, ETL the data(Dataduct)
Collecting Curating Capitalizing
steps:
- type: extract-from-rds
sql: | SELECT instructor_id,
,course_id
,rank
FROM courses_instructorincourse;
hostname: host_db_1
database: master
- type: load-into-staging-table
table: staging.instructors_sessions
- type: reload-prod-table
source: staging.instructors_sessions
destination: prod.instructors_sessions
CollectingLesson 2: If you create the data, ETL the data(Scalding)
Collecting Curating Capitalizing
def processBranches(branchPipe: TypedPipe[BranchModel], outputPath: String): Unit = {
branchPipe
.map { branch =>
(StringKey(branch.branchId).key,
StringKey(branch.courseId).key,
branch.changesDescription.map(_.value).getOrElse(""))
}
.write(TypedTsv[COURSE_BRANCHES_OUTPUT_FORMAT](outputPath))}
Collecting
Lesson 1: Play an active role in the systems that collect data
Lesson 2: If you create the data, ETL the data
Collecting Curating Capitalizing
In practice: Instructor dashboards
Collecting Curating Capitalizing
Eventing data(Student progress)
Cassandra data(Course content)
RedshiftRaw tables
Learner
Curating
Data quality
Collecting Curating Capitalizing
“If you're not thinking about how to keep your data clean from the very beginning, you're fucked. I guarantee it.”
- Everything We Wish We'd Known About Building Data Products
Curating
Data quality
Collecting Curating Capitalizing
1. Correctness2. Completeness3. Interpretability
In practice: Instructor dashboards
Collecting Curating Capitalizing
RedshiftBI tables
Cumulative progress per student & course
Dataduct
Eventing data(Student progress)
Cassandra data(Course content)
RedshiftRaw tables
Learner
Curating
Lesson 3: Standardize the business definitions
Collecting Curating Capitalizing
BI schema
prod schema
instructordashboards
analyses
bi.[grain]__[purpose]e.g. bi.users__activity bi.users__ltv
Curating
Lesson 3: Standardize the business definitions
Collecting Curating Capitalizing
1. Parsimony
2. Documentation
3. Syntactic consistency
Curating
Lesson 1: Solve data correctness at the data source
Lesson 2: Centralize the raw data
Lesson 3: Standardize the business definitions
Collecting Curating Capitalizing
In practice: Instructor dashboards
Collecting Curating Capitalizing
RedshiftBI tables
Cumulative progress per student & course
Dataduct
Eventing data(Student progress)
Cassandra data(Course content)
RedshiftRaw tables
Learner
In practice: Instructor dashboards
Collecting Curating Capitalizing
Materialized progress per course (KVS)
Nostos
InstructorRedshiftBI tables
Cumulative progress per student & course
Dataduct
Eventing data(Student progress)
Cassandra data(Course content)
RedshiftRaw tables
Learner
Capitalizing
Lesson 2: Make egress as easy as possible (Nostos)
Collecting Curating Capitalizing
Nostos service
Key/Value access
Capitalizing
Collecting Curating Capitalizing
Lesson 1: Make analysis as easy as possible
Lesson 2: Make egress as easy as possible
In practice: Instructor dashboards
Collecting Curating Capitalizing
Materialized progress per course (KVS)
Nostos
InstructorRedshiftBI tables
Cumulative progress per student & course
Dataduct
Eventing data(Student progress)
Cassandra data(Course content)
RedshiftRaw tables
Learner
1. Play an active role in your data collection2. If you create the data, ETL the data
1. Solve data correctness at the data source2. Centralize the raw data3. Standardize the business definitions
1. Make analysis as easy as possible2. Make egress as easy as possible
Collecting
Curating
Capitalizing