developing standards and systems for mooc data...

32
Developing standards and systems for MOOC data science Kalyan Veeramachaneni, Franck Dernoncourt Colin Taylor, Zach Pardos, Una-May O’Reilly Any Scale learning for All CSAIL, MIT Tuesday, July 9, 13

Upload: others

Post on 20-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Developing standards and systems for MOOC data

science

Kalyan Veeramachaneni, Franck DernoncourtColin Taylor, Zach Pardos, Una-May O’Reilly

Any Scale learning for All CSAIL, MIT

Tuesday, July 9, 13

Page 2: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Overview

• Background and motivation

• What did the data looked like?

• Where are the bottlenecks?

• Data science @ scale

• Organize- MOOCdb

• Create multiple views- MOOC En Images

• Provide APIs- MOOCdb Access

• Why standardize?

Tuesday, July 9, 13

Page 3: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

What did the data look like?

MOOC

Tuesday, July 9, 13

Page 4: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

What did the data look like?

MOOC

Server logsJSON lines

Browser logsJSON lines

Tuesday, July 9, 13

Page 5: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

What did the data look like?

MOOC

Server logsJSON lines

Browser logsJSON lines

SQL DumpProduction server

Forums data

Tuesday, July 9, 13

Page 6: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

What did the data look like?

MOOC

Server logsJSON lines

Browser logsJSON lines

Course information xml files

Course website

SQL DumpProduction server

Forums data

Tuesday, July 9, 13

Page 7: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

What did the data look like?

MOOC

Server logsJSON lines

Browser logsJSON lines

Course information xml files

Course website

Emails Surveys

SQL DumpProduction server

Forums data

Tuesday, July 9, 13

Page 8: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Example Research Question RQ1: How does amount of time spent on the video correlate to performance on the homework?

What is the time period between two consecutive homeworks?

Identify what are the video urls that correspond to these homeworks?

Tuesday, July 9, 13

Page 9: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Example Research Question RQ1: How does amount of time spent on the video correlate to performance on the homework?

Identify what are the video urls that correspond to these homeworks?

What is the time period between two consecutive homeworks?

Calculate the time spent on these urls by the user?

Tuesday, July 9, 13

Page 10: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Example Research Question RQ1: How does amount of time spent on the video correlate to performance on the homework?

Identify what are the video urls that correspond to these homeworks?

What is the time period between two consecutive homeworks?

Calculate the time spent on these urls by the user?

What answers did the user submit during this period?

What are the correct answers for those problems?

Tuesday, July 9, 13

Page 11: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Example Research Question RQ1: How does amount of time spent on the video correlate to performance on the homework?

Identify what are the video urls that correspond to these homeworks?

What is the time period between two consecutive homeworks?

Calculate the time spent on these urls by the user?

What answers did the user submit during this period?

What are the correct answers for those problems?

How many did the student get right?

Tuesday, July 9, 13

Page 12: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Example Research Question RQ1: How does amount of time spent on the video correlate to performance on the homework?

Identify what are the video urls that correspond to these homeworks?

What is the time period between two consecutive homeworks?

Calculate the time spent on these urls by the user?

What answers did the user submit during this period?

What are the correct answers for those problems?

How many did the student get right?

Run corr function in MATLAB

Tuesday, July 9, 13

Page 13: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Where are the bottlenecks?

Preprocess Assemble covariates

Build models Analyze

A typical process

Tuesday, July 9, 13

Page 14: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Where are the bottlenecks?

Preprocess Assemble covariates

Build models Analyze

Preprocess Assemble covariates

Build models Analyze

sizes represent the time spent in doing the activity

What we desire ?

What is now ?

Tuesday, July 9, 13

Page 15: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

How about we organize the data using a concise schema that we will always adhere to ?

Then all the scripts we write can be reused

Tuesday, July 9, 13

Page 16: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

How about we organize the data using a concise schema that we will always adhere to ?

Then all the scripts we write can be reused

Challenge: Generalizable, loss-less

Tuesday, July 9, 13

Page 17: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Steps to Data science @ scale

Step 1: Develop a generalizable, loss-less schema for the data

Step 2: Define and design APIs to extract multiple views of data

Step 4: Design a platform where users can contribute and share the scripts

Step 3: Design user friendly APIs

Tuesday, July 9, 13

Page 18: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Step 1: Develop a generalizable, loss-less schema

• Students interact with the system in the following ways

• Observe

• Submit

• Collaborate

• Give feedback

Tuesday, July 9, 13

Page 19: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Step 1: Develop a generalizable, loss-less schema

The observing mode

Tuesday, July 9, 13

Page 20: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Step 1: Develop a generalizable, loss-less schema

Is this generalizable?

The observing mode

Tuesday, July 9, 13

Page 21: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Step 1: Develop a generalizable, loss-less schema

Is this generalizable? yes

The observing mode

Tuesday, July 9, 13

Page 22: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Step 1: Develop a generalizable, loss-less schema

Is this generalizable? yes

Is this loss -less?

The observing mode

Tuesday, July 9, 13

Page 23: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Step 1: Develop a generalizable, loss-less schema

Is this generalizable? yes

Is this loss -less? yes

The observing mode

Tuesday, July 9, 13

Page 24: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Step 1: Develop a generalizable, loss-less schema

MOOCdb

Tuesday, July 9, 13

Page 25: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Step 2: Define and design APIs to extract multiple views of data

TimeSpace

Student cohorts

by grade by access patternsby certificate winnerscustom cohort

by week by deadlineby midterm by homework by modules custom time points

by country by IP by city

MOOC En Images

Tuesday, July 9, 13

Page 26: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Step 2: Define and design APIs to extract multiple views of data

MOOC En Images

Tuesday, July 9, 13

Page 27: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Step 2: Define and design APIs to extract multiple views of data

MOOC En Images 6/30/13 francky.me/mit/moocdb/all/submissions_per_day.html

francky.me/mit/moocdb/all/submissions_per_day.html 1/1

Number  of  submissions  per  day:

Number  of  submissions  per  day

Number  ofsubmissions

1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 1540

200,000

400,000

600,000

800,000

Day

Tuesday, July 9, 13

Page 28: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Step 3: Design user friendly APIs

MOOCdb - Access

MATLAB Python

Tuesday, July 9, 13

Page 29: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Step 4: Design a platform where community can share scripts

MOOC En Images

Multiple views of data

IRT

BKT

Mobility

Who talked to who?

User friendly APIs

MATLABPythonR

Query Optimizers

Scri

pts A

rchi

ve

MOOCdb

Scientific Community

Analysts

Database experts

Privacy experts

Tuesday, July 9, 13

Page 30: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Leading to standardization

Shared Standardized

schema

MOOC Curators Database

Insert logs into database

Scientific Community

Feedback to MOOC Organizers, sharing of results, predictions descriptive analytics.

Analysts Database experts

Public access Platform

Crowd

Scripts archive

Privacy experts

Database Standardized

schema

Transform Database into

standard format

Sql scripts

Sql scripts ="+"

MOOCdb

Tuesday, July 9, 13

Page 31: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Benefits of standardization

• Reduction in time to analyze (TTA)

• Publicly available scripts to create data views

• Unified representation to DB/Privacy experts

• Crowd sourcing of feature engineering

• Crowd sourcing of data analytics

• Elimination of entry barriers

Tuesday, July 9, 13

Page 32: Developing standards and systems for MOOC data sciencepeople.csail.mit.edu/zp/moocshop2013/presentation_14.pdf · Developing standards and systems for MOOC data science Kalyan Veeramachaneni,

Thank you!

Look for MOOC En Images, updates on MOOCdb and open release of all the tools.

Tuesday, July 9, 13