could a data science program use data science insights?

Mining the Unseen: Educational Data!

GALVANIZE CAPSTONE PROJECTZACHARY THOMAS

[email protected] | L INKEDIN: THOMASZI | GITHUB: THOMZI12 | KAGGLE: ZTHOMAS

11/30/2016

mailto:[email protected]

https://www.linkedin.com/in/thomaszi

https://github.com/thomzi12

Could a data science program use data science insights?

Could a data science program use data science insights?

How can we extract meaning from unstructured data?

Data Sources & Tools

Data Science Program Student Data

Assessment Data

Experience & Background

Daily Exercise Data

Unsupervised Learning

Data Collection Student Pairs (hand-typed)

Student assignments (semi-unstructured)

Assessment data (structured)

+ creativity!

Tabular data, ready for analysis

What next?

Clustering Algorithms

K-means clustering

Hierarchical Clustering

Multi-dimensional Scaling

DBSCAN OPTICS

Multi-Dimensional Scaling

First we obtain distances, then use MDS to visualize points and

potential clusters.

26 dimensions → 2 dimensions

Students furthest apart: 59 and 75 (10.386)

K-Means Clustering

Cluster Percent of Data Characteristics

1 23% Above avg. experience, below avg. scores

2 62% Avg. experience, above average scores

3 15% Below avg. experience, good at coding

An inertia plot + human intuition help us determine the right number of clusters.

Action Steps

Specialized Program Offerings

Scoping out entry-level positions DS management training

Targeted Marketing

Insights Dashboard

Next Steps for this Project

• Build cluster classifier for incoming students• Obtain feedback from end users of data

Questions?

[email protected] | LinkedIn: thomaszi Github: thomzi12 | Kaggle: zthomas

Thank you!

mailto:[email protected]

https://www.linkedin.com/in/thomaszi

https://github.com/thomzi12

Appendix

Dataset Insights Cohort Correlation b/t Initial and Final Student Rank

sf_15 0.435

Sf_16 0.805

s_sept 0.55

den 0.17s 0.7

*Done by analyzing rank according to first (1) and last (5) assessment scores

Summary Visualizations Relative performance, math and coding Student Background

Students in orange box are relatively better at math than coding; students in the green are the opposite

Engineering, mathematics, and economics are the three most common academic backgrounds of students

Work/Grad School Experience

Median: 6 years 1st quartile: 3 years 3rd quartile: 9 years

`Pvclust` Dendrogram

• Package used: pvclust• Distance: Euclidean • Linkage: complete • Clusters: 8

Hierarchical Clustering Bar plots

DBSCAN

Need to tune DBSCAN better?• KNN plot on data after scaling into two dimensions • Clear knee at eps = 1.5

OPTICS

Choose epsilon (i.e. reachability, vertical axis) of 1, producing two clusters. (Upper bound of 1.5)

Very reasonable clusters! Used data scaled into two dimensions (above) for model

K=3 K-means Inertia Plot

[,1] [,2] [,3][1,] 1.000 0.130 0.468 [2,] 0.130 1.000 0.606 [3,] 0.468 0.606 1.000

Cluster Similarity Matrix, using ’shadow’ method

K=3 K-means Clustering Bar plot

Data Pipeline • Motivation: Real life data is messy – as I found out with this project. • Data Sources:

• Handwritten pair assignments for student group assignments • Student LinkedIn profiles• Student Github profiles (list of usernames) • Student assessment data

• Goal: Find the lengths of selected pair programming assignments to use as a proxy for work ethic. • Selected 9 assignments that were unusually long or difficult – would require more effort/need to

be a quick worker to get through completely • Process:

• Used script (repo_scraper_optimized.py) to download zip file of student Github profile, look for .py or .ipynb files, record number of characters in file. Ignore lines that are comments.

• Since hitting Github for data is I/O bound, used threading to run multiple requests simultaneously.

• Parallelized collection (at one point, using AWS) to speed up even further

Data Pipeline (Part 2) Process:• In another script, used fuzzy matching to use student assignment pairs to determine file attribution.

Looked at files with word “pair” in name to single out pair assignments. • Wrote results to master spreadsheet with assessment data. Now data in format where each row a

different student, with student’s respective data in columns

Challenges:• Missing values! How to handle and recognize • Finding pair assignments and not individual assignments • Pair assignments not in format where each row a unique student (each row is two different students)

Next Steps: • Improve recall – 10-30% of files for an assignment are missing. Granted, those files may not exist. • Incorporate data into clustering analysis

Glossary• Threading – Used in Python to run multiple (I/O bound) processes at one time.• Fuzzy Matching – This project Python package FuzzyWuzzy. It uses Levenshtein distance (when

comparing strings A and B, the number of deletions, insertions, or substitutions need to transform A into B). This was used to match hand-typed names with those on file, which were then matched to a Github username for further processing.

• Hierarchal Clustering – Tried out different methods of clustering (ended up using ‘complete’ linkage, which looks at the elements furthest away from each other when deciding to combine two clusters or not. `Single` linkage, which looks at the closest elements between two clusters, gave me chained clusters), scaled/unscaled data (unscaled data gave more sensible results) and different kinds of distances (tried out cosine, went with Euclidean though). Package pvclust gave me a result with 8 clusters, some of which were sensible. Many were small though.

• Non-metric Multidimensional Scaling – algorithms gives projection that minimizes stress, the square root of the ratio of the sum of squared distances between input distances and configuration distances.

Glossary – DBSCAN • DBSCAN – Short for “Density-Based Spatial Clustering and Application with Noise”. Basically works on

the idea of looking for areas of density (points close together) that are more dense than areas of noise. (Unlike K-means, which takes a centroid approach that assigns all points to a cluster.)

• DBSCAN Algorithm: 1. Compute distances between x_i and all other points. Find all neighbor points with distance eps of the point. Each point with a

neighbor count greater than or equal to MinPts is marked as a core point or visited. 2. For each core point, assign it to a new cluster if it hasn’t been already. Find recursively all its density related points and assign

them to same cluster as core3. Iterate through all points. Points that do not belong to a cluster are treated as noise.

• DBSCAN (tuning) parameters epsilon (eps) and minimum points (MinPts) – eps is determined using a KNN Distance Plot and looking for looking for knee point. MinPts is determined by user. Smaller value will produce more clusters, larger values less.

Glossary – OPTICS• OPTICS – Short for “Ordering points to identify the clustering structure.” A generalization of DBSCAN

that addresses the algorithm’s weakness in detecting cluster of differing densities. • Reachability Plot – Created by OPTICS algorithms, valleys in plots represent points in clusters since their

reachability score (distance to nearest neighbor?) are low. Points are ordered such that they are next to neighbors. Identify clusters like hierarchical clustering does by ’cutting’ reachability plot.

could a data science program use data science insights?

Data & Analytics