could a data science program use data science insights?
TRANSCRIPT
Mining the Unseen: Educational Data!
GALVANIZE CAPSTONE PROJECTZACHARY THOMAS
[email protected] | L INKEDIN: THOMASZI | GITHUB: THOMZI12 | KAGGLE: ZTHOMAS
11/30/2016
Could a data science program use data science insights?
Could a data science program use data science insights?
How can we extract meaning from unstructured data?
Data Sources & Tools
Data Science Program Student Data
Assessment Data
Experience & Background
Daily Exercise Data
Unsupervised Learning
Data Collection Student Pairs (hand-typed)
Student assignments (semi-unstructured)
Assessment data (structured)
+ creativity!
Tabular data, ready for analysis
What next?
Clustering Algorithms
K-means clustering
Hierarchical Clustering
Multi-dimensional Scaling
DBSCAN OPTICS
Multi-Dimensional Scaling
First we obtain distances, then use MDS to visualize points and
potential clusters.
26 dimensions → 2 dimensions
Students furthest apart: 59 and 75 (10.386)
K-Means Clustering
Cluster Percent of Data Characteristics
1 23% Above avg. experience, below avg. scores
2 62% Avg. experience, above average scores
3 15% Below avg. experience, good at coding
An inertia plot + human intuition help us determine the right number of clusters.
Action Steps
Specialized Program Offerings
Scoping out entry-level positions DS management training
Targeted Marketing
Insights Dashboard
Next Steps for this Project
• Build cluster classifier for incoming students• Obtain feedback from end users of data
Appendix
Dataset Insights Cohort Correlation b/t Initial and Final Student Rank
sf_15 0.435
Sf_16 0.805
s_sept 0.55
den 0.17s 0.7
*Done by analyzing rank according to first (1) and last (5) assessment scores
Summary Visualizations Relative performance, math and coding Student Background
Students in orange box are relatively better at math than coding; students in the green are the opposite
Engineering, mathematics, and economics are the three most common academic backgrounds of students
Work/Grad School Experience
Median: 6 years 1st quartile: 3 years 3rd quartile: 9 years
`Pvclust` Dendrogram
• Package used: pvclust• Distance: Euclidean • Linkage: complete • Clusters: 8
Hierarchical Clustering Bar plots
DBSCAN
Need to tune DBSCAN better?• KNN plot on data after scaling into two dimensions • Clear knee at eps = 1.5
OPTICS
Choose epsilon (i.e. reachability, vertical axis) of 1, producing two clusters. (Upper bound of 1.5)
Very reasonable clusters! Used data scaled into two dimensions (above) for model
K=3 K-means Inertia Plot
[,1] [,2] [,3][1,] 1.000 0.130 0.468 [2,] 0.130 1.000 0.606 [3,] 0.468 0.606 1.000
Cluster Similarity Matrix, using ’shadow’ method
K=3 K-means Clustering Bar plot
Data Pipeline • Motivation: Real life data is messy – as I found out with this project. • Data Sources:
• Handwritten pair assignments for student group assignments • Student LinkedIn profiles• Student Github profiles (list of usernames) • Student assessment data
• Goal: Find the lengths of selected pair programming assignments to use as a proxy for work ethic. • Selected 9 assignments that were unusually long or difficult – would require more effort/need to
be a quick worker to get through completely • Process:
• Used script (repo_scraper_optimized.py) to download zip file of student Github profile, look for .py or .ipynb files, record number of characters in file. Ignore lines that are comments.
• Since hitting Github for data is I/O bound, used threading to run multiple requests simultaneously.
• Parallelized collection (at one point, using AWS) to speed up even further
Data Pipeline (Part 2) Process:• In another script, used fuzzy matching to use student assignment pairs to determine file attribution.
Looked at files with word “pair” in name to single out pair assignments. • Wrote results to master spreadsheet with assessment data. Now data in format where each row a
different student, with student’s respective data in columns
Challenges:• Missing values! How to handle and recognize • Finding pair assignments and not individual assignments • Pair assignments not in format where each row a unique student (each row is two different students)
Next Steps: • Improve recall – 10-30% of files for an assignment are missing. Granted, those files may not exist. • Incorporate data into clustering analysis
Glossary• Threading – Used in Python to run multiple (I/O bound) processes at one time.• Fuzzy Matching – This project Python package FuzzyWuzzy. It uses Levenshtein distance (when
comparing strings A and B, the number of deletions, insertions, or substitutions need to transform A into B). This was used to match hand-typed names with those on file, which were then matched to a Github username for further processing.
• Hierarchal Clustering – Tried out different methods of clustering (ended up using ‘complete’ linkage, which looks at the elements furthest away from each other when deciding to combine two clusters or not. `Single` linkage, which looks at the closest elements between two clusters, gave me chained clusters), scaled/unscaled data (unscaled data gave more sensible results) and different kinds of distances (tried out cosine, went with Euclidean though). Package pvclust gave me a result with 8 clusters, some of which were sensible. Many were small though.
• Non-metric Multidimensional Scaling – algorithms gives projection that minimizes stress, the square root of the ratio of the sum of squared distances between input distances and configuration distances.
Glossary – DBSCAN • DBSCAN – Short for “Density-Based Spatial Clustering and Application with Noise”. Basically works on
the idea of looking for areas of density (points close together) that are more dense than areas of noise. (Unlike K-means, which takes a centroid approach that assigns all points to a cluster.)
• DBSCAN Algorithm: 1. Compute distances between x_i and all other points. Find all neighbor points with distance eps of the point. Each point with a
neighbor count greater than or equal to MinPts is marked as a core point or visited. 2. For each core point, assign it to a new cluster if it hasn’t been already. Find recursively all its density related points and assign
them to same cluster as core3. Iterate through all points. Points that do not belong to a cluster are treated as noise.
• DBSCAN (tuning) parameters epsilon (eps) and minimum points (MinPts) – eps is determined using a KNN Distance Plot and looking for looking for knee point. MinPts is determined by user. Smaller value will produce more clusters, larger values less.
Glossary – OPTICS• OPTICS – Short for “Ordering points to identify the clustering structure.” A generalization of DBSCAN
that addresses the algorithm’s weakness in detecting cluster of differing densities. • Reachability Plot – Created by OPTICS algorithms, valleys in plots represent points in clusters since their
reachability score (distance to nearest neighbor?) are low. Points are ordered such that they are next to neighbors. Identify clusters like hierarchical clustering does by ’cutting’ reachability plot.