using r for social media and sports analytics
TRANSCRIPT
© 2013 Sqor, Inc.
Sqor
Athletes
Traditional sponsorship, contracts/salary SqorFunding Increasing cash flow for athletes Success Athletes focus on what they do best
Using R For Social Media and Sports Data
Athletes Success Data
Noah Gift: CTO @ Sqor
© 2013 Sqor, Inc.
What is Sqor?
Athletes
Traditional sponsorship, contracts/salary SqorFunding Increasing cash flow for athletes Success Athletes focus on what they do best
• Social Network hyper-focused on enhancing fan/athlete relationships. We only do Sports!: Now • Marketplace for athletes to build and market their digital brand: Now • Social Analytics and Prediction Engine as a Service: Q1 2015 • Micro-endorsement platform: Q1 2015 • Crowdfunding for athletes: Now • Game platform: First Homegrown game featuring Brett Favre: Now • Cross-Social Network Publishing Platform: Facebook, Twitter, Embeddable posts.: Now • Website, Android App, and iOS App:
© 2013 Sqor, Inc.
Key Aspects of Data Pipeline
Athletes
Traditional sponsorship, contracts/salary SqorFunding Increasing cash flow for athletes Success Athletes focus on what they do best
• Multiple languages involved: Python, R, Erlang, C#, SQL and Javascript. • Multiple persistence options: SQL Server (RDS), Riak (No SQL), CSV Files, Mnesia (Distributed Soft Realtime
DB) • RabbitMQ and Erlang handle messaging and job communication • Easy to debug: daily and nightly scripts, intermediate CSV files, deep storage in K/V store and reports live in
RDS. • R is used exclusively for machine learning and statistics (Although recommendation engine v1 was written in
Python. We are going to replace it with R/Erlang code though)
© 2013 Sqor, Inc.
Things They Don’t Tell You Building A Data Pipeline From Scratch (Our you should have paid attention to)
Athletes
Traditional sponsorship, contracts/salary SqorFunding Increasing cash flow for athletes Success Athletes focus on what they do best
• Getting the data in the right format and making sure it is accurate is back breaking work. It truly is horrible. • Keeping track of model prediction accuracy over time: both with new data and new models is really important • Non-linear regression is non-trivial • Automation and debuggability of every step is very important. Think Unix Tools • Expensive, exotic solutions sometimes aren’t worth it at first…or maybe ever. Weird databases, etc. • Making predictions involving real money with limited data is scary and really hard. If your not scared about this,
you should be.)
© 2013 Sqor, Inc.
Predicting Top Athletic Performers in Social Media
Athletes
Traditional sponsorship, contracts/salary SqorFunding Increasing cash flow for athletes Success Athletes focus on what they do best
• Sqor finds influential athletes and collaborates with them using our prediction algorithms
© 2013 Sqor, Inc.
Our Prediction Algorithms Appear To Work
Athletes
Traditional sponsorship, contracts/salary SqorFunding Increasing cash flow for athletes Success Athletes focus on what they do best
• Or we got really lucky….
© 2013 Sqor, Inc.
Clustering
Athletes
Traditional sponsorship, contracts/salary SqorFunding Increasing cash flow for athletes Success Athletes focus on what they do best
• We use R clustering packages for classification, visualization of patterns and diagnostics for predictions
© 2013 Sqor, Inc.
Clustering
Athletes
Traditional sponsorship, contracts/salary SqorFunding Increasing cash flow for athletes Success Athletes focus on what they do best
• We use kNN clustering for NBA and MLB Sports. Plan on expanding this further in the near future.
© 2013 Sqor, Inc.
Erlang/R Bridge
Athletes
Traditional sponsorship, contracts/salary SqorFunding Increasing cash flow for athletes Success Athletes focus on what they do best
• Sqor is a heavy user of Erlang • We like Erlang because it has unique concurrency abilities and high uptime (and also because I had a lot of
bosses who told me I couldn’t use).
• ➜ ~ curl -v -X PUT -H 'content-type: application/json' http://127.0.0.1:8080/api/script/foo -d '{"script":"execute <- function (A) { A * 2 }", "docs":"this doubles stuff"}'
• ➜ ~ curl -v http://127.0.0.1:8080/api/script/foo -X POST -H 'content-type: application/json' -d '[25]’ • Returns: [50.0] • We plan on open sourcing this in next 2 months: Run scripts, runs jobs, scales R