open source software for data scientists -- bigconf 2014
TRANSCRIPT
Altamira Technologies Corporation 2014
Agenda
■ What is a Data Scientist? ■ Why use Open Source Software? ■ Survey of Open Source Software Tools:
¤ Statistical Analysis ¤ Data Mining ¤ Machine Learning ¤ Natural Language Processing ¤ Social Network Analysis ¤ Data Visualization
Altamira Technologies Corporation 2014
About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable
photo: Columbia Pictures
http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/
Paul Cooper, ITProPortal.com
“A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”
Computer Programming
Mathematics & Analytic Methodology
Distributed Computing & Big Data
Data Science
Stat
istic
al A
naly
sis
Dat
a M
inin
g
Mac
hine
Lea
rnin
g
Nat
ural
Lan
guag
e Pr
oces
sing
Soci
al N
etw
ork
Ana
lysis
Dat
a V
isual
izat
ion
Domain Knowledge & Communication Skills
etc.
Altamira Technologies Corporation 2014
photo: Paul Inkles (https://flic.kr/p/e2QMS5)
IF YOUR BOSS BUYS SOMETHING,"YOU DAMN WELL BETTER USE IT."
Altamira Technologies Corporation 2014
Statistical Analysis
■ Name: R ■ Creator: Gentleman, Ihaka, et al. ■ License: GPL Version 2 ■ Website: r-project.org ■ Source: cran.us.r-project.org/src/base/ ■ Features:
¤ Language & environment for statistical computing & viz ¤ Linear and nonlinear modeling, classical statistical tests,
time-series analysis, graphical techniques, and more… ¤ 5000+ packages available in CRAN repository
Altamira Technologies Corporation 2014
Data Mining
■ Name: Pandas ■ Creator: Wes McKinney, et al. ■ License: BSD 3-Clause License ■ Website: pandas.pydata.org ■ Source: github.com/pydata/pandas ■ Features:
¤ Data analysis workflow in Python ¤ DataFrame object for fast manipulation & indexing ¤ Tools for reading & writing data between formats ¤ Label-based slicing, indexing, and subsetting of data
Altamira Technologies Corporation 2014
Data Mining
■ Name: Impala ■ Creator: Cloudera ■ License: Apache License 2.0 ■ Website: impala.io ■ Source: github.com/cloudera/impala ■ Features:
¤ MPP query engine implemented on Hadoop ¤ Low latency, high concurrency SQL & BI queries ¤ Same interfaces as Apache Hive, but ~24x faster ¤ Written in C++; does not use MapReduce
Altamira Technologies Corporation 2014
Machine Learning
■ Name: Mahout ■ Creator: ASF ■ License: Apache License 2.0 ■ Website: mahout.apache.org ■ Source: svn.apache.org/viewvc/mahout ■ Features:
¤ Distributed/scalable ML library for Hadoop ¤ Classification, Clustering, Collaborative filtering ¤ Logistic regression, naïve Bayes, random forest, neural
networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.
Altamira Technologies Corporation 2014
Machine Learning
■ Name: Scikit-learn ■ Creator: Cournapeau, et al. ■ License: BSD 3-Clause License ■ Website: scikit-learn.org ■ Source: github.com/scikit-learn/scikit-learn ■ Features:
¤ ML library for Python built on NumPy, SciPy, matplotlib ¤ Support for classification, clustering, dimensionality
reduction, regression, model selection, preprocessing ¤ SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
Altamira Technologies Corporation 2014
Machine Learning + NLP
■ Name: Mallet ■ Creator: UMass (McCallum, et al.) ■ License: Common Public License 1.0 ■ Website: mallet.cs.umass.edu ■ Source: hg-iesl.cs.umass.edu/hg/mallet ■ Features:
¤ Java-based “Machine Learning for Language Toolkit” ¤ Document classification, clustering, topic modeling,
information extraction & sequence tagging, etc. ¤ Efficient implementation of LDA for topic modeling
Altamira Technologies Corporation 2014
Natural Language Processing
■ Name: NLTK ■ Creator: Bird, Loper, et al. ■ License: Apache License 2.0 ■ Website: nltk.org ■ Source: github.com/nltk/nltk ■ Features:
¤ Natural Language Toolkit for Python ¤ Built-in support for dozens of corpora & trained models ¤ Libraries for classification, tokenization, stemming,
tagging, parsing, and semantic reasoning
Altamira Technologies Corporation 2014
Natural Language Processing
■ Name: Stanford CoreNLP ■ Creator: Stanford NLP Group ■ License: GPL Version 2 ■ Website: nlp.stanford.edu/software/corenlp.shtml ■ Source: github.com/stanfordnlp/CoreNLP ■ Features:
¤ Suite of high-quality, Java-based NLP tools ¤ Includes POS tagger, named entity recognizer, parser,
coreference resolution, sentiment analysis, SUTime, etc. ¤ Includes models for English, Chinese, Arabic, German
Altamira Technologies Corporation 2014
NLP + Geospatial Analysis
■ Name: CLAVIN ■ Creator: Berico Technologies ■ License: Apache License 2.0 ■ Website: clavin.io ■ Source: github.com/Berico-Technologies/CLAVIN ■ Features:
¤ Extracts location names from text, resolves to gazetteer ¤ Employs context-based geospatial entity resolution ¤ ~75% accuracy, processes 1M documents per hour ¤ Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
Altamira Technologies Corporation 2014
Social Network Analysis
■ Name: Gephi ■ Creator: UTC France ■ License: GPL Version 3 ■ Website: gephi.org ■ Source: github.com/gephi/gephi ■ Features:
¤ Network analysis and visualization package for Java ¤ Dynamic network analysis with temporal filtering ¤ Metrics include: community detection, betweenness,
closeness, clustering coefficient, PageRank, etc.
Altamira Technologies Corporation 2014
Data Visualization
■ Name: D3.js ■ Creator: Mike Bostock ■ License: BSD 3-Clause License ■ Website: d3js.org ■ Source: github.com/mbostock/d3 ■ Features:
¤ JavaScript library based on HTML, SVG, and CSS ¤ Binds data to DOM & enables transformations ¤ ~200 examples, including: force-directed graphs,
choropleths, treemaps, dendrograms, animations, etc.
Altamira Technologies Corporation 2014
Fusion, Analysis, and Visualization
■ Name: Lumify ■ Creator: Altamira ■ License: Apache License 2.0 ■ Website: lumify.io ■ Source: github.com/altamiracorp/lumify ■ Features:
¤ Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. ¤ Integrates structured data, text, images, video ¤ Cell-level security & access controls ¤ Live, shared collaborative workspaces
Altamira Technologies Corporation 2014
Final Thought…
Save your $$$ for: ¨ People
¤ salaries, training, etc.
¨ Resources ¤ hardware, AWS, etc.
¨ Proprietary software ¤ if no viable OSS
alternative exists photo: Brett Weinstein (http://bit.ly/1dHXvqJ)
FINAL THOUGHT
Springer’s