Data Science Toolchainpresented by Jie-Han Chen
slide: https://goo.gl/1hXBGk
Language & SoftwarePythonRJavaMatlabOctaveJupyter Notebook
PythonOpen Source CommunityPackageWeb ServiceGood ReadabilityMachine Learning
ROpen Source CommunityBuilt-in Statistics PackageStandalone computing &data analysisSlower than Python
High PerformanceBig DataPoor Visualization,Modeling
Java
Matlab & OctavePowerful built-in math functionsSimple Data Visualization toolPrototyping
-50
510
-10 -10
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-50
510
Jupyter NotebookSupport 40+ programming language. eg: Python, R, Scala...Excellent for sharing your experimentsMarkdown, Latexexample1example2
Language & SoftwarePythonRJavaMatlabOctaveJupyter Notebook
Data Science Roadmap
Data Science Toolchains
Data CollectionData VisualizationData StorageAlgorithm & Modeling
Data CollectionUsing API: Facebook, WikipediaWeb Scraper
Web ScraperHTTP request + HTML Parser
HTTP: python-requestsBetter than built-in urllibSessions with Cookie PersistenceThread-safety
HTTP: python-requests
HTTP: python-requests
Parser!Regular Expression?
BeautifulSoupHTML/XML parser
BeautifulSoup
Ptt
More Powerful Tool?
Scrapy
An open source and collaborative framework forextracting the data you need from websites.In afast, simple, yet extensible way.
Scrapy
$ scrapy startproject tutorial
path: /scrapy/dmoz.pycrawler name: dmoz
Scrapy
$ scrapy crawl dmoz
Scrapy
youtube.com/robots.txt
"I believe that visualization is one of the most
powerful means of achieving personal goals."
Harvey Mackay
Data Visualization
Data VisualizationMatplotlib, ggplot2D3.jsBokehTableauPlotDBLeaflet
D3.jsData Visualization ProjectInteraciveWeb frontendexample1example2
BokehPython, R, Scala, JuliaInteractiveJupyter Notebook
Tableau ( )
code Data Source
Data Visulization
Code
Using GeoJSON with Leaflet
, Configurable
Using GeoJSON with Leaflet
S3
1. Key-value2. Permission3. Data Visualization4. Big Data (Spark)
Algorithm
&
Modeling
Algorithm & Modeling
python-numpy + python-pandas + scikit-learnlibsvmspark-MlibWekaDeep Learning
Numpy + Pandas
+ Scikit-learn
Numpy - data structure
ndarray (n-dim array)ndimsizeshapedtype
Numpy
generate matrix
Numpy
generate matrix
Numpy
generate matrix
Numpy
generate matrix
Numpy
generate matrix
numpy Series, DataFrame
: csv, json ... nan
DataFrame - Series
Pandas - operation
MergeGroupingReshaping. . .
DatasetFeature EngineeringModelingEvaluation
LIBSVMC
Easy to useSupport many programming languages
Dataset
LIBSVM - install$ git clone
LIBSVM - install$ make
LIBSVM - workflow
LIBSVM - data format
label index , attribute value , attribute
LIBSVM - data format
MLlib, Hadoop
Java, Scala, R, Python
MLlib, Hadoop
Java, Scala, R, Python
Classification: logistic regression, naive Bayes,...Regression: generalized linear regression, survival regression,...Decision trees, random forests, and gradient-boosted treesRecommendation: alternating least squares (ALS)Clustering: K-means, Gaussian mixtures (GMMs),...Topic modeling: latent Dirichlet allocation (LDA)Frequent itemsets, association rules, and sequential patternmining
MLlib
Feature transformations: standardization,normalization, hashing,...ML Pipeline constructionModel evaluation and hyper-parameter tuningML persistence: saving and loading models andPipelines
MLlib
MLlib, Hadoop
Java, Scala, R, Python
Weka
Java libraryBig DataSupport GUI
Deep LearningTheanoPylearn2KerasTensorflowCaffeDeeplearning4J...
Theano
Base on NumpyImplemented by CythonDynamic C code generationGPU & CUDAtensor, math expression
A CPU and GPU Math Compiler in
Python
Theano tutorial:http://www.slideshare.net/SergiiGavrylov/theano-tutorial
Keras
Theano, TensorflowSupport GPU
prototype
High-level neural networks library
Homework Github repo Data science
Database, Social Network Analytics, ML library, DeepLearning Platform ...
READM.md: Repo Demo Code
email: [email protected]
Google https://goo.gl/forms/PQPz8u2glyunQvfM2