cis 602-01: scalable data analysis
TRANSCRIPT
CIS 602-01: Scalable Data Analysis
Data Analysis Tools Dr. David Koop
D. Koop, CIS 602-01, Fall 2017
Analysis is a cycle
Start with basic computations, analyze results, and ask new questions
2D. Koop, CIS 602-01, Fall 2017
KnowledgeData Data Products
Specification
Computation Perception &Cognition
Exploration
Types of EDA• Univariate (one attribute) vs. multivariate (2+ attributes) • Non-graphical vs. graphical
- Non-graphical ~ statistics - Graphical ~ visualizations
• All are important!
3D. Koop, CIS 602-01, Fall 2017
Univariate Non-Graphical: Word Counts• "This is the house that Jack built. This is the malt that lay in the
house that Jack built. This is the rat that ate the malt That lay in the house that Jack built."
4D. Koop, CIS 602-01, Fall 2017
Word Count Frequency
the 6 0.171
that 5 0.143
This 3 0.086
is 3 0.086
house 3 0.086
Jack 3 0.086
built. 3 0.086
malt 2 0.057
lay 2 0.057
in 2 0.057
Univariate Graphical: Histograms
5D. Koop, CIS 602-01, Fall 2017
["The reddit Front Page is Not a Meritocracy", T. W. Schneider]
Observed ranks of posts by subreddit
Multivariate Non-Graphical: Crosstabs• Count groups and subgroups • At least two different attributes • Can subdivide vertically and
horizontally for more subgroups • Sometimes totals are useful
6D. Koop, CIS 602-01, Fall 2017
Multivariate Graphical: Parallel Coordinates
7D. Koop, CIS 602-01, Fall 2017
0
0
0
0
0
0
00
5
5
5
5
5
5
55
10
10
10
10
10
10
1010
15
15
15
15
15
15
1515
20
20
20
20
20
20
2020
25
25
25
25
25
25
2525
30
30
30
30
30
30
3030
35
35
35
35
35
35
3535
40
40
40
40
40
40
4040
45
45
45
45
45
45
4545
economy (mpg)
economy (mpg)
economy (mpg)
economy (mpg)
economy (mpg)
economy (mpg)
economy (mpg)economy (mpg)
3.0
3.0
3.0
3.0
3.0
3.0
3.03.0
3.5
3.5
3.5
3.5
3.5
3.5
3.53.5
4.0
4.0
4.0
4.0
4.0
4.0
4.04.0
4.5
4.5
4.5
4.5
4.5
4.5
4.54.5
5.0
5.0
5.0
5.0
5.0
5.0
5.05.0
5.5
5.5
5.5
5.5
5.5
5.5
5.55.5
6.0
6.0
6.0
6.0
6.0
6.0
6.06.0
6.5
6.5
6.5
6.5
6.5
6.5
6.56.5
7.0
7.0
7.0
7.0
7.0
7.0
7.07.0
7.5
7.5
7.5
7.5
7.5
7.5
7.57.5
8.0
8.0
8.0
8.0
8.0
8.0
8.08.0cylinders
cylinders
cylinders
cylinders
cylinders
cylinders
cylinderscylinders
100
100
100
100
100
100
100100
150
150
150
150
150
150
150150
200
200
200
200
200
200
200200
250
250
250
250
250
250
250250
300
300
300
300
300
300
300300
350
350
350
350
350
350
350350
400
400
400
400
400
400
400400
450
450
450
450
450
450
450450
displacement (cc)
displacement (cc)
displacement (cc)
displacement (cc)
displacement (cc)
displacement (cc)
displacement (cc)displacement (cc)
0
0
0
0
0
0
00
20
20
20
20
20
20
2020
40
40
40
40
40
40
4040
60
60
60
60
60
60
6060
80
80
80
80
80
80
8080
100
100
100
100
100
100
100100
120
120
120
120
120
120
120120
140
140
140
140
140
140
140140
160
160
160
160
160
160
160160
180
180
180
180
180
180
180180
200
200
200
200
200
200
200200
220
220
220
220
220
220
220220
power (hp)
power (hp)
power (hp)
power (hp)
power (hp)
power (hp)
power (hp)power (hp)
2,000
2,000
2,000
2,000
2,000
2,000
2,0002,000
2,500
2,500
2,500
2,500
2,500
2,500
2,5002,500
3,000
3,000
3,000
3,000
3,000
3,000
3,0003,000
3,500
3,500
3,500
3,500
3,500
3,500
3,5003,500
4,000
4,000
4,000
4,000
4,000
4,000
4,0004,000
4,500
4,500
4,500
4,500
4,500
4,500
4,5004,500
5,000
5,000
5,000
5,000
5,000
5,000
5,0005,000
weight (lb)
weight (lb)
weight (lb)
weight (lb)
weight (lb)
weight (lb)
weight (lb)weight (lb)
8
8
8
8
8
8
88
10
10
10
10
10
10
1010
12
12
12
12
12
12
1212
14
14
14
14
14
14
1414
16
16
16
16
16
16
1616
18
18
18
18
18
18
1818
20
20
20
20
20
20
2020
22
22
22
22
22
22
2222
24
24
24
24
24
24
2424
0-60 mph (s)
0-60 mph (s)
0-60 mph (s)
0-60 mph (s)
0-60 mph (s)
0-60 mph (s)
0-60 mph (s)0-60 mph (s)
70
70
70
70
70
70
7070
71
71
71
71
71
71
7171
72
72
72
72
72
72
7272
73
73
73
73
73
73
7373
74
74
74
74
74
74
7474
75
75
75
75
75
75
7575
76
76
76
76
76
76
7676
77
77
77
77
77
77
7777
78
78
78
78
78
78
7878
79
79
79
79
79
79
7979
80
80
80
80
80
80
8080
81
81
81
81
81
81
8181
82
82
82
82
82
82
8282year
year
year
year
year
year
yearyear
[M. Bostock]
Progressive Visualization
8D. Koop, CIS 602-01, Fall 2017
Blocking vs. Progressive Visualization
9D. Koop, CIS 602-01, Fall 2017
the currently applied brush, will force all other visualiza-tions (i.e., v1, v3, and v4) to recompute. Depending on thevisualization condition, v1, v3, and v4 will either show theresults of the new brushing action instantaneously, a load-ing animation until the full result is available, or incremen-tally updated progressive visualizations. The system doesnot perform any form of result caching.
Upon startup, the system loads the selected dataset intomemory. The system randomly shuffles the data points inorder to avoid artifacts caused by the natural oder of thedata and to improve convergence of progressive computa-tions [46]. We approximate the instantaneous visualizationcondition by computing queries over entire dataset asquickly as possible. Initial microbenchmarks for a variety ofqueries over the small datasets used in the experimentyielded the following measurements: time to compute aresult ! 100 ms and time to render a visualization ! 30 ms.Although not truly instantaneous, a total delay of only! 130 ms is well below the guideline of one second, there-fore allowing us to simulate the instantaneous condition.We simulate the blocking condition by artificially delayingthe rendering of a visualization by the number of secondsspecified through the dataset-delay factor. In other words,we synthetically prolong the time necessary to compute aresult in order to simulate larger datasets. While a blockingcomputation is ongoing, we display a simple loading ani-mation, but users can still interact with other visualizationpanels, change the ongoing computation (e.g, selecting adifferent subset), or replace the ongoing computation with anew query (e.g., changing an axis). Fig. 4 (top) shows anexample of a blocking visualization over time.
We implemented the progressive condition by process-ing the data in chunks of 1,000 data points at a time, withapproximate results displayed after each chunk. In total, werefresh the progressive visualization 10 times, independentof the dataset-delay factor. We display the first visualizationas quickly as possible, with subsequent updates appropri-ately delayed so that the final accurate visualization is dis-played after the specified dataset-delay condition. Note thatthe initial min and max estimates might change afters seeingadditional data, in which case we extend the visualizationby adding bins of the same width to accommodate newincoming data. Throughout the incremental computation,we display a progress indication in the bottom left corner ofa visualization (Fig. 3d). Progressive visualizations are aug-mented with error metrics indicating that the current view
is only an approximation of the final result. Even though95 percent confidence intervals based on the standard errorhave been shown to be problematic to comprehend in cer-tain cases [47], we still opted to use them for bar charts dueto their wide usage and familiarity. We render labels withmargins of error (e.g., “"3%”) in each bin of a 2D-histo-gram. Fig. 4 (bottom) shows an example of a progressivevisualization and how it changes over time. Note that theconfidence intervals in the example are rather small, buttheir size can change significantly based on the dataset andquery.
Our system is implemented in C# / Direct2D. We testedour system and ran all sessions on a quad-core 3.60 GHz,16 GB RAM, Microsoft Windows 10 desktop machine witha 16:9 format, 1,920 # 1,080 pixel display.
3.4 ProcedureWe recruited 24 participants from a research university inthe US. All participants were students (22 undergraduate,two graduate), all of whom had some experience with dataexploration or analysis tools (e.g., Excel, R, Pandas). 21 of theparticipants were currently enrolled in and halfway throughan introductory data science course. Our experimentincluded visualization condition as a within-subject factorand dataset-delay and dataset-order as between-subject fac-tors. Note that the dataset-delay has no direct influence onthe instantaneous condition, but it is used as a between-subject factor to test if it affects the other visualization condi-tions. To control against ordering and learning effects, wefully randomized the sequence in which we presented thedifferent visualization conditions to the user and counterbal-anced across dataset-delay conditions. Instead of fully ran-domizing the ordering of datasets, we opted to create twopredefined dataset-orderings and factored them into ouranalysis. The two possible dataset-ordering values were 123and 312 (i.e., DS1 followed by DS2 followed by DS3 and DS3followed by DS1 followed by DS2, respectively), and weagain counterbalanced across dataset-ordering. We seed oursystem’s random number generator differently for each ses-sion to account for possible effects in the progressive condi-tion caused by the sequence in which data points areprocessed. While each participant did not experience all pos-sible combinations of visualization, dataset-ordering, anddataset-delay conditions, all users saw all visualization con-ditions with one specific dataset-delay and dataset-ordering.For example, participant P14 was assigned dataset-ordering312, dataset-delay 6 s, and visualization condition ordering[blocking, instantaneous, progressive]. In total, this adds upto four trial sub-groups, each with six participants: (dataset-delay = 6 s & dataset-order = 123), (dataset-delay = 12 s &dataset-order = 123), (dataset-delay = 6 s & dataset-order =312) and (dataset-delay = 12 s & dataset-order = 312). Withineach subgroup, we fully randomized the order in which wepresented the different visualization conditions.
After a 15 minute tutorial of the system, including a sum-mary of the visualization conditions and how to interpret con-fidence intervals and margins of error, we instructed theparticipants to perform three exploration sessions. In the caseof participant P14, these three sessions included (1) blockingvisualizations on DS3, (2) instantaneous visualizations onDS1, and (3) progressive visualizations on DS2 all with 6 s
Fig. 4. The blocking and progressive visualization conditions.
ZGRAGGEN ETAL.: HOW PROGRESSIVE VISUALIZATIONS AFFECT EXPLORATORYANALYSIS 1981
Experiment Results
10D. Koop, CIS 602-01, Fall 2017
delay. Each sessionwas open-ended andwe asked the partici-pants to explore the dataset at their own liking and pace. Weallotted 12 minutes per session, but participants were free tostop earlier.Weused a think-aloud protocol [48]where partic-ipants were instructed to report anything they found interest-ing while exploring the dataset. Throughout each session, wecaptured both screen and audio recordings and logged low-level interactions (mouse events) as well as higher-levelevents (“axis changed”, “aggregation changed”, “textboxbrush/filter”, “visualization brush,” and “visualizationupdated/stopped/completed”). An experimenter was pres-ent throughout the session, and participants were free toask any technical questions or questions about the meaningof dataset attributes. At the end of the three sessions, weasked participants to give feedback about the tool andwhether they had any thoughts regarding the differentvisualization conditions.
3.5 Statistical AnalysisOur study is designed as a full-factorial 3 (visualization con-ditions) ! 2 (dataset-delay conditions) ! 2 (dataset-orderconditions) experiment. We applied mixed design analysisof variance tests (ANOVA) with visualization condition asthe within-subject factor and dataset-delay and dataset-order as the between-subject factors to assess the effects ofour factors on the various metrics we computed. Note thatthe dataset-delay factor should have no influence on trialswhere the visualization condition is set to instantaneous.
We tested the assumption of sphericity using Mauchly’stest, and we report results with corrected degrees of free-dom using Greenhouse-Geisser estimates if violated. Wereport all significant (i.e., p < 0:05) main and interactioneffects of these tests. For significant main effects, we con-ducted further analysis through Bonferroni corrected posthoc tests and for more nuanced interpretation, we opted toinclude Bayes factors for certain results and report BF10 fac-tors along with corresponding significance labels [49]. Effectsizes are reported through Pearson’s r coefficient, and sig-nificance levels are encoded in plots using the followingnotation: " p < 0:05; " " p < 0:01; " " " p < 0:001.
4 ANALYSIS OF VERBAL DATA
Inspired by previous insight-based studies [40], [44], [50], wemanually coded insights from the audio recordings andscreen captures. An insight is a nugget of knowledgeextracted from the data, such as “France produces morewines than the US”. We followed the verbal data segmenta-tion approach proposed by Liu et al. [40] and adopted theircoding process in which the first author did the bulk of thecoding, but we iteratively revised finished codes with
collaborators to reduce bias. In our case, we decided to countobservations that are within the same visualization, have thesame semantics, and are on the same level of granularity asone insight. For example: “It looks like country 1 makes themost cars, followed by country 2 and country 3” was codedas one single insight, whereas an observation across two vis-ualizations (through brushing) such as “Country 1 makesthe most cars and seems to have the heaviest cars” wascounted as two separate insights. We did not categorizeinsights or assign any quality scores or weights to insightsnor did we quantify accuracy or validity of insights. For ouranalysis, all insights were treated equally.
4.1 Number of Insights Per MinuteIn order to get a quantifiable metric for knowledge discovery,we normalized the total insight count for each session by theduration of the session. Fig. 5 shows this resulting “number ofinsights per minute” metric across different factors. Ourresults show that this metric was significantly affected bythe type of visualization condition (F ð1:452; 29:039Þ ¼ 6:701;p < 0:01).
Post hoc tests revealed that the instantaneous conditionshowed a slight increase of number of insights per minuteover the progressive condition (1:477& 0:568 versus1:361& 0:492, respectively), which was not statisticallysignificant (p ¼ 1:0; r ¼ 0:080). However, the blocking con-dition reduces the number of insights per minute to1:069& 0:408, which differed significantly from both theprogressive (p < 0:05; r ¼ 0:292) and instantaneousðp < 0:001; r ¼ 0:383) conditions. A Bayesian Paired Sam-ples T-Test that tested if measure1 < measure2 revealedstrong evidence for an increase in insights per minute fromthe blocking condition to the progressive condition(BF10 ¼ 13:816), extreme evidence for an increase fromblocking to instantaneous (BF10 ¼ 134:561), and moderateevidence for no change or a decrease between instantaneousand progressive (BF10 ¼ 0:130). In summary, blocking visu-alizations produced the fewest number of insights perminute, whereas the instantaneous and progressive visual-izations performed equally well.
4.2 Insight OriginalitySimilar to [44], we grouped insights that encode the samenugget of knowledge together and computed the originalityof an insight as the inverse of the number of times thatinsight was reported by any participant. That is, insightsreported more frequently by participants received loweroverall originality scores. A participant received an insightoriginality score of 1 if he or she had only unique insights(i.e., insights found by no other users). The lower the score,
Fig. 5. Insights per minute: Boxplot (showing median and whiskers at 1.5 interquartile range) and overlaid swarmplot for insights per minute (left)overall, (middle) by dataset-delay, and (right) by dataset-order. Higher values indicate better.
1982 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 23, NO. 8, AUGUST 2017
Assignment 1• http://www.cis.umassd.edu/~dkoop/
cis602-2017fa/assignment1.html • Boston Property Assessments
- Initial exploratory analysis - Use a Python Notebook - May use pandas - Label subproblems and answers - Show work (even if it's not your
final answer)
11D. Koop, CIS 602-01, Fall 2017
[Google Maps]
Test 1• Thursday, October 5, 5-6:15pm, Dion 109 (here) • Format:
- Multiple Choice - Free Response
• Covers papers and topics discussed through this Thursday • Next week:
- No class on Tuesday, Oct. 3 - Exam on Thursday, Oct. 5 - No office hours - Email me with questions
12D. Koop, CIS 602-01, Fall 2017
Data Analysis Tools
13D. Koop, CIS 602-01, Fall 2017
Lots and lots of tools• Tableau • Qlikview • Spotfire • Trifacta • KNIME • Open Refine • Octoparse • Watson • SAS • Mahout • Spark • scikit-learn
14D. Koop, CIS 602-01, Fall 2017
Tables of Tools• Data Architecture/Storage • Data Cleaning • Data Visualization • Programming Languages • Spreadsheets • Machine Learning
15D. Koop, CIS 602-01, Fall 2017
O'Reilly's Data Science Salary Survey (2016)
16D. Koop, CIS 602-01, Fall 2017
0 5% 10% 15% 20%
>200K
200K
180K
160K
140K
120K
100K
80K
60K
40K
20K
0K
BASE SALARY
Base
Sal
ary
(US
DO
LLA
RS)
Share of Respondents
Share of respondents
[O'Reilly]
Data Science Tasks
17D. Koop, CIS 602-01, Fall 2017
TASKS (major involvement only)
BASIC EXPLORATORY DATA ANALYSIS
69%
COMMUNICATING FINDINGS TO BUSINESS DECISION-MAKERS
58%
DATA CLEANING53%
CREATING VISUALIZATIONS
49%
IDENTIFYING BUSINESS PROBLEMS TO BE SOLVED WITH ANALYTICS
47%FEATURE EXTRACTION43% COLLABORATING ON CODE
PROJECTS (READING/EDITING OTHERS' CODE, USING GIT)
32%
TEACHING/TRAINING OTHERS31%
PLANNING LARGE SOFTWARE PROJECTS OR DATA SYSTEMS30%
DEVELOPING DASHBOARDS30%
ETL29%
DEVELOPING PRODUCTS THAT DEPEND ON REAL-TIME DATA ANALYTICS
19%
USING DASHBOARDS AND SPREADSHEETS (MADE BY OTHERS) TO MAKE DECISIONS
19%
DEVELOPING HARDWARE (OR WORKING ON SOFTWARE PROJECTS THAT REQUIRE EXPERT KNOWLEDGE OF HARDWARE)
5%
COMMUNICATING WITH PEOPLE OUTSIDE YOUR COMPANY
28%
SETTING UP / MAINTAINING DATA PLATFORMS
24%DEVELOPING DATA
ANALYTICS SOFTWARE
20%
IMPLEMENTING MODELS/ ALGORITHMS INTO PRODUCTION
36%ORGANIZING AND GUIDING TEAM PROJECTS
39%
DEVELOPING PROTOTYPE MODELS
43%
CONDUCTING DATA ANALYSIS TO ANSWER RESEARCH QUESTIONS
61%
[O'Reilly]
Data Science Tasks
18D. Koop, CIS 602-01, Fall 2017
30K 60K 90K 120K 150K
Developing hardware (or working on software projects that require expert knowledge of hardware)
Using dashboards and spreadsheets (made by others) to make decisions
Developing products that depend on real-time data analytics
Developing data analytics software
Setting up / maintaining data platforms
Communicating with people outside your company
ETL
Developing dashboards
Planning large software projects or data systems
Teaching / training others
Collaborating on code projects (reading / editing others' code, using git)
Implementing models / algorithms into production
Organizing and guiding team projects
Developing prototype models
Feature extraction
Identifying business problems to be solved with analytics
Creating visualizations
Data cleaning
Communicating findings to business decision-makers
Conducting data analysis to answer research questions
Basic exploratory data analysis
Range/Median
Task
[O'Reilly]
Programming Languages
19D. Koop, CIS 602-01, Fall 2017
PROGRAMMING LANGUAGES
SQL70%
R57%
PYTHON54%
BASH24%JAVA18%
JAVASCRIPT17%
VISUAL BASIC / VBA13%
C++9%
SCALA8%
C#8%
C8%
SAS5%
PERL5%
RUBY3%
GO1%
OCTAVE2%
MATLAB9%
SALARY MEDIAN AND IQR (US DOLLARS)
Range/Median
Lang
uage
s
SHARE OF RESPONDENTS
0 50K 100K 150K 200K
GoOctave
RubySASPerl
C#C
ScalaMatlab
C++Visual Basic/VBA
JavaScriptJavaBash
PythonR
SQL
[O'Reilly]
Database Tools
20D. Koop, CIS 602-01, Fall 2017
MYSQL37%
MICROSOFT SQL SERVER33%
ORACLE23%
POSTGRESQL22%
SQLITE11%
TERADATA10%
IBM DB25% VERTICA
4%EMC / GREENPLUM2%
ASTER DATA (TERADATA)2%
SAP HANA1%
REDSHIFT1%
ORACLE EXASCALE1%
NETEZZA (IBM)4%
RELATIONAL DATABASES
SALARY MEDIAN AND IQR (US DOLLARS)
Range/Median
Rela
tiona
l dat
abas
es
SHARE OF RESPONDENTS
50K 100K 150K 200K 250KOracle Exascale
RedshiftSAP HANA
Aster Data (Teradata)EMC/Greenplum
Netezza (IBM)Vertica
IBM DB2Teradata
SQLitePostgreSQL
OracleMicrosoft SQL Server
MySQL
[O'Reilly]
Hadoop and Search Tools
21D. Koop, CIS 602-01, Fall 2017
APACHE HADOOP17%
CLOUDERA12%
HORTONWORKS8%AMAZON ELASTIC MAPREDUCE (EMR)7%
MAPR4% ORACLE
2% EMC / GREENPLUM
1%IBM1%
ELASTICSEARCH10%
SOLR5%
LUCENE4%
SALARY MEDIAN AND IQR (US DOLLARS)
Range/Median
Had
oop
SALARY MEDIAN AND IQR (US DOLLARS)
Range/Median
Sear
ch
SHARE OF RESPONDENTS
SHARE OF RESPONDENTS
HADOOP
SEARCH
0 50K 100K 150K 200KIBM
EMC / GreenplumOracleMapR
Amazon Elastic MapReduce (EMR)Hortonworks
ClouderaApache Hadoop
0 50K 100K 150K 200K
LuceneSolr
ElasticSearch
[O'Reilly]
Data Management Tools
22D. Koop, CIS 602-01, Fall 2017
APACHE HADOOP17%
CLOUDERA12%
HORTONWORKS8%AMAZON ELASTIC MAPREDUCE (EMR)7%
MAPR4% ORACLE
2% EMC / GREENPLUM
1%IBM1%
ELASTICSEARCH10%
SOLR5%
LUCENE4%
SALARY MEDIAN AND IQR (US DOLLARS)
Range/Median
Had
oop
SALARY MEDIAN AND IQR (US DOLLARS)
Range/Median
Sear
ch
SHARE OF RESPONDENTS
SHARE OF RESPONDENTS
HADOOP
SEARCH
0 50K 100K 150K 200KIBM
EMC / GreenplumOracleMapR
Amazon Elastic MapReduce (EMR)Hortonworks
ClouderaApache Hadoop
0 50K 100K 150K 200K
LuceneSolr
ElasticSearch
DATA MANAGEMENT, BIG DATA PLATFORMS
SPARK21%
HIVE20%
MONGODB10%
AMAZON REDSHIFT9%HBASE7%
KAFKA7%
PIG7%
IMPALA6%
CASSANDRA4%
REDIS4%
ZOOKEEPER4%
GOOGLE BIGQUERY/FUSION TABLES
3%NEO4J3% SPLUNK
3%
STORM2%
COUCHBASE1%
AMAZON DYNAMODB3%
TOAD5%
SALARY MEDIAN AND IQR (US DOLLARS)
Range/Median
Big
Dat
a Pl
atfo
rms
SHARE OF RESPONDENTS
0 50K 100K 150K 200K
CouchbaseStorm
Amazon DynamoDBSplunk
Google BigQuery/Fusion TablesNeo4J
RedisZookeeperCassandra
ToadImpala
PigKafka
HbaseAmazon RedShift
MongoDBHive
Spark
[O'Reilly]
Spreadsheet & Business Intelligence Tools
23D. Koop, CIS 602-01, Fall 2017
EXCEL69%
POWERPIVOT10%
POWER BI8%QLIKVIEW7%
BUSINESSOBJECTS6%
COGNOS6%
ORACLE BI5% SPOTFIRE
4%ADOBE ANALYTICS
3%MICROSTRATEGY3%
ALTERYX2%
JASPERSOFT1%
DATAMEER1%
PENTAHO3%
SPREADSHEETS, BI, REPORTING
SALARY MEDIAN AND IQR (US DOLLARS)
Range/Median
Spre
adsh
eets
, BI,
repo
rtin
g
SHARE OF RESPONDENTS
30K 60K 90K 120K 150K
DatameerJaspersoft
AlteryxMicrostrategy
Adobe AnalyticsPentahoSpotfire
Oracle BICognos
BusinessObjectsQlikViewPower BI
PowerPivotExcel
[O'Reilly]
Visualization Tools
24D. Koop, CIS 602-01, Fall 2017
GGPLOT35%
TABLEAU33%
MATPLOTLIB26%
SHINY16%
D316%
GOOGLE CHARTS8% BOKEH
6% PROCESSING1% JAVASCRIPT INFOVIS TOOLKIT
1% VISUALIZATION TOOLS
SALARY MEDIAN AND IQR (US DOLLARS)
Range/Median
Visu
aliz
atio
n to
ols
SHARE OF RESPONDENTS
30K 60K 90K 120K 150K
JavaScript InfoVis ToolkitProcessing
BokehGoogle Charts
D3Shiny
MatplotlibTableau
ggplot
[O'Reilly]
Machine Learning Tools
25D. Koop, CIS 602-01, Fall 2017
SCIKIT-LEARN31%
SPARK MLLIB13%
WEKA9%
H2O5%
RAPIDMINER4%
LIBSVM4%
MAHOUT3% MATHEMATICA
3% STATA2% DATO / GRAPHLAB
2% KNIME2% VOWPAL
WABBIT
2% BIGML1%
IBM BIG INSIGHTS1%
GOOGLE PREDICTION1%
MACHINE LEARNING, STATISTICS
SALARY MEDIAN AND IQR (US DOLLARS)
Range/Median
Mac
hine
lear
ning
, sta
tistic
s
SHARE OF RESPONDENTS
30K 60K 90K 120K 150K
Google PredictionIBM Big Insights
BigMLVowpal Wabbit
KNIMEDato / GraphLab
StataMathematica
MahoutLIBSVM
RapidMinerH2O
WekaSpark MlLibScikit-learn
[O'Reilly]
Trends in Data Analysis Tool Use
26D. Koop, CIS 602-01, Fall 2017
[KDNuggets]
Trends in Data Analysis Tool Usage
27D. Koop, CIS 602-01, Fall 2017
Table 1: Top Analytics/Data Science Tools in 2017 KDnuggets Poll
Tool 2017% Usage
% change2017 vs 2016 % alone
Python 52.6% 15% 0.2%
R language 52.1% 6.4% 3.3%
SQL language 34.9% -1.8% 0%
RapidMiner 32.8% 0.7% 13.6%
Excel 28.1% -16% 0.1%
Spark 22.7% 5.3% 0.2%
Anaconda 21.8% 37% 0.8%
Tensorflow 20.2% 195% 0%
scikit-learn 19.5% 13% 0%
Tableau 19.4% 5.0% 0.4%
KNIME 19.1% 6.3% 2.4%
[KDNuggets]
Top Increases
28D. Koop, CIS 602-01, Fall 2017
Table 2: Major Analytics/Data Science Tools with the largest increase in usage
Tool % change 2017% usage
2016% usage
Microsoft, CNTK 294% 3.4% 0.9%
Tensorflow 195% 20.2% 6.8%
Microsoft Power BI 84% 10.2% 5.6%
Alteryx 76% 5.3% 3.0%
SQL on Hadoop tools 42% 10.3% 7.3%
Microsoft other ML/Data Science tools 40% 2.2% 1.6%
Anaconda 37% 21.8% 16.0%
Caffe 32% 3.1% 2.3%
Orange 30% 4.0% 3.1%
DL4J 30% 2.2% 1.7%
Other Deep Learning Tools 30% 4.8% 3.7%
Microsoft Azure Machine Learning 26% 6.4% 5.1%
[KDNuggets]
Top Decreases
29D. Koop, CIS 602-01, Fall 2017
Table 3: Major Analytics/Data Science Tools with the largest decline in usage
Tool % change 2017% usage
2016% usage
Turi (former Dato/GraphLab) -93% 0.2% 2.4%
RapidInsight/Veera -92% 0.2% 3.0%
Salford SPM/CART/RF/MARS/TreeNet -89% 0.4% 3.5%
MLlib -61% 4.5% 11.6%
C4.5/C5.0/See5 -38% 1.2% 2.0%
Hadoop: Open Source Tools -32% 15.0% 22.1%
Other free analytics/data mining tools -29% 4.8% 6.8%
Rattle -28% 2.6% 3.6%
Perl -27% 1.7% 2.3%
Pentaho -23% 1.8% 2.3%
Gnu Octave -22% 2.4% 3.1%
QlikView -21% 4.2% 5.3%
[KDNuggets]
Data Science & Machine Learning Associations
30D. Koop, CIS 602-01, Fall 2017
[KDNuggets]
Python vs. R
31D. Koop, CIS 602-01, Fall 2017
[KDNuggets]
Hadoop/Spark Association
32D. Koop, CIS 602-01, Fall 2017
[KDNuggets]
Structured Query Language (SQL)• Used with relational database management systems (RDBMS) • Based on relational algebra • Supports:
- Data schema definitions - Add/delete/update data - Data query
33D. Koop, CIS 602-01, Fall 2017
Excel• One of the most universal data analysis tools • Supports:
- Spreadsheets - Calculated cells - Visualizations
• Many similar products
34D. Koop, CIS 602-01, Fall 2017
Tableau• Grew out of research at Stanford University on how to explore
multidimensional datasets & relational databases • Tableau Desktop: standalone (free trial, student license) • Tableau Public: cloud-based system (free) • Tableau Vizable: mobile app • High-level GUI that connects to data, helps organize it, and
provides intuitive routines for visualizing it plus customization • Lots of possibilities for exploring data
35D. Koop, CIS 602-01, Fall 2017
Trifacta• Grew out of research at Stanford University and UC-Berkeley • Data Wrangling: Data Cleaning Tool • Prepare data for analysis • Trifacta Wrangler: free desktop app, integrate with Tableau • Trifacta Enterprise Wrangler: integration with Hadoop • Google's Cloud Dataprep integrates Trifacta Enterprise
36D. Koop, CIS 602-01, Fall 2017
Python Tools• matplotlib:
- Python's "base" visualization library - Originally mimicked the functions in matlab - Integrated with pandas .plot calls - Jupyter Notebook: %matplotlib inline or notebook - seaborn adds extras on top of matplotlib
• scikit-learn - Machine learning library - Fit and predict using models - Classification and regression
37D. Koop, CIS 602-01, Fall 2017