cis 602-01: scalable data analysis

37
CIS 602-01: Scalable Data Analysis Data Analysis Tools Dr. David Koop D. Koop, CIS 602-01, Fall 2017

Upload: others

Post on 05-May-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CIS 602-01: Scalable Data Analysis

CIS 602-01: Scalable Data Analysis

Data Analysis Tools Dr. David Koop

D. Koop, CIS 602-01, Fall 2017

Page 2: CIS 602-01: Scalable Data Analysis

Analysis is a cycle

Start with basic computations, analyze results, and ask new questions

2D. Koop, CIS 602-01, Fall 2017

KnowledgeData Data Products

Specification

Computation Perception &Cognition

Exploration

Page 3: CIS 602-01: Scalable Data Analysis

Types of EDA• Univariate (one attribute) vs. multivariate (2+ attributes) • Non-graphical vs. graphical

- Non-graphical ~ statistics - Graphical ~ visualizations

• All are important!

3D. Koop, CIS 602-01, Fall 2017

Page 4: CIS 602-01: Scalable Data Analysis

Univariate Non-Graphical: Word Counts• "This is the house that Jack built. This is the malt that lay in the

house that Jack built. This is the rat that ate the malt That lay in the house that Jack built."

4D. Koop, CIS 602-01, Fall 2017

Word Count Frequency

the 6 0.171

that 5 0.143

This 3 0.086

is 3 0.086

house 3 0.086

Jack 3 0.086

built. 3 0.086

malt 2 0.057

lay 2 0.057

in 2 0.057

Page 5: CIS 602-01: Scalable Data Analysis

Univariate Graphical: Histograms

5D. Koop, CIS 602-01, Fall 2017

["The reddit Front Page is Not a Meritocracy", T. W. Schneider]

Observed ranks of posts by subreddit

Page 6: CIS 602-01: Scalable Data Analysis

Multivariate Non-Graphical: Crosstabs• Count groups and subgroups • At least two different attributes • Can subdivide vertically and

horizontally for more subgroups • Sometimes totals are useful

6D. Koop, CIS 602-01, Fall 2017

Page 7: CIS 602-01: Scalable Data Analysis

Multivariate Graphical: Parallel Coordinates

7D. Koop, CIS 602-01, Fall 2017

0

0

0

0

0

0

00

5

5

5

5

5

5

55

10

10

10

10

10

10

1010

15

15

15

15

15

15

1515

20

20

20

20

20

20

2020

25

25

25

25

25

25

2525

30

30

30

30

30

30

3030

35

35

35

35

35

35

3535

40

40

40

40

40

40

4040

45

45

45

45

45

45

4545

economy (mpg)

economy (mpg)

economy (mpg)

economy (mpg)

economy (mpg)

economy (mpg)

economy (mpg)economy (mpg)

3.0

3.0

3.0

3.0

3.0

3.0

3.03.0

3.5

3.5

3.5

3.5

3.5

3.5

3.53.5

4.0

4.0

4.0

4.0

4.0

4.0

4.04.0

4.5

4.5

4.5

4.5

4.5

4.5

4.54.5

5.0

5.0

5.0

5.0

5.0

5.0

5.05.0

5.5

5.5

5.5

5.5

5.5

5.5

5.55.5

6.0

6.0

6.0

6.0

6.0

6.0

6.06.0

6.5

6.5

6.5

6.5

6.5

6.5

6.56.5

7.0

7.0

7.0

7.0

7.0

7.0

7.07.0

7.5

7.5

7.5

7.5

7.5

7.5

7.57.5

8.0

8.0

8.0

8.0

8.0

8.0

8.08.0cylinders

cylinders

cylinders

cylinders

cylinders

cylinders

cylinderscylinders

100

100

100

100

100

100

100100

150

150

150

150

150

150

150150

200

200

200

200

200

200

200200

250

250

250

250

250

250

250250

300

300

300

300

300

300

300300

350

350

350

350

350

350

350350

400

400

400

400

400

400

400400

450

450

450

450

450

450

450450

displacement (cc)

displacement (cc)

displacement (cc)

displacement (cc)

displacement (cc)

displacement (cc)

displacement (cc)displacement (cc)

0

0

0

0

0

0

00

20

20

20

20

20

20

2020

40

40

40

40

40

40

4040

60

60

60

60

60

60

6060

80

80

80

80

80

80

8080

100

100

100

100

100

100

100100

120

120

120

120

120

120

120120

140

140

140

140

140

140

140140

160

160

160

160

160

160

160160

180

180

180

180

180

180

180180

200

200

200

200

200

200

200200

220

220

220

220

220

220

220220

power (hp)

power (hp)

power (hp)

power (hp)

power (hp)

power (hp)

power (hp)power (hp)

2,000

2,000

2,000

2,000

2,000

2,000

2,0002,000

2,500

2,500

2,500

2,500

2,500

2,500

2,5002,500

3,000

3,000

3,000

3,000

3,000

3,000

3,0003,000

3,500

3,500

3,500

3,500

3,500

3,500

3,5003,500

4,000

4,000

4,000

4,000

4,000

4,000

4,0004,000

4,500

4,500

4,500

4,500

4,500

4,500

4,5004,500

5,000

5,000

5,000

5,000

5,000

5,000

5,0005,000

weight (lb)

weight (lb)

weight (lb)

weight (lb)

weight (lb)

weight (lb)

weight (lb)weight (lb)

8

8

8

8

8

8

88

10

10

10

10

10

10

1010

12

12

12

12

12

12

1212

14

14

14

14

14

14

1414

16

16

16

16

16

16

1616

18

18

18

18

18

18

1818

20

20

20

20

20

20

2020

22

22

22

22

22

22

2222

24

24

24

24

24

24

2424

0-60 mph (s)

0-60 mph (s)

0-60 mph (s)

0-60 mph (s)

0-60 mph (s)

0-60 mph (s)

0-60 mph (s)0-60 mph (s)

70

70

70

70

70

70

7070

71

71

71

71

71

71

7171

72

72

72

72

72

72

7272

73

73

73

73

73

73

7373

74

74

74

74

74

74

7474

75

75

75

75

75

75

7575

76

76

76

76

76

76

7676

77

77

77

77

77

77

7777

78

78

78

78

78

78

7878

79

79

79

79

79

79

7979

80

80

80

80

80

80

8080

81

81

81

81

81

81

8181

82

82

82

82

82

82

8282year

year

year

year

year

year

yearyear

[M. Bostock]

Page 8: CIS 602-01: Scalable Data Analysis

Progressive Visualization

8D. Koop, CIS 602-01, Fall 2017

Page 9: CIS 602-01: Scalable Data Analysis

Blocking vs. Progressive Visualization

9D. Koop, CIS 602-01, Fall 2017

the currently applied brush, will force all other visualiza-tions (i.e., v1, v3, and v4) to recompute. Depending on thevisualization condition, v1, v3, and v4 will either show theresults of the new brushing action instantaneously, a load-ing animation until the full result is available, or incremen-tally updated progressive visualizations. The system doesnot perform any form of result caching.

Upon startup, the system loads the selected dataset intomemory. The system randomly shuffles the data points inorder to avoid artifacts caused by the natural oder of thedata and to improve convergence of progressive computa-tions [46]. We approximate the instantaneous visualizationcondition by computing queries over entire dataset asquickly as possible. Initial microbenchmarks for a variety ofqueries over the small datasets used in the experimentyielded the following measurements: time to compute aresult ! 100 ms and time to render a visualization ! 30 ms.Although not truly instantaneous, a total delay of only! 130 ms is well below the guideline of one second, there-fore allowing us to simulate the instantaneous condition.We simulate the blocking condition by artificially delayingthe rendering of a visualization by the number of secondsspecified through the dataset-delay factor. In other words,we synthetically prolong the time necessary to compute aresult in order to simulate larger datasets. While a blockingcomputation is ongoing, we display a simple loading ani-mation, but users can still interact with other visualizationpanels, change the ongoing computation (e.g, selecting adifferent subset), or replace the ongoing computation with anew query (e.g., changing an axis). Fig. 4 (top) shows anexample of a blocking visualization over time.

We implemented the progressive condition by process-ing the data in chunks of 1,000 data points at a time, withapproximate results displayed after each chunk. In total, werefresh the progressive visualization 10 times, independentof the dataset-delay factor. We display the first visualizationas quickly as possible, with subsequent updates appropri-ately delayed so that the final accurate visualization is dis-played after the specified dataset-delay condition. Note thatthe initial min and max estimates might change afters seeingadditional data, in which case we extend the visualizationby adding bins of the same width to accommodate newincoming data. Throughout the incremental computation,we display a progress indication in the bottom left corner ofa visualization (Fig. 3d). Progressive visualizations are aug-mented with error metrics indicating that the current view

is only an approximation of the final result. Even though95 percent confidence intervals based on the standard errorhave been shown to be problematic to comprehend in cer-tain cases [47], we still opted to use them for bar charts dueto their wide usage and familiarity. We render labels withmargins of error (e.g., “"3%”) in each bin of a 2D-histo-gram. Fig. 4 (bottom) shows an example of a progressivevisualization and how it changes over time. Note that theconfidence intervals in the example are rather small, buttheir size can change significantly based on the dataset andquery.

Our system is implemented in C# / Direct2D. We testedour system and ran all sessions on a quad-core 3.60 GHz,16 GB RAM, Microsoft Windows 10 desktop machine witha 16:9 format, 1,920 # 1,080 pixel display.

3.4 ProcedureWe recruited 24 participants from a research university inthe US. All participants were students (22 undergraduate,two graduate), all of whom had some experience with dataexploration or analysis tools (e.g., Excel, R, Pandas). 21 of theparticipants were currently enrolled in and halfway throughan introductory data science course. Our experimentincluded visualization condition as a within-subject factorand dataset-delay and dataset-order as between-subject fac-tors. Note that the dataset-delay has no direct influence onthe instantaneous condition, but it is used as a between-subject factor to test if it affects the other visualization condi-tions. To control against ordering and learning effects, wefully randomized the sequence in which we presented thedifferent visualization conditions to the user and counterbal-anced across dataset-delay conditions. Instead of fully ran-domizing the ordering of datasets, we opted to create twopredefined dataset-orderings and factored them into ouranalysis. The two possible dataset-ordering values were 123and 312 (i.e., DS1 followed by DS2 followed by DS3 and DS3followed by DS1 followed by DS2, respectively), and weagain counterbalanced across dataset-ordering. We seed oursystem’s random number generator differently for each ses-sion to account for possible effects in the progressive condi-tion caused by the sequence in which data points areprocessed. While each participant did not experience all pos-sible combinations of visualization, dataset-ordering, anddataset-delay conditions, all users saw all visualization con-ditions with one specific dataset-delay and dataset-ordering.For example, participant P14 was assigned dataset-ordering312, dataset-delay 6 s, and visualization condition ordering[blocking, instantaneous, progressive]. In total, this adds upto four trial sub-groups, each with six participants: (dataset-delay = 6 s & dataset-order = 123), (dataset-delay = 12 s &dataset-order = 123), (dataset-delay = 6 s & dataset-order =312) and (dataset-delay = 12 s & dataset-order = 312). Withineach subgroup, we fully randomized the order in which wepresented the different visualization conditions.

After a 15 minute tutorial of the system, including a sum-mary of the visualization conditions and how to interpret con-fidence intervals and margins of error, we instructed theparticipants to perform three exploration sessions. In the caseof participant P14, these three sessions included (1) blockingvisualizations on DS3, (2) instantaneous visualizations onDS1, and (3) progressive visualizations on DS2 all with 6 s

Fig. 4. The blocking and progressive visualization conditions.

ZGRAGGEN ETAL.: HOW PROGRESSIVE VISUALIZATIONS AFFECT EXPLORATORYANALYSIS 1981

Page 10: CIS 602-01: Scalable Data Analysis

Experiment Results

10D. Koop, CIS 602-01, Fall 2017

delay. Each sessionwas open-ended andwe asked the partici-pants to explore the dataset at their own liking and pace. Weallotted 12 minutes per session, but participants were free tostop earlier.Weused a think-aloud protocol [48]where partic-ipants were instructed to report anything they found interest-ing while exploring the dataset. Throughout each session, wecaptured both screen and audio recordings and logged low-level interactions (mouse events) as well as higher-levelevents (“axis changed”, “aggregation changed”, “textboxbrush/filter”, “visualization brush,” and “visualizationupdated/stopped/completed”). An experimenter was pres-ent throughout the session, and participants were free toask any technical questions or questions about the meaningof dataset attributes. At the end of the three sessions, weasked participants to give feedback about the tool andwhether they had any thoughts regarding the differentvisualization conditions.

3.5 Statistical AnalysisOur study is designed as a full-factorial 3 (visualization con-ditions) ! 2 (dataset-delay conditions) ! 2 (dataset-orderconditions) experiment. We applied mixed design analysisof variance tests (ANOVA) with visualization condition asthe within-subject factor and dataset-delay and dataset-order as the between-subject factors to assess the effects ofour factors on the various metrics we computed. Note thatthe dataset-delay factor should have no influence on trialswhere the visualization condition is set to instantaneous.

We tested the assumption of sphericity using Mauchly’stest, and we report results with corrected degrees of free-dom using Greenhouse-Geisser estimates if violated. Wereport all significant (i.e., p < 0:05) main and interactioneffects of these tests. For significant main effects, we con-ducted further analysis through Bonferroni corrected posthoc tests and for more nuanced interpretation, we opted toinclude Bayes factors for certain results and report BF10 fac-tors along with corresponding significance labels [49]. Effectsizes are reported through Pearson’s r coefficient, and sig-nificance levels are encoded in plots using the followingnotation: " p < 0:05; " " p < 0:01; " " " p < 0:001.

4 ANALYSIS OF VERBAL DATA

Inspired by previous insight-based studies [40], [44], [50], wemanually coded insights from the audio recordings andscreen captures. An insight is a nugget of knowledgeextracted from the data, such as “France produces morewines than the US”. We followed the verbal data segmenta-tion approach proposed by Liu et al. [40] and adopted theircoding process in which the first author did the bulk of thecoding, but we iteratively revised finished codes with

collaborators to reduce bias. In our case, we decided to countobservations that are within the same visualization, have thesame semantics, and are on the same level of granularity asone insight. For example: “It looks like country 1 makes themost cars, followed by country 2 and country 3” was codedas one single insight, whereas an observation across two vis-ualizations (through brushing) such as “Country 1 makesthe most cars and seems to have the heaviest cars” wascounted as two separate insights. We did not categorizeinsights or assign any quality scores or weights to insightsnor did we quantify accuracy or validity of insights. For ouranalysis, all insights were treated equally.

4.1 Number of Insights Per MinuteIn order to get a quantifiable metric for knowledge discovery,we normalized the total insight count for each session by theduration of the session. Fig. 5 shows this resulting “number ofinsights per minute” metric across different factors. Ourresults show that this metric was significantly affected bythe type of visualization condition (F ð1:452; 29:039Þ ¼ 6:701;p < 0:01).

Post hoc tests revealed that the instantaneous conditionshowed a slight increase of number of insights per minuteover the progressive condition (1:477& 0:568 versus1:361& 0:492, respectively), which was not statisticallysignificant (p ¼ 1:0; r ¼ 0:080). However, the blocking con-dition reduces the number of insights per minute to1:069& 0:408, which differed significantly from both theprogressive (p < 0:05; r ¼ 0:292) and instantaneousðp < 0:001; r ¼ 0:383) conditions. A Bayesian Paired Sam-ples T-Test that tested if measure1 < measure2 revealedstrong evidence for an increase in insights per minute fromthe blocking condition to the progressive condition(BF10 ¼ 13:816), extreme evidence for an increase fromblocking to instantaneous (BF10 ¼ 134:561), and moderateevidence for no change or a decrease between instantaneousand progressive (BF10 ¼ 0:130). In summary, blocking visu-alizations produced the fewest number of insights perminute, whereas the instantaneous and progressive visual-izations performed equally well.

4.2 Insight OriginalitySimilar to [44], we grouped insights that encode the samenugget of knowledge together and computed the originalityof an insight as the inverse of the number of times thatinsight was reported by any participant. That is, insightsreported more frequently by participants received loweroverall originality scores. A participant received an insightoriginality score of 1 if he or she had only unique insights(i.e., insights found by no other users). The lower the score,

Fig. 5. Insights per minute: Boxplot (showing median and whiskers at 1.5 interquartile range) and overlaid swarmplot for insights per minute (left)overall, (middle) by dataset-delay, and (right) by dataset-order. Higher values indicate better.

1982 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 23, NO. 8, AUGUST 2017

Page 11: CIS 602-01: Scalable Data Analysis

Assignment 1• http://www.cis.umassd.edu/~dkoop/

cis602-2017fa/assignment1.html • Boston Property Assessments

- Initial exploratory analysis - Use a Python Notebook - May use pandas - Label subproblems and answers - Show work (even if it's not your

final answer)

11D. Koop, CIS 602-01, Fall 2017

[Google Maps]

Page 12: CIS 602-01: Scalable Data Analysis

Test 1• Thursday, October 5, 5-6:15pm, Dion 109 (here) • Format:

- Multiple Choice - Free Response

• Covers papers and topics discussed through this Thursday • Next week:

- No class on Tuesday, Oct. 3 - Exam on Thursday, Oct. 5 - No office hours - Email me with questions

12D. Koop, CIS 602-01, Fall 2017

Page 13: CIS 602-01: Scalable Data Analysis

Data Analysis Tools

13D. Koop, CIS 602-01, Fall 2017

Page 14: CIS 602-01: Scalable Data Analysis

Lots and lots of tools• Tableau • Qlikview • Spotfire • Trifacta • KNIME • Open Refine • Octoparse • Watson • SAS • Mahout • Spark • scikit-learn

14D. Koop, CIS 602-01, Fall 2017

Page 15: CIS 602-01: Scalable Data Analysis

Tables of Tools• Data Architecture/Storage • Data Cleaning • Data Visualization • Programming Languages • Spreadsheets • Machine Learning

15D. Koop, CIS 602-01, Fall 2017

Page 16: CIS 602-01: Scalable Data Analysis

O'Reilly's Data Science Salary Survey (2016)

16D. Koop, CIS 602-01, Fall 2017

0 5% 10% 15% 20%

>200K

200K

180K

160K

140K

120K

100K

80K

60K

40K

20K

0K

BASE SALARY

Base

Sal

ary

(US

DO

LLA

RS)

Share of Respondents

Share of respondents

[O'Reilly]

Page 17: CIS 602-01: Scalable Data Analysis

Data Science Tasks

17D. Koop, CIS 602-01, Fall 2017

TASKS (major involvement only)

BASIC EXPLORATORY DATA ANALYSIS

69%

COMMUNICATING FINDINGS TO BUSINESS DECISION-MAKERS

58%

DATA CLEANING53%

CREATING VISUALIZATIONS

49%

IDENTIFYING BUSINESS PROBLEMS TO BE SOLVED WITH ANALYTICS

47%FEATURE EXTRACTION43% COLLABORATING ON CODE

PROJECTS (READING/EDITING OTHERS' CODE, USING GIT)

32%

TEACHING/TRAINING OTHERS31%

PLANNING LARGE SOFTWARE PROJECTS OR DATA SYSTEMS30%

DEVELOPING DASHBOARDS30%

ETL29%

DEVELOPING PRODUCTS THAT DEPEND ON REAL-TIME DATA ANALYTICS

19%

USING DASHBOARDS AND SPREADSHEETS (MADE BY OTHERS) TO MAKE DECISIONS

19%

DEVELOPING HARDWARE (OR WORKING ON SOFTWARE PROJECTS THAT REQUIRE EXPERT KNOWLEDGE OF HARDWARE)

5%

COMMUNICATING WITH PEOPLE OUTSIDE YOUR COMPANY

28%

SETTING UP / MAINTAINING DATA PLATFORMS

24%DEVELOPING DATA

ANALYTICS SOFTWARE

20%

IMPLEMENTING MODELS/ ALGORITHMS INTO PRODUCTION

36%ORGANIZING AND GUIDING TEAM PROJECTS

39%

DEVELOPING PROTOTYPE MODELS

43%

CONDUCTING DATA ANALYSIS TO ANSWER RESEARCH QUESTIONS

61%

[O'Reilly]

Page 18: CIS 602-01: Scalable Data Analysis

Data Science Tasks

18D. Koop, CIS 602-01, Fall 2017

30K 60K 90K 120K 150K

Developing hardware (or working on software projects that require expert knowledge of hardware)

Using dashboards and spreadsheets (made by others) to make decisions

Developing products that depend on real-time data analytics

Developing data analytics software

Setting up / maintaining data platforms

Communicating with people outside your company

ETL

Developing dashboards

Planning large software projects or data systems

Teaching / training others

Collaborating on code projects (reading / editing others' code, using git)

Implementing models / algorithms into production

Organizing and guiding team projects

Developing prototype models

Feature extraction

Identifying business problems to be solved with analytics

Creating visualizations

Data cleaning

Communicating findings to business decision-makers

Conducting data analysis to answer research questions

Basic exploratory data analysis

Range/Median

Task

[O'Reilly]

Page 19: CIS 602-01: Scalable Data Analysis

Programming Languages

19D. Koop, CIS 602-01, Fall 2017

PROGRAMMING LANGUAGES

SQL70%

R57%

PYTHON54%

BASH24%JAVA18%

JAVASCRIPT17%

VISUAL BASIC / VBA13%

C++9%

SCALA8%

C#8%

C8%

SAS5%

PERL5%

RUBY3%

GO1%

OCTAVE2%

MATLAB9%

SALARY MEDIAN AND IQR (US DOLLARS)

Range/Median

Lang

uage

s

SHARE OF RESPONDENTS

0 50K 100K 150K 200K

GoOctave

RubySASPerl

C#C

ScalaMatlab

C++Visual Basic/VBA

JavaScriptJavaBash

PythonR

SQL

[O'Reilly]

Page 20: CIS 602-01: Scalable Data Analysis

Database Tools

20D. Koop, CIS 602-01, Fall 2017

MYSQL37%

MICROSOFT SQL SERVER33%

ORACLE23%

POSTGRESQL22%

SQLITE11%

TERADATA10%

IBM DB25% VERTICA

4%EMC / GREENPLUM2%

ASTER DATA (TERADATA)2%

SAP HANA1%

REDSHIFT1%

ORACLE EXASCALE1%

NETEZZA (IBM)4%

RELATIONAL DATABASES

SALARY MEDIAN AND IQR (US DOLLARS)

Range/Median

Rela

tiona

l dat

abas

es

SHARE OF RESPONDENTS

50K 100K 150K 200K 250KOracle Exascale

RedshiftSAP HANA

Aster Data (Teradata)EMC/Greenplum

Netezza (IBM)Vertica

IBM DB2Teradata

SQLitePostgreSQL

OracleMicrosoft SQL Server

MySQL

[O'Reilly]

Page 21: CIS 602-01: Scalable Data Analysis

Hadoop and Search Tools

21D. Koop, CIS 602-01, Fall 2017

APACHE HADOOP17%

CLOUDERA12%

HORTONWORKS8%AMAZON ELASTIC MAPREDUCE (EMR)7%

MAPR4% ORACLE

2% EMC / GREENPLUM

1%IBM1%

ELASTICSEARCH10%

SOLR5%

LUCENE4%

SALARY MEDIAN AND IQR (US DOLLARS)

Range/Median

Had

oop

SALARY MEDIAN AND IQR (US DOLLARS)

Range/Median

Sear

ch

SHARE OF RESPONDENTS

SHARE OF RESPONDENTS

HADOOP

SEARCH

0 50K 100K 150K 200KIBM

EMC / GreenplumOracleMapR

Amazon Elastic MapReduce (EMR)Hortonworks

ClouderaApache Hadoop

0 50K 100K 150K 200K

LuceneSolr

ElasticSearch

[O'Reilly]

Page 22: CIS 602-01: Scalable Data Analysis

Data Management Tools

22D. Koop, CIS 602-01, Fall 2017

APACHE HADOOP17%

CLOUDERA12%

HORTONWORKS8%AMAZON ELASTIC MAPREDUCE (EMR)7%

MAPR4% ORACLE

2% EMC / GREENPLUM

1%IBM1%

ELASTICSEARCH10%

SOLR5%

LUCENE4%

SALARY MEDIAN AND IQR (US DOLLARS)

Range/Median

Had

oop

SALARY MEDIAN AND IQR (US DOLLARS)

Range/Median

Sear

ch

SHARE OF RESPONDENTS

SHARE OF RESPONDENTS

HADOOP

SEARCH

0 50K 100K 150K 200KIBM

EMC / GreenplumOracleMapR

Amazon Elastic MapReduce (EMR)Hortonworks

ClouderaApache Hadoop

0 50K 100K 150K 200K

LuceneSolr

ElasticSearch

DATA MANAGEMENT, BIG DATA PLATFORMS

SPARK21%

HIVE20%

MONGODB10%

AMAZON REDSHIFT9%HBASE7%

KAFKA7%

PIG7%

IMPALA6%

CASSANDRA4%

REDIS4%

ZOOKEEPER4%

GOOGLE BIGQUERY/FUSION TABLES

3%NEO4J3% SPLUNK

3%

STORM2%

COUCHBASE1%

AMAZON DYNAMODB3%

TOAD5%

SALARY MEDIAN AND IQR (US DOLLARS)

Range/Median

Big

Dat

a Pl

atfo

rms

SHARE OF RESPONDENTS

0 50K 100K 150K 200K

CouchbaseStorm

Amazon DynamoDBSplunk

Google BigQuery/Fusion TablesNeo4J

RedisZookeeperCassandra

ToadImpala

PigKafka

HbaseAmazon RedShift

MongoDBHive

Spark

[O'Reilly]

Page 23: CIS 602-01: Scalable Data Analysis

Spreadsheet & Business Intelligence Tools

23D. Koop, CIS 602-01, Fall 2017

EXCEL69%

POWERPIVOT10%

POWER BI8%QLIKVIEW7%

BUSINESSOBJECTS6%

COGNOS6%

ORACLE BI5% SPOTFIRE

4%ADOBE ANALYTICS

3%MICROSTRATEGY3%

ALTERYX2%

JASPERSOFT1%

DATAMEER1%

PENTAHO3%

SPREADSHEETS, BI, REPORTING

SALARY MEDIAN AND IQR (US DOLLARS)

Range/Median

Spre

adsh

eets

, BI,

repo

rtin

g

SHARE OF RESPONDENTS

30K 60K 90K 120K 150K

DatameerJaspersoft

AlteryxMicrostrategy

Adobe AnalyticsPentahoSpotfire

Oracle BICognos

BusinessObjectsQlikViewPower BI

PowerPivotExcel

[O'Reilly]

Page 24: CIS 602-01: Scalable Data Analysis

Visualization Tools

24D. Koop, CIS 602-01, Fall 2017

GGPLOT35%

TABLEAU33%

MATPLOTLIB26%

SHINY16%

D316%

GOOGLE CHARTS8% BOKEH

6% PROCESSING1% JAVASCRIPT INFOVIS TOOLKIT

1% VISUALIZATION TOOLS

SALARY MEDIAN AND IQR (US DOLLARS)

Range/Median

Visu

aliz

atio

n to

ols

SHARE OF RESPONDENTS

30K 60K 90K 120K 150K

JavaScript InfoVis ToolkitProcessing

BokehGoogle Charts

D3Shiny

MatplotlibTableau

ggplot

[O'Reilly]

Page 25: CIS 602-01: Scalable Data Analysis

Machine Learning Tools

25D. Koop, CIS 602-01, Fall 2017

SCIKIT-LEARN31%

SPARK MLLIB13%

WEKA9%

H2O5%

RAPIDMINER4%

LIBSVM4%

MAHOUT3% MATHEMATICA

3% STATA2% DATO / GRAPHLAB

2% KNIME2% VOWPAL

WABBIT

2% BIGML1%

IBM BIG INSIGHTS1%

GOOGLE PREDICTION1%

MACHINE LEARNING, STATISTICS

SALARY MEDIAN AND IQR (US DOLLARS)

Range/Median

Mac

hine

lear

ning

, sta

tistic

s

SHARE OF RESPONDENTS

30K 60K 90K 120K 150K

Google PredictionIBM Big Insights

BigMLVowpal Wabbit

KNIMEDato / GraphLab

StataMathematica

MahoutLIBSVM

RapidMinerH2O

WekaSpark MlLibScikit-learn

[O'Reilly]

Page 26: CIS 602-01: Scalable Data Analysis

Trends in Data Analysis Tool Use

26D. Koop, CIS 602-01, Fall 2017

[KDNuggets]

Page 27: CIS 602-01: Scalable Data Analysis

Trends in Data Analysis Tool Usage

27D. Koop, CIS 602-01, Fall 2017

Table 1: Top Analytics/Data Science Tools in 2017 KDnuggets Poll

Tool 2017% Usage

% change2017 vs 2016 % alone

Python 52.6% 15% 0.2%

R language 52.1% 6.4% 3.3%

SQL language 34.9% -1.8% 0%

RapidMiner 32.8% 0.7% 13.6%

Excel 28.1% -16% 0.1%

Spark 22.7% 5.3% 0.2%

Anaconda 21.8% 37% 0.8%

Tensorflow 20.2% 195% 0%

scikit-learn 19.5% 13% 0%

Tableau 19.4% 5.0% 0.4%

KNIME 19.1% 6.3% 2.4%

[KDNuggets]

Page 28: CIS 602-01: Scalable Data Analysis

Top Increases

28D. Koop, CIS 602-01, Fall 2017

Table 2: Major Analytics/Data Science Tools with the largest increase in usage

Tool % change 2017% usage

2016% usage

Microsoft, CNTK 294% 3.4% 0.9%

Tensorflow 195% 20.2% 6.8%

Microsoft Power BI 84% 10.2% 5.6%

Alteryx 76% 5.3% 3.0%

SQL on Hadoop tools 42% 10.3% 7.3%

Microsoft other ML/Data Science tools 40% 2.2% 1.6%

Anaconda 37% 21.8% 16.0%

Caffe 32% 3.1% 2.3%

Orange 30% 4.0% 3.1%

DL4J 30% 2.2% 1.7%

Other Deep Learning Tools 30% 4.8% 3.7%

Microsoft Azure Machine Learning 26% 6.4% 5.1%

[KDNuggets]

Page 29: CIS 602-01: Scalable Data Analysis

Top Decreases

29D. Koop, CIS 602-01, Fall 2017

Table 3: Major Analytics/Data Science Tools with the largest decline in usage

Tool % change 2017% usage

2016% usage

Turi (former Dato/GraphLab) -93% 0.2% 2.4%

RapidInsight/Veera -92% 0.2% 3.0%

Salford SPM/CART/RF/MARS/TreeNet -89% 0.4% 3.5%

MLlib -61% 4.5% 11.6%

C4.5/C5.0/See5 -38% 1.2% 2.0%

Hadoop: Open Source Tools -32% 15.0% 22.1%

Other free analytics/data mining tools -29% 4.8% 6.8%

Rattle -28% 2.6% 3.6%

Perl -27% 1.7% 2.3%

Pentaho -23% 1.8% 2.3%

Gnu Octave -22% 2.4% 3.1%

QlikView -21% 4.2% 5.3%

[KDNuggets]

Page 30: CIS 602-01: Scalable Data Analysis

Data Science & Machine Learning Associations

30D. Koop, CIS 602-01, Fall 2017

[KDNuggets]

Page 31: CIS 602-01: Scalable Data Analysis

Python vs. R

31D. Koop, CIS 602-01, Fall 2017

[KDNuggets]

Page 32: CIS 602-01: Scalable Data Analysis

Hadoop/Spark Association

32D. Koop, CIS 602-01, Fall 2017

[KDNuggets]

Page 33: CIS 602-01: Scalable Data Analysis

Structured Query Language (SQL)• Used with relational database management systems (RDBMS) • Based on relational algebra • Supports:

- Data schema definitions - Add/delete/update data - Data query

33D. Koop, CIS 602-01, Fall 2017

Page 34: CIS 602-01: Scalable Data Analysis

Excel• One of the most universal data analysis tools • Supports:

- Spreadsheets - Calculated cells - Visualizations

• Many similar products

34D. Koop, CIS 602-01, Fall 2017

Page 35: CIS 602-01: Scalable Data Analysis

Tableau• Grew out of research at Stanford University on how to explore

multidimensional datasets & relational databases • Tableau Desktop: standalone (free trial, student license) • Tableau Public: cloud-based system (free) • Tableau Vizable: mobile app • High-level GUI that connects to data, helps organize it, and

provides intuitive routines for visualizing it plus customization • Lots of possibilities for exploring data

35D. Koop, CIS 602-01, Fall 2017

Page 36: CIS 602-01: Scalable Data Analysis

Trifacta• Grew out of research at Stanford University and UC-Berkeley • Data Wrangling: Data Cleaning Tool • Prepare data for analysis • Trifacta Wrangler: free desktop app, integrate with Tableau • Trifacta Enterprise Wrangler: integration with Hadoop • Google's Cloud Dataprep integrates Trifacta Enterprise

36D. Koop, CIS 602-01, Fall 2017

Page 37: CIS 602-01: Scalable Data Analysis

Python Tools• matplotlib:

- Python's "base" visualization library - Originally mimicked the functions in matlab - Integrated with pandas .plot calls - Jupyter Notebook: %matplotlib inline or notebook - seaborn adds extras on top of matplotlib

• scikit-learn - Machine learning library - Fit and predict using models - Classification and regression

37D. Koop, CIS 602-01, Fall 2017