why exploring big data is hard - danyel fisher

Post on 22-Jan-2018

758 Views

Category:

Engineering

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

WHY EXPLORING BIG DATA IS HARD(& WHAT WE CAN DO ABOUT IT)DANYEL FISHER, MICROSOFT RESEARCH

/tiles/r02123002133111.png

One of the most popular spots in the world.

Based on a table with a few billion rows

Can you distinguish American users from international?

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data

Create shapes

Assign scales to shapes

Render to screen

By hand!

SLOW!

NETWORK!

Defining “Big” Volume

“…200,000 magnetic tape reels

which represent over 900 billion

characters of data”

1975

“the size of the dataset is part of

the problem”

Why is Big Data different?

REPRESENTATION

What visualizations are suitable for big data?

INTERACTION

What do we need to do to make that visualization useful for interaction?

And it’s costly!

Big data has the potential to cost unlimited amounts of money

A query on 100 cores for an hour costs 100 core-hours … and an analyst-hour.

Massive savings for doing less, or early termination

A Note on Infrastructure

You Won’t Plot Every Point…Screen space to draw each data point [106 points]

Every data point in memory [109 bytes]

Store all the data points [1012 bytes]

… Even If You Tried

x

y

Scatterplot(at least one pixel per point)

Network DiagramParallel Coordinates

(individual lines)

Aggregation

What is the aggregation equivalent of a bar graph?

What is an aggregated line chart, or a scatterplot?

N. Elmqvist and J.-D. Fekete. Hierarchical aggregation for information visualization: Overview, techniques, and design guidelines. IEEE Trans-actions on Visualization and Computer Graphics, 16(3):439–454, May 2010.

Some things aggregate well

020406080

100120140160180200

3/1

3/1

98

6

3/1

3/1

98

7

3/1

3/1

98

8

3/1

3/1

98

9

3/1

3/1

99

0

3/1

3/1

99

1

3/1

3/1

99

2

3/1

3/1

99

3

3/1

3/1

99

4

3/1

3/1

99

5

3/1

3/1

99

6

3/1

3/1

99

7

3/1

3/1

99

8

3/1

3/1

99

9

3/1

3/2

00

0

3/1

3/2

00

1

3/1

3/2

00

2

3/1

3/2

00

3

3/1

3/2

00

4

3/1

3/2

00

5

3/1

3/2

00

6

3/1

3/2

00

7

3/1

3/2

00

8

3/1

3/2

00

9

3/1

3/2

01

0

3/1

3/2

01

1

3/1

3/2

01

2

3/1

3/2

01

3

3/1

3/2

01

4

Daily values

0

20

40

60

80

100

120

140

160

180

200

Monthly aggregate:min and max

Multiple dimensions

Liu, Jiang, Heer: imMens (2013)

Wattenberg: PivotGraph (2005)

Treemaps (mostly)

“Generalized Histograms”Select buckets on data

then

Examine points, placing them into buckets

then

Create shapes based on buckets

Hadley Wickham: "Bin, Summarize, Smooth: A Framework for Visualizing Large Data"

Big Data Exploration

EXPLORATION

Learn about the dataset

Explore multiple hypotheses

Manipulate data freely

May be discarded after completion

Rapid iteration

Examples: Some of Tableau, PowerView, GGPLOT, etc

PRESENTATION

Communicate a specific view

Constrain interaction

Visual style important

Examples: visual dashboards, data storytelling

The Story of Walt

the hypothetical histogram

The Story of Walt

ASSUMPTION

The dataset is too big to fit into memory

ASSUMPTION

Every query takes a full minute

Creating Walt(Min,Max)

Bucket all points

Total time: 2 minutes

Interact With WaltCHANGE BUCKET COUNT

One pass.

Re-bucket every point

Or maybe we were clever…

CROSS-FILTER WALT WITH ANOTHER HISTOGRAM

One pass.

Check filter on every point

Or maybe we were clever…

How clever do we have to be?Which operations are worth pre-caching?

◦ Change number of buckets, or their size

◦ Zoom in on a single bar

◦ Filter out some data

◦ Cross-filter into other visualizations

◦ Cross-filter from other visualizations

◦ Show sample rows from the histogram

OLAP!

The Moral of Walt’s StoryDecide what operations will support rapidly … and which we’ll tolerate being slow

Solution Space◦ Work Offline◦ Index

OLAP: PentahoInMems, Nanocubes

◦ Restrict Data◦ Sample (or Stream)◦ Divide & Conquer

◦ Multiple passes across the data in parallel

Limited exploration!

Trade accuracy for latency

Time

100%

Online

Traditional

Image adapted from Hellerstein

Computing Confidence Bounds

𝑏𝑜𝑢𝑛𝑑𝑠 ~𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒

𝑠𝑎𝑚𝑝𝑙𝑒𝑠

The Progressive Pitch

“Trust Me, I'm Partially Right: Incremental Visualization

Lets Analysts Explore Large Datasets Faster"

CHI 2012

What We LearnedUsers made lots of mistakes

…carried out lots of queries

…and cut them off early

Users were fearless about exploration

Most numbers are rough

Randomness in databases is a pain

Supporting StreamsRESERVOIR SAMPLE

Keep a sample of k elements of the data such that each element has a k/size chance of being in the reservoir

EQUI-DEPTH HISTOGRAM

Good one-pass algorithms exist

… but we have no idea how to visualize them

Incremental changes the rulesCategorical: add categories on the fly

Numerical: changing bounds

Any color map or scale can change

SAMPLING: You’ll never know it all

TASKS

Find extreme

Compare bars

Bar to constant

Bar to range

Order (top-K)

SAMPLING: Probabilistic Views

“Sample-Oriented Task-Driven Visualizations: Allowing Users to Make Better, More Confident Decisions”

Design Goals

Easy to interpret

Consistency across tasks

Spatial Stability

Minimize Visual Noise (overhead)

“Is Bar A > Bar B”

Compare to constant

Other Tasks

Find extremeCompare to Range

A Tentative Framework

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

CACHE or INDEX

NETWORK!

SAMPLE

Place These!

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

Using the FrameworkHOTMAP

CACHE

NETWORK!

CACHE

D3

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

NETWORK!CACHE

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

Using the FrameworkSERVER-SIDE RENDER

NETWORK!

D3

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

NETWORK!

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

Using the FrameworkOLAP & PRE-INDEX

SAMPLE ACTION

Raw dataRelevant

dimensionsFilter data

Choose bucket bounds

Aggregate data Create shapes

Assign scales to shapes

Render to screen

CACHE

NETWORK!NETWORK!

SAMPLE

OLAP: Pentaho, MondrianInMems, NanoCubes

Cross-DisciplinarityThis isn’t the way SQL—or Hadoop—works today

Infovis needs to be very integrated with the back-end

New skills, new training

Close collaboration across fields

Let’s Build Cool Stuff!

@fisherdanyel

http://research.microsoft.com/bigdataux

top related