visualizing billions of data points: doing it right

37
Visualizing Billions of Data Points: Doing It Right

Upload: continuum-analytics

Post on 05-Jan-2017

14.681 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Visualizing Billions of Data Points: Doing it Right

Visualizing Billions of Data Points: Doing It Right

Page 2: Visualizing Billions of Data Points: Doing it Right

About Us

Peter Wang is the CTO and Co-founder of Continuum Analytics and the creator of Bokeh.

He has been developing commercial scientific computing and visualization software for over 15 years.

As a creator of the PyData conference, he devotes time and energy to growing the Python data

community, and advocating and teaching Python at conferences worldwide.

Jim Bednar is a Solutions Architect at Continuum Analytics, and an Honorary Fellow at the University of Edinburgh, Scotland.

He holds a Ph.D. in Computer Science from the University of Texas, as well as undergraduate degrees in Electrical Engineering and Philosophy.

He has published more than 50 papers and books about the visual system and software development, and manages the open-source Python projects HoloViews, ImaGen, Param, and Datashader.

Page 3: Visualizing Billions of Data Points: Doing it Right

Overview• Visualizing big data — what is the problem?

• Datashading

• Dataset 1: 10 million points of NYC Taxi data

• Dataset 2: 3 billion points of OpenStreetMap data

• Dataset 3: 300 million points of US Census data

• Sneak preview: Data explorer web interface

• Q&A

Page 4: Visualizing Billions of Data Points: Doing it Right

Visualizing big data: What is the problem?

Page 5: Visualizing Billions of Data Points: Doing it Right

X

Billions and billions…

Page 6: Visualizing Billions of Data Points: Doing it Right

5

Big data magnifies small problems• Of course, big data presents storage and computation problems • More importantly, standard plotting tools have problems that are

magnified by big data: • Overdrawing/Overplotting • Saturation • Undersaturation • Binning issues

• We’ll first explain these problems, and then present a new technique called datashading to address them head-on.

Page 7: Visualizing Billions of Data Points: Doing it Right

6

Overdrawing• For a scatterplot, the order in

which points are drawn is very important

• The same distribution can look entirely different depending on plotting order

• Last data plotted overplots

Page 8: Visualizing Billions of Data Points: Doing it Right

7

Overdrawing• Underlying issue is just occlusion

• Same problem happens with one category, but less obvious

• Can prevent occlusion using transparency

Page 9: Visualizing Billions of Data Points: Doing it Right

8

Saturation• E.g. for alpha = 0.1, up to 10 points

can overlap before saturating the available brightness

• Now the order of plotting matters less

• After 10 points, first-plotted data still lost

• For one category, 10, 20, or 2000 points overlapping will look identical

Page 10: Visualizing Billions of Data Points: Doing it Right

9

Saturation• Same alpha value, more points:

• Now is highly misleading

• alpha value depends on size, overlap of dataset

• Difficult-to-set parameter, hard to know when data is misrepresented

Page 11: Visualizing Billions of Data Points: Doing it Right

10

Saturation• Can try to reduce point size to reduce

overplotting and saturation • Now points are hard to see, with no

guarantee of avoiding problems • Another difficult-to-set parameter

• For really big data, scatterplots start to become very inefficient, because there are many datapoints per pixel — may as well be binning by pixel

Page 12: Visualizing Billions of Data Points: Doing it Right

11

Binning issues• Can use heatmap

instead of scatter • Avoids saturation by

auto-ranging on bins • Result independent of

data size • Here two merged

normal distributions look very different at different binning

• Another difficult-to-set parameter

Page 13: Visualizing Billions of Data Points: Doing it Right

12

Plotting big data• When exploring really big data, the visualization is all you have —

there’s no way to look at each of the individual data points • Common plotting problems can lead to completely incorrect

conclusions based on misleading visualizations • Slow processing makes trial and error approach ineffective

When data is large, you don’t know when the viz is lying.

Page 14: Visualizing Billions of Data Points: Doing it Right

Datashading

Page 15: Visualizing Billions of Data Points: Doing it Right

14

Datashading• Flexible, configurable pipeline for automatic plotting • Provides flexible plugins for viz stages, like in graphics shaders • Completely prevents overplotting, saturation, and undersaturation • Mitigates binning issues by providing fully interactive exploration in

web browsers, even of very large datasets on ordinary machines • Statistical transformations of data are a first-class aspect of the

visualization • Allows rapid iteration of visual styles & configs, interactive selections

and filtering, to support data exploration

Page 16: Visualizing Billions of Data Points: Doing it Right

15

Datashading Pipeline: Projection

Data

Project /Synthesize

Scene

• Stage 1: select variables (columns) to project onto the screen

• Data often filtered at this stage

Page 17: Visualizing Billions of Data Points: Doing it Right

16

Datashading Pipeline: Aggregation

Data

Project /Synthesize

Scene Aggregates

Sample / Raster

• Stage 2: Aggregate data into a fixed set of bins

• Each bin yields one or more scalars (total count, mean, stddev, etc.)

Page 18: Visualizing Billions of Data Points: Doing it Right

17

Datashading Pipeline: Transfer

Data

Project /Synthesize

Scene Aggregates

Sample / Raster Transfer

Image

• Stage 3: Transform data using one or more transfer functions, culminating in a function that yields a visible image

• Each stage can be replaced and configured separately

Page 19: Visualizing Billions of Data Points: Doing it Right

18

Datashader software architecture• Open-source BSD-licensed library, written in Python

• Download from https://github.com/bokeh/datashader

• conda install -c bokeh datashader• Works well with the Bokeh library, for interactive visualization of large

data in the browser, but is fully independent

• Supports large parallel machines, but works well on desktops and laptops — all demos shown here will be on a standard MacBook Pro

Page 20: Visualizing Billions of Data Points: Doing it Right

Dataset 1: NYC Taxi data

Page 21: Visualizing Billions of Data Points: Doing it Right

Dataset 1: OverviewThis demo shows how traditional plotting tools break down for large datasets, and how to use datashading to make even large datasets practical interactively.

• Data for 10 million New York City taxi trips

• Even 100,000 points gets slow for scatterplot

• Parameters usually need adjusting for every zoom

• True relationships within data not visible in std plot

Datashading automatically reveals the entire dataset, including outliers, hot spots, and missing data

Page 22: Visualizing Billions of Data Points: Doing it Right

Dataset 2: 3 billion OSM points

Page 23: Visualizing Billions of Data Points: Doing it Right

Dataset 2: OverviewBecause Datashader decouples the data-processing from the visualization, it can handle arbitrarily large data

• E.g. Open Street Map data:

• About 3 billion GPS coordinates

• https://blog.openstreetmap.org/2012/04/01/bulk-gps-point-data/.

• This image was rendered in one minute on a standard MacBook with 16 GB RAM

• Renders in 7 seconds on a 128GB Amazon EC2 instance

Page 24: Visualizing Billions of Data Points: Doing it Right

Out of core operation

• OSM database is 40GB, larger than the 16GB RAM on this machine

• Fast out-of-core operation powered by:

• Numba: generates fast C and GPU code from Python source

• Dask: Parallelizes tasks

• datashader source code is all Python

Page 25: Visualizing Billions of Data Points: Doing it Right

Dataset 3: 300 million points Census data

Page 26: Visualizing Billions of Data Points: Doing it Right

25

Categorical data: 2010 US Census• One point per

person • 300 million total • Categorized by

race • Datashading

shows faithful distribution per pixel

Page 27: Visualizing Billions of Data Points: Doing it Right

26

Categorical data: Illinois Census• Zooming in

doesn’t require new parameter settings

• All ranges, etc. update automatically

Page 28: Visualizing Billions of Data Points: Doing it Right

27

Categorical data: Chicago Census• Chicago area is

highly racially segregated, on average

• Population density differences still visible, automatically

Page 29: Visualizing Billions of Data Points: Doing it Right

28

Categorical data: Chinatown Census• At neighborhood

level, the full racial distribution is clear

• Size of dots increases automatically for visibility

Page 30: Visualizing Billions of Data Points: Doing it Right

29

Categorical data: Chinatown Census• Bokeh supports

geographic data tiles

• Borders between race-based neighborhoods are related to geographic features

Page 31: Visualizing Billions of Data Points: Doing it Right

Go Further with Datashader

Page 32: Visualizing Billions of Data Points: Doing it Right

© 2016 Continuum Analytics- Confidential & Proprietary

Use DataShader Today!

• Go to https://github.com/bokeh/datashader

• Please Go Try It Out!

• Star It, Fork It, Open Issues, Let Us Know What You Think!

31

Page 33: Visualizing Billions of Data Points: Doing it Right

© 2016 Continuum Analytics- Confidential & Proprietary© 2016 Continuum Analytics- Confidential & Proprietary

Support & Indemnification

• We offer support & indemnification for DataShader and many other Python packages through our Anaconda Pro offering.

• You can learn more at continuum.io/anaconda-subscriptions

32

Page 34: Visualizing Billions of Data Points: Doing it Right

© 2016 Continuum Analytics- Confidential & Proprietary

Anaconda Mosaic• Create PORTABLE Transformations • Interactively EXPLORE Heterogeneous

Data Visually with Datashader • Easily ANALYZE Large Flat File

Repositories • ELIMINATE Data Movement and

Redundant Storage • CATALOG datasets and transformations

33

• Learn More About Anaconda Mosaic in Our Webinar - https://go.continuum.io/visualize-explore-transform/

• Get a Personalized Demo of Anaconda Mosaic - https://www.continuum.io/anaconda-subscriptions

Page 35: Visualizing Billions of Data Points: Doing it Right

© 2016 Continuum Analytics- Confidential & Proprietary

Training & Consulting

• We also provide: • Training for your team on DataShader,

Bokeh, matplotlib & Many Other Visualization Packages

• Consulting & Custom Engineering Help Taking Visualizations to Scale

34

Page 36: Visualizing Billions of Data Points: Doing it Right

Contact Information and Additional Details• Contact [email protected] for more information about

Anaconda subscriptions and about becoming an early adopter for Data Explorer — help make sure our product fits your needs!

• View documentation and examples at

github.com/bokeh/datashader and bokeh.pydata.org

• View demo notebooks on Anaconda Cloud

notebooks.anaconda.org/jbednar/

Page 37: Visualizing Billions of Data Points: Doing it Right

Thank you

Email: [email protected]

Twitter: @ContinuumIO

Peter WangTwitter: @pwang

Jim BednarTwitter: @JamesABednar