big data visualization with datashader · big data visualization with datashader . ... very small...

16
WHITEPAPER Dr. James Bednar, Open Source Tech Lead & Solutions Architect, Continuum Analytics August 2016 BIG DATA VISUALIZATION WITH DATASHADER

Upload: others

Post on 22-May-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

WHITEPAPER

Dr. James Bednar, Open Source Tech Lead & Solutions Architect, Continuum Analytics August 2016

BIG DATA VISUALIZATION WITH DATASHADER

2 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER

Visualization in the Era of Big Data: Getting It Right Is Not Always Easy

Some of the problems related to the abundance of data can be

overcome simply by using more or better hardware. For instance,

larger data sets can be processed in a given amount of time by

increasing the amount of computer memory, CPU cores or network

bandwidth. But, other problems are much less tractable, such as

what might be called the ‘points-per-pixel problem’—which is

anything but trivially easy to solve and requires fundamentally

different approaches.

The ‘points-per-pixel’ problem is having more data points than is

possible to represent as pixels on a computer monitor. If your data

set has hundreds of millions or billions of data points—easily

imaginable for Big Data—there are far more than can be displayed

on a typical high-end 1920x1080 monitor with 2 million pixels, or

even on a bleeding edge 8K monitor, which can display only 33

million pixels. And yet, data scientists must accurately convey, if not

all the data, at least the shape or scope of the Big Data, despite these

hard limitations.

Very small data sets do not have this problem. For a scatterplot with

only ten or a hundred points, it is easy to display all points, and

observers can instantly perceive an outlier off to the side of the data’s

cluster. But as you increase the data set’s size or sampling density,

you begin to experience difficulties. With as few as 500 data points, it

is much more likely that there will be a large cluster of points that

mostly overlap each other, known as ‘overplotting’, and obscure the

structure of the data within the cluster. Also, as they grow, data sets

can quickly approach the points-per-pixel problem, either overall or

in specific dense clusters of data points.

Technical ‘solutions’ are frequently proposed to head off these issues,

but too often these are misapplied. One example is downsampling,

where the number of data points is algorithmically reduced, but

which can result in missing important aspects of your data. Another

approach is to make data points partially transparent, so that they

add up, rather than overplot. However, setting the amount of

transparency correctly is difficult, error-prone and leaves

unavoidable tradeoffs between visibility of isolated samples and

overplotting of dense clusters. Neither approach properly addresses

the key problem in visualization of large data sets: systematically

and objectively displaying large amounts of data in a way that can be

presented effectively to the human visual system.

• The complexity of visualizing large amounts of data

• How datashading helps tame this complexity

• The power of adding interactivity to your visualization

In this paper, you’ll learn why Open Data Science is the foundation to modernizing data analytics, and:

In This WhitepaperData science is about using data to provide insight and evidence that can lead business, government and academic leaders to make better decisions. However, making sense of the large data sets now becoming ubiquitous is difficult, and it is crucial to use appropriate tools that will drive smart decisions.

The beginning and end of nearly any problem in data science is a visualization—first, for understanding the shape and structure of the raw data and, second, for communicating the final results to drive decision making. In either case, the goal is to expose the essential properties of the data in a way that can be perceived and understood by the human visual system.

Traditional visualization systems and techniques were designed in an era of data scarcity, but in today’s Big Data world of an incredible abundance of information, understanding is the key commodity. Older approaches focused on rendering individual data points faithfully, which was appropriate for the small data sets previously available. However, when inappropriately applied to large data sets, these techniques suffer from systematic problems like overplotting, oversaturation, undersaturation, undersampling and underutilized dynamic range, all of which obscure the true properties of large data sets and lead to incorrect data-driven decisions. Fortunately, Anaconda is here to help with datashading technology that is designed to solve these problems head-on.

The ‘points-per-pixel’ problem is having more

data points than is possible to represent

as pixels on a computer monitor.

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 3

Let’s take a deeper dive into five major ‘plotting pitfalls’ and how they

are typically addressed, focusing on problems that are minor

inconveniences with small data sets but very serious problems with

larger ones:

OVERPLOTTING. Let’s consider plotting some 2D data points that

come from two separate categories, plotted as blue and red in A and

B below in Figure 1. When the two categories are overlaid, the

appearance of the result can be very different, depending on which

one is plotted first.

Plots C and D shown in the overplotting example are the same

distribution of points, yet they give a very different impression of

which category is more common, which can lead to incorrect decisions

based on this data. Of course, both are equally common in this case.

The cause for this problem is simply occlusion. Occlusion of data by

other data is called overplotting or overdrawing, and it occurs

whenever a data point or curve is plotted on top of another data

point or curve, obscuring it. Overplotting is a problem not just for

scatterplots, as shown below, but for curve plots, 3D surface plots, 3D

bar graphs and any other plot type where data can be occluded.

Overplotting is tricky to avoid, because it depends not only on the

number of data points, but on how much they happen to overlap in a

given data set, which is difficult to know before visualization.

Even worse, the visualizations themselves can be highly misleading,

as shown in C and D in Figure 1, so that even after visualization, it

can be difficult to detect overplotting.

OVERSATURATION. You can reduce problems with overplotting

by using transparency or opacity, via the alpha parameter provided

to control opacity in most plotting programs. For example, if alpha is

0.1, full color saturation will be achieved only when 10 points overlap,

which reduces the effects of plot ordering but can make it harder to

see individual points.

In the example in Figure 2, C and D look very similar (as they should,

since the distributions are identical), but there are still a few specific

locations with oversaturation, a problem that will occur when more

than 10 points overlap. The oversaturated points are located near the

middle of the plot, but the only way to know whether they are there

would be to plot both versions and compare, or to examine the pixel

values to see if any have reached full saturation—a necessary, but not

sufficient, condition for oversaturation. Locations where saturation

has been reached have problems similar to overplotting, because only

the last 10 points plotted will affect the final color, for alpha of 0.1.

Occlusion of data by other data is called

overplotting or overdrawing, and it occurs

whenever a data point or curve is plotted on

top of another point or curve.

1. Overplotting

2. Oversaturation

3. Undersampling

4. Undersaturation

5. Underutilized range

Figure 2. Using Transparency to Avoid OverplottingFigure 1. Overplotting

A B

C D

A B

C D

4 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER

Even worse, if just one has set the alpha value to approximately or

usually avoid oversaturation, as in the previous plot, the correct value

still depends on the data set. If there are more points overlapping in

that particular region, a manually adjusted alpha setting that worked

well for a previous data set will systematically misrepresent the new

data set.

In the example in Figure 3, C and D again look qualitatively different,

yet still represent the same distributions, just with more points. Since

we are assuming that the goal of the visualization is to faithfully

reveal the underlying distributions, having to tune visualization

parameters manually based on the properties of the data set itself is a

fundamental problem that wastes time and leads to errors in

judgment that could be very costly.

To make it even more complicated, the correct alpha also depends

on the dot size, because smaller dots have less overlap for the same

data set. With smaller dots, as shown in Figure 4, C and D look more

similar, as desired, but the color of the dots is now difficult to see in

all cases, because the dots are too transparent for this size.

As you can see in Figure 4, it is very difficult to find settings for the

dot size and alpha parameters that correctly reveal the data, even for

relatively small and obvious data sets like these. With larger data sets

with unknown content, it is often impossible to detect that such

problems are occurring, leading to false conclusions based on

inappropriately visualized data.

UNDERSAMPLING. With a single category instead of the multiple

categories shown previously, oversaturation simply obscures spatial

differences in density. For instance, 10, 20 and 2000 single-category

points overlapping will all look the same visually, for alpha=0.1.

In Figure 5, on the next page, let’s first look at another example that

has a sum of two normal distributions slightly offset from one

another but no longer uses color to separate them into categories.

As shown in the examples in the previous sections, finding settings to

avoid overplotting and oversaturation is difficult. The ‘small dots’

parameters used in the A and B (size 0.1, full alpha) of the

undersampling vs overplotting example work fairly well for a sample

of 600 points (A), but those parameters lead to serious overplotting

issues for larger data sets, obscuring the shape and density of the

distribution (B). Switching to 10 times smaller dots with alpha 0.1 to

allow overlap (‘tiny dots’) works well for the larger data set D, but not

at all for the 600 point data set C. Clearly, not all of these settings are

accurately conveying the underlying distribution, as they all appear

quite different from one another, but in each case they are plotting

samples from the same distribution. Similar problems occur for the

same size data set, but with greater or lesser levels of overlap

between points, which varies with every new data set.

In any case, as data set size increases, at some point plotting a full

scatterplot like any of these will become impractical with current

plotting technology. At this point, people often simply subsample

their data set, plotting 10,000 or perhaps 100,000 randomly selected

data points. But, as Figure 5 panel A shows, the shape of an

undersampled distribution can be very difficult or impossible to

make out, leading to incorrect conclusions about the distribution.

Such problems can occur even when taking very large numbers of

samples and examining sparsely populated regions of the space,

which will approximate panel A for some plot settings and panel C

for others. The actual shape of the distribution is only visible if

sufficient data points are available in that region and appropriate

plot settings are used, as in D, but ensuring that both conditions are

Figure 3. Oversaturation Due to More Overlapping Points Figure 4. Reducing Oversaturation by Decreasing Dot Size

Oversaturation obscures spatial

differences in density.

A B

C D

A B

C D

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 5

true is quite a difficult process of trial and error, making it very

likely that important features of the data set will be missed.

To avoid undersampling large data sets, researchers often use 2D

histograms visualized as heatmaps, rather than scatterplots showing

individual points. A heatmap has a fixed size grid regardless of the

data set size, so that it can make use of all the data. Heatmaps

effectively approximate a probability density function over the

specified space, with coarser heatmaps averaging out noise or

irrelevant variations to reveal an underlying distribution, and finer

heatmaps are able to represent more details in the distribution, as

long as the distribution is sufficiently and densely sampled.

Let’s look at some heatmaps in Figure 6 with different numbers of

bins for the same two-Gaussians distribution.

As you can see, a too coarse binning, like grid A, cannot represent

this distribution faithfully, but with enough bins, like grid C, the

heatmap will approximate a tiny-dot scatterplot like plot D in the

Undersampling in Figure 5. For intermediate grid sizes like B, the

heatmap can average out the effects of undersampling. Grid B is

actually a more faithful representation of the distribution than C,

given that we know this distribution is two offset 2D Gaussians, while

C more faithfully represents the sampling—the individual points

drawn from this distribution. Therefore, choosing a good binning

grid size for a heatmap does take some expertise and knowledge of

the goals of the visualization, and it is always useful to look at

multiple binning-grid spacings for comparison. Still, the binning

parameter is something meaningful at the data level—how coarse a

view of the data is desired? Rather than just a plotting detail (what

size and transparency should I use for the points?), which would

need to be determined arbitrarily.

In principle, the heatmap approach can entirely avoid the first three

problems above:

1. Overplotting, since multiple data points sum arithmetically into

the grid cell, without obscuring one another

2. Oversaturation, because the minimum and maximum counts

observed can automatically be mapped to the two ends of a

visible color range

3. Undersampling, since the resulting plot size is independent of

the number of data points, allowing it to use an unbounded

amount of incoming data

UNDERSATURATION. Heatmaps come with their own plotting

pitfalls. One rarely appreciated issue common to both heatmaps and

alpha-based scatterplots is undersaturation, where large numbers of

data points can be missed entirely because they are spread over

many different heatmap bins or many nearly-transparent scatter

points. To look at this problem, we can construct a data set

combining multiple 2D Gaussians, each at a different location and

with a different amount of spread (standard deviation):

Even though this is still a very simple data set, it has properties

shared with many real world data sets, namely that there are some

areas of the space that will be very densely populated with points,

while others are only sparsely populated. On the next page we’ll look

at some scatterplots for this data in Figure 7.

Which one of the plots in the undersaturation scatterplot figure shows

the ‘real’ overall distribution that we know is there? None of them—at

least not very well. In Figure 7 plot A, the cluster with the widest

spread (standard deviation of 2.0) covers up everything else,

completely obscuring the structure of this data set by overplotting.

Plots B and C reveal the structure better, but they required hand-

Figure 5. Undersampling vs Overplotting Figure 6. Binning into Heatmaps

The shape of an undersampled distribution

can be very difficult or impossible to make

out, leading to incorrect conclusions about

the distribution.

LOCATION (2,2) (2,-2) (-2,-2) (-2,2) (0,0)

STANDARD DEVIATION 0.01 0.1 0.5 1.0 2.0

A B

C D

A B C

6 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER

tuning and neither one is particularly satisfactory. In B, there are four

clearly visible Gaussians, but all but the largest appear to have the

same density of points per pixel, which we know is not the case from

how the data set was constructed, plus the smallest is nearly invisible.

In addition, each of the five Gaussians has the same number of data

points (10,000), but the second largest looks like it has more than the

ones with smaller spreads, and the narrowest one is likely to be

overlooked altogether, which is the clearest example of oversaturation

obscuring important features. Yet, if we try to combat the

oversaturation by using transparency as in Figure 7 plot C, we now

get a clear problem with undersaturation—the ‘very large spread’

Gaussian is now essentially invisible. Again, there are just as many

data points in the widest-spread cluster as in each of the others, but

we would never even know any of those points were there if we were

only looking at C.

To put it in a real-world context, with plot settings like plot C, a large

rural population spread over a wide region will entirely fail to show

up on the visualization, compared to a densely populated area, and

will entirely dominate the plot if using the plot settings in A, either

of which would lead to a completely inappropriate decision if

making a judgment about that real-world data. Similar problems

occur for a heatmap view of the same data, as shown in Figure 8.

Here, the narrowly spread distributions lead to single pixels that

have a very high count compared to the rest. If all the pixels’ counts

are linearly ramped into the available color range, from zero to that

high count value, then the wider spread values are obscured, as in B,

or entirely invisible, as in C.

To avoid undersaturation, you can add an offset to ensure that low count,

but nonzero, bins are mapped into a visible color, with the remaining

intensity scale used to indicate differences in counts (Figure 9).

Such mapping entirely avoids undersaturation, since all pixels are

either clearly zero, in the background color, white in this case, or a

non-background color taken from the colormap. The widest-spread

Gaussian is now clearly visible in all cases.

However, despite these plots avoiding overplotting, oversaturation,

undersampling and undersaturation, the actual structure of this data

is still not visible. In Figure 9, plot A, the problem is clearly too-

coarse binning, but, even B is somewhat too coarsely binned for this

data, since the ‘very narrow spread’ and ‘narrow spread’ Gaussians

show up identically, each mapping entirely into a single bin (the two

black pixels). Plot C does not suffer from too-coarse binning, yet it

still looks more like a plot of the ‘very large spread’ distribution

alone, rather than a plot of these five distributions that have different

spreads, and it is thus still highly misleading, despite the correction

for undersaturation.

UNDERUTILIZED RANGE. So, what is the problem in Figure 9,

plot C? By construction, we’ve avoided the first four pitfalls:

overplotting, oversaturation, undersampling and undersaturation.

But the problem is now more subtle—differences in data point

density are not visible between the five Gaussians, because all, or

nearly all, pixels end up being mapped into either the bottom end of

the visible range (light gray), or the top end (pure black, used only for

the single pixel holding the ‘very narrow spread’ distribution). The

rest of the visible colors in this gray colormap are unused, conveying

no information to the viewer about the rich structure that we know

this distribution contains. If the data were uniformly distributed

over the range from minimum to maximum counts per pixel (0 to

10,000 in this case), then the plot would work well, but that’s not the

case for this data set and for many real-world data sets.

So, let’s try transforming the data from its default linear

representation, or integer count values, into something that reveals

relative differences in count values by mapping them into visually

distinct colors. A logarithmic transformation is one common choice

as shown on the next page in Figure 10.

Aha! We can now see the full structure of the data set, with all five

Gaussians clearly visible in B and C and the relative spreads also

clearly visible in C. However, we still have a problem, though. Unlike

the solutions to the first four pitfalls, the choice of a logarithmic

transformation to address the fifth problem was arbitrary and

Figure 9. Avoiding Undersaturation Using an Offset

A B C

Figure 8. Undersaturation with Heatmaps

Figure 7. Undersaturation with Scatterplots

A

A

B

B

C

C

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 7

dependent on the specifics of this data set. The logarithmic

transformation mainly works well because we happened to have

used an approximately geometric progression of spread sizes when

constructing the example. For large data sets with truly unknown

structure, can we have a more principled approach to mapping the

data set values into a visible range that will work across data sets?

Yes, if we think of the visualization problem in a different way. The

underlying difficulty in plotting this data set, as for many real-world

data sets, is that the values in each bin are numerically very different,

ranging from 10,000 in the bin for the ‘very narrow spread’

Gaussian to 0 or 1 for single data points from the ‘very large spread’

Gaussian. Given the 256 gray levels available in a normal monitor

and the similarly limited human ability to detect differences in gray

values, numerically mapping the data values into the visible range

linearly is clearly not going to work well. But, given that we are

already backing off from a direct numerical mapping in the above

approaches for correcting undersaturation and for doing log

transformations, what if we entirely abandon the numerical

mapping approach, using the numbers only to form an ordering of

the data values and plotting that rather than the magnitudes? Such

an approach would be a rank-order plot, preserving relative order

while discarding specific magnitudes. For 100 gray values, you can

think of it as a percentile based plot, with the lowest 1% of the data

values mapping to the first visible gray value, the next 1% mapping to

the next visible gray value, and so on to the top 1% of the data values

mapping to the highest gray value 255 (black, in this case). The actual

data values would be ignored in such plots, but their relative

magnitudes would still determine how they map onto colors on the

screen, preserving the structure of the distribution, rather than the

numerical values.

We can approximate such a rank-order or percentile encoding using

the histogram equalization function from an image processing

package, which makes sure that each gray level is used for about the

same number of pixels in the plot as shown in Figure 11.

Figure 11 plot C , the rank-order plotting example, now reveals the

full structure that we know was in this data set, i.e. five Gaussians

with different spreads, with no arbitrary parameter choices. The

differences in counts between pixels are now very clearly visible,

across the full and very wide range of counts in the original data.

Of course, we’ve lost the actual counts themselves, so we can no

longer tell just how many data points are in the ‘very narrow spread’

pixel in this case. So, plot C is accurately conveying the structure, but

additional information would need to be provided to show the actual

counts, by adding a color key mapping from the visible gray values

into the actual counts and/or by providing hovering value

information. Interactive approaches also work well at this point, with

the initial view showing where to investigate, at which point the

numerical values can be examined in each area of interest; actually

showing the full range in a single plot will not work well, but in each

local area it can be useful.

At this point, one could also consider explicitly highlighting hotspots

so that they cannot be overlooked. In plots B and C in Figure 11, the

two highest density pixels are mapped to the two darkest pixel colors,

and with many monitor settings chosen to make black text ‘look

better,’ those values may not be clearly distinguishable from each

other or from nearby grey values. Once the data is reliably and

automatically mapped into a good range for display, making explicit

adjustments—based on wanting to make hotspots particularly

clear—can be done in a principled way that does not depend on the

actual data distribution or by just making the top few pixel values

into a different color, highlighting the top few percentile ranges of

the data.

If we step back a bit, we can see that by starting with plots of specific

data points, we showed how typical visualization techniques will

systematically misrepresent the distribution of those points. With

Big Data, these problems are incredibly serious for businesses,

because the visualization is often the only way that we can

understand the properties of the data set, leading to potentially costly

missed opportunities and incorrect decisions based on the data.

Figure 11. Parameter-Free Visualization Using Rank Order PlottingFigure 10. Dynamic Range with a Logarithmic Transformation

A B C A B C

8 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER

Visualizing Big Data Effectively

Fortunately, there is now an approach to Big Data visualizations that

provides an optimized interaction between the data and the human

visual system, automatically avoiding all of the above plotting

pitfalls. The approach, in which raw data is ultimately rendered into

an image, is a three-part operation:

1. Synthesize

2. Rasterize

3. Transfer

SYNTHESIZE. The first step is to project or synthesize your data

onto a scene. One starts with free-form data, and then needs to make

decisions as to how best to initially lay out that data on the monitor.

An example might be a basic graph of price vs. sales for a product. In

the past, this would be the final step of the visualization process,

leading to any of the serious problems of visualizing Big Data that we

discussed above. In our approach, however, this is only the first step;

it is about making a decision about what to visualize, which will then

be rendered automatically onto the screen in the subsequent steps.

RASTERIZE. The second step is rasterization, which can be thought

of as replotting all of the data on a grid, so that any square of that

grid serves as a finite subsection of the data space; within each square

of this grid you then count the data points that fall there or do other

operations like averaging or measuring standard deviation. One

square may contain no data, another square may contain two points,

and others may contain many points. This step results in an

‘aggregate’ view of the data, binned into a fixed-sized data structure.

TRANSFER. The final step is transfer, which really exploits how the

human visual system works. In this step, the aggregates are

transformed into squares of color, producing an image. The

colormapping will represent the data that lies within that subsection

of the grid and ought to be chosen carefully based on what we know

about how our brains process colors. This step is easy to grasp

intuitively, but to do it well requires introducing some sophisticated

statistical operations that drive the most appropriate transformation

of the data. Luckily, these steps can be automated so that they do not

depend on human judgment about unknown data sets.

Despite the automation, it is important to emphasize that the data

scientist should retain fine grained control at each step in these three

processes. If the plots are to be interpreted, there must be no ‘black

boxes’ for any of the transformation — it should be clear both what

processing is being done and how to change that processing to

highlight specific aspects of the data that are needed for a decision.

By contrast, traditional plotting was, at best, a two-step black box

process, going from raw data to an image of a plot, with at most

some highly indirect control available to the analyst, such as

selecting transparency, dot size and a color scheme. Because those

choices are not directly expressible in terms of the data set itself,

they can only reveal the true picture of the underlying data after a

process of manual adjustment that requires significant domain

expertise and time for parameter adjustment for every plot.

Our solution provides several key advantages. Statistical

transformations of data are now a first-class feature of the

visualization—the data is processed according to a fully specified,

rigorous criterion, not subject to human judgment. Algorithmic

processing of intermediate stages in the visualization pipeline is

used both to reduce time-consuming manual interventions and the

likelihood of covering up data accidentally. In traditional

approaches, these steps are done by trial and error; our approach

automates them and also makes those automation parameters easily

accessible for final tweaking. Rapid iteration of visual styles and

configurations, as well as interactive selections and filtering,

encourages open-minded data exploration, rather than the older

approaches of having to repeatedly adjust the plot before it will show

any useful data at all.

All of these advantages are open for readjustment in an iterative

process of tuning one’s models and how best to display the data, in

which the data scientist can control how data is best transformed and

visualized at each step, starting from a first plot that already

faithfully reveals the overall data set.

Datashader for Big Data Visualization

Anaconda provides all of the functionality described above with its

open source and freely available datashader library. The datashader

library can be used in conjunction with Bokeh, another free, open

source library, to create richly interactive browser based visualizations.

The first step is about making a

decision about what to visualize.

Figure 12. Stages of a Datashader Pipeline

Data

Projection

Aggregate(s)Scene

Transformation Colormapping EmbeddingAggregation

Image Plot

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 9

The datashader library overcomes all of the pitfalls above, both by

automatically calculating appropriate parameters based on the data

itself and by allowing interactive visualizations of truly large data

sets with millions or billions of data points so that their structure can

be revealed. The above techniques can be applied ‘by hand’, but

datashader lets you do this easily, by providing a high performance

and flexible modular visualization pipeline, making it simple to do

automatic processing, such as auto-ranging and histogram

equalization, to faithfully reveal the properties of the data.

The datashader library has been designed to expose the stages

involved in generating a visualization. These stages can then be

automated, configured, customized or replaced wherever

appropriate for a data analysis task. The five main stages in a

datashader pipeline are an elaboration of the three main stages

above, after allowing for user control in between processing steps as

shown in Figure 12.

Figure 12 illustrates a datashader pipeline with computational steps

listed across the top of the diagram, while the data structures, or

objects, are listed along the bottom. Breaking up the computation

into this set of stages is what gives datashader its power, because only

the first couple of stages require the full data set while the remaining

stages use a fixed-size data structure regardless of the input data set,

making it practical to work with on even extremely large data sets.

To demonstrate, we’ll construct a synthetic data set made of the

same five overlapping 2D normal distributions introduced in the

undersaturation example shown previously in Figure 7.

Centered on each location shown are 10,000 randomly chosen

points, drawn from a distribution with the indicated standard

deviation. Datashader is able to faithfully reveal the overall shape of

this 50,000-point distribution, without needing to adjust or tune any

parameters, in only 15 milliseconds.

In Figure 13, you can see each of the five underlying distributions

clearly, which have been manually labeled in the version on the right,

for clarity.

The stages involved in these computations will be laid out one by

one below, showing both how the steps are automated and how they

can be customized by the user when desired.

PROJECTION. Datashader is designed to render data sets projected

onto a 2D rectangular grid, eventually generating an image where

each pixel corresponds to one cell in that grid. The projection stage

includes several steps:

1. Select which variable you want to have on the x axis and which

one for the y axis. If those variables are not already columns in

your dataframe—if you want to do a coordinate

transformation, you’ll first need to create suitable columns

mapping directly to x and y for use in the next step.

2. Choose a glyph, which determines how an incoming data

point maps onto the chosen rectangular grid. There are three

glyphs currently provided with the library:

a. A Point glyph that maps the data point into the

single closest grid cell

b. A Line glyph that maps that point into every grid

cell falling between this point and the next

c. A Raster glyph that treats each point as a square

in a regular grid covering a continuous space

3. Although new glyph types are somewhat difficult to create and

rarely needed, you can design your own if desired, to shade a

point onto a set of bins according to some kernel function or

some uncertainty value.

4. Decide what size final image you want in pixels, what range of

the data to plot in whatever units x and y are stored, and create

a canvas object to hold information.

At this stage, no computation has actually been done—the glyph and

canvas objects are purely declarative objects that record your

preferences, which won’t actually be applied until the next stage.

Thus, the projection stage is primarily conceptual—how do you want

your data to be mapped for aggregation and when it is aggregated?

The scene object suggested above is not actually constructed in

memory, but conceptually corresponds to what other plotting

packages would render directly to the screen at this stage.

LOCATION (2,2) (2,-2) (-2,-2) (-2,2) (0,0)

STANDARD DEVIATION 0.01 0.1 0.5 1.0 2.0

Figure 13. Datashader Rendering of the Five-Gaussians Example

10 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER

AGGREGATION. Once a conceptual scene object has been specified,

it can then be used to guide aggregating the data into a fixed-sized

grid. All of the aggregation options currently supported are

implemented as incremental reduction operators. Using incremental

operations means that we can efficiently process data sets in a single

pass, which is particularly important for data sets larger than the

memory available. Given an aggregate bin to update, typically

corresponding to one eventual pixel, and a new data point, the

reduction operator updates the state of the bin in some way.

Data points are normally processed in batches for efficiency, but it is

simplest to think about the operator as being applied per data point,

and the mathematical result should be the same.

Figure 14 shows four examples using different aggregation functions.

TRANSFORMATION. Now that the data has been projected and

aggregated into a gridded data structure, it can be processed in any

way you like, before converting it to an image, which will be

described in the following section. At this stage, the data is still

stored as bin data, not pixels, which makes a wide variety of

operations and transformations simple to express.

For instance, in Figure 15, instead of plotting all the data, we can

easily find hotspots by plotting only those bins in the 99th percentile

by count or apply any NumPy ufunc to the bin values, whether or

not it is meaningful.

COLORMAPPING. As you can see in Figure 13-15, the typical way to

visualize an aggregate array is to map each array bin into a color for

a corresponding pixel in an image. The examples maps a scalar

aggregate bin value into an RGB (color) triple and an alpha (opacity)

value. By default, the colors are chosen from the colormap

[‘lightblue’,’darkblue’] (#ADD8E6 to #00008B), with intermediate

colors chosen as a linear interpolation independently for the red,

green and blue color channels (AD to 00 for the red channel, in this

case). The alpha (opacity) value is set to 0 for empty bins and 1 for

non-empty bins, allowing the page background to show through

wherever there is no data. You can supply any colormap you like as

shown in Figure 16, including Bokeh palettes, matplotlib colormaps

or a list of colors using the color names from ds.colors, integer

triples or hexadecimal strings.

Figure 14. Visualization of Various Aggregations Using Datashader

A Count aggregation

C Mean y aggregation

D Mean val aggregation

B Any aggregation

Figure 15. Single- Line Operations Using xarray/NumPy Functions

A agg.where(agg>=np.

percentile(agg,99)

B numpy.sin(agg)

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 11

EMBEDDING. In Figure 16, the stages all eventually lead to a raster

image, displayed here as PNG images. However, these bare images

do not show the data ranges, axis labels and so on, nor do they

support the dynamic zooming and panning necessary to understand

data sets across scales. To add these features, the datashader output

can be embedded into plots in a variety of plotting programs, such

as an interactive Bokeh plot as illustrated in Figure 17.

On a live server, you can zoom and pan to explore each of the

different regions of this data set. For instance, if you zoom in far

enough on the blue dot, you’ll see that it does indeed include 10,000

points, they are just so close together that they show up as only a

single tiny blue spot in the above plot. Such exploration is crucial for

understanding data sets with rich structure across different scales, as

in most real world data.

To illustrate the power of visualizing rich structures at a very large scale,

we will take a look at two data rich examples on the following pages.

Figure 16. Examples of Colormapping Using Datashader Figure 17. Datashader Embedded in Interactive Bokeh Visualizations

12 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER

EXAMPLE 1: 2010 CENSUS DATA. The 2010 Census collected a

variety of demographic information for all of the more than 300

million people in the United States. Here, we’ll focus on the subset of

the data selected by the Cooper Center, who produced a map of the

population density and the racial/ethnic makeup of the USA

(http://www.coopercenter.org/demographics/Racial-Dot-Map). Each

dot in this map corresponds to a specific person counted in the

census, located approximately at their residence. To protect privacy,

the precise locations have been randomized at the block level, so that

the racial category can only be determined to within a rough

geographic precision. In this map, we show the results of running

novel analyses focusing on various aspects of the data, rendered

dynamically as requested, using the datashader library, rather than

precomputed and pre-rendered, as in the above URL link.

For instance, we can look at the population density by plotting the

x,y locations of each person, using all the default plotting values,

apart from selecting a more colorful colormap in Figure 18.

Patterns relating to geography like mountain ranges, infrastructure

like roads in the Midwest and history such as high population density

along the East coast, are all clearly visible and additional structures

are interactively visible when zooming into any local region.

For this data set, we can add additional information by colorizing

each pixel by the racial/ethnic category reported on the census data

for that person, using a key of:

• Purple: Hispanic/Latino

• Cyan: Caucasian/White

• Green: African American/Black

• Red: Asian/Pacific Islander

• Yellow: Other including Native American

Datashader will then merge all the categories present in each pixel to

show the average racial/ethnic makeup of that pixel, showing clear

levels of segregation at the national level, again using only the

default parameter settings with no custom tuning or adjustment as

shown in Figure 19.

Here, “segregation” means only that persons of different races or

ethnicities are grouped differently geographically, which could have

a very wide variety of underlying historical, social or political causes.

Even greater levels of segregation are visible when zooming into any

major population center, such as those shown in Figure 20.

In the examples, we can see that Chicago and Manhattan’s historic

Chinatown neighborhoods are clearly visible (colored in red), and

other neighborhoods are very clearly segregated by race/ethnicity.

Datashader supports interactive zooming all the way in to see

individual data points, so that the amount of segregation can be seen

very clearly at a local level, such as in Chicago’s Chinatown and

nearby neighborhoods.

Here, datashader has been told to automatically increase the size of

each point when zooming in so far that data points become sparse,

making individual points more visible.

Figure 19. Visualizing US Population by Race with Datashader

Figure 18. Visualizing US Population Density with Datashader Figure 20. Race & Ethnicity with Datashader

A Zooming in to view race/

ethnicity data in Chicago

B

Zooming in to view race/

ethnicity data in NYC

C Zooming in to view race/

ethnicity data in Los Angeles

D Zooming in to view race/

ethnicity data in Chicago

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 13

Figure 21. Plotting NYC Taxi Dropoffs with Bokeh

Figure 22. Plotting NYC Taxi Dropoffs with Datashader

Figure 23. NYC Taxi Pickup Times

EXAMPLE 2: NYC TAXI DATA SET. For this example, we’ll use

part of the well-studied NYC taxi trip database, with the locations of

all New York City taxicab pickups and dropoffs from January 2015.

The data set contains 12 million pickup and dropoff locations (in

Web Mercator coordinates), with passenger counts and times of day.

First, let’s look at a scatterplot of the dropoff locations, as would be

rendered by subsampling with Bokeh, Figure 21.

Here, the location of Manhattan can be seen clearly, as can the

rectangular Central Park area with few dropoffs, but there are

serious overplotting issues that obscure any more detailed structure.

With the default settings of datashader, apart from the colormap, all

of the data can be shown with no subsampling required, revealing

much richer structure. In Figure 22, the entire street grid of the New

York City area is now clearly visible, with increasing levels of detail

available by zooming in to particular regions, without needing any

specially tuned or adjusted parameters.

By analogy to the US census race data, you can also treat each hour

of the day as a category and color them separately, revealing

additional temporal patterns using the color key of:

• Red: 12 a.m. Midnight

• Yellow: 4 a.m.

• Green: 8 a.m.

• Cyan: 12 p.m. Noon

• Blue: 4 p.m.

• Purple: 8 p.m.

In Figure 23, there are definitely different regions of the city where

pickups happen at specific times of day, with rich structure that can

be revealed by zooming in to see local patterns and relate them to

the underlying geographical map as shown in Figure 24.

Figure 24. Taxi Pickup Times Zoomed with Overlay

14 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER

OPERATIONS IN VISUALIZATION. Once the data is in

datashader, it becomes very simple to perform even quite

sophisticated computations on the visualization, not just on the

original data. For instance, we can easily plot all the locations in

NYC where there are more pickups than dropoffs in shades of red,

and all locations where there are more dropoffs than pickups in

shades of blue on Figure 25.

Plotted in this way, it is clear that pickups are much more likely

along the main arteries—presumably where a taxi can be hailed

successfully, while dropoffs are more likely along side streets.

LaGuardia Airport (circled) also shows clearly segregated pickup

and dropoff areas, with pickups being more widespread,

presumably because those are on a lower level and thus have lower

GPS accuracy due to occlusion of the satellites.

With datashader, building a plot like this is very simple, once the

data has been aggregated. An aggregate is an xarray (see xarray.

pydata.org) data structure and, if we create an aggregate named

drops that contains the dropoff locations and one named picks

that contains the pickup locations, then drops.where(drops>picks)

will be a new aggregate holding all the areas with more dropoffs,

and picks.where(picks>drops) will hold all those with more

pickups. These can then be merged to make the plot above, in one

line of datashader code. Making a plot like this in another plotting

package would essentially require replicating the aggregation step

of datashader, which would require far more code.

Similarly, referring back to the US census data, it only takes one

line of datashader code to filter the race/ethnicity data to show only

those pixels containing at least one person of every category in

Figure.26, plot A.

The color then indicates the predominant race/ethnicity, but only

for those areas—mainly major metropolitan areas—with all races

and ethnicities included. Another single line of code will select only

those areas where the number of African Americans/Blacks is

larger than the number of Caucasians/Whites as shown in Figure

26, plot B.

Here, the predominantly African American/Black neighborhoods

of major cities have been selected, along with many rural areas in

the Southeast, along with a few largely Hispanic neighborhoods on

the West Coast that nonetheless have more Blacks than Whites.

Alternatively, we can simply highlight the top 1% of the pixels by

population density, in this case by using a color range with 100

shades of gray and then changing the top one to red in Figure 26,

plot C.

Nearly any such query or operation that can be expressed at the

level of pixels (locations) can be expressed similarly simply,

providing a powerful counterpart to queries that are easy to

perform at the raw data level, or to filter by criteria already

provided as columns in the data set.

Figure 25. Visualizing Drop-Off Location

dropoffs (blue) vs pick-up (red) locations

Figure 26. Filtering US Census Data

US census data, only including pixels with every race/

ethnicity included

A

BUS census data, only including pixels where African

Americans/Blacks outnumber Caucasians/Whites

US population density, with the 1% most dense pixels

colored in red

C

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 15

OTHER DATA TYPES. The previous examples focus on scatterplots,

but datashader also supports line plots, trajectories and raster plots.

Line plots behave similarly to datashader scatter plots, avoiding the

very serious overplotting and occlusion effects that happen for plots

of multiple overlaid time-series curves, by ensuring that overlapping

lines are combined in a principled way, as shown in Figure 27.

With datashader, time series data with millions or billions of points

can be plotted easily, with no downsampling required, allowing

isolated anomalies to be detected easily and making it simple to

zoom in to see lower-level substructure.

Trajectory plots (ordered GPS data coordinates) can similarly use all

the data available even for millions or billions of points, without

downsampling and with no parameter tuning, revealing

substructure at every level of detail, as in Figure 28.

In Figure 28, using one million points, there is an overall synthetic

random-walk trajectory, but a cyclic ‘wobble’ can be seen when

zooming in partially, and small local noisy values can be seen when

zooming in fully. These patterns could be very important, if, for

example, summing up total path length, and are easily discoverable

interactively with datashader, because the full data set is available,

with no downsampling required.

Figure 27. Multiple Overlapping Time Series Curves Figure 28. Zooming in on the Data

Zoom level 0

Zoom level 1

Zoom level 2

About Continuum Analytics

Continuum Analytics’ Anaconda is the leading open data science platform powered by Python. We put superpowers into the hands of people who are changing the world. Anaconda is trusted by leading businesses worldwide and across industries – financial services, government, health and life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda helps data science teams discover, analyze, and collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage open data science environments and harness the power of the latest open source analytic and technology innovations. Visit www.continuum.io.

SummaryIn this paper, we have shown some of the major challenges in

presenting Big Data visualizations, the failures of traditional

approaches to overcome these challenges and how a new approach

surmounts them. This new approach is a three-step process that:

optimizes the display of the data to fit how the human visual system

works, employs statistical sophistication to ensure that data is

transformed and scaled appropriately, encourages exploration of data

with ease of iteration by providing defaults that reveal the data

automatically and allows full customization that lets data scientists

adjust every step of the process between data and visualization.

We have also introduced the datashader library available with

Anaconda, which supports all of this functionality. Datashader uses

Python code to build visualizations and powers the plotting

capabilities of Anaconda Mosaic, which explores, visualizes and

transforms heterogeneous data and lets you make datashader plots

out-of-the-box, without the need for custom coding.

The serious limitations of traditional approaches to visualizing Big

Data are no longer an issue. The datashader library is now available

to usher in a new era of seeing the truth in your data, to help you

make smart, data-driven decisions.