big data visualization with datashader - anaconda...big data, the visualization is all you have, so...

WHITEPAPER

Dr. James Bednar, Open Source Tech Lead & Senior Solutions Architect, Anaconda

Big Data Visualization With Datashader

• The complexity of visualization in the era of Big Data

• How Datashader helps tame this complexity

• The power of adding interactivity to your visualization

In this whitepaper, we will examine:

In This WhitepaperThe best way to explore and communicate insights about data is through interactive visualization. Whether you are dealing with geospatial, time series, or tabular data, interactive graphics allow everyone on your team—from analysts to executives—to understand the patterns in your data.

Traditional visualization systems and techniques were designed in an era of data scarcity, but in today’s Big Data world of an incredible abundance of information, understanding is the key commodity. As data grows to include millions and billions of points, traditional visualization techniques break down. Whether you’re loading the data into limited memory or separating the signal from the noise, as data gets big, visualization gets challenging.

Older approaches focused on rendering individual data points faithfully, which was appropriate for the small data sets previously available. However, when inappropriately applied to large data sets, these techniques suffer from systematic problems like overplotting, oversaturation, undersaturation, undersampling, and underutilized dynamic range, all of which obscure the true properties of large data sets and lead to incorrect data-driven decisions and missed business opportunities.

Fortunately, Anaconda is here to help with Datashader, a Big Data visualization tool specifically designed to solve these problems head-on.

2 ANACONDA WHITEPAPER • Big Data Visualization With Datashader

Interactive graphics allow everyone on your team—from analysts to executives—to understand the patterns in your data.

Big Data Visualization With Datashader • ANACONDA WHITEPAPER 3

Visualization In The Era Of Big Data: Getting It Right Is Not Always Easy

The role of a data scientist is to collect data, extract useful knowledge

from the data, and accurately communicate this knowledge to

business stakeholders so they can make informed, data-driven

decisions. But in a Big Data world, accurately conveying all the data,

or even just the overall structure or shape of this data, can be

challenging. In addition to presenting various storage and

computational issues, Big Data also magnifies significant problems

with standard plotting tools.

For a simple scatterplot with only ten or a hundred points, it is easy

to display all points, and observers can instantly perceive an outlier

off to the side of the data’s cluster. But if your data set has hundreds

of millions or billions of data points—easily imaginable for Big

Data—there are far more points than can be displayed as pixels on a

typical computer monitor, leading to what might be called the

“points-per-pixel problem.” You are also much more likely to run the

risk of overplotting, where points in a large cluster overlap each

other and obscure the structure of the data within the cluster. As

seen below in Figures 1A and 1B, when the two categories are

overlaid, the appearance of the result can be very different,

depending on which one is plotted first.

These common plotting problems—as well as oversaturation,

undersaturation, and various binning issues—can lead your business

to make incorrect conclusions based on misleading visualizations.

But how are you to know when the visualization is lying to you?

Faced with so many challenging limitations, technical “solutions” are

frequently proposed to head off issues like overplotting, but too often

these so-called solutions are misapplied. One approach is

downsampling, where the number of data points is algorithmically

reduced, but which can result in missing important aspects of your

data. Another approach is to make data points partially transparent,

so that they add up, rather than overplot. However, setting the

amount of transparency correctly is difficult and error-prone, and

leaves unavoidable tradeoffs between visibility of isolated samples

and overplotting inside dense clusters.

Neither approach properly addresses the key problem in

visualization of large data sets: systematically and objectively

displaying large amounts of data in a way that can be perceived and

understood by the human visual system. And when exploring really

Big Data, the visualization is all you have, so it is crucial to get it right!

BA

Figure 1. Overplotting

Big Data magnifies significant problems with standard plotting tools.

Plotting problems can lead your business to make incorrect conclusions based on misleading visualizations. But how are you to know when the visualization is lying to you?


Visualizing Big Data Effectively

Recognizing the growing need to rethink traditional, outmoded

visualization processes, Anaconda developed a new approach to Big

Data visualizations that provides an optimized interaction between

the data and the human visual system, automatically avoiding all of

the typical plotting pitfalls previously discussed.

This new approach optimizes the display of the data to fit how the

human visual system works, employs statistical sophistication to

ensure that data is transformed and scaled appropriately, encourages

easy and accurate iterative exploration of data by providing defaults

that reveal the data automatically at every scale, and allows full

customization that lets data scientists adjust every step of the process

between data and visualization.

The approach is a three-part operation:

• Synthesize

The first step is to project or synthesize your data onto a scene.

One begins with free-form data, then makes decisions as to

how best to initially lay out that data on the monitor. This step

is about a human being deciding what to visualize, which will

then be rendered automatically onto the screen by Datashader

in the subsequent steps.

• Rasterize

The second step is rasterization, which can be thought of as

replotting all of the data on a grid, so that any square of that

grid serves as a finite subsection of the data space. In the

simplest case, you then count all the data points that fall

within each square of this grid. In other cases, you can

perform a variety of user-selectable operations like averaging

or measuring standard deviation. This step results in an

“aggregate” view, binned into a fixed-sized data structure

regardless of the size of the original dataset.

• Transfer

In this step, the aggregates are transformed into squares of

color, producing an image. The colormapping will represent

the data that lies within that subsection of the grid and needs

to be chosen carefully based on what we know about how our

brains process patterns of light.

This three-step solution provides several key advantages. First,

transforming the aggregated data into a visualization is now a fully

exposed operation, unlike the implicit and error-prone

transformations in traditional plotting programs. With Datashader,

the aggregated data is processed according to a fully specified,

rigorous criterion, not subject to human judgment. Algorithmic

processing of intermediate stages in the visualization pipeline can

now be used both to reduce time-consuming manual interventions,

and even more importantly, reduce the likelihood of covering up

your data accidentally.

In traditional approaches, these steps are each done by trial and

error, and you typically have to repeatedly adjust the plot before it

will show any useful data at all. Our approach, on the other hand,

automates these steps to reveal your data the very first time it is

processed, while making the automation parameters easily accessible

for final tweaking if needed.

Anaconda has developed a new approach to Big Data visualizations that automatically avoids all of the typical plotting pitfalls.

Datashader Data Optimiztion in Action

10-million-datapoint trajectories from the

OpenSky project, showing flight paths between

European airports.

Without any parameter adjustment, a wealth of

additional details is visible when zooming in.

Eventually every sample can be revealed, point by

point, making it simple for users to discover

patterns in their data at any level.


Datashader For Big Data Visualization

To provide all of the functionality described above, Anaconda

created Datashader—a flexible and configurable visualization

pipeline designed to reveal the patterns in large data sets

automatically. Datashader provides statistically accurate

transformation of data at every stage, allowing for rapid iteration of

visual styles and configurations, as well as interactive selections and

filtering. It completely prevents the typical plotting pitfalls

previously discussed, including overplotting, saturation,

undersampling, and underutilized dynamic range.

Data can still be obscured by being binned into the resolution of

your monitor, but Datashader addresses this important issue by

providing fully interactive exploration in web browsers for billions

of datapoints on an ordinary laptop, or for even larger datasets using

large compute clusters. Achieving the performance levels required

comes from two unique features of Datashader: using Python with

Numba and Dask to compute the aggregates at bare-metal speeds,

then transferring only the aggregate to the web browser at any

instant so that visualization speed is independent of the dataset size.

The web-browser interface is provided by the separate, general-

purpose HoloViews and Bokeh open-source libraries, which make it

simple to create richly interactive browser-based visualizations.

Figure 2 illustrates the complete Datashader pipeline from data to

web browser, with each computational step listed across the top of

the diagram and each data structure or object involved listed along

the bottom. Breaking up the computation into this set of stages is

what gives Datashader its power, because only the first couple of

stages require the full data set while the remaining stages use a

fixed-size data structure regardless of the input data set, making it

practical to work with even extremely large datasets.

To illustrate the power of visualizing rich structures at a very large

scale, let’s take a look at two illustrative examples.

Datashader supports interactive zooming all the way in to see individual data points.

Data

Projection

Aggregate(s)Scene

Transformation Colormapping EmbeddingAggregation

Image Plot

Figure 2. Stages of a Datashader Pipeline


Datashader Example 1: 2010 Census Data

The 2010 Census collected a variety of demographic information for

the 300+ million people in the US. Here, we’ll focus on the subset of

the data selected by the Cooper Center, who produced a map of US

population density and racial/ethnic makeup. Each dot in the map

corresponds to a specific person counted in the census, located

approximately at their residence.

In this example, we’ll show the results of running novel Datashader

analyses focusing on various aspects of the data. As a first “dump” of

the data, using no changes to any of the parameters apart from

choosing a colorful colormap, Figure 3 shows the overall population

density by plotting the x,y locations of each person.

As we can see, patterns relating to geography such as mountain

ranges, infrastructure such as roads in the Midwest, and history such

as high population density along the East coast are all clearly visible,

and additional structures will be visible at every scale when zooming

into any local region.

For this data set, we can add additional information by colorizing

each pixel by the racial/ethnic category reported on the census data

for that person, using a key of:

• Purple: Hispanic/Latino

• Cyan: Caucasian/White

• Green: African American/Black

• Red: Asian/Pacific Islander

• Yellow: Other including Native American

Figure 4 Visualizing US Population by Race with Datashader

Figure 3. Visualizing US Population Density with Datashader

Datashader supports interactive zooming all the way in to see individual data points.


Datashader will then merge all the categories present in each pixel to

show the average racial/ethnic makeup of that pixel. The results in

Figure 4 show clear patterns of segregation at the national level, again

using only the default parameter settings with no custom tuning or

adjustment. With other approaches, trying to visualize such widely

ranging levels of population density between rural and urban areas

would require extensive domain knowledge and experimentation to

choose appropriate parameters to reveal the patterns.

Even greater levels of segregation are visible when zooming into any

major population center, such as those shown in Figure 5. We can see

that Chicago and Manhattan’s historic Chinatown neighborhoods

are clearly visible (colored in red), and other neighborhoods are very

clearly segregated by race/ethnicity.

Datashader supports interactive zooming all the way in to see

individual data points, so that the amount of segregation can

be seen very clearly at a local level. Here, Datashader has been

told to automatically increase the size of each point when zooming

in so far that data points become sparse, making individual points

more visible.

Figure 5. Race & Ethnicity with Datashader

A Zooming in to view race/

ethnicity data in Chicago

B

Zooming in to view race/

ethnicity data in NYC

C Zooming in to view race/

ethnicity data in Los Angeles

D Zooming in to view race/

ethnicity data in Chicago


Datashader Example 2: NYC Taxi Data Set

For this example, we’ll use part of the well-studied NYC taxi trip

database, with the locations of all New York City taxicab pickups and

dropoffs from January 2015. The data set contains 12 million pickup

and dropoff locations, with passenger counts and times of day. First,

let’s look at a scatterplot of the dropoff locations, as would be

rendered by subsampling for a traditional plotting program, in

Figure 6.

Here, the location of Manhattan can be seen clearly, as can the

rectangular Central Park area with few dropoffs, but there are

serious overplotting issues that obscure any more detailed structure.

With the default settings of Datashader, apart from the colormap, all

of the data can be shown with no subsampling required, revealing

much richer structure. In Figure 7, the entire street grid of the New

York City area is now clearly visible, with increasing levels of detail

available by zooming in to particular regions, without needing any

specially tuned or adjusted parameters.

By analogy to the US census race data, you can also treat each hour

of the day as a category and color them separately, revealing

additional temporal patterns using the color key of:

• Red: 12 a.m. midnight

• Yellow: 4 a.m.

• Green: 8 a.m.

• Cyan: 12 p.m. noon

• Blue: 4 p.m.

• Purple: 8 p.m.

In Figure 8, there are clearly different regions of the city where

pickups happen at specific times of day, with rich structure that can

be revealed by zooming in to see local patterns and relate them to

the underlying geographical map as shown in Figure 9.

Figure 6. Plotting NYC Taxi Dropoffs with Bokeh

Figure 7. Plotting NYC Taxi Dropoffs with Datashader

Figure 8. NYC Taxi Pickup Times

Figure 9. Taxi Pickup Times Zoomed with Overlay


Operations In Visualization

Once the data is in Datashader, it becomes very simple to perform

even quite sophisticated computations on the visualization, not just

on the original data. For instance, we can easily plot all the locations

in NYC where there are more pickups than dropoffs in shades of red,

and all locations where there are more dropoffs than pickups in

shades of blue, as shown in Figure 10.

Plotted in this way, it is clear that pickups are much more likely along

the main arteries—presumably where a taxi can be hailed

successfully, while dropoffs are more likely along side streets.

LaGuardia Airport (circled) also shows clearly segregated pickup and

dropoff areas, with pickups being more widespread, presumably

because those are on a lower level and thus have lower GPS accuracy

due to occlusion of the satellites.

With Datashader, building a plot like this is very simple, once the

data has been aggregated. An aggregate is an xarray data structure

and, if we create an aggregate named “drops” that contains the

dropoff locations and one named “picks” that contains the pickup

locations, then drops.where(drops>picks) will be a new aggregate

holding all the areas with more dropoffs, and picks.

where(picks>drops) will hold all those with more pickups. These can

then be merged to make the plot in Figure 10, in one line of

Datashader code. Making a plot like this in another plotting package

would require far more code, essentially requiring replicating the

aggregation step of Datashader.

Similarly, for the 2010 Census Data example, it only takes one line of

Datashader code to filter the race/ethnicity data to show only those

pixels containing at least one person of every category (Figure 11A).

The color then indicates the predominant race/ethnicity, but only for

those areas—mainly major metropolitan areas—that have all races

and ethnicities included. Another single line of code will select only

those areas where the number of African Americans/Blacks is larger

than the number of Caucasians/Whites, as shown in Figure 11B.

Here, the predominantly African American/Black neighborhoods of

major cities have been selected, along with many rural areas in the

Southeast, as well as a few largely Hispanic neighborhoods on the

West Coast that nonetheless have more Blacks than Whites.

Alternatively, we can simply highlight the top 1% of the pixels by

population density, in this case by using a color range with 100

shades of grey and then changing the top one to red (Figure 11C).

Nearly any such query or operation that can be expressed at the level

of pixels (locations) can be expressed similarly simply, providing a

powerful counterpart to queries that are already easy to perform at

the raw data level, such as filter by criteria provided as columns in

the data set.

Figure 10. Visualizing Drop-Off Location

Dropoff (blue) vs pick-up (red) locations

Figure 11. Filtering US Census Data

A US census data, only including pixels with

every race/ethnicity included

B US census data, only including pixels where

African Americans/Blacks outnumber

Caucasians/Whites

C

US population density, with the 1% most

dense pixels colored in red

Once the data is in Datashader, it becomes very simple to perform even quite sophisticated computations on the visualization, not just on the original data.


Other Data Types

Although the previous examples focus on scatterplots, Datashader

also supports line plots (such as time series), trajectories, and

raster plots.

Line plots behave similarly to Datashader scatter plots, avoiding the

very serious overplotting and occlusion effects that happen for plots

of multiple overlaid time-series curves, by ensuring that overlapping

lines are combined in a principled way (see Figure 12).

With Datashader, time-series data with millions or billions of points

can be plotted easily, with no downsampling required, allowing

isolated anomalies to be detected easily and making it simple to

zoom in to see lower-level substructure.

Trajectory plots can similarly use all the data available even for

millions or billions of points, without downsampling and with no

parameter tuning, revealing substructure at every level of detail, as

in Figure 13. Using one million points, this example shows a

large-scale synthetic random-walk trajectory, but a cyclic “wobble”

can be seen when zooming in partially, and small local noisy values

can be seen when zooming in fully. These patterns could be very

important, if, for example, summing up total path length, and are

easily discoverable interactively with Datashader, because the full

data set is available interactively, with no downsampling required.

Figure 13. Zooming in on the Data

Zoom level 0 Zoom level 1 Zoom level 2

Figure 12. Multiple Overlapping Time Series Curves

SummaryIn this paper, we have shown some of the major challenges in creating

visualizations of Big Data, the failures of traditional approaches to

overcome these challenges, and how a new approach to visualization

surmounts them. We have also introduced the Datashader library

available with Anaconda Enterprise, and demonstrated how the serious

limitations of traditional approaches to visualizing Big Data no longer

need to be an issue.

Datashader is here to usher in a new era of seeing the truth in Big Data.

Now everyone on your team—from analysts to executives—can

properly perceive and understand the patterns in your data, in order to

make smart, data-driven business decisions.


Datashader is here to usher in a new era of seeing the truth in Big Data

© Copyright 2017 Anaconda, Inc. All Rights Reserved.

About Anaconda, Inc.

With over 4.5 million users, Anaconda is the world’s most popular Python data science platform. Anaconda, Inc. continues to lead open source projects like Anaconda, NumPy and SciPy that form the foundation of modern data science. Anaconda’s flagship product, Anaconda Enterprise, allows organizations to secure, govern, scale and extend Anaconda to deliver actionable insights that drive businesses and industries forward.

Edited by Rory Merritt, Technical Marketing Copywriter, Anaconda

The Most Popular Python Data Science Platform

Anaconda.com

big data visualization with datashader - anaconda...big data, the visualization is all you have, so...

Documents