big data visualization with datashader - anaconda...big data, the visualization is all you have, so...
TRANSCRIPT
WHITEPAPER
Dr. James Bednar, Open Source Tech Lead & Senior Solutions Architect, Anaconda
Big Data Visualization With Datashader
• The complexity of visualization in the era of Big Data
• How Datashader helps tame this complexity
• The power of adding interactivity to your visualization
In this whitepaper, we will examine:
In This WhitepaperThe best way to explore and communicate insights about data is through interactive visualization. Whether you are dealing with geospatial, time series, or tabular data, interactive graphics allow everyone on your team—from analysts to executives—to understand the patterns in your data.
Traditional visualization systems and techniques were designed in an era of data scarcity, but in today’s Big Data world of an incredible abundance of information, understanding is the key commodity. As data grows to include millions and billions of points, traditional visualization techniques break down. Whether you’re loading the data into limited memory or separating the signal from the noise, as data gets big, visualization gets challenging.
Older approaches focused on rendering individual data points faithfully, which was appropriate for the small data sets previously available. However, when inappropriately applied to large data sets, these techniques suffer from systematic problems like overplotting, oversaturation, undersaturation, undersampling, and underutilized dynamic range, all of which obscure the true properties of large data sets and lead to incorrect data-driven decisions and missed business opportunities.
Fortunately, Anaconda is here to help with Datashader, a Big Data visualization tool specifically designed to solve these problems head-on.
2 ANACONDA WHITEPAPER • Big Data Visualization With Datashader
Interactive graphics allow everyone on your team—from analysts to executives—to understand the patterns in your data.
Big Data Visualization With Datashader • ANACONDA WHITEPAPER 3
Visualization In The Era Of Big Data: Getting It Right Is Not Always Easy
The role of a data scientist is to collect data, extract useful knowledge
from the data, and accurately communicate this knowledge to
business stakeholders so they can make informed, data-driven
decisions. But in a Big Data world, accurately conveying all the data,
or even just the overall structure or shape of this data, can be
challenging. In addition to presenting various storage and
computational issues, Big Data also magnifies significant problems
with standard plotting tools.
For a simple scatterplot with only ten or a hundred points, it is easy
to display all points, and observers can instantly perceive an outlier
off to the side of the data’s cluster. But if your data set has hundreds
of millions or billions of data points—easily imaginable for Big
Data—there are far more points than can be displayed as pixels on a
typical computer monitor, leading to what might be called the
“points-per-pixel problem.” You are also much more likely to run the
risk of overplotting, where points in a large cluster overlap each
other and obscure the structure of the data within the cluster. As
seen below in Figures 1A and 1B, when the two categories are
overlaid, the appearance of the result can be very different,
depending on which one is plotted first.
These common plotting problems—as well as oversaturation,
undersaturation, and various binning issues—can lead your business
to make incorrect conclusions based on misleading visualizations.
But how are you to know when the visualization is lying to you?
Faced with so many challenging limitations, technical “solutions” are
frequently proposed to head off issues like overplotting, but too often
these so-called solutions are misapplied. One approach is
downsampling, where the number of data points is algorithmically
reduced, but which can result in missing important aspects of your
data. Another approach is to make data points partially transparent,
so that they add up, rather than overplot. However, setting the
amount of transparency correctly is difficult and error-prone, and
leaves unavoidable tradeoffs between visibility of isolated samples
and overplotting inside dense clusters.
Neither approach properly addresses the key problem in
visualization of large data sets: systematically and objectively
displaying large amounts of data in a way that can be perceived and
understood by the human visual system. And when exploring really
Big Data, the visualization is all you have, so it is crucial to get it right!
BA
Figure 1. Overplotting
Big Data magnifies significant problems with standard plotting tools.
Plotting problems can lead your business to make incorrect conclusions based on misleading visualizations. But how are you to know when the visualization is lying to you?
4 ANACONDA WHITEPAPER • Big Data Visualization With Datashader
Visualizing Big Data Effectively
Recognizing the growing need to rethink traditional, outmoded
visualization processes, Anaconda developed a new approach to Big
Data visualizations that provides an optimized interaction between
the data and the human visual system, automatically avoiding all of
the typical plotting pitfalls previously discussed.
This new approach optimizes the display of the data to fit how the
human visual system works, employs statistical sophistication to
ensure that data is transformed and scaled appropriately, encourages
easy and accurate iterative exploration of data by providing defaults
that reveal the data automatically at every scale, and allows full
customization that lets data scientists adjust every step of the process
between data and visualization.
The approach is a three-part operation:
• Synthesize
The first step is to project or synthesize your data onto a scene.
One begins with free-form data, then makes decisions as to
how best to initially lay out that data on the monitor. This step
is about a human being deciding what to visualize, which will
then be rendered automatically onto the screen by Datashader
in the subsequent steps.
• Rasterize
The second step is rasterization, which can be thought of as
replotting all of the data on a grid, so that any square of that
grid serves as a finite subsection of the data space. In the
simplest case, you then count all the data points that fall
within each square of this grid. In other cases, you can
perform a variety of user-selectable operations like averaging
or measuring standard deviation. This step results in an
“aggregate” view, binned into a fixed-sized data structure
regardless of the size of the original dataset.
• Transfer
In this step, the aggregates are transformed into squares of
color, producing an image. The colormapping will represent
the data that lies within that subsection of the grid and needs
to be chosen carefully based on what we know about how our
brains process patterns of light.
This three-step solution provides several key advantages. First,
transforming the aggregated data into a visualization is now a fully
exposed operation, unlike the implicit and error-prone
transformations in traditional plotting programs. With Datashader,
the aggregated data is processed according to a fully specified,
rigorous criterion, not subject to human judgment. Algorithmic
processing of intermediate stages in the visualization pipeline can
now be used both to reduce time-consuming manual interventions,
and even more importantly, reduce the likelihood of covering up
your data accidentally.
In traditional approaches, these steps are each done by trial and
error, and you typically have to repeatedly adjust the plot before it
will show any useful data at all. Our approach, on the other hand,
automates these steps to reveal your data the very first time it is
processed, while making the automation parameters easily accessible
for final tweaking if needed.
Anaconda has developed a new approach to Big Data visualizations that automatically avoids all of the typical plotting pitfalls.
Datashader Data Optimiztion in Action
10-million-datapoint trajectories from the
OpenSky project, showing flight paths between
European airports.
Without any parameter adjustment, a wealth of
additional details is visible when zooming in.
Eventually every sample can be revealed, point by
point, making it simple for users to discover
patterns in their data at any level.
Big Data Visualization With Datashader • ANACONDA WHITEPAPER 5
Datashader For Big Data Visualization
To provide all of the functionality described above, Anaconda
created Datashader—a flexible and configurable visualization
pipeline designed to reveal the patterns in large data sets
automatically. Datashader provides statistically accurate
transformation of data at every stage, allowing for rapid iteration of
visual styles and configurations, as well as interactive selections and
filtering. It completely prevents the typical plotting pitfalls
previously discussed, including overplotting, saturation,
undersampling, and underutilized dynamic range.
Data can still be obscured by being binned into the resolution of
your monitor, but Datashader addresses this important issue by
providing fully interactive exploration in web browsers for billions
of datapoints on an ordinary laptop, or for even larger datasets using
large compute clusters. Achieving the performance levels required
comes from two unique features of Datashader: using Python with
Numba and Dask to compute the aggregates at bare-metal speeds,
then transferring only the aggregate to the web browser at any
instant so that visualization speed is independent of the dataset size.
The web-browser interface is provided by the separate, general-
purpose HoloViews and Bokeh open-source libraries, which make it
simple to create richly interactive browser-based visualizations.
Figure 2 illustrates the complete Datashader pipeline from data to
web browser, with each computational step listed across the top of
the diagram and each data structure or object involved listed along
the bottom. Breaking up the computation into this set of stages is
what gives Datashader its power, because only the first couple of
stages require the full data set while the remaining stages use a
fixed-size data structure regardless of the input data set, making it
practical to work with even extremely large datasets.
To illustrate the power of visualizing rich structures at a very large
scale, let’s take a look at two illustrative examples.
Datashader supports interactive zooming all the way in to see individual data points.
Data
Projection
Aggregate(s)Scene
Transformation Colormapping EmbeddingAggregation
Image Plot
Figure 2. Stages of a Datashader Pipeline
6 ANACONDA WHITEPAPER • Big Data Visualization With Datashader
Datashader Example 1: 2010 Census Data
The 2010 Census collected a variety of demographic information for
the 300+ million people in the US. Here, we’ll focus on the subset of
the data selected by the Cooper Center, who produced a map of US
population density and racial/ethnic makeup. Each dot in the map
corresponds to a specific person counted in the census, located
approximately at their residence.
In this example, we’ll show the results of running novel Datashader
analyses focusing on various aspects of the data. As a first “dump” of
the data, using no changes to any of the parameters apart from
choosing a colorful colormap, Figure 3 shows the overall population
density by plotting the x,y locations of each person.
As we can see, patterns relating to geography such as mountain
ranges, infrastructure such as roads in the Midwest, and history such
as high population density along the East coast are all clearly visible,
and additional structures will be visible at every scale when zooming
into any local region.
For this data set, we can add additional information by colorizing
each pixel by the racial/ethnic category reported on the census data
for that person, using a key of:
• Purple: Hispanic/Latino
• Cyan: Caucasian/White
• Green: African American/Black
• Red: Asian/Pacific Islander
• Yellow: Other including Native American
Figure 4 Visualizing US Population by Race with Datashader
Figure 3. Visualizing US Population Density with Datashader
Datashader supports interactive zooming all the way in to see individual data points.
Big Data Visualization With Datashader • ANACONDA WHITEPAPER 7
Datashader will then merge all the categories present in each pixel to
show the average racial/ethnic makeup of that pixel. The results in
Figure 4 show clear patterns of segregation at the national level, again
using only the default parameter settings with no custom tuning or
adjustment. With other approaches, trying to visualize such widely
ranging levels of population density between rural and urban areas
would require extensive domain knowledge and experimentation to
choose appropriate parameters to reveal the patterns.
Even greater levels of segregation are visible when zooming into any
major population center, such as those shown in Figure 5. We can see
that Chicago and Manhattan’s historic Chinatown neighborhoods
are clearly visible (colored in red), and other neighborhoods are very
clearly segregated by race/ethnicity.
Datashader supports interactive zooming all the way in to see
individual data points, so that the amount of segregation can
be seen very clearly at a local level. Here, Datashader has been
told to automatically increase the size of each point when zooming
in so far that data points become sparse, making individual points
more visible.
Figure 5. Race & Ethnicity with Datashader
A Zooming in to view race/
ethnicity data in Chicago
B
Zooming in to view race/
ethnicity data in NYC
C Zooming in to view race/
ethnicity data in Los Angeles
D Zooming in to view race/
ethnicity data in Chicago
8 ANACONDA WHITEPAPER • Big Data Visualization With Datashader
Datashader Example 2: NYC Taxi Data Set
For this example, we’ll use part of the well-studied NYC taxi trip
database, with the locations of all New York City taxicab pickups and
dropoffs from January 2015. The data set contains 12 million pickup
and dropoff locations, with passenger counts and times of day. First,
let’s look at a scatterplot of the dropoff locations, as would be
rendered by subsampling for a traditional plotting program, in
Figure 6.
Here, the location of Manhattan can be seen clearly, as can the
rectangular Central Park area with few dropoffs, but there are
serious overplotting issues that obscure any more detailed structure.
With the default settings of Datashader, apart from the colormap, all
of the data can be shown with no subsampling required, revealing
much richer structure. In Figure 7, the entire street grid of the New
York City area is now clearly visible, with increasing levels of detail
available by zooming in to particular regions, without needing any
specially tuned or adjusted parameters.
By analogy to the US census race data, you can also treat each hour
of the day as a category and color them separately, revealing
additional temporal patterns using the color key of:
• Red: 12 a.m. midnight
• Yellow: 4 a.m.
• Green: 8 a.m.
• Cyan: 12 p.m. noon
• Blue: 4 p.m.
• Purple: 8 p.m.
In Figure 8, there are clearly different regions of the city where
pickups happen at specific times of day, with rich structure that can
be revealed by zooming in to see local patterns and relate them to
the underlying geographical map as shown in Figure 9.
Figure 6. Plotting NYC Taxi Dropoffs with Bokeh
Figure 7. Plotting NYC Taxi Dropoffs with Datashader
Figure 8. NYC Taxi Pickup Times
Figure 9. Taxi Pickup Times Zoomed with Overlay
Big Data Visualization With Datashader • ANACONDA WHITEPAPER 9
Operations In Visualization
Once the data is in Datashader, it becomes very simple to perform
even quite sophisticated computations on the visualization, not just
on the original data. For instance, we can easily plot all the locations
in NYC where there are more pickups than dropoffs in shades of red,
and all locations where there are more dropoffs than pickups in
shades of blue, as shown in Figure 10.
Plotted in this way, it is clear that pickups are much more likely along
the main arteries—presumably where a taxi can be hailed
successfully, while dropoffs are more likely along side streets.
LaGuardia Airport (circled) also shows clearly segregated pickup and
dropoff areas, with pickups being more widespread, presumably
because those are on a lower level and thus have lower GPS accuracy
due to occlusion of the satellites.
With Datashader, building a plot like this is very simple, once the
data has been aggregated. An aggregate is an xarray data structure
and, if we create an aggregate named “drops” that contains the
dropoff locations and one named “picks” that contains the pickup
locations, then drops.where(drops>picks) will be a new aggregate
holding all the areas with more dropoffs, and picks.
where(picks>drops) will hold all those with more pickups. These can
then be merged to make the plot in Figure 10, in one line of
Datashader code. Making a plot like this in another plotting package
would require far more code, essentially requiring replicating the
aggregation step of Datashader.
Similarly, for the 2010 Census Data example, it only takes one line of
Datashader code to filter the race/ethnicity data to show only those
pixels containing at least one person of every category (Figure 11A).
The color then indicates the predominant race/ethnicity, but only for
those areas—mainly major metropolitan areas—that have all races
and ethnicities included. Another single line of code will select only
those areas where the number of African Americans/Blacks is larger
than the number of Caucasians/Whites, as shown in Figure 11B.
Here, the predominantly African American/Black neighborhoods of
major cities have been selected, along with many rural areas in the
Southeast, as well as a few largely Hispanic neighborhoods on the
West Coast that nonetheless have more Blacks than Whites.
Alternatively, we can simply highlight the top 1% of the pixels by
population density, in this case by using a color range with 100
shades of grey and then changing the top one to red (Figure 11C).
Nearly any such query or operation that can be expressed at the level
of pixels (locations) can be expressed similarly simply, providing a
powerful counterpart to queries that are already easy to perform at
the raw data level, such as filter by criteria provided as columns in
the data set.
Figure 10. Visualizing Drop-Off Location
Dropoff (blue) vs pick-up (red) locations
Figure 11. Filtering US Census Data
A US census data, only including pixels with
every race/ethnicity included
B US census data, only including pixels where
African Americans/Blacks outnumber
Caucasians/Whites
C
US population density, with the 1% most
dense pixels colored in red
Once the data is in Datashader, it becomes very simple to perform even quite sophisticated computations on the visualization, not just on the original data.
10 ANACONDA WHITEPAPER • Big Data Visualization With Datashader
Other Data Types
Although the previous examples focus on scatterplots, Datashader
also supports line plots (such as time series), trajectories, and
raster plots.
Line plots behave similarly to Datashader scatter plots, avoiding the
very serious overplotting and occlusion effects that happen for plots
of multiple overlaid time-series curves, by ensuring that overlapping
lines are combined in a principled way (see Figure 12).
With Datashader, time-series data with millions or billions of points
can be plotted easily, with no downsampling required, allowing
isolated anomalies to be detected easily and making it simple to
zoom in to see lower-level substructure.
Trajectory plots can similarly use all the data available even for
millions or billions of points, without downsampling and with no
parameter tuning, revealing substructure at every level of detail, as
in Figure 13. Using one million points, this example shows a
large-scale synthetic random-walk trajectory, but a cyclic “wobble”
can be seen when zooming in partially, and small local noisy values
can be seen when zooming in fully. These patterns could be very
important, if, for example, summing up total path length, and are
easily discoverable interactively with Datashader, because the full
data set is available interactively, with no downsampling required.
Figure 13. Zooming in on the Data
Zoom level 0 Zoom level 1 Zoom level 2
Figure 12. Multiple Overlapping Time Series Curves
SummaryIn this paper, we have shown some of the major challenges in creating
visualizations of Big Data, the failures of traditional approaches to
overcome these challenges, and how a new approach to visualization
surmounts them. We have also introduced the Datashader library
available with Anaconda Enterprise, and demonstrated how the serious
limitations of traditional approaches to visualizing Big Data no longer
need to be an issue.
Datashader is here to usher in a new era of seeing the truth in Big Data.
Now everyone on your team—from analysts to executives—can
properly perceive and understand the patterns in your data, in order to
make smart, data-driven business decisions.
Big Data Visualization With Datashader • ANACONDA WHITEPAPER 11
Datashader is here to usher in a new era of seeing the truth in Big Data
© Copyright 2017 Anaconda, Inc. All Rights Reserved.
About Anaconda, Inc.
With over 4.5 million users, Anaconda is the world’s most popular Python data science platform. Anaconda, Inc. continues to lead open source projects like Anaconda, NumPy and SciPy that form the foundation of modern data science. Anaconda’s flagship product, Anaconda Enterprise, allows organizations to secure, govern, scale and extend Anaconda to deliver actionable insights that drive businesses and industries forward.
Edited by Rory Merritt, Technical Marketing Copywriter, Anaconda
The Most Popular Python Data Science Platform
Anaconda.com