scalable data analysis (cis 602-02)dkoop/cis602-2015fa/lectures/lecture01.pdfscalable data analysis...

D. Koop, CIS 602-02, Fall 2015

Scalable Data Analysis (CIS 602-02)

Introduction

Dr. David Koop

Fig. 2. Taxis as sensors of city life. The plot on the top shows how the number of trips varies over 2011 and 2012. While some patterns are regularand appear on both years, some anomalies are clear, e.g., the drops in August 2011 (Hurricane Irene) and in October 2012 (Hurricane Sandy). Inthe bottom, we show pickups (blue) and dropoffs (orange) in Manhattan on May 1st from 7am to 11am. Notice that from 8-10am, there are virtuallyno trips along 6th Avenue, indicating the traffic was blocked.

over 2011 and 2012. There is a lot of regularity: the plot lines are verysimilar for the two years. For example, on Thanksgiving, Christmasand New Year’s eve, there is a substantial drop in the number of trips.But the plot also shows some anomalies. There are big drops in Au-gust 2011 and October 2012, which correspond to hurricanes Irene andSandy, respectively. Looking at the data at a finer grain, other inter-esting patterns emerge. The maps in Fig. 2 show the density of taxisacross Manhattan from 7am to 11am, on May 1st, 2011. From 8amto 10am, taxis disappear along 6th avenue, from Midtown to Down-town; and then, at 10am they reappear. As it turns out, during thisperiod, the streets were closed to traffic for the Five Boro Bike Tour.1As we discuss later, other useful information can be discovered byanalyzing these data, from popular night spots and economically dis-advantaged neighborhoods that are underserved by taxis, to mobilitypatterns across regions at different times and days (see Fig. 1).

Like many urban data sets, taxi trips contain geographical and tem-poral components. In addition, they encode information about move-ment: a trip is associated with pickup and dropoff locations and times.A trip also contains other attributes including the taxi id, distance trav-eled, fare and tip amount, which enable, for example, the study of theeconomics of fare structure and optimal fleet size. Not surprisingly,exploring these data is challenging. We have carried out interviewswith social scientists and engineers that have used this data set in theirresearch to better understand their needs. Their analyses can be com-plex and have been greatly limited by the size of the data. Tools thatare commonly used, including R, MatLab, Stata, ArcGIS and Excel,cannot handle large data sets. This prevents scientists from analyz-ing the whole data. Instead, they first select small slices and then loadthem into these tools for analysis. This process is both tedious and timeconsuming. Furthermore, while these tools support complex analysis,users must be familiar with the underlying languages. For example, inArcGIS, users have to construct SQL-like queries, a task that is out ofreach for scientists without database training.

To address these limitations, in this paper, we propose a new vi-sual query model that supports complex spatio-temporal queries overorigin-destination (OD) data. Users need not be experts in any textual

1http://www.nycbikemaps.com/spokes/five-boro-bike-tour-sunday-may-1st-2011

query language: they can directly query the data using visual opera-tions. We show that this model supports a wide range of queries, andin particular, the three classes of queries in Peuquet’s typology [29]:identify a set of objects at a given location and time; given a time anda set of objects, describe the locations occupied by the objects; anddescribe the times a set of objects occupied a given set of locations.This model also supports origin-destination queries that are needed toexplore taxi movement. While visual languages have been proposedfor spatio-temporal data and moving objects, they targeted a differentkind of data—continuous data (i.e., complete trajectories), and madeuse of a pictorial language [18, 12]. Instead of requiring users to sketchqueries, our model allows them to directly (and visually) manipulatethe data. The ability to specify queries using graphical widgets andvisualize their results was in part inspired by the seminal work byAhlberg et al. [2] on dynamic queries. Our focus, however, is dif-ferent: we aim to support the exploration of large, spatio-temporalOD data, and provide visualization services that are both usable andefficient. Another important feature of our model is that each queryis associated with a set of trips. As a result, not only can queries becomposed and refined, but also query results can be aggregated anddifferent visual representations can be applied while still maintainingthe spatial and temporal contexts. Query composition also enables theuse of cross-filtered views [37], which is key in our model to supportthe query classes in Peuquet’s typology.

We have built TaxiVis, an analysis environment that implementsthis model. It combines a number of interaction capabilities that enableusers to pose queries over all the dimensions of the data and flexiblyexplore the attributes associated with the taxi trips. Another importantfeature of the system is the ability to compare spatio-temporal slicesthrough multiple coordinated views. Users can interactively composeand refine queries as well as generalize them by performing parametersweeps. To deal with scalability, the system implements a number ofstrategies to support interactive response times and the rendering of alarge number of graphical primitives on a map. As discussed in Sec. 5,these include efficient data storage, and the use of adaptive level-of-detail rendering to provide clutter-free visualization of the results.

We demonstrate the usefulness of our system through a series ofcase studies motivated by traffic engineers and economists whoseneeds have driven our design. These case studies show how our model

NYC Taxi Data

2D. Koop, CIS 602-02, Fall 2015

[Ferreira et al., 2013]

Fig. 8. Comparison of taxi pickups (left) and dropoff (right) in different neighborhoods over the first week of May 2011. The plots show that Midtownand the Upper East side are the most active areas. But over the weekend, there is an increased number of dropoffs in Downtown. The figure alsohighlights the fact the Harlem is underserved by taxis.

and we see greater activity in Downtown. Note the increase in thenumber of trips that starts to happen on Thursday (May 5), with bigpeak for pickups on Friday (May 6) in the evening—this indicates thatthe nightlife on weekends is very lively in Downtown.

This one-week overview provides an accurate overview of city life,where people go and when. It also highlights social inequalities. Peo-ple who live in Harlem have long complained about the lack of taxiservice in their neighborhood. The plot clearly shows that their discon-tent is well justified. There is over one order of magnitude differencein the number of trips to/from Harlem compared to other more afflu-ent neighborhoods. The heat map also shows that while people taketaxis to Harlem, there are barely any pickups there. Exploring otherparameters associated with the trips we found one surprising fact: thetips per trip originating in Harlem are higher than for the other neigh-borhoods (see Fig. 9). Further analysis also showed that the fare permile is lower for Harlem, and thus, there is less economic incentivefor taxis to be in that area. The higher tips may be a means to rewarddrivers that go to Harlem.

6.2 Exploring Movement: Transportation HubsAirports and major train stations (i.e. Penn Station and Grand Central)are key transportation hubs in NYC. By analyzing taxi movement toand from these locations, we can obtain insights into how people moveinto and out of the city. To compare the number of trips originating atJFK and La Guardia, we select the regions in their vicinity and exam-ine a 1-week period (05/01/2011 through 05/07/2011). As the plot inthe top of Fig. 10 shows, there are more pickups at La Guardia than atJFK on most days. Another interesting question is where passengersgo. The choropleth (Fig. 10 top) that highlights NYC neighborhoods,shows that most people go to Midtown (the darkest region), followedby the Upper West Side.

By hovering the mouse over a neighborhood, the system displaysthe exact number of trips ending in that neighborhood. We can also ob-tain more fine-grained information about the exact dropoff locations—the popular destinations, using a heat map.

In order to study the movement patterns for airports and train sta-tions, we can group them (Fig. 10 bottom) . We select the regionsaround Penn Station and Grand Central, and group them using theGroup/Ungroup button (note the two green outlines); we also groupthe trips that start at the airports (blue outline). Immediately, the plotis updated to show the number of pickups in the two regions. Notethat there are many more pickups around the train stations. Anotherinteresting observation is that the number of trips originating at thetrain stations remains roughly constant from Monday through Thurs-day, and starts to decrease on Friday, hitting a low on Saturday. Thisreflects the behavior of many commuters who go to the City duringthe week, but not on weekends. Note that, while in this example wehave focused on pickups, i.e., people arriving, it is easy to also studydropoffs. Starting from the map view shown in Fig. 10, we can simplyselect the airport and train regions (by double-clicking on them), andthen click on the “Dropoff” button.

Using the summary view, we can further explore features of theselected trips. For example, by examining the average cost of trip per

Fig. 10. Comparing movement across NYC transportation hubs. On thetop, we examine trips starting at the two major airports in NYC: JFK andLa Guardia. In the bottom, we refine the query to compare trips startingat the airports with trips starting at the train stations, Penn Station andGrand Central.

mile, we can see that it is higher within Manhattan. This providesan incentive for taxi companies to stay within Manhattan and avoidtrips to the airport. Note that while it is illegal for taxis to reject rides,this is a common practice when the destination is JFK.2 This problemis accentuated during rush hour on weekdays, when trips take muchlonger (see Fig. 1) and lead to a potential reduction in revenue.

6.3 Studying Behavior over TimeTaxi Demand Patterns. Studying how taxi demand varies over timecan be useful to understand city dynamics. For taxi companies, thisinformation can help in decision making, both to schedule driver shiftsand maximize profits. To simplify the process of comparing multipletimes slices, TaxiVis provides a time space exploration mechanism.The user first selects the time slices of interest. This can be done usingthe time selection widgets (Fig. 5). In the regular selection mode, theslices are selected by specifying a time range, a step size (e.g. an hour,a day, a week), and the number of steps. In the recurrent selectionmode, the list of time ranges is already expressed and generated bythe widget. For example, by selecting 2011, May and Sunday, 5 timesranges are returned–each corresponding to a Sunday in the month ofMay, 2011. Given a list of time ranges, the result of a time spaceexploration is a multi-view visualization displaying one map per timeinterval, and a data summary view that aggregates the results for thetime intervals. Each map view and plot line is associated with a colorassigned to its time range. This is illustrated in Fig. 11. Here, weexamined all Mondays in May 2011 and May 2012. The number oftrips for the two years is very similar, including the significant drop

2http://cityroom.blogs.nytimes.com/2011/02/24/taxi-panel-focuses-on-destination-discrimination.

NYC Taxi Data

3D. Koop, CIS 602-02, Fall 2015

[Ferreira et al., 2013]

++

--

Leaflet | Map data © OpenStreetMap contributors

Marine Traffic Data

4D. Koop, CIS 602-02, Fall 2015

++

--

Leaflet | Map data © OpenStreetMap contributors

Marine Traffic Data

5D. Koop, CIS 602-02, Fall 2015

Fig. 9: Ball and fielder trajectories vary for hits that arise from differ-ent pitch types.

on hitter characteristics, pitch types, and weather conditions. WithBaseball4D, we show that full-field tracking presents opportunities tocontinue both aggregation and filtering for other parts of the game,including hit trajectories, fielding, and baserunning. In addition, thisfiltering can be extended to utilize pitch information as well as theother game characteristics.

One feature that Baseball4D provides is the opportunity to selec-tively view and analyze multiple plays at once. Coaches and playersoften watch video to improve their own play or scout their opponents.For example, an analyst might view a set of ground balls hit to theshortstop and notice that when he moves to his left, his throws to firstbase tend to more accurate than when he moves to his right. We couldalso time-shift plays so that we could see balls as they are caught toevaluate the different trajectories and speeds. Figure 1 (right) showshow Baseball4D can present both multiple hits by the same batter andmultiple plays by the same fielder.

As with pitch data, we can aggregate all plays to generate heatmapsof hit ball trajectories. The heat map shown in Figure 7, for example,shows all hits and suggests that the 3rd baseman and the shortstop fieldmost of the hits, perhaps suggesting a larger number of right-handed atbats. However, further investigations might combine this aggregationwith filtering to confirm differences between left- and right-handedbatters or differences between pitchers. Such selection allows morefine-grained analysis and presents the opportunity to find new correla-tions between aspects of the game.

One interesting direction that uses filtering is to combine pitch char-acteristics with hit ball trajectories and fielder trajectories. If a man-ager wants to induce a double-play, for example, he may signal in aparticular type of pitch. Analysis of the hit trajectories of that type of

Fig. 10: A data-based partition of which regions a particular fielder isresponsible for. Each field point is assigned to the player that made themost catches in that location during the 2013 season. The availabilityof actual fielding data may help to determine the regions each playeris responsible for.

pitch might enhance the understanding of how effective such strate-gies are. Note that a successful double-play depends on the quality ofthe pitch, the hitter’s decision, the ball trajectory, the fielders’ initialpositions, the baserunners’ movements, and the execution of all of thefielders’ throws. Thus, understanding hit ball trajectories and fieldermovements for different pitch types (see Figure 9) presents a step to-ward improving our understanding of such complicated plays. In thefigures, heat maps show the outcomes of different pitch types, andspray charts show the trajectories of the balls and fielders. Both visu-alizations may be also filtered by other gameplay attributes includingthe pitcher name, the batter name, the game, inning, number of outs,pitch result, and pre-pitch balls and strikes.

6.3 True Defensive Range: Expanding Defensive MetricsDespite the large number of statistics that allow in-depth analysis ofpitching and hitting, it has been much more difficult to evaluate perfor-mance once the ball is in play. Often, the number of errors is the onlymeasure available to compare fielding performance. Perhaps someevaluations can be made based on number of outs versus times theball was hit to a particular region the player was responsible for, butthese statistics often lack information about where the ball landed orif the player was positioned according to a particular strategy (e.g., totake away potential extra-base hits). With the reconstruction of entiregameplays, we can compute defensive metrics that evaluate a player’sfielding performance. Not only can we see where the ball is hit to butalso the speed with which a fielder reacts.

The measurement of defensive skills has been significantly im-proved in recent years. Defensive metrics can be derived from bothdefensive events (putout, assists, errors, total chances) and fielding in-formation. The latter has proven to be more a reliable indicator of thefielding ability, but suffers from the lack of accurate data on battedballs (this data is only available since 1989) and players’ positions.The early tracking approaches (such as the zone rating of STATS Inc.and Baseball Info Solutions) were based on a discretization of the fieldinto zones. The zones helped the reporters (zone rating operators) tovisually determine where the ball landed and what player was sup-posed to field it (zones were assigned to specific players). The assign-ment of zones to specific fielders was put to discussion later, however,and the new data shows that players’ zones may overlap significantlyon the field (see, e.g., Figure 10).

With new tracking abilities, we can better track fielding perfor-mance, and this has already led to more discussion and new metricsabout defensive skills. Greg Rybarczyk has proposed a measure of

Baseball Data

6D. Koop, CIS 602-02, Fall 2015

[Deitrich et al., 2014]

Fig. 9: Ball and fielder trajectories vary for hits that arise from differ-ent pitch types.

on hitter characteristics, pitch types, and weather conditions. WithBaseball4D, we show that full-field tracking presents opportunities tocontinue both aggregation and filtering for other parts of the game,including hit trajectories, fielding, and baserunning. In addition, thisfiltering can be extended to utilize pitch information as well as theother game characteristics.

One feature that Baseball4D provides is the opportunity to selec-tively view and analyze multiple plays at once. Coaches and playersoften watch video to improve their own play or scout their opponents.For example, an analyst might view a set of ground balls hit to theshortstop and notice that when he moves to his left, his throws to firstbase tend to more accurate than when he moves to his right. We couldalso time-shift plays so that we could see balls as they are caught toevaluate the different trajectories and speeds. Figure 1 (right) showshow Baseball4D can present both multiple hits by the same batter andmultiple plays by the same fielder.

As with pitch data, we can aggregate all plays to generate heatmapsof hit ball trajectories. The heat map shown in Figure 7, for example,shows all hits and suggests that the 3rd baseman and the shortstop fieldmost of the hits, perhaps suggesting a larger number of right-handed atbats. However, further investigations might combine this aggregationwith filtering to confirm differences between left- and right-handedbatters or differences between pitchers. Such selection allows morefine-grained analysis and presents the opportunity to find new correla-tions between aspects of the game.

One interesting direction that uses filtering is to combine pitch char-acteristics with hit ball trajectories and fielder trajectories. If a man-ager wants to induce a double-play, for example, he may signal in aparticular type of pitch. Analysis of the hit trajectories of that type of

Fig. 10: A data-based partition of which regions a particular fielder isresponsible for. Each field point is assigned to the player that made themost catches in that location during the 2013 season. The availabilityof actual fielding data may help to determine the regions each playeris responsible for.

pitch might enhance the understanding of how effective such strate-gies are. Note that a successful double-play depends on the quality ofthe pitch, the hitter’s decision, the ball trajectory, the fielders’ initialpositions, the baserunners’ movements, and the execution of all of thefielders’ throws. Thus, understanding hit ball trajectories and fieldermovements for different pitch types (see Figure 9) presents a step to-ward improving our understanding of such complicated plays. In thefigures, heat maps show the outcomes of different pitch types, andspray charts show the trajectories of the balls and fielders. Both visu-alizations may be also filtered by other gameplay attributes includingthe pitcher name, the batter name, the game, inning, number of outs,pitch result, and pre-pitch balls and strikes.

6.3 True Defensive Range: Expanding Defensive MetricsDespite the large number of statistics that allow in-depth analysis ofpitching and hitting, it has been much more difficult to evaluate perfor-mance once the ball is in play. Often, the number of errors is the onlymeasure available to compare fielding performance. Perhaps someevaluations can be made based on number of outs versus times theball was hit to a particular region the player was responsible for, butthese statistics often lack information about where the ball landed orif the player was positioned according to a particular strategy (e.g., totake away potential extra-base hits). With the reconstruction of entiregameplays, we can compute defensive metrics that evaluate a player’sfielding performance. Not only can we see where the ball is hit to butalso the speed with which a fielder reacts.

The measurement of defensive skills has been significantly im-proved in recent years. Defensive metrics can be derived from bothdefensive events (putout, assists, errors, total chances) and fielding in-formation. The latter has proven to be more a reliable indicator of thefielding ability, but suffers from the lack of accurate data on battedballs (this data is only available since 1989) and players’ positions.The early tracking approaches (such as the zone rating of STATS Inc.and Baseball Info Solutions) were based on a discretization of the fieldinto zones. The zones helped the reporters (zone rating operators) tovisually determine where the ball landed and what player was sup-posed to field it (zones were assigned to specific players). The assign-ment of zones to specific fielders was put to discussion later, however,and the new data shows that players’ zones may overlap significantlyon the field (see, e.g., Figure 10).

With new tracking abilities, we can better track fielding perfor-mance, and this has already led to more discussion and new metricsabout defensive skills. Greg Rybarczyk has proposed a measure of

Baseball Data

7D. Koop, CIS 602-02, Fall 2015


Fig. 13: Plot of TDR difficulty according to the position where theball landed. Note that the difficulty levels are not linked to specificfield positions.

Fig. 14: An alternative characterization of play difficulty might involvethe fielder’s running speed. Here we plot plays where the fielder ran atleast 15 feet/sec (10.2 mph) as hard.

batted ball with a given trajectory and hit location (a contribution ofthe plus/minus metric). With the advent of fielder tracking technol-ogy, the defensive metrics can be computed with large amounts ofconsistent and accurate field data, and their significance can then bereviewed.

7 CONCLUSION

In this paper, we present Baseball4D, a new visual analytics tool thathas been designed to enable the analysis of high-resolution, time-varying player- and ball-tracking data streams that are becoming avail-able as 3D tracking systems are deployed throughout the world. Base-ball4D takes several data streams and consolidates, filters, and en-riches them so that gameplay can be more easily studied. In this paper,we show that the consolidated gameplays can be used to generate non-trivial statistics and visualizations that were not possible to be computebefore (or at least very hard to do). We have also shown that Base-ball4D’s intuitive user interface allows users to “slice and dice” gamesin many ways that were previously impossible.

There are several avenues for future work. We would like to ex-plore adding more metrics to the system, and also develop ways todepict them in an intuitive way. We are also looking forward to havingcomplete coverage of a season. As the amount of data will increasesubstantially, we will need to revisit some of the components of the

system. In particular, Baseball4D currently incrementally loads dataas needed, caching it in memory. This approach is not scalable movingforward, and we will need to use an alternative approach. Queryingwill also need to be improved, and we plan to explore how to cou-ple our system with an efficient database backend. As the data sizegets larger, another area that will need to be explored is how to main-tain interactive rendering performance. We plan to explore automaticlevel-of-detail techniques. Also, we would like to add a scripting APIto the system that allows the users to implement new analysis tech-niques without having to use the low-level C++ code that composesmost of the system.

Throughout the development of the techniques and tools embed-ded in Baseball4D, we have been collaborating closely with baseballexperts. On purpose, we have refrained from making any mention ofspecific plays, players, or teams. We do believe that teams and coachesrecognize the value of objective data like accurate release points andmovement trajectories, and we believe that our work is a step in theright direction toward building tools that will allow these experts toimprove baseball statistics and unveil useful information about play-ers and teams.

ACKNOWLEDGMENTS

We would like to thank Dirk Van Dall, Cory Schwartz, Andrew Pinter,and Joe Inzerillo and the rest of the talented engineering and stats teamat MLB Advanced Media. Without their help and support, this workwould not have been possible. We would also like to acknowledge theanonymous reviewers for many helpful suggestions.

REFERENCES

[1] P.-A. Albinsson and D. Andersson. Extending the attribute explorerto support professional team-sport analysis. Information Visualization,7(2):163–169, 2008.

[2] D. Appelman. Customizable heat maps, Jan 31, 2011. RetrievedJuly 30, 2014 from http://www.fangraphs.com/blogs/

customizable-heat-maps/.[3] D. Basco and J. Zimmerman. The Baseball Research Journal, volume 39,

chapter Measuring Defense: Entering the Zones of Fielding Statistics.Society for American Baseball Research, 2010.

[4] T. Bebie and H. Bieri. A video-based 3d-reconstruction of soccer games.Computer Graphics Forum, 19(3):391–400, 2000.

[5] I. Boudway. Baseball: Running the new numbers. BloombergBusinessWeek, March 31, 2011. Retrieved July 30, 2014 fromhttp://www.businessweek.com/magazine/content/11_

15/b4223072802462.htm.[6] C. Bregler. Motion capture technology for entertainment [in the spot-

light]. Signal Processing Magazine, IEEE, 24(6):160–158, Nov. 2007.[7] D. H. S. Chung, P. A. Legg, M. L. Parry, I. W. Griffiths, R. Brown, R. S.

Laramee, and M. Chen. Visual analytics for multivariate sorting of sportevent data. In Workshop on Sports Data Visualization, 2013.

[8] Chyronhego. http://chyronhego.com/.[9] A. Cox and J. Stasko. Sportsvis: Discovering meaning in sports statis-

tics through information visualization. In Proceedings of Symposium onInformation Visualization, pages 114–115. Citeseer, 2006.

[10] FIELDf/x, Sportvision. http://www.sportvision.com/

baseball/fieldfx.[11] K. Goldsberry. Courtvision: New visual and spatial analytics for the nba.

In MIT Sloan Sports Analytics Conference. MIT Sloan Sports AnalyticsConference, 2012.

[12] A. Gueziec. Tracking pitches for broadcast television. Computer,35(3):38–43, Mar. 2002.

[13] A. Hervieu, P. Bouthemy, and J.-P. L. Cadre. Trajectory-based handballvideo understanding. In Proceedings of the ACM International Confer-ence on Image and Video Retrieval, CIVR ’09, pages 43:1–43:8, NewYork, NY, USA, 2009. ACM.

[14] M. Hferlin, E. Grundy, R. Borgo, D. Weiskopf, M. Chen, I. W. Griffiths,and W. Griffiths. Video visualization for snooker skill training. ComputerGraphics Forum, 29(3):1053–1062, 2010.

[15] P. A. Legg, D. H. S. Chung, M. L. Parry, M. W. Jones, R. Long, I. W.Griffiths, and M. Chen. Matchpad: Interactive glyph-based visualizationfor real-time sports performance analysis. Computer Graphics Forum,31(3pt4):1255–1264, 2012.

Baseball Data

8D. Koop, CIS 602-02, Fall 2015


Data Science Venn Diagram

9D. Koop, CIS 602-02, Fall 2015

[D. Conway, The Data Science Venn Diagram, 2013]

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Questions are important!• Having data is great, but most of the time it just sits waiting for

someone to analyze it • The reason data analysis is not completely automated is that there

are so many potential questions • Humans need to stay involved in the loop • Interaction and visualization can be important, especially early in

data analysis

10D. Koop, CIS 602-02, Fall 2015

Data• What is this data?

• Semantics: real-world meaning of the data • Type: structural or mathematical interpretation • Both often require metadata

- Sometimes we can infer some of this information - Line between data and metadata isn’t always clear

11D. Koop, CIS 602-02, Fall 2015

Data

12D. Koop, CIS 602-02, Fall 2015

Data Types• Items

- An item is an individual discrete entity - e.g. row in a table, node in a network

• Attributes - An attribute is some specific property that can be measured,

observed, or logged - a.k.a. variable, (data) dimension

13D. Koop, CIS 602-02, Fall 2015

22

Fieldattribute

item

Items & Attributes

14D. Koop, CIS 602-02, Fall 2015

Data Types• Nodes

- Synonym for item but in the context of networks (graphs) • Links

- A link is a relation between two items - e.g. social network friends, computer network links

15D. Koop, CIS 602-02, Fall 2015

Items & Links

16D. Koop, CIS 602-02, Fall 2015

[Bostock, 2011]

Item

Links

Dataset Types

17D. Koop, CIS 602-02, Fall 2015

Tables

Attributes (columns)

Items (rows)

Cell containing value

Networks

Link

Node (item)

Trees

Fields (Continuous)

Attributes (columns)

Value in cell

Cell

Multidimensional Table

Value in cell

Grid of positions

Geometry (Spatial)

Position

Dataset Types

[Munzner (ill. Maguire), 2014]

Attribute Types

18D. Koop, CIS 602-02, Fall 2015

Attribute Types

Ordering Direction

Categorical Ordered

Ordinal Quantitative

Sequential Diverging Cyclic


231 = Quantitative2 = Nominal3 = Ordinal

quantitative ordinal categorical

Categorial, Ordinal, and Quantitative

19D. Koop, CIS 602-02, Fall 2015

241 = Quantitative2 = Nominal3 = Ordinal

quantitative ordinal categorical

Categorial, Ordinal, and Quantitative

20D. Koop, CIS 602-02, Fall 2015

Semantics• The type of data does not tell us what the data means or how it

should be interpreted • Tables have keys/values, fields have independent/dependent vars

21D. Koop, CIS 602-02, Fall 2015

Attribute SemanticsKeys vs. Values (Tables) or Independent vs. Dependent (Fields)

Flat

Multidimensional

Tabl

es

Fiel

ds


Analysis

22D. Koop, CIS 602-02, Fall 2015

Trends

Actions

Analyze

Search

Query

Why?

All Data

Outliers Features

Attributes

One ManyDistribution Dependency Correlation Similarity

Network Data

Spatial DataShape

Topology

Paths

Extremes

ConsumePresent EnjoyDiscover

ProduceAnnotate Record Derive

Identify Compare Summarize

tag

Target known Target unknown

Location knownLocation unknown

Lookup

Locate

Browse

Explore

Targets

Why?

How?

What?


Analysis: Consume & Produce• Consume

–Exploration –Explanation –Enjoyment

• Produce –Annotation –Record –Derivation

•Leads to new directions/ideas

23D. Koop, CIS 602-02, Fall 2015

Analyze

ConsumePresent EnjoyDiscover

ProduceAnnotate Record Derive

tag


Analysis: Search and Query• Search based on what

a user knows - Target - Location

• Query depends onwhat data matters - One - Some (Often Two) - All

24D. Koop, CIS 602-02, Fall 2015

Search

Query

Identify Compare Summarize

Target known Target unknown

Location known

Location unknown

Lookup

Locate

Browse

Explore


Targets

25D. Koop, CIS 602-02, Fall 2015

Trends

ALL DATA

Outliers Features

ATTRIBUTES

One ManyDistribution Dependency Correlation Similarity

Extremes

NETWORK DATA

SPATIAL DATA

Shape

Topology

Paths


Scalability• “Big Data”

- What is “big”? For whom is it “big”? - variety, velocity, volume, …

• Lots of data that was big is not an issue now • Understanding the scalability of techniques is important • There will always be larger datasets, but we should be able to

understand performance, timing based on the scalability of the algorithm

26D. Koop, CIS 602-02, Fall 2015

About Me• Research Interests

- Visualization - Computational Provenance - Data Science

• Research Projects - VisTrails: www.vistrails.org - VisComplete, UV-CDAT, SAHM

• See my web page for more information - http://www.cis.umassd.edu/~dkoop/

27D. Koop, CIS 602-02, Fall 2015

http://www.vistrails.org

http://www.cis.umassd.edu/~dkoop/

About You• Previous topics course (CIS 602)? (visualization, bioinformatics,

provenance) • Research Papers? • Data Science? • Python? • Database Experience? • Analytics Experience? • Anything you want to see covered?

28D. Koop, CIS 602-02, Fall 2015

About this course• Course web page is authoritative:

- http://www.cis.umassd.edu/~dkoop/cis602/ - Schedule, Readings, Assignments will be posted online - Check the web site before emailing me

• Topics course - A current research area the professor works in - A chance to be on the “cutting edge” of research

• Requires student participation - Reading responses - Reading presentation - Course project

29D. Koop, CIS 602-02, Fall 2015

About this course• Balance of techniques and research ideas • Some background (Python, writing) followed by specific topics and

readings with presentations • Assignments at the beginning of course, project at end • One exam near midterm

30D. Koop, CIS 602-02, Fall 2015

About this course• Course Registration:

- Make sure you have registered in COIN for the course - Email me if you are not registered but are interested in taking the

course • Review of course policies:

- Plagiarism and academic honesty - If you have any concerns or questions, please email me as soon

as possible • If you are not sure if this course is a good fit, please email me or talk

to me

31D. Koop, CIS 602-02, Fall 2015

Next Class• Introduction to/review of Python • Download anaconda distribution (continuum.io)

- I am planning to use Python 3 (3.4) - You may choose to use Python 2 if you already know it

32D. Koop, CIS 602-02, Fall 2015

scalable data analysis (cis 602-02)dkoop/cis602-2015fa/lectures/lecture01.pdfscalable data analysis...

Documents