cis 602-01: scalable data analysisdkoop/cis602-2017fa/lectures/lecture05.pdf · cis 602-01:...

CIS 602-01: Scalable Data Analysis

Visualization Dr. David Koop

D. Koop, CIS 602-01, Fall 2017

Growth of Data

2D. Koop, CIS 602-01, Fall 2017

Usefulness of Data

3D. Koop, CIS 602-01, Fall 2017

Analyzed Data

4D. Koop, CIS 602-01, Fall 2017

Example Data Sources• Radio Telescopes • Twitter • Wind Turbine Sensors • Surveillance Cameras • Cars & Airplanes • Dog Collars • Dishwashers • Traffic Lights • MRI Scanners • NFL Football Players • Farming

5D. Koop, CIS 602-01, Fall 2017

[Zebra MotionWorks]

[CC-SA 2.0, Stephan Trebs]

Large Synoptic Survey Telescope (LSST)

6D. Koop, CIS 602-01, Fall 2017

[http://www.lsst.org]

• Image every 15 seconds • 100PB over 10 years

http://www.lsst.org

Large Numerical Simulations• Millennium simulation: dark matter, 30TB raw data

7D. Koop, CIS 602-01, Fall 2017

Figure 1: The dark matter density field on various scales. Each individual image shows the projecteddark matter density field in a slab of thickness 15h−1Mpc (sliced from the periodic simulation volumeat an angle chosen to avoid replicating structures in the lower two images), colour-coded by densityand local dark matter velocity dispersion. The zoom sequence displays consecutive enlargements byfactors of four, centred on one of the many galaxy cluster halos present in the simulation.

5

[V. Springel et al., 2005]

More Data Sources• Awesome Public Datasets • Kaggle Datasets • Government Data: data.gov • Customer Data: (see https://aboutthedata.com) • Internal Business Data

8D. Koop, CIS 602-01, Fall 2017

https://github.com/caesar0301/awesome-public-datasets

https://www.kaggle.com/datasets

https://www.data.gov

https://aboutthedata.com

Table 1: Set of dimensions.

Dimension Categories Question to be answered

objective

Type Web Crawler, Customizable Crawler, SearchEngine, Pure Data Vendor, Complex Data Vendor,Matching Vendor, Enrichment Tagging,Enrichment Sentiment, Enrichment Analysis, DataMarket Place

What is the type of the core offering?

Time Frame Static/Factual, Up To Date Is the data static or real-time?Domain All, Finance/Economy, Bio Medicine, Social Media,

Geo Data, Address DataWhat is the data about?

Data Origin Internet, Self-Generated, User, Community,Government, Authority

Where does the data come from? Who is the author?

Pricing Model Free, Freemium, Pay-Per-Use, Flat Rate Is the offer free, pay-per-use or usable with a flat rate?Data Access API, Download, Specialized Software, Web

InterfaceWhat technical means are offered to access the data?

Data Output XML, CSV/XLS, JSON, RDF, Report In what way is the data formatted for the user?Language English, German, More What is the language of the website? Does it differ

from the language of the data?Target Audience Business, Customer Towards whom is the product geared?

subjective Trustworthiness Low, Medium, High How trustworthy is the vendor? Can the original data

source be tracked or verified?Size of Vendor Startup, Medium, Big, Global Player How big is the vendor?Maturity Research Project, Beta, Medium, High Is the product still in beta or already established?

ever, if no more companies were found, the categorydefinitions were reconsidered and updated.

2.3 LimitationsThe information we used was taken directly from

the website of each vendor. This may limit theaccuracy of our findings in some cases, where thedescription of a product exceeds the actual function-ality. Verifying that every product fulfills its owndescription is a task that goes beyond the purposeof this survey. Random samples, however, indicatethat the descriptions commonly match the servicesprovided. Nevertheless, there are also cases wherethe information provided on a vendor’s website wasnot sufficient to categorize all dimensions. This wasparticularly the case for B2B vendors, which only re-veal their pricing models upon request. We chose toleave these dimensions out than to speculate abouttheir value. As a result, however, the numbers ofthese dimensions are minimally skewed.The market of data vendors and data market

places is highly active, i. e., new actors emerge andothers disappear, and the market as such is growingrapidly. Therefore, it cannot be guaranteed that thisstudy is fully exhaustive with regard to the numberof vendors in the market. That said, we are confidentthat during our observation period from April toJuly 2012 we have obtained a representative samplethat allows for a meaningful analysis. Furthermore,it has to be stated that data trading channels arenot necessarily made public. This means that weare aware of the fact that a certain amount of datais traded directly between (large) corporations or

within a certain ecosystem (such as social networks)without the use of intermediaries. It is obvious thatit is impossible to investigate those forms of datatrading using our Web survey approach.

3. FINDINGSAs stated in the previous section, the following

twelve dimensions have been examined: Type, TimeFrame, Domain, Data Origin, Pricing Model, DataAccess, Data Output, Language, Target Audience,Trustworthiness, Size of Vendor, and Maturity. Tostructure these dimension we have categorized theminto objective and subjective measures, i. e., whetherthe classification within each dimension can be easilyverified or whether the classification is down to theresearcher’s judgement.

3.1 Objective Dimensions

3.1.1 TypeThe first dimension type is used to classify vendors

based on what their core product is. In order to forma common understanding of the different categoriesthese are explained below:

• (Focused) Web Crawler: Services that are specif-ically designed to crawl a particular websiteor set of websites. These are always bound toone domain, e. g., spinn3r is a service that isspecialized on indexing the blogosphere.

• Customizable Crawler: General purpose craw-lers that can be set up by the customer to crawl

SIGMOD Record, March 2013 (Vol. 42, No. 1) 17

Dimensions of Data

9D. Koop, CIS 602-01, Fall 2017

[Schomm et al., 2013]

Big Data or Small Data?• Many companies feel the need to overclaim the amount of data • "when you take a normal tech company and sprinkle on data, you

get the next Google" — [C. O'Neil] • Many large datasets are not useful • Twitter processes 8TB, but the tweets only take about 30GB… • Wikipedia can be downloaded onto a USB drive • All MP3s can be stored on a moderately sized disk array • Can learn a lot from a "small" dataset, e.g. sensors from a single

turbine, grocery store, Apple Watch • Small data focused on end-user, more timely insights?

10D. Koop, CIS 602-01, Fall 2017

Jobs on a Large Analytics Cluster

11D. Koop, CIS 602-01, Fall 2017

tions to Hadoop that improve scale-up performance with-out compromising the ability to scale out.

While vanilla Hadoop performs poorly in a scale-upconfigurations, a series of optimizations makes it com-petitive with scale-out. Broadly, we remove the initialdata load bottleneck by showing that it is cost-effectiveto replace disk by SSDs for local storage. We then showthat simple tuning of memory heap sizes results in dra-matic improvements in performance. Finally, we showseveral small optimizations that eliminate the “shufflebottleneck”.

This paper makes two contributions. First, it showsthrough an analysis of real-world job sizes as well as anevaluation on a range of jobs, that scale-up is a com-petitive option for the majority of Hadoop MapReducejobs. Of course, this is not true for petascale or multi-terabyte scale jobs. However, there is a large number ofjobs, in fact the majority, that are sub-terabyte in size.For these jobs we claim that processing them on clus-ters of 10s or even 100s of commodity machines, as iscommonly done today, is sub-optimal. Our second con-tribution is a set of transparent optimizations to Hadoopthat enable good scale-up performance. Our results showthat with these optimizations, raw performance on a sin-gle scale-up server is better than scale-out on an 8-nodecluster for 9 out of 11 jobs, and within 5% for the other2. Larger cluster sizes give better performance but incurother costs. Compared to a 16-node cluster, a scale-upserver provides better performance per dollar for all jobs.When power and server density are considered, scale-upperformance per watt and per rack unit are significantlybetter for all jobs compared to either size of cluster.

Our results have implications both for data centerprovisioning and for software infrastructures. Broadly,we believe it is cost-effective for providers supporting“big data” analytic workloads to provision “big memory”servers (or a mix of big and small servers) with a view torunning jobs entirely within a single server. Second, itis then important that the Hadoop infrastructure supportboth scale-up and scale-out efficiently and transparentlyto provide good performance for both scenarios.

The rest of this paper is organized as follows. Sec-tion 2 shows an analysis of job sizes from real-worldMapReduce deployments that demonstrates that mostjobs are under 100 GB in size. It then describes 11 exam-ple Hadoop jobs across a range of application domainsthat we use as concrete examples in this paper. Section 3then briefly describes the optimizations and tuning re-quired to deliver good scale-up performance on Hadoop.Section 4 compares scale-up and scale-out for Hadoopfor the 11 jobs on several metrics: performance, cost,power, and server density. Section 5 discusses someimplications for analytics in the cloud as well as thecrossover point between scale-up and scale-out. Sec-

Figure 1: Distribution of input job sizes for a large analytics cluster

tion 6 describes related work, and Section 7 concludesthe paper.

2 Job sizes and example jobs

A key claim of this paper is that the majority of real-world analytic jobs can fit into a single “scale-up” serverwith up to 512 GB of memory. We analyzed 174,000 jobssubmitted to a production analytics cluster in Microsoftin a single month in 2011 and recorded the size of theirinput data sets. Figure 1 shows the CDF of input datasizes across these jobs. The median job input data set sizewas less than 14 GB, and 80% of the jobs had an inputsize under 1 TB. Thus although there are multi-terabyteand petabyte-scale jobs which would require a scale-outcluster, these are the minority.

Of course, these job sizes are from a single clusterrunning a MapReduce like framework. However we be-lieve our broad conclusions on job sizes are valid forMapReduce installations in general and Hadoop installa-tions in particular. For example, Elmeleegy [10] analyzesthe Hadoop jobs run on the production clusters at Ya-hoo. Unfortunately, the median input data set size is notgiven but, from the information in the paper we can esti-mate that the median job input size is less than 12.5 GB.1Ananthanarayanan et al. [4] show that Facebook jobs fol-low a power-law distribution with small jobs dominating;from their graphs it appears that at least 90% of the jobshave input sizes under 100 GB. Chen et al. [7] presenta detailed study of Hadoop workloads for Facebook as

1The paper states that input block sizes are usually 64 or 128 MBwith one map task per block, that over 80% of the jobs finish in 10minutes or less, and that 70% of these jobs very clearly use 100 orfewer mappers (Figure 2 in [10]). Therefore conservatively assuming128 MB per block, 56% of the jobs have an input data set size of under12.5 GB.

2

[R. Appuswamy et al., 2013]

http://research.microsoft.com/pubs/179615/msrtr-2013-2.pdf

Reading Quiz

12D. Koop, CIS 602-01, Fall 2017

Assignment 1• http://www.cis.umassd.edu/~dkoop/

cis602-2017fa/assignment1.html • Boston Property Assessments

- Initial exploratory analysis - Use a Python Notebook - May use pandas - Label subproblems and answers - Show work (even if it's not your

final answer)

13D. Koop, CIS 602-01, Fall 2017

[Google Maps]

http://www.cis.umassd.edu/~dkoop/cis602-2017fa/assignment1.html

http://www.cis.umassd.edu/~dkoop/cis602-2017fa/assignment1.html

Big Data Visualization

(Slides from Dr. Nan Cao via Dr. Ching-Yung Lin)


https://www.ee.columbia.edu/~cylin/course/bigdata/EECS6893-BigDataAnalytics-Lecture10.pdf

Big Data Visualization• What is Visualization and Why Visualization? • Big Data Visualization Challenges and Techniques • Visualizing Big Data • Visual Analytics and Big Data

15D. Koop, CIS 602-01, Fall 2017

Whisper: Tracing Information Diffusion in Real Time• https://www.youtube.com/watch?v=ou8L0MzGvOU

16D. Koop, CIS 602-01, Fall 2017

https://www.youtube.com/watch?v=ou8L0MzGvOU

Customizing Computational Methods for Visual Analytics with Big Data

J. Choo and H. Park


Complexities of Visual Analytics of Big Data• Human perception and large numbers of items

- locating items - tracking items

• Limited screen space: - clutter - overlapping items

18D. Koop, CIS 602-01, Fall 2017

Use Computational Methods• Methods:

- Dimensionality reduction - Clustering - Machine learning & data mining

• Issues with using these methods - What's going on? - Waiting time…

• Goal: - Interactive - Faster

19D. Koop, CIS 602-01, Fall 2017

Exploiting Discrepancies• Precision: use knowledge of screen resolution to set precision • Convergence: Don't worry about minor changes that may be

imperceptible - Human perception - Screen resolution constraints

20D. Koop, CIS 602-01, Fall 2017

Changes in Cluster Membership in k-Means

21D. Koop, CIS 602-01, Fall 2017

24 July/August 2013

Big-Data Visualization

1. Compute each cluster’s centroid by averaging the feature vectors of that cluster’s data items.

2. Update each data item’s cluster assignment on the basis of its closest cluster centroid.

The iteration terminates when no membership changes occur.

For instance, we used k-means clustering to cluster 50,000 Reuter newswire articles into 20 clusters. Figure 1 shows how many cluster mem-bership changes occurred throughout the 40 itera-tions. Major changes occurred in only the first few iterations. For instance, fewer than 5 percent of the data items changed their memberships after the fifth iteration, as the blue line shows. After the seventh iteration, more than 90 percent of the data items had been correctly clustered, as the red line shows. In addition, each iteration of the k-means algorithm requires an equal amount of time. Therefore, most of the time for running the algorithm could be curtailed because it doesn’t contribute much to a human’s perception in VA.

Screen-space-wise convergence. The coarse-grained quantization due to the screen space’s limited resolution can also affect convergence. To better describe the idea, we show an example using mul-tidimensional scaling (MDS), a common compu-tational method. Basically, MDS tries to preserve all the pairwise distances or relationships of data items in the lower-dimensional space, which is typically a 2D or 3D screen space.

In particular, our example uses nonmetric MDS, which tries to preserve the distance values’ order-

ings instead of their actual values. Nonmetric MDS is often better suited to VA than the original MDS because humans care more about data items’ ordering. However, it requires much more inten-sive computation than metric MDS.

We used nonmetric MDS on 2,000 data items consisting of handwritten numbers. Of the 169 it-erations, major changes occurred in only the first few (see Figure 2a). After the fourth iteration, a data item’s average pixel-wise coordinate change was fewer than 10 pixels from the item’s coordi-nates in the previous iteration, as the blue line shows. After the 30th iteration, each data item was on average fewer than 10 pixels away from the final converged coordinate, as the red line shows. Scatterplots generated by the fifth iteration and the converged result (see Figures 2b and 2c, re-spectively) confirm that the changes between the two are indeed minor.

Customizing Computational MethodsHere, we suggest how to customize computational methods by tackling the precision and conver-gence discrepancies.

Low-Precision ComputationOne of the easiest ways to lower precision and thus accelerate computation is to change double preci-sion to single precision.

Figure 3 shows the results of using principal component analysis (PCA)4 to generate a scatter-plot of facial-image data with single and double precision. Single precision took much less time, but the two cases generated almost identical scatter-plots. After analyzing the exact pixel-wise coordi-nates at 1,024 × 768 resolution, we found only two pixel-wise displacements between the two cases.

You could more carefully determine the compu-tational precision on the basis of human percep-tion and the screen resolution. For instance, you could conduct a user study on how significantly a human’s perception of the results degrades as the precision decreases. Conversely, you could formu-late the minimum precision required for a given resolution of the screen space.

So far, little research has focused on adopting a lower precision than the standard double preci-sion to save computation time. Researchers have studied computation at an arbitrary precision, but their primary purpose was to support much higher precision than modern CPUs can handle.5

However, considerably decreasing the precision might not always achieve computational efficiency owing to hardware issues. Specifically, most CPUs have a floating-point unit (FPU)—a dedicated

0 5 10 15 20 25 30 35 400

10

20

30

40

50

60

70

80

90

100

No. of iterations

No.

of p

er-it

erat

ion

chan

ges/

accu

racy

(%

)

Accuracy against final solutionPer-iteration changes

Figure 1. The relative changes of cluster memberships between iterations, and the cluster membership accuracy with respect to the final converged solution. This example used k-means clustering to cluster 50,000 Reuter newswire articles into 20 clusters. Most of the time for running the algorithm could be curtailed because it doesn’t contribute much to a human’s perception in visual analytics (VA).

Customizing Computations• Use lower precision computation • Use interactive visualization that shows iterations • Refine results iteratively • Data scale confinement

22D. Koop, CIS 602-01, Fall 2017

Iteration-level Interactive Visualization

23D. Koop, CIS 602-01, Fall 2017

IEEE Computer Graphics and Applications 27

Data scale confinement is particularly useful for dealing with computational complexity. In prin-ciple, as the number of data items increases, the algorithm complexity can’t be more efficient than O(n), which assumes that every data item is pro-cessed at least once. Even such ideal complexity can cause a computational bottleneck in real-time VA. Having a fixed number of available pixels can turn algorithmic complexity into O(1), in that you can visualize only a specific number of data items at most. One of the easiest ways to select this data subset is random sampling, although you could adopt other more carefully designed sampling methods that better represent the entire dataset.

Some user interaction such as zoom-in or zoom-out might require the computational results for data items that haven’t yet been processed. In this case, you can handle the situation through a dif-ferent kind of efficient computation.

For example, suppose you have a large-scale da-taset for which only a certain subset of the data has been clustered. To obtain the remaining data items’ cluster labels, you can apply a simple clas-sification method based on the already computed clusters.

Or, in the case of dimension reduction, suppose PCA has been computed on a data subset. You can project the remaining data onto the same space via a linear transformation matrix given by PCA. This is much more efficient than computing PCA on the entire dataset.

Although these approximated approaches can’t give the exact same results as those generated by

using the entire data from the beginning, they’re a viable way to ensure real-time VA for big data.

To achieve tight integration between compu-tational methods and VA, researchers from

each side must care more about the other side. In particular, researchers who design computational methods must realize that making an algorithm more interactive and interpretable in practical data analysis scenarios is just as important as addressing practical concerns such as the data’s maximum applicable size, computation time, and memory requirements. On the other side, re-searchers who apply computational methods to VA need to understand the algorithm details as much as possible and tailor them to make them blend well in real-time VA.

AcknowledgmentsUS National Science Foundation (NSF) grant CCF-0808863 partly supported this research. Any opin-ions, findings, and conclusions or recommendations in this article are the authors’ and don’t necessarily reflect the NSF’s views.

References 1. D. Keim, “Information Visualization and Visual

Data Mining,” IEEE Trans. Visualization and Computer Graphics, vol. 8, no. 1, 2002, pp. 1–8.

2. J. Thomas and K. Cook, Illuminating the Path: The

(a)

Subroutine 1 Subroutine k…

Computational module

Output Visualization/summarization

Iterate

(b)

Subroutine 1 Subroutine k…

Computational module

Output Visualization/summarization

Iterate

Interaction

Interaction

Figure 4. Two approaches to applying computational methods to VA. (a) In the standard approach, visualization and interaction occur only after the computational module finishes its iterations. (b) In iteration-level interactive visualization, intermediate results are visualized dynamically; users can interact with the computational module during iterations.

Next…• Progressive Visualization • Read: How Progressive Visualizations Affect Exploratory Analysis • Write:

- Critique of Paper - < 1 paragraph summary, 2 paragraphs critique

• Which ideas in the paper are interesting and why? • Which ideas do you have related to the paper • Which ideas seem problematic? Can you suggest alternatives?

- Turn in via myCourses - Due Tuesday before class

24D. Koop, CIS 602-01, Fall 2017

https://hal.inria.fr/hal-01377896/document

cis 602-01: scalable data analysisdkoop/cis602-2017fa/lectures/lecture05.pdf · cis 602-01:...

Documents