jrskbt.files.wordpress.com  · web viewunit 5. unit – v big data visualization. data...

19
Unit 5 UNIT – V Big Data Visualization Data visualization Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. Data visualization objectives 1. Improved Insight 2. Faster Decision Making 3. Information Processing 4. To illustrate or hide the data 5. To make comparison between some statistical data or predict outcomes. 6. To find pattern or relationship among the data. 7. Challenges of Big Data Visualization Scalability and dynamics are two major challenges in visual analytics. The visualization-based methods take the challenges presented by the “four Vs” of big data and turn them into following opportunities [2]. Volume: The methods are developed to work with an immense number of datasets and enable to derive meaning from large volumes of data. Variety: The methods are developed to combine as many data sources as needed. Velocity: With the methods, businesses can replace batch processing with real-time stream processing.

Upload: others

Post on 03-Feb-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Unit 5

UNIT – V Big Data Visualization

Data visualization

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

Data visualization objectives

1. Improved Insight

2. Faster Decision Making

3. Information Processing

4. To illustrate or hide the data

5. To make comparison between some statistical data or predict outcomes.

6. To find pattern or relationship among the data.

7.

Challenges of Big Data Visualization

Scalability and dynamics are two major challenges in visual analytics.

The visualization-based methods take the challenges presented by the “four Vs” of big data and turn them into following opportunities [2].

• Volume: The methods are developed to work with an immense number of datasets and enable to derive meaning from large volumes of data.

• Variety: The methods are developed to combine as many data sources as needed.

• Velocity: With the methods, businesses can replace batch processing with real-time stream processing.

• Value: The methods not only enable users to create attractive infographics and heatmaps, but also create business value by gaining insights from big data.

There are also following problems for big data visualization:

•  Visual noise: Most of the objects in dataset are too relative to each other. Users cannot divide them as separate objects on the screen.

•  Information loss: Reduction of visible data sets can be used, but leads to information loss.

•  Large image perception: Data visualization methods are not only limited by aspect ratio and resolution of device, but also by physical perception limits.

•  High rate of image change: Users observe data and cannot react to the number of data change or its intensity on display.

•  High performance requirements: It can be hardly noticed in static visualization because of lower visualization speed requirements--high performance requirement.

Perceptual and interactive scalability are also challenges of big data visualization. Visualizing every data point can lead to over-plotting and may overwhelm users’ perceptual and cognitive capacities; reducing the data through sampling or filtering can elide interesting structures or outliers. Querying large data stores can result in high latency, disrupting fluent interaction [13].

In Big Data applications, it is difficult to conduct data visualization because of the large size and high dimension of big data. Most of current Big Data visualization tools have poor performances in scalability, functionalities, and response time. Uncertainty can result in a great challenge to effective uncertainty-aware visualization and arise during a visual analytics process [5].

Potential solutions to some challenges or problems about visualization and big data were presented [14]:

1. Meeting the need for speed: One possible solution is hardware. Increased memory and powerful parallel processing can be used. Another method is putting data in-memory but using a grid computing approach, where many machines are used.

2. Understanding the data: One solution is to have the proper domain expertise in place.

3. Addressing data quality: It is necessary to ensure the data is clean through the process of data governance or information management.

4. Displaying meaningful results: One way is to cluster data into a higher-level view where smaller groups of data are visible and the data can be effectively visualized.

5. Dealing with outliers: Possible solutions are to remove the outliers from the data or create a separate chart for the outliers.

Common general types of data visualization:

· Charts

· Tables

· Graphs

· Maps

· Infographics

· Dashboards

More specific examples of methods to visualize data:

· Area Chart

· Bar Chart

· Box-and-whisker Plots

· Bubble Cloud

· Bullet Graph

· Cartogram

· Circle View

· Dot Distribution Map

· Gantt Chart

· Heat Map

· Highlight Table

· Histogram

· Matrix

· Network

· Polar Area

· Radial Tree

· Scatter Plot (2D or 3D)

· Streamgraph

· Text Tables

· Timeline

· Treemap

· Wedge Stack Graph

· Word Cloud

· And any mix-and-match combination in a dashboard!

Data Visualization techniques /method

1. Data Visualization

2. Information Visualization

3. Concept Visualization

4. Strategic Visualization

5. Metaphor Visualization

6. Compound Visualization

Tool used in Data Visualization / Propriety Data Visualization Tools

Multiple line graphs

Line graphs are used for one dimensional data. On the horizontal axis (Ox) the values are not repeated (e.g., time or

the ordering of the table). The vertical axis (Oy) shows the values of the variable of interest. Multiple line graphs can be used

to show more than two variables or dimensions (x, y1, y2, y3, etc.).

Wordle :

Wordle is a toy for generating “word clouds” from text that you provide. The clouds give greater prominence to words that appear more frequently in the source text. You can tweak your clouds with different fonts, layouts, and color schemes. The images you create with Wordle are yours to use however you like. You can print them out, or save them to your own desktop to use as you wish.

pie chart

Tools: most statistical and charting software, Many Eyes, Google Charts, Tableau Public, Google Fusion Tables

Image created in Excel with randomized data.

tree map :

Tools: d3/Protovis, Many Eyes, Google Charts, Network Workbench/Sci2

Image created by code in d3 "examples/treemap/" directory.

bar chart, radial bar chart

Tools: most statistical and charting software, Many Eyes, Google Charts, Tableau Public, High Charts, Google Fusion Tables

Image created in Excel with data from Anscombe's quartet.

histogram

Tools: most statistical and charting software, Protovis, Many Eyes

Image:Pyrsmis. (2008). Black cherry tree histogram.CC BY-SA 3.0

Tools for Multidimensional Visualizations

Google Charts

Display live data on your website.  Includes Introduction, Quick Start, and Chart Gallery for ideas.

Many Eyes

An experiment by IBM Research and the IBM Cognos software group.  View others' visualizations, upload your own data and create your own visualizations.

Tableau Public

Tableau Public is a free tool that "brings data to life" (according to their website). View others' visualizations or create your own.  Tutorial included.

Weave

Web-based Analysis and Visualization Environment is designed to enable visualization of any available data.  WEAVE has a wide array of options for working with different data types.

Wordle

Generates “word clouds” from text that you provide. The clouds give greater prominence to words that appear more frequently in the source text. You can tweak your clouds with different fonts, layouts, and color schemes.

What are Hierarchical Data Visualizations?

Hierarchical Visualizations or Trees are collections of items with each item having a link to one parent item (except the root). Items and the links between parent and child can have multiple attributes. These can be applied to items and links. Tasks related to structural properties become interesting -  how many levels in the tree? or how many children does an item have?

Examples include: dendrogram, phylogenetic tree, radial tree, hyperbolic tree, tree map, cone tree, radial hierarchy, and decision tree/flow chart. What do these look like?  They are among the light green elements on the Periodic Table of Visualization

Tools for Hierarchical Visualizations

Network Workbench

This is a large-scale network analysis, modeling, and visualization toolkit for biomedical, social science and physics research.  It designs, evaluates, and operates a unique distributed, shared resources environment for large-scale network analysis, modeling, and visualization.

Provotis

A graphical approach to visualization, Provotis composes custom views of data with simple marks such as bars and dots, and defines marks through dynamic properties that encode data. Protovis is mostly declarative and designed to be learned by example.  It is no longer under active development.

d3.js

From the developers of Provotis, d3.js is a small, free JavaScript library for manipulating documents based on data.  Can produce choropleth, motion chart, hib plot, and fisheye distortion visualizations.

Many Eyes

An experiment by IBM Research and the IBM Cognos software group.  View others' visualizations, upload your own data and create your own visualizations.

Open-Source Data Visualization Tools1. Candela

Candela is a data visualization package made available through the Resonant platform. Candela separates itself from other tools by providing a full suite of data visualization components. The training documentation provides for a quick start for novices to get up to speed, and code can be used via JavaScript, Python, or R. Installation of Candela locally can be done via the latest public release package through a repository, though tool documentation suggests installing the package from source as it will allow for the latest development release.

2. Charted

Charted is perhaps one of the easiest data visualization tools around, as it simply requires a link to a .csv file or a Google Sheets location; hit GO and Charted creates a visual display using a bar or line chart. According to the developers of Charted (created by the Product Science Team at Medium), the tool was built around three principles: it does not store data, does not transform data, and is not a formatting tool. It pulls data on a regular cadence (refreshes every 30 minutes) so changes made to the underlying sheet are always up-to-date in the chart. It also supports tab-delimited files and Dropbox links. Training? Non-existent, though neither is it required.

3. Datawrapper

Datawrapper is a tool that has been in existence since 2011 and is primarily used by journalists, though is comprehensive enough to be useful to any data scientist or researcher. In contrast to most of the tools profiled here, Datawrapper has free and paid versions. It’s also not technically open-source because no coding skills are needed. As the site home page explains, you simply cut & paste, visualize, and publish. Charts are interactive, meaning viewers can see underlying values, and the visualizations can also be embedded on a website. There is a wide range of charting options from simple bar charts to scatter plots, as well as mapping functionality.

4. Leaflet

Leaflet is all about maps. In fact, it has no charting capabilities but touts itself as the “leading open-source JavaScript library for mobile-friendly interactive maps”. The tool provides for a variety of mapping layers, and interaction features such as zoom controls, and mouseover functionality. There is also customization capability such as map projections and easy CSS3 restyling. Additional features can be provided via plugins, and users can vote for additional plug-ins if one is not available. There are both basic tutorials such as a quick start guide as well as more advanced training for plugin development. Install files can be accessed through a repository (both stable and in-progress versions) as well as through source code.

5. RawGraphs

Similar in some respects to Charted and Datawrapper, RawGraphs, whose tagline is the missing link between spreadsheets and data visualizations, simply requires the user to either cut/paste data, upload, or provide a link to create a wide variety of charts. One feature that differentiates RawGraphs is that a number of unconventional visualization models are provided (e.g. sunburst, alluvial diagrams, dendrograms for hierarchical clustering, etc.). Don’t fret, novices – the usual suspects (bar, line, pie, scatter) are also included. For advanced users, new chart types can also be created. Visual creations can be exported as vector or raster images for display on your website, and the tutorials, while not extensive, can be completed quickly so you can get right to work on that visual magnum opus.

6. Chartist.js

Chartist.js is another JavaScript library that embodies its tagline as Simple Responsive Charts. Indeed. No waterfalls or boxplots here, but what Chartist.js loses in diversity it more than makes up for in customization. Style sheets (CSS) can be customized to a great degree in this tool with customization allowing for animation of visualizations, some using SVG. What is SVG? SVG is scalable vector graphics, a format that allows for interactivity and animation, as well as being scalable (without loss of resolution quality). Chartist.js sees SVG as a cutting-edge technology, a vision apparently shared by others. There are some browser compatibility issues, but the site provides a concise table indicating compatible browsers.

7. D3.js

D3.js is yet another JavaScript library that develops data visualizations through the use of html, svg, and css. D3 stands for Data-Driven Documents, document here being a Document Object Model (DOM). The core idea behind D3.js is to leverage the full capability of the modern browser for the development of visualizations through web standards, without “tying yourself to a proprietary framework”. In terms of learning curve, this would be the polar opposite of other cut-and-paste tools, so D3.js is decidedly not for those that avoid the dreaded code moniker. That said, if you are looking for a tool that provides nearly unlimited functionality in terms of design creativity and charting options, D3.js might be just the ticket!

8. Plotly

Plotly is another example of a tool that has both open-source and proprietary (paid) products, each tier containing its own functionality. Offerings can be grouped into two platforms (Plotly On-Premises and Plotly Cloud) with four primary business intelligence tools covering charting, dashboards, slide decks, and SQL client. The SQL client is free, while Plotly libraries are available as open-source through JavaScript, Python, and R. One of the oft-marketed features of Plotly (at least in the paid tools) includes the ability to collaborate and share data visualizations with other team members.

9. Polymaps

Similar to Leaflet, and as the name suggests, Polymaps is a tool consisting of a JavaScript library for “making dynamic, interactive maps in modern web browsers”. Polymaps is another tool that leverages SVG functionality, facilitating styling through CSS, and allows for increased interactivity. Examples of mapping visualizations include general street layer mapping, chloropleth maps (for instance, comparing state-level data), population density, and even the use of k-means clustering.

10. OpenHeatMaps

In the category of upload and create, OpenHeatMaps is a fairly basic tool that allows user to upload either a csv, excel, or Google Sheets file, and create a map instantly. OpenHeatMap can also be used by developers (as a JQuery plugin) to provide for mapping functionality within their own website. Users uploading a file for rendering are recommended to include a full street address in one field, with values represented in another field (for instance, housing value, sales price, number of employees, etc.). Geographies can be point-based (i.e. one address), or aggregates such as city, county, state, etc.

11. DyGraphs

DyGraphs claims as one of its primary features the ability to handle huge data sets, plotting millions of data points without “getting bogged down”. Another feature, for those who consider themselves stats nerds, is the ability to display error bars and/or confidence intervals. To use these, one standard deviation must be specified in the data file. The tutorial demonstrations are fairly basic but should serve to get someone started fairly quickly in creating their own visualizations.

Analytical techniques used in big data visualization

https://www.youtube.com/watch?v=e6Nc-P76yMQ

1. Classification

Decision Tree (predication)

2. Regression

Linear and Logistic

3. Clustering ( unsupervised)

Predication

Grouping

K – means

4. Association rule

Ungrouping learning

Predication

Relationship

Rules/items

PentahoReporting

Pentaho reporting depends on the JFreeReport project. It helps you to fulfill your business reporting needs. This component also offers both scheduled and on-demand report publishing in popular formats such as XLS, PDF, TXT, and HTML.

Analysis

It offers a wide range of analysis a wide range of features that includes a pivot table view. The tool provides enhanced GUI features (using Flash or SVG), integrated dashboard widgets, portal, and workflow integration.

Moreover, Pentaho Spreadsheet Services allows a user to browse, pivot, and use chart from within MS Excel.

Dashboards

The dashboard offers Reporting and Analysis, which contribute content to Pentaho Dashboards. The self-service dashboard designer includes extensive built-in dashboard templates and layout. It allows business users to build personalized dashboards with little training.

Data Mining

Data mining tool discovers hidden patterns and indicators of future performance. It offers the most comprehensive set of machine learning algorithms from the Weka project, which includes clustering, decision trees, random forests, principal component analysis, neural networks.

It allows you to view data graphically, interact with it programmatically, or use multiple data sources for reports, further analysis, and other processes.

Pentaho Data Integration

This component is used to integrate data wherever it exists.

Rich transformation library with over 150 out-of-the-box mapping objects.

It supports a wide range of data source which includes more than 30 open source and proprietary database platforms, flat files. It also helps Big Data analytics with integration and management of Hadoop data.

Who are using Pentaho BI?

Pentaho BI is a widely used tool by may software professionals like:

· Open source software programs

· Business analyst and researcher

· College students

· Business intelligence councilor

Flare Data Visualization for the Web

· Flare is an ActionScript library for creating visualizations that run in the Adobe Flash Player. From basic charts and graphs to complex interactive graphics, the toolkit supports data management, visual encoding, animation, and interaction techniques. Even better, flare features a modular design that lets developers create customized visualization techniques without having to reinvent the wheel.

· View the demos and sample applications to see a few of the visualizations that flare makes it easy to build.

· To begin making your own visualizations, download flare and work through the tutorial. You should also get familiar with the API documentation. Need more help? Visit the help forum (you'll need a SourceForge login to post).

Jasper Reports – Open Source Reporting Tool

JasperReports is an open source reporting engine. It provides the ability to deliver rich content onto to the printer, the screen, or into various formats such as  PDF, HTML, XLS, RTF, ODT, CSV, TXT and XML files. It is a Java library and can be used in a variety of Java-enabled applications to generate dynamic content. Its main purpose is to help create page-oriented, ready-to-print documents in a simple and flexible manner. JasperReports can also be used to provide reporting capabilities in our applications.

As it is not a standalone tool, it cannot be installed on its own. Instead, it is embedded into Java applications by including its library in the application's CLASSPATH.

Dygraphsdygraphs is a fast, flexible open source JavaScript charting library.

It allows users to explore and interpret dense data sets. Here's how it works

The chart is interactive: you can mouse over to highlight individual values. You can click and drag to zoom. Double-clicking will zoom you back out. Shift-drag will pan. You can change the number and hit enter to adjust the averaging period.

Features

· Handles huge data sets: dygraphs plots millions of points without getting bogged down.

· Interactive out of the box: zoom, pan and mouseover are on by default.

· Strong support for error bars / confidence intervals.

· Highly customizable: using options and custom callbacks, you can make dygraphs do almost anything.

· dygraphs is works in all recent browsers. You can even pinch to zoom on mobile/tablet devices!

· There's an active community developing and supporting dygraphs

Datameer Analystics solution and cloudera

NodeBox

NodeBox is a node-based software application for generative design. It helps designers and everyone that uses it to automate boring productiong challenges, visualise large sets of data and manipulate the raw power of computer without using mechanical language of machines. The tools are able to integrate with traditional design applications and are cross platform.

Gephi

Gephi is an open-source software for visualizing and analysing large networks graphs. Gephi uses a 3D render engine to display graphs in real-time and speed up the exploration. You can use it to explore, analyse, spatialise, filter, cluterize, manipulate and export all types of graphs.

Google Chart API

The Google Chart API is an interactive Web service (now deprecated) that creates graphical charts from user-supplied data. Google servers create a PNG (Portable Network Graphics) image of a chart from data and formatting parameters specified by a user's HTTP request. The service supports a wide variety of chart information and formatting. Users may conveniently embed these charts in a Web page by using a simple image tag.

Flot

Flot is a javascript plotting library, it`s small, performance is good and supports all kinds of chart types. There are also plugins for Flot to use.

There are many chart types available like Line chart、Pie chart、Bar chart、Area chart、Stacked chart, Flot also supports real time update chart and Ajax update chart, if you know little about javascript and jQuery, you could get started with Flot easily.

Flot can handle hundreds of data points easily, even if you`re using real time update chart, to redraw the chart in every 100 milliseconds, still runs very fast. Flot can be run on IE、Firefox、Chrome、Safar and Opera. Because Flot uses HTML5 Canvas, if you`re using IE8 and below, you can use excanvas to make IE simulates HTML5 Canvas, this way makes Flot can work properly on IE.

Visual.ly

Visual.ly is a community platform for data visualization and infographics. (a clipped compound of "information" and "graphics").