introduction to big data

12
Introduction to Big Data DEENA DAYALAN

Upload: deenadayalancs

Post on 27-Jan-2015

292 views

Category:

Technology


1 download

DESCRIPTION

Introduction to Big Data

TRANSCRIPT

Page 1: Introduction to Big Data

Introduction to Big DataDEENA DAYALAN

Page 2: Introduction to Big Data

What is Big Data?

As per Wikipedia Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.

As per Gartner “Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

Any data that cannot be processed using current relational database technologies in some manner can consider as Big Data.

Page 3: Introduction to Big Data

Big Data Age We are living in the age of big data

Data is collected from everywhere and stored in warehouse.

Social network – Facebook, Twitter, flickr, Google, Amazon, etc

Bank / Credit card Transactions – For sending promotional offers.

E-commerce – eBay, Amazon.

Page 4: Introduction to Big Data

Big Data Vectors (3Vs)

Page 5: Introduction to Big Data

1 - Data Volume  Large Hadron Collider (LHC)  in CERN generating approximately

one petabyte of data per second. CERN stores 25 petabytes of data per year.

Large Synoptic Survey Telescope (LSST) http://lsst.org/lsst/ Over 30 thousand gigabytes (30TB) of images will be generated every night during the decade -long LSST sky survey.

EBay got 90 Petabytes of data warehouse.

Petabyte data sets are common these days and Exabyte is not far away.

Data volume is increasing exponentially.

Page 6: Introduction to Big Data

2 - Data Velocity Initially, companies analyzed data using a batch process. One

takes a chunk of data, submits a job to the server and waits for delivery of the result.

It works only when incoming data rate is slower than the batch processing rate and the data is useful despite the delay.

With the new sources of data such as social and mobile applications, the batch process breaks down.

The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short.

Twitter Users send over 400 million tweets per day.

Page 7: Introduction to Big Data

3 - Data Variety

Previously in relation database data is stored in tables and excel files.

Various formats include Pure text, photo, audio, video, web, GPS data, sensor data,  relational databases, documents, SMS, PDF, flash, social media data, etc.

One no longer has control over the input data format. Structure can no longer be imposed like in the past in order to keep control over the analysis.

In facebook we post images, audio files, comments(text),etc

Google uses smart phones as sensors to determine traffic conditions from their GPS(gps data, Traffic cameras, RFID Tags in electronic payment) and internet connectivity.

Page 8: Introduction to Big Data

Importance of Big Data Amazon.com handles millions of back-end operations every day, as

well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.

Facebook handles 50 billion photos from its user base.

In 2012, the Obama administration announced the Big Data Research and Development Initiative, which explored how big data could be used to address important problems faced by the government. The initiative was composed of 84 different big data programs spread across six departments.

LHC with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication.

Page 9: Introduction to Big Data

Tools used in Big Data Scenarios

NoSQL Databases

MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, Zookeeper

Map Reduce

Hadoop, Hive, Pig, Cascading, Cascalog, MapR, Flume, Kafka, Azkaban, Oozie, Greenplum

Storage

S3, Hadoop Distributed File System

Servers

EC2, Google App Engine, Elastic, Beanstalk, Heroku

Page 10: Introduction to Big Data

Gartner Hype cycle 2013

Page 12: Introduction to Big Data

Questions?