big data&hadoop

Course Topics

Week 1 – Understanding Big Data– Introduction to HDFS– Playing around with Cluster– Data loading Techniques

Week 2– Map-Reduce Basics, types

and formats– Use-cases for Map-Reduce– Analytics using Pig– Understanding Pig Latin

Week 3 – Analytics using Hive– Understanding HIVE QL – NoSQL Databases– Understanding HBASE

Week 4– Zookeeper, Sqoop, Flume– Debug MapReduce programs

in Eclipse.– Real world Datasets and

Analysis– Planning a career in Big Data

What is Big Data?

Facebook Example

Facebook users spend 10.5 billion minutes (almost 20,000 years) online on the social networkFacebook has an average of 3.2 billion likes and comments are posted every day.

Twitter has over 500 million registered users.

The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough to finish well ahead of Brazil, Japan, the UK and Indonesia.

79% of US Twitter users are more like to recommend brands they follow

67% of US Twitter users are more likely to buy from brands they follow

57% of all companies that use social media for business use Twitter

Twitter Example

Other Industrial Usecases

• Insurance • Healthcare• Genome Sequencing• Utilities

Hadoop Users

http://wiki.apache.org/hadoop/PoweredBy

Data volume is growing exponentially

• Estimated Global Data Volume:– 2011: 1.8 ZB– 2015: 7.9 ZB

• The world's information doubles every two years

• Over the next 10 years:– The number of servers worldwide will grow

by 10x– Amount of information managed by

enterprise data centers will grow by 50x– Number of “files” enterprise data center

handle will grow by 75x

Source: http://www.emc.com/leadership/programs/digital-universe.htm, which was based on the 2011 IDC Digital Universe Study

http://www.emc.com/leadership/programs/digital-universe.htm

http://www.emc.com/leadership/programs/digital-universe.htm

Un-Structured Data is exploding

Read 1 TB Data

10 Machines 4 I/O Channels Each Channel – 100 MB/s

4 I/O Channels Each Channel – 100 MB/s

1 Machine

Why DFS?



1 Machine

Read 1 TB Data

45 Minutes

Why DFS?

4.5 Minutes45 Minutes



1 Machine

Read 1 TB Data

Why DFS?

What Is Distributed File System? (DFS)

Apache Hadoop is a framework that allows for the distributed processing of large data

sets across clusters of commodity computers using a simple programming model.

Companies using Hadoop:

- Yahoo

- Google

- Facebook

- Amazon

- AOL

- IBM

- And many more at

http://wiki.apache.org/hadoop/PoweredBy

What is Hadoop?

Hadoop Eco-System

HDFS – Hadoop Distributed File System (storage)

MapReduce (processing)

Hadoop Core Components:

Any Questions ? See you in Next class

Thankyou.Sainagaraju vaduka

big data&hadoop

Documents