big data and hadoop - introduction, architecture - hdfs and mapreduce, ecosystem
DESCRIPTION
This is the first introductory session on Big Data & Hadoop. It clears a lot of myths and confusion about Big Data. How exactly Big Data handling is a challenge and how does Hadoop addresses those problems. This is typically for a 3 hours session. The first half of the session talks about Big Data and related challenges and how does Hadoop solves those challenges. Second half of the presentation talks about the Hadoop components stack, HDFS architecture, MapReduce Paradigm and Architecture. By the end of this presentation you should be fairly clear about Hadoop and Big Data. To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-and-hadoopTRANSCRIPT
![Page 1: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/1.jpg)
Sandeep GiriHadoop
Session 1 - Introduction
Welcome to Big Data & Hadoop
Session
[email protected]+91-9538998962
Use Q&A to Communicate With Instructor
![Page 2: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/2.jpg)
Sandeep GiriHadoop
Interact - Ask Questions
Lifetime access of content
Class Recording
Cluster Access
24x7 support
WELCOME - KNOWBIGDATA
Real Life Project
Quizzes & Certification Test
10 x (3hr class)
Socio-Pro Visibility
Mock Interviews
![Page 3: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/3.jpg)
Sandeep GiriHadoop
ABOUT ME2014 KnowBigData Founded2014
AmazonBuilt High Throughput Systems for Amazon.com site using in-house NoSql.
20122012 InMobi Built Recommender after churning 200 TB2011
tBits GlobalFounded tBits Global Built an enterprise grade Document Management System
2006
D.E.ShawBuilt the big data systems before the term was coined
20022002 IIT Roorkee Finished B.Tech somehow.
![Page 4: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/4.jpg)
Sandeep GiriHadoop
COURSE CONTENTCOURSE CONTENTI Understanding BigData, Hadoop Architecture
II Cluster Setup, ETL, Project Environment
III MapReduce framework
IV Adv MapReduce & Testing
V Analytics using Pig
VI Analytics using Hive
VII NoSQL, HBASE
VIII Zookeeper, Oozie, Mahout
IX Apache Flume, Apache Spark
X YARN, Big Data Sets & Project Assignment
![Page 5: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/5.jpg)
Sandeep GiriHadoop
What/why of Big Data?
Why Now?
Examples Customers
What is Hadoop?
Components Hadoop
HDFS Architecture
TODAY’S CLASS
NameNode
Secondary NameNode
Job Tracker
File Read/Write
Replication & Rack Awareness
Further Reading/Assignment
![Page 7: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/7.jpg)
Sandeep GiriHadoop
WHAT IS BIG DATA?
Simply: Data of Very Big Size
Can’t process with usual tools
Distributed Architecture Needed
Structured / Unstructured
![Page 8: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/8.jpg)
Sandeep GiriHadoop
WHAT IS BIG DATA?
Facebook: 500TB /day Boeing737: 240 TB / flight
. Clickstreams: ~ 1m events / sec
Geospatial data 3D data
audio & video Unstructured text
![Page 9: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/9.jpg)
Sandeep GiriHadoop
How many bytes in a petabyte?
![Page 10: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/10.jpg)
Sandeep GiriHadoop
How many bytes in petabytes?
1.1259x10^15
![Page 11: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/11.jpg)
Sandeep GiriHadoop
How many bytes in petabytes?
Kilo 1024 Bytes 1024 Bytes
Mega 1024 KB 1024 Bytes
Giga 1024 MB 1024 Bytes
Tera 1024 GB 1024^4 Bytes
Peta 1024 Tera 1024 Bytes
Exa 1024 Peta 1024 Bytes
Zeta 1024 Exa 1024 Bytes
Yotta 1024 Zeta 1024 Bytes
1 byte = 8 bit = can store 256 states
1.1259x10^15
![Page 13: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/13.jpg)
Sandeep GiriHadoop
WHY IS IT IMPORTANT NOW?
Smart Phones
4.6 billion mobile-phones. 1 - 2 billion people accessing the internet.
Facebook:1.06 bn monthly active users, 30 billion pieces shared monthly.
~175 million tweets every day
Connectivity: Internet Of Things
Connectivity: Social Networks
Moores Law has failed. data-bus is the bottle neck
![Page 14: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/14.jpg)
Sandeep GiriHadoop
BIG DATA PROBLEMTo process & store data
we need
3. Disk Size + Speed2. RAM - Speed & Size
1. CPU Speed4. Network
![Page 15: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/15.jpg)
Sandeep GiriHadoop
BIG DATA PROBLEMTo process & store data
we need
1. Disk Size + Speed3.RAM - Speed & Size
2. Speed
And at least one of these become bottle neck
![Page 16: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/16.jpg)
Sandeep GiriHadoop
BIG DATA PROBLEM - STORAGEQ: If you have 100TB data, How would you store it?
![Page 17: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/17.jpg)
Sandeep GiriHadoop
BIG DATA PROBLEM - STORAGEQ: If you have 100TB data, How would you store it?
A: Have 100 1TB drives and make 100 subfolders mount these.
Challenges?
![Page 18: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/18.jpg)
Sandeep GiriHadoop
BIG DATA PROBLEM - STORAGEQ: If you have 100TB data, How would you store it?
A: Have 100 1TB drives and make 100 subfolders mount these.
• What about fail overs & Backups? • How would distribute the data uniformly? • Is this best value for money? • is this best use of resources? We might have hundreds of smaller drives already. • What about Increasing accessibility?
Then?
Challenges?
![Page 19: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/19.jpg)
Sandeep GiriHadoop
BIG DATA PROBLEM - STORAGEQ: If you have 100TB data, How would you store it?
A: Have 100 1TB drives and make 100 subfolders mount these.
• What about failovers & Backups? • How would distribute the data uniformly? • Is this best value for money? • is this best use of resources? We might have hundreds of smaller drives already. • What about Increasing accesibility?
Then?
Challenges?
Hadoop Distributed File System or HDFS
![Page 20: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/20.jpg)
Sandeep GiriHadoop
BIG DATA PROBLEM - PROCESSINGQ: How fast can 1GHz processor to sort the 1TB data?
1TB == 10 billion names having 100 characters each
RAM = 2GB
![Page 21: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/21.jpg)
Sandeep GiriHadoop
BIG DATA PROBLEM - PROCESSINGQ: How fast can 1GHz processor to sort the 1TB data?A: Around 5 hours
![Page 22: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/22.jpg)
Sandeep GiriHadoop
BIG DATA PROBLEM - PROCESSINGQ: How fast can 1GHz processor to sort the 1TB data?A: Around 5 hours
Google, 8 Sept, 2011: Sorting10PB took 6.5 hrs on 8000 computers
We need 1. Faster Sort 2. Bigger Data Sorting 3. More often
![Page 23: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/23.jpg)
Sandeep GiriHadoop
Components
Break for 5 Minutes. Come back at 10:08am IST.
![Page 24: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/24.jpg)
Sandeep GiriHadoop
EXAMPLE BIG DATA CUSTOMERSWeb and e-tailing • Recommendation Engines • Ad Targeting • Search Quality • Sentiment Analyses • Abuse and Click Fraud Detection
Telecommunications • Customer Churn Prevention • Network Performance Optimization • Calling Data Record (CDR) Analysis • Analyzing Network to Predict Failure
![Page 25: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/25.jpg)
Sandeep GiriHadoop
EXAMPLE BIG DATA CUSTOMERS
Government • Fraud Detection • Cyber Security Welfare • Justice
Healthcare & Life Sciences • Health information exchange • Gene sequencing • Healthcare improvements • Drug Safety
![Page 26: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/26.jpg)
Sandeep GiriHadoop
AND MANY MORE…kaggle.com
![Page 27: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/27.jpg)
Sandeep GiriHadoop
11 COMMON MYTHSBIG DATA • Always means data above or in range of TB • Is always about social media. Doesn't apply to me. • Will replace EDW • Is just a buzz word. No Practical Applications
!• Is New Concept • Will be future. • Is Expensive • Is only for data scientists. Or is magic. • We have enough hardware. Don't need any more. • We will build it when we need it. • Big Data is about Hadoop.
![Page 28: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/28.jpg)
Sandeep GiriHadoop
Q1:How is analysis of Big Data saving money for businesses? —Souvik
A1:How is analysis of Data saving money for business? • Examine our assumptions before building product. • Evaluate sale / marketing strategies • Retain / Attain existing customers • A/B Testing • Historical Data Analysis better than Surveys
A2:What is the expense of "Big Data" handling? • Cost: All tools are free/open. • Cheap H/W needed.
![Page 29: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/29.jpg)
Sandeep GiriHadoop
Q2:How important it is to know big data to make a fast growing IT career?
Short Answer: Very Important
Number of Jobsindeed.com naukri.com
Hadoop 1,102 2312
Big Data 1,255 659
Analyst Reports Gartner Top 10, 2014: Point #1, #4, #9, #10,
!Forbes top 10 tech trends: Point #5, #6, #8, #9
![Page 30: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/30.jpg)
Sandeep GiriHadoop
WHAT IS HADOOP?
A. Cute Little Yellow Toy Elephant B. Framework to handle Big Data C. Open Source - Apache D. Power, Popular & Supported E. For reliable, scalable, distributed computing F. Created by Doug Cutting (of Yahoo) and Mike Cafarella G. Built for Nutch search engine project H. Written in Java
![Page 31: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/31.jpg)
Sandeep GiriHadoop
CORE OF HADOOP?
MapReduce engine Execute your logic on multiple computers in parallel.
Hadoop Distributed File System (HDFS) Stores multiple copies of file parts on multiple machines.
![Page 34: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/34.jpg)
Sandeep GiriHadoop
ComponentsWorkflow
SQL Inteface
New Language
Machine learning /
STATS
![Page 35: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/35.jpg)
Sandeep GiriHadoop
ComponentsWorkflow
SQL Inteface
New Language
Machine learning /
STATS
NoSQL Database
Compute Engine
Main Component
![Page 36: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/36.jpg)
Sandeep GiriHadoop
CORE COMPONENTS
Machine1 Machine 2 Machine 3 Machine 4 Machine 5
![Page 37: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/37.jpg)
Sandeep GiriHadoop
CORE OF HADOOP?HDFS – Hadoop Distributed File System (Storage) 1. Distributed across “nodes” 2. Fault Tolerant & High Throughput 3. Low Cost Hardware 4. NameNode tracks locations.
MapReduce (Processing) 1. Splits a task across processors / nodes 2. “near” the data & assembles results 3. Self-Healing, High Bandwidth 4. Clustered storage 5. JobTracker manages the TaskTrackers
![Page 38: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/38.jpg)
Sandeep GiriHadoop
WHAT IS HDFS?Distributed file system to run on commodity hardware
Design Goals 1. Hardware Failure 2. Streaming Data Access 3. Large Data Sets 4. Simple Coherency Model 5. Moving Code is Cheaper than Data 6. Portable - Java
![Page 40: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/40.jpg)
Sandeep GiriHadoop
MAIN COMPONENTS OF HDFS?
NameNode: 1. Master of the system 2. Maintains and manages the blocks which are present on the
DataNodes
DataNodes: 1. Slaves which are deployed on each machine and provide
the actual storage 2. Responsible for serving read and write requests for the
clients
![Page 41: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/41.jpg)
Sandeep GiriHadoop
ANATOMY OF A FILE WRITE
![Page 42: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/42.jpg)
Sandeep GiriHadoop
ANATOMY OF A FILE READ
![Page 43: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/43.jpg)
Sandeep GiriHadoop
NAMENODE METADATA
Meta-data in Memory 1. The entire metadata is in main memory. 2. No demand paging of FS meta-data !Types of Metadata 1. List of files 2. List of Blocks for each file 3. List of DataNode for each block 4. File attributes, e.g. access time, replication
factor !A Transaction Log 1. Records file creations, file deletions. etc
NameNode (File Name, numReplicas, block-ids,…)
/users/sgiri/data/part-0,r:2, {1,3},… /users/sgiri/data/part-1,r:3, {2,4,5},…
NameNode Keeps track of overall file directory structure and the placement of Data
Block
![Page 44: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/44.jpg)
Sandeep GiriHadoop
JOB TRACKER
![Page 45: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/45.jpg)
Sandeep GiriHadoop
CORE COMPONENTS
Machine1 Machine 2 Machine 3 Machine 4 Machine 5
![Page 46: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/46.jpg)
Sandeep GiriHadoop
JOB TRACKER (DETAILED)
![Page 47: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/47.jpg)
Sandeep GiriHadoop
JOB TRACKER (CONT.)
![Page 48: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/48.jpg)
Sandeep GiriHadoop
JOB TRACKER (CONT.)
![Page 49: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/49.jpg)
Sandeep GiriHadoop
MAP REDUCE
![Page 50: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/50.jpg)
Sandeep GiriHadoop
FURTHER READING
http://en.wikipedia.org/wiki/Apache_Hadoop !http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
![Page 51: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/51.jpg)
Sandeep GiriHadoop
ASSIGNMENT / PRE-WORK
• Setup The Hadoop Environment based on the Virtual machines Present in LMS & cluster
• Access the cluster. Details are in LMS. • Finish the quiz from LMS • See Assignment section on LMS
![Page 52: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/52.jpg)
Sandeep GiriHadoop
www.KnowBigData.com
• Second Class Onwards from 16 Aug 7:00am • Every Saturday-Sunday - 3 hours • 30 hours - 3 hr x 10 classes • ₹ 19995 ₹ 14996 (25% off) (Incl. Taxes)
• ₹ 1880/- per class
• Cluster for Hands-On Experiments
FULL COURSE
![Page 54: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem](https://reader033.vdocument.in/reader033/viewer/2022042518/554f36f6b4c905cd048b4dcd/html5/thumbnails/54.jpg)
Sandeep GiriHadoop
COLLECT YOUR CERTIFICATEwww.KnowBigData.com/dell
• Permanent URL / Link • Crawl-able • Verifiable • Display on LinkedIn • Put in Social Profile