hadoop distributed file systemsdevans/7343... · a distributed file system (dfs) is a file system...

22

Distributed File Systems & Hadoop Kevin Queenan

Upload: others

Post on 08-Aug-2020

6 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Distributed File Systems &Hadoop

Kevin Queenan

Page 2: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

What is a Distributed File System (DFS)?

Page 3: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Simply...

A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine.

Page 4: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

What is Hadoop?

Page 5: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Apache Hadoop is...

A framework, ecosystem, or set of open-source software tools that allows for the distributed housing and processing of extremely large data sets contained across numerous clusters of commodity grade hardware.

Page 6: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Why does Hadoop exist?

Page 7: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Consider current industry trends...

Data at a massive scale -> TB and PB

Facebook ingested 20 TB of data per day in 2011

NYSE generated 1TB of data per day in 2010

This data is also heterogeneous:

Images, social network activity, log files, IOT sensors, etc

Page 8: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

TB and PB

80% unstructured20% structuredHeterogeneous data consisting of log files, audio, video, images, etc

Good, bad, undefined, incomplete?

Time sensitive, real-time, etc

Page 9: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Challenge: Read 1TB of data

1 machine

4 I/O channels

Each channel operates @ 100 MB/s

Time taken?

45 minutes

10 machines

4 I/O channels

Each channel operates @ 100 MB/s

Time taken?

4.5 minutes

Page 10: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Where was Hadoop developed?

Page 11: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Hadoop Origins

Three Google white papers:1. GFS2. MapReduce3. BigTable

HDFS

MapReduce

HBase

Page 12: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Hadoop is the faithful, open-source implementation of Google’s MapReduce, GFS, and BigTable

Hadoop’s primary architect is Doug Cutting who is also credited with creating Apache Lucene

The project began while Doug Cutting was working for Yahoo! on a project named Nutch

Cutting’s son named a yellow stuffed elephant Hadoop which Doug adopted for the project

Page 13: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Hadoop’s Design Axioms

1. Store and process massive amounts of data (order of PB)2. Performance must scale linearly3. Failure is expected4. Easily manageable 5. Self-healing file system6. Run on commodity, off-the-shelf hardware

Page 14: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Fundamental tenet of relational databases involves a db schema -> inherently structured

What about the massive amount of unstructured data we need to house and process?

Scaling commercial relational databases is incredibly expensive and limited

Hadoop cost per user is approx $250/TB

RDBMS cost per user is approx $100,000 - $200,000/TB

Hadoop vs RDBMS

Page 15: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Hadoop Architecture

Page 16: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Master/Slave Model

Master

NameNode (HDFS)

JobTracker (MapReduce)

Slave

DataNode (HDFS)

TaskTracker (MapReduce)

Page 17: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

NameNodeFile metadata:/user/kevin/data1.txt -> 1,2,3

r = 3

hdfs-site.xml

DataNode

2, 3

DataNode

1, 3

DataNode

1, 2, 3

DataNode

1, 2

Page 18: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Underlying Filesystem

Each physical drive in each slave DataNode machine is formatted either ext3 or ext4

HDFS can be considered to be an abstract filesystem in the sense that fixed blocks of data are sent to slave DataNodes from the master NameNode

Page 19: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

MapReduce

Page 20: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Data Processing Paradigm

MapReduce is a framework for performing high performance distributed data processing using the divide and aggregate programming paradigm

Page 21: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Page 22: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Thanks for your time!

Kenmore 790.7050 7060 7323 7343 Range Owners Guide

By Adela CHUNG & Doreen CHEUNG GCEPSA Kwun Tong Primary School Tel: 2343-7343

7343.RedPrairie Parcel and Microsoft SQL Server Benchmark White Paper (1)

Presented by. © 2012 ISACA. All rights reserved. No part of this publication may be used, copied, reproduced, modified, distributed, displayed, stored

Stored Procedure and Subroutine Reference for 3GL … · Using a Stored Procedure .....14 Calling a Stored Procedure .....14 Stored Procedure Libraries .....15 ... Stored Procedure

Workflows and Distributed Version Control830923/FULLTEXT01.pdf · distributed version control systems like GNU Arch10, Monotone11 and Mercu-rial12 (hg13) stored a copy of the repository

Present computers Future computers · 2009. 5. 20. · Present computers Future computers Digital, binary Solid state based (semiconductors) Localized/distributed Stored energy dependent

Architecture of NDB - MySQL · Distributed differently than primary table • Ordered indexes Stored as Ttrees in each node Data is not stored in index (direct reference to attributes)

Schäfer + Peters GmbH · 417 7435 963 2009 6925 7042 2009 963 7042 6925 8735 7979 427 2342 964 2010 7343 8750 2010 964 7045 7985 8736 7977 433 7092 965 7046 7343 8751 2338 7 7046

Standard Stored Pressure Hand Portables ABC Multipurpose ... · Distributed by: Badger Fire Protection 944 Glenwood Station Lane, Suite 303 Charlottesville, VA 22901 Tel.: 434.964.3200

7343 WOODLEY AVE. - LoopNet · 7343 WOODLEY AVE. 7343 Woodley Avenue Van Nuys, CA 91406 Price: $1,425,000 Cindy Hill, CCIM Executive Vice President 818.846.1611 [email protected] 00885625

Module 6 VeriFone POS Peripherals Copyright 2006 VeriFone, Inc. All rights reserved. No part of this publication may be copied, distributed, stored in

Library Expert Surface Mount Families · 7343‐20 7.3 mm × 4.3 mm × 2.0 mm V Y 7343‐31 7.3 mm × 4.3 mm × 3.1 mm D D

I. What is a Distributed Database System? - cs.gordon.edu fileIn a distributed database system, the database is stored on several computers located at multiple physical sites. 1. This

CSE 5343/7343 Fall 2002 Case Studies

Introduction to Hbase - Distributed Managementdmod.eu/WeeklyMeeting/hbase.pdf · Introduction to Hbase ... CFs stored separately on disk access one without ... Tried MySQL,Cassandra,

011 040 7343 Page 1...Matric certificate to still complete their Senior Certificate. 011 040 7343 Page 8 Sample Certificate of Technical Matric Personal information of the student

Decentralized Access Control With Anonymous Authentication of Data Stored in Clouds Parallel Distributed Systems

No part of this document may be reproduced or distributed ... · No part of this document may be reproduced or distributed in any form or by any means, or stored in a data base or

POP Discount Overview Copyright 2011 VeriFone, Inc. All rights reserved. No part of this publication may be copied, distributed, stored in a retrieval

Distributed File Systems: Design Comparisons410-s04/lectures/L27_DFS.pdf · 7 What Distributed File Systems Provide • Access to data stored at servers using file system interfaces

Git Branching What is a branch?. Review: Distributed - Snapshots Files are stored by SHA-1 hash rather than filename Stored in git database in compressed

Coordination Avoidance in Distributed Databasescation servers and end users. In modern distributed databases, data is stored on several servers that may be located in geographically

1964 CONGRESSIONAL RECORD - HOUSE 7343 - GPO€¦ · .1964 CONGRESSIONAL RECORD - HOUSE 7343 1187) (H. Doc. No. 295); to the Committee on Interior and Insular Affairs and ordered

Snake 1,2 service documents 2019-08 v14 - Dudek Paragliders...AP BP CP DP BRD Technora 7343-140: XXXX 1 4045 3999 4082 4200 1544 2 4137 4105 4186 4299 1998 Technora 7343-190: XXXX

Distributed Systems - ERNETsbansal/os/previous... · Distributed File Systems •Distributed File System: –Transparent access to files stored on a remote disk •Naming choices

Seagull: A distributed, fault tolerant, concurrent …...Test timings are stored in ElasticSearch. P90 of test timings for last one week are stored in DynamoDB every day. Greedy Algorithm

Kubernetes Patterns · … only distributed to nodes running Pods that need it … only stored in memory in a tmpfs and never written to physical storage … stored encrypted in the

0148-0038-7343-2554-2990 · 04 July 2014 RRN: 0148-0038-7343-2554-2990 About this document and the data in it Energy Performance Certificate This document has been produced following

Present computers Future computers€¦ · Present computers Future computers Digital, binary Solid state based (semiconductors) Localized/distributed Stored energy dependent Quantum/classical

Distributed Power, Renewables, Stored Energy and the Grid ... White Paper.pdf · Distributed Power, Renewables, Stored Energy and the Grid ... This switching mechanism allows an inverter

Persistent Stored Modules (Stored Procedures) : PSM

The next Frontier: Distributed Deep Learningsweds2018.cs.umu.se/wp-content/uploads/2018/11/SweDS2018... · 2018-11-25 · Distributed Metadata •Small files stored replicated in

Distributed Database Management Systemsivarga/Documentation/DistributedDB/Ozsu/SingleLecture.pdf · Distributed DBMS Implicit Assumptions Data stored at a number of sites each site

- ГЛАВНАЯbearingpnz1.weebly.com/.../raznovidnost_melkie_kartinki.pdf · 2018. 9. 5. · aHanor ISO 8750 DIN 7343 111Tl/lCþTbl 14L'lJ11,1HAPV1qeCHVle TPy6qaTble cnqpanbHble