hadoop_ppt.pptx

25
SCHEDULE MODULE – 1. Introduction to Big Data and Hadoop MODULE – 2. HDFS Internals, Hadoop configuration, and Data loading MODULE – 3. Introduction to Map reduce MODULE – 4. Advanced Map reduce concepts MODULE – 5. Introduction to pig MODULE – 6. Advanced pig and Introduction to Hive MODULE – 7. Advanced Hive MODULE – 8. Extending Hive and Introduction to HBase MODULE – 9. Advanced HBase and oozie MODULE – 10. Project set-up Discussion

Upload: siv535

Post on 16-Dec-2015

6 views

Category:

Documents


1 download

TRANSCRIPT

PowerPoint Presentation

SCHEDULEMODULE 1. Introduction to Big Data and HadoopMODULE 2. HDFS Internals, Hadoop configuration, and Data loadingMODULE 3. Introduction to Map reduceMODULE 4. Advanced Map reduce conceptsMODULE 5. Introduction to pigMODULE 6. Advanced pig and Introduction to HiveMODULE 7. Advanced HiveMODULE 8. Extending Hive and Introduction to HBaseMODULE 9. Advanced HBase and oozie MODULE 10. Project set-up DiscussionMODULE 1. Introduction to Big Data and Hadoop

Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructuredMore data may lead to more accurate analyses. More accurate analyses may lead to more confident decision making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk.The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smarter business decision making.Big data analytics is often associated with cloud computing because the analysis of large data sets in real-time requires a platform like Hadoop to store large data sets across a distributed cluster and Map Reduce to coordinate, combine and process data from multiple sources.Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.

Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of system failure, even if a significant number of nodes become inoperative.Companies that need to process large and varied data sets frequently look to Apache Hadoop as a potential tool, because it offers the ability to process, store and manage huge amounts of both structured and unstructured data. The open source Hadoop framework is built on top of a distributed file system and a cluster architecture that enable it to transfer data rapidly and continue operating even if one or more compute nodes fail. But Hadoop isn't a cure-all system for big data application needs as a whole. And while big-name Internet companies like Yahoo, Facebook, Twitter, eBay and Google are prominent users of the technology, Hadoop projects are new undertakings for many other types of organizations.Big data is come from Genomics and Astronomy.WHAT IS

Huge amount of data (TB or PB)Big in volumeIt is a moving data

Velocity : Real time capture and real time analyticsVolume : petabytes per day/weekVariety : unstructured data, web logs, audio, video, image, structured dataData cannot be affordable in single physical machineDistributed storage in multiple systemsFile system is going to be distributed oneBid data in Industry 1. Financial services uses 22% 2. Technology uses 16% 3. Telecommunications uses 14% 4. Retail uses 9% 5. Government uses 7%

Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it involves many areas of business and technology.The data in it will be of three types.Structured data: Relational data.Semi Structured data: XML data.Unstructured data: Word, PDF, Text, Media Logs.Benefits of Big DataUsing the information in the social media like preferences and product perception of their consumers, product companies and retail organizations are planning their production.Using the data regarding the previous medical history of patients, hospitals are providing better and quick service.Big Data TechnologiesBig data technologies are important in providing more accurate analysis, which may lead to more concrete decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business.There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to handle big data.

CHALLENGES ASSOCIATED WITH Big Data

StorageCaptureSharingVisualizationCurationStorage : Some vendors are using increased memory and powerful parallel processing to crunch large volumes of data extremely quickly. Another method is putting data in-memory but using a grid computing approach, where many machines are used to solve a problem. Both approaches allow organizations to explore huge data volumes and gain business insights in near-real time.Capture : Even if you can capture and analyze data quickly and put it in the proper context for the audience that will be consuming the information, the value of data for decision-making purposes will play an important role, if the data is not accurate or timely. This is a challenge with any data analysis, but when considering the volumes of information involved in big data projects, it becomes even more pronounced. Sharing :

REAL TIME PROCESSINGReal time data processing involves a continual input, process and output of dataData processing time is typically much smaller (in fraction of seconds) Examples Complex event processing (CEP) platform, which combines data fro multiple sources to detect patterns and attempts to identify either opportunities or threatsOperational intelligence (OI) platform which use real time data processing and (CEP) to gain insight into operations by running query analysis against live feeds and event dataOI is near real time analytics over operational data and provides variability over many data sources. The goal is to obtain near real time insights using continuous analytics to allow the organization to take immediate action BATCH PROCESSINGExecutinga series of non-interactivejobsall at one time.Batch jobs can bestoredup during working hours and then executed during the evening or whenever the computer is idle.Batch processing is an efficient and preferred way for processing high volume of dataData processing programs are run over a group of transactions and collected over a business agreed time periodData is collected, entered, processed and then the batch results are produced for every batch window. Batch processing requires separate programs for input, process and output ExamplesAn example of batch processing is the way that credit card companies process billing. The customer does not receive a bill for each separate credit card purchase but one monthly bill for all of that months purchases. The bill is created through batch processing, where all of the data are collected and held until the bill is processed as a batch at the end of the billing cycle.Financial reporting andForecasting

Operation dataBISocial dataHistoric dataService dataExtract BIG DATA BATCH PROCESSINGTransformLoadBig data analysis HOW HADOOP COME INTO EXISTANCEHadoop and Big Data, they became synonymous. But they are two different things.Hadoop is a parallel programming modelthat is implemented on a bunch of low-cost clustered processors, and it's intended to support data-intensive distributed applications. That's what Hadoop is all about.Due to the advent of new technologies, devices, and communication means like social networking sites, the amount of data produced by mankind is growing rapidly every yearAn enterprise will have a computer to store and process big data. For storage purpose, the programmers will take the help of their choice of database vendors such as Oracle, IBM, etc. In this approach, the user interacts with the application, which in turn handles the part of data storage and analysis.This approach works fine with those applications that process less voluminous data that can be accommodated by standard database servers, or up to the limit of the processor that is processing the data. But when it comes to dealing with huge amounts of scalable data, it is a hectic task to process such data through a single database bottleneckGoogle solved this problem using an algorithm called Map Reduce. This algorithm divides the task into small parts and assigns them to many computers, and collects the results from them which when integrated, form the result dataset. Using the solution provided by Google, Doug Cutting and his team developed an Open Source Project called HADOOP.WHAT IS

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers.Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.Hadoop supports any type of dataHadoop is best in solving Big data Batch processing Hadoop = Storage + Computer grid As mentioned above Hadoop is a place to store and known to be a distributed file systemConcepts come under Hadoop is HDFS , Map reduce , Pig , Hive , Sqoop , Flume , HBaseCore components of HadoopHDFSMap reduce

HADOOP KEY CHARACTERISTICSCharacteristicsReliableScalableEconomicalFlexible HADOOP KEY DIFFERENTIATORSDifferentiating FactorsRobustAccessibleSimpleScalableHadoop is a system for large scale data processing. It has two main components:HDFS Distributed across nodes Natively redundant NameNode tracks locations

MapReduceSplits a task across processorsnear the data & assembles resultsSelf-Healing, High BandwidthClustered storageJobTracker manages the TaskTrackers MODULE 2 . Hadoop Distributed File system The most common file system used by Hadoop is the Hadoop Distributed File System (HDFS). It is designed to run on large clusters (thousands of computers) of small computer machines in a reliable, fault-tolerant manner.HDFS uses a master/slave architecture where master consists of a singleName Nodethat manages the file system metadata and one or more slaveData Nodesthat store the actual data.A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of Data Nodes. The Name Node determines the mapping of blocks to the Data Nodes. The Data Nodes takes care of read and write operation with the file system. They also take care of block creation, deletion and replication based on instruction given by Name Node.HDFS provides a shell like any other file system and a list of commands are available to interact with the file system. File system is a Storage side component. HDFS COMPONENTSName node : http://localhost:50070/dfshealth.jspStorage side acts as master of the system , HDFS has only one Name nodeIt maintains, manages and administers the data blocks present on the Data nodesDefault Data block size 64 MB (It can be changed) The Name Node determines the mapping of blocks to the Data Nodes.Read/write operation wanted to be perform fast, seek time should be lessIncreasing the block size tends to decrease seek time and increase in streaming time finally the R/W operations will be fastName node will keep track of overall file directoryIf at all failure of the Name node occurs back up of name node is must doneSo, here secondary name node will act as back upWe have to make secondary name node

Secondary Name node :Secondary name node will get data from name node for every one hourIf it fails at middle of hour we cant trace back the data

It will take meta data for every hour & keep secure MetadataName nodeSecondaryName node HDFS ARCHITECTURE Meta data operations Block ops

Read Data nodes Replication Data nodes

Write Rack 1 Rack 2Name nodeClientClient RACK AWARENESS Rack 1 Rack 2 Rack 3

File 1File 2File 2File 3File 3File 1File 1File 2File 3 HDFS FILE WRITE OPERATION 1. Open 2. Create 3. Shows location

4. Write data 5. ack packet

4 4

5 5 HDFS UserDistributed File SystemName nodeData nodeData nodeData node HDFS FILE READ OPERATION 1. Open 2. Get Block location

4. Read 5. Close

3. Read 3. Read 3. Read HDFSUserDistributed File SystemFS Data Input StreamName nodeData nodeData nodeData node MAP REDUCE FRAMEWORK

It takes processing to the data The Reducer is responsible for processing oneIt allows processing of data in parallel or more values which share a common key

ReducerMapperInput key value pairPersistent dataMapMapMapTransient dataReduceReduceReducePersistent dataOutput key value pair