wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/hadoop-.docx · web viewthe power of...

22
What is Hadoop? Hadoop is an open source framework that is ideally suitable for developing software apps and reliable, distributed, scalable computing. The Hadoop distributed computing is required to support the growth of voluminous data. It is also perfect for analyzing, storing, and processing big data files running over the clusters. The big data ecosystem techniques are placed at the heart of the Hadoop. They are generally needed to support advanced big data analytics that includes predictive analytics, machine learning, or data mining etc. List of Hadoop Frameworks The major components of the Hadoop framework include – HDFS (Hadoop Distributed File System), YARN (Yet another Resource Navigator), or MapReduce jobs etc. HDFS is utilized by the top Companies like IBM, or EMC to support the high throughput access for the application data. This is based on the objective to distribute large datasets uniformly across clusters of low-budget computer systems. In the same way, the MapReduce jobs allow users to distribute voluminous data across computer series for parallel processing of large datasets. In the end, Hadoop YARN is the framework to manage the job scheduling and cluster resources too.

Upload: dinhcong

Post on 17-Feb-2019

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

What is Hadoop?

Hadoop is an open source framework that is ideally suitable for developing software apps and reliable, distributed, scalable computing. The Hadoop distributed computing is required to support the growth of voluminous data. It is also perfect for analyzing, storing, and processing big data files running over the clusters. The big data ecosystem techniques are placed at the heart of the Hadoop. They are generally needed to support advanced big data analytics that includes predictive analytics, machine learning, or data mining etc.

List of Hadoop Frameworks

The major components of the Hadoop framework include – HDFS (Hadoop Distributed File System), YARN (Yet another Resource Navigator), or MapReduce jobs etc.

HDFS is utilized by the top Companies like IBM, or EMC to support the high throughput access for the application data. This is based on the objective to distribute large datasets uniformly across clusters of low-budget computer systems.

In the same way, the MapReduce jobs allow users to distribute voluminous data across computer series for parallel processing of large datasets.

In the end, Hadoop YARN is the framework to manage the job scheduling and cluster resources too.

Page 2: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

How is Hadoop connected with the Cloud?

Cloud computing is a new buzz in the industry where apps or programs are installed over the centralized server and they can be accessed from anywhere anytime over the network. Since cloud technology is just a power solution to deliver the needed computation power to the big data apps, merging the Hadoop framework and the cloud together will ease the processing of parallel datasets magically.

The power of computing large data set is recommendable and it can be scaled based on your requirements. When Hadoop will run over the cloud, it can provide the users with distributed computing, data mining, data analytics and cloud infrastructures too.

Page 3: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

Top six reason why the association of Cloud and Hadoop makes sense?

The relation of cloud and Hadoop has become a trendy topic these days. Further, in this blog, we will discuss the top six reasons why the association of two popular techniques makes sense and their value is constantly increasing in the IT marketplace.

Innovation Costs are Reduced Resources are procured faster Batch Workloads are managed efficiently Variable Resources managed perfectly Runs Closer to the Data Hadoop Operations are Simplified

1). Innovation Costs are Reduced

When Hadoop runs over the cloud, the extra investments can be lower down instantly. Every time when big data computation is needed for a project, the cloud makes a sense in this context. The concept was

Page 4: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

introduced by the leading technology experts when benefits of association were realized by Companies.

2). Resources are procured faster

Quick resources procurement is the biggest need for organizations. Hadoop needs heavy storage drives to storage and computes the large data sets. For small-scale Companies, this is not possible to procure all resources quickly, the best solution is to run Hadoop on the cloud where heavy or expensive resources can be procured automatically whenever needed. Also, you can release the resources again, once the objective is complete.

As soon as data analytics demand grew by organizations, there was an emergency need of expanding Hadoop cluster nodes too. Here, the cloud platform itself witnessed handling linear scaling especially when it comes to innovation and growth. With the cloud computing, shifting hardware to Hadoop becomes much easier that was not possible earlier. When it comes to resource procurement, it was usually taking weeks or months by Companies but cloud solved this problem dramatically where resources are hired within minutes.

3). Batch Workloads are managed efficiently

The main objective of the Hadoop framework includes job scheduling and processing data on a fixed basis. Companies usually collect data from different sources that should be analyzed wisely to derive meaningful insights from the same. For this purpose, workloads are reduced to batches and they are managed more efficiently when computed over the cloud.

Cloud helps to analyze user patterns effectively and even clusters can be divided into suitable sizes at the right time when needed. On the

Page 5: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

above of data analysis, this is possible for Companies to schedule cloud-based clusters for a specific time period when data should be crunched.

4). Variable Resources managed perfectly

The association of cloud and the Hadoop is one of the trendy choices these days. Not all MapReduce jobs in Hadoop are created equal. A few of them demand more computing resources and investment when compared to others. In this case, this is necessary to manage the diversity in Hadoop jobs.

One of the amazing solutions is running Hadoop on the cloud. It will suggest the proper scheduling techniques and make you avail with necessary computing resources. Intuitively, the cloud is the more adaptable solution to handle the variable resources requirements when compared to other IT techniques.

5). Runs Closer to the Data

Every time when businesses are planning to move their data to the cloud, they had to follow the standard process for a successful migration. At the same time, there is a need for proper data analytics techniques where large data sets can be managed efficiently and the overall time for migration can be reduced.

Here, running the clusters of Hadoop in the cloud environment is just the wonderful choice to solve the problem. Combine cloud benefits with data locality principle of Hadoop at the Macro level.

6). Hadoop Operations are Simplified

Once clusters are consolidated by the organization, one thing remains the same that is resources isolation by multiple sets of users. Users

Page 6: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

need to bunch the MapReduce jobs in a shared cluster and this is the job of administrators here to handle the multi-tenancy issues like interference of jobs with one another and managing security constraints too.

The one and most typical solution to the problem is to enforce data policies at the cluster level that will avoid users to perform any harmful activities with other user jobs. With this approach, the secured use cases will also be safe. This activity is common for the administrators to protect the data from harm and unwanted access.  Also, Companies need to spend heavy amount to manage the resources in a clustered manner.

With the help of a cloud, this is possible for users to configure clusters with varied characteristics and features. Each cluster is suitable for a particular set of jobs and this is possible to manage to complicated cluster policies without any multi-tenancy issues. In brief, the right configuration is available for multiple jobs.

This is clear from the discussion that the relation of cloud and Hadoop is just a trendy choice these days. The association of both the framework will result in wonderful solutions for big data analytics without any potential problems. To know more on the Hadoop framework, join the big data certification program at JanBask Training right away.

MapReduce

MapReduce is a core component of the Apache Hadoop software framework.

Page 7: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

Hadoop enables resilient, distributed processing of massive unstructured data sets across commodity computer clusters, in which each node of the cluster includes its own storage. MapReduce serves two essential functions: it filters and parcels out work to various nodes within the cluster or map, a function sometimes referred to as the mapper, and it organizes and reduces the results from each node into a cohesive answer to a query, referred to as the reducer.

How MapReduce works

The original version of MapReduce involved several component daemons, including:

JobTracker -- the master node that manages all the jobs and resources in a cluster;

TaskTrackers -- agents deployed to each machine in the cluster to run the map and reduce tasks; and

JobHistory Server -- a component that tracks completed jobs and is typically deployed as a separate function or with JobTracker.

With the introduction of MapReduce and Hadoop version 2, previous JobTracker and TaskTracker daemons have been replaced with components of Yet Another Resource Negotiator (YARN), called ResourceManager and NodeManager.

ResourceManager runs on a master node and handles the submission and scheduling of jobs on the cluster. It also monitors jobs and allocates resources.

NodeManager runs on slave nodes and interoperates with Resource Manager to run tasks and track resource usage. NodeManager can employ other daemons to assist with task execution on the slave node.

Page 8: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

To distribute input data and collate results, MapReduce operates in parallel across massive cluster sizes. Because cluster size doesn't affect a processing job's final results, jobs can be split across almost any number of servers. Therefore, MapReduce and the overall Hadoop framework simplify software development.

MapReduce is available in several languages, including C, C++, Java, Ruby, Perl and Python. Programmers can use MapReduce libraries to create tasks without dealing with communication or coordination between nodes.

MapReduce is also fault-tolerant, with each node periodically reporting its status to a master node. If a node doesn't respond as expected, the master node reassigns that piece of the job to other available nodes in the cluster. This creates resiliency and makes it practical for MapReduce to run on inexpensive commodity servers.

MapReduce examples and uses

The power of MapReduce is in its ability to tackle huge data sets by distributing processing across many nodes, and then combining or reducing the results of those nodes.

As a basic example, users could list and count the number of times every word appears in a novel as a single server application, but that is time-consuming. By contrast, users can split the task among 26 people, so each takes a page, writes a word on a separate sheet of paper and takes a new page when they're finished. This is the map aspect of MapReduce. And if a person leaves, another person takes his or her place. This exemplifies MapReduce's fault-tolerant element.

When all the pages are processed, users sort their single-word pages into 26 boxes, which represent the first letter of each word. Each user takes a box

Page 9: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

and sorts each word in the stack alphabetically. The number of pages with the same word is an example of the reduce aspect of MapReduce.

There is a broad range of real-world uses for MapReduce involving complex and seemingly unrelated data sets. For example, a social networking site could use MapReduce to determine users' potential friends, colleagues and other contacts based on site activity, names, locations, employers and many other data elements. A booking website could use MapReduce to examine the search criteria and historical behaviors of users, and can create customized offerings for each. An industrial facility could collect equipment data from different sensors across the installation and use MapReduce to tailor maintenance schedules or predict equipment failures to improve overall uptime and cost-savings.

MapReduce services and alternatives

One challenge with MapReduce is the infrastructure it requires to run. Many businesses that could benefit from big data tasks can't sustain the capital and overhead needed for such an infrastructure. As a result, some organizations rely on public cloud services for Hadoop and MapReduce, which offer enormous scalability with minimal capital costs or maintenance overhead.

For example, Amazon Web Services (AWS) provides Hadoop as a service through its Amazon Elastic MapReduce (EMR) offering. Microsoft Azure offers its HDInsight service, which enables users to provision Hadoop, Apache Spark and other clusters for data processing tasks. Google Cloud Platform provides its Cloud Dataproc service to run Spark and Hadoop clusters.

For organizations that prefer to build and maintain private, on-premises big data infrastructures, Hadoop and MapReduce represent only one option. Organizations can opt to deploy other platforms, such as Apache Spark, High-Performance Computing Cluster and Hydra. The big data framework an

Page 10: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

enterprise chooses will depend on the types of processing tasks required, supported programming languages, and performance and infrastructure demands.

VirtualBox

A VirtualBox or VB is a software virtualization package that installs on an operating system as an application. VirtualBox allows additional operating systems to be installed on it, as a Guest OS, and run in a virtual environment. In 2010, VirtualBox was the most popular virtualization software application. Supported operating systems include Windows XP, Windows Vista, Windows 7, macOS X, Linux, Solaris, and OpenSolaris.VirtualBox was originally developed by Innotek GmbH and released in 2007 as an open-sourcesoftware package. The company was later purchased by Sun Microsystems. Oracle Corporation now develops the software package and titles it Oracle VM VirtualBox.

Google App Engine

Google App Engine is a Platform as a Service (PaaS) product that provides Web app developers and enterprises with access to Google's scalable hosting and tier 1 Internet service. 

The App Engine requires that apps be written in Java or Python, store data in Google BigTable and use the Google query language. Non-compliant applications require modification to use App Engine.

Page 11: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

Google App Engine provides more infrastructurethan other scalable hosting services such as Amazon Elastic Compute Cloud (EC2). The App Engine also eliminates some system administration and developmental tasks to make it easier to write scalable applications.

Google App Engine is free up to a certain amount of resource usage. Users exceeding the per-day or per-minute usage rates for CPU resources, storage, number of API calls or requests and concurrent requests can pay for more of these resources.

Programming Support for Google Apps engine

Google App Engine (GAE) is a Platform as a Service (PaaS) cloud-based Web hosting service on Google's infrastructure. For an application to run on GAE, it must comply with Google's platform standards, which narrows the range of applications that can be run and severely limits those applications' portability.

GAE supports the following major features:

1. Dynamic Web services based on common standards2. Automatic scaling and load balancing3. Authentication using Google's Accounts API4. Persistent storage, with query access sorting and transaction management

features5. Task queues and task scheduling6. A client-side development environment for simulating GAE on your local system7. One of either two runtime environments: Java or Python

Google File System:

Page 12: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

Abbreviated as GFS, a Global File System is a cluster file system that enables a cluster of computers to simultaneously use a block device that is shared between them.

GFS reads and writes to the block device like a local file system, but also allows the computers to coordinate their I/O to maintain file system consistency.

With GFS any changes that are made to the file system on one computer will immediately be seen on all other computers in that cluster.

GFS provides fault tolerance, reliability, scalability, availability and performance to large networks and connected nodes. GFS is made up of several storage systems built from low-cost commodity hardware components.

It is optimized to accommodate Google's different data use and storage needs, such as its search engine, which generates huge amounts of data that must be stored.

Big Tables and Google NO SQL System:

Google Cloud Bigtable is a productized version of the NoSQL database that stores Google's bits and bytes.

The big selling point is it doesn't require the maintenance traditionally needed for compatible on-prem NoSQL solutions.

Bigtable is a compressed, high performance, and proprietary data storage system built on Google File System, Chubby Lock Service and a few other Google technologies.

Bigtable maps two arbitrary string values (row key and column key) and timestamp (hence three-dimensional mapping) into an associated arbitrary byte array.

It is not a relational database and can be better defined as a sparse, distributed multi-dimensional sorted map.

Bigtable is designed to scale into the petabyte range across "hundreds or thousands of machines, and to make it easy to add more machines [to] the system and automatically start taking advantage of those resources without any reconfiguration".

Google’s Distributed Lock Service (Chubby):

Page 13: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

Chubby is a distributed lock service intended for coarse-grained synchronization of activities within Google's distributed systems.

Chubby has become Google's primary internal name service; it is a common rendezvous mechanism for systems such as MapReduce; the storage systems GFS and Bigtable use Chubby to elect a primary from redundant replicas; and it is a standard repository for files that require high availability, such as access control lists.

Chubby is a relatively heavy-weight system intended for coarse-grained locks, locks held for "hours or days", not "seconds or less."

OpenStack

OpenStack is a collection of open source software modules that provides a framework to create and manage both public cloud and private cloud infrastructure.

What OpenStack does

To create a cloud computing environment, an organization typically builds off of its existing virtualized infrastructure, using a well-established hypervisor such as VMware vSphere, Microsoft Hyper-V or KVM. But cloud computing goes beyond just virtualization. A public or private cloud also provides a high level of provisioning and lifecycle automation, user self-service, cost reporting and billing, orchestration and other features.

When an organization installs OpenStack software on top of its virtualized environment, this forms a "cloud operating system" that can organize, provision and manage large pools of heterogeneous compute, storage and network resources. While an IT administrator is typically called on to provision and manage resources in a more

Page 14: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

traditional virtualized environment, OpenStack enables individual users to provision resources through management dashboards and the OpenStack application programming interface (API).

An organization can use OpenStack to deploy and manage cloud-based infrastructure that supports an array of uses cases, including web hosting, big data projects, software as a service (SaaS) delivery, or deploying high volumes of containers.

OpenStack competes most directly with other open source cloud platforms, including Eucalyptus and Apache CloudStack. Some also see it as an alternative to public cloud platforms like Amazon Web Services or Microsoft Azure.

OpenStack components

The OpenStack cloud platform is not a single thing, but an amalgam of software modules that serve different purposes. OpenStack components are shaped by open source contributions from the developer community, and adopters can implement some or all of these components. Key OpenStack components, by category, include:

Compute

Glance -- a service that discovers, registers and retrieves virtual machine (VM) images;

Ironic -- a bare-metal provisioning service;

Magnum -- a container orchestration and provisioning engine;

Nova -- a service that provides scalable, on-demand and self-service access to compute resources, such as VMs and containers;

Storlets -- a computable object storage service;

Zun -- a service that provides an API to launch and manage containers.

Storage

Page 15: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

Cinder -- a block storage service;

Swift -- an object storage service;

Freezer -- a backup, restore and disaster recovery service;

Karbor -- an application and data protection service;

Manila -- a shared file system.

Networking and content delivery

Designate -- a DNS service for the network;

Neutron -- a software-defined networking (SDN) service for virtual compute environments;

Dragonflow -- a distributed control plane implementation of Neutron;

Kuryr -- a service that connects containers and storage;

Octavia -- a load balancer;

Tacker -- an orchestration service for network functions virtualization (NFV);

Tricircle -- a network automation service for multi-region cloud deployments.

Data and analytics

Sahara -- a provisioning service for big data projects;

Searchlight -- a data indexing and search service;

Trove -- a database as a service (DBaaS).

Security and compliance

Barbican -- a management service for passwords, encryption keys and X.509 Certificates;

Congress -- an IT governance service;

Page 16: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

Keystone -- an authentication and multi-tenant authorization service;

Mistral -- a workflow management and enforcement service.

Deployment

Ansible OpenStack-- a service that provides Ansible playbooks for OpenStack;

Chef OpenStack -- a service that provides Chef cookbooks for OpenStack;

Kolla -- a service for container deployment;

Charms -- a service that offers Juju charms for OpenStack;

Puppet OpenStack -- a service that provides Puppet modules for OpenStack;

TripleO -- a service to deploy OpenStack in production.

Management

Horizon -- a management dashboard and web-based user interface for OpenStack services;

OpenStack Client -- the OpenStack command-line interface (CLI);

Rally -- an OpenStack benchmark service;

Senlin -- a clustering service;

Vitrage -- a root cause analysis (RCA) service for troubleshooting;

Watcher -- a performance optimization service.

Applications

Heat -- orchestration and autoscaling services;

Murano -- an application catalog;

Solum -- a software development tool;

Page 17: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

Zaqar -- a messaging service.

Monitoring

Aodh -- an alarming service that takes actions based on rules;

Ceilometer -- a metering and data collection service;

CloudKitty -- a billing and chargeback service;

Monasca -- a high-speed metrics monitoring and alerting service;

Panko -- a service for metadata indexing and event storage to aid auditing and troubleshooting.

OpenStack pros and cons

OpenStack is available freely as open source software released under the Apache 2.0 license. This means there is no upfront cost to acquire and use OpenStack. Considering all of its modular components, OpenStack provides a comprehensive and production-ready platform upon which an enterprise can build and operate a private or public cloud. Because of its open source nature, some organizations also see OpenStack as a way to avoid vendor lock-in.

But potential enterprise adopters must also consider some drawbacks. Perhaps the biggest disadvantage of OpenStack is its very size and scope -- such complexity requires an IT staff to have significant knowledge to deploy the platform and make it work. In some cases, an organization might require additional staff or a consulting firm to deploy OpenStack, which adds time and cost.

As open source software, OpenStack is not owned or directed by any one vendor or team. This can make it difficult to obtain support for the technology -- other than support from the open source community.

To reduce the complexity of an OpenStack deployment, and get more direct access to technical support, an organization can choose to adopt an OpenStack distribution from

Page 18: wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/Hadoop-.docx · Web viewThe power of computing large data set is recommendable and it can be scaled based on your requirements

a vendor. An OpenStack distribution is a version of the open source platform that is packaged with other components, such as an installation program and management tools, and often comes with technical support options. Common OpenStack distributions include the Red Hat OpenStack platform, the Mirantis Cloud Platform and the Rackspace OpenStack private cloud.

OpenStack releases

OpenStack has followed an alphabetical naming scheme for its version releases since its initial Austin release in October 2010. The original Austin, Baxar and Cactus releases have since been deprecated, and are no longer available. More recent releases, between 2012 and 2016, include Diablo, Essex, Folsom, Grizzly, Havana, Icehouse, Juno, Kilo, Liberty, Mitaka and Newton, which are all at end-of-life (EOL).

These were followed by the Ocata release in February 2017, and the Pike release in August 2017. Pike added a variety of new features, including support for Python 3.5, a revert-to-snapshot feature in Cinder and support for globally distributed erasure codes in Swift.

OpenStack Foundation

The National Aeronautics and Space Administration (NASA) worked with Rackspace, a managed hosting and cloud computing service provider, to develop OpenStack.

OpenStack officially became an independent non-profit organization in September 2012. The OpenStack Foundation, which is overseen by a board of directors, is comprised of many direct and indirect competitors, including IBM, Intel and VMware.