refresh your data lake to cisco data intelligence platform · your data lake, such as conjoining...

10
Solution overview Cisco public Consideration in the journey of a Hadoop refresh Despite the capability gap between Hadoop 2.x and 3.x, it is estimated that more than 80 percent of the Hadoop installed base is still on either HDP2 or CDH5, which are built on Apache Hadoop 2.0, and are getting close to end of support by the end of 2020. Amid those feature enrichments, specialized computing resources, and end of support, a Hadoop upgrade is a value-added refresh. Considering these enhancements, it is imperative to find a more holistic approach while refreshing your data lake, such as conjoining various frameworks and open-source technologies with the Hadoop ecosystem. Refresh Your Data Lake to Cisco Data Intelligence Platform The evolving Hadoop landscape In the beginning of 2019, providers of leading Hadoop distribution, Hortonworks and Cloudera merged together. This merger raised the bar on innovation in the big data space and the new “Cloudera” launched Cloudera Data Platform (CDP) which combined the best of Hortonwork’s and Cloudera’s technologies to deliver the industry leading first enterprise data cloud. Recently, Cloudera released the CDP Private Cloud Base, which is the on-premises version of CDP. This unified distribution brought in several new features, optimizations, and integrated analytics. CDP Private Cloud Base is built on Hadoop 3.x distribution. Hadoop developed several capabilities since its inception. However, Hadoop 3.0 is an eagerly awaited major release with several new features and optimizations. Upgrading from Hadoop 2.x to 3.0 is a paradigm shift as it enables diverse computing resources, (i.e., CPU, GPU, and FPGA) to work on data and leverage AI/ML methodologies. It supports flexible and elastic containerized workloads, managed either by Hadoop scheduler (i.e., YARN or Kubernetes), distributed deep learning, GPU-enabled Spark workloads, and more. Not only that, Hadoop 3.0 offers better reliability and availability of metadata through multiple standby name nodes, disk balancing for evenly utilized data nodes, enhanced workloads scheduling with YARN 3.0, and overall improved operational efficiency. Going forward, the Ozone initiative lays the foundation of the next generation of storage architecture for HDFS, where data blocks are organized in storage containers for higher scale and handling of small objects in HDFS. The Ozone project also includes an object store implementation to support several new use cases. © 2020 Cisco and/or its affiliates. All rights reserved.

Upload: others

Post on 02-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Refresh Your Data Lake to Cisco Data Intelligence Platform · your data lake, such as conjoining various frameworks and open-source technologies with the Hadoop ecosystem. Refresh

Solution overviewCisco public

Consideration in the journey of a Hadoop refreshDespite the capability gap between Hadoop 2.x and 3.x, it is estimated that more than 80 percent of the Hadoop installed base is still on either HDP2 or CDH5, which are built on Apache Hadoop 2.0, and are getting close to end of support by the end of 2020.

Amid those feature enrichments, specialized computing resources, and end of support, a Hadoop upgrade is a value-added refresh. Considering these enhancements, it is imperative to find a more holistic approach while refreshing your data lake, such as conjoining various frameworks and open-source technologies with the Hadoop ecosystem.

Refresh Your Data Lake to Cisco Data Intelligence Platform

The evolving Hadoop landscapeIn the beginning of 2019, providers of leading Hadoop distribution, Hortonworks and Cloudera merged together. This merger raised the bar on innovation in the big data space and the new “Cloudera” launched Cloudera Data Platform (CDP) which combined the best of Hortonwork’s and Cloudera’s technologies to deliver the industry leading first enterprise data cloud. Recently, Cloudera released the CDP Private Cloud Base, which is the on-premises version of CDP. This unified distribution brought in several new features, optimizations, and integrated analytics.

CDP Private Cloud Base is built on Hadoop 3.x distribution. Hadoop developed several capabilities since its inception. However, Hadoop 3.0 is an eagerly awaited major release with several new features and optimizations. Upgrading from Hadoop 2.x to 3.0 is a paradigm shift as it enables diverse computing resources, (i.e., CPU, GPU, and FPGA) to work on data and leverage AI/ML methodologies. It supports flexible and elastic containerized workloads, managed either by Hadoop scheduler (i.e., YARN or Kubernetes), distributed deep learning, GPU-enabled Spark workloads, and more. Not only that, Hadoop 3.0 offers better reliability and availability of metadata through multiple standby name nodes, disk balancing for evenly utilized data nodes, enhanced workloads scheduling with YARN 3.0, and overall improved operational efficiency.

Going forward, the Ozone initiative lays the foundation of the next generation of storage architecture for HDFS, where data blocks are organized in storage containers for higher scale and handling of small objects in HDFS. The Ozone project also includes an object store implementation to support several new use cases.

© 2020 Cisco and/or its affiliates. All rights reserved.

Page 2: Refresh Your Data Lake to Cisco Data Intelligence Platform · your data lake, such as conjoining various frameworks and open-source technologies with the Hadoop ecosystem. Refresh

Solution overviewCisco public

As the journey continues in Hadoop, more staggering and impressive software frameworks and technologies are introduced for crunching big data. Going forward, they will continue to evolve more and integrate in a modular fashion. Furthermore, specialized hardware such as GPU and FPGA are becoming the de-facto standard to facilitate deep learning for processing gigantic datasets expeditiously. Figure 1 demonstrates, how AI/ML frameworks and containerization are augmenting the Hadoop ecosystem.

Figure 1. Hadoop 3.0 refresh with AI included

Hadoop meets AI with Hadoop 3.0

Apache Hadoop 3.1

AI support in a data Lake

YARN Scheduler

CPU NvidiaGPU

Memory

Cisco UCS enabling these Next Gen workloads

Apache Submarine

Apache Spark 2.3

Apache Spark 3.0 Apache Ozone

(Tech Preview) (Tech Preview)

Note2

Note3

Note1 Note4EditJobs

Tensorboard

Worker

Worker

Worker

Worker

Worker

Tensorboard

Tensorboard

PS

PS

PS

YARN

YARN

Worker

Launch Submarine jobs

Schedule Jobs

CSI

Interpreter

Azkaban

S3 protocol

Hadoop FS

S3

Nvidia GPU

© 2020 Cisco and/or its affiliates. All rights reserved.

Page 3: Refresh Your Data Lake to Cisco Data Intelligence Platform · your data lake, such as conjoining various frameworks and open-source technologies with the Hadoop ecosystem. Refresh

Solution overviewCisco public

ContainerizationHadoop 3.0 introduces production-ready Docker container support on YARN with GPU isolation and scheduling. This opens up a plethora of opportunities for modern applications, such as micro-services and distributed applications frameworks comprised of thousands of containers to execute AI/ML algorithms on peta bytes of data with ease and in a speedy fashion.

Distributed deep learning with Apache SubmarineThe Hadoop community initiated the Apache Submarine project to make distributed deep learning/machine learning applications easy to launch, manage, and monitor. These improvements make distributed deep learning/machine learning applications (TensorFlow, PyTorch, MXNet, etc.) run on Apache Hadoop YARN as simple as running it locally, which can let data scientists focus on algorithms instead of underlying infrastructure.

Apache Spark 3.0Apache Spark 3.0 is something that every data scientist and data engineer has been waiting for with anticipation. Spark is no longer limited just to CPU for its workload; it now offers GPU isolation and acceleration. To easily manage the deep learning environment, YARN launches the Spark 3.0 applications with GPU. This paves the way for other workloads, such as machine learning and ETL, to also be accelerated by GPU for Spark workloads. Learn more by reading the Cisco Blog on Apache Spark 3.0

Cloudera Data Platform Private Cloud Base (PvC)With the merger of Cloudera and Hortonworks, a new “Cloudera” software named Cloudera Data Platform (CDP) combined the best of Hortonwork’s and Cloudera’s technologies to deliver the industry leading first enterprise data cloud. CDP Private Cloud Base is the on-premises version of CDP. This unified distribution is a scalable and customizable platform where workloads can be securely provisioned. CDP gives a clear path for extending or refreshing your existing HDP and CDH deployments and set the stage for cloud-native architecture.

Cloudera Data Platform Private Cloud Shadow IT can now be eliminated when the CDP Private Cloud is implemented in Cisco® Data Intelligence Platform. CDP Private Cloud offers a cloud-like experience in a customer’s on-premises environment. With disaggregated compute and storage, a complete multi-tenant self-service analytics environment can be implemented, thereby offering better infrastructure utilization.

Also, CDP Private Cloud offers data scientist, data engineer, and data analyst personas, bringing together the right tools to the user and improving time to value.

Red Hat OpenShift Container Platform (RHOCP) clusterCloudera selects Red Hat OpenShift as the preferred container platform for CDP Private Cloud. With RHOCP, CDP Private Cloud delivers powerful, self-service analytics and enterprise-grade performance with the granular security and governance policies that IT leaders demand.

Apache Hadoop Ozone object storeApache Hadoop Ozone is a scalable, redundant, and distributed object store for Hadoop. Apart from scaling to billions of objects of varying sizes, Ozone can function effectively in containerized environments such as Kubernetes and YARN. Applications using frameworks like Apache Spark, YARN, and Hive work natively without any modifications. Ozone is built on a highly available, replicated block storage layer called Hadoop Distributed Data Store (HDDS).

Ozone is a scale-out architecture with minimal operational overhead and long-term maintenance effort. Ozone can be co-located with HDFS with single security and governance policies for easy data exchange or migration. It also offers seamless application portability. Ozone enables separation of compute and storage via the S3 API. Similar to HDFS, it also supports data locality for applications that choose to use it.

© 2020 Cisco and/or its affiliates. All rights reserved.

Page 4: Refresh Your Data Lake to Cisco Data Intelligence Platform · your data lake, such as conjoining various frameworks and open-source technologies with the Hadoop ecosystem. Refresh

Solution overviewCisco public

The design of Ozone was guided by the key principles listed in Figure 2.

Figure 2 Ozone design principles

Highly scalable - tens of billions of files and blocks

Layered architecture - separate namespace and block management layer

Data locality -inherit the power of HDFS’s data locality

Side-by-side deployment - share storage disks with HDFS

Highly available - fully replicated to survive multiple failures

Cloud - native - works in containerized environments like YARN and Kubernetes

Multi protocol - with HDFS - S3 complaint API

Secure - access control and on-wire encryption

Apache Ozone

KubernetesExtracting intelligence from data lakes in a timely and speedy fashion is key in finding emerging business opportunities, accelerating time-to-market efforts, gaining market share, and increasing overall business agility.

In today’s fast-paced digitization, Kubernetes enables enterprises to rapidly deploy new updates and features at scale while maintaining consistency across testing, development, and production environments. Kubernetes lays the foundation for cloud-native apps, which can be packaged in container images and ported to diverse platforms. Containers with microservice architecture managed and orchestrated by Kubernetes help organizations embark on a modern development pattern. Moreover, Kubernetes has become a de-facto standard for container orchestration and offers the core for on-premises container cloud for enterprises. Kubernetes is a single cloud-agnostic infrastructure with a rich open-source ecosystem. It allocates, isolates, and manages resources across many tenants at scale as needed in elastic fashion, thereby, giving efficient infrastructure resource utilization. Figure 3 shows how Kubernetes is transforming the use of compute and becoming the de-facto standard for running applications.

© 2020 Cisco and/or its affiliates. All rights reserved.

Page 5: Refresh Your Data Lake to Cisco Data Intelligence Platform · your data lake, such as conjoining various frameworks and open-source technologies with the Hadoop ecosystem. Refresh

Solution overviewCisco public

© 2020 Cisco and/or its affiliates. All rights reserved.

Hybrid architectureRed Hat OpenShift is the preferred container cloud platform for CDP private cloud and is the market-leading Kubernetes-powered container platform. This combination completes the vision of the very first enterprise data cloud, with a powerful hybrid architecture that decouples compute and storage for greater agility, ease of use, and more efficient use of private and multi-cloud infrastructure resources. With Cloudera’s Shared Data Experience (SDX), security and governance polices can be easily and consistently enforced across data and analytics in private as well as multi-cloud deployments. This hybridity will open myriad opportunities for multi-function integration with other frameworks such as streaming data, batch workloads, analytics, data pipelining/engineering, and machine learning.

Cloud-native architecture for data lakes and AICisco Data Intelligence Platform with CDP private cloud accelerates the journey in becoming cloud-native for your data lake and AI/ML workloads. By leveraging a Kubernetes-powered container cloud, enterprises can now quickly break the silos in monolithic application frameworks and embrace a continuous innovation of micro-services architecture with a CI/CD approach. With a cloud-native ecosystem, enterprises can build scalable and elastic modern applications that extend the boundaries from a private cloud to a hybrid infrastructure.

Figure 3 Compute on Kubernetes is exciting

EnvironmentalConsistency -Test/dev/productionon the same or similarinfrastructure

Hybridity - Stretch tocloud and data anywhere

Personalities -Onboarding of dataengineers, data scientistsand analysts

Utilization - Containersallow packing of manymore tenants andworkloads on the sameinfrastructure, enablingbetter utilization

Application Portability -Written toKubernetes/Containersecosystem and providesportability

Observability andMonitoring - Highergranularity

Agility - Rapid timeto market

Spark on KubernetesThe introduction of Spark 2.3 brought full support for Apache Spark on Kubernetes, enabling a Kubernetes cluster to act as compute for the data lake, much of which is used in Cloudera Private Cloud applications. Spark 2.3 can also be submitted in a Kubernetes cluster.

Spark on Kubernetes is a great stride in the Hadoop ecosystem, as it opened the door for many public cloud-specific applications and framework use cases to be deployed on premises; thus, providing hybridity to stretch to the cloud anywhere. Kubernetes addresses gaps that existed in YARN, such as a lack of isolation and reproducibility. Kubernetes also allows workloads to be packaged in docker images. Spark on Kubernetes inherits all other built-in features, such as auto-scaling, detailed metrics, advanced container networking, security, and so on.

Page 6: Refresh Your Data Lake to Cisco Data Intelligence Platform · your data lake, such as conjoining various frameworks and open-source technologies with the Hadoop ecosystem. Refresh

Solution overviewCisco public

© 2020 Cisco and/or its affiliates. All rights reserved.

Upgrade your Data Lake with Cisco Data Intelligence Platform (CDIP)The Cisco Data Intelligence Platform (CDIP) delivers:

• The latest generation in CPU from Intel (2nd generation Intel Scalable family, with Cascade Lake CLXR) and AMD (EPYC Rome CPUs)

• Cloud scale and a fully modular architecture, where big data, AI/compute farm, and massive storage tiers work together as a single entity and each CDIP component can also scale independently to address the IT issues in the modern data center

• World-record Hadoop performance, both for MapReduce and Spark frameworks published at TPCx-HS benchmark

• AI compute farm, which offers different types of AI frameworks and compute types (GPU, CPU, FPGA) to work data for analytics

• A massive storage tier that enables customers to gradually retire data and quickly retrieve it when needed on storage-dense sub-systems with a lower cost per TB for a better TCO

• Data compression with FPGA, which allows customers to offload compute-heavy compression tasks to FPGA, relieve CPU to perform other tasks, and gain significant performance

• Seamless scale of the architecture to thousands of nodes

• Single-pane-of-glass management with Cisco Intersight™

• An ISV partner ecosystem, a best-in-class ecosystem of vendors that offers best-of-breed, end-to-end validated architectures

• A pre-validated and fully supported platform

• Disaggregate architecture, supporting separation of storage and compute for a data lake

Platform upgrades are exciting as they bring long-awaited features and capabilities. However, successful upgrades require planning. This planning also involves underlying infrastructure as much as the refreshed software involved.

Given all these new long-awaited frameworks and functionalities—unified distribution (CDP), S3-compatible object store, CDP Private Cloud, and most importantly, end of support for older releases—now is the time to consider a risk-free system refresh with Cisco Data Intelligence Platform.

Consensus has been truly achieved among enterprises that data science initiatives are effectively driving business value. However, exponential data growth and the need to analyze the enormous amount of data—whether at rest or in motion—at higher rates are constituting several challenges, such as I/O bottlenecks, several management touchpoints, growing cluster complexity, performance degradation, and so on. Cisco Data Intelligence Platform is thoughtfully designed and engineered with an ecosystem-driven data strategy, therefore, providing high-quality collaboration between data science and IT.

Figure 4 illustrates how CDP and HDP integrate with Cisco Data Intelligence Platform.

Figure 4. Ecosystem-driven data strategy

Cisco Data Intelligence Platformwith Cloudera Data Platform

HDPHORTONWORKSDATA PLATFORMpowered by Apache Hadoop

CDHCLOUDERA’S OPEN SOURCEPLATFORM DISTRIBUTIONincluding APACHE HADOOP

Page 7: Refresh Your Data Lake to Cisco Data Intelligence Platform · your data lake, such as conjoining various frameworks and open-source technologies with the Hadoop ecosystem. Refresh

Solution overviewCisco public

Figure 5 illustrates how the ecosystem of vendors’ technologies work together.

Figure 5. Partner ecosystem

AI/compute farm Pre-validated

• Fully supported • Architectural innovations

World record performance• TPCx-HS (20 plus)• Proven linear scaling• Only to publish 300 TB test

Centralized management• Infrastructure management

Scaling• Independently scale storage and compute• Data tiering

Data lake (Hadoop) Data anywhere

Compute Applications

Object StorageApache OzoneCompute Applications

Cisco UCS C-SeriesRack Server

Cisco UCS S3260

Containers, CPU/GPU/FPGA

Containers, CPU/GPU/FPGA

Massive storageData-intensive workloads

Compute Intensive Workloads

with vendor

Cisco UCS C-SeriesRack Server

Red HatOpenshift

Container Platform

Red HatOpenShift

Container Platform

Cisco Intersight manages CDIP, which offers a powerful experience of cloud-based management

© 2020 Cisco and/or its affiliates. All rights reserved.

Page 8: Refresh Your Data Lake to Cisco Data Intelligence Platform · your data lake, such as conjoining various frameworks and open-source technologies with the Hadoop ecosystem. Refresh

Solution overviewCisco public

Cisco Intersight is a lifecycle management platform for Cisco Data Intelligence Platform, regardless of where it resides (see Figure 4). Cisco Intersight features SaaS-based management, proactive guidance, security and extensibility, enhanced support with connected Cisco Technical Assistance Center (TAC) integration, visibility anywhere, and much more.

Figure 6. Cisco Intersight features

SaaS Delivered (Hosted Mgmt. or

Connected appliance)

Platform compliance (HW/FW compatibility

checks)

Connected TAC (Technical Assistance

Center)

Unified Management (Dashboard and Data

collection)

Cisco Security Advisories

(CVEs)

Centralized Management

Global Policies

ComprehensiveAutomation

Single Pane of Glass

ActionableIntelligence

ConnectEverything(UCS Director, UCS

Central, UCS manager and IMC)

Organizations that are deploying container cloud in production need to have a platform strategy that encompass key elements i.e. security, governance, monitoring and logging, data protection and persistence, container lifecycle management, process priority and isolation, and end to end automation and orchestration from application layer to infrastructure layer. Cisco Data Intelligence Platform delivers cloud infrastructure strategy for data and application, aligned with long-term business strategy, and with a clear vision of becoming cloud-native from on-prem to hybrid in the long run. The table below outlines CDIP features.

© 2020 Cisco and/or its affiliates. All rights reserved.

Page 9: Refresh Your Data Lake to Cisco Data Intelligence Platform · your data lake, such as conjoining various frameworks and open-source technologies with the Hadoop ecosystem. Refresh

Solution overviewCisco public

© 2020 Cisco and/or its affiliates. All rights reserved.

Table 1. Cisco Data Intelligence Platform Features

Features Phase 1 Phase 2

InfrastructureHardware support CPU/GPU FPGASecurity Cloudera security (authorization, authentication, RBAC, encryption)Networking 25G/40G – thousands of nodes 100G – thousands of nodesCompute 2nd Gen Intel Xeon Scalable family, AMD EPYC Rome

Storage (data locality + decoupled compute and storage)

HDFSS3 storage for Hadoop - Apache Ozone Minio, S3-compatible

Storage FPGA compression HDFS Compression with Xilinx Centralized management Cisco Intersight™

AI/ML – Deep learning on Kubernetes

Deep learning Cloudera machine learningDistributed deep learning Apache SubmarineAI/ML – Inference and model management

Inference on CPU Cloudera machine learning

Inference on GPU Triton Inference ServerInference on FPGA Deploying models to Xilinx FPGAData processing and AI/ML

Application and services Spark 2.x/Spark 3.0

Monitoring for Kubernetes

Application and k8 infra AppDynamics

Workload Optimizer for Kubernetes

Workload optimizer Intersight Workload Optimizer (IWO)

Kubernetes

Kubernetes RedHat Openshift

Multi-cloud

Hybrid/multi-cloud AWS/Google

Page 10: Refresh Your Data Lake to Cisco Data Intelligence Platform · your data lake, such as conjoining various frameworks and open-source technologies with the Hadoop ecosystem. Refresh

Solution overviewCisco public

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco and the Cisco logo are trademarks or registered trademarks of Cisco and/or its affiliates in the U.S. and other countries. To view a list of Cisco trademarks, go to this URL: https://www.cisco.com/go/trademarks. Third-party trademarks mentioned are the property of their respective owners. The use of the word partner does not imply a partnership relationship between Cisco and any other company. (1110R) C22-744005-00 09/20

For more information• Optimizing Analytics Workloads with the

Cisco Data Intelligence Platform• To learn more about Cisco Data

Intelligence Platform, visit https://www.cisco.com/c/dam/en/us/products/servers-unified-computing/ucs-c-series-rack-servers/solution-overview-c22-742432.pdf

• To find out more about Cisco UCS® big data solutions, visit https://www.cisco.com/go/bigdata

• To find out more about Cisco UCS big data validated designs, visit https://www.cisco.com/go/bigdata_design

• To find out more about Cisco UCS AI/ML solutions, visit https://www.cisco.com/go/ai-compute

• To find out more about Cisco validated solutions based on Cloudera Data Platform, visit https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/UCS_CVDs/cisco_ucs_cdip_cloudera.html

• To learn more about Cisco Intersight, visit https://www.cisco.com/c/en/us/products/servers-unified-computing/intersight/index.html

Features Phase 1 Phase 2

Governance

Governance Cloudera SDX

Backup and disaster recovery

Backup/point-in-time recovery Cloudera BDR and Ozone Hybrid

Experiences (or personas-driven)

Data warehouse (data analyst) Cloudera data warehouse

Data lake (data engineer) Cloudera data engineer

Data science (data scientist) Cloudera machine learning

Edge use cases Cloudera data flow

ConclusionCisco Data Intelligence Platform is a robust platform that lays out the foundation for all the exciting new architectural and technological innovation happening in the data lake world. It sets the stage for Cloudera Data Platform Private Cloud and for all of the upcoming enhancements for increased flexibility. By design, Cisco Data Intelligence Platform is a disaggregated architecture, which makes the big data journey easy and removes complexity out of refresh or upgrade cycles. With Cisco Data Intelligence Platform, each component can not only scale independently but also be refreshed or upgraded.