refresh your data lake to cisco data intelligence platform · your data lake, such as conjoining...
TRANSCRIPT
Solution overviewCisco public
Consideration in the journey of a Hadoop refreshDespite the capability gap between Hadoop 2.x and 3.x, it is estimated that more than 80 percent of the Hadoop installed base is still on either HDP2 or CDH5, which are built on Apache Hadoop 2.0, and are getting close to end of support by the end of 2020.
Amid those feature enrichments, specialized computing resources, and end of support, a Hadoop upgrade is a value-added refresh. Considering these enhancements, it is imperative to find a more holistic approach while refreshing your data lake, such as conjoining various frameworks and open-source technologies with the Hadoop ecosystem.
Refresh Your Data Lake to Cisco Data Intelligence Platform
The evolving Hadoop landscapeIn the beginning of 2019, providers of leading Hadoop distribution, Hortonworks and Cloudera merged together. This merger raised the bar on innovation in the big data space and the new “Cloudera” launched Cloudera Data Platform (CDP) which combined the best of Hortonwork’s and Cloudera’s technologies to deliver the industry leading first enterprise data cloud. Recently, Cloudera released the CDP Private Cloud Base, which is the on-premises version of CDP. This unified distribution brought in several new features, optimizations, and integrated analytics.
CDP Private Cloud Base is built on Hadoop 3.x distribution. Hadoop developed several capabilities since its inception. However, Hadoop 3.0 is an eagerly awaited major release with several new features and optimizations. Upgrading from Hadoop 2.x to 3.0 is a paradigm shift as it enables diverse computing resources, (i.e., CPU, GPU, and FPGA) to work on data and leverage AI/ML methodologies. It supports flexible and elastic containerized workloads, managed either by Hadoop scheduler (i.e., YARN or Kubernetes), distributed deep learning, GPU-enabled Spark workloads, and more. Not only that, Hadoop 3.0 offers better reliability and availability of metadata through multiple standby name nodes, disk balancing for evenly utilized data nodes, enhanced workloads scheduling with YARN 3.0, and overall improved operational efficiency.
Going forward, the Ozone initiative lays the foundation of the next generation of storage architecture for HDFS, where data blocks are organized in storage containers for higher scale and handling of small objects in HDFS. The Ozone project also includes an object store implementation to support several new use cases.
© 2020 Cisco and/or its affiliates. All rights reserved.
Solution overviewCisco public
As the journey continues in Hadoop, more staggering and impressive software frameworks and technologies are introduced for crunching big data. Going forward, they will continue to evolve more and integrate in a modular fashion. Furthermore, specialized hardware such as GPU and FPGA are becoming the de-facto standard to facilitate deep learning for processing gigantic datasets expeditiously. Figure 1 demonstrates, how AI/ML frameworks and containerization are augmenting the Hadoop ecosystem.
Figure 1. Hadoop 3.0 refresh with AI included
Hadoop meets AI with Hadoop 3.0
Apache Hadoop 3.1
AI support in a data Lake
YARN Scheduler
CPU NvidiaGPU
Memory
Cisco UCS enabling these Next Gen workloads
Apache Submarine
Apache Spark 2.3
Apache Spark 3.0 Apache Ozone
(Tech Preview) (Tech Preview)
Note2
Note3
Note1 Note4EditJobs
Tensorboard
Worker
Worker
Worker
Worker
Worker
Tensorboard
Tensorboard
PS
PS
PS
YARN
YARN
Worker
Launch Submarine jobs
Schedule Jobs
CSI
Interpreter
Azkaban
S3 protocol
Hadoop FS
S3
Nvidia GPU
© 2020 Cisco and/or its affiliates. All rights reserved.
Solution overviewCisco public
ContainerizationHadoop 3.0 introduces production-ready Docker container support on YARN with GPU isolation and scheduling. This opens up a plethora of opportunities for modern applications, such as micro-services and distributed applications frameworks comprised of thousands of containers to execute AI/ML algorithms on peta bytes of data with ease and in a speedy fashion.
Distributed deep learning with Apache SubmarineThe Hadoop community initiated the Apache Submarine project to make distributed deep learning/machine learning applications easy to launch, manage, and monitor. These improvements make distributed deep learning/machine learning applications (TensorFlow, PyTorch, MXNet, etc.) run on Apache Hadoop YARN as simple as running it locally, which can let data scientists focus on algorithms instead of underlying infrastructure.
Apache Spark 3.0Apache Spark 3.0 is something that every data scientist and data engineer has been waiting for with anticipation. Spark is no longer limited just to CPU for its workload; it now offers GPU isolation and acceleration. To easily manage the deep learning environment, YARN launches the Spark 3.0 applications with GPU. This paves the way for other workloads, such as machine learning and ETL, to also be accelerated by GPU for Spark workloads. Learn more by reading the Cisco Blog on Apache Spark 3.0
Cloudera Data Platform Private Cloud Base (PvC)With the merger of Cloudera and Hortonworks, a new “Cloudera” software named Cloudera Data Platform (CDP) combined the best of Hortonwork’s and Cloudera’s technologies to deliver the industry leading first enterprise data cloud. CDP Private Cloud Base is the on-premises version of CDP. This unified distribution is a scalable and customizable platform where workloads can be securely provisioned. CDP gives a clear path for extending or refreshing your existing HDP and CDH deployments and set the stage for cloud-native architecture.
Cloudera Data Platform Private Cloud Shadow IT can now be eliminated when the CDP Private Cloud is implemented in Cisco® Data Intelligence Platform. CDP Private Cloud offers a cloud-like experience in a customer’s on-premises environment. With disaggregated compute and storage, a complete multi-tenant self-service analytics environment can be implemented, thereby offering better infrastructure utilization.
Also, CDP Private Cloud offers data scientist, data engineer, and data analyst personas, bringing together the right tools to the user and improving time to value.
Red Hat OpenShift Container Platform (RHOCP) clusterCloudera selects Red Hat OpenShift as the preferred container platform for CDP Private Cloud. With RHOCP, CDP Private Cloud delivers powerful, self-service analytics and enterprise-grade performance with the granular security and governance policies that IT leaders demand.
Apache Hadoop Ozone object storeApache Hadoop Ozone is a scalable, redundant, and distributed object store for Hadoop. Apart from scaling to billions of objects of varying sizes, Ozone can function effectively in containerized environments such as Kubernetes and YARN. Applications using frameworks like Apache Spark, YARN, and Hive work natively without any modifications. Ozone is built on a highly available, replicated block storage layer called Hadoop Distributed Data Store (HDDS).
Ozone is a scale-out architecture with minimal operational overhead and long-term maintenance effort. Ozone can be co-located with HDFS with single security and governance policies for easy data exchange or migration. It also offers seamless application portability. Ozone enables separation of compute and storage via the S3 API. Similar to HDFS, it also supports data locality for applications that choose to use it.
© 2020 Cisco and/or its affiliates. All rights reserved.
Solution overviewCisco public
The design of Ozone was guided by the key principles listed in Figure 2.
Figure 2 Ozone design principles
Highly scalable - tens of billions of files and blocks
Layered architecture - separate namespace and block management layer
Data locality -inherit the power of HDFS’s data locality
Side-by-side deployment - share storage disks with HDFS
Highly available - fully replicated to survive multiple failures
Cloud - native - works in containerized environments like YARN and Kubernetes
Multi protocol - with HDFS - S3 complaint API
Secure - access control and on-wire encryption
Apache Ozone
KubernetesExtracting intelligence from data lakes in a timely and speedy fashion is key in finding emerging business opportunities, accelerating time-to-market efforts, gaining market share, and increasing overall business agility.
In today’s fast-paced digitization, Kubernetes enables enterprises to rapidly deploy new updates and features at scale while maintaining consistency across testing, development, and production environments. Kubernetes lays the foundation for cloud-native apps, which can be packaged in container images and ported to diverse platforms. Containers with microservice architecture managed and orchestrated by Kubernetes help organizations embark on a modern development pattern. Moreover, Kubernetes has become a de-facto standard for container orchestration and offers the core for on-premises container cloud for enterprises. Kubernetes is a single cloud-agnostic infrastructure with a rich open-source ecosystem. It allocates, isolates, and manages resources across many tenants at scale as needed in elastic fashion, thereby, giving efficient infrastructure resource utilization. Figure 3 shows how Kubernetes is transforming the use of compute and becoming the de-facto standard for running applications.
© 2020 Cisco and/or its affiliates. All rights reserved.
Solution overviewCisco public
© 2020 Cisco and/or its affiliates. All rights reserved.
Hybrid architectureRed Hat OpenShift is the preferred container cloud platform for CDP private cloud and is the market-leading Kubernetes-powered container platform. This combination completes the vision of the very first enterprise data cloud, with a powerful hybrid architecture that decouples compute and storage for greater agility, ease of use, and more efficient use of private and multi-cloud infrastructure resources. With Cloudera’s Shared Data Experience (SDX), security and governance polices can be easily and consistently enforced across data and analytics in private as well as multi-cloud deployments. This hybridity will open myriad opportunities for multi-function integration with other frameworks such as streaming data, batch workloads, analytics, data pipelining/engineering, and machine learning.
Cloud-native architecture for data lakes and AICisco Data Intelligence Platform with CDP private cloud accelerates the journey in becoming cloud-native for your data lake and AI/ML workloads. By leveraging a Kubernetes-powered container cloud, enterprises can now quickly break the silos in monolithic application frameworks and embrace a continuous innovation of micro-services architecture with a CI/CD approach. With a cloud-native ecosystem, enterprises can build scalable and elastic modern applications that extend the boundaries from a private cloud to a hybrid infrastructure.
Figure 3 Compute on Kubernetes is exciting
EnvironmentalConsistency -Test/dev/productionon the same or similarinfrastructure
Hybridity - Stretch tocloud and data anywhere
Personalities -Onboarding of dataengineers, data scientistsand analysts
Utilization - Containersallow packing of manymore tenants andworkloads on the sameinfrastructure, enablingbetter utilization
Application Portability -Written toKubernetes/Containersecosystem and providesportability
Observability andMonitoring - Highergranularity
Agility - Rapid timeto market
Spark on KubernetesThe introduction of Spark 2.3 brought full support for Apache Spark on Kubernetes, enabling a Kubernetes cluster to act as compute for the data lake, much of which is used in Cloudera Private Cloud applications. Spark 2.3 can also be submitted in a Kubernetes cluster.
Spark on Kubernetes is a great stride in the Hadoop ecosystem, as it opened the door for many public cloud-specific applications and framework use cases to be deployed on premises; thus, providing hybridity to stretch to the cloud anywhere. Kubernetes addresses gaps that existed in YARN, such as a lack of isolation and reproducibility. Kubernetes also allows workloads to be packaged in docker images. Spark on Kubernetes inherits all other built-in features, such as auto-scaling, detailed metrics, advanced container networking, security, and so on.
Solution overviewCisco public
© 2020 Cisco and/or its affiliates. All rights reserved.
Upgrade your Data Lake with Cisco Data Intelligence Platform (CDIP)The Cisco Data Intelligence Platform (CDIP) delivers:
• The latest generation in CPU from Intel (2nd generation Intel Scalable family, with Cascade Lake CLXR) and AMD (EPYC Rome CPUs)
• Cloud scale and a fully modular architecture, where big data, AI/compute farm, and massive storage tiers work together as a single entity and each CDIP component can also scale independently to address the IT issues in the modern data center
• World-record Hadoop performance, both for MapReduce and Spark frameworks published at TPCx-HS benchmark
• AI compute farm, which offers different types of AI frameworks and compute types (GPU, CPU, FPGA) to work data for analytics
• A massive storage tier that enables customers to gradually retire data and quickly retrieve it when needed on storage-dense sub-systems with a lower cost per TB for a better TCO
• Data compression with FPGA, which allows customers to offload compute-heavy compression tasks to FPGA, relieve CPU to perform other tasks, and gain significant performance
• Seamless scale of the architecture to thousands of nodes
• Single-pane-of-glass management with Cisco Intersight™
• An ISV partner ecosystem, a best-in-class ecosystem of vendors that offers best-of-breed, end-to-end validated architectures
• A pre-validated and fully supported platform
• Disaggregate architecture, supporting separation of storage and compute for a data lake
Platform upgrades are exciting as they bring long-awaited features and capabilities. However, successful upgrades require planning. This planning also involves underlying infrastructure as much as the refreshed software involved.
Given all these new long-awaited frameworks and functionalities—unified distribution (CDP), S3-compatible object store, CDP Private Cloud, and most importantly, end of support for older releases—now is the time to consider a risk-free system refresh with Cisco Data Intelligence Platform.
Consensus has been truly achieved among enterprises that data science initiatives are effectively driving business value. However, exponential data growth and the need to analyze the enormous amount of data—whether at rest or in motion—at higher rates are constituting several challenges, such as I/O bottlenecks, several management touchpoints, growing cluster complexity, performance degradation, and so on. Cisco Data Intelligence Platform is thoughtfully designed and engineered with an ecosystem-driven data strategy, therefore, providing high-quality collaboration between data science and IT.
Figure 4 illustrates how CDP and HDP integrate with Cisco Data Intelligence Platform.
Figure 4. Ecosystem-driven data strategy
Cisco Data Intelligence Platformwith Cloudera Data Platform
HDPHORTONWORKSDATA PLATFORMpowered by Apache Hadoop
CDHCLOUDERA’S OPEN SOURCEPLATFORM DISTRIBUTIONincluding APACHE HADOOP
Solution overviewCisco public
Figure 5 illustrates how the ecosystem of vendors’ technologies work together.
Figure 5. Partner ecosystem
AI/compute farm Pre-validated
• Fully supported • Architectural innovations
World record performance• TPCx-HS (20 plus)• Proven linear scaling• Only to publish 300 TB test
Centralized management• Infrastructure management
Scaling• Independently scale storage and compute• Data tiering
Data lake (Hadoop) Data anywhere
Compute Applications
Object StorageApache OzoneCompute Applications
Cisco UCS C-SeriesRack Server
Cisco UCS S3260
Containers, CPU/GPU/FPGA
Containers, CPU/GPU/FPGA
Massive storageData-intensive workloads
Compute Intensive Workloads
with vendor
Cisco UCS C-SeriesRack Server
Red HatOpenshift
Container Platform
Red HatOpenShift
Container Platform
Cisco Intersight manages CDIP, which offers a powerful experience of cloud-based management
© 2020 Cisco and/or its affiliates. All rights reserved.
Solution overviewCisco public
Cisco Intersight is a lifecycle management platform for Cisco Data Intelligence Platform, regardless of where it resides (see Figure 4). Cisco Intersight features SaaS-based management, proactive guidance, security and extensibility, enhanced support with connected Cisco Technical Assistance Center (TAC) integration, visibility anywhere, and much more.
Figure 6. Cisco Intersight features
SaaS Delivered (Hosted Mgmt. or
Connected appliance)
Platform compliance (HW/FW compatibility
checks)
Connected TAC (Technical Assistance
Center)
Unified Management (Dashboard and Data
collection)
Cisco Security Advisories
(CVEs)
Centralized Management
Global Policies
ComprehensiveAutomation
Single Pane of Glass
ActionableIntelligence
ConnectEverything(UCS Director, UCS
Central, UCS manager and IMC)
Organizations that are deploying container cloud in production need to have a platform strategy that encompass key elements i.e. security, governance, monitoring and logging, data protection and persistence, container lifecycle management, process priority and isolation, and end to end automation and orchestration from application layer to infrastructure layer. Cisco Data Intelligence Platform delivers cloud infrastructure strategy for data and application, aligned with long-term business strategy, and with a clear vision of becoming cloud-native from on-prem to hybrid in the long run. The table below outlines CDIP features.
© 2020 Cisco and/or its affiliates. All rights reserved.
Solution overviewCisco public
© 2020 Cisco and/or its affiliates. All rights reserved.
Table 1. Cisco Data Intelligence Platform Features
Features Phase 1 Phase 2
InfrastructureHardware support CPU/GPU FPGASecurity Cloudera security (authorization, authentication, RBAC, encryption)Networking 25G/40G – thousands of nodes 100G – thousands of nodesCompute 2nd Gen Intel Xeon Scalable family, AMD EPYC Rome
Storage (data locality + decoupled compute and storage)
HDFSS3 storage for Hadoop - Apache Ozone Minio, S3-compatible
Storage FPGA compression HDFS Compression with Xilinx Centralized management Cisco Intersight™
AI/ML – Deep learning on Kubernetes
Deep learning Cloudera machine learningDistributed deep learning Apache SubmarineAI/ML – Inference and model management
Inference on CPU Cloudera machine learning
Inference on GPU Triton Inference ServerInference on FPGA Deploying models to Xilinx FPGAData processing and AI/ML
Application and services Spark 2.x/Spark 3.0
Monitoring for Kubernetes
Application and k8 infra AppDynamics
Workload Optimizer for Kubernetes
Workload optimizer Intersight Workload Optimizer (IWO)
Kubernetes
Kubernetes RedHat Openshift
Multi-cloud
Hybrid/multi-cloud AWS/Google
Solution overviewCisco public
© 2020 Cisco and/or its affiliates. All rights reserved. Cisco and the Cisco logo are trademarks or registered trademarks of Cisco and/or its affiliates in the U.S. and other countries. To view a list of Cisco trademarks, go to this URL: https://www.cisco.com/go/trademarks. Third-party trademarks mentioned are the property of their respective owners. The use of the word partner does not imply a partnership relationship between Cisco and any other company. (1110R) C22-744005-00 09/20
For more information• Optimizing Analytics Workloads with the
Cisco Data Intelligence Platform• To learn more about Cisco Data
Intelligence Platform, visit https://www.cisco.com/c/dam/en/us/products/servers-unified-computing/ucs-c-series-rack-servers/solution-overview-c22-742432.pdf
• To find out more about Cisco UCS® big data solutions, visit https://www.cisco.com/go/bigdata
• To find out more about Cisco UCS big data validated designs, visit https://www.cisco.com/go/bigdata_design
• To find out more about Cisco UCS AI/ML solutions, visit https://www.cisco.com/go/ai-compute
• To find out more about Cisco validated solutions based on Cloudera Data Platform, visit https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/UCS_CVDs/cisco_ucs_cdip_cloudera.html
• To learn more about Cisco Intersight, visit https://www.cisco.com/c/en/us/products/servers-unified-computing/intersight/index.html
Features Phase 1 Phase 2
Governance
Governance Cloudera SDX
Backup and disaster recovery
Backup/point-in-time recovery Cloudera BDR and Ozone Hybrid
Experiences (or personas-driven)
Data warehouse (data analyst) Cloudera data warehouse
Data lake (data engineer) Cloudera data engineer
Data science (data scientist) Cloudera machine learning
Edge use cases Cloudera data flow
ConclusionCisco Data Intelligence Platform is a robust platform that lays out the foundation for all the exciting new architectural and technological innovation happening in the data lake world. It sets the stage for Cloudera Data Platform Private Cloud and for all of the upcoming enhancements for increased flexibility. By design, Cisco Data Intelligence Platform is a disaggregated architecture, which makes the big data journey easy and removes complexity out of refresh or upgrade cycles. With Cisco Data Intelligence Platform, each component can not only scale independently but also be refreshed or upgraded.