hadoop and spark analytics over better storage

21
#ibmedge © 2016 IBM Corporation Create a Colder Storage Tier for Hadoop & Spark Using IBM Elastic Storage Server & HDFS Transparency Ted Hoover / September 19, 2016

Upload: sandeep-patil

Post on 21-Mar-2017

67 views

Category:

Technology


1 download

TRANSCRIPT

PowerPoint Presentation

Create a Colder Storage Tier for Hadoop & Spark Using IBM Elastic Storage Server & HDFS TransparencyTed Hoover / September 19, 2016

#ibmedge 2016 IBM Corporation

#ibmedge

0

The History of Spectrum Scale1

This infographic is the genealogy of IBM Spectrum Scale, from its birth as a digital media server and HPC research project to its place as a foundational element in the IBM Spectrum Storage family. It highlights key milestones in the product history, usage, and industry to convey that Spectrum Scale may have started as GPFS, but it is so much more now. IBM has invested in the enterprise features that make it easy to use, reliable and suitable for mission critical storage of all types.

#ibmedgeUnified data access with File and Object Based StorageRollingUpgradesFile PlacementOptimizationGlobal Active File ManagementAdvanced Routing & Caching ServicesSpectrum Scale services PCS / IBM ConfidentialCommodity HardwareSync & AsyncReplicationFlashAccelerationNetwork performance monitoringNative Encryption And Secure EraseCommon ManagementCloud ReadyHigh speed scanning engineTransparent policy Driven data migration

Storage Resource Pools

POSIXHDFSSMB/CIFSNFSSwift/S3

Global NamespaceArchive IntegrationSimplify ManagementSoftware-Defined AgilityEnable Global Collaboration

EnterpriseFeatures & Flexibility2

#ibmedge22

2

Reduce ComplexityRedefining Unified Storage ChallengeManaging Data Growth Lowering data costsManaging data retrieval & app supportProtecting business data

Unified Scale-out Data LakeFile In/Out, Object In/Out; Analytics on demand.High-performance native protocolsSingle Management PlaneCluster replication & global namespaceEnterprise storage features across file, object & HDFS

SSDFastDiskSlowDiskTape

Spectrum Scale

NFSSMBPOSIXSwift/S3HDFS

3

#ibmedgeSpectrum Scales core DNA is highly parallel access to large data. We do this by scaling out. Footnote goes here3

Store everywhere. Run anywhere.Analytics without complexityChallenge

Separate storage systems for ingest, analysis, resultsHDFS requires locality aware storage (namenode)Data transfer slows time to resultsDifferent frameworks & analytics tools use data differently

HDFS TransparencyMap/Reduce on shared, or shared nothing storageNo waiting for data transfer between storage systemsImmediately share resultsSingle Data Lake for all applicationsEnterprise data managementArchive and Analysis in-place

IngestObjectFileDirect AccessPOSIXRaw Data

Analysis4

IBM Systems

#ibmedgebut valid reasons for using HDFS/storage rich serversGreat for workloads that need localityGreat for workloads that are read dominatedEasy incremental growth option

But has downsidesLow compute and memory utilization3 way replication overhead (high TCO)Compute & storage is tied togetherHard to integrate into enterprise (backup, archive, DR, security, etc.)

4

Spectrum Scale 4.2.1 for Big Data Oceans extending HDFS for the enterpriseAn enterprise HDFS filesystem

Expand use of Shared Nothing ClustersSimplicity of Storage Rich Servers with enterprise featuresAdvanced Routing (AFM), encryption, QoS, compressionMix cluster typesShared Nothing = traditional HDFS styleCentralized Storage = traditional enterprise

Other clients

Storage Servers

Storage

Store Everywhere. Run Anywhere. Standard commands& protocols

Shared Nothing ClustersApplication specifies hdfs:///namenode:90015

#ibmedge5

Spectrum Scale 4.2.1 for Big Data Oceans extending HDFS across ClustersExtending the Filesystem

Run analytics across multiple HDFS and/or Spectrum Scale clustersNo need to move the data Build Data Oceans on demand

Store Everywhere. Run Anywhere. DiskIBM Spectrum Scale HDFS Transparency Connector

DiskDiskDisk

viewfs://clusterX:/hadoop/hdfs/file1viewfs://clusterY:/hadoop/gpfs/file1hdfs://nn2.node.net:8020/hadoop/gpfshdfs://nn1.node.net:8020/hadoop/hdfsCluster XCluster Y

6

#ibmedge6ChallengeSeparate storage systems for ingest, analysis, resultsHDFS requires locality aware storage (namenode)Data transfer slows time to resultsDifferent frameworks & analytics tools use data differently

Native HDFS supportMap/Reduce on shared, or shared nothing storageNo waiting for data transfer between storage systemsImmediately share resultsSingle Data Lake for all applications

Use Case 1: Federate ESS with Existing HDFS Filesystemviewfs://clusterX:/hadoop/hdfs/file1viewfs://clusterY:/hadoop/gpfs/file1hdfs://nn1.node.net:8020/hadoop/hdfs

IBM Spectrum Scale HDFS Transparency Connector

hdfs://nn2.node.net:8020/hadoop/gpfs

Improve Hadoop Cluster UtilizationManually move less frequently accessed data to an ESS tierApplications can still access data that has been moved seamlesslyCommands: $ hadoop distcp viewfs://clusterX:/hadoop/hdfs/file1 viewfs://clusterY:/hadoop/gpfs/file1$ Hadoop fs rm viewfs://clusterX:/hadoop/hdfs/file1 7

#ibmedge7

Use Case 2: Federate ESS with Existing a Spectrum Scale Filesystem/gpfs/fs1Extending a Spectrum Scale Filesystem

Add an ESS tier to an existing FPO cluster and use ILM policies to migrate data to ESS tierData is still accessible from FPO clusterrule 'FPO_USE' SET POOL 'fpodata' REPLICATE (2) FOR FILESET ('fpodata')rule 'FPO_TO_SHARESTORAGE' MIGRATE FROM POOL 'fpodata' TO POOL 'datapool' where CURRENT_TIMESTAMP - MODIFICATION_TIME > INTERVAL '10' MINUTES

rule default SET POOL 'datapool'

/gpfs/fs2

Single Name Space8

#ibmedge8

Demo9

#ibmedge

IBM Spectrum Protect

Use Case 3: Spectrum Scale Provides Easy Integration with Enterprise Backup ToolsIBM Spectrum Scale HDFS Transparency Connector

Protecting Business Data

Use ESS warm data tier with Spectrum Protect for backupSimplified backup administration toolsScalable performanceOptimized data protection

viewfs://clusterX:/hadoop/hdfs/file1viewfs://clusterY:/hadoop/gpfs/file1hdfs://nn1.node.net:8020/hadoop/hdfshdfs://nn2.node.net:8020/hadoop/gpfs10

#ibmedge10

Use Case 4: Spectrum Scale Provides Easy Integration with Enterprise Archiving ToolsProtecting Business DataUse ESS warm data tier with Spectrum Archive to tapePowerful policy engine Information Lifecycle ManagementFast metadata scanning and data movementAutomated data migration to based on thresholdUsers not affected by data migrationSingle namespace

IBM Spectrum Scale HDFS Transparency Connector

viewfs://clusterX:/hadoop/hdfs/file1viewfs://clusterY:/hadoop/gpfs/file1hdfs://nn1.node.net:8020/hadoop/hdfshdfs://nn2.node.net:8020/hadoop/gpfs11

IBM Spectrum Archive

#ibmedge11

IBM Spectrum Protect

Use Case 5: Spectrum Scale Provides Easy Integration with Enterprise Backup and Archiving tools

/gpfs/fs2

Protecting Business Data

Optionally Spectrum Protect and Spectrum Archive can be used directly with Spectrum Scale FPO

/gpfs/fs1Single Name Space12

IBM Spectrum Archive

#ibmedge12

Client Use Case 1: Unified Analytic/Workflow Pipelines13

Ingest

Analyze

Export

Visualize

POSIXHDFSNFSObject

RDBMSPOSIX

Data LakeHDFS

AnalyzeHDFS

Dashboard

POSIX

Report

POSIXSMBShareNFSObject

DB2 on Shared Nothing Cluster (FPO)

Hadoop on Shared Nothing Cluster (FPO)

StorageESS

Structured data warehouseWarehouse extension for unstructured dataCompute ClusterSAS Analytics

Dashboard & Reporting

#ibmedgeClient Use Case 2: Life Sciences with HPC and Hadoop/Spark14

Storage

Storage

StorageESS based shared storage cluster

HPC Compute Cluster

File A Event on shared poolAnd

File D Event on FPO poolFile B on shared pool

LSF Job 2

LSF Hadoop Job

File F on FPO poolLSF Spark Job

LSF Job 1File C on FPO poolFile E on FPO pool

LSF Job 5File F replicated to a remote Spectrum Scale Server

#ibmedgeClient Use Case 3: HPC and Data Analytics15

Storage

Storage

StorageESS based shared storage cluster

HPC Compute Cluster

Ingest

POSIX & Object

Analyze

POSIX

iterate

Simulate

HDFS

AnalyzeAnalyze

HDFSHDFS

#ibmedge

15

Summary: Big Data Oceans extending HDFS across Clusters16

Unified Data Repository, Support Multiple Analytics

Federate ESS with Existing HDFS FilesystemFederate ESS with Existing Spectrum Scale Filesystem

HDFS Transparency Connector

Single Name SpaceImprove Hadoop Cluster UtilizationManually move less frequently accessed data to an ESS tierApplications can still access data that has been moved seamlesslyExpand use of Shared Nothing ClustersSimplicity of Storage Rich Servers with enterprise featuresAdvanced Routing (AFM), encryption, QoS, compressionMix cluster typesBackup and Archive SupportExtending the FilesystemRun analytics across multiple HDFS and/or Spectrum Scale clustersNo need to move the data Build Data Oceans on demandDiskHDFS Transparency ConnectorDiskDiskDisk

#ibmedgeSpectrum Scale User GroupThe Spectrum Scale User Group is freeto join and open to all using, interestedin using or integrating Spectrum Scale.Join the User Group activities to meetyour peers and get access to expertsfrom partners and IBM.Next meetings:- APAC: October 14, Melbourne- Global at SC16 : November 13 1pm to 5pm, Salt Lake CityWeb page:http://www.spectrumscale.org/ Presentations:http://www.spectrumscale.org/presentations/ Mailing list:http://www.spectrumscale.org/join/ Contact:http://www.spectrumscale.org/committee/ Meet Bob Oesterlin (US Co-Principal) at Edge2016: [email protected]

#ibmedgeThank You

2016 IBM Corporation#ibmedge

#ibmedgeNotices and Disclaimers19Copyright 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided. IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms apply.Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation. It is the customers responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customers business and any actions the customer may need to take to comply with such laws. IBMdoes not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law

#ibmedgeNotices and Disclaimers Cont. 20Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBMs products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The provision of the information contained h erein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. IBM, the IBM logo, ibm.com, Aspera, Bluemix, Blueworks Live, CICS, Clearcase, Cognos, DOORS, Emptoris, Enterprise Document Management System, FASP, FileNet, Global Business Services , Global Technology Services , IBM ExperienceOne, IBM SmartCloud, IBM Social Business, Information on Demand, ILOG, Maximo, MQIntegrator, MQSeries, Netcool, OMEGAMON, OpenPower, PureAnalytics, PureApplication, pureCluster, PureCoverage, PureData, PureExperience, PureFlex, pureQuery, pureScale, PureSystems, QRadar, Rational, Rhapsody, Smarter Commerce, SoDA, SPSS, Sterling Commerce, StoredIQ, Tealeaf, Tivoli, Trusteer, Unica, urban{code}, Watson, WebSphere, Worklight, X-Force and System z Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.

#ibmedge

20