imp ebook 10 key considerations to build a data lake on cloud

16
10 key considerations to build a sustainable, robust data lake on the cloud E-BOOK Maximizing speed, scalability, and agility in today's data-driven world

Upload: others

Post on 14-Mar-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

10 key considerations to builda sustainable, robust data lake on the cloud

E-BOOK

Maximizing speed, scalability, and agility in today's data-driven world

To tap the unmatched agility, scalability, and advanced capabilities of the cloud, enterprises across the globe are considering the move to cloud-based data lakes. This e-book explores the critical levers of success in this journey of transformation, taking a deep dive into the people, processes, and technology involved. It aims at helping decision-makers overcome challenges and adopt the right strategy needed to create a sustainable, robust data lake on the cloud.

DisclaimerAll references to AWS, GCP, and Azure as cloud providers in this e-book are purely for illustrative purposes

Abstract

2

1. Why are businesses creating their data lakes on the cloud?.......................4

2. What challenges do businesses face while adopting a data lake on the cloud?...................................................................................................5

3. Key influencers for success............................................................................6

• People.......................................................................................................................6

• Process.....................................................................................................................6

• Technology..............................................................................................................6

4. 10 key considerations to build a sustainable and robust data lake on the cloud....................................................................................................8

• Organizational point of view..............................................................................8

· • Value delivered.......................................................................................................9

• Foundational capabilities...................................................................................9

• Tooling.....................................................................................................................11

• Validation...............................................................................................................12

• Lock-ins...................................................................................................................12

• Expansion plan.....................................................................................................13

• Speed of change..................................................................................................13

• ROI and TCO..........................................................................................................14

• Measures and metrics........................................................................................14

5. How can Impetus help?..................................................................................15

• Our cloud competencies...................................................................................15

• Customer success stories..................................................................................16

Contents

3

Why are businesses creating their data lakes on the cloud?

The worldwide public cloud service market will grow to $331.2 billion in by 20222.

Business bene�ts

The agile nature of the cloud helps enterprises to rapidly deploy new products and solutions, quickly test new ideas, and become more responsive to the changes inthe market.

AGILITY

Business bene�ts

Cloud enables enterprises to scale the resources up or down based on computing and infrastructure needs.

SCALABILITY

Business bene�ts

The cloud service providers strictly adhere to a high-level of security protocol to ensure data protection. In addition to security audits, the cloud has layered security consisting of data encryption, key management, strong access controls, and security intelligence.

SECURITY

Enterprises can save substantial capital and operating costs on equipment, infrastructure, and so�ware by moving data lake on the cloud. Moreover, with apay-as-you-go, subscription-based cost structure, the cloud is more cost-e�ective than on-premise models given careful planning and execution.

ECONOMICS

With replication and availability across multiple geographic locations, data lakes on the cloud are robust and reliable. The cloud o�ers redundancy at the region,cross-region, and cross-country levels, enabling users to create highly available and fault-tolerant applications, along with the ability to define e�ective data recovery strategies.

RELIABILITY

Reports reveal that the data lakes market is growing at a CAGR of 27.8% and is expected to touch an aggregate of $12.01 billion by 20241. Enterprises are creating their data lakes on the cloud to take advantage of advanced capabilities for discovering new information models and diverse query capabilities. Cloud-based data lakes help them avoid upfront hardware investments for complex computing and analytics, which is why they are gaining popularity.

4

Enterprises miss potential opportunities because of a lack of accessibility to data and a single source of truth. Cloud-based data lakes enable easy access and advanced analysis of data from multiple touchpoints, making it easier for enterprises to mine data, collaborate, and understand trends better. The cloud’s ready-to-use tooling, prototyped solutions, and AI/ML capabilities enable a deeper focus on analytics.

ADVANCED ANALYTICS

The cloud helps to reduce e�ort, improve business processes, and anticipate system failures. It provides multiple in-built tools for engineers to prepare, operate, and evolve. Users can perform operations as code and generate annotated documentation automatically to get the most present view of a system. They can make small, frequent, reversible changes with ease, and measure and act on a large variety of data points.

OPERATIONAL EXCELLENCE

What challenges do businesses face while adopting a data lake on the cloud?

Lack of long-term

data governance

Lack of proper data

lake architecture

Funding and cost

Lack of planning

CHALLENGES

5

6

Key in�uencers for successPEOPLE

Aligning all stakeholders to common goals

People at every level across the organization, like business and finance managers, budget owners, and strategy stakeholders, need to align on goals based on the business value for budgeting. To ensure successful adoption of the cloud data lake, all stakeholders across functions like finance, program, engineering, IT, and management must have a holistic view of the journey and understand their responsibilities.

Relevant personnel should be aware of the industry’slatest tools, best practices, and reference implementations. Re-skilling and upskillingalso need to be continuous processes.

PROCESS

Ensuring strategic planning

A successful data lake shi� to the cloud involves e�ective program and portfolio management, integrating IT governance with organizational governance, and aligning IT and business goals.

CIOs, project managers, enterprisearchitects, business analysts, andportfolio managers must prioritizeKPIs and define new processes inline with these objectives.Additionally, the impact ofboth existing and newprocesses must be assessed atappropriate intervals.

Advancement in the cloud has replaced long planning, budgeting, and procurement cycles with agility. Enterprises are adopting agile to start small, fail fast, and recover quickly. Leveraging the newest and most potent cloud tools is the key to driving automation-led e�iciencies and delivering business outcomes.

TECHNOLOGY

Making smarter choices

Cloud has become a catalyst in the evolution of technology. Cloud vendors are experimenting to make the cloud more adaptable with more choices and zero lock-ins.

7

Additional Resources While the basic design principles of a data lake will still hold, it is crucial to understand that the solutions need to be more agile to take advantage of technologies to be available in the future.

Similarly, it is crucial to ensure that the organization has people withcloud skills, as traditional on-premise skills would not su�ice.

For example, to provision data lake storage, one would use cloud-native object/file-based storage, which is di�erent from provisioning SAN, NAS, etc. and requires a new set of skills.

Learn about the keys to formulating an effective blueprint for DW transformation to the cloud and the best practices for driving tangible business outcomes through cloud-scale analytics.

Watch Now

Organizational point of view

Di�erent people within the organization have di�erent objectives and needs. For example, program managers, portfolio managers, and Chief Information Security O�icers (CISO) will focus on data security and compliance-related governance of the data lake. In contrast, business and finance managers will focus on maximizing the business value of investments in creating and hydrating the data lake.

Therefore, it is essential to understand, align, and address the needs of various stakeholders to have a common enterprise goal.It is important to understand the enterprise data strategy and align with the security, compliance, and governance needs to build a robust, sustainable, and secure data lake.

To realize the benefits of the cloud, enterprises need to adopt new processes and policies. For example, finance managers need to budget as per the cloud consumption model, configure chargeback models for incoming and outgoing data, and keep track of operational expenses to

assess ROI. E�ective chargeback models enable users to trace consumption to a granular level and manage costs more e�ectively. For example, AWS Organizations can be configured to manage enterprise-level cloud expenditure centrally.

This, coupled with project/resource /user level tagging, and Cloud Watch alerts can help gather both aggregated and granular level consumption details. AWS SDK Cost APIs can further be leveraged to build custom reports or analytics.

Similarly, in GCP, Organization Resource provides central visibility and control over all Google Cloud resources further down in the hierarchy. Enterprises can monitor Google Cloud spends against planned budget and automate cost control response based on budget notification using Cloud Billing. GCP Cloud Function can be leveraged to call GCP Billing API and decide to cap the overall expenses.

For Azure, enterprises can use Azure Cost Management + Billing to set spending thresholds, proactively apply data analysis to costs, and identify opportunities for workload changes to optimize spends.

10 key considerations to build a sustainable and robust data lake on the cloud

8

1

Change management lifecycle

9

Value delivered

Once the goals are aligned, it is crucial to prioritize your cloud adoption strategy to ensure it adds value to the organization from day one.

Assess the plan at every phase to identify gaps. Enterprises need to be flexible to adopt new technologies and consider data lake accelerators, data warehouse modernization solutions, and self-service analytics tools available on the cloud marketplace to accelerate the cloud adoption journey.

The diagram below depicts a typical change management lifecycle:

2 Foundational capabilities

Enterprises tend to focus on building core functionalities and de-prioritize the important non-functional aspects. While building a data lake on the cloud, focus on building foundational capabilities and extending them as a part of the overall governance model.

Security should be a priority both at the infra and data level. With legacy data warehouses, security was never a concern as data didn't leave the firewall. However, a data lake on the cloud demands 360-degree, ironclad security. Many enterprises still refrain from migrating to the cloud, considering it to be vulnerable to attacks and hacks. In reality, cloud providers o�er secure, systematic, and dedicated services for on-the-fly detection and prevention of any security breaches.

3

10

Additional Resources

While the cloud o�ers unmatched speed, flexibility, and cost savings, security remains a major concern.

Learn about the key pillars of cloud security and outlines how a holistic approach can help enterprises protect the confidentiality, integrity, and availability of their data.

Read More

Compute Engine

Cloud Functions

Compute

Cloud Virtual Network

Networking

Cloud IAM

Key Management

Service

Identity & Security

BigQuery

Cloud Dataflow

Big Data

Cloud Dataproc

Cloud Pub/Sub

Cloud Storage

Cloud SQL

Storage & Databases

Stack Driver

Monitoring

Management Tools

Logging

Deployment Manager

Cloud Console

Cloud Shell

Cloud SDK

Google Cloud Platform Services

For instance, both AWS and GCP o�er a well thought out security model covering key aspects of the platform and data security as detailed below:

4 Tooling

Perform an exhaustive assessment to identify all the capabilities and tools required for an end-to-end solution. Some key considerations to select the right tooling are:• Assess the existing data

sources/target and consider the future before choosing the tool

To build a robust data lake on the cloud, identify and address the challenges first. Plan and design for compliance, identify right tools and integration endpoints, consider using cloud-agonistic reusable templates, enable logging and monitoring, and understand the costing to have better controlover ROI.

addresses all your requirements and ensure that the chosen tool integrates with the cloudof your choice

• Check for cloud-agonistic tools• Consider tools like Jupyter

Notebook that o�er automated discovery, AI/ML-powered predictive insight relationship discovery, and business association with technical information. Amazon SageMaker, Google Cloud DataLab, and Azure Notebooks all o�er fully managed Jupyter Notebook services.

• Assess the tool’s ability to configure CDC or incremental loads, along with all performance aspects

• Pick a tool that o�ersan easy-to-use, drag-and-drop functionalityfor basic transformations. For example, GCP Cloud Fusion provides a visual point-and-click interface enabling code-free deployment of ETL data pipelines.

• Choose a tool that addresses all aspects of data integration – ingestion, curation, quality, cataloging, and governance

• Perform an exhaustive evaluation to ensure that the cloud-native service provider

11

• When choosing between a cloud-native service and an enterprise tool (build vs. buy), ensure that you perform a POC to mitigate business risks

Additional Resources

A successful cloud journey requires a mature foundation built through a well-architected, phased, and incremental approach.

It is important to ensure a seamless transition to the cloud including exploration, migration, application transformation and maintenance.

Watch Now

Validation

The key to a successful data lake on the cloud is to start small. Have SMART (specific, measurable, attainable, relevant, and time-based) goals,perform a proof-of-concept before implementation, and extend it once the POC is successful. It will help you recover fast in case of failure, assess and measure success, and make changes quickly without disrupting business.

Some of the best practices for creating an initial version of the data lake on the cloud are:

• Choose simple sources like relational, structured, or semi-structured files

• Limit the number of supported file formats

• Apply necessary standardization and validations

• Apply fewer complex transformations to validate a base case

• Take a step-by-step approach• Measure each step for

compliance, security, and quality• Iterate the plan to improve,

extend, and address more complex scenarios

5

Lock-ins

Assess the level and layer of lock-ins before choosing a cloud provider. Ask questions to decide and devise a futuristic strategy before you select which layer to lock-in – tool, framework, platform, cloud, or solution.

Avoiding lock-ins altogether is di�icult. The assessment criteria will also vary depending onthe business.

Some questions to ask to assess and decide which layer to lock-in:• Would the data lake be spread

across multiple clouds?• Would there be any requirement

for a cloud-agonistic development approach?

6

12

Additional Resources

Learn how a e-learning platform operationalized a serverless, secure platform using automated DevOps

Read more

Learn how a digitalcustomer journey experiencecompany enabled 2x fasterapplication deployment withAWS Kubernetes

Read more

Expansion plan

Choose a cloud vendor and develop a strategy that is scalable and designed to meet your business expansion goals. Consider the geographical compliances of the new territories that you plan to explore and ensure the platform can integrate with other tools and platforms to address future use cases. For example, expanding operations from the US to Europe would require reconsidering compliance strategies.

Some key considerations while building the foundation of a data lake on the cloud are:• Can the business units onboard

new use cases on the existing foundation?

• Does the foundation help to accelerate the journey to the cloud?

7

Speed of change

Enterprises are moving their data lakes to the cloud to embrace speed and agility. To maintain the pace of development and speed of delivery that comes with the cloud, enterprises need to investin DevOps.

Some factors to consider while investing are as follows:• Aim for 100% automation in deployment, testing, log analysis, and vulnerability analysis

• Use AI-based capabilities to enhance the e�ectiveness of the automated processes

8

13

Additional Resources

Cloud adoption is inevitable to implement digital transformation and drive customer expectations.

But what are the keys to establish sustainable data warehouse and analytics on the cloud?

Know the best practices for driving tangible business outcomes through cloud-scale analytics.

Watch Now

ROI and TCO

While planning the migration, enterprises o�en ignore hidden costs. To calculate and optimize the total cost of ownership (TCO), take all expenses into account.

These include infrastructure cost, service cost, and hidden costs like operations, maintenance, training, licenses, POCs, audit penalties, security breaches, etc.

Other factors to consider for optimizing expenditure:

• Assess so� cost and agility• Assess how agile infrastructure

setup has reducedtime-to-market and impactedthe overall productivity

• Measure, monitor, and optimizespends at regular intervals tomeasure the ROI

9

Measures and metrics

Apart from expenditure, it is also important to assess other metrics and configure all the necessary parameters for measurement.

Some metrics that enterprises need to measure to keep the data lake robust, and the cost under control are:• Can the system detect user

access details, including purpose,frequency, and duration, andidentify unauthorized access?

10• Can the system identify security

vulnerabilities, and fix thosewithout impacting business?

• Is there any unwarranted peak incost or service usage?

• Does the existing team needreskill or experts?

• Is the data meeting the desiredquality standard?

Enterprises can consider a datalake on the cloud as successfulwhen it has a reasonable adoptionrate across the organization and isnot siloed in the IT department.

14

Pace ofadoption

ROI andTCO

Time-to-market InnovationFootprint

Some measures for the success of a data lake on the cloud

How can Impetus help?Over the last two decades, Impetus has helped several Fortune 100 enterprises transform their data-driven business with modern ETL, BI, data lakes, and AI/ML.

Leveraging a cloud-first approach and accelerators for adoption and management, Impetus can help you identify and address the challenges of building a data lake on the cloud. We help enterprises accelerate data lake creation (in days versus months), large-scale migration (in months versus years), and management of workloads on the cloud, helping them reduce time-to-market and manage overall costs.

15

Our cloud competencies

Advisory, strategy, TCO

Cloud cost optimization

Automation and orchestration Security and governance

Workload assessment andtransformation

Capacity planning andDevOps

Maintenance and administration

Architecture evaluation Cloud infrastructure realization

Customer success stories

A centralized data lake on the cloud enabled a single source of truth with real-time integrationof sources

Data lake on AWS resulted in 60% infrastructure cost reduction and improved application performance by rearchitecting for the cloud

Read more

Fortune 500 healthcare service provider prevented a legacy data lake renewal with Google Cloud Platform

Reduced time-to-market and operations cost for CDH data lake migration to cloud data lake

Read more

A future-proof, centralized data warehouse with Snowflake on Azure

Data ingestion from complex sources like SAP Hana to create a single source of truth

Read more

Single source of truth from 500+ data feeds - Fortune 500 firm implemented an enterprise data lake on cloud (AWS)

A scalable, one-click data ingestion solution for data pipelines and use cases with built-in robust security, governance, and metadata management

Read more

and many more...

1

2 https://www.forbes.com/sites/louiscolumbus/2019/04/07/public-cloud-soaring-to-331b-by-2022-according-to-gartner/#225ca1305739

https://www.advancemarketanalytics.com/reports/4219-global-data-lakes-market

Impetus Technologies is focused on enabling a unified, clear, and present view for the intelligent enterprise by enabling data warehouse modernization, unification of data sources, self-service ETL, advanced analytics, and BI consumption. For more than a decade, Impetus has been the 'Partner of Choice' for several Fortune 500 enterprises in transforming their data and analytics lifecycle. The company brings together a unique mix of so�ware products, consulting services, and technology expertise. Our solutions include industry's only platform for the automated transformation of legacy systems to the cloud/big data environment and StreamAnalytix – a self-service ETL and machine learning platform.

To learn more, visit www.impetus.com or write to [email protected].

© 2020 Impetus Technologies, Inc. All rights reserved. Product and company names mentioned herein may be trademarks of their respective companies. Nov 2020