data lake architecture: delivering insight and scale from hadoop as an enterprise-wide shared...

Memo from Analytix.is

Data Lake Architecture: Delivering Insight and Scale from Hadoop as an

Enterprise-Wide Shared Service EXECUTIVE SUMMARY Business Drivers for Hadoop 2.x

Our whitepaper “Data Lake Business Value” laid out why Apache1 HadoopTM 2.x is well suited to enable revenue growth and cost savings across the enterprise. Traditional solutions were not designed to extract full value from the flood of data arriving in the enterprise today. Apache Hadoop 2.x overcomes the limitations of traditional solutions by delivering unprecedented insight and scale. This creates significant value both at the application and infrastructure level:

• Application level: Allows simultaneous access and timely insights for all your users across all your data irrespective of the processing engine. This is possible because Hadoop 2.x allows you to store data first and query it in the moment or later in a flexible fashion.

• Infrastructure level: Allows you to acquire all data in its original format and store it in one place, cost effectively and for an unlimited time. This is possible because Hadoop 2.x delivers 20x cheaper data storage than alternatives.

This step change in effectiveness and efficiency allows you to extract maximum business value from the rapid growth in data volume, variety and velocity. Business Drivers for a Data Lake

We posit that deploying Hadoop 2.x as an enterprise-wide shared service is the best way of turning data into profit. We call this shared service a data lake. The value created by Hadoop 2.x grows exponentially as data from more applications lands in the data lake. More and more of that data will be retained for decades. For many enterprises, data becomes possibly as important as capital and talent in the quest for profit. Therefore, it is important to future-proof your investments in big data. Even your first Hadoop 2.x project should consider the data lake as the target architecture. In this white paper, you will learn about the technology that makes the data lake a reality in your environment. We will introduce the modular target architecture for the data lake, detail its functional requirements and highlight how the enterprise-grade capabilities of Hadoop 2.x deliver on these expectations. As more use cases join the data lake, more of the enterprise-grade functionality of that the second generation of Hadoop brings comes into focus.

1 Apache, Apache Hadoop and any Apache projects are trademarks of the Apache Software Foundation.


PAGE 2

Data Lake Architecture Deploying Apache Hadoop 2.x as an enterprise-wide shared service requires more than just the ability to deliver insights and scale. Five enabling capabilities of Hadoop 2.x deliver on the functional requirements of a data lake:

1. Allows a seamless, well-orchestrated flow of data between existing environments and Hadoop 2.x.

2. Implements enterprise-grade security requirements such as authentication, authorization, accountability, encryption.

3. Provides highly automated operations, including multi-tenancy, both natively in Hadoop 2.x and integrated with existing management tools.

4. Runs anywhere, across operating systems, deployment form factors, on-premise and in the cloud.

5. Enables both existing and new applications to provide value to the organization.

Benefits of a Data Lake Deploying an enterprise-wide Apache Hadoop data lake brings a series of benefits relative to spinning up dedicated clusters for each project. Most importantly, larger questions can be asked, yielding deeper insights because any authorized user can interact with the pool of data in multiple ways. More data typically leads to better answers. The data lake, as a shared service across the organization, also brings operational benefits that are very similar to a private infrastructure cloud:

• Speed of provisioning and de-provisioning. • Fast learning curve and reduced operational complexity. • Consistent enforcement of data security, privacy and governance. • Optimal capital efficiency.


PAGE 3

BUSINESS DRIVERS FOR HADOOP 2.X Many enterprises are overwhelmed by the variety, velocity and volume of data. Extracting value from that data requires the ability to store data first and query it in the moment or later in a flexible fashion. Traditional solutions however do not provide timely or actionable insights. Questions have to be determined before new data even arrives. Defining schemas on write severely limits your ability to take advantage of the wealth of new data types available. Moreover, traditional solutions do not scale well technically or economically. Data is locked into operational silos of proprietary technology. Retaining all your historical data for analysis is unaffordable. Therefore, enterprises run out of budget long before they run out of useful data. For the business, extracting value from big data is all about asking the right questions. Often the combination of data sources that are most valuable and the questions that yield real insights only become apparent after multiple rounds of exploration and iteration. In other words, questions are not always known in advance and the data to answer these questions could be of varying types, likely including unstructured data. What the business requires is an ability to store data first and query it in the moment or later, in a flexible fashion. Apache Hadoop 2.x excels at doing exactly that. It is the open source technology at the center of the big data revolution. Apache Hadoop 2.x has become the leading platform for capturing, storing, managing and processing vast quantities of structured and unstructured data in a cost-efficient and scalable manner, unleashing analytic opportunities to grow revenue and reduce cost. Apache Hadoop 2.x overcomes some of the challenges of traditional solutions by delivering unprecedented insight and scale. Hadoop 2.x allows you to store data first and query it in the moment or later in a flexible fashion. The key reason Hadoop 2.x allows you to capture more data is because the marginal cost of retaining data is less than the marginal value. In fact, Hadoop provides 20x cheaper data storage than alternatives. That results in order-of-magnitude better insights and business value compared to legacy solutions. This step change in effectiveness and efficiency completely transforms the value of your data, allowing you to take full advantage of the rapid growth in data volume, variety and velocity. Exceptionally powerful analyses are now possible, whether using existing tools or entirely new applications. Therefore, only Hadoop 2.x is able to provide distributed storage and processing that is…

• Affordable and performing well into the 100+ petabyte scale. • Flexible enough to deal with increasing data variety and velocity.


PAGE 4

BUSINESS DRIVERS FOR A DATA LAKE We posit that deploying Hadoop 2.x as an enterprise-wide shared service is the best way of turning data into profit. We call this shared service a data lake. The attributes of a data lake are:

• Delivers maximum scale and insight to the entire enterprise. • Allows authorized users across all business units to refine, explore and enrich

data. • Contains data from transactions, interactions and observations. • Enables multiple data access patterns across a shared infrastructure: batch,

interactive, online, search, in-memory and other processing engines.

Many enterprises are making Hadoop 2.x available throughout the organization in a very thoughtful manner to avoid the fragmentation of the legacy world. Instead of deploying one Hadoop 2.x cluster per project, they are moving to Hadoop 2.x as an enterprise-wide shared service for all projects, for all analytic workloads and for all business units. Clearly it does not make sense for an enterprise to have one cluster per team, or one cluster for each of Storm, Hbase, MapReduce, Lucene. In other words, these customers are avoiding data ponds and moving toward a unified data lake.

The value created by Hadoop 2.x grows exponentially as data from more applications lands in the data lake. More and more of that data will be retained for decades. For many enterprises, data becomes possibly as important as capital and talent in the quest for profit. Therefore, it is important to future-proof your investments in big data. Even your first Hadoop 2.x project should consider the data lake as the target architecture. In this white paper, you will learn about the technology that makes the data lake a reality in your environment. We will introduce the modular target architecture for the data lake and detail its functional requirements


PAGE 5

DATA LAKE ARCHITECTURE Hadoop 2.x creates significant value both at the application and infrastructure level:

• Application level: Allows simultaneous access and timely insights for all your users across all your data irrespective of the processing engine.

• Infrastructure level: Allows you to acquire all data in its original format and store it in one place, cost effectively and for an unlimited time, subject only to constraints imposed by your compliance policies.

Two essential capabilities of Hadoop make this possible: 1. Infrastructure-level data management: Storage: Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. In production clusters, HDFS has demonstrated scalability of up to 200 PB of storage and a single cluster of 4,500 servers, supporting close to a billion files and blocks. In other words, HDFS provides distributed, redundant storage. It is now in its second major release, called HDFS2. HDFS is also able to distinguish between different storage media. Processing operating system: Apache Hadoop YARN is the data operating system for Hadoop 2.x. YARN enables a user to interact with all data in multiple ways simultaneously, making Hadoop 2.x a true multi-use data platform. In essence, YARN is the operating system responsible for resource management in the cluster. YARN plays a fundamental role in the data lake since it achieves two objectives: scheduling resource usage across a cluster and the ability to accommodate any data processing engine. 2. Data access for a variety of applications: As we said above, Hadoop 2.x allows simultaneous access and timely insights for all your users across all your data, irrespective of the processing engine or analytical application. In other words, one pool of data can be analyzed for many different purposes using a wide range of processing engines such as batch, interactive, streaming, search, graph, machine learning, and NoSQL databases. Much more than just a stagnant pool of stored data, the Hadoop 2.x data lake comes alive through analytics. One important reason these processing engines allow enterprises to ask larger questions and obtain fresher insights is that semantic schemas are only defined on read. Any and all data can flow into the lake without the upfront effort that would be required in a legacy solution to define which questions will be asked of the data. With Hadoop 2.x, enterprises can “store first, ask questions later”. More recently, customers can even “store first, ask questions in the moment” as


PAGE 6

the second generation of Hadoop enables low latency queries. This second essential capability is enabled by the following Apache modules at the data access layer: Metadata management: HCatalog is a table management layer that exposes SQL metadata to a wide range of Hadoop 2.x applications and tools, thereby enabling users with different data processing tools (i.e. domain specific languages including Pig and MapReduce among others) to more easily read and write data on the data lake. HCatalog’s table abstraction presents users with a relational view of data in HDFS and ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from ORCFile or RCFile formats, JSON (text) files, CSV files and sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata. Batch processing: Apache Hadoop MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner. Interactive processing: Apache Tez. While Apache™ Hadoop MapReduce has been the batch-oriented data processing backbone for Hadoop 2.x, Tez enables projects in the Apache Hadoop ecosystem such as Apache Hive and Apache Pig, as well as commercial solutions originally built on Apache Hadoop MapReduce, to meet the demands for fast response times and extreme throughput at petabyte scale. Together with HDFS2, YARN and MapReduce described above, Tez extends the speed of data processing that is possible from the core of Apache Hadoop. SQL: Apache Hive is a data warehouse built on top of Apache Hadoop for providing data summarization, ad-hoc queries, and analysis of large datasets. Hive provides a mechanism to project structure onto the data in Hadoop 2.x and to query that data using standard SQL syntax, easing integration between Hadoop 2.x and mainstream tools for business intelligence and visualization. Scripting: Apache Pig is a scripting platform for processing and analyzing large data sets. Apache Pig allows customers to write complex data processing transformations using a simple scripting language. Pig Latin (the language) defines a set of transformations on a data set such as aggregate, join and sort. Pig translates the Pig Latin script into Apache Hadoop MapReduce jobs (and soon Apache Tez jobs) so that it can be executed within Hadoop. Pig is ideal for standard extract-transform-load (ETL) data pipelines, which enables the enterprise data warehouse rightsizing discussed below. It is also used for research on raw data and iterative processing of data. Online database: Apache HBase is a non-relational (NoSQL) document database that runs on top of HDFS. It is columnar and provides quick access to large quantities of sparse data. It also adds transactional capabilities to Hadoop 2.x, allowing users to conduct updates, inserts and deletes. Because of this


PAGE 7

capability, it is often used for online applications. Online database: Hortonworks also supports Apache Accumulo, a sorted, distributed key-value store with cell-based access control. Cell-level access control is important for organizations with complex policies governing who is allowed to see data. It enables the intermingling of different data sets with different access control policies and proper handling of individual data sets that have some sensitive portions. Streaming: Apache Storm is a distributed system for quick processing of large streams of data in real time. Machine Learning: Apache Mahout is a library of scalable machine-learning algorithms. Machine learning is a discipline of artificial intelligence focused on enabling machines to learn without being explicitly programmed. Once data is stored on HDFS, Mahout provides the data science tools to automatically find meaningful patterns in those big data sets. But deploying Apache Hadoop 2.x as an enterprise-wide shared service requires more than just the ability to deliver insights and scale. Five enabling capabilities of Hadoop 2.x deliver on the functional requirements of a data lake. These enabling capabilities create the trust that is needed for Hadoop 2.x to become the standard platform for enterprise data. We will refer to this target architecture by the term data lake.

1. Data integration & governance Batch Ingestion: Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop imports data from external structured datastores into HDFS or related systems like Hive and HBase. Sqoop can also be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses. Sqoop works with relational databases such as: Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB. API access to Hadoop: WebHDFS defines a public HTTP REST API, which permits clients to access Hadoop 2.x from multiple languages without installing Hadoop 2.x NFS access to Hadoop: With NFS access to HDFS, customers can mount the HDFS cluster as a volume on client machines and have native command line, scripts or file explorer UI to view HDFS files and load data into HDFS. NFS thus enables file-based applications to perform file read and write operations directly to Hadoop 2.x. Streaming: Apache Flume is a service for efficiently collecting, aggregating, and moving large amounts of streaming data into HDFS.


PAGE 8

Real Time Ingestion: Apache Storm is a distributed system for processing fast, large streams of data in real time. (See above) Data Workflow and Lifecycle: Apache Falcon is a framework that enables users to automate the movement and processing of datasets for ingest, pipelines, disaster recovery and data retention use cases. Crucially for the data lake, Falcon enables import, scheduling & coordination, replication, lifecycle policies, SLA & multi-cluster management, and compaction. Falcon will in the near future also address lineage and audit trails, implemented via a time series traversal of logs and data. These capabilities are complemented by data integration tools, for instance connectors enabling data transfer between HDFS and EDW, in-memory DB, relational DB, scale-out relational DB, and No-SQL DB 2. Security Hadoop 2.x supports enterprise-grade security requirements such as authentication, authorization, accountability, and data protection / encryption. Multi-tenancy is also supported and a core security requirement. Given the prominent role of security in discussions around any shared service in IT, we will go into a bit more detail than on some of the other capabilities. We discuss multi-tenancy in the section on Operations below. Authentication and authorization: The Knox Gateway (“Knox”) is a single security perimeter for REST/HTTP access to one or more Hadoop clusters. It provides secure authentication and authorization in a way that is fully integrated with enterprise and cloud identity management environments. Authentication verifies the identity of a system or user accessing the system. Hadoop 2.x provides two modes of authentication. The first, simple or pseudo authentication, essentially places trust in user’s assertion about who they are. The second, Kerberos, provides a fully secure Hadoop cluster. In line with best practice, Hadoop 2.x provides these capabilities while relying on widely accepted corporate user-identity-stores (such as LDAP or Active Directory) so that a single source can be used for a credential catalog across Hadoop 2.x and existing systems. Authorization specifies access privileges for a user or system. Hadoop 2.x provides fine-grained authorization via or with file permissions in HDFS and resource level access control (via ACL) for Apache Hadoop MapReduce and coarser-grained access control at a service level. For data, HBase provides authorization with ACL on tables and column families, and Accumulo extends this even further to cell-level control. Also, Apache Hive provides coarse-grained access control on tables. Accounting provides the ability to track resource use within a system. Within Hadoop 2.x, insight into usage and data access is critical for compliance or forensics. As part of core Apache Hadoop 2.x, HDFS and MapReduce provide base audit support. Additionally, Apache Hive metastore records audit


PAGE 9

(who/when) information for Hive interactions. Finally, Apache Oozie, the workflow engine, provides audit trail for services. Data Protection ensures privacy and confidentiality of information. HDP allows you to protect data in motion. HDP provides encryption capability for various channels such as Remote Procedure Call (RPC), HTTP, JDBC/ODBC, and Data Transfer Protocol (DTP) to protect data in motion. Finally, HDFS supports operating system-level encryption. Current capabilities also include encryption with SSL for NameNode, JobTracker and data encryption for HDFS, HBase & Hive. In the near future, HDP will support wire encryption for Shuffle, HDFS Data Transfer and JDBC/ODBC access to Hive. As Hadoop 2.x evolves, so do the solutions to support enterprise security requirements. Much of the focus is centered around weaving the security frameworks together and to make them even simpler to manage. 3. Operations Hadoop 2.x supports highly automated operations whether natively in Hadoop or integrated with existing management tools. In a data lake architecture, data as well as the underlying infrastructure are shared across the enterprise. This shared infrastructure pool can span different data centers and is managed by a unified data management “operating system.” The data in this pool is accessible to all authorized users across the enterprise through advanced multi-tenancy features. Every department or user in the organization connects via their own sandbox, with dedicated SLAs and differentiated views of the data, which are defined by data protection rules. Provisioning, Management, Monitoring: Apache Ambari is a 100-percent open source operational framework for provisioning, managing and monitoring Apache Hadoop 2.x clusters. Ambari includes an intuitive collection of operator tools and a robust set of APIs that hide the complexity of Hadoop, simplifying the operation of clusters. The APIs provide connectivity to market-leading management tools including Microsoft System Center, Teradata Viewpoint and OpenStack. Scheduling: Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs on a recurring basis. Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop 2.x stack and supports Hadoop jobs for Apache Hadoop MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. It can also be used to schedule jobs specific to a system, like Java programs or shell scripts. Multi-tenancy - Apps: YARN. A multi-tenant architecture shares core resources while isolating services and data. Apache Hadoop YARN is a multi-application, multi-workload general-purpose data operating system which enables multi-tenancy. One important feature of YARN that avoids contention between workloads is the Capacity Scheduler. The Capacity Scheduler assigns minimum guaranteed capacities to users, applications or entire business units


PAGE 10

sharing a cluster. At the same time, if the cluster has some idle capacity, then users and applications are allowed to consume more of the cluster than their guaranteed minimum share, maximizing overall cluster utilization. This approach is superior to both the first-in-first-out (FIFO) job scheduler or the Fair Scheduler found in other Hadoop distributions, since the latter can result in catastrophic cluster failures (for instance due to user error) whereby all workloads are stuck waiting rather than running. Multi-tenancy - Data: HDFS. HDFS is a scalable, flexible and fault tolerant data storage system that enables data of varying types to co-exist in the same architecture. HDFS provides mechanisms for data access control and storage capacity management to enable multiple tenants to manage data in the same physical cluster. HDFS utilizes a familiar POSIX based permission model to assign read and write access privileges for each dataset to a set of users. This enables strong data access isolation across datasets in the data lake. HDFS provides quota management functionality to set limits on storage capacity used by tenants' datasets. This ensures that all tenants in the system get their share of storage space in the shared storage system. High Availability & Disaster Recovery: YARN, HDFS, Falcon. Yarn enables re-use of key platform services for reliability, redundancy and security across multiple workloads. The raw storage layer in HDFS is fully distributed and fault-tolerant. Prior to Hadoop 2.x, the file system metadata though was stored in a single master server called NameNode. Hadoop 2.x provides hot standby, automatic failover and file system journaling to the NameNode. This improves the overall service availability both for planned and unplanned downtime of the NameNode. It also eliminates the need for external HA frameworks and highly available external shared storage for storing journals. Disaster recovery is orchestrated via Falcon and leverages the capability of distcp, which distributes a copy from one cluster to another. Vendors like WANdisco complement this. 4. Environment and deployment model Enterprises evaluating a Hadoop distribution should look at the broadest range of deployment options for Hadoop, from Windows Server or Linux to virtualized cloud deployments. A portable Hadoop distribution allows to easily and reliably migrate from one deployment type to another. A portable Hadoop distribution also delivers choice and flexibility irrespective of physical location, deployment form factor or storage medium, augmenting your current operational data stores. 5. Presentation and application Hadoop 2.x is of value to both existing and new applications. It allows all authorized users in the organization to ask larger questions and get deeper insights using familiar business intelligence and data warehouse solutions.


PAGE 11

At the presentation layer, it is easy for business users and analysts to access and visualize all the data via familiar tools ranging from Microsoft Excel to Tableau. In Excel for instance, simply set the data source to Hadoop. At the application layer, it is easy for developers to write applications that take advantage of one enterprise-wide shared pool of data, the data lake. Jointly, these five categories of enabling capabilities form the reference architecture for the data lake. BENEFITS Hadoop 2.x is the foundation for the big data and analytics revolution that is unfolding because of its superior economics and transformational capabilities compared to legacy solutions. Enterprises moving away from fragmented data storage and analytics silos stand to realize numerous benefits: Deeper insights. Larger questions can be asked because all of the data is available for analysis, including all time periods, all data formats and from all sources. And semantic schemas are only defined at the time of analysis. The combinatorial effect of analyzing all data produces a much deeper level of insight than what is available in the stove-piped legacy world. Having a single 360-degree view of the customer relationship across time, across interaction channels, products and organizational boundaries is a perfect example. Actionable insights. Hadoop 2.x enables closed-loop analytics by bringing the time to insight as close to real time as possible. The data lake, as a shared service across the organization, also brings benefits that are very similar to a private infrastructure cloud: Agile provisioning and de-provisioning of both capacity and users. With the data lake, your business users can be up and running in minutes. Simply tap into the pool of capacity available, instead of setting up a dedicated Hadoop cluster for every project. Moreover, should a project no longer be required, the capex is not stranded since the physical assets can be repurposed. This means no minimum investment is required to get started. Likewise, on-ramping a new user is a matter of minutes both operationally and skills-wise. The single infrastructure pool works seamlessly with a range of BI and analytics tools. For a new user, this means they may already have different analytic lenses configured for them the moment they log in for the first time. And they can use Microsoft Excel as a familiar tool, simply selecting Hadoop as the data source. Faster learning curve and reduced operational complexity. Hadoop is a new topic to many IT professionals. Many organizations have found it beneficial to create a Hadoop center of excellence in the organization, forming either a central pool of expertise or a coordinated set of teams that can easily learn from each other. A data lake provides the natural setting for this to happen. Having dedicated technical and organizational Hadoop silos for every project would


PAGE 12

multiply the effort of supporting and managing Hadoop and ensuring currency with one release across the enterprise. Interoperability with existing environments would suffer due to additional testing and integration, notably since not every project would run on the latest stable release. A data lake on the other hand avoids inconsistent approaches and a skills mismatch between projects. Consistent data security, privacy and governance. Physically distinct silos would make it very hard in practice to enforce policies across boundaries even within one large organization. Optimal capital efficiency driven by scale and the ability to load balance. The opex and capex required to deploy and manage several small ponds of Hadoop is higher than that for one big lake of identical capacity. Not all Hadoop jobs are busy at the same time, and some may work on the same data sets. Therefore, a common pool of capacity can be smaller than capacity spread over several smaller ponds. Silos will show low average utilization. In one lake, data can be de-duplicated. CONCLUSION To summarize, a data lake allows you to…

• Store every shred of data throughout the enterprise in original fidelity. • Provide access to this data across the organization. • Explore, discover and mine your data for value using multiple engines. • Secure and govern all data in accordance with enterprise policies.

This is enabled by the core value proposition of Hadoop 2.x:

• Performance scales in linear fashion due to its distributed compute and storage model, making exabyte-scale possible.

• Cost is 20x lower than traditional solutions due to open source software and industry standard hardware.

• Integration with your existing environment makes it easy to ingest, analyze and visualize data, and to develop applications that take advantage of Hadoop 2.x.

data lake architecture: delivering insight and scale from hadoop as an enterprise-wide shared...

Documents