optimize your data warehouse with hadoop

Optimize Your Data Warehouse with Hadoop

2

Introduction Gartner predicts that enterprise data will grow 650 percent over the next 5 years. IDC

claims the entire world’s data volume doubles every 2 years. EMC estimates that their

customers’ storage will grow 100x by 2020, exceeding 100,000 customers who store and

manage more than 1 petabyte of data. This data tsunami represents potential

competitive value for enterprises but presents new challenges for IT and business leaders.

The challenge to enterprise IT teams is enormous and growing exponentially as terabytes

of new data flow into analytic systems on a daily basis. Hundreds of terabytes and

petabyte scale data warehouses are becoming commonplace. New technologies and

approaches have been developed under the “Big Data” umbrella to help enterprises

leverage the potential of the data tsunami. One of these new technologies that has

arisen to provide a cost-effective platform for processing and storing “Big Data” is

Hadoop.

Estimated costs for 30TB of additional data warehouse infrastructure are $2-$4 million. In

addition, large amount of computing resources is often consumed by batch processes

loading data – a large amount of which is infrequently used, inactive or dormant.

Data growth exploding to 100’s of Terabytes in Data Warehouses

Every 30TB of growth costs $2-$4 million in incremental infrastructure costs

Unused or infrequently used data unnecessarily maintained on expensive

engineered systems

Valuable capacity consumed by unnecessary and inefficient batch load

processes

3

Enterprises are adopting Hadoop as a complement to the Enterprise Data Warehouse to

control rising infrastructure costs and enable new analytic capabilities. The benefits of

this approach are to:

Lower the cost of data management

Extend the capacity of existing data warehouse infrastructure

Retain all data for user access and analytics

Improve query performance of data warehouse

Use Hadoop for inactive and dormant data Data warehouses are getting bloated with hundreds of terabytes of data which are rapidly consuming the capacity of existing infrastructure, requiring organizations to spend millions in infrastructure upgrades. However a large amount of this data is inactive, unused and dormant yet consumes significant capacity of expensively engineered data warehouse systems. Source: Appfluent Visibility Report (www.appfluent.com) Organizations can significantly reduce the infrastructure costs of data warehousing by identifying inactive data and offloading the data onto inexpensive Hadoop clusters. Since most organizations are mandated to keep several years’ worth of historical data, Hadoop offers a true ‘active archive’ by enabling users to quickly and easily access data on an ongoing basis without unnecessarily consuming capacity from expensive data warehouse systems. A large financial organization, for example, eliminated over 100 Terabytes of data from a data warehouse system saving over $15 million in infrastructure by deploying a Hadoop cluster to store the inactive data at a fraction of the cost of adding capacity to the existing data warehouse system. Another source of the deluge in data is the regulatory and compliance mandates that require organizations to maintain several years of historical data. A large amount of historical data is seldom ever accessed and used by users. However, identifying usage patterns based on dates used in queries can be near impossible to do manually.

4

By automating the process of analyzing the calendar dates used by users when querying data, organizations can quickly tier historical data more efficiently. Source: Appfluent Visibility Report (www.appfluent.com)

By moving unused or infrequently accessed historical data to Hadoop, enterprises can extend the capacity of the existing data warehouse, and provide access to the historical data on more cost effective Hadoop clusters.

Appfluent Visibility, a software solution from Appfluent Technology provides the ability for

organizations to monitor their data usage of data warehouses and analyze dates used in

queries either explicitly or via date dimension table look up.

Store and access infrequently used data on Hadoop

In addition to identifying inactive data, organizations should analyze how all data is being used in a data warehouse to determine relevant workloads and associated data that can be moved to inexpensive Hadoop clusters. As data volumes continue to explode and business users clamor for access to more data, IT teams need to understand how users interact with data to ensure the most relevant business information is optimized and made readily available on a data warehouse and what can be moved to inexpensive Hadoop clusters.

5

Source: Appfluent Visibility Report For example, much of the data in data warehouses is used on a monthly or quarterly basis and is run as batch jobs. These workloads and associate data are ideal for processing on Hadoop at a fraction of the cost of processing on the data warehouse. Additionally, IT organizations should understand data utilization to focus on optimizing the most business relevant data thus maximizing existing data warehouse investments. Data usage insights lead to better alignment of business and IT while reducing the costs of storage and data management.

Offload expensive batch processing and ETL to Hadoop

A large amount of data warehouse capacity is consumed by batch processing and

ETL/ELT transformations. For example, a large enterprise discovered that less than 2% of

the ETL processes were consuming over 60% of CPU and I/O resources on a high-end

data warehouse system.

Source: Appfluent Visibility Report (www.appfluent.com)

http://www.appfluent.com/

6

Hadoop offers an inexpensive platform ideally suited for ETL/ELT batch processing.

Organizations should identify expensive batch processes that consume system capacity

and resources on existing data warehouse systems and move those workloads to a

Hadoop cluster for processing.

Summary Organizations cannot economically sustain adding more capacity to existing data

warehouse infrastructure at the current rate of data growth and associated deceleration

in system performance. In order to continue meeting acceptable service levels,

supporting new projects, applications and data sources while controlling costs,

organizations should begin to leverage low cost Hadoop clusters to complement existing

data warehouse implementations.

Organizations should begin offloading low value data such as unused, dormant or

infrequently used data to Hadoop using Impala to provide access rather than moving to

tiered storage or tape. By identifying resource consuming processes on the data

warehouse that can be better more efficiently run on low cost and fast computing

Hadoop clusters, organizations can recover costly capacity on existing data warehouses

and eliminate millions of dollars in infrastructure costs.

About Appfluent Technology Appfluent Technology, Inc. is the leading provider of data usage analytics. Data

intensive organizations are challenged with managing exploding data volumes,

increasing analytic complexity and spiraling costs. Appfluent enables enterprises to

maximize the value of analytic and big data warehouse systems. Appfluent’s software

reduces data management costs, improves IT efficiency, boosts performance and

ensures trusted usage of data. For more information, please visit www.appfluent.com

For more information please visit www.appfluent.com

http://www.appfluent.com/

mailto:[email protected]