optimize your data warehouse with hadoop
DESCRIPTION
Enterprise IT teams face enormous challenges as terabytes of new data flow into analytic systems on a daily basis. Very large data warehouses - from hundreds of terabytes to a petabyte - are becoming commonplace. How can your organization optimize your data warehouse and contain costs?TRANSCRIPT
Optimize Your Data Warehouse with Hadoop
2
Introduction Gartner predicts that enterprise data will grow 650 percent over the next 5 years. IDC
claims the entire world’s data volume doubles every 2 years. EMC estimates that their
customers’ storage will grow 100x by 2020, exceeding 100,000 customers who store and
manage more than 1 petabyte of data. This data tsunami represents potential
competitive value for enterprises but presents new challenges for IT and business leaders.
The challenge to enterprise IT teams is enormous and growing exponentially as terabytes
of new data flow into analytic systems on a daily basis. Hundreds of terabytes and
petabyte scale data warehouses are becoming commonplace. New technologies and
approaches have been developed under the “Big Data” umbrella to help enterprises
leverage the potential of the data tsunami. One of these new technologies that has
arisen to provide a cost-effective platform for processing and storing “Big Data” is
Hadoop.
Estimated costs for 30TB of additional data warehouse infrastructure are $2-$4 million. In
addition, large amount of computing resources is often consumed by batch processes
loading data – a large amount of which is infrequently used, inactive or dormant.
Data growth exploding to 100’s of Terabytes in Data Warehouses
Every 30TB of growth costs $2-$4 million in incremental infrastructure costs
Unused or infrequently used data unnecessarily maintained on expensive
engineered systems
Valuable capacity consumed by unnecessary and inefficient batch load
processes
3
Enterprises are adopting Hadoop as a complement to the Enterprise Data Warehouse to
control rising infrastructure costs and enable new analytic capabilities. The benefits of
this approach are to:
Lower the cost of data management
Extend the capacity of existing data warehouse infrastructure
Retain all data for user access and analytics
Improve query performance of data warehouse
Use Hadoop for inactive and dormant data Data warehouses are getting bloated with hundreds of terabytes of data which are rapidly consuming the capacity of existing infrastructure, requiring organizations to spend millions in infrastructure upgrades. However a large amount of this data is inactive, unused and dormant yet consumes significant capacity of expensively engineered data warehouse systems. Source: Appfluent Visibility Report (www.appfluent.com) Organizations can significantly reduce the infrastructure costs of data warehousing by identifying inactive data and offloading the data onto inexpensive Hadoop clusters. Since most organizations are mandated to keep several years’ worth of historical data, Hadoop offers a true ‘active archive’ by enabling users to quickly and easily access data on an ongoing basis without unnecessarily consuming capacity from expensive data warehouse systems. A large financial organization, for example, eliminated over 100 Terabytes of data from a data warehouse system saving over $15 million in infrastructure by deploying a Hadoop cluster to store the inactive data at a fraction of the cost of adding capacity to the existing data warehouse system. Another source of the deluge in data is the regulatory and compliance mandates that require organizations to maintain several years of historical data. A large amount of historical data is seldom ever accessed and used by users. However, identifying usage patterns based on dates used in queries can be near impossible to do manually.
4
By automating the process of analyzing the calendar dates used by users when querying data, organizations can quickly tier historical data more efficiently. Source: Appfluent Visibility Report (www.appfluent.com)
By moving unused or infrequently accessed historical data to Hadoop, enterprises can extend the capacity of the existing data warehouse, and provide access to the historical data on more cost effective Hadoop clusters.
Appfluent Visibility, a software solution from Appfluent Technology provides the ability for
organizations to monitor their data usage of data warehouses and analyze dates used in
queries either explicitly or via date dimension table look up.
Store and access infrequently used data on Hadoop
In addition to identifying inactive data, organizations should analyze how all data is being used in a data warehouse to determine relevant workloads and associated data that can be moved to inexpensive Hadoop clusters. As data volumes continue to explode and business users clamor for access to more data, IT teams need to understand how users interact with data to ensure the most relevant business information is optimized and made readily available on a data warehouse and what can be moved to inexpensive Hadoop clusters.
5
Source: Appfluent Visibility Report For example, much of the data in data warehouses is used on a monthly or quarterly basis and is run as batch jobs. These workloads and associate data are ideal for processing on Hadoop at a fraction of the cost of processing on the data warehouse. Additionally, IT organizations should understand data utilization to focus on optimizing the most business relevant data thus maximizing existing data warehouse investments. Data usage insights lead to better alignment of business and IT while reducing the costs of storage and data management.
Offload expensive batch processing and ETL to Hadoop
A large amount of data warehouse capacity is consumed by batch processing and
ETL/ELT transformations. For example, a large enterprise discovered that less than 2% of
the ETL processes were consuming over 60% of CPU and I/O resources on a high-end
data warehouse system.
Source: Appfluent Visibility Report (www.appfluent.com)
6
Hadoop offers an inexpensive platform ideally suited for ETL/ELT batch processing.
Organizations should identify expensive batch processes that consume system capacity
and resources on existing data warehouse systems and move those workloads to a
Hadoop cluster for processing.
Summary Organizations cannot economically sustain adding more capacity to existing data
warehouse infrastructure at the current rate of data growth and associated deceleration
in system performance. In order to continue meeting acceptable service levels,
supporting new projects, applications and data sources while controlling costs,
organizations should begin to leverage low cost Hadoop clusters to complement existing
data warehouse implementations.
Organizations should begin offloading low value data such as unused, dormant or
infrequently used data to Hadoop using Impala to provide access rather than moving to
tiered storage or tape. By identifying resource consuming processes on the data
warehouse that can be better more efficiently run on low cost and fast computing
Hadoop clusters, organizations can recover costly capacity on existing data warehouses
and eliminate millions of dollars in infrastructure costs.
About Appfluent Technology Appfluent Technology, Inc. is the leading provider of data usage analytics. Data
intensive organizations are challenged with managing exploding data volumes,
increasing analytic complexity and spiraling costs. Appfluent enables enterprises to
maximize the value of analytic and big data warehouse systems. Appfluent’s software
reduces data management costs, improves IT efficiency, boosts performance and
ensures trusted usage of data. For more information, please visit www.appfluent.com
For more information please visit www.appfluent.com