high performance analytics sas greenplum sunz 2012
Post on 19-Oct-2014
2.122 views
DESCRIPTION
TRANSCRIPT
1 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
E M C A C Q U I R E S G R E E N P L U M
“Greenplum, with expertise in the massively parallel arena, will give the storage giant a boost in big-data computing.”
– InformationWeek –
Greenplum Becomes the Foundation of EMC’s Data Computing Division
“For three years, Gartner has identified Greenplum as the most advanced vendor in the visionary
quadrant of its data warehouse DBMS Magic Quadrant….” – Gartner
2 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
New Reali2es… New Demands! • Do it faster
– Ingest more data – Ingest it faster – Keep it unsummarised, keep it for longer
• Be more Responsive – Unpredictable queries, Rapidly evolving bespoke analy2cs – New tools: Hadoop, MapReduce, Hive, HBase, “R”
• Manage new data types – Manage and allow queries across structured, semi-‐structured and unstructured data
• Do it at a lower cost
Big Data will revolu/onise Data Warehousing and analy/cs.
3 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Why Greenplum?
Fast Data
Loading Extreme Performance & Elastic Scalability
Unified Data Access
• EMC Greenplum is a shared nothing, massively parallel processing (MPP) data warehouse system
• Core principle of data computing is to move the processing dramatically closer to the data and to the people
4 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Segment Servers
Query processing & data storage
... ...
Master Server
Query planning & dispatch
Hadoop MapReduce
Data Sources
Loading, streaming, etc.
Network Interconnect
External Files, URLs, Hadoop (HDFS), WebServices (including from other DBs),
O/S Pipes (including from other DBs)
Standard Business Intelligence and Analy2cal tools
SQL BI tools
Analytical tools
Queries distributed across all available
resources
Shared Nothing, Massively Parallel Processing means no boSlenecks and linear scalability.
Data loading also takes advantage of MPP architecture
Greenplum handles structured, semi-‐structured and
unstructured data
Clients see a single database
Structured Analy2cs Unstructured Analy2cs
primary server, plus hot failover
5 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Why is MPP different?
Greenplum is a Scale-Out Architecture on standard commodity hardware
…
MPP • Queries shipped to each node simultaneously • Execute parallel on each segment instance. • Multiple pipe lines of data • Highly Scalable topology • Locks and buffers not shared.
Traditional • Single database buffer used by all user
operations • More locks, means more complex lock
management system • Single pipe to data • Limited Scalability
6 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division 20/02/12 6
Par22oning: The Key to Parallelism Strategy: Spread data evenly across as many nodes (and disks) as possible
43 Oct 20 2005 12 64 Oct 20 2005 111 45 Oct 20 2005 42 46 Oct 20 2005 64 77 Oct 20 2005 32 48 Oct 20 2005 12
Order
Ord
er #
Ord
er
Dat
e
Cus
tom
er
ID
Greenplum Database High Speed Loader
50 Oct 20 2005 34 56 Oct 20 2005 213 63 Oct 20 2005 15 44 Oct 20 2005 102 53 Oct 20 2005 82 55 Oct 20 2005 55
7 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Greenplum Database Powerful Data Loading Capabilities • Industry leading performance:
– >10TB per hour per rack • Innovative, parallel-everything
architecture: – Scatter-Gather Streaming™
provides true linear scaling – Support for both large-batch
and continuous real-time loading strategies
– Enable complex data transformations “in-flight”
– Transparent interfaces to loading via support files, application and services
8 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Tradi2onal Loading vs Greenplum DB Parallel Loading
Segment nodes
Segment nodes
Segment nodes
Segment nodes
Interconnect
Conventional Loading
ETL Servers
Interconnect
ETL Servers
9 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Client 1 4 2 9
7 3 11 6
8 12 5 10
Sort Request Sort Request Sort Request
Advanced pipeline process for fast operation
Master Server
Segment Servers
10 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
12 11 10
9
8 7 6
5
4
3 2 1
Advanced pipeline process for fast operation
Master Server
Segment Servers
Client
11 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Greenplum Database Extreme Performance
• Optimized for BI and Analytics – Rich eco-system of partners
• Provides automatic parallelization – Just load and query like any database – Tables are automatically distributed across
nodes – No need for manual partitioning or tuning
• Extremely scalable MPP shared-nothing Architecture
– All nodes can scan and process in parallel – Linear scalability by adding nodes
Interconnect
Loading
12 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Pla^orm Independence Delivers Choice and Flexibility
So2ware-‐Only • On your x86 hardware • Flexibility for any workload
Virtualized Infrastructure • Pool resources • Elas2c scalability
Data Compu@ng Appliance • Op2mized Price/Performance • Minimum 2me-‐to-‐value • Ideal for Produc@on Environments
13 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Table ‘Customer’
Jan ’09 Feb ’09 Mar ’09 Apr ’09 May ’09 Jun ’09 Jul ’09 Aug ’09 Sept ’09 Oct ’09 Nov ’09
Column-Oriented Archival Compression
Column-Oriented Fast Compression
Row-Oriented Fast Compression
Greenplum Polymorphic Data Storage
• Greenplum Database’s engine provides a flexible storage model – Four table types: heap, row-oriented, column-oriented, external – Block compression: Gzip (levels 1-9), QuickLZ
• Storage types can be mixed within a database, and even within a table – Fully configurable via table DDL and partitioning syntax – You may also choose to index some partitions and not others
• Gives customers the choice of processing model for any table or partition – Supports ILM scenarios – denser packing of older partitions, etc. – Tables/partitions of different storage types can be joined together without restriction – Highly tuned – e.g. columnar does efficient pre-projection and parallel execution
14 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Unified Data Access Across The Enterprise • Workload Management
– Connection management controls how many users can be connected and assigns them to a queue
– User-based resource queues allow for control of the total number or cost of queries allowed at any point in time.
• Dynamic Query Prioritization – Patent pending technique of dynamically
balancing resources across running queries
– Allows DBAs to control query priorities in real-time, or determine default priorities by resource queue
15 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Highly interactive web-based performance monitoring
Real-time and historic views of:
• Resource utilization
• Queries and query internals
Greenplum Performance Monitor
16 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Key Technical Requirements for HPA Ø Technical Values
ü Performance - Massively parallel Architecture ü Load speeds – 10TB/hr ü Integration with SAS ü In-database analytics using Java, PL/R, etc ü Integration with many more BI, Analytical tools, ü Integration with Hadoop for unstructured data analysis
Ø Financial Value ü Lower Total cost of ownership ü Best Price/performance Ratio in the industry for EDW/ analytical
appliance Ø Operational Values
ü No Indices maintenance ü Backup recovery solution ü Most robust Disaster Recovery Solution in Industry ü Best Technical and customer Support Organization backing
17 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
A Few SAS Generalisations
Ø Large sequential reads and writes Ø Reading and Writing of data is done via the OS’s file
cache Ø I/O throughput rate is restricted by how fast the OS’s file
cache can process the data Ø A lot of temporary files can be created .
18 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
An MPP SQL query – just for fun
• 44TB and the query planner executes a sequential scan. There are 1,218 million rows of data and 1000 columns. 5 concurrent users running the same query on a monthy data set.
• As a base line: a single node on a typical high-end server with a single controller can read about 1.5GB per second into the database. So, a DBMS deployed on a single node can scan our 44TB in 40.7 hours.
19 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
An MPP SQL query – just for fun
• If we deploy over 8 nodes on a Greenplum cluster the aggregate I/O bandwidth increases linearly to 12GB/sec. Our query will complete in 61 minutes.
• If we compress the rows then we can read more data with each I/O. Compression varies but 2.5X is a reasonable estimate. So our effective scan rate improves by 2.5 and our query completes in 24.4 minutes.
20 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
An MPP SQL query – just for fun
• Partitioning allows us to split the data on each segment by a known value, by month in our example and if possible, read only the partitions selected. We scan only 1/84th (7 x 12 months) of the table. Our query completes in 17.4 seconds.
• Columnar, based compression is more effective than row based compression. 10X columnar based compression is a conservative estimate…10X is 4 times better than the 2.5X row compression already built into our example. So now our table scan completes in 4.35 seconds.
21 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
An MPP SQL query – just for fun
• Columnar projection lets us perform I/O on only the columns we are interested in. Lets assume 500 of the 1000 columns in our example. By reading only 50% of the data we reduce our I/O by 50%. And our table scan completes in 2.175 seconds. If 5 people were executing the same query concurrently and each person was configured to have an equal share of the system resources then each persons query would complete in 10.9 seconds.
22 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
An MPP SQL query – just for fun
• Note that queries that touch two months touch twice as much data and would complete in 4.35 seconds, four months in 8.7 seconds, and so on it is scalable and robust
• Also note that joins are also implemented using a
shared-nothing approach, meaning that they scale up as well
• We can apply indexes if necessary to further improve query performance.
23 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
An MPP SQL query – Summary
24 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Mul2ple op2ons for SAS & GP Deployments
SAS Grid
SAS In-‐Database SAS In-‐Memory
SAS Access, Greenplum database
25 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
SAS Access, Greenplum database
• Provides integration capability to Greenplum
• Allows for increased performance of Base SAS Procs when using the latest SAS v 9.3 release
• Products: SAS Access for Greenplum
• libname myGP ODBC server=gplum04 db=customers port=5432 user=gpusr1 password=gppwd1;
Mul2ple op2ons for SAS & GP Deployments
26 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
SAS In-‐Database
• SAS Enterprise Miner models to execute within Greenplum database.
• Automat ica l ly t rans la tes and publishes the model as a scoring function inside the database.
• High-performance model scoring with faster time to results
• Products: SAS Scoring Accelerator Note: Currently, this will be only available for Greenplum in the next version release of 9.3 slated for the end of this year.
In-Database Scoring In-Database Analytics
Mul2ple op2ons for SAS & GP Deployments
27 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
SAS In-‐Database
In-Database Scoring In-Database Analytics
• Execution of key SAS analytical, data discovery and data summarization tasks in database.
• Reduces the time needed to build, execute and deploy powerful predictive models.
• Improve data governance on predictive analytics projects and produce faster, better results.
• Products: SAS Analytics Accelerator
Note: Currently, this is in Roadmap for Greenplum will be available with SAS future versions
Mul2ple op2ons for SAS & GP Deployments
28 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
SAS Grid
• SAS running on a cluster of servers for better performance
• This can provide some acceleration on the base procs with Greenplum as the database, as it allows the database to make use of parallel processing
• Products: SAS Access for Greenplum
Mul2ple op2ons for SAS & GP Deployments
29 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
SAS In-‐Memory
• This is a complete 'big data' stack offering fast-loading, robust data management and complex analytics in a purpose-built environment.
• Very high performance for business users that can significantly increase revenues or decrease costs as a result of improved performance
• Products: GP & SAS HPA Note: Available in Q4 2011
Mul2ple op2ons for SAS & GP Deployments
30 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
SAS / Greenplum Product Overview
SAS High Performance Computing
SAS Access
Provides integration capability to a number of databases
Allows for increased performance of Base SAS Procs when using the latest SAS v 9.3 release
Products: SAS Access for Greenpum
SAS Grid
Utilized to run SAS on a grid of commodity servers instead of large UNIX or Mainframe
Limited impact to SAS jobs and users, but simplified operations. Generally uses more CPUs for improved performance
Products: SAS Access Greenplum, SAS Grid
SAS In-Database
Allows certain models to be pushed into the database for execution. Requires SAS Enterprise Miner in order to be of utilized
Will lead to significant (20x or more) improvement in performance versus non-database deployments
Products: SAS Access for Greenplum, SAS Grid, SAS Enterprise Miner, SAS Scoring Accelerator for Greenplum
SAS In-Memory (HPA)
New functionality from SAS that requires dedicated database appliance
Very high performance for business users that can signficantly increase revenues or decrease costs as a result of improved performance
Products: SAS Access for Greenplum, SAS Grid, SAS High Performance Analytics
31 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
In-Database Roadmap for Greenplum
Greenplum SAS Product Capability
Status
Base SAS® Descriptive Statistics / Query and Reporting – SQL Pushdown
Available in 2011 Q4 (9.3 M)
SAS/Access® Interface Database Specific Integration and Connectivity
Available
Support for SAS Format Function Available in 2011 Q4 (9.3 M) SAS® Data Integration
Studio Data Extraction, Load and
transformation Available
SAS® Scoring Accelerator*
Production Batch Scoring / Real Time Scoring
Available in 2011 Q4 (9.3 M)
32 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
What is SAS High Performance Analytics for GP?
• It’s software (GP DB, SAS HPA) • It combines parallel execution with in-memory • It allows large volumes of data to be handled
quickly • A select set of procedures from following SAS
products: Base SAS, SAS/STAT, SAS/ETS, SAS/OR and SAS Enterprise Miner.
33 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Why is GP & SAS and good match??
ü Greenplum & SAS already work well together via SAS|Access and the Scoring Accelerator
ü GP & SAS represent end-to-end analytics infrastructure, including rapid data loads, powerful ETL, parallel data computing for reports and analytics
ü Greenplum delivers extreme performance via the MPP architecture that is optimized for faster query execution and unmatched data loading
ü Rapidly deployable and designed for massive growth ü SAS & GP are working to develop advanced solutions with
deeper connectivity this solution will represent state of art in high performance, scalable, advanced analytics
34 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Some Greenplum Big Data References •
• The Greenplum Database supports up to 2^48 (2 to the power of 48) rows per table. One Greenplum customer – Fox Interactive Media has a trillion row fact table and is adding a further 3TB per day in a True mixed-workload environment supporting production reporting, ad-hoc data mining, and operational data services.
•
• Another On-line eCommerce client at last site visit had approximately 21TB in their Greenplum instance with 10 nodes. They load between 10-30 million rows a day but the issue is frequency and complexity rather than size. There are 2,000 Informatica workflows per day, complex hourly loads (up to 300 Greenplum loads per batch with 9,000 Greenplum loads every day)
•
• They have 5,000 tables, 350,000 columns 4,000 views, 1,600 indexes, relational and dimensional models, heavily relational/3NF as they had a legacy Teradata DW that Greenplum replaced. Hourly metadata/schema/table changes in response to the hourly data loads.
• This Client is averaging around a million SQL statements per day. They have heavy spikes during peak hours and maintain a Cognos reporting SLA of 100k queries per hour. They have over 1000 Cognos users and 50% of the workload is Cognos; these are mostly small statements. 25% is financial reporting, 10% is CRM. The remaining 15% is ad-hoc by power users and analysts with lots of 25-50 slice significantly large queries (and up to 100 slices). They have dependent views to 4 levels of nesting: view (great-grandchild) -> view (grandchild) -> view (child) -> view -> table.
35 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Australian Tax Office uses Greenplum as an investigatory tool in their Compliance and Audit Logging Unit. They are an extremely happy reference customer referring to Greenplum's ability to pull in data from multiple sources and quickly analysis the data without needing to create complex data models or even indices.
31 © Copyright 2010 EMC Corporation. All rights reserved.
Some SAS & Greenplum Customers (some) RWS, in Singapore used MS SQL server as their reporting environment. Their reporting & ETL process were
very slow and the DWH environment is limited in terms of scalability. They were looking for an in-database platform that can work with SAS. We won in a competitive PoC last quarter and is being currently implemented. They will be using GP & SAS as EDW to store and analyze the customer trends AIS, a Telco in Thailand migrated a Teradata DWH as well as 2 Oracle DWHs onto a single Greenplum cluster
demonstrating the schema independence of the Database. The system has expanded to 70 TB across 32 Servers. AIS using SAS as their analytical platform.
Inland Revenue Service was running on Oracle DWH and had problems with Analytical report processing time. We won this deal in Q3 and is currently in the implementation phase.
Samsung Life Insurance had a 50TB Sybase DWH that they had spent 8 years building. They ran out of performance but were able to migrate the entire environment to Greenplum in 3 months. They had approx. 400,000 reports across 4 tools (SAS, Webfocus, MSTR, OLAP) only about 100 required tuning.
36 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division 12
Greenplum Customers -- Government
• Pacific Northwest National Labs (Dept. of Energy) does cyberanalytics.
• Usa spending.gov traces the outlays of the US Federal Government.
• The Federal Reserve Bank of Kansas City does economic analysis mostly related to the housing market.
• Recently, the Internal Revenue Service purchased a DCA to do work related to Fraudulent Tax returns.
• ATO uses GP as an investigatory tool in their Compliance and Audit Logging Unit.
37 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
SAS AND EMC GREENPLUM INTEGRATED ARCHITECTURE
Data Scientist
Data Engineer
Data Analyst
Bl Analyst
LOB User
Data Platform Admin
DAT
A S
CIE
NC
E T
EA
M
Greenplum Chorus - Analytic Productivity Layer
SAS Analytics
Private/Hybrid Cloud Infrastructure or Appliance
SAS Business Intelligence
SAS Information Management
Greenplum Database Greenplum Hadoop
Data Access & Query Layer
38 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
High Performance Analytics
‘The power to know fast’
39 © Copyright 2011 EMC Corpora2on. All rights reserved.
Data Computing Division
Questions?