(bdt309) delivering results with amazon redshift, one petabyte at a time | aws re:invent 2014
DESCRIPTION
The Amazon Enterprise Data Warehouse team, responsible for data warehousing across all of Amazon's divisions, spent 2014 working with Amazon Redshift on its largest datasets, including web log traffic. The key goals in this project were to provide a viable, enterprise-grade solution that enabled full scans of 2 trillion rows in under an hour at load. Key to success were automation of routine DW tasks that become complicated at scale: backfilling erroneous data, re-calculating statistics, re-sorting daily additions, and so forth. In this session, we discuss the scale and performance of a 100-node 1PB Amazon Redshift cluster, as well as describing some of the technical aspects and best practices of running 100-node clusters in an enterprise environment.TRANSCRIPT
Use Case Goal Benchmark
Scan 2.25 Trillion Rows
(15 months)
60m 14m
Load 5 Billion Rows
(1 day)
60m 10m
Load 150 Billion Rows
(30 days)
24 hours 9.75 hours
– VACUUM is slow, physical partitions do not exist
• Doesn’t allow for parallel loads into the same table
• 15 concurrent queries
– “Bad” queries can impact the entire cluster
2x
– COMPUPDATE (samples the date) – fast but not optimal
FASTER 86.35%
GREATER THAN 15X 14.91%
10X TO 15X 18.42%
5X TO 10X 25.73%
3X TO 5X 19.88%
2X TO 3X 7.02%
1X TO 2X 3.80%
SAME 8.47%
SLOWER 5.65%
1X TO 2X 1.75%
FASTER 14.85%
3X TO 5X .56%
2X TO 3X 3.64%
1X TO 2X 10.64%
SAME 19.05%
SLOWER 66.11%
1X TO 2X 18.49%
2X TO 3X 8.96%
3X TO 5X 9.8%
5X TO 10X 10.08%
10X TO 15X 5.04%
SLOWER THAN 15X 13.73%
or
30 min
48 hours
48 hours
Daily (6B) 40 8XL nodes 100 8XL nodes
Vacuum 80 min 30 min
Stats Collection 90 sec 50 sec
Monthly (150B) 40 8XL nodes 100 8XL nodes
Vacuum (Deep
Copy) 380 min 201 min
Stats Collection 22 min 4 min
http://bit.ly/awsevals