amazon redshift is 10x faster and cheaper than hadoop + hive

FlyData: Amazon RedshiftBENCHMARK Series 01

Amazon Redshift is10x faster and cheaperthan Hadoop + Hive

Comparisons of speed and cost efficiency

www.flydata.com

http://www.flydata.com/

Amazon Redshift took 155 seconds to run our

queries for 1.2TB data

Hadoop + Hive took 1491 seconds to run our

queries for 1.2TB data

Amazon Redshift was 10X faster

Amazon Redshift cost $20 to run a query every 30

minutes

Hadoop + Hive took $210 to run a query every 30

minutes

Amazon Redshift was 10X cost effective

www.flydata.com


Amazon Redshift is a new data warehouse for

big data on the cloud. Before Redshift, users

had to turn to Hadoop for querying over TBs

of data.

We have run benchmarks to compare Redshift

to Hadoop (Amazon Elastic MapReduce), both

on AWS environments, specifically to show

differences for advertisement agencies.• Between 100GB to ~50TB• Frequent query (more than once an hour)• Short turn around time required

www.flydata.com


Prerequisite - Data

TSV files, gzip compressed

Imp_log

1) 300GB / 300M record

2) 1.2TB / 1.2B recorddate datetimepublisher_id integerad_campaign_id integerbid_price realcountry varchar(30)attr1-4 varchar(255)

click_log

1) 1.4GB / 1.5M record

2) 5.6GB / 6M recorddate datetimepublisher_id integerad_campaign_id integercountry varchar(30)attr1-4 varchar(255)

1) for 1 month2) for 4

months

ad_campaign100MB / 100k

recordpublisher10MB / 10k

record

advertiser10MB / 10k

record

We use 5 tables to run a query which join tables and creates a report.

www.flydata.com


1. Query Speed• Redshift takes 155

seconds to complete our query for 1.2TB

• Hadoop takes 1491 seconds to complete our query for 1.2TB

• Redshift is about 10 times faster than Hadoop for this query

Here, we are comparing Hadoop and Redshift servers of the same cost. (Hadoop: c1.xlarge vs Redshift: dw.hs1.xlarge).

672sec

38sec155sec

1491sec

* The query used can be referenced in our Appendix

www.flydata.com


2. Total Cost• Redshift costs $20

per month to run queries every 30 minutes

• Hadoop costs $210 per month to run queries every 30 minutes

• Redshift is about 10 times cheaper than Hadoop to run this job

Here, we are comparing Hadoop and Redshift servers running the same query for the same duration of time.


www.flydata.com


Redshift Query Result

Data Size Instance Type Number of Instances

TrialProcessing

TimeAverage Server Cost Per Day

300GB dw.hs1.xlarge 1

1 58s

38s $20.40

2 43s

3 31s

4 30s

5 30s

1.2TB dw.hs1.xlarge 1

1 164s

155s $20.40

2 149s

3 158s

4 156s

5 150s


www.flydata.com


Hadoop Query Result

Data Size Instance Type Instance Number Processing Time Server Cost Per Day

300GB

c1.xlarge 1 1h 23m 2s $0.80

c1.medium 10 37m 48s $0.89

c1.xlarge 10 11m 12s $1.06

1.2TB

m1.xlarge 1 6h 43m 24s $3.22

c1.medium 4 5h 14m 0s $3.04

c1.xlarge 10 37m 7s $3.58

c1.xlarge 20 24m 51s $4.64


www.flydata.com


Discussion

• Consider Redshift– If your data is big (>TB) and you need to run your

queries more than once an hour– If you want to get quick results

• Consider Hadoop (EMR)– If your data is too big (>PB)– If your job queries are once a day, week or month– If you already have invested in Hadoop

technology specialists

www.flydata.com


appendix – Sample Query

select ac.ad_campaign_id as ad_campaign_id, adv.advertiser_id as advertiser_id, cs.spending as spending, ims.imp_total as imp_total, cs.click_total as click_total, click_total/imp_total as CTR, spending/click_total as CPC, spending/(imp_total/1000) as CPMfrom ad_campaigns acjoin advertisers adv on (ac.advertiser_id = adv.advertiser_id)

join(select il.ad_campaign_id, count(*) as imp_total from imp_logs il group by il.ad_campaign_id) ims on (ims.ad_campaign_id = ac.ad_campaign_id)join(select cl.ad_campaign_id, sum(cl.bid_price) as spending, count(*) as click_total from click_logs cl group by cl.ad_campaign_id) cs on (cs.ad_campaign_id = ac.ad_campaign_id);

The query generates a basic report for ad campaigns performance, imp, click numbers,advertiser spending, CTR, CPC and CPM.

www.flydata.com


APPENDIX - Additional Comments

• Redshift is good for an aggregate calculation such as sum, average, max, min, etc. because it is a columnar database

• Importing large amounts of data takes a lot of time– 17 hours for 1.2TB in our case– Continuous importing is useful

• Redshift supports only “Separated” formats like CSV, TSV– JSON is not supported

• Redshift supports only primitive data types– 11 types, INT, DOUBLE, BOOLEAN, VARCHAR, DATE..

(as of Feb. 17, 2013)

www.flydata.com


APPENDIX – Additional Information

• All resources for our benchmark are on our github repository– https://github.com/hapyrus/redshift-benchmar

k– The dataset we use is open on S3, so you

can reproduce the benchmark

www.flydata.com

https://github.com/hapyrus/redshift-benchmark

https://github.com/hapyrus/redshift-benchmark


About Us - FlyData

• FlyData Enterprise

– Enables continuous loading to Amazon Redshift, with real-time data loading

– Automated ETL process with multiple supported data formats

– Auto scaling, data Integrity and high durability

– FlyData Sync feature allows real-time replication from RDBMS to Amazon Redshift

Contact us at: [email protected]

We are an official data integration partner of Amazon Redshift

Formerly known as Hapyrus

www.flydata.com

mailto:[email protected]?subject=Feedback%20from%20%5BAmazon%20Redshift%20SSD%20Benchmarking%5D


www.flydata.com www.flydata.com

Check us out!-> http://flydata.com

[email protected]

Toll Free: 1-855-427-9787

http://flydata.com

We are an official data integration partner of Amazon Redshift

http://flydata.com?ref=slideshare

mailto:[email protected]

http://flydata.com?ref=slideshare

amazon redshift is 10x faster and cheaper than hadoop + hive

Technology