15-319 / 15-619 cloud computing - cs.cmu.edumsakr/15619-f16/recitations/f16_recitation12.pdf ·...

15-319 / 15-619 Cloud Computing Recitation 12 November 15 th 2016

Upload: others

Post on 13-Sep-2019




0 download


Page 1: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

15-319 / 15-619Cloud Computing

Recitation 12

November 15th 2016

Page 2: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Overview• Last week’s reflection

– Team project phase 2– Quiz 10

• This week’s schedule– Project 4.2– Quiz 11

• Twitter Analytics: The Team Project– Phase 2 report due– Phase 3 out


Page 3: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •


● Monitor AWS expenses regularly and tag all resources○ Check your bill (Cost Explorer > filter by tags)

● Piazza Guidelines○ Please tag your questions appropriately○ Search for an existing answer first

● Provide clean, modular and well documented code○ Large penalties for not doing so

○ Double check that your code is submitted!! (verify by downloading it from TPZ from the submissions page)

● Utilize Office Hours○ We are here to help (but not to give solutions)

● Use the team AWS account and tag the Team Project

resources carefully3

Page 4: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Conceptual Modules to Read on OLI• UNIT 5: Distributed Programming and Analytics

Engines for the Cloud– Module 18: Introduction to Distributed Programming

for the Cloud– Module 19: Distributed Analytics Engines for the

Cloud: MapReduce– Module 20: Distributed Analytics Engines for the

Cloud: Spark– Module 21: Distributed Analytics Engines for the

Cloud: GraphLab– Module 22: Message Queues and Stream Processing


Page 5: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Project 4

• Project 4.1, Batch Processing with MapReduce– MapReduce Programming Using YARN

• Project 4.2

– Iterative Batch Processing Using Apache Spark

• Project 4.3

– Stream Processing using Kafka/Samza


Page 6: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Typical MapReduce Batch Job

• Simplistic view of a MapReduce job

• You simply write code for the– Mapper

– Reducer

• Inputs are read from disk and outputs are written to disk

– Intermediate data is spilled to local disk


HDFSMapper ReducerHDFS



Page 7: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Iterative MapReduce Jobs

• Some applications require iterative processing• Eg: Machine Learning, etc.

• MapReduce: Data is always spilled to disk

– This leaded to added overhead for each iteration

– Can we keep data in memory? Across Iterations?

– How do you manage this?


HDFSMapper ReducerHDFS


Prepare data for the next iteration


Page 8: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Resilient Distributed Datasets (RDDs)

● RDDs○ can be in-memory or on disk○ are read-only objects○ are partitioned across the cluster

■ partitioned across machines based on a range or the hash of a key in each record


Page 9: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Operations on RDDs

• Loading>>>input_RDD = sc.textFile("text.file")

• Transformation – Apply an operation and derive a new RDD>>>transform_RDD = input_RDD.filter(lambda x: "abcd" in x)

• Action– Computations on an RDD and return a single object>>>print "Number of “abcd”:" + transform_RDD.count()


Page 10: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

RDDs and Fault Tolerance

• Actions create new RDDs• Instead of replication, recreate RDDs on failure• Recreate RDDs using lineage

– RDDs store the transformations required to bring them to current state

– Provides a form of resilience even though they can be in-memory


Page 11: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

The Spark Framework


Page 12: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Spark Ecosystem

• Spark SQL– Allows running of SQL-like queries against RDDs

• Spark Streaming– Run spark jobs against streaming data

• MLlib– Machine learning library

• GraphX– Graph-parallel framework


Page 13: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Project 4.2, Three Tasks● Use Spark to analyze the Twitter social graph

○ Task 1■ Number of nodes and edges■ Number of followers for

each user○ Task 2

■ Run PageRank to compute the influence of users

■ Fast runs get a bonus○ Task 3

■ Friend recommendation on the graph using GraphX


Page 14: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Project 4.2 - Three Tasks

1. Enumerate the Twitter Social Graph– Find the number of nodes and edges– Edges in the graph are directed. (u, v) and (v,

u) should be counted as two edges– Find the number of followers for each user

2. Rank each user by influence– Run PageRank with 10 iterations– Need to deal with dangling nodes

3. Friend recommendation– Need to use GraphX and Scala.


Page 15: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Task 2: The PageRank Algorithm

• Give pages ranks (scores) based on links to them• A page that has:

– Links from many pages ⇒ high rank– Link from a high-ranking page ⇒ high rank

"PageRank-hi-res". Licensed under CC BY-SA 2.5 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:PageRank-hi-res.png#/media/File:PageRank-hi-res.png


Page 16: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

The PageRank Algorithm

● For each Page i in dataset, Rank of i can be computed:

● Iterate for 10 iterations● The formula you will implement in P4.2 is slightly

more complex. Read the writeup carefully!!! 16

Page 17: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

PageRank in Spark (Scala)(Note: This is a simpler version of PageRank, than P4.2)

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {

// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {

(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))


// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)}


Page 18: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Graph Processing using GraphX

Task 3: Recommendation• From all the people your followees follow (i.e. your 2nd

degree followees), recommend the one with the highest influence value.


Page 19: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Hints for Launching a Spark Cluster

• Use the Spark-EC2 scripts• There are command line options to specify

instance types and spot pricing• Spark is an in-memory system

– test with a single instance first• Develop and test your scripts on a portion of

the dataset before launching a cluster


Page 20: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Spark Shell

• Like the python shell

• Run commands interactively

• On the master, execute (from /root)– ./spark/bin/spark-shell– ./spark/bin/pyspark


Page 21: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

P4.2 Grading - 1

● Submit your work in the submitter instance● Don’t forget to submit your code● For Task 1

○ Put the number of nodes and edges in the answer file○ Put your spark program that count the number of

followers for each user in the given folder.○ Run the submitter to submit

● For Task 2○ Put your spark program that calculate the pagerank

score for each user in the given folder.○ Run the submitter to submit○ Bonus for execution time < 1800 seconds


Page 22: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

P4.2 Grading - 2

● Submit your work in the submitter instance● Don’t forget to submit your code● For Task 3

○ Put your spark program that recommends a user for every user in the given folder.

○ Run the submitter to submit


Page 23: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Upcoming Deadlines

● Team Project : Phase 2

○ Code and report due: 11/15/2016 11:59 PM Pittsburgh

● Quiz 11

○ Due: 11/18/2016 11:59 PM Pittsburgh

● Project 4.2 : Iterative Programming with Spark

○ Due: 11/20/2016 11:59 PM Pittsburgh

● Team Project : Phase 3

○ Live-test due: 12/04/2016 3:59 PM Pittsburgh

○ Code and report due: 12/06/2016 11:59 PM Pittsburgh 23

Page 24: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •



Page 25: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •



Page 26: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Team Project Phase 3 Deadlines

Monday10/10/201600:00:01 ET

Tuesday 11/15/201623:59:59 ET

Team ProjectPhase 1 & 2 (Live Test 1 and Code +

Report Submissions)

Sunday 12/04/201523:59:59 ET

Tuesday 12/06/201523:59:59 ET

Team Project Phase 3


Team Project Phase 3

Code & Report Due

Team Project Phase 3

Live Test

Sunday 12/04/201615:59:59 ET


Report Due


Page 27: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Team Project Time Table


Phase (and query due)

Start Deadline Code and Report Due

Phase 1● Q1, Q2

Monday 10/10/201600:00:01 EST

Sunday 10/30/201623:59:59 ET

Tuesday 11/01/201623:59:59 ET

Phase 2● Q1, Q2, Q3

Monday 10/31/201600:00:01 ET

Sunday 11/13/2016

15:59:59 ET

Phase 2 Live Test (Hbase/MySQL)

● Q1, Q2, Q3

Sunday 11/13/201618:00:01 ET

Sunday 11/13/201623:59:59 ET

Tuesday 11/15/201623:59:59 ET

Phase 3● Q1, Q2, Q3, Q4

Monday 11/14/201600:00:01 ET

Sunday 12/04/201615:59:59 ET

Phase 3 Live Test● Q1, Q2, Q3, Q4

Sunday 12/04/201618:00:01 ET

Sunday 12/04/201623:59:59 ET

Tuesday 12/06/201623:59:59 ET

Page 28: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

HBase scoreboard

Page 29: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

MySQL scoreboard

Page 30: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

● One last query (Q4)○ Serving write requests

○ Front end caching will not work during the live test

○ Three types of requests, read, write and delete

● Live Test!■ Warmup, Q1, Q2, Q3, Q4, Mixed Q1-Q4

● Each for 30 min

■ Choose HBase or MySQL or a hybrid● Submit One DNS

Phase 3


Page 31: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Query 4: Tweet Server

There are five different parameters in the request URL for a

request to /q4.

● tweetid (tweet ID)

● op (operation type)

● seq (sequence number)

● field (one field in the request)

● payload (the payload for the new field)

Execute the requests of a tweetid by the seq (sequence number)


Page 32: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

| field | type | example |


| tweetid | long int | 15213 |

| userid | long int | 156190000001 |

| username | string | CloudComputing |

| timestamp | string | Mon Feb 15 19:19:57 2016 |

| text | string | Welcome to P4!#CC15619#P3 |

| hashtag | tabulation + string | CC15619\tP3 |

Query 4: Tweet Server


Page 33: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

● Write Request /q4?tweetid=15213&op=write&seq=1&field=hashtag&payload=awesome



Query 4: Tweet Server


Page 34: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

● Read Request /q4?tweetid=15213&op=read&seq=2&field=hashtag&payload=



Query 4: Tweet Server


Page 35: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

● Delete Request /q4?tweetid=15213&op=delete&seq=3&field=hashtag&payload=



Query 4: Tweet Server


Page 36: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

● Read Request /q4?tweetid=15213&op=read&seq=4&field=hashtag&payload=


Query 4: Tweet Server


Page 37: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

● Part of the queries○ Write or Delete, but always before a Read

● Rest of the queries○ Read only

Query 4: Tweet Server


Page 38: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

● Don’t blindly optimize for every component, identify the bottlenecks using fine-grained profiling.

● Use caches wisely: cache in HBase and MySQL is obviously important, storing everything in the frontend cache will lead to failure during the live test.

● Review what we have learned in previous project modules○ Scale out○ Load balancing○ Replication and sharding○ Strong consistency (correctness is very important in Q4)

● Look at the feedback of your Phase 1 & 2 report!

Team Project General Hints


Page 39: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

● MySQL DBs behind an ELB may require a forwarding


● Consider forwarding the requests but pay attention to


● Consider batch writes.

● Think about effective distributed caching techniques.

● Don’t block your front-end server.

Team Project, Q4 Hints


Page 40: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Phase 3 Live Test, 12/04

Time Value Target Weight

3:59 pm ET Deadline to submit your DNS

6:00 pm - 6:30 pm Warm-up (Q1 only) - 0%

6:30 pm - 7:00 pm Q1 27000 5%

7:00 pm - 7:30 pm Q2 4000 15%

7:30 pm - 8:00 pm Q3 7000 15%

8:00 pm - 8:30 pm Q4 3000 15%

8:30 pm - 9:00 pm Mixed Reads(Q1,Q2,Q3,Q4)

TBD 5+5+5+5 = 20%

● Phase 3 report is worth 30% of the Phase 3 grade.

● Phase 3 grade is worth 12% of the course grade!


Page 41: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •

Team Project, Phase 3 Deadlines

● Phase 3 Development○ Submission by 15:59 ET (Pittsburgh) Sunday 12/04

■ Live Test from 6 PM to 10 PM ET

○ Fix Q1 - Q3 if you did not go well

○ New query Q4

○ Phase 3 counts for 60% of the Team Project grade

● Phase 3 Report○ Submission 23:59:59 ET (Pittsburgh) Tuesday 12/06

○ Explain in detail the strategies you used

○ Difficulties you encountered even if you didn’t get

a good score


Page 42: 15-319 / 15-619 Cloud Computing - cs.cmu.edumsakr/15619-f16/recitations/F16_Recitation12.pdf · –Run spark jobs against streaming data • MLlib –Machine learning library •
