data-intensive distributed systems laboratory - ... the advent of big data and the exascale...
Post on 16-Jul-2020
1 views
Embed Size (px)
TRANSCRIPT
SCALABLE RESOURCE MANAGEMENT IN CLOUD COMPUTING
BY
IMAN SADOOGHI
DEPARTMENT OF COMPUTER SCIENCE
Submitted in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computer Science in the Graduate College of the Illinois Institute of Technology
Approved _________________________ Adviser
Chicago, Illinois September 2016
iii
ACKNOWLEDGEMENT
This dissertation could not have been completed without the consistent
encouragement and support from my advisor Dr. Ioan Raicu and my family. There are
many more people who help me succeed through my PhD study, particularly my
dissertation committee: Dr. Zhiling Lan, Dr. Boris Glavic and Dr. Erdal Oruklu.
In addition, I would like to thank all my colleagues and collaborators for their
time spent and efforts made in countless meetings, brainstorms, and running experiments.
iv
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENT ....................................................................................... iii
LIST OF TABLES ................................................................................................... vi
LIST OF FIGURES ................................................................................................. vii
ABSTRACT ............................................................................................................. ix
CHAPTER
1. INTRODUCTION .............................................................................. 1
2. UNDERSTANDING THE PERFORMANCE AND POTENTIAL OF
CLOUD COMPUTING FOR SCIENTIFIC APPLICATIONS ......... 13
2.1 Background and Motivation ....................................................... 13 2.2 Performance Evaluation .............................................................. 16 2.3 Cost Analysis .............................................................................. 46 2.4 Summary ..................................................................................... 51
3. ACHIEVING EFFICIENT DISTRIBUTED SCHEDULING WITH
MESSAGE QUEUES IN THE CLOUD FOR MANY-TASK COMPUTING
AND HIGH-PERFORMANCE COMPUTING ...................................... 54
3.1 Background and Motivation ....................................................... 54 3.2 Design and Implementation of CloudKon .................................. 57 3.3 Performance Evaluation .............................................................. 67 3.4 Summary ..................................................................................... 81
4. FABRIQ: LEVERAGING DISTRIBUTED HASH TABLES TOWARDS
DISTRIBUTED PUBLISH-SUBSCRIBE MESSAGE QUEUES ..... 84
4.1 Background and Motivation ....................................................... 84 4.2 Fabriq Architecture and Design Principles ................................. 88 4.3 Network Communication Cost ................................................... 102 4.4 Performance Evaluation .............................................................. 106 4.5 Summary ................................................................................ 116
v
5. ALBATROSS: AN EFFICIENT CLOUD-ENABLED TASK
SCHEDULING AND EXECUTION FRAMEWORK USING
DISTRIBUTED MESSAGE QUEUES ................................................... 119
5.1 Background and Motivation ....................................................... 120 5.2 System Overview ........................................................................ 128 5.3 Map-Reduce on Albatross .......................................................... 136 5.4 Performance Evaluation .............................................................. 137 5.5 Summary ................................................................................ 155
6. RELATED WORK .................................................................................. 157
6.1 The Cloud Performance for Scientific Computing ..................... 157 6.2 Distributed Scheduling Frameworks .......................................... 160 6.3 Distributed Message Queues ...................................................... 163 6.4 Distributed Job Scheduling and Execution Frameworks ............ 165
7. CONCLUSION AND FUTURE WORK ................................................ 169
6.1 Conclusions ................................................................................. 169 6.4 Future Work ................................................................................ 173
BIBLIOGRAPHY .................................................................................................... 176
vi
LIST OF TABLES
Table Page
1 Performance summary of EC2 Instances ........................................................ 50 2 Cost-efficiency summary of EC2 Instances .................................................... 51 3 Performance and Cost-efficiency summary of AWS Services ....................... 51 4 Comparison of Fabriq, SQS and Kafka ........................................................... 108
vii
LIST OF FIGURES
Figure Page
1. CacheBench Read benchmark results, one benchmark process per instance . 22 2. CacheBench write benchmark results, one benchmark process per instance . 24 3. iPerf: Network bandwidth in a single client and server connection ................ 25 4. CDF and Hop distance of connection latency between instances ................... 28 5. HPL: compute performance of single instances vs. ideal performance .......... 29 6. Local POSIX read benchmark results on all instances ................................... 30 7. Maximum write/read throughput on different instances ................................. 31 8. RAID 0 Setup benchmark for different transfer sizes – write ......................... 32 9. S3 performance, maximum read and write throughput ................................... 33 10. Comparing the read throughput of S3 and PVFS on different scales ............. 34 11. PVFS read on different transfer sizes over instance storage ........................... 35 12. Scalability of PVFS2 and NFS in read/write throughput using memory ........ 36 13. CDF plot for insert and look up latency on 96 8xxl instances ........................ 37 14. Throughput comparison of DynamoDB with ZHT on different scales .......... 38 15. The LMPS application tasks time distributions .............................................. 39 16. A contour plot snapshot of the power prices in $/MWh ................................. 40 17. The runtime of LMPS on m1.large instances in different scales .................... 41 18. Raw performance comparison overview of EC2 vs. FermiCloud .................. 43 19. Efficiency of EC2 and FermiCloud running HPL application ........................ 45 20. Memory capacity and memory bandwidth cost .............................................. 47 21. CPU performance cost of instances ................................................................ 48
viii
22. Cost of virtual cluster of m1.medium and cc1.4xlarge ................................... 48 23. Cost Comparison of DynamoDB with ZHT ................................................... 49 24. CloudKon architecture overview .................................................................... 58 25. CloudKon-HPC architecture overview ........................................................... 63 26. Communication Cost ....................................................................................... 66 27. Throughput of CloudKon, Sparrow and MATRIX (MTC tasks) ................... 70 28. Throughput of CloudKon (HPC tasks) ........................................................... 72 29. Latency of CloudKon sleep 0 ms tasks ........................................................... 73 30. Cumulative Distribution of the latency on the task execution step ................. 74 31. Cumulative Distribution of the latency on the task submit step ..................... 75 32. Cumulative Distribution of the latency on the result delivery step ................. 76 33. Efficiency with homogenous workloads with different task lengths .............. 78 34. Efficiency of the systems running heterogeneous workloads ......................... 80 35. The overhead of task execution consistency on CloudKon ............................ 81 36. Fabriq servers and clients with many user queues .......................................... 91 37. Structure of a Fabriq server ............................................................................. 93 38. Push operation ................................................................................................. 95 39. A remote pop operation with a single hop cost ............................................... 98 40. Cumulative Distribution of 1 client and 64 servers ........................................ 105 41. Load Balance of Fabriq vs. Kafka on 64 instances ......................................... 110 42. Average latency of push and pop operations .................................................. 111 43. Cumulative distribution of the push latency .................................