data-intensive distributed systems laboratory - ... the advent of big data and the exascale...

Click here to load reader

Post on 16-Jul-2020

1 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • SCALABLE RESOURCE MANAGEMENT IN CLOUD COMPUTING

    BY

    IMAN SADOOGHI

    DEPARTMENT OF COMPUTER SCIENCE

    Submitted in partial fulfillment of the requirements for the degree of

    Doctor of Philosophy in Computer Science in the Graduate College of the Illinois Institute of Technology

    Approved _________________________ Adviser

    Chicago, Illinois September 2016

  • iii

    ACKNOWLEDGEMENT

    This dissertation could not have been completed without the consistent

    encouragement and support from my advisor Dr. Ioan Raicu and my family. There are

    many more people who help me succeed through my PhD study, particularly my

    dissertation committee: Dr. Zhiling Lan, Dr. Boris Glavic and Dr. Erdal Oruklu.

    In addition, I would like to thank all my colleagues and collaborators for their

    time spent and efforts made in countless meetings, brainstorms, and running experiments.

  • iv

    TABLE OF CONTENTS

    Page

    ACKNOWLEDGEMENT ....................................................................................... iii

    LIST OF TABLES ................................................................................................... vi

    LIST OF FIGURES ................................................................................................. vii

    ABSTRACT ............................................................................................................. ix

    CHAPTER

    1. INTRODUCTION .............................................................................. 1

    2. UNDERSTANDING THE PERFORMANCE AND POTENTIAL OF

    CLOUD COMPUTING FOR SCIENTIFIC APPLICATIONS ......... 13

    2.1 Background and Motivation ....................................................... 13 2.2 Performance Evaluation .............................................................. 16 2.3 Cost Analysis .............................................................................. 46 2.4 Summary ..................................................................................... 51

    3. ACHIEVING EFFICIENT DISTRIBUTED SCHEDULING WITH

    MESSAGE QUEUES IN THE CLOUD FOR MANY-TASK COMPUTING

    AND HIGH-PERFORMANCE COMPUTING ...................................... 54

    3.1 Background and Motivation ....................................................... 54 3.2 Design and Implementation of CloudKon .................................. 57 3.3 Performance Evaluation .............................................................. 67 3.4 Summary ..................................................................................... 81

    4. FABRIQ: LEVERAGING DISTRIBUTED HASH TABLES TOWARDS

    DISTRIBUTED PUBLISH-SUBSCRIBE MESSAGE QUEUES ..... 84

    4.1 Background and Motivation ....................................................... 84 4.2 Fabriq Architecture and Design Principles ................................. 88 4.3 Network Communication Cost ................................................... 102 4.4 Performance Evaluation .............................................................. 106 4.5 Summary ................................................................................ 116

  • v

    5. ALBATROSS: AN EFFICIENT CLOUD-ENABLED TASK

    SCHEDULING AND EXECUTION FRAMEWORK USING

    DISTRIBUTED MESSAGE QUEUES ................................................... 119

    5.1 Background and Motivation ....................................................... 120 5.2 System Overview ........................................................................ 128 5.3 Map-Reduce on Albatross .......................................................... 136 5.4 Performance Evaluation .............................................................. 137 5.5 Summary ................................................................................ 155

    6. RELATED WORK .................................................................................. 157

    6.1 The Cloud Performance for Scientific Computing ..................... 157 6.2 Distributed Scheduling Frameworks .......................................... 160 6.3 Distributed Message Queues ...................................................... 163 6.4 Distributed Job Scheduling and Execution Frameworks ............ 165

    7. CONCLUSION AND FUTURE WORK ................................................ 169

    6.1 Conclusions ................................................................................. 169 6.4 Future Work ................................................................................ 173

    BIBLIOGRAPHY .................................................................................................... 176

  • vi

    LIST OF TABLES

    Table Page

    1 Performance summary of EC2 Instances ........................................................ 50 2 Cost-efficiency summary of EC2 Instances .................................................... 51 3 Performance and Cost-efficiency summary of AWS Services ....................... 51 4 Comparison of Fabriq, SQS and Kafka ........................................................... 108

  • vii

    LIST OF FIGURES

    Figure Page

    1. CacheBench Read benchmark results, one benchmark process per instance . 22 2. CacheBench write benchmark results, one benchmark process per instance . 24 3. iPerf: Network bandwidth in a single client and server connection ................ 25 4. CDF and Hop distance of connection latency between instances ................... 28 5. HPL: compute performance of single instances vs. ideal performance .......... 29 6. Local POSIX read benchmark results on all instances ................................... 30 7. Maximum write/read throughput on different instances ................................. 31 8. RAID 0 Setup benchmark for different transfer sizes – write ......................... 32 9. S3 performance, maximum read and write throughput ................................... 33 10. Comparing the read throughput of S3 and PVFS on different scales ............. 34 11. PVFS read on different transfer sizes over instance storage ........................... 35 12. Scalability of PVFS2 and NFS in read/write throughput using memory ........ 36 13. CDF plot for insert and look up latency on 96 8xxl instances ........................ 37 14. Throughput comparison of DynamoDB with ZHT on different scales .......... 38 15. The LMPS application tasks time distributions .............................................. 39 16. A contour plot snapshot of the power prices in $/MWh ................................. 40 17. The runtime of LMPS on m1.large instances in different scales .................... 41 18. Raw performance comparison overview of EC2 vs. FermiCloud .................. 43 19. Efficiency of EC2 and FermiCloud running HPL application ........................ 45 20. Memory capacity and memory bandwidth cost .............................................. 47 21. CPU performance cost of instances ................................................................ 48

  • viii

    22. Cost of virtual cluster of m1.medium and cc1.4xlarge ................................... 48 23. Cost Comparison of DynamoDB with ZHT ................................................... 49 24. CloudKon architecture overview .................................................................... 58 25. CloudKon-HPC architecture overview ........................................................... 63 26. Communication Cost ....................................................................................... 66 27. Throughput of CloudKon, Sparrow and MATRIX (MTC tasks) ................... 70 28. Throughput of CloudKon (HPC tasks) ........................................................... 72 29. Latency of CloudKon sleep 0 ms tasks ........................................................... 73 30. Cumulative Distribution of the latency on the task execution step ................. 74 31. Cumulative Distribution of the latency on the task submit step ..................... 75 32. Cumulative Distribution of the latency on the result delivery step ................. 76 33. Efficiency with homogenous workloads with different task lengths .............. 78 34. Efficiency of the systems running heterogeneous workloads ......................... 80 35. The overhead of task execution consistency on CloudKon ............................ 81 36. Fabriq servers and clients with many user queues .......................................... 91 37. Structure of a Fabriq server ............................................................................. 93 38. Push operation ................................................................................................. 95 39. A remote pop operation with a single hop cost ............................................... 98 40. Cumulative Distribution of 1 client and 64 servers ........................................ 105 41. Load Balance of Fabriq vs. Kafka on 64 instances ......................................... 110 42. Average latency of push and pop operations .................................................. 111 43. Cumulative distribution of the push latency .................................