computer archicture f07– different history •cloud computing as utility computing (1966 paper)...

1

COSC 6397

Big Data Analytics

Grids, Clouds and Volunteer Systems

Edgar Gabriel

Spring 2014

Grid Computing

• Definition 1: Infrastructure that provides dependable,

consistent, pervasive, and inexpensive access to high-

end computational capabilities (1998)

• Definition 2: A system that coordinates resources not

subject to centralized control, using open, general-

purpose protocols to deliver nontrivial Quality of

Service (2002)

Slide based on a talk by Anda Iamnitchi

http://www.csee.usf.edu/~anda/CIS6930-S11/notes/grids-and-clouds.ppt









2

An Example: The Globus Toolkit

- Initially developed at Argonne National

Lab/University of Chicago and ISI/University

of Southern California



How It Started

While helping to build/integrate a diverse range of distributed applications, the same problems kept showing up over and over again.

– Too hard to keep track of authentication data (ID/password) across institutions

– Too hard to monitor system and application status across institutions

– Too many ways to submit jobs

– Too many ways to store & access files and data

– Too many ways to keep track of data

– Too easy to leave “dangling” resources lying around (robustness)



















3

Forget Homogeneity!

• Trying to force

homogeneity on users is

futile. Everyone has their

own preferences,

sometimes even dogma.

• The Internet provides the

model…



Building a Grid (in Practice)

• Building a Grid system or application is currently an exercise in software integration.

– Define user requirements

– Derive system requirements or features

– Survey existing components

– Identify useful components

– Develop components to fit into the gaps

– Integrate the system

– Deploy and test the system

– Maintain the system during its operation

• This should be done iteratively, with many loops and eddys in the flow.



















4

How it Really Happens

Web Browser

Compute Server

Data Catalog

Data Viewer Tool

Certificate authority

Chat Tool

Credential Repository

Web Portal

Compute Server

Resources implement standard access & management interfaces

Collective services aggregate &/or

virtualize resources

Users work with client applications

Application services organize VOs & enable access to other services

Database service

Database service

Database service

Simulation Tool

Camera

Camera

Telepresence Monitor

Registration Service



What Is the Globus Toolkit?

• The Globus Toolkit is a collection of solutions to problems that frequently come up when trying to build collaborative distributed applications.

• Not turnkey solutions, but building blocks and tools for application developers and system integrators.

– Some components (e.g., file transfer) go farther than others (e.g., remote job submission) toward end-user relevance.

• To date, the Toolkit has focused on simplifying heterogeneity for application developers.

• The goal has been to capitalize on and encourage use of existing standards (IETF, W3C, OASIS, GGF).

– The Toolkit also includes reference implementations of new/proposed standards in these organizations.



http://images.google.com/imgres?imgurl=www.fh-wolfenbuettel.de/st/fara-ei/blindleistung/ws96/simpsons.gif&imgrefurl=http://www.fh-wolfenbuettel.de/st/fara-ei/blindleistung/ws96/seite23.htm&h=248&w=212&sz=7&tbnid=LmDowBrbepwJ:&tbnh=105&tbnw=90&prev=/images?q%3Dsimpsons%26hl%3Den%26lr%3D%26ie%3DUTF-8%26sa%3DN

















5

How To Use the Globus Toolkit

• By itself, the Toolkit has surprisingly limited end user value.

– There’s very little user interface material there.

– You can’t just give it to end users (scientists, engineers,

marketing specialists) and tell them to do something useful!

• The Globus Toolkit is useful to application developers and

system integrators.

– You’ll need to have a specific application or system in mind.

– You’ll need to have the right expertise.

– You’ll need to set up prerequisite hardware/software.

– You’ll need to have a plan.



Data Management Security Common

Runtime Execution

Management Information

Services

Web Services Components

Non-WS

Components

Pre-WS Authentication Authorization

GridFTP

Grid Resource

Allocation Mgmt (Pre-WS GRAM)

Monitoring & Discovery

System (MDS2)

C Common Libraries

G T 2

WS Authentication Authorization

Reliable File

Transfer

OGSA-DAI [Tech Preview]

Grid Resource

Allocation Mgmt (WS GRAM)

Monitoring & Discovery

System (MDS4)

Java WS Core

Community Authorization

Service G T 3

Replica Location Service

XIO

G T 3

Credential Management

G T 4

Python WS Core [contribution]

C WS Core

Community Scheduler Framework

[contribution]

Delegation Service

G T 4

Globus Toolkit Components



















6

From Grids to Cloud Computing

• Logical steps: – Make the grids public

– Provide much simpler interfaces (and more limited control)

– Charge usage of resources

• Instead of relying on implicit incentives from science collaborations

• Ideally, a “pay-as-you-go” rate

• In reality: – Different history

• Cloud computing as utility computing (1966 paper)

• However, the promise of cloud computing finds a great user base in science grids due to: – Intense computations

– Huge amounts of storage needs

• Much of the Grid research community is now working on clouds – How much of that is only rebranding is useful to understand



Why volunteer computing?

● 2006: 1 billion PCs, 55% privately owned

● If 100M people participate:

– 100 PetaFLOPs, 1 Exabyte (10^18) storage

● Consumer products drive technology

– GPUs (NVIDIA, Sony Cell)

your computers

academic

business

home PCs

Slide based on a lecture by David Anderson:

http://www.cs.berkeley.edu/~demmel/cs267_Spr06/Lectures/Lecture11/lecture_11_VolunteerComputing.ppt












7

Volunteer computing history

95 96 97 98 99 00 01 02 03 04 05 06

GIMPS, distributed.net

SETI@home, folding@home

commercial projects

climateprediction.net

BOINC

Einstein@home

Rosetta@home

Predictor@home

LHC@home

BURP

PrimeGrid

... Slide based on a lecture by David Anderson:


Scientific computing paradigms

Grid computing

Supercomputers

Volunteer computing

Cluster computing

Control Bang/buck

least

least most

most










8

BOINC

SETI physics Climate biomedical

Joe Alice Jens

volunteers

projects



Participation in >1 project

• Better short-term resource utilization

– communicate/compute in parallel

– match applications to resources

• Better long-term resource utilization

– project A works while project B thinks

project

computing

needs think

work

think

work

time










9

Server performance

How many clients can a project support?



Server limits

• Single server (2X Xeon, 100 Mbps disk)

– 8.8 million tasks/day

– 4.4 PetaFLOPS (if 12 hrs on 1 GFLOPS CPU)

– CPU is bottleneck (2.5% disk utilization)

– 8.2 Mbps network (if 10K request/reply)

• Multiple servers (1 MySQL, 2 for others)

– 23.6 million tasks/day

– MySQL CPU is bottleneck

– 21.9 Mbps network









10

Credit



Credit display









11

Credit system goals

• Retain participants

– fair between users, across projects

– understandable

– cheat-resistant

• Maximize utility to projects

– hardware upgrades

– assignment of projects to computers



Credit system

• Computation credit

– benchmark-based

– application benchmarks

– application operation counting

– cheat-resistance: redundancy

• Other resources

– network, disk storage, RAM

• Other behaviors

– recruitment

– other participation









12

Goals of BOINC

• > 100 projects, some churn

• Handle big data better

– BitTorrent integration

– Use GPUs and other resources

– DAGs

• Participation

– 10-100 million

– multiple projects per participant



What is Condor? • Full-featured batch queue system.

• Condor is a specialized workload management system for compute-intensive jobs.

• Condor provides a job queuing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management.

• Users submit their serial or parallel jobs to Condor

• Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.

• Can be used to manage a cluster of dedicated compute nodes

• In addition, unique mechanisms enable Condor to effectively harvest wasted CPU power from otherwise idle desktop workstations.

• Condor does not require a shared file system across machines - if no shared file system is available, Condor can transfer the job's data files on behalf of the user

• Condor can be used to seamlessly combine all of an organization's computational power into one resource.

Slide based on a talk by J.Knudstrup

http://www.eso.org/projects/dfs/dfs-shared/web/rei/rei-condor-intro.ppt













13

History of Condor

• Hosted at University of Wisconsin, USA.

• Condor project started in 1988.

• Directed by Professor M.Livny.

• Preliminary version of the Condor Resource Management system implemented in 1986.

• Originally focusing on the problem of Load Balancing in a distributed system,

• Shifted its attention to Distributively Owned computing environments where owners have full control over the resources they own.



Status

• ~17 years development.

• Condor team consists of ~30 people.

• Available on many platforms.

• Basic installation and usage very easy.

• Contracted + free support.

• Used in research environments and by industry.

• Sponsored by various major IT companies and organizations (IBM, Intel, Microsoft, NASA, …).





















14

Architecture • Coordinated by a Central Manager Node.

• No central DBMS.

• Condor provides set of daemons defining the roles of each node in the pool.

• Daemons: – condor_master: Basic coordination on each node.

– condor_collector: Collects system information. Only on Central Manager.

– condor_negotiator: Assigns jobs to machines. Only on Central Manager.

– condor_startd: Executes jobs.

– condor_schedd: Handles job submission. Slide based on a talk by J.Knudstrup


Condor Pool - Example

Central Manager

master

collector

negotiator

schedd

startd

= ClassAd Communication Pathway

= Process Spawned

Submit-Only

master

schedd

Execute-Only

master

startd

Regular Node

schedd

startd

master

Regular Node

schedd

startd

master

Execute-Only

master

startd





















15

Personal Condor vs. Condor Pool

• Condor Pool:

– Collection of several nodes coordinated by one, Central Manager.

• Personal Condor:

– Condor on one workstation, no root access required, no system administrator intervention needed.

– Benefits of ‘pool’ with only one node (same as for a pool):

• Schedule large batches of jobs and have these processed in background.

• Keep an eye on jobs and get progress updates.

• Implement own scheduling policies on the execution order of jobs.

• Keep a log of the job activities.

• Add fault tolerance to the job execution.

• Implement policies for when jobs can run on a workstation.



Dedicated Nodes vs. Non-

Dedicated Nodes • Dedicated Node:

– Condor has all CPUs at its disposal.

• Non-Dedicated Node:

– Can't always run Condor jobs.

– If user is accessing keyboard/mouse or CPU is used by other processes, the Condor jobs are preempted.

– The policies for when Condor jobs can be started and may be preempted, are defined in the Condor Configuration.





















16

Shared File System vs. File

Distribution • Use Shared Filesystem if available

– Administration and handling easier.

– Normally the case for a Dedicated Cluster.

• If no shared filesystem?

– Condor can transfer files.

– Can automatically send back changed files.

– Atomic transfer of multiple files.

– Data can be encrypted during transfer.

– Usually the case for pools with non-dedicated nodes

or in a GRID environment.



Personal Condor - Condor Pool

Condor Flocking

Personal Condor

Dedicated Pool

Common User Desktop Pool

Submission Node

Personal Condor





















17

Condor Configuration

• Simple format (based on ClassAd).

• Possible to use environment variables.

• Global configuration and local specific to each node.

• Large set of configurable parameters.

• Example:

CONDOR_HOST = dfo09.hq.eso.org

RELEASE_DIR = /home/condor/INSTROOT/

LOCAL_DIR = $(TILDE)

LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.local

REQUIRE_LOCAL_CONFIG_FILE = TRUE

CONDOR_ADMIN = [email protected]

MAIL = /bin/mail

UID_DOMAIN = hq.eso.org

FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)

…



Command Line Tools

• Many command line tools provided – some of these are: – condor_config_val: Get/set value of configuration parameters.

– condor_history: Query job history queue.

– condor_off: Stop Condor daemons.

– condor_q: Check job queue.

– condor_reconfig: Force sourcing of configuration.

– condor_rm: Remove jobs from the queue.

– condor_status: Status of Condor pool.

– condor_submit: Submit a job or a cluster of jobs.

– condor_submit_dag: Submit a set of jobs with dependencies.

– …





















18

Submitting Clusters of Jobs

# Example condor_submit input file that defines

# a cluster of 600 jobs with different directories

Universe = vanilla

Executable = my_job

Log = my_job.log

Arguments = -arg1 –arg2

Input = my_job.stdin

Output = my_job.stdout

Error = my_job.stderr

InitialDir = run_$(Process)

Queue 600












computer archicture f07– different history •cloud computing as utility computing (1966 paper)...

Documents