overview and evaluation of conceptual strategies for accessing cpu-dependent execution resources in...

19
Overview and Evaluation of Conceptual Strategies for Accessing CPU- dependent Execution Resources in Grid Infrastructures J Walsh, J Dukes, B Coghlan, G Pierantoni School of Computer Science and Statistics The University of Dublin, Trinity College

Upload: prosper-underwood

Post on 18-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Overview and Evaluation of Conceptual Strategies for Accessing

CPU-dependent Execution Resources in Grid Infrastructures

J Walsh, J Dukes, B Coghlan, G PierantoniSchool of Computer Science and Statistics

The University of Dublin, Trinity College

Background

• Original problem (Grid/GPGPU integration)– Publication, Discovery, Job Submission & LRMS– GPGPU only used for highly parallel compute tasks– CPU needed for data I/O to GPGPU

• All GPGPU jobs need CPU, but not vice-versa

• Is this problem similar for other H/W or S/W?– If so, can the problem be abstracted?– Generalised method to support many H/W types?

• Do not want to create new GLUE Schema definition for each type• Must accommodate the differences (i.e. must be extensible)

CPU-Dependent Execution Resource

• CPU-Dependant Execution Resource (CDER)• Physically associated with a host (node-bound)• CPU required to facilitate resource access• Job execution split between CPU and CDER• Job access to the CDER must be exclusive

(or appears to the job to be exclusive)• Finite number of batch system job-slots

(usually one)• e.g. GPGPU, FPGA, hardware media encoder

CDER GLUE Integration• Modify GLUE2 and UI/WMS/CE?• Slow adoption of GLUE changes (GLUE2 > 5 yrs)• Not practical for CDERs that have yet to be

envisaged• Need a flexible, dynamic approach

• CDER support layer using existing GLUE2 schema and middleware?• Not proposing changes to GLUE2• Use wrappers/plugins for existing middleware• Adapt quickly to describe new CDERs

Conceptual Strategies

• A Priori• Named-Queue• Tagged-Environment• Attribute-Extension• Class-Extension

Criteria for Evaluation• Discoverability• Semantic Resource Detail (intra)

– Level of Detail– Structured Information

• Semantic Structure (inter)– Associations between CDER and other Entities

• Dynamic Information• Time Efficiency (how efficiently you can (i) query and (ii)

update the CDER information)• Space Efficiency (what is the size of the additional published

CDER information)• Discovery / Matchmaking / Submission support

A Priori

• “Non”-strategy• Used by some Sites/VOs for GPGPU handling• Requires knowledge of how to access the resource (e.g

Queue and Software)

• Discoverability: None• Semantic Resource Detail: None• Semantic Structure: None• Dynamic Info: None

Named-Queue• Discoverability: Queue Name suffix

• https://ce1.example.com/cream-pbs-nvidia_gpgpu• Can match against Queue suffix in JDL

• Semantic Resource Detail: Minimal• Semantic Structure: Minimal• Limited to queue name suffix detail

• Dynamic Info: Limited• Do #CPUs = #CDERS?• Potential job requirements may never be satisfied• Batch system limitations• Under-utilisation of CPU resources

Tagged-Environment

• Discoverability: Published Tag • e.g GLUE 1.3 SoftwareEnvironment

• Semantic Resource Detail: coarse• Naming convention (e.g. MPI_FEATURE_X)

• Semantic Structure: minimal• must know relationship between published tags

• Dynamic Information: limited/difficult• Difficult to encode CDER capacity and utilisation• Very limited use with current M/W

Attribute-Extension (I)

• GLUE2 entities can define multiple OtherInfo string attributes containing arbitrary string values• Use to publish CDER specific K/V pairs• Extended attributes internal to entity representing CDER

• Discoverability: yes• via LDAP query, not WMS

• Semantic Resource Detail: fine• Can encode arbitrary attributes• Manufacturer, Model, Capacity, Utilisation, Memory, …

• Semantic Structure: yes

Attribute-Extension (II)

• Dynamic Information: yes• Time Efficiency: medium– to query key/values stored in OtherInfo, must

retrieve and parse ALL OtherInfo strings

• Space Efficiency: good, compact structure• 2-phase discovery and submission required

Example (App Environment)

objectClass: GLUE2ApplicationEnvironment GLUE2ApplicationEnvironmentMaxJobs: 32 GLUE2ApplicationEnvironmentAppName: CUDA GLUE2ApplicationEnvironmentFreeJobs: 30 GLUE2EntityOtherInfo: GPUCUDAComputeCapability=2.1 GLUE2EntityOtherInfo: GPUMainMemorySize=1024 GLUE2EntityOtherInfo: GPUCoresPerMP=48 GLUE2EntityOtherInfo: GPUCores=192 GLUE2EntityOtherInfo: GPUClockSpeed=1660 GLUE2EntityOtherInfo: GPUECCSupport=false GLUE2EntityOtherInfo: GPUVendor=Nvidia GLUE2EntityOtherInfo: GPUPerNode=2

Class-Extension (I)• GLUE2 entities can be associated with multiple

Extension class instances• Each Extension object contains a single key/value pair• Use to publish CDER specific K/V pairs

• Discoverability: yes• via LDAP query, not WMS

• Semantic Resource Details: fine– Entity can reference multiple Extension object

• Semantic Structure: fine– Inherent key/value pairs rather than strings in Attribute-

Extension

Class-Extension (II)• Dynamic Information: yes• Time Efficiency: high– LDAP query using desired key name– No need to extract key/value pairs from string

• Space Efficiency: low– Each key/value pair requires a complete Extension object– Less efficient than Attribute-Extension– Greater overhead in resolving all K/V pairs

• More complex to realise (e.g. in LDAP) than Attribute-Extension

• 2-phase discovery and submission required

Example (Extension Class)

dn:GLUE2ExtensionLocalID=GPU_NVIDIA_P_1,GLUE2ShareID=gpgpu_gputestvo_wn136.grid.cs.tcd.ie_ComputingElement,GLUE2ServiceID=wn136.grid.cs.tcd.ie_ComputingElement,GLUE2GroupID=resource,o=glue

GLUE2ExtensionLocalID: GPU_NVIDIA_P_1GLUE2ExtensionKey: GPUPerNodeobjectClass: GLUE2ExtensionGLUE2ExtensionValue: 2GLUE2ExtensionEntityForeignKey:

gpgpu_gputestvo_wn136.grid.cs.tcd.ie_ComputingElement

Strategy Summary

Two-phase GPGPU SubmissionRequirements = GPUVendor==“Nvidia” &&

(GPUMainMemorySize >= 512);

LDAP Query

Command

(1) Convert GPGPU Requirements to LDAP query

Global Resource Information

Service

(2) Query Global Information

Service

(3) Return LDAP matches

(4) Generate List of Matched Resource Centres

(5) Generate Job Description (restricted to matched RCs) and Submit to Grid in normal way

Orchestrate Grid Job

GPUPerNode=2;

Phase 1

Phase 2

Conclusion

• Five Conceptual Methods considered• Only two methods promising• Two-phase process required• 1st phase is a GLUE 2.0 GPGPU pre-filter on GPGPU

requirements• 2nd phase restricts jobs to set of matching Resource Centres

• Attribute vs Extension• Attribute more space efficient• Extension easier to find individual Key/Values (time)• Mixed Attribute & Extension ?

• Method applicable to many other new resources (Limited Software Licenses, …)

Acknowledgements

This research, a part of the Telecommunications Graduate Initiative, is funded by the Irish Higher Education Authority under the Programme for Research in Third-Level Institutes (PRTLI) and co-funded under the European Regional Development fund.