extension of dirac to enable distributed computing using windows resources 3 rd egee user forum...

19
Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont- Ferrand J. Coles , Y. Y. Li, K. Harrison, A. Tsaregorodtsev, M. A. Parker, V. Lyutsarev

Upload: gavin-poole

Post on 04-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

Extension of DIRAC to enable distributed computing using Windows resources

3rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand

J. Coles, Y. Y. Li, K. Harrison, A. Tsaregorodtsev,

M. A. Parker, V. Lyutsarev

Page 2: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 2

Overview

Why port to Windows and who is involved?DIRAC overviewPorting process

Client (job creation/submission) Agents (job processing) Resources

Successes/usage Deployment

Summary

Page 3: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 3

Motivation Aim:

Enabling Windows computing resources in the LHCb workload and data management system DIRAC

Allow what can be done under Linux to be possible under Windows

Motivation: To increase the number CPU resources available to LHCb for

production and analysis To offer a service to Windows users Allow transparent job submissions and execution on Linux

and Windows Who’s involved:

Cambridge, Cavendish – Ying Ying Li, Karl Harrison, Andy Parker

Marseilles, CPPM - Andrei Tsaregorodtsev (DIRAC Architect) Microsoft Research – Vassily Lyutsarev

Page 4: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 4

DIRAC Overview Distributed Infrastructure with

Remote Agent Control LHCb’s distributed production

and analysis workload and data management system

Written in Python 4 sections

Client User interface

Services DIRAC Work Management

System, based on the main Linux server

Agents Resources

CPU resources and Data storage

Page 5: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 5

DISET security module DIRAC Security Transport module – underlying security module of

DIRAC Provides grid authentication and encryption (using X509

certificates and grid proxies) between the DIRAC components Uses OpenSSL with pyOpenSSL (DIRAC’s modified version)

wrapped around it. Standard: Implements Secure Sockets Layer and Transport

Layer Security, and contains cryptographic algorithm. Additional: Grid proxy support

Pre-built OpenSSL and pyOpenSSL libraries are shipped with DIRAC Windows libraries are provided alongside Linux libraries,

allowing appropriate libraries to be loaded at run time Proxy generation under Windows

Multi-platform command: dirac-proxy-init Validation of generated proxy is checked under both Windows

and Linux

Page 6: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 6

Client – job submissions Submissions made with

valid grid proxy Three Ways

JDL (Job Description Language)

DIRAC API Ganga

Built on DIRAC API commands

Currently under porting process to Windows

Successful job submission returns job ID, provided by Job Monitoring Service

SoftwarePackages = { “DaVinci.v12r15" };InputSandbox = { “DaVinci.opts” };InputData = { "LFN:/lhcb/production/DC04/v2/00980000/DST/Presel_00980000_00001212.dst" };JobName = “DaVinci_1";Owner = "yingying";StdOutput = "std.out";StdError = "std.err";OutputSandbox = { "std.out", "std.err", “DaVinci_v12r15.log” “DVhbook.root” };JobType = "user";

import DIRACfrom DIRAC.Client.Dirac import *

dirac = Dirac()job = Job()

job.setApplication(‘DaVinci', 'v12r15')job.setInputSandbox(['DaVinci.opts’])job.setInputData(['LFN:/lhcb/production/DC04/v2/00980000/DST/Presel_00980000_00001212.dst'])job.setOutputSandbox([‘DaVinci_v12r15.log’, ‘DVhbook.root’])

dirac.submit(job)

> myjob.pyor enter directly

in python under Windows> dirac-job-submit.py

myjob.jdlUnder Windows

JDL

API

Page 7: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 7

DIRAC Agent under Windows Python installation script

Downloads and installs DIRAC software, and sets up DIRAC Agent Agents are initiated on free resources Agent Job retrieval:

Run DIRAC Agent to see if there are any suitable jobs on the server. Agent retrieves any matched jobs. Agent Reports to Job Monitoring Service of job status Agent downloads and installs required applications to run the job. Agent retrieves any required data. (see next slide) Agent creates Job Wrapper to run the job (wrapper platform aware). Upload output to storage if requested Windows SitesLinux Sites

Page 8: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 8

Data access Data access to LHCb’s distributed data storage system requires:

Access to LFC (LCG File Catalogue, maps LFNs (Logical File Names) to the PFNs (Physical File Names))

Access to the Storage Element On Windows a catalogue client is provided via the DIRAC portal service

Uses DIRAC’s security module DISET and a valid user’s grid proxy Authenticates to Proxy server, and proxy server contacts File catalogue on user’s

behalf with its own credentials Uses .NetGridFTP client 1.5.0 provided by University of Virginia

Based on GridFTP v1, from tests it seems to be compatible with GridFTP server used by LHCb (edg uses GridFTP client 1.2.5-1 and globus GT2)

Client contains functions needed for file transfers get, put, mkdir And a batch tool that mimics the command flags of globus-url-copy

Requirements: .Net v2.0

.NetGridFTP binaries are shipped with DIRAC Allows full data registration and transfer to any Storage Element supporting

GridFTP

Page 9: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 9

DIRAC CE backends

DIRAC provides a variety of Compute Element backends under Linux: Inprocess (standalone machine), LCG, Condor etc…

Windows: Inprocess

Agent loops in preset intervals assessing the status of the resource Microsoft Windows Compute Cluster

Additional Windows specific CE backend Requires one shared installation of DIRAC and applications on

the Head node of the cluster Agents are initiated from the Head node, and communicates with

the Compute Cluster Services Job outputs are uploaded to the Sandboxes directly from the

worker nodes

Page 10: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 10

LHCb applications Five main LHCb applications (C++ : Gauss, Boole, Brunel, DaVinci Python: Bender)

Gauss

Event Generation

Detector Simulation

Boole

Digitalisation

Brunel

Reconstruction

DaVinci

Analysis

Bender

Sim

DST

DSTStatistics

RAWRAWmc Data flow from

detectorMC

Production Job

Analysis Job

Sim – Simulation data formatRAWmc – RAW Monte Carlo, equivalent to RAW data format from detectorDST – Data Storage Tape

Page 11: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 11

Gauss Most LHCb applications are compiled for both Linux and Windows

For historical reasons, we use Microsoft Visual Studio .Net 2003 Gauss – only application, previously not compiled under Windows. Gauss relies on three major pieces of software not developed by LHCb

Pythia6: simulation of particle production – Legacy Fortran code EvtGen: Simulation of particle decays – C++ Geant4: Simulation of detector – C++

Gauss needs each of the above to run under Windows Work strongly supported by LHCb and LCG software teams All third-party software now successfully built under Windows Most build errors have resulted from Windows compiler being less tolerant of “risky

coding” than gcc Insist on arguments passed to function being of correct type More strict about memory management Good for forcing code improvements!

Able to fully build Gauss under Windows with both Generator and Simulation parts We are able to produce full Gauss jobs of BBbar events, with comparable distributions to

those produced under Linux Have installed and tested Gauss v30r4 on Cambridge cluster Latest release of Gauss v30r5

First fully Windows compatible release Contains both pre-built GEANT4 and Generator Windows binaries

Page 12: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 12

Cross-platform job submissions Job creation and submission process is the same under both Linux and

Windows (i.e. uses the same DIRAC API commands, and the same steps) Two current types of main LHCb grid jobs MC Production Jobs – CPU intensive, no input required. Potentially ideal

for ‘CPU scavenging’ jobs Recent efforts (Y.Y.Li, K.Harrison) allowed Gauss to compile under Windows

(see previous slide) A full MC production chain is still to be demonstrated on Windows

Analysis Jobs – Requires input (data, private algorithms, etc …) DaVinci, Brunel, Boole

Note: requires C++ compiler for customised user algorithms Jobs submitted with libraries are bound to the same platform for processing

Platform requirements can be added during job submission Bender (Python)

Note: no compiler, linker or private library required Allows cross-platform analysis jobs to be performed

Results retrieved to local computer via >dirac_job_get_output.py 1234 results in the outputsandbox >dirac-rm-get(LFN) this uses GridFTP to retrieve outputdata from a Grid SE

Page 13: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 13

DIRAC Widows usage DIRAC is supported on two Windows

platforms Windows XP Windows Server 2003

Use of DIRAC to run LHCb physics analysis under Windows Comparison between DC04 and DC06

data on B±→D0(Ksπ+π-)K± channel 917,000 DC04 events processed under

Windows, per selection run ~48hours total CPU time on 4 nodes Further ~200 jobs (totalling ~4.7 million

events) submitted from Windows to DIRAC, processing on LCG, retrieved on Windows

Further selection background studies are currently being carried out with the system

Processing speed comparisons between Linux and Windows Difficult, as currently the Windows

binaries are built in debug mode by default

Page 14: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 14

DIRAC deploymentPlatform Hardware

Number of CPUs Available

Disk SizeCompute

Element Backend

BristolWindows XP Professional

Intel® Pentium® 4CPU 2.00GHz 2.00GHz, 504MB of RAM

4 37.2GB on C: drive Inprocess

Cambridge

Windows XP Professional

Dell Optiplex GX745 Intel® Core™2 CPU 6400 @ 22.13GHz 2.13GHz, 2.99GB of RAM

2Mapped drives can be linked to Cambridge HEP group storage disks

Inprocess

Windows Server 2003 x64 + Compute Cluster Pack 2006

AMD Athlon™ 64x2 Dual Core Processor 4400+ 2.21 GHz, 2.00GB of RAM

4 nodes available, with a total of 8CPU

Compute Cluster

Laptop Windows XP Tablet

Intel® Pentium® M processor 2.00GHz 1.99GHz, 512MB of RAM

2 - Inprocess

Oxford

Windows Server 2003 x64 + Compute Cluster Pack 2006

Intel® Xeon™ CPU 2.66GHz 2.66GHz, 31.9 GB of RAM

22 nodes available, with a total of 100CPU

208GB on Mapped disk

Compute Cluster

Windows Server 2003 Intel® Xeon™ CPU 2.80GHz 2.80GHz, 2.00GB of RAM

2136GB on local C: drive

Inprocess

BirminghamWindows Server 2003 + compute Cluster Pack2006

-16 machines, 4 core’s each

- Compute Cluster

Page 15: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 15

Windows wrapping Bulk of DIRAC python code was already platform independent

However not all python modules are platform independent Three types of code modifications/additions:

Platform specific libraries and binaries (e.g. OpenSSL, pyOpenSSL, .NetGridFTP)

Additional Windows specific code (e.g. Windows Compute Cluster CE backend, .bat files to match Linux shell scripts)

Minor Python code modifications (e.g. changing process forks to threads) Dirac installation ~ 60MB Per LHCb application ~ 7GB

Unmodified 60%

Windows Specific 6%

Modified for cross-platform compatibility

34%

Windows port modifications by file size of used

DIRAC code

Page 16: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 16

Summary Working DIRAC v2r11, and able to integrate

both Windows standalone and cluster CPUs to existing Linux system

Porting – replacement of Linux specific python code & provision of windows equivalents where platform independence not possible (e.g. pre-compiled libs, secure file transfers…)

Windows platforms tested: Windows XP Windows Server 2003

Cross-platform job submissions and retrievals

Little change to syntax for user Full analysis jobs cycle on Windows, from

algorithm development to results analysis. (BenderRunning(linux)Getting results ) Continued use for further physics studies

All applications for MC production jobs tested Deployment extended to three site so far,

totalling 100+ Windows CPUs. Two Windows Compute Cluster sites

Requirements

Python 2.4

PyWin32 (Windows specific python module)

Grid Certificate

Future plans: Test the full production chain Deploy on further

systems/sites e.g. Birmingham Larger scale test Continued usage for physics

studies Provide a useful tool when

LHC data arrives

Page 17: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 17

Backup slides

Page 18: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 18

Cross-platform compatibility

Language Binaries Available

Ganga Python -

DIRAC PythonLinux/Windows

compatible

LHCb Applications

Gauss C++ SLC3, SLC4, Win32

Boole C++ SLC3, SLC4, Win32

Brunel C++ SLC3, SLC4, Win32

DaVinci C++ SLC3, SLC4, Win32

Bender PythonLinux/Windows

compatible

Page 19: Extension of DIRAC to enable distributed computing using Windows resources 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li,

13th Feb 2008 University of Cambridge 19

DIRAC

Head Node

Job

Su

bm

issio

n b

y

User

Job

Su

bm

issio

n b

y

User

DaVinci

SoftwareRepository

DISETLocal

SE

DIRAC

ProxyServer

Agent

Watch-dog

Wrapper

Job

WMS

Job ManagementService

SandboxService

Job Matcher

LFC Service

Job MonitoringService