shared%services%and%tools%based% … · union%of%lightion% centresineurope%...

30
Union of Light Ion Centres in Europe Shared services and tools based on the Ganga job defini=on and management framework. EGI User Forum 2011 – Vilnius, Lithuania. U. Egede (Imperial College London), M. Kenyon (CERN), J. Moscicki (CERN), I.A. Dzhunov (University of Sofia), L. Kokoszkiewicz (CERN), M. Cinquilli (CERN), L. Sargsyan (CERN/Yerevan Physics Ins=tute), E. Karavakis (CERN), J. Andreeva (CERN) The ULICE project is cofunded by the European Commission under FP7 Grant Agreement Number 228436.

Upload: others

Post on 23-Oct-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Union  of  Light  Ion  Centres  in  Europe  

Shared  services  and  tools  based  on  the  Ganga  job  defini=on  and  

management  framework.  EGI  User  Forum  2011  –  Vilnius,  Lithuania.  

U.  Egede  (Imperial  College  London),  M.  Kenyon  (CERN),  J.  Moscicki  (CERN),  I.A.  Dzhunov  (University  of  Sofia),  L.  Kokoszkiewicz  (CERN),  M.  Cinquilli  (CERN),  L.  Sargsyan  (CERN/Yerevan  Physics  Ins=tute),  E.  Karavakis  (CERN),  J.  Andreeva  

(CERN)  

The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    Grant  Agreement  Number  228436.    

Page 2: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Overview  

•  The  EGI  Introductory  Package.  –  Tools  that  were  ini=ally  developed  by  and  for  specific  user  communi=es.  

•  Tool  adop=on  case-­‐studies.  –  Grid  job  submission  &  management.  

–  Error-­‐repor=ng  infrastructure  for  users.  

4/27/11   2  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 3: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

The  EGI  Introductory  Package  

•  “Simple,  complete  solu=on  for  running  and  monitoring  compu=ng  tasks  on  the  grid.”  –  Audience:  small  to  medium-­‐sized  user  communi=es.  

•  Comprises  the  following  components.  –  Ganga:  user  interface  for  job  submission  and  management.  

–  DIANE:  automa=c  control  and  scheduling  of  Ganga  jobs.  – Mini-­‐dashboard:  monitoring  of  Ganga  and/or  Diane  tasks  &  error  repor=ng.  

4/27/11   3  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 4: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Ganga  overview  

•  Ganga  project  was  started  by  ATLAS  &  LHCb  in  2001.  –  En=rely  wrieen  in  Python  from  the  outset.  –  Extensible  framework;  simple  and  well  documented  procedure.  

•  Underwent  a  major  redesign  in  2005.  –  Core  codebase  is  stable,  though  new  features  are  in  ac=ve  

development.  –  Assign  ownership  of  sub-­‐packages.  

•  Release  procedure.  –  Rota=ng  release  manager.  –  Rigorous  tes=ng  prior  to  each  release.    

•  Registered  in  the  EGI  applica=on  database.  

4/27/11   4  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 5: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Ganga  overview  

•  Easy  to  use  frontend  for  the  configura=on,  execu=on  and  management  of  computa=onal  tasks.  –  Interac=ve  Python  prompt,  script  submission,  GUI.  

•  Submission  to  a  range  of  plaiorms  in  a  consistent  manner  (localhost,  PBS,  LSF,  SGE,  Condor,  gLite,  ARC,  Globus).  

•  Key  concept:  “The  Job”.  

4/27/11   5  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 6: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

DIANE  –  Distributed  Analysis  Environment  

•  DIANE  development  started  within  CERN  IT  in  2000.  –  Registered  in  the  EGI  applica=on  database.  

•  Job  execu=on  control  framework.  –  Automa=c  load  balancing,  scheduling  and  failure  recovery.  

•  U=lises  a  pilot-­‐job  mechanism  (also  known  as  “late  binding”).  –  Pilots  controlled  by  Ganga.  – Workload  fed  directly  to  pilots.  

4/27/11   6  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 7: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Mini-­‐dashboard:  task  monitoring  

•  A  service  to  monitor  the  state  of  submieed  Ganga  and/or  DIANE  tasks.  

4/27/11   7  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 8: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Mini-­‐dashboard:  task  monitoring  

•  Based  around  the  hBrowse  framework.  –  HTML/Javascript  client  that  consumes  JSON  data.  –  Highly  configurable.  –  Plugin  architecture  (dynamic  tables,  user  selec=on,  field  filters).  

–  Support  for  Google  charts.  

4/27/11   8  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 9: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Mini-­‐dashboard:  task  monitoring  

4/27/11   9  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 10: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Mini-­‐dashboard:  task  monitoring  

4/27/11   10  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 11: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Mini-­‐dashboard:  error  repor=ng  

•  Support  teams  need  the  full  picture  to  help  users.  

•  Grid  jobs  are  complex  beasts  –  Range  of  somware  versions.  –  Configura=on  senngs.  

–  Environment  variables.  

•  Error  repor=ng  tool.  –  Repository  for  diagnos=c  data.  –  Tradi=onal  “expert-­‐user”  model.  

–  Or,  community  support  model.  

4/27/11   11  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 12: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Mini-­‐dashboard:  error-­‐repor=ng  tool  

•  Ganga  func=on  to  POST  a  job  error  report  to  a  central  repository.  

4/27/11   12  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 13: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Mini-­‐dashboard:  error-­‐repor=ng  tool  

•  Ganga  func=on  to  POST  a  job  error  report  to  a  central  repository.  

•  Report  available  to  community  support  team.  

•  Command  history,  detailed  logs,  environment...  

4/27/11   13  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 14: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

4/27/11   14  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Case study 1 Geant4 code validation.

Page 15: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Community  case  study:  Geant4  

•  Geant4  is  a  toolkit  for  simula=ng  the  trajectories  of  par=cles  through  maeer.  

•  Finds  applica=on  within  HEP,  nuclear  and  accelerator  physics,  medical  and  space  science.  

•  Vast,  object-­‐oriented  suite,  with  over  600,000  lines  of  complex  source  code.  –  Developers  are  distributed  around  the  world.  

4/27/11   15  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 16: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Geant4:  Code  valida=on  

•  Sta=s=cal  regression  tests  compare  simula=on  results  between  the  previous  and  pending  releases.  –  Wide  range  of  (physics)  input  parameters  are  used.  –  1000  independent  tasks  generated,  each  one  simula=ng  5000  events.  –  Total  CPU  =me  required  per  release  =  a  few  CPU  years.  

•  Intensive  valida=on  period  prior  to  release  (every  6  months).  –  Quick  succession  of  candidate  releases.  –  Tests  par=ally  or  wholly  repeated  mul=ple  =mes.  

•  The  Ganga/DIANE  framework  has  been  used  to  test  releases  since  June  2007.  

•  Valida=on  metrics;  performance  of  code  (=me/event)  and  stability  (applica=on  crashes).  

4/27/11   16  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 17: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Geant4:  Code  valida=on  

•  Last  valida=on  Dec.  2010;  Geant4  v9.4  (report)  –  Ganga/DIANE  used  to  run  code  valida=on  on  18  grid  sites.  –  80  million  events  produced  in  2-­‐3  days  (c.f.  1  week  for  previous  releases).  

–  Exposed  rare  Geant4  somware  crashes  (1  crash  per  10,000  generated  events).  

4/27/11   17  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 18: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Geant4:  Value  added  by  EGI  IP  

•  Geant4  valida=on  team  observa=ons.  –  New  DIANE/Ganga  infrastructure  [...]  allows  for  robust  and  

reliable  logging  and  monitoring  of  Ganga  jobs.  –  DIANE’s  error  detec=on  allows  the  automa=c  exclusion  of  mis-­‐

configured  nodes    (“the  main  problem  in  previous  valida=on  periods”).  

–  Grid  experience  substan=ally  improved  since  last  campaign;  increase  of  10x  in  number  of  jobs  executed  for  the  same  allocated  resources.  

•  “GRID  usage  has  improved  substan=ally...  mainly  due  to  the  improved  somware  and  to  the  increased  ability  to  monitor  and  debug  jobs.”  

4/27/11   18  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 19: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

4/27/11   19  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Case study 2 OpenGate Virtual Laboratory Environment

Page 20: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Community  case  study:  OpenGate  

•  GATE  somware;  nuclear  medicine  simula=ons  for  imaging.  –  Based  on  Geant4  toolkit.  –  Simulate  par=cle  tracks  through  maeer.  

•  Ini=al  set  of  proper=es  (type,  loca=on,  direc=on).  •  Large  number  of  par=cles  (simula=ons).  

•  Typical  simula=ons  take  hours  to  days.  

•  Tasks  well  suited  to  being  parallelized.  –  Split  simula=on  into  sub-­‐simula=ons  (sta=c  par==oning).  

–  But;  all  MC  sub-­‐tasks  must  complete  to  get  the  final  result.  

4/27/11   20  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 21: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Community  case  study:  OpenGate  

•  OpenGate  collabora=on  developed  a  dynamic  splinng  method  for  GATE.  –  Camarasu-­‐Pop  et  al.  J  Grid  Compu=ng  (2010)  8:241–259  

4/27/11   21  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Monitors  no.  of  simulated  events;  generates  new  tasks  if  required;  

passes  them  to  DIANE.  

Task  processing  engine;  pilot  jobs  on  grid  nodes;  STDOUT  &  ERR  returned  periodically.    

Page 22: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Community  case  study:  OpenGate  

•  N  tasks  are  dispatched  by  DIANE  to  the  Grid.  –  Each  task  will  con=nue  to  run  un=l  the  desired  number  of  par=cles  (summed  across  all  tasks)  is  reached.  

•  On  average,  a  grid  node  will  simulate  a  frac=on  (1/N)  of  the  total  number  of  par=cles.  

•  DIANE  performs  a  periodic  check  of  total  number  of  par=cles  simulated.  

•  Run  is  terminated  when  all  par=cles  have  been  simulated.  

4/27/11   22  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 23: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

OpenGate:  Value  added  by  EGI  IP  

•  Makespan  reduced;  for  simula=ng  20x106  events.  –  8.5  hours  on  dual-­‐core  PC  (2008).  –  ~24  hours  on  the  grid  with  100  classical  jobs.  78%  success  rate.  –  1.75  hours  on  the  grid  with  100  DIANE  worker  agents.  100%  of  

simula=on  complete.  •  In  period  July  2009  –  Aug  2010;  

–  360  DIANE  RunMaster  instances  were  ac=vated  in  the  backbone,  handling  58,000  worker  agent  jobs.  

•  Generic  solu=on;  any  applica=on  managed  by  MOTEUR  can  now  interface  with  DIANE.  

•  Solu=on  has  entered  produc=on  environment  for  non-­‐clinical  radia=on  therapy  researchers  at  Crea=s.  

4/27/11   23  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 24: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

4/27/11   24  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Case study 3 CMS Error Reporting Tool.

Page 25: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

CMS  error  repor=ng  tool  

•  Users  perform  analysis  via  CMS  Remote  Analysis  Builder  (CRAB).  

•  Jobs  running  on  heterogeneous  services.  –  CMS  services/middleware/batch  systems.  –  Differing  use  cases.  

•  Dedicated  analysis  opera=on  debugging  team.  –  Proac=vely  find  and  debug  problems.  –  Handle  support  requests  from  CMS  user  community  mailing  list.  

–  3,500  users  from  40  countries  across  many  =me  zones.  

4/27/11   25  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 26: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

CMS  error  repor=ng  tool  

•  In  order  to  op=mise  the  support  procedure,  CMS  adopted  the  error-­‐repor=ng  tool.  

•  A  CRAB  plugin  was  developed  that  uploads  debugging  informa=on  to  a  central  repository.  

4/27/11   26  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 27: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

CMS  error  repor=ng  tool  

4/27/11   27  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 28: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

CMS  error-­‐repor=ng:  early  feedback  

•  Posi=ve;  the  system  works  well,  and  is  considered  very  useful.  

•  Streamlines  a  user-­‐support  request.  –  Reduces  email  traffic  to  support  lists.  

•  Has  allowed  CMS  to  centralise  and  formalise  their  user-­‐support  mechanism.  

4/27/11   28  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 29: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Summary  

•  EGI  Introductory  Package  has  been  adopted  and  adapted  by  a  wide  range  of  user  communi=es.  

•  Based  around  mature,  stable  and  well  documented  tools.  

•  Startup  overhead  is  surprisingly  low  (see  demo  later  today).  

•  DIANE  allows  researchers  to  use  resources  more  efficiently  than  direct  job  submission  alone.  

•  Mini-­‐dashboard  provides  customisable  tools  for  monitoring  jobs  and  op=mising  user-­‐support.  

4/27/11   29  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.    

Page 30: Shared%services%and%tools%based% … · Union%of%LightIon% CentresinEurope% Shared%services%and%tools%based% on%the%Ganga%job%defini=on%and% managementframework.% EGIUser%Forum%2011%–Vilnius,%Lithuania.%

Further  informa=on  

•  EGI  introductory  package  –  heps://twiki.cern.ch/twiki/bin/view/ArdaGrid/EGIIntroductoryPackage  

•  Ganga  and  DIANE  are  registered  in  the  EGI  applica=on  database,  and  are  part  of  the  EGEE  Respect  tool  suite.  

•  Demonstra=on:  Lambda  room  4pm  today.  

4/27/11   30  The  ULICE  project  is  co-­‐funded  by  the  European  Commission  under  FP7    

Grant  Agreement  Number  228436.