prague site report

22
Prague Site Report Jiří Chudoba Institute of Physics, Prague 23.4.2012 Hepix meeting, Prague

Upload: genica

Post on 23-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Prague Site Report. Jiří Chudoba Institute of Physics, Prague. 23.4.2012 Hepix meeting, Prague. Local Organization. Institute of Physics: 2 locations in Prague, 1 in Olomouc 786 employees (281 researchers + 78 doctoral students) Department of Networking and Computing Techniques (SAVT) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Prague Site Report

Prague Site Report

Jiří ChudobaInstitute of Physics, Prague

23.4.2012 Hepix meeting, Prague

Page 2: Prague Site Report

[email protected] 2

Local Organization• Institute of Physics:

o 2 locations in Prague, 1 in Olomouco 786 employees (281 researchers + 78 doctoral students)

• Department of Networking and Computing Techniques (SAVT)o networking up to offices, mail and web servers, central services

• Computing centre (CC)o large scale calculationso part of SAVT (except leader – Jiri Chudoba)

• Division of Elementary Particle Physicso Section Department of detector development and data processing

• head Milos Lokajicek• started large scale calculations, later transferred to CC• the biggest hw contributor (LHC computing)• participates in the CC operation

Page 3: Prague Site Report

[email protected] 3

Server room I• Server room I (Na Slovance)

o 62 m2, ~20 racks 350 kVA motor generator, 200 + 2 x 100 kVA UPS, 108 kW air cooling, 176 kW water cooling

o continuous changeso hosts computing servers and central services

Page 4: Prague Site Report

[email protected] 4

Other server rooms• New server room for SAVT

o located next to server room Io independent UPS (24 kW now, max 64 kW n+1),

motor generator (96 kW), cooling 25 kW (n+1)o dedicated for central services o 16 m2, now 4 racks (room for 6)o very high reliability requiredo first servers moved in last week

• Server room Cukrovarnickao another building in Pragueo 14 m2, 3 racks (max 5), 20 kW central UPS, 2x8 kW

coolingo backup servers and services

• Server room UTIAo 3 racks, 7 kW cooling, 3 + 5x1.5 kW UPSo dedicated to Department of Condensed Matter

Theory

Page 6: Prague Site Report

[email protected] 6

Clusters in CC - Dorje• Dorje: Altix ICE8200, 1.5 rack

o 512 cores on 64 diskless WN, IB, 2 disk arrays (6+14 TB)o only local users, solid state physics, condense matter theoryo 1 admin for administration and user supporto relatively small number of jobs, MPI jobs up to 256 processeso Torque + Maui, SLES10 SP2, SGI Tempo, MKL, OpenMPI, ifort

• users run mostly: Wien2k, vasp, fireball, apls

Page 7: Prague Site Report

[email protected] 7

Cluster LUNA• 2 servers SunFire X4600

o 8 CPUs 32 cores, 256 GB RAM• 4 servers SunFire V20z, V40z• Operated by CESNET Metacentrum – distributed

computing activity of the NGI_CZ• Metacentrum

o 9 locationso 3500 coreso 300 TB

Page 8: Prague Site Report

[email protected] 8

Cluster Thsun, Small group servers

• Thsuno “private” cluster

• small number of users• power users with root privileges

o 12 servers of variable hw• servers for groups

o managed by groups in collaboration with CC

Page 9: Prague Site Report

[email protected] 9

Cluster Golias• Upgraded every year –

several subclusters of the identical hw

• 3812 cores, 30700 HS06• almost 2 PB disk space• the newest (March 2012)

subcluster rubus:o 23 nodes SGI Rackable C1001-G13o 2x (Opteron 6274 16 cores) 64 GB RAM,

2x SAS 300 GBo 374 W (full load)o 232 HS06 per node, 5343 HS06 total

Page 10: Prague Site Report

[email protected] 10

47%

26%

22%

4% 1%

37%

30%

28%

1% 4%

d0

alice

atlas

auger

solid

Golias shares2011 HS06 shareAlice+Star 7551 30Atlas 7087 28D0 9165 37Solid 914 4Calice 30 0Auger 205 1

24951 100

2012 HS06 shareAlice+Star 7564 25Atlas 11861 39D0 9969 32Solid 629 2Calice 13 0Auger 668 2

30704 100

3%4% 15%

22%

15%8%

5%

12%

17%

Golias-pGolias-cIberisIbisIbSalixSaltixDorjeRubus

Subclusters contribution to the total performance

Planned vs real usage (walltime)

Page 11: Prague Site Report

[email protected] 11

WLCG Tier2• cluster Golias@FZU + xrootd servers @ Rez• 2012 pledges:

o ATLAS 10000 HS06, 1030 TiB; 11861 HS06 available, 1300 TB av.o ALICE 5000 HS06, 420 TiB; 7564 HS06, 540 TB available

• delivery of almost 600 TB delayed due to floods• 66% efficiency is assumed for WLCG accounting

o sometimes under 100% of pledges• Low cputime/walltime ratio for the ALICE

o not only on our siteo Tests with limits on number of concurrent jobs (last week)o “no limit” (about 900 jobs) – 45%o limit 600 jobs - 54 %

Page 12: Prague Site Report

[email protected] 12

Utilization• Very high average utilization

o several different projects, different tools for productiono D0 – production submitted locally by 1 usero ATLAS – panda, ganga, local users; DPMo ALICE – VO box; xrootd

D0

ALICE

ATLAS

Page 13: Prague Site Report

[email protected] 13

Networking• CESNET upgraded our

main CISCO routero 6506 -> 6509o supervisor SUP720 -> SUP2To new 8x 10G X2 cardo planned upgrade of power

supplies 2x3kW -> 2x6 kW• (2 cards 48x1 Gbps, 1 card

4x10 Gbps, FW service module)

Page 17: Prague Site Report

[email protected] 17

External connection• Exclusive: 1 Gbps (to FZK) + 10 Gbps (CESNET)• Shared: 10 Gbps (PASNET – GEANT)

• Not enough for ATLAS T2D limit (5 MB/s to/from T1s)• Perfsonar installed

FZK -> FZU

FZU -> FZK

PASNET link

Page 18: Prague Site Report

[email protected] 18

Miscellaneous items• Torque server performance

o W jobs, sometimes long response timeo divide Golias in 2 clusters with 2 torque instances?o memory limits for ATLAS and ALICE queues

• CVMFSo used by ATLAS, works wello some older nodes have too small disks -> excluded for ATLAS

• Managemento Cfengine v2 used for productiono Puppet used for IPv6 testbed

• 2 new 64 core nodeso SGI Rackable H2106-G7, 128 GB RAM, 4x Opteron 6274 2.2GHz, 446 HS06o frequent crashes when loaded with jobs

• Another 2 servers with Intel SB expectedo small subclusters with different hw

Page 19: Prague Site Report

[email protected] 19

Water cooling• Active vs passive cooling doors

o 1 new rack with cooling doorso 2 new cooling doors on APC racks

Page 20: Prague Site Report

[email protected] 20

Water cooling

good sealing crucial

diskservers on off (divider added)

disk

serv

ers

work

er n

odes

rubus01

Page 21: Prague Site Report

[email protected] 21

Distributed Tier2, Tier3s• Networking infrastructure (provided by CESNET)

connects all Prague institutions involvedo Academy of Sciences of the Czech Republic

• Institute of Physics (FZU, Tier-2)• Nuclear Physics Institute

o Charles University in Prague• Faculty of Mathematics and Physics

o Czech Technical University in Prague• Faculty of Nuclear Sciences and Physical Engineering• Institute of Experimental and Applied physics

• Now only NPI hosts resources visible in Grido Many reasons why others do not: manpower, suitable rooms, lack of IPv4

addresses• Data Storage group at CESNET

o deployment for LHC projects discussed

Page 22: Prague Site Report

22

• Thanks to my colleagues for help with preparation of these slides:

o Marek Eliášo Lukáš Fialao Jiří Horkýo Tomáš Hrubýo Tomáš Kouba o Jan Kundráto Miloš Lokajíčeko Petr Roupeco Jana Uhlířováo Ota Velínský

[email protected]