status report on tier-1 in korea gungwon kang, sang-un ahn and hangjin jang (kisti gsdc) april 28,...
TRANSCRIPT
Status Report on Tier-1 in Korea
Gungwon Kang, Sang-Un Ahn and Hangjin Jang
(KISTI GSDC)
April 28, 2014 at 15th CERN-Korea Committee, Geneva
Korea Institute of Science and Technology Informa-tion
Global Science experiment Data hub Center
2
OUTLINE
Computing Resources
Operations
Network
Conclusion
28 April 201415th CERN-Korea Committee
KISTI GSDC Tier-1 Team
3
ROLE Name
Representative Haeng-Jin Jang
System Management Hee-Jun Yoon
System Administration Jeong-Heon Kim
Storage (Disk & Tape) Hee-Jun Yoon
Sang-Oh Park
NetworkHyoung-Woo Park
KISTI support (Dr. Bu-Seung Cho)
Site Operation & AdministrationIl-Yeon Yeo
Sang-Un Ahn
KIAF Operation & User Support Sang-Un Ahn
~ 9 people
28 April 201415th CERN-Korea Committee
4
Computing Resource Status 2013 Pledges (CPU): HepSpec06 25,000
Current HepSpec06: 28,055 2,524 Jobs slots available (4 reserved slots for pilot jobs) with H/T enabled
2013 Pledges (Tape Storage): Tape 1,500 TB Current Tape capacity: 1,000 TB Pledges will be met in this year
2013 Pledges (Disk Storage): Disk 1,000 TB Current Disk capacity: 966 TB (allocated 1,000 TB but usable space slightly below)
28 April 2014
15th CERN-Korea Committee
5
OPERATIONS
6
Total wall clock hours for ALICE jobs in the last 6 monthsKISTI, 3.9 %(Including Tier-2)Jobs
Oct 2013
T1 worker nodes migration to 10GbE equipped ones
ALICE Central Service Maintenance
EMI-3 Migration & Delivery of full pledges
~ 800
~ 1800
~ 2500
Apr 2014
• Current capacity: 2,524 job slots, 28.1 kHS06– 84 nodes, 32 (logical) cores per node, 11 HS06/core
• Maintenance issues– Worker nodes migration to 10GbE equipped ones– Middleware: EMI-3 migration (end of support to EMI-2
by 30 April)– Delivered full pledges for 2013
3.58% (2013)
7
Site Reliability
28 April 201415th CERN-Korea Committee
8
KISTI Analysis Facility - KIAF• Parallel Analysis Facility based on PROOF• In operation since 2011, ALICE use only• 1 master, 8 worker nodes, 12 cores and 22 TB disk per node• Similar size and utilization as CAF - CERN Analysis Facility
28 April 201415th CERN-Korea Committee
Plans for On-call Service
• Alarm system– Nagios + e-mail notifications – Implementing SMS plugin + Night Owl shift by private company– Tape system - hardware/software malfunction reported to IBM and third-party company– 24/7 support, intervention to be carried out within one day– Ongoing evaluation of monitoring frameworks: e.g. Icinga, Zabbix, etc.
• On-call scheme– One week shift cycle with 5-6 personnel– Expecting 1 or 2 calls in a cycle - alarms from batch scheduler and services, WN servicing– From daily monitoring report – detailed action list on services and hardware incidents
• Night owl shift– Private company contract – on-site support – If necessary - SMS and e-mail notification to off-site on-duty experts– Supercomputing division at KISTI is running similar system for years
We are planning to prepare for On-call Service. Maybe it has 3 func-tions of service.
28 April 201415th CERN-Korea Committee
10
NETWORK
11
Internal Network• Internal network for Tier-1 is isolated from the computing centre service net-
work
• Done in Oct 2013 - internal network re-structuring (3-week shutdown)• Preparation for upgrade of bandwidth of external network up to 10Gbps• Main switch upgrade: bandwidth up to 2.5 Tbps • HA configuration of private network• Remove bottlenecks to storage
• Full 20 Gbps configuration (Incoming/Outgoing)• Replaced all switches by 10 Gbps; done on part of service racks• 1Gbps switches in place for servers with 1Gbps cards
• Worker nodes to be upgraded with10 Gb cards• Tape service nodes are being connected to the 10 Gbps switches
12
External Network• Current Bandwidth to CERN: 2 Gbps
• Dedicated link via Daejeon-Chicago-Amsterdam-Geneva
• Roadmap for 10 Gbps upgrade presented to WLCG MB and accepted• Working on upgrading bandwidth up to 10 Gbps
13
LHC OPN• KISTI T1 network (134.75.125.0/24) included into LHC OPN
• BGP Peering between Kreonet router @ KISTI and LCG network @ CERN• perfSONAR has been deployed for measuring bandwidth and latency; firewall policy
issue persists concerning the ports below 1024 e.g. 80 (http), 443 (https), 843 (b-wctl)
14
Conclusion• KISTI T1 has been approved as a full T1 at the meeting of WLCG Overview Board in
Nov. 2013• The progress of ramping up the capability as a T1 appreciated by ALICE community and a
roadmap to 10G network accepted
• In Jan, KISTI T1 joined LHC OPN
• Over the last 6 months, KISTI T1 has been in “shape-shifting” in terms of network• Core switches replaced (bandwidth: 0.9 Tbps 2.5 Tbps)• Rack switches replaced (bandwidth: 1 Gbps 10 Gbps)• Servers migrated to 10GbE equipped ones