ska-sdp.orgska-sdp.org/.../ska-tel-lfaa-0600051-02_mccs_detaileddesigndocume… · web viewthe...

Name Designation Affiliation Signature

Authored by:

A. MagroSubject Matter Expert

AADCDate:

Owned by:

M. Waterson AA Domain Specialist SKAO

Date:

Approved by:

P. GibbsEngineering

Project Manager

SKAODate:

Released by:

J. G. Bij de Vaate Consortium Lead AADC

Date:

DOCUMENT HISTORYRevision Date Of Issue Engineering Change Comments

MCCS DETAILED DESIGN DOCUMENTDocument number......................................................................SKA-TEL-LFAA-0600051Context...................................................................................................................... DRERevision........................................................................................................................02Author...........................................................................................A. Magro, A. DeMarcoDate...............................................................................................................2019-02-12Document Classification............................................................FOR PROJECT USE ONLYStatus.................................................................................................................Released

Number

A 2018-06-04 - Draft Template version released within consortium

B 2018-10-24 First-round revisions

01 2018-10-31 Formal release

02 2019-02-12 Implemented CDR panel OARs:

LFAA Element CDR_OAR_MCCS DDD

OARs: 2,5,7,8

DOCUMENT SOFTWAREPackage Version Filename

Wordprocessor MsWord Word 2016 document.docx

Block diagrams

Other

ORGANISATION DETAILSName Aperture Array Design and Construction Consortium

Registered Address ASTRONOude Hoogeveensedijk 47991 PD DwingelooThe Netherlands+31 (0)521 595100

Fax. +31 (0)521 595101Website www.skatelescope.org/lfaa/

CopyrightDocument owner Aperture Array Design and Construction Consortium

This document is written for internal use in the SKA project

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

Error: Reference source notfound

Author: A. Magro et al. of 80

TABLE OF CONTENTS1 INTRODUCTION............................................................................................9

1.1 Purpose of the document......................................................................................................91.2 Scope of the document..........................................................................................................91.3 Intended Audience................................................................................................................91.4 Document Overview..............................................................................................................91.5 Document Tree......................................................................................................................9

2 REFERENCES..............................................................................................112.1 Applicable documents.........................................................................................................112.2 Reference documents..........................................................................................................11

3 MCCS OVERVIEW......................................................................................143.1 Telescope Overview.............................................................................................................143.2 MCCS Overview...................................................................................................................153.3 Functional Elements............................................................................................................163.4 Non-Functional Elements.....................................................................................................17

4 HARDWARE COMPONENTS...........................................................................184.1 Compute Cluster..................................................................................................................18

4.1.1 Minimum network bandwidth requirements per station............................................184.1.2 Minimum compute requirements per station.............................................................204.1.3 Minimum memory requirements per station..............................................................224.1.4 Number of compute servers required.........................................................................22

4.2 MCCS Network.....................................................................................................................244.2.1 Network Diagram.........................................................................................................254.2.2 Network Configuration................................................................................................264.2.3 Security........................................................................................................................26

4.3 MCCS Assembly...................................................................................................................27

5 SOFTWARE................................................................................................295.1 System Software..................................................................................................................305.2 Hardware Provisioning.........................................................................................................31

5.2.1 Adding a compute server to MAAS..............................................................................325.2.2 MCCS node management............................................................................................32

5.3 Software Orchestration........................................................................................................325.4 Storage Management..........................................................................................................355.5 Metric Monitoring...............................................................................................................36

6 DESIGN DECISIONS......................................................................................376.1 Why use GPUs and not perform everything on CPUs?........................................................376.2 Why have a redundant server per cabinet?.........................................................................376.3 Why have a separate master and shadow node?................................................................376.4 Why partition the station across servers rather than schedule based on available resources?.......................................................................................................................................386.5 Why use distributed rather than central storage?...............................................................386.6 Why have a separate 1G network?......................................................................................386.7 Why is the fast transient buffer located in MCCS?...............................................................39





6.8 Why use containers for running software?..........................................................................39

7 PERFORMANCE.......................................................................................... 407.1 Power Consumption............................................................................................................407.2 Thermal Performance..........................................................................................................407.3 Monitor and Control Network Bandwidth...........................................................................40

8 RELIABILITY AVAILABILITY AND MAINTAINABILITY..............................................418.1 Reliability, Availability Maintainability Allocation................................................................41

8.1.1 Design for Reliability....................................................................................................418.1.2 MCCS RAMS Product Breakdown Structure.................................................................418.1.3 Reliability Prediction....................................................................................................428.1.4 Operationally Capable..................................................................................................42

8.2 Availability and Maintainability...........................................................................................428.2.1 Availability...................................................................................................................428.2.2 Maintenance Effort......................................................................................................43

8.3 Reliability Maintainability and Availability Requirements Compliance................................44

9 SAFETY.....................................................................................................459.1 Hazard Analysis....................................................................................................................459.2 Personal Safety....................................................................................................................45

9.2.1 Rack Tip-over...............................................................................................................459.2.2 Over-Temperature and Fire.........................................................................................469.2.3 Weight.........................................................................................................................469.2.4 Sharp Edges..................................................................................................................469.2.5 Laser Light....................................................................................................................469.2.6 Fire Protection.............................................................................................................469.2.7 Personal Protective Equipment....................................................................................46

9.3 Electrical Safety....................................................................................................................469.3.1 Rack Electrical Design...................................................................................................479.3.2 LRU Design...................................................................................................................479.3.3 Earthing and Electrical Bonding Systems.....................................................................47

9.4 Environmental Safety...........................................................................................................479.4.1 Hazardous Materials....................................................................................................479.4.2 Gases............................................................................................................................489.4.3 Liquids..........................................................................................................................48

9.5 Asset Protection...................................................................................................................489.6 Certification.........................................................................................................................499.7 Safety Requirements Compliance........................................................................................49

10 ENVIRONMENTAL.....................................................................................5510.1 Transportation of Equipment..............................................................................................5510.2 Storage of Equipment..........................................................................................................5510.3 Operation.............................................................................................................................55

10.3.1 Mechanical Environment.............................................................................................5510.3.2 Nominal Environmental Conditions.............................................................................5510.3.3 EMC Environment........................................................................................................5610.3.4 Susceptibility to Emissions...........................................................................................56

10.3.4.1 Susceptibility to Radiated Emissions....................................................................56





10.3.4.2 Susceptibility to Conducted Emissions.................................................................56

10.3.5 Emissions.....................................................................................................................5610.3.5.1 Radiated Emissions..............................................................................................56

10.3.5.2 Conducted Emissions...........................................................................................57

10.4 Environmental Requirement Compliance............................................................................57

11 ASSUMPTIONS........................................................................................62

12 DEVELOPMENT SUPPORT...........................................................................6412.1 Development Facility...........................................................................................................6412.2 Integration Verification and Test Plan..................................................................................64

13 SYSTEM COMMISSIONING..........................................................................6613.1 Hardware Commissioning....................................................................................................66

13.1.1 Production Readiness Review......................................................................................6713.2 Code Commissioning...........................................................................................................6713.3 Verification and Acceptance................................................................................................68

14 INTEGRATED LOGISTIC SUPPORT..................................................................6914.1 Support and Maintenance Concept.....................................................................................69

14.1.1 Support Concept..........................................................................................................6914.1.1.1 On-site Support....................................................................................................70

14.1.1.2 Remote and Off-site Support...............................................................................70

14.1.2 Corrective Maintenance...............................................................................................7014.1.3 Predictive Maintenance...............................................................................................7114.1.4 Preventative & Scheduled Maintenance......................................................................7114.1.5 COTS Refresh Cycle......................................................................................................71

14.2 Spares..................................................................................................................................7114.2.1 COTS Equipment Repair...............................................................................................7214.2.2 Bespoke Equipment Repair..........................................................................................7214.2.3 Consumables................................................................................................................72

14.3 Support Organization...........................................................................................................7214.3.1 Technical data..............................................................................................................7214.3.2 Obsolescence Management.........................................................................................72

14.4 Support Facilities and Equipment........................................................................................7314.4.1 Spares Storage.............................................................................................................7314.4.2 Intermediate Level Support Facility.............................................................................7314.4.3 Support Test equipment..............................................................................................7314.4.4 Computer Resources....................................................................................................7314.4.5 Accessibility and Security.............................................................................................7314.4.6 Software Maintenance and Installation.......................................................................73

14.5 Manpower and Personnel...................................................................................................7414.5.1 Operator Role..............................................................................................................7414.5.2 On-site Maintenance Support Role..............................................................................7514.5.3 Off-site Maintenance Support Role..............................................................................7514.5.4 Maintainer Training and Training Support...................................................................75

15 APPENDIX A: LIST OF TBD TBCS................................................................76





LIST OF FIGURESFigure 1-1 SKA1 LFAA Element Documentation Tree...........................................................................10Figure 3-1 SKA1 Telescope Overview...................................................................................................14Figure 3-2 SKA1_Low Functional Diagram...........................................................................................15Figure 3-3 LFAA Sub-Elements.............................................................................................................15Figure 4-1. Network links between MCCS and external entities..........................................................24Figure 4-2. MCCS rack-level network diagram.....................................................................................25Figure 4-3. MCCS rack assembly..........................................................................................................28Figure 5-1. Cluster management overview..........................................................................................29Figure 5-2. MAAS deployment with a Regional Controller and two Cluster Controllers......................31Figure 5-3. Software containment.......................................................................................................33Figure 5-4. GlusterFS distributed replicated volume...........................................................................36Figure 8-1 MCCS Availability Model.....................................................................................................43Figure 14-1. Support and Maintenance Concept.................................................................................69

LIST OF TABLESTable 4-1. Data rate per station...........................................................................................................19Table 4-2 Core Utilization Estimates...................................................................................................21Table 4-3. Compute requirements for one station..............................................................................22Table 4-4. Memory requirements for one station...............................................................................22Table 4-5. Minimum resource requirements for increasing number of stations.................................23Table 4-6. MCCS compute server configuration..................................................................................23Table 4-7. Master server configuration...............................................................................................23Table 7-1 Rack power budget..............................................................................................................40Table 8-1 MCCS Equipment MTBF (h) and MTTR (min).......................................................................42Table 8-2 Reliability, Maintainability and Availability Requirements Compliance...............................44Table 9-1. Safety Requirements Compliance.......................................................................................49Table 10-1. MCCS EMI Emitters...........................................................................................................57Table 10-2 Environmental Requirements Compliance........................................................................61Table 11-1. Assumptions used in the design of MCCS.........................................................................62Table 13-1. Array assembly events and MCCS functionality................................................................66Table 13-2. Hardware requirements for supporting and processing equipment for each AA.............66Table 14-1 Spares list derived from reliability prediction models......................................................71Table 15-1 Table of TBDs.....................................................................................................................76Table 15-2 Table of TBCs.....................................................................................................................77





LIST OF ABBREVIATIONS

AADC................................. Aperture Array Design and construction ConsortiumAAVS................................. Aperture Array Verification SystemADC................................... Analog to Digital converterAd-n.................................. nth document in the list of Applicable DocumentsAIV.................................... Assembly Integration and VerificationAPI .................................... Application Programming InterfaceAPIU ................................. Antenna Power Interface UnitASIC................................... Application Specific Integrated CircuitBIOS ................................. Basic Input/Output SystemCAD................................... Computer Aided DesignCCB.................................... Configuration Control BoardCDR................................... Critical Design ReviewCI....................................... Configuration ItemCOTS................................. Commercial Off The ShelfCPF.................................... Central Processing FacilityCM.................................... Configuration ManagerCMB ................................. Cabinet Management BoardCPU .................................. Central Processing UnitCSP ................................... Central Signal ProcessorCW.................................... Continuous WaveDAQ .................................. Data AcquisitionDDD .................................. Detailed Design DocumentDHCP ................................ Dynamic Host Configuration ProtocolDMS.................................. Document/Data Management SystemDNS .................................. Domain Name ServiceECP.................................... Engineering Change ProposalEMI.................................... Electro Magnetic InterferenceFN ..................................... Field NodeFoV.................................... Field of ViewFPGA................................. Field Programmable Gate ArrayGPU .................................. Graphics Processing UnitHW.................................... HardwareICD.................................... Interface Control DocumentINFRAAUS.......................... Infrastructure AustraliaIP ...................................... Internet ProtocolIPMI .................................. Intelligent Platform Management ServiceISO.................................... International Organisation for StandardisationLFAA.................................. Low Frequency Aperture ArrayLFAA-DN............................ Low Frequency Aperture Array – Data NetworkLNA................................... Low Noise AmplifierLMC................................... Local monitoring and ControlLRU ................................... Line Replaceable UnitLOFAR............................... Low Frequency Aperture ArrayMAAS ............................... Metal-as-a-ServiceMBSE................................. Model Based Systems EngineeringMCCS................................. Monitor, Control and Calibration SubsystemMOM................................. Minutes of MeetingMPO.................................. Multi-Purpose Optic (connector)MRI................................... Master Record IndexMRO.................................. Murchison Radio-astronomy ObservatoryMTBF ................................ Mean Time Between FailureMTTR................................. Mean Time To RepairDocument No.:Revision:Date:




MTU ................................. Maximum Transmission UnitMWA................................. Murchison Widefield arrayNRE................................... Non Recurring EngineeringNTP ................................... Network Time ProtocolOS ..................................... Operating SystemOSPF ................................. Open Shortest Path FirstPA...................................... Product AssurancePBS ................................... Product Breakdown StructurePDF.................................... Portable Document FormatPDR................................... Preliminary Design ReviewPC...................................... Project ControllerPPS ................................... Pulse Per SecondPO..................................... Project OfficerPXE ................................... Preboot Execution EnvironmentQA..................................... Quality AssuranceRBS.................................... Re-Baselining SubmissionRD-N.................................. nth document in the list of Reference DocumentsRF...................................... Radio FrequencyRFI..................................... Radio Frequency InterferenceRFoF.................................. Radio Frequency signal over FibreRMS .................................. Root Mean SquareRPF.................................... Remote Processing FacilityRSTP ................................. Rapid Spanning Tree ProtocolSAD .................................. Software Architecture DocumentSaDT.................................. Signal and Data TransportSATA ................................. Serial Advanced Technology AttachmentSDP.................................... Signal Data ProcessingSEMP................................. System Engineering Management PlanSFDR.................................. Spurious Free Dynamic RangeSPS ................................... Signal Processing SubsystemSRMB ................................ Sub Rack Management BoardSKA.................................... Square Kilometre ArraySKA-LOW........................... SKA low frequency part of the full telescopeSKAO................................. SKA OfficeS/N.................................... Signal to noiseSOW.................................. Statement of WorkSSD ................................... Solid State DriveSW..................................... SoftwareTANGO ............................. TAco Next Generation ObjectsTCP-IP................................ Transmission Control Protocol – Internet ProtocolTBC.................................... To Be ContinuedTBD................................... To Be DoneTBS.................................... To Be SuppliedTDP.................................... Total Dissipated PowerTFTP ................................. Trivial File Transfer ProtocolTM..................................... Telescope ManagementTPM................................... Tile Processor ModuleTRB.................................... Test Review BoardUCP .................................. UniBoard Control ProtocolUDP .................................. User Datagram ProtocolUPS ................................... Unlimited Power SupplyVLAN ................................ Virtual Local Area NetworkWBS.................................. Work Breakdown Structure WDM................................. Wavelength Division MultiplexingWP.................................... Work PackageDocument No.:Revision:Date:




1 Introduction

1.1 Purpose of the document

The purpose of this document is to describe the detailed design of the MCCS compute hardware cluster for the Low Frequency Aperture Array (LFAA) of the SKA Phase 1 with one detailed Reference Design implementation used to determine the envelope of its cost, power, equipment space, reliability, availability and maintainability.

1.2 Scope of the document

This document describes how the LFAA Monitor, Control and Calibration Sub-System reference design can meet the requirements within the SKA LFAA Monitor, Control and Calibration Sub-System Requirement Specification.

The level of detail in this document is sufficient to:

1. Define interfaces with other SKA Elements and LFAA Sub-elements.2. Establish a reasonable reference baseline design at reasonably low perceived risk.3. Estimate time, effort and cost to deliver the functionality specified in the LFAA Monitor,

Control and Calibration Sub-System Requirements Specification [AD6].

In other words, the LFAA Sub-Element reference design is defined in sufficient detail as to reduce risk of effort/time/cost overruns in the Construction Phase.

The current release (100% version) will support the Critical Design Review for the LFAA Element. The level of detail is sufficient to have high confidence in the reference design being compliant and able to be constructed with low risk. This Detailed Design Document (DDD), with references to supporting information and data, will provide a design artefact to support the Construction Phase activities.

1.3 Intended Audience

This document is expected to be used by the LFAA Element Consortium Engineering and Management Team and the SKAO System Engineering Team and SKAO LFAA Project Manager. This document is expected to be read by the external CDR review panel.

1.4 Document Overview

This document follows a template that was agreed to between the SKAO and the LFAA Consortium. It covers the key contents called out in the LFAA SOW [AD1].

Detailed information is provided in the appendices or is contained in reference documents.





1.5 Document Tree

The overall document tree for the LFAA Element is shown in Figure 1-1. Level 1 (L1) is the SKA System (telescope) level, L2 is the LFAA Element level and L3 is the LFAA sub-element level (where MCCS resides).

L1 Requirements

L2 Requirements

LFAA ADD

LFAA Costing

Planning Verification Specifications Design Costing

Baseline Design/Architecture Data Pack

L3 Requirements

Internal ICDs

L1

L2

L3

Design DocsSub-element

Costings

Sub-element Detailed Design and

Prototyping Docs

Sub-element Test Specs and

Statement of Compliance at

CDR

LFAATest Spec

PMP SEMP

Risk Reg

External ICDs

LFAAAIVP

Sub-element Prototyping

Plans

Sub-element Dev Plans

(SOW,WBS)

External ICDsSE-6*

Construction Plan

Legend

LFAA CIDL Tree – Rev 1.aJune 06, 2018

«(Additional Planning Docs)

Sub-element Signal Models

Con Ops

LFAA RAMS/Logistics/Safety/EMI/EMC

SKAO Doc

LFAA Doc for PDR; SKAO Doc for CDR updates

LFAA Doc at PDR; updates for CDR as requiredLFAA Doc to be delivered for CDR

* L2 docs split between Sub-elements** L3 requirements split per sub-element

Not Delivered

RecoveryPlan

Figure 1-1 SKA1 LFAA Element Documentation Tree





2 References

2.1 Applicable documents

The following documents are applicable to the extent stated herein. In the event of conflict between the contents of the applicable documents and this document, the applicable documents shall take precedence.

[AD1] SKA-1 System Baseline Design, SKA-TEL-SKO-0000002, Issue 01[AD2] SKA1 LFAA Element Statement of Work [AD3] SKA LFAA EMI/ EMC Control Plan, SKA-TEL-SKO-0000202, Issue 3[AD4] Roll-out Plan for SKA1 Low SKA-TEL-AIV-4410001, Issue 05[AD5] SKA RAM Allocation SKA-TEL-SKO-0000102, Issue 02[AD6] SKA1 LFAA SPS Sub-Element Requirements Specification, SKA-TEL-LFAA-0400014[AD7] SKA1 LFAA Requirements Specification, SKA-TEL-LFAA-0200026[AD8] SKA1 LFAA to INFRA AUS ICD, 100-000000-003, Issue 03[AD9] SKA1 LFAA Architecture Design Document SKA1-TEL-LFAA-02000028 Rev 01

2.2 Reference documents

The following documents are referenced in this document. In the event of conflict between the contents of the referenced documents and this document, this document shall take precedence.

[RD1] Restriction of Hazardous Substances Directive (RoHS 2) http://www.conformance.co.uk/adirectives/doku.php?id=rohs Directive 2011/65/EU

[RD2] Waste Electrical and Electronic Equipment Directive (WEEE) http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32012L0019 Directive 2012/19/EU

[RD3] IEC EN BS AS/NZS 60950-1 Information Technology Equipment - Safety Part 1 General Requirements

[RD4] IEC 60721-3-1 Classification of environmental conditions - Part 3: Classification of groups of environmental parameters and their severities - Section 1: Storage 1997

[RD5] IEC 60721-3-2 Classification of environmental conditions - Part 3: Classification of groups of environmental parameters and their severities - Section 2: Transportation 1997

[RD6] ETSI EN 300 019-1-1 Environmental Engineering (EE); Environmental conditions and environmental tests for telecommunications equipment; Part 1-1: Classification of environmental conditions; Storage 2014 http://www.etsi.org/deliver/etsi_en/300001_300099/3000190101/02.02.01_60/en_3000190101v020201p.pdf

[RD7] ETSI EN 300 019-1-3 Environmental Engineering (EE); Environmental conditions and environmental tests for telecommunications equipment; Part 1-3: Classification of environmental conditions; Stationary use at weather-protected locations http://www.etsi.org/deliver/etsi_en/300001_300099/3000190103/02.03.02_60/en_3000190103v020302p.pdf

[RD8] IEC EN 61000-3-2 Electromagnetic compatibility (EMC) - Part 3-2 - Limits - Limits for harmonic current emissions (equipment input current ≤ 16 A per phase) 2006+ A2 2009

[RD9] IEC EN 61000-3-3 Electromagnetic compatibility (EMC) - Part 3-3: Limits - Limitation of voltage changes, voltage fluctuations and flicker in public low-voltage supply systems, for





http://www.etsi.org/deliver/etsi_en/300001_300099/3000190103/02.03.02_60/en_3000190103v020302p.pdf




http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32012L0019

http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32012L0019

http://www.conformance.co.uk/adirectives/doku.php?id=rohs

equipment with rated current ≤ 16 A per phase and not subject to conditional connection Electromagnetic compatibility (EMC) - Part 3-3: Limits - Limitation of voltage changes, voltage fluctuations and flicker in public low-voltage supply systems, for equipment with rated current ≤ 16 A per phase and not subject to conditional connection 2017

[RD10] IEC EN 61000-4-2 Electromagnetic compatibility (EMC)- Part 4-2: Testing and measurement techniques - Electrostatic discharge immunity test 2013

[RD11] IEC EN 61000-4-3 Electromagnetic compatibility (EMC)- Part 4-3: Testing and measurement techniques - Radiated, radio-frequency, electromagnetic field immunity test 2010

[RD12] EC EN 61000-4-4 Electromagnetic compatibility (EMC) - Part 4-4: Testing and measurement techniques - Electrical fast transient/burst immunity test 2012

[RD13] IEC EN 61000-4-5 Electromagnetic compatibility (EMC) - Part 4-5: Testing and measurement techniques - Surge immunity test 2014

[RD14] IEC EN 61000-4-6 Electromagnetic compatibility (EMC) - Part 4-6: Testing and measurement techniques - Immunity to conducted disturbances, induced by radio-frequency fields 2013

[RD15] IEC EN 61000-4-11 Electromagnetic compatibility (EMC) - Part 4-11: Testing and measurement techniques - Voltage dips, short interruptions and voltage variations immunity tests 2004

[RD16] IEC 61000-6-1 Electromagnetic compatibility (EMC) - Part 6-1: Generic standards - Immunity standard for residential, commercial and light-industrial environments

[RD17] IEC 61000-6-2 Electromagnetic compatibility (EMC) - Part 6-2: Generic standards - Immunity standard for industrial environments 2016

[RD18] IEC 61000-6-3 Electromagnetic compatibility (EMC) - Part 6-3: Generic standards - Emission standard for residential, commercial and light-industrial environments 2011

[RD19] IEC 61000-6-4 Electromagnetic compatibility (EMC) - Part 6-4: Generic standards - Emission standard for industrial environments 2011

[RD20] EN 55022 Information technology equipment - Radio disturbance characteristics - Limits and methods of measurement 2010

[RD21] CISPR 22 Information technology equipment - Radio disturbance characteristics - Limits and methods of measurement R2014

[RD22] CISPR 24 Information technology equipment - Immunity characteristics - Limits and methods of measurement 2010

[RD23] CISPR 32 Electromagnetic compatibility of multimedia equipment - Emission requirements 2015

[RD24] CISPR 35 Electromagnetic compatibility of multimedia equipment - Immunity requirements[RD25] AS/NZS 4836:2011 Australian/New Zealand Standard: Safe working on or

near low-voltage electrical installations and equipment 2011[RD26] AS/NZS 3000-2007 Wiring rules[RD27] AIV Roll-out Plan SKA-44100001[RD28] LFAA Logistics Report, SKA-TEL-LFAA-02000044[RD29] LFAA Hazard Analysis, SKA-TEL-LFAA-00000024[RD30] MCCS Architecture Overview, SKA-TEL-LFAA-0600050[RD31] MCCS Software Architecture Document, SKA-TEL-LFAA-0600052[RD32] AAVS1 Software Demonstrator Design Report, SKA-TEL-LFAA-0600054[RD33] Australian Health and Safety Act[RD34] LFAA Hazard analysis[RD35] SKA1 Low Transient Buffer – Analysis, Requirements, Budgets and Implementation, SKA-

TEL-SKO-0000984, Issue 01[RD36] MCCS Assembly, Verification and Test Plan, SKA-TEL-LFAA-0600053, Issue 02





[RD37] TM-LFAA ICD, 100-000000-028, Issue 2[RD38] SPS Detailed Design Document, SKA-TEL-LFAA-0500035, Issue 02[RD39] http://ganglia.sourceforge.net/[RD40] https://www.ubuntu.com/[RD41] https://maas.io/[RD42] https://docs.gluster.org/en/latest/[RD43] https://www.docker.com/[RD44] https://singularity.lbl.gov/[RD45] https://jujucharms.com/





3 MCCS Overview

3.1 Telescope Overview

Figure 1 shows the major SKA1 Observatory entities: SKA1-Low in Australia, SKA1-Mid in South Africa and the SKA Global Headquarters in the UK. The thick flow-lines show the unidirectional transport of large amounts of digitised data from the antennas to the Central Processing Facilities (CPF) on the sites, and from the CPFs to the Science Data Processor (SDP) and Archive facilities. The thin blue dash-dot lines show the bidirectional transport of system monitor and control data.

The SKA1-Low telescope array includes 512 stations, each consisting of 256 dual-polarisation log-periodic antennas. The stations are distributed over a distance of 65 km, with the greatest density of stations in the central core. The Central Processing facility is located on site and the SDP and archive are located in Perth. Additionally, each station can be divided into a number of smaller sub-stations at reduced bandwidth.

A more detailed schematic of the SKA1-Low telescope, extracted from the SKA1 System Baseline V3 Description (in preparation), is shown in Figure 2. This figure shows the major SKA1-Low signal flow components, as well as the areas of consortia responsibility (red boxes) and the key technologies needed to implement the components. The green dashed line shows the bi-directional flow of monitor, control and operational data, and the orange dot-dashed line shows the distribution of synchronisation and timing signals.

Figure 3-2 SKA1 Telescope Overview

A schematic of the SKA1_Low Telescope, extracted from the Baseline Design [AD1], is shown below including the LFAA Element, product [101-000000]





SKA1-Low operates concurrently in imaging mode and non-imaging mode with concurrent operation of between 1 and 16 sub-arrays. Each sub-array is programmable as a separate conceptual telescope in terms of antenna pointing, band selection and the setting of configurable imaging and non-imaging parameters. The only things that are shared between sub-arrays are observation time, communications links and processing resources.

Advanced Time Keeping &

DistributionAdvanced Data

Storage

Central Processing Facility

Visibility Data

Can

dida

tes

&

Tim

ing

Dat

a

Syn

chro

nisa

tion

& T

imin

g

Data Transport

LNA & Amplifier RF over Fibre,Opto-

electronics

Filterbanks,Beamformer &Stn Correlator

Antenna Array Design

Outer Antenna

Station Array

RF Electronics

RF Transport

Links

Channelisation Beamforming& Transient

Capture

Low-Frequency Aperture Array Stations

Science Data Processing Facility

Channeliser,Correlator

& Beamformer

Science Data

ProcessingScience

Data Archive &

Distribution

High-speed Digital

Hardware

Fibre OpticDigital Data

Links

Specialised Digital

Hardware

Synchronisation & Timing

DistributionPulsar Search

Pulsar Timing

Observatory Clock System

Telescope Manager

Operations,Control and Monitoring Systems

Core Antenna Station Array

RF Electronics

RF Transport

Links

Channelisation Beamforming& Transient

Capture

Long-haul Links

Telescope Mgt

RF Gain Digitisation

RF Gain Digitisation

Amplification& Filtering

VLBI Data VLBITerminal

Equipment/Interface

Transient Data

SampleClock &

Time StampGeneration

Sample Clock &

Time StampGeneration

Switch

VLBIObserving

Log

VLBI Data

Can

dida

tes

&

Tim

ing

Dat

a

Visibility DataTransient Data

Super-computer Hardware,Software

Science DataProcessingFront-end

Data Routing

Time stamp

Data Transport

Long-haul Links

Fibre OpticDigital Data

Links

Figure 3-3 SKA1_Low Functional Diagram





3.2 MCCS Overview

Figure 3-4 LFAA Sub-Elements

The MCCS performs the local monitoring, control and calibration functions for the stations and supporting products. It receives commands and reports the LFAA status to TM. It comprises of a compute cluster (hardware resources composed of off-the-shelf high-performance servers), local power and cooling distribution, local network and job management software to support the LFAA monitor and control functions. The MCCS is connected to both the SPS and LFAA-DN. It also calculates the beamforming and calibration coefficients. The MCCS controls both TPMs, the M&C and data network, as well as supporting hardware in the cabinets. It is also responsible for implementing the transient buffer and transmitting the buffer, when instructed, to SDP via a dedicated 100Gb link. Refer to [RD31] for additional information on the overall software architecture of MCCS.

3.3 Functional Elements

The MCCS key function is to perform control and monitoring of the entire stack of MCCS hardware and software, including the data acquisition and LFAA telescope calibration processes. The purpose is to be able to create, configure and deploy observation setups across the telescope, and monitor and control the progress of these observations.

The primary functions of LFAA.MCCS are: Apply and maintain control over the signals being received, in particular to perform RFI

flagging as instructed (L3-126), calibrate over a fixed number of frequency channels (L3-118), permit control and monitoring of a desired bandwidth (L3-123) and make available correction coefficients to normalize amplitude response (L3-103)

Synchronise the system to work with NTP timing (L3-12, L3-171), control time stamping in each station to a particular PPS transition (L3-101), calculate and apply pointing coefficients





(L3-147) and calculate and apply calibration coefficients in real time (L3-146, L3-15, L3-16, L3-105)

Store information required to recreate pointing/calibration coefficients (L3-14), achieve a very accurate flux density scale (L3-18), have direction dependent models for station beams of sufficient accuracy (L3-89, L3-104), compute and transmit polarization compensation coefficients (L3-106), correlate channelized data for calibration (L3-157), flatten bandpasses (L3-159) and maintain a good signal-to-noise ratio (L3-123)

Apply a beam pointing model of particular accuracy and angle range, according to the suppled beam pointing coordinates (L3-32, L3-99, L3-281, L3-148). MCCS will monitor and control up to 8 beams per station within a sub-array (L3-98, L3-282)

Process and control a transient buffer of a particular size and latency, transferring it to SDP at a particular speed (L3-9, L3-10, L3-11, L3-120, L3-121, L3-279, L3-292).

Configure, control, manage and monitor the entire observation cycle of the LFAA, setting up stations, transitioning across the different observation states, sub-arraying of the telescope, running particular software/firmware combinations for the particular observations, ad reporting on and about the observation (L3-2, L3-19, L3-20, L3-21, L3-22, L3-23, L3-30, L3-154, L3-142, L3-290, L3-291)

Perform general monitoring and control as applied to LFAA.MCCS, hardware usage, power/temperature control, state monitoring, network status, data acquisition jobs and alarms (L3-283, L3-158, L3-160, L3-220, L3-162, L3-156, L3-33, L3-239), station failure flagging, failure reporting or equipment shutdown (L3-26, L3-277, L3-248), low-power control (L3-280, L3-36, L3-37), remote power up/down (L3-170), and fail safe operation (L3-180, L3-181, L3-181, L3-182)

Be capable of performing MCCS off-line diagnostics (L3-211, L3-212, L3-213, L3-214, L3-215), as well as on-line diagnostic tests (L3-216, L3-217, L3-218, L3-219)

Possess appropriate configuration management via a local database, keep track of software versioning and updates (L3-91, L3-166, L3-167)

Provide the required internal interfaces to field nodes and SPS (L3-129, L3-274), and external interfaces to engineering control, web-based access, SADT, INFRA, TM and SDP (L3-169, L3-168, L3-34, L3-35, L3-39, L3-128)

3.4 Non-Functional Elements

To implement the functionalities outlined above, MCCS must also provide: An appropriate deployment location (L3-241) with the require environmental and safety

certifications (L3-242, L3-243, L3-244) and noise emission limits (L3-247) Racks that are built according to predefined specifications related to stability (L3-245, L3-

246), size and weight (L3-249, L3-250), and number (L3-251) Appropriate power supply, power draw, limits and survivability (L3-252, L3-253, L3-254, L3-

255, L3-256, L3-284, L3-257, L3-258, L3-259, L3-260, L3-285, L3-286, L3-287, L3-288, L3-289) Have controlled electromagnetic emissions (L3-261, L3-262, L3-263) Have immunity to radiated, radio-frequency or electromagnetic fields and surges or dips (L3-

264, L3-265, L3-266, L3-267, L3-268, L3-269, L3-270) Maintain controlled thermal environment/cooling operation (L3-272) Maintain appropriate software and firmware standards (L3-172) Eliminate and reduce occupational health and safety hazards (L3-176, L3-177) Fail-safe design and operation (L3-178, L3-179, L3-276), mechanical safety (L3-183, L3-184,

L3-185, L3-186), electrical safety (L3-187, L3-188), and equipment safety (L3-189, L3-190, L3-191)





All equipment must have appropriate marking for identification, safety, weight warnings, hazard warnings, electrostatic warnings, off-the-shelf marking etc. The labels must be robust (L3-193, L3-194, L3-195, L3-196, L3-192, L3-204)

Abide by environmental protection regulations and procedures (L3-197, L3-198, L3-199, L3-200)

Control over electromagnetic emissions and susceptibility (L3-201, L3-202, L3-203, L3-205) Sufficient and appropriate levels and design for duration of availability and reliability (L3-

206, L3-207), and maintainability (L3-208, L3-209, L3-210, L3-221, L3-222, L3-223, L3-224, L3-225, L3-226, L3-227, L3-228)

A controlled environment for all equipment (L3-220, L3-230, L3-231, L3-232) Identification and methods of identification for all equipment (L3-233, L3-235, L3-236, L3-

237, L3-238)





4 Hardware Components

MCCS is essentially a compute cluster, requiring enough compute processing power, network bandwidth and memory space to run the MCCS software. Compute processing power is dominated by the correlation and calibration processes, network bandwidth is dominated by the transmission of calibration spigots from SPS to MCCS, while memory space is dominated by the fast transient buffer. The compute servers are distributed across 4 MCCS cabinets which, apart from the compute servers themselves, contain the required number of network switches to transport SPS LMC data from SPS to MCCS, interconnect the compute server and transmit the fast transient buffer to SDP. The following section analyse the compute, network and cabinet requirements, as well as provides describes the software necessary for these components to function properly.

4.1 Compute Cluster

This section defines the specification for a single compute server as well as the minimum number of servers required to process all the stations in LFAA. The following hardware components are assumed:

NVIDIA P100 with NVLink, or equivalent 100Gb interface card 128 GB DDR4 2666 RAM chip Intel Xeon Gold 6148 Processor (20-core 2.4 GHz), or equivalent (current Intel CPUs can go

up to 28 cores)

The current state of the art compute server can have the following maximum specifications (these are specified in more detail in Table 4-6):

4x NVIDIA P100 with NVLink 1x 100 Gb interface cards 12x 128 GB DDR4 chips, totalling 1.5 TB of RAM 2x Intel Xeon Gold processors, totalling 80 cores with hyperthreading

Note that apart from compute servers, an additional server is required to host the high-level monitoring and control software, as well as acting as the master server for the compute cluster. For redundancy and additional server is also included which acts as a “shadow-master”, that is, in the case where the master server is offline, suffers and unrecoverable error or is being updated, the shadow master can temporarily take over.

The next three sections analyse the network, compute and memory requirements for processing a single station, which is used as a base for calculating the number of stations which can be processed on a single compute server, and therefore the minimum size of the MCCS cluster.

4.1.1 Minimum network bandwidth requirements per station

Table 4-1 lists the data rate of several data items which either flow from SPS to MCCS, of from MCCS to other elements. A single station is considered for these calculations, since it is only a matter of scaling the number to multiple stations to get the total data rate per server, cabinet and all the MCCS cluster. The process to compute the calculation is shown below the table, whilst the packets





per seconds depends on the packet size, which is know for certain data items but unknown for others.

Table 4-1. Data rate per station

Data item Data rate Pkts/sec In/OutCalibration spigot 7.6 Gb/s 116k In

Transient data 1.4 Gb/s 85k InFast transient buffer 1 Gb/s NA Out

Calibration coefficients 3 Mb/s 336 OutPointing delays 65.5 kb/s 16 OutIntegrated data 4 Mb/s 66 In

Antenna transient data TBD1 TBD2 InM&C traffic 116 kb/s 1600 In/Out

Total ~9 Gb/s ~ 202k In~ 1 Gb/s ~ 1682 Out

Calibration data to MCCSEvery TPM sends channelised voltage data (calibration spigot) to the MCCS. The data rate below provides the burst data rate for a single channel. Each packet contains an 8 KB payload, with 72 bytes overhead (Ethernet, IP, UDP and SPEAD headers), or 0.009%

This data rate per station is given by:

256 (Antennas) * 2 (Polarisations) * 0.781 (Channel Bandwidth, MHz) * 2 (Nyquist) * 8 (Sample bits) * 32/27 (Oversampling)

= 256 * 2 * 0.781 * 2 * 8 * (32/27) * 10^6 Hz ≈ 7.6 Gb/s

This amounts to ~116 k packets (overhead is negligible in this case)

Fast transient data rate to MCCSEach station sends reduced bandwidth station beams to MCCS for the fast transient buffer. The packet payload depends on the number of bits per sample. In this case, 2 bits are assumed, with 150 MHz worth of frequency channels being buffered. In this case, a packet contains a 2 KB payload, with 72 bytes overhead (Ethernet, IP, UDP and SPEAD headers), or 0.035%.

192 (Channels) * 2 (Polarisation) * 0.781 (Channel Bandwidth, MHz) * 2 (Nyquist) * 4 (Sample Size) * 32/27 (Oversampling factor) * 106 MHz

192 * 2 * 0.781 * 2 * 2 * (32/27) * 10^6 Hz ≈ 1.4 Gb/s

This amounts to ~85 k packets (overhead is negligible in this case)

Fast transient data rate from MCCS (to SDP)When triggered by TM, the fast transient buffer from each station is transmitted to SDP. This LFAA-SDP link is a single 100G link, such that bandwidth must be shared and balanced across all stations. Assuming all the stations are transmitting the link, then the bandwidth per station is 195 Mb/s. This rate will double when the number of stations is halved, but cannot exceed the rate of the raw station beam, which is 11.4 Gb/s. An average data rate of 1 Gb/s (20% of station taking part in transient buffer) will be assumed.





Integrated beam and channel to MCCSIntegrated channel data is required for compute frequency channel scaling factors for flattening the bandpasses, while integrated beam data is required for diagnostic purposes. The integration time for both data modes is user-define. A 1s integration is assumed here, resulting in a data rate of:

Integrated channel data: 256 (Antennas) * 2 (pols) * 512 (channels) * 16 (sample size) ≈ 4 Mb/s Integrated beam: 2 (pols) * 384 (channels) * 16 (sample size) ≈ 12 kb/s

Note that the integration time assumed here is probably much higher than what would be used during a scan, however this data will be transmitted in burst mode by the TPMs, such that the instantaneous bandwidth requirement will be higher. These value result in ~66 packet.

Antenna transient data to MCCSTBD3

Calibration and pointing data rate to MCCSThe calibration process results in a 2x2 complex coefficient matrix per antenna, polarisation and channel. Each complex coefficient is represented by a single byte. The calibration process will be performed for one frequency channel at a time, such that only the coefficients of a single channel need to be transmitted every second, resulting in the data rate below:

256 (Antennas) * 4 (XX, XY, YX, YY) * 8 (bits) = 8 kb/s

If the firmware does not allow for applying calibration coefficients a frequency channel at a time and the coefficients for all the frequency channel need to be transmitted (384 frequency channel), then the resultant worst case bandwidth required is 8 * 384 ≈ 3 Mb/s

The overhead for UCP packets is 72 bytes (Ethernet, IP, UDP and UCP header), less than 0.1% of a packet containing a UCP payload of 1200 bytes, so overhead is negligible, resulting in ≈ 336 packets when rounding to the number of TPMs in a station.

MCCS computes delay and delay rates for each beam (up to 8) and each antenna. Using a linear approximation for the delay, with tracking up to 4 times the sidereal rate, the delay/delay rate must be specified every 15s, however here it is assumed that the delays will be transmitted in burst mode to the TPMs. The data rate per station is given by:

256 (antennas) * 8 (beams) *32 (bits/coefficient) ≈ 65.5 kb/s

This translates to just one packet per TPM.

Monitoring and Control TrafficThe monitoring and control traffic between all SPS hardware components contributing to a station (TPMs, network switches, rack management boards, cabinet hardware etc…) depends on the number of monitoring points per hardware component. Based on the following assumptions:

The TPM has the largest number of monitoring points. 100 monitoring points, each with a duty cycle of 1 second Each monitoring point needs a separate UCP packet

Each request and reply packets are 68 bytes and 72 bytes respectively. For 100 monitoring points and 16 TPMs, the resulting data rate is approximately 115 kb/s for each direction. It can therefore be assumed that LMC traffic is negligible and does not contribute much to the required bandwidth. Note, however, that the number of packets required here is 16 * 100 = 1600 in both directions.





4.1.2 Minimum compute requirements per station

There are four primary compute-intensive processes which must be instantiated and run per station, the data acquisition, correlation, calibration and transient buffer processes. Additional processes, such as pointing and computing scaling factors for bandpass flattening, are not regarded as compute intensive since they will be running at lower cadence and only require a CPU core for less than a second each time. LMC functionality also requires compute resources per device instance, which can amount to a significant fraction when many devices are required. A breakdown of the resource requirements is provided hereunder. The percentage core utilisation is computed by extrapolating benchmarks tests (for DAQ and correlation) or by through a best guess estimate (for the rest).

Data acquisition processThe data acquisition process is responsible for receiving the calibration spigots and integrated data from TPMs. The total date rate which must be processed is ~7.6 Gb/s, largely dominated by the calibration spigots. Based on the prototype implementation and performance benchmarks (see [RD32]), the implementation of which can be improved to limit the amount of compute resources required to filter packets at the kernel level, the following compute resources are required:

Table 4-2 Core Utilization Estimates

Thread Description % Core UtilizationPacket receiver Receives and filters packets, placing UDP

payloads in ring buffers60%

Calibration spigots consumer

Process the calibration spigots and places data in a buffering system for the

correlator to process

100%

Integrated channel consumer

Processes integrated channel data packets and writes them to storage

10%

Integrated beam consumer Processes integrated beam data packets and writes them to storage

10%

Correlator processThe correlator process waits for the calibration spigot consumer in the data acquisition process to fill a buffer, then copies this buffer to the allocated GPU’s memory, runs the correlation kernel, then copies the result back to system memory and writes the correlation matrix to storage. Based on the prototype implementation and performance benchmarks (see [RD32]), a single CPU thread is required for copying data into and out of GPU memory, while several GPU kernels are required to perform the correlation. The GPU kernels can perform the computation in about 25% of real-time, whilst the CPU thread minimally uses a CPU core. See [RD32] for the specifications of the test hardware.

Transient buffer processThe transient buffer process is a special instance of the DAQ, since it’s main role it to receive the transient packets and places them in a circular buffer. Apart from a packet receiver, as specified above (with less resource requirements since the number of packets processed is less), an additional thread is required for packet manipulation. A worst case scenario is assumed, where a 100% core utilization is assigned to this thread.

Calibration process





The calibration process is logically split into two functions: the generation of the model sky visibilities, and the calibration itself. Based on the prototype implementation and performance benchmarks (see [RD32]), and if the same tool will be used for model visibility generation, then 100% CPU and 10% GPU can be assigned to it. The requirements for the calibration algorithm will be excluded, such that an upper limit to the amount of CPU and GPU compute can be defined once the requirements of all the other processes are factored in.

LMC functionalityAs discussed in [RD31], TANGO devices which are associated with a station will be hosted on the server which will process the same station. For each station, the following devices are required: one Station device, grouping 16 Tile devices, each grouping 16 Antenna Devices; a Pointing device for each beam (maximum 8); and one instance of the Calibration, DAQ and Transient Buffer devices. Rounded up, this amounts to about 300 TANGO devices. If each device is performing an operation every 100ms, requiring 1ms processing time, then the total core requirement per second approximates to 300% core utilization. However, this assumes that the core will be performing works during this time, which is an incorrect assumption for most cases where a request packet is transmitted to the device and the TANGO device sleeps until a reply is received. In this case, the core can be utilised for other purposes. So, assuming that 50% of the time the thread is asleep waiting for a reply (which is an overestimate), the total core requirement core requirements per second becomes 150%.

A summary of the compute requirements above are summarised in Table 4-3.

Table 4-3. Compute requirements for one station

Function %CPU Core Utilization % GPU utilizationData Acquisition 200% 0%

Correlator 20% 25%Transient Buffer 150% 0%

Calibration TBD4 10%LMC functionality 150% 0%

Total ~520% 35%

4.1.3 Minimum memory requirements per station

Table 4-4 summarises the memory amount of memory and memory bandwidth required for one station. The memory bandwidth is generally directly related to the incoming data bandwidth. The memory requirements for the DAQ is based on the prototype implementation and performance benchmarks (see [RD32]), whilst that for the transient buffer is based on the data rate and number of seconds required to buffer (1.5 Gb/s * 900)

Table 4-4. Memory requirements for one station

Thread Description Memory Utilization

Memory bandwidth

DAQ All data acquisition, including correlation and integrated data

10GB 16 Gb/s

Transient buffer Ingesting the transient buffer and placing data into a circular buffer

160 GB 3 Gb/s





Calibration, model sky generation

Model sky visibility generation for input to the calibration algorithm

TBD5 TBD6

Calibration Running the calibration algorithm TBD7 TBD8

4.1.4 Number of compute servers required

Based on the requirement analysis presented in the previous section we can define the size of a single compute node, which depends on the number of stations which need to be processed on a single node. This is show in Table 4-5. Based on this and additional factors which will be described shortly, the minimum size of the MCCS cluster can be defined.

Table 4-5. Minimum resource requirements for increasing number of stations

# Stations CPU Compute GPU Compute

Network Bandwidth

Memory Memory Bandwidth

1 550% 35% 9 Gb/s 170 GB 10 Gb/s2 1100% 70% 18 Gb/s 340 GB 20 Gb/s4 2200% 140% 36 Gb/s 680 GB 40 Gb/s6 3300% 210% 54 Gb/s 1020 GB 60 Gb/s8 4400*% 280% 72 Gb/s 1360 GB 80 Gb/s

A redundancy factor must the added to the resource requirements in Table 4-5 to cater for scheduling and context switch overhead and for running system software, including the operating system. With this factor, and using the specifications defined in the beginning of Section 4.1, the minimal server configuration, assuming each server is responsible for 8 stations, is listed in Table 4-6.

Table 4-6. MCCS compute server configuration

Item Quantity Minimum SpecificationChassis 1 1U, min 2x SATA, dual 1Gb Ethernet, 2 kW redundant

power supply, NVLink supportCPU 2 20-cores, 2 GHz minimumGPU 4 NVIDIA P100 with NVLink or equivalentRAM 12 128GB 2666MHz DDR41 Gb interfaces 1 On chassis100 Gb interfaces 1 Mellanox 100-Gb ConnectX-5 with 1 QSFP, or equivalentSSDs 2 1 TB 2.5” SATA 6.0 Gb/s

With the configuration above, the following resources will be left available for calibration (per station):

3.5-cores equivalent per calibration instance (350%) 10% equivalent of GPU compute

Therefore, the total number of servers required for processing is 64, resulting in 16 servers per rack. Two additional servers are required, the master and shadow-master servers. The configuration for these servers is defined in Table 4-7.





Table 4-7. Master server configuration

Item Quantity Minimum SpecificationChassis 1 1U, min 2x SATA, dual 1Gb Ethernet, 1 kW redundant

power supplyCPU 2 6-cores, 1.8 GHz minimumRAM 12 16 GB 2666MHz DDR4, or equivalent1 Gb interfaces 2 On chassisSSDs 2 1 TB 2.5” SATA 6.0 Gb/s

With these numbers the network infrastructure of MCCS can now be defined.

4.2 MCCS Network

Consistent with the overall LFAA data network design shown in [AD9] (Figures 10-5 and 10-6), the LFAA network utilises multilayer Ethernet switching and connectivity to facilitate data flows primarily within and between the SPS and MCCS sub-Elements. This is primarily based on commercially available equipment such as 19” rackmount switches. Although the LFAA network is multilayer, the SPS network operates at Layer 2 only as data flows are in parallel for different stations with little or no need for cross-connections or redundant communication paths. SPS data connectivity will be grouped into private class C subnets of 8 stations (within 4 cabinets). This corresponds to the order of 150 IP addresses within the 254 available for the subnet. The management nodes within APIUs of the corresponding Field Nodes will be in the same subnet and represent another 32 addresses. The partitioning into subnets ensures that broadcast storms are contained. Since the SPS network is Layer 2, there should be no active loops in its topology. To ensure this Rapid Spanning Tree should be enable across the network switches.

The external network links connected to MCCS are shown in Figure 4-5. Communication with SPS goes through a single 100Gb link between each SPS cabinet in the RPF and groups of two SPS cabinets in the CPF, totalling to 110 100Gb links. Communication with TM goes through a 1Gb link, of which there are two for redundancy. The transient buffer is transmitted to SDP via a 100 Gb link provided by SaDT.

Figure 4-5. Network links between MCCS and external entities





These connections need to be distributed across the four racks which host MCCS. As stated in Section 4.1.4, all compute-intensive operations will be performed on 64 compute servers, each dealing with eight stations. Core SPS cabinets have one 100 Gbps per two cabinets to MCCS, RPFs within 25km have one 100 Gbps links to MCCS and RPFs which are farther away than 25km use DWDM through a muxponder, multiplexed to 100 Gbps to MCCS. Additionally, each 100G network switch should be capable of reaching all other network switches within MCCS in the least number of hops for the following reasons:

The fast transient buffer needs to be transmitted to SDP from potentially all stations, therefore all compute servers will take part in this

In the case where a server or network switch goes offline (switches to a FAULTY state), it should be possible to re-route data within minimal added latency

All servers will host several TANGO devices, and these should be accessible with minimal latency (up to TM), especially when alarms are triggered

Apart from the 100G network, a separate 1G network local to MCCS is suggested: Hardware components in the MCCS racks need to be monitored and controlled. Network

switches, power supplies and other monitorable cabinet hardware generally expose this through a dedicated 1G maintenance port

If a 100G link, port or switch goes offline, resulting in a server not being reachable over the 100G network, it would still be reachable through the 1G network

Figure 4-6. MCCS rack-level network diagram





4.2.1 Network Diagram

Based on the discussion above, Figure 4-6 presents the network diagram for a single MCCS rack. Compute servers are grouped into 4 groups of 4, each connected to a separate 32-port 100Gb network switch. Each 100Gb network switch ingests 16 SPS links. A single 32-port 1Gb network switch is required to interconnect all hardware devices within an MCCS rack, with enough free ports for creating a full 1G mesh with the rest of the racks. Links to TM and SDP are also shown, however these are not present in all racks. The TM links are connected to the 1G switch in the central two racks, whilst the SDP links can be connected to any of the racks. The head/shadow nodes are also located in the central two racks, each requiring two 1Gb links for redundancy. Note that in the diagram, links without multiplicity mean that there is a single link.

The 1G switches are fully interconnected with each other across rack, such that only one hop is required to for one component to reach any other component within MCCS. The top 100 Gb switch in a rack is connected to the bottom switch in the same rack as well as all other top switches in the other MCCS racks. The bottom switches are in turn connected to all other bottom switches.

4.2.2 Network Configuration

MCCS is composed of two networks, the 100Gb network and the 1Gb network. The 1G network is internal to MCCS, whilst the 100Gb network is connected to SPS. The 1Gb network can be configured to be on a VLAN requiring less than 256 address, such that a private class C subnet should suffice. The MCCS servers connect to SPS can be assigned an IP on the same VLAN as that belonging to the associated station

IP addresses will be assigned through DHCP, however it is preferable that compute servers and SPS components always get the same IP address, one which will always be associated with a hostname or MAC address (such that even if servers are changed, for example due to malfunction, the new server will still obtain the same IP address and hostname. Two schemes can be employed:

a. A static map between MAC address and IP address is kept in the DHCP server. This requires that a MAC address is manually (or through an automated process) added to the DHCP server.

b. IP addresses are assigned depending on which port on the switch a host is connected to. This requires that a DHCP server is hosted on every switch, and that they are configured in such a way that no conflicts arise

c. Use third party network management software to automatically enable switch ports and assign IPs

If MAC addresses will not change often (apart for during deployment and replacements), options a and c are feasible and require a less complex configuration. The DHCP server can either be hosted on a switch or on the head node. Hosting the DHCP server on one of the switches provides a central point of failure, since if it goes offline then another switch must be configured. On the other hand, if the master node goes offline then the shadow master will automatically take over (cluster management software provides this flexibility, see Section 5). It is therefore suggested that the DHCP server should be hosted on the master node, which provides a different pool of IP address (subnets) for the required network. All switches must be configured to forward DHCP requests (ports 67 and 68) to the master and shadow master nodes (the latter is included to avoid re-configuration when the master node goes offline).





Additional network configurations which must be performed include: Setting the MTU on all 100Gb switch ports and 100Gb server interfaces to 9000 (enabling

jumbo packets) Maximizing the size of the RX and TX circular buffer on 100Gb server interfaces Enabling RSTP on SPS switches Enabling OSPFv2 on MCCS all switches for routing between subnets

4.2.3 Security

Although the principal security of the Low Telescope is not the responsibility of the LFAA, the LFAA network configuration will provide reasonable steps to ensure the integrity of the MCCS and LFAA whilst not restricting the functionality of the Telescope to the user. The aim is to provide protection from unauthorised users enumerating and gaining control of the network which might compromise the subsystem. This is particularly important for the network components as it is often difficult to detect whether they have been compromised. A list of security measures is provided in the SPS DDD [RD38], and these are also applicable to the MCCS network.

4.3 MCCS Assembly

Following from previous sections, 17 compute servers are required (including spare), a master/shadow node in two of the racks, and 2 100Gb and 1 1Gb switch per cabinet, the cabinet design is presented in Figure 4-7. An MCCS cabinet contains:

• 16 compute servers and one spare compute server• 2 100 Gb switches• 1 1Gb switch• For two of the racks an additional server is required to act as a master/shadow node• For the racks containing the master/shadow node, a UPS is included

The head/shadow node and 1Gb switch connecting the head/shadow node to TM are connected to the UPS, such that if a power failure arises then MCCS can inform TM and perform and emergency shutdown operations, ensuring that the system will be capable of going back online when power is restored. Since the head/shadow server will be low-power servers (when compared to the compute servers), a standard rack UPS should be able to provide enough uptime for the head/shadow node to perform these operations.

The estimated power budget for the rack configuration is provided in Table 7-8 Rack power budget. Note that this does not show the maximum TDP per component, but the estimated average power consumption for each.





Figure 4-7. MCCS rack assembly





5 Software

This section provides detailed examples of how cluster, storage configuration and monitoring can be performed in MCCS. This is done by using specific software technologies which at the time of writing presented a good match to the MCCS requirements. However, the decision of which technology technologies will be used is postponed to after CDR. This provides enough time to discuss possible SKA-wide harmonization of technologies, such that the maintenance, updates, and expertise cost can be minimised across the project.

Automating management of a compute cluster is a complex task, requiring several software tools and configurations. The following lists the primary operations which need to be performed on the MCCS cluster include:

Remote power management of compute servers Provisioning compute nodes, that is, load an operating system such that applications and

services can run on the servers Monitor server metrics, including load, temperature, disk space, and so on Manage networking across MCCS (and SPS) Create and manage distributed storage across the entire cluster Deploy applications and services on specific compute nodes. Each node will have multiple

instances of the same applications running, and applications need to communicate with each other. Additionally, during updates, different applications version might be running at the same time. So, this operation also needs to cater for:

o Isolation of running applications (containment)o Managing communication and interaction between running applications

(orchestration)o Automated handling of failed or crashed deploys

Mirroring the operations performed by the master node so that in case of failure the shadow node takes over

Figure 5-8 present a high-level visual description of the above operations. In the figure all slave nodes are enclosed in a single rectangle, however slave nodes can be partitioned across multiple ones, forming separate logical clusters, which is useful for mirroring their partitioning in rack for example.

Figure 5-8. Cluster management overview





5.1 System Software

System software is computer software designed to provide a platform to other software. This section provides a non-exhaustive list of system software which is required for operation MCCS (and in some cases SPS). Note that application-specific system software, that is software required to run user applications, such as compilers and system libraries, are excluded from this list.

Operating systemThe operating system is an essential part of a server since it manages the computer hardware and software resources and provides common services to computer applications. Appropriate drivers (such as for the 100 Gb network cards and GPUs) are also required. This document assumes the use of Ubuntu [RD40] Long Term Support (LTS), the version of which will depend on the development time, deployment time, and which version will change throughout the lifetime of the project.

Dynamic Host Configuration Protocol (DHCP) and Doman Name Service (DNS)Network configuration is the responsibility of MCCS. A DHCP server is required to provide IP address to all the networked hardware components in MCCS and SPS. The configuration of this server is discussed in Section 4.2.2. Additionally, MCCS might need access to the outside world, through appropriate proxies, such that a DNS server is required to convert domain names to appropriate addresses which can then be reached through routers outside of MCCS. If this is required, appropriate NAT functionality and firewalls must be in place.

Network Time Protocol (NTP)Timing is a crucial part of LFAA since observational data needs to be time-tagged to a high accuracy. Accurate time synchronisation across the telescope is also useful for logging and reporting. It is envisaged that TM (or SaDT) will provide a network time services which is distributed across all the element in the SKA. Locally, MCCS will host an NTP server which internal components can use to update their system time.

Provisioning softwareSetting up each compute node manually is time consuming, error prone and difficult to maintain. A centralised provisioning system automates this process by having an OS image which is sent to the compute nodes, which boot over the network. This guarantees that all nodes have the same operating system and system configuration (although this can be changed to reflect different hardware or cluster configuration). Metal-As-A-Service [RD41] (MAAS) is used as an example of provisioning software in this document, described in [RD30].

Distributed storage management softwareStorage could be centralised on the master node or a dedicated storage server, however this would result in a single point of failure for storage. In a centralised scheme, the network must be capable of handling all I/O requests from all the compute server, such that these servers would likely need to be connected to the high-speed 100 Gb network. Alternatively distribute storage can be used, where compute servers include one or more disk drives which form part of the distributed storage space. Standard distributed storage managers can then manage this storage space, which can be configured for redundancy and replication. This distributes the network, disk and CPU load required to for handling I/O requests. GlusterFS [RD42] is used as an example of a distributed storage manager in this document, described briefly in Section 5.4.

Monitoring system software





Provisioning software provides a limited number of metrics on the cluster, however if more detailed metrics are required, such as CPU temperature and power consumption of GPUs, then a monitoring system is required. Ganglia [RD39] is used as an example of a metric monitoring system in this document, discussed briefly in Section 5.5.

5.2 Hardware Provisioning

Metal-As-A-Service is hardware provisioning software from Canonical intended to quickly commission and deploy physical servers to run a wide array of software services or workloads via Juju charms. Servers can be dynamically associated or connected to scale up services and can also be disconnected to scale down as demand requires it. MAAS treats physical servers as compute commodities that can be quickly manipulated, similar to how a cloud environment creates and removes virtual resources to adjust to computing demands. A MAAS deployment consists of one Region Controller, one or more Cluster Controllers, and as many physical servers as required. Figure5-9 (borrowed from http://maas.ubuntu.com) depicts a MAAS deployment with a Region controller, two Cluster controllers and 6 servers (nodes) per cluster. A Region controller consists of a web user interface, an API, the metadata server and an optional DNS server. A Cluster controller oversees provisioning and consists of a TFTP server and an optional DHCP server. It is also responsible for powering servers on and off.

Figure 5-9. MAAS deployment with a Regional Controller and two Cluster Controllers





The MAAS Region Controller must be installed on the master node and mirrored on the shadow node. Since MCCS must provide DHCP service to MCCS and SPS, the MAAS installation would use the DHCP and DNS servers running directly on the master node. Compute nodes must:

- Be equipped with out-of-band management (such as an IDRAC or a BMC) controller as they will be powered on and off via IPMI commands, so IPMI over LAN must be enabled

- Set to boot from the network

5.2.1 Adding a compute server to MAAS

The enlistment process starts when a server is manually powered on (by pressing the power button) and registers itself with MAAS. When the server boots, it obtains an IP address and then PXE boots from the Cluster Controller. The server then loads an ephemeral image that runs and performs an initial discovery process. The discovery process obtains basic information such as network interfaces, MAC addresses and the machine’s architecture. Once this information is gathered, a request to register the machine is made to the MAAS Region Controller. When done, the server will be listed in MAAS with a ‘Declared’ state.

Once a server is in the ‘Declared’ state, it must be accepted into MAAS for the commissioning processes to begin and for it to be ready for deployment. The commissioning process is where MAAS collects hardware information such as the number of CPU cores, memory, disk size, etc. which can be later used as constraints. Once a node is commissioned, it will PXE boot from the MAAS Cluster Controller and will be instructed to run the same ephemeral image. This time, however, the commissioning process will be instructed to gather more information about the server, which will be sent back to the MAAS region controller. Once this process has finished, the server information will be updated, and its state will change to ‘Ready’, which means it’s ready for deployment. When a server has been deployed, its state will change to ‘Allocated to’. This state means that the server is in use by the user who requested its deployment.

5.2.2 MCCS node management

With MAAS the complexity for adding, removing and maintaining nodes is greatly reduced: Adding a node is simply a question of attaching the node to a rack, connecting it up and

switching it on. Some BIOS settings need to be adjusted, such as enabling booting up from LAN. Once switched on MAAS will take over and eventually it will be added automatically to the list of nodes and can then be used. Note that additional configuration is required in the LMC software configuration to properly map the node with the rest of the software infrastructure (such as associating it with stations).

A node can be removed (for replacement or maintenance) by simply disabling it from then MAAS configuration. This will mark the node as unavailable

BIOS updates can also be performed. Once a node is added it can be logged into (through SSH) and a provided BIOS update utility can be executed

5.3 Software Orchestration

Software applications and services need to run on compute nodes (and master nodes). Most of these require unique configurations as well as system and third party libraries to be installed. On an





operational system, especially if this system is meant to run for a long time (such as SKA), software is routinely updated, packages can change, library and software versions will change, and different versions of the same library might be required for different applications. Management of such a system is highly complex and prone to misconfiguration and dependency problems and conflicts. Additionally, the software environments during development, testing and on production servers will be different.

Containers are a solution to the problem of how to get software to run reliably when moved from one computing environment to another. They consist of an entire runtime environment: an application with all its dependencies, libraries, and configuration files needed to run, bundled into one package. By containerizing the application platform and its dependencies, differences in OS distributions and underlying infrastructures are abstracted away.

Containers are different from virtual environments. With virtualisation technology, the package that can be passed around is a virtual machine, and it includes an entire operating system as well as the application. A physical server running three virtual machines would have a hypervisor and three separate operating system running on top of it. By contrast a server running three containerised applications run a single operating system, and each container shares the operating system kernel with the other containers. Shared parts of the operating system are read only, while each container has its own filesystem for writing. This means that containers are much more lightweight and use far fewer resources than virtual machines. Containment is show in Figure 5-10.

Figure 5-10. Software containment

Docker [RD43], at the time of writing, is the industry standard for containers. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. Container images become containers at runtime and in the case of Docker containers - images become containers when they run on Docker Engine. Containers isolate software from its environment and ensure that it works uniformly despite differences for instance between development and staging. Docker containers that run on Docker Engine:

Standard: Docker created the industry standard for containers, so they could be portable anywhere





Lightweight: Containers share the machine’s OS system kernel and therefore do not require an OS per application, driving higher server efficiencies and reducing server and licensing costs

Secure: Applications are safer in containers and Docker provides the strongest default isolation capabilities in the industry

Different to traditional cloud-based services and applications, MCCS software requires direct access to some hardware components, mainly GPUs (for correlations and calibration) and network interfaces (for data acquisition). Docker supports NVIDIA GPUs through an extension developed by NVIDIA, however direct network card access might be more problematic. This issue has not been investigated yes, and it is still uncertain on whether Docker can provide this functionality or whether a more HPC-tailored containment technology, such as Singularity [RD44], must be used.

Individual containers can be defined for specific software applications, however at any point in time several containers will be running on MCCS, depending on the currently configured scans and required services. When a compute node is switched on (with no scans running), the following software applications and services need to be deployed:

GlusterFS for distributed storage NTP client, to get the time from the NFP server Ganglia for metric monitoring All TANGO devices associated with the station configured to run on the node. These include

Antenna, APIU, Tile, Station, CMB, SRMB, Switch and other devices, totalling about 2400 devices

When a scan is configured additional application need to be launched, the compute processes for calibration, correlation, data acquisition, fast transient buffer and bandpass flattening. Managing all the containers, on the fly in the case of scan-related containers, becomes very complicated. This can be automated by using an orchestration tool.

Orchestration is the automated arrangement, coordination and management of computer system, middleware and software applications. It defined the policies and service levels through automated workflows, provisioning and change management, creating an application-aligned infrastructure that can be scaled up or down based on the need of the application. It also provides centralised management of the resource pool. For this document, Juju [RD45] will be used as an example for orchestrating containers across MCCS.

Juju is an open source modelling tool for operating software in virtualised or bare metal environments. Juju allows for deployment, configuration, management, maintenance, and scaling applications quickly and efficiently on public clouds, physical servers, OpenStack, and containers. Juju's application modelling provides tools to express the intent of how to deploy such applications and to subsequently scale and manage them. At the lowest level, traditional configuration management tools like Chef, Puppet and Ansible, or even general scripting languages such as Python or bash, automate the configuration of machines to a particular specification. With Juju a model of the relationships between applications that make up the solution is created and a mapping of the parts of that model to machines is generated. Juju then applies the necessary configuration management scripts to each machine in the model.

Application-specific knowledge such as dependencies, scale-out practices, operational events like backups and updates, and integration options with other pieces of software are encapsulated in





Juju's 'charms'. This knowledge can then be shared between team members, reused everywhere from laptops to virtual machines and cloud, and shared with other organisations.

A bundle is a collection of charms and their relationships, designed to provide an entire working deployment in one collection. To describe an application composed of many applications and services a set of charms and their relationships (what is connected to what) need to be defined, and this setup is encapsulated as a bundle. This allows teams to share not only the core primitive for each application, but also enables sharing higher-level models of several applications, and allows for replication of complex application models.

The deployment of containers in on compute nodes can be split into two types of bundles: A bundle representing all the TANGO devices for a single station A bundle representing all the compute processes for a single station

These two bundles would contain the following containers (per station), resulting on ~70 containers per compute node:

256 antennas devices in one container Station device Sub-station device Transient buffer device Transient buffer process DAQ and correlator in one device (they will probably share implementation) Calibration-related activities (they will probably share implementation) Others for additional processes

5.4 Storage Management

GlusterFS is an open source, distributed file system capable of scaling to several exabytes and handling thousands of clients. GlusterFS clusters together storage building blocks over a networked interconnect, aggregating disk and memory resources and managing data in a single global namespace. It is based on a stackable user space design and can deliver exceptional performance for diverse workloads. GlusterFS is designed to meet the following enterprise-level requirements:

Can be deployed on commodity hardware servers No dedicated metadata server is required (so no single point of failure) Any number of servers can access a storage that can be scaled up to several exabytes. Can be aggregated on top of existing file systems (file recovery can be performed without

GlusterFS) Not tightly coupled with the OS kernel and therefore any updates to the system would have

no effect

A GlusterFS volume is the collection of bricks and most of the GlusterFS operations happen on the volume. GlusterFS file system supports different types of volumes based on user requirements. Some volumes are good for scaling storage size, some for improving performance and some for both. For MCCS, where high availability and reliability is required, and data files are not large (or rather, instantaneous I/O is not high), a distributed replicated volume is suggested. In this volume configuration files are distributed across replicated sets of bricks, as shown in Figure 5-11. The number of bricks must be a multiple of the replica count, and the order in which the bricks are





specified matters since adjacent bricks become replicas of each other. This type of volume is used when high availability of data due to redundancy and scaling storage is required.

Figure 5-11. GlusterFS distributed replicated volume

As a software component of the LFAA LMC system, GlusterFS will manage one logical cluster wide partition to store raw data files and logs. Each node will have a dedicated partition which will make up part of this logical volume. GlusterFS will automatically:

Handle failures in case of drive or node failure Use the fastest available cluster interconnect (in this case, 100GbE network) to transfer data

between nodes Recover lost data (depending on the type of volume deployed)

5.5 Metric Monitoring

Several metrics need to be monitored from each compute node, including CPU load, RAM load, GPU load, multiple temperature sensors and power consumption, and other. Several compute node monitoring software exist which can be used in MCCS, and the decision of which one to use does not affect the architecture and design of the software system for MCCS.





6 Design Decisions

This section documents some of the details and consideration behind the major design decisions made of MCCS. Only the major decisions are listed here. As much as possible decisions have also been included in the text of the main document where a topic is discussed. A limited amount of discussion is provided here for each decision focussing only on the main reasons.

6.1 Why use GPUs and not perform everything on CPUs?

Calibration is the main driver for sizing the MCCS compute cluster. Calibration requires that antennas within a station are correlated, generating correlation matrices and visibilities, which are then calibrated against model visibilities generated by using a local sky model and an interferometer simulator. For correlation alone about ~800 GFLOPS worth of computations per station are required, which equates to about 6 CPU cores, requiring about 3,000 cores for all stations. A single GPU can correlate at least three stations. Therefore, using GPUs reduces the size of the cluster (number of servers, racks and supporting equipment) as well as the power required to run it.

6.2 Why have a redundant server per cabinet?

The MCCS cluster must be available TBD13 % of the time. The following lists cases where a compute server might become, or be placed, offline and a redundant server takes its place:

It becomes faulty It is taken down for hardware maintenance or upgrades A system or other update is being performed which make it unusable for a short period of

timeIf the transition is abrupt, that is without warning (such as loss of power), then all TANGO devices, processes and services will have to be re-launched on the redundant server. If there is a warning then TANGO devices and services can be moved to the redundant server, however running processes would have to be re-launched. There will be a temporary disruption of running scans in both cases. Once the offline server is brought back online it will become the redundant server, since this avoid re-configuration and loss of availability.

6.3 Why have a separate master and shadow node?

The purpose of the master node is to run all central services and TANGO devices. These include: The central services for cluster and storage management The software and hardware configuration databases The high level TANGO devices, including LFAA Master, Alarm Handler, Element Logger,

TelState, Local Sky Model, and others The Subarray devices The TANGO database

These services and devices are not associated with specific stations and do not need to run on compute servers. This ensures that the load across compute servers is uniform, and no processing spikes or temporary glitches arise due to unforeseen behaviour from these devices. Likewise, if some processes on a compute server result in temporary spikes or glitches they would not affect the core





LMC functionality. Additionally, only a single master node and 1G switch need to be powered on to have basic LMC functionality and communication with TM.

Having single master node which runs the core LMC functionality and communication with TM would result in a central point of failure, which is undesirable. TANGO and cluster management software generally allow for shadowing, such that an additional server can mirror the operations being performed by the master node and take over in the circumstance where the master node goes offline. In MCCS this is a dedicated server which has the same hardware and software configuration as that of the master node. It is located on a separate rack which has a redundant link to TM, so if the rack hosting the master node goes offline for any reason, the shadow node can take over and use the redundant TM link.

6.4 Why partition the station across servers rather than schedule based on

available resources?

Each compute server is responsible for at most eight station. Unless a server is faulty or offline, the mapping between stations and servers is fixed, with this mapping residing in the configuration database. This greatly simplifies scan configuration, since a full resource check is not required for determining on which server a station will be processed. One disadvantage of this is that if the LFAA is not fully utilised the power consumption will not be optimal, since some server might have a subset of their allocated station being processed (or none), and stations could be grouped to run on a reduced number of servers. However, it is envisaged that the LFAA will run at close to full capacity most of the time, making this a minor disadvantage. When a server becomes faulty the shadow server will take over, and the station configured to run on the faulty server will run on the redundant one instead. In the case where two servers go offline, a redundant server from a different cabinet can be used.

6.5 Why use distributed rather than central storage?

MCCS needs a limited amount of storage for holding LMC files, database backends and the system itself. Storage could be centralised on the master node or a dedicated storage server, however this would result in a single point of failure for storage. Mirroring of a storage server (or master node storage) is also possible. In a centralised scheme, the network must be capable of handling all I/O requests from all the compute server, such that these servers would likely need to be connected to the high-speed 100 Gb network. The cheaper and more reliable alternative is to distribute storage, where compute servers include one or more disk drives which form part of the distributed storage space. Standard distributed storage managers can then manage this storage space, which can be configured for redundancy and replication. This distributes the network, disk and CPU load required to for handling I/O requests.

6.6 Why have a separate 1G network?

Apart from the high-speed 100G network which interconnects the compute servers with SPS (and amongst themselves) an additional 1G network is included for internal M&C traffic between compute servers and the master/shadow nodes, as well as communication with TM. This ensures





that if the data network becomes unstable (for example due to misconfiguration) the LMC system can still control all MCCS components, although it would not be able to control SPS since M&C traffic goes through the high-speed network as well.

6.7 Why is the fast transient buffer located in MCCS?

A design and cost analysis were performed to determine the best location for the transient buffer [RD35], with the options being storing the buffer on TPMs, MCCS servers or dedicated storage servers. It resulted that storing the fast transient buffer on the MCCS servers resulted in the minimal increase in cost (since only additional RAM was required, with some increased compute). Additionally, transmitting the transient buffer to SDP from the compute server resolves a number of issues, such as load balancing across the different entities transmitting the buffers and bandwidth control of the LFAA-SDP link. The size of the MCCS did not need to change.

6.8 Why use containers for running software?

Containers allow for easy configuration and deployment of software system across multiple nodes, such that different software versions and configuration can simply be defined in a configuration file and deployed on the target nodes. The nodes themselves would run the operating system and supporting services but would have no specific services and libraries installed which are required by the MCCS software. This also avoid library version conflicts, configuration conflicts and other issues which can arise when deploying software directly on the running system (provides isolation from the underlying system). The limitation imposed on the selected container technology is that it should be able to provide direct access to the some of the hardware, such as GPUs and network cards, otherwise it would result in a severe performance penalty.





7 Performance

MCCS is composed of COTS equipment distributed across four racks. An estimate can be made for the power consumption of the system, although the actual power consumption will depend on the procured equipment and the system load. Thermal performance can only be computed once the COTS hardware is procured and tested, therefore only the list of devices contributing to the thermal environment of a rack is provided here.

7.1 Power Consumption

The estimated power budget for the rack configuration is provided in Table 7-8. Note that this does not show the maximum TDP per component, but the estimated average power consumption. The power budget for MCCS is defined in requirement LFAA_MCCS_REQ-255, however at this point the power allocated to MCCS is still TBD14, such that compliance cannot be determined

Table 7-8 Rack power budget

Item # Average Power Max TDP Total Average Total TDPCompute Server 17 1.2 kW 2 kW 20.4 kW 34 kWHead/shadow node * 1 0.4 kW 1 kW 0.4 kW 1 kW100Gb switch 4 0.2 kW 0.5 kW 0.8 kW 2 kW1 Gb switch 1 0.15 kW 0.2 kW 0.15 kW 0.2 kWUPS * 1 TBD9 TBD10 TBD11 TBD12

Total 21.75 kW 37.2 kW* (Head/shadow node, UPS only present in center 2 racks)

[RD30] Section 7 describes the power on, off sequences as well as transition to low power mode.

7.2 Thermal Performance

The thermal performance of an MCCS rack is based on the thermal behaviour of the following components:

Compute servers, which will generate a large percentage of the heat 100G switches, which will generate the second largest percentage of heat UPS (for the central two racks) heat output depends on whether load is being demanded. 1G switch will produce a negligible amount of heat

The rack will be a closed system using a COTS air-water heat-exchange unit. The CPF will provide the required cooling water service to each rack from the floor, dimensioned to meet the total rack absorbed power ([AD8]LFAA to INFRA AUS ICD).

7.3 Monitor and Control Network Bandwidth

The network link for monitoring and control is a 1GB/s link. Most of the monitoring and control data has to do with reading/writing attribute data, passing command data, and maintaining a record of pointing and calibration coefficients. The bulk of the bandwidth will be taken up by these coefficients, but it is estimated that the data rate required for all of the monitoring and control data can be serviced by the dedicated network link.





A breakdown of the data rate estimates, the sources of the generated data, and the adequacy of the 1 Gb/s link is available in [RD37] Appendix C.

8 Reliability Availability and Maintainability

MCCS is allocated an Inherent Availability (Ai) of ≥ 99.99% and a Mean Time To Repair (MTTR) of 1 hour in the SKA RAM Allocation document [AD5].

This section provides the rationale and analysis to show that MCCS meets the allocated Ai and MTTR.

8.1 Reliability, Availability Maintainability Allocation

8.1.1 Design for Reliability

MCCS’s reliability and robustness is increased using measures such as:

1. Redundant design2. Selection of high reliability components (where cost-effective). 3. Design to have the fewest possible components for required functionality. 4. Design to have the fewest possible variations of equipment types and configurations 5. Design of built-in advanced diagnostic tests that have complete hardware and software

coverage (i.e. to sense all component failures). 6. Use of components that have plug and play modularity (and software which provides

support for this) for rapid part replacement

MCCS reliability is further enhanced by: 7. Ensuring a dust free/clean facility cooling air supply (INSA to CSP ICD [AD8]). 8. Ensuring that facility cooling is enough to keep equipment within temperature limits 9. Proactively replacing parts with failure rates that increase over time (parts that mechanically

move), such as server cooling fans and AC-DC power supplies (that have cooling fans and failure-prone AC-input electrolytic capacitors).

10. Ensuring that facility power doesn’t exceed equipment EMC limits (ESD, voltage dropouts, transients, and surges as well as harmonic distortion; see the INAU to LFAA ICD [AD8].)

Note that MCCS operates in a benign temperature-controlled ground environment specified in the INAU ICD [AD8].

8.1.2 MCCS RAMS Product Breakdown Structure

Table 8-9 below provides the relevant parts of the MCCS Sub-element Product Breakdown Structure (PBS) as they apply to RAMS. The “Level” column indicates the hierarchy of each item in the PBS. The Sub-element hierarchy consists of assemblies, sub-assemblies, and finally components (not indicated in this PBS) at the lowest level. LRUs are items that must be removed and replaced at the organizational level (O-Level or 1st level) to restore the Sub-element in case of a failure.





8.1.3 Reliability Prediction

The reliability (MTBF) and repair time (MTTR) estimates for each assembly and sub-assembly are as indicated in Table 8-9. The MTBF estimates of COTS equipment are from device datasheets (actual or similar devices) and from experience using similar products.

MTTR allocations for COTS items are engineering estimates to be confirmed during the construction phase as part of maintainability analysis.

Table 8-9 MCCS Equipment MTBF (h) and MTTR (min)

PBSLevel Name Product

Structure # MTBF Estimate

MTTR Estimate

Repair-able?

1 MCCS Sub-Element (CI) 1 Y2 Cabinet Chassis Assembly 4 Y3 100G Ethernet Switch (32 ports) 8 160000 55 N3 Cabinet 1G Ethernet Switch 4 160000 55 N3 Cabinet PDU 4 1000000 30 N3 Compute Server 68 16000 60 Y

8.1.4 Operationally Capable

Operationally Capable (OC), for the MCCS, is defined when enough processing resources are available to process ≥ 95% of the telescope’s receptor data to obtain 95% of data products. MCCS is designed to have an Ai ≥ 99.99% while being operationally capable.

8.2 Availability and Maintainability

8.2.1 Availability

The MCCS inherent availability model is provided in this section. As a result of the N+1 redundancy provided by the hot spares in the cluster and the additional links in the network configuration the resulting availability is extremely high under the assumption that any failures are addressed in a timely manner.





Mark Waterson, 10/31/18,

Adjusted for consistency, based on engineering judgement.

Figure 8-12 MCCS Availability Model

8.2.2 Maintenance Effort

The operation of the MCCS cluster will include automated monitoring of basic server conditions including temperature and load as well as internal behaviour (e.g. S.M.A.R.T. hard disk health and fan speeds) which permit predictive identification of potentially faulty hardware and programmed preventative maintenance activities (e.g. cleaning). Implementation of these standard approaches should result in a very low non-programmed maintenance level. Based on the data-sheet values for a typical server of this class and normal experience with replacement of such a server into a cluster, reconnecting into the network, and restoring the software environment as supported by the MCCS design we can estimate the MTTR for any individual server to be 60 minutes (excluding travel and other support time, and assuming that the system has properly reconfigured to remove the failed unit from the active set). The expected failure rates for the servers themselves should be very low, assuming appropriate preventative maintenance, and based on industry-standard technology refresh cycles of 4 years. This is supported by manufacturer data indicating that a 3 year Warrantee term is typically offered, and by experience with similar systems in representative environments (ASTRON - LOFAR central processor, MWA correlator, and AAVS1 server). Further analysis to confirm these values is necessary data determination of the calibration algorithms and specification of the servers themselves.





8.3 Reliability Maintainability and Availability Requirements Compliance

Req ID Requirement Description Design Compliance

MCCS_REQ-206 MCCS Availability

The MCCS shall have an Inherent Availability of more than 99.99%. (sub-element Ai allocated as TBD15 % in the SKA RAM ALLOCATION (SKA-TEL-SKO-0000102) document

Section 8.2 provides the MCCS Ai model and shows the calculated Ai to be 99.99 % given that the MCCS needs to remain Operationally Capable (i.e. > 95% of signal processing capacity).

MCCS_REQ-208 MCCS Maintenance Hours

The MCCS shall require less than 0.25 FETs for corrective and preventative maintenance (O-Level and I-Level).

Section 8.2.2 calculates direct replacement effort at 51 hr/year.

MCCS_REQ-212 MCCS Off-line Fault Detection Performance

When commanded and the MCCS Administrative Mode set to OFFLINE, MCCS built-in diagnostic self-test capability shall detect and report TBD16 % of all critical sub-element failures.

A detailed analysis of MCCS fault detection performance is provided in the LFAA Logistics Report [RD28] during the construction phase

MCCS_REQ-218 MCCSS On-line Fault Isolation Performance

When commanded and with the MCCS Administrative Mode set to OFFLINE, MCCS shall detect and report TBD17 % of sub-element LRU to LRU and LRU to external interface communication path faults.

An analysis of MCCS fault isolation performance is provided in the LFAA Logistics Report [RD28] during the construction phase

MCCS_REQ-209 MCCS Mean Time to Repair

The MCCS shall have a MTTR of less than 1 hour.

Typical experience with similar systems indicates this is reasonable.

Table 8-10 Reliability, Maintainability and Availability Requirements Compliance





9 SafetyThe MCCS design considers the following categories of safety:

1. Personal safety – avoiding risk of physical injuries, including those associated with sight, hearing, and lifting. The applicable regulation is the Australian Occupational Health and Safety Act [RD33].

2. Electrical safety - avoiding risk of electrocution. The applicable regulation is the European Low Voltage Directive [RD25] and more specifically IEC 60950-1 - Information Technology Equipment - Safety: General Requirements [RD3].

3. Asset protection safety – self-protection, external entity failure, fire protection. The applicable regulation is also IEC 60950-1 - Information Technology Equipment - Safety: General Requirements [RD3].

4. Environmental safety – disposal of electronic waste. The applicable regulations that are acceptable in Australia are the Restriction of Hazardous Substances Directive (RoHS 2) [RD1]and the Waste Electrical and Electronic Equipment Directive (WEEE) [RD2].

The following sections address each of these categories and demonstrate how MCCS is designed with these safety issues in mind and to be compliant with the requirements. This overall approach is consistent with the SKA Project Safety Management Plan.

9.1 Hazard Analysis

The MCCS hazard analysis [RD34] identifies the main hazards to personnel and equipment and documents how these are mitigated. The CSP Element Safety and Hazard Analysis Report [RD29] also covers selected safety topics in greater detail.

9.2 Personal Safety

During MCCS installation, verification, operation, and maintenance, Personal Protective Equipment (PPE) is used (such as earplugs and gloves), and safety procedures followed, to reduce possible risk to personnel.

9.2.1 Rack Tip-over

The following 2 scenarios show how MCCS equipment racks can cause human injury or death as well as damage to the equipment they contain:

1. Seismic forces cause equipment racks to become unstable and tip over.2. Downward force is exerted on equipment pulled out and extended from the rack to get

access for maintenance, causing the rack to become unstable and tip over.

The following methods are used to stabilise the racks to mitigate these risks: Ensure racks have a centre of gravity (CofG) of less than 50% in the height direction by

placing heavier equipment in the lower rack spaces and perform loading analysis to verify. Add front and back stabilizing feet (if required) to increase rack front-to-back stability.





Warning labels are placed on the front of racks to indicate that only one equipment item (e.g. LRU) may be extended from the rack at any given time. This is to warn integrators and maintainers to mitigate the rack tipping hazard.

9.2.2 Over-Temperature and Fire

To prevent equipment damage and a possible fire hazard, in the case where LFA.MCCS equipment is not provided with adequate cooling, equipment temperature will increase up to a pre-defined limit at which point the hardware will autonomously power-off, even if monitor and control systems have failed.

9.2.3 Weight

Any MCCS equipment item weighing more than 23 kg presents an increased risk of human physical injury due to lifting and moving. Any equipment exceeding 23 kg is fitted with clearly visible permanent “warning lifting hazard” signage.Any MCCS equipment items that have a mass of more than 23 kg, but less than 40 kg, will be designed with carrying handles that allow for a two-person lift. Such equipment will have signage to indicate that 2 people are required to lift it. Currently there is no such equipment in MCCS.The only MCCS equipment items having a mass of more than 40 kg are the cabinets which can be fitted with an integral lifting arrangement such as eye-bolts to allow lifting equipment to lift and move them.

9.2.4 Sharp Edges

MCCS is designed and manufactured such that it has no sharp edges on access openings and corners.

9.2.5 Laser Light

MCCS has many optical fibre links that transport infrared laser light. A risk of permanent eye damage exists if humans stare into an open end of such a live optical fibre connection.Wherever possible, equipment uses automated shutters. Where required, equipment is fitted with applicable optical warning signage.

9.2.6 Fire Protection

MCCS rack equipment must comply with the IEC 60950-1 or equivalent safety standard. “Fire-rated” materials retard the ignition and spreading of fire and are used throughout.

9.2.7 Personal Protective Equipment

During MCCS installation, verification, operation, and maintenance, Personal Protective Equipment (PPE) is used (such as earplugs and gloves), and safety procedures followed, to reduce possible risk to personnel.





9.3 Electrical Safety

This section mainly describes MCCS electrical safety requirements, design, and operation. The MCCS is designed, installed, and certified in accordance with the Australian or equivalent regulations, codes, and standards.

9.3.1 Rack Electrical Design

MCCS racks use COTS power distribution units (PDUs) certified to national and international electrical safety standards. The PDUs are connected to CPF mains 3-phase AC power via a National and International Electrical Regulatory Authority approved plug and socket.MCCS racks and LRU equipment is procured and installed to comply with AS/NZS 3000:2007. (The Wiring of Premises) [RD26] or equivalent.Shock Hazard warning labels and signs are provided where humans are exposed to shock hazards on and inside the racks, although the MCCS design is such that no such hazards exist outside of standard AC plug and socket use. Any access to shock hazard areas are secured inside enclosures or access panels. Any possible spare openings on chassis or faceplates are fitted with covers or plugs.

9.3.2 LRU Design

COTS LRUs are procured and designed to meet the ITE safety standard, IEC 60950-1.

9.3.3 Earthing and Electrical Bonding Systems

Earthing and Electrical Bonding systems are installed on MCCS racks and LRUs are to:1. Provide a discharge path for ESD-induced voltages.2. Provide protection against electrical shock by triggering earth leakage protection and over current protection devices (e.g., fuses or circuit breakers) to isolate faulty equipment.3. Prevent voltage displacement, which could lead to electric shocks.4. Prevent electrical arcing and sparking that may cause a fire hazard.MCCS earthing/grounding systems are in accordance with the Australian codes and standards.

9.4 Environmental Safety

This section deals with toxic substances - materials, liquids, and gases.

9.4.1 Hazardous Materials

MCCS consists of COTS information technology type electrical and electronic equipment. Many MCCS LRUs contain lead (e.g. RoHS-compliant electronics components contain small amounts of lead and polychlorinated biphenyl).

The MCCS limits the use of hazardous materials and adheres to the Restriction of Hazardous Substances (RoHS) directive. Equipment containing hazardous materials indicated in the directive is labelled as hazardous waste and is marked with the lead/silver contamination symbol and per the Waste Electrical and Electronic Equipment (WEEE) Directive 2012/19/EU category “IT and telecommunications equipment” (or equivalent).





Unrepairable or throwaway equipment containing hazardous materials is marked with applicable symbols/markings. These items must be disposed of by placing them in containers dedicated for this purpose to be subsequently transported to waste recycling centres (i.e., collection facilities) or approved authorised treatment facilities in compliance with Australian environmental protection and waste disposal legislation as well as regional local authority requirements.

9.4.2 Gases

LFFA.MCCS is designed to not use or emit any hazardous gases, and materials are selected on the basis that they have reduced potential of emitting poisonous fumes/gases during fires.

9.4.3 Liquids

MCCS is designed not to use or emit any hazardous liquids.

9.5 Asset Protection

This section describes how LFAA.SPS protects the integrity of its equipment. Equipment safety concerns relate to overheating, loss of communication with TM and power consumption surges.Compute Servers LRU are outfitted with integral temperature sensing; when temperature trip points are reached, alarms are generated, prompting MCCS to take action, which could be either commanding server to power down (i.e. in the case of only 1 or a few servers overheating), or commanding the MCCS system to enter the Low Power state in a controlled/staged manner (i.e. in the case of large portions of, or the entire Sub-element, overheating) to keep power draw changes within the required rates.

MCCS will continually make available temperature information from MCCS LRUs as well as SPS LRUs and flag alarms when approaching or in an over-temperature situation. It is the responsibility of external entities to poll the flags and respond by commanding the MCCS and LFAA.SPS to shut down some or all the system. If an over-temperature situation arises and no action is taken to respond to the alarm flagging, each MCCS and SPS LRU has integral “deadman” thermal overload protection that automatically shuts down the LRU when a hard-defined temperature limit (TBD18) is reached, independently and not relying on CPUs or intelligent control threads. Power to the over-temperature LRU will remain off until commanded otherwise by TM.

Deadman thermal tripping may vary in time across the MCCS and LFAA.SPS systems since each LRU operates at a slightly different temperature and deadman devices trip at slightly different temperatures.

If the LFAA.SPS and MCCS management are still operational in this situation, they flag warnings of a possible hazard since there is no way to absolutely guarantee that power draw changes are within required rates. However, this situation should rarely, if ever, be encountered, and the disparities noted above should statistically guarantee power draw change limits.

Other COTS LRUs (such as switches) rely on built-in manufacturer’s thermal overload protection for self-protection/fail-safe operation. Since none of these items are high-power units, no asset hazard





is envisioned, and these are shut down in an orderly manner using rack PDUs as described at the beginning of this section.

Power surges (i.e. voltage spikes) are suppressed by the LFAA.SPS and MCCS PDUs and LRU power supplies with integrated transient suppression devices.

LFAA.SPS rack tip-over condition is alleviated by ensuring that racks have a low centre of gravity (< 50%).

9.6 Certification

The Occupational Health and Safety (OHS), Electromagnetic Compatibility (EMC), and Electrical Safety standards in Australia and Europe are not identical. To ensure that MCCS equipment certification is internationally acceptable, all equipment is certified by certification bodies that are members of the "Scheme of the IECEE for Mutual Recognition of Test Certificates for Electrical Equipment" or shortened to “CB scheme”.“CB scheme” members can EMC- and safety-qualify equipment and issue CB certificates that are accepted by global National Certification bodies to speed up and simplify the approval and acceptance of equipment between countries.

9.7 Safety Requirements Compliance

Table 9-11. Safety Requirements Compliance


MCCS_REQ-242 MCCS Safety Legislation and Regulations

The MCCS equipment and operations shall comply with all applicable Australian Occupational Health and Safety legislation and regulations in accordance with the latest approved version of the LFAA.SPS to INFRA ICD.

MCCS equipment is IEC 60950-1 (Safety) compliant.Electrical: (see Section 10.3)MCCS design has no exposed hazardous voltages.MCCS racks have one PDU that provides equipment overcurrent protection.MCCS equipment has basic voltage surge protection.MCCS has basic fault current interrupt capability.MCCS equipment is connected to protective/safety earth.MCCS equipment is labelled to indicate electric shock hazards.MCCS equipment is designed, constructed, and installed to Australian





Req ID Requirement Description Design Complianceregulations and standards (or equivalent).MCCS equipment chassis and racks are connected to safety earth.MCCS racks can manage the fault currents present at the power connection point.Fire:MCCS rack design uses non-flammable materials, components that won’t ignite a fire, and has no combustible materials.Mechanical: (see section 10.2)MCCS racks in same row are tied together to provide stability in sideways direction.MCCS racks have a Centre Of Gravity (CofG) < 45% of the rack height.MCCS weight warnings are provided.MCCS lifting features are provided.MCCS rack design is sufficiently stable and will not have any sharp edges and corners.Sound: (see Section 10)MCCS rack audible noise levels limited < 85 dBA at a distance of 1 m.

MCCS_REQ-234 MCCS Safety Certification

MCCS rack and LRU equipment shall be safety certified to AS/NZS 60950.11:2015 or equivalent in accordance with the latest approved version of the LFAA to INFRA ICD.

This 60950.11 standard is a country-specific standard applicable to ITE equipment. The equipment will be compliant with the IEC version and the AS/NZS version. The AS/NZSversion is almost identical to the IEC version with some deviations such as local electrical voltage, electrical connection types, ground/earth connection,





Req ID Requirement Description Design Complianceetc.

MCCS_REQ-244 MCCS Environmental Certification

MCCS racks and LRU equipment shall be RoHS 2 (2011/65/EU) certified in accordance with the latest approved version of the LFAA to INFRA ICD.

MCCS equipment is RoHS 2 directive compliant. The equipment materials list (BOM) is audited to confirm that it conforms to the directive

MCCS_REQ-198 MCCS Recycling and Decomposition

MCCS equipment shall be compliant with environmental, ecological, safety, political and social legislation with regard to recycling and decomposition.

MCCS equipment is RoHS 2 directive compliant. CBF equipment is Waste Electrical and Electronic Equipment (WEEE) Directive 2012/19/EU category “IT and telecommunications equipment” (or equivalent) compliant.

MCCS_REQ-199 MCCS Disposable Equipment Labelling

MCCS disposable equipment shall be labelled as such.

MCCS equipment is Waste Electrical and Electronic Equipment (WEEE) Directive 2012/19/EU category “IT and telecommunications equipment” (or equivalent) compliant.Electronic Equipment is labelled to indicate how it needs to be disposed of or recycled.

MCCS_REQ-200 MCCS Restriction of Hazardous Substances

MCCS shall be compliant with the RoHS Directive 2011/65/EU (RoHS 2) or equivalent.

MCCS equipment is RoHS 2 directive compliant. See Section 10.4.1.

MCCS_REQ-245 MCCS rack side-to-side stability

MCCS racks shall be attached to adjacent racks in the same row to provide lateral stability in accordance with the latest approved version of the LFAA to INFRA ICD.

MCCS racks are attached to each other in the lateral direction to support individual racks. This prevents the racks from tipping over in case of seismic events and when heavy LRUs that are extended from the rack for maintenance.

MCCS_REQ-247 MCCS Rack Acoustic Noise Level

MCCS acoustic noise emission sound pressure level shall be in accordance with the

MCCS equipment is designed with fans that ensure a safe noise level is maintained at any rack location.






LFAA to INFRA ICD. See Section 140.3.2.

MCCS_REQ-248 SPS Equipment Shutdown on Failure

The MCCS shall autonomously detect and shut-down affected equipment in accordance with the latest approved version of the LFAA to INFRA ICD, when any of the following conditions occur:

an overcurrent condition occurs in a rack;

an over-temperature condition occurs in an LRU.

MCCS PDU equipment provides overcurrent protection.MCCS equipment LRUs individually have thermal shutdown protection

MCCS_REQ-176 MCCS Occupational Health Regulations

MCCS shall comply with all local Occupational Health and Safety legislation and regulations applicable to the Australian observatory site and its operations.

See MCCS_REQ-242.

MCCS_REQ-177 SPS Hazard Elimination MCCS shall be safe to the operators, maintainers and own equipment with categorization exceeding the acceptable levels defined in the SKA Project Safety Management Plan [SKA-TEL-SKO-0000740].

See MCCS_REQ-242.

MCCS_REQ-178 MCCS Fail Safe Operation

MCCS equipment shall operate in a locally fail-safe manner and not rely on external safety devices or measures to operate safely.

MCCS PDU equipment provides overcurrent protection.MCCS equipment individually has thermal shutdown protection

MCCS_REQ-47 MCCS Fail Safe Provisions

MCCS shall not exhibit safety hazards in Categories I or II (ISO 45001) following an unplanned loss of main electrical power or main

MCCS equipment is IEC 60950-1 (Safety) compliant. See L3-1672. When MCCS rack power is restored, rack equipment powers back on and transitions to a safe






control functions. state.

MCCS_REQ-179 MCCS Non-Propagation of Failures

MCCS equipment hardware failures and software errors shall be safe from creating hazardous conditions in interfacing elements and sub-elements.

MCCS does not control any external equipment. External interfaces are defined to allow interfacing systems to operate autonomously and have self-protection

MCCS_REQ-178 SPS Fail-Safe Operation MCCS equipment that would otherwise present a safety hazard when subjected to an unplanned loss of main electrical power or main control function, shall enter a designated fail-safe state.

MCCS equipment is IEC 60950-1 (Safety) compliant. When MCCS rack power is restored, rack equipment powers back on and transitions to a safe state.

MCCS_REQ-187 MCCS Electrical Installation

The electrical wiring of all MCCS hardware shall comply with low voltage installation Standards (wiring of premises) AS/NZS 3000:2007.

MCCS equipment and racks are designed to comply and be installed in compliance to Australian standards.

MCCS_REQ-188 MCCS Electrical Circuit Interlock

MCCS shall provide electrical circuit inter-locks to prevent personnel coming into contact with hazards that cannot otherwise be eliminated from design.

MCCS equipment does not present any hazards that require electrical circuit inter-locks.

MCCS_REQ-190 MCCS Safety of Equipment

MCCS racks and equipment shall comply with IEC 60950-1 Information Technology Equipment - Safety Part 1 General Requirements or equivalent standard.

MCCS equipment is designed to comply with IEC 60950-1 Information Technology Equipment - Safety Part 1 General Requirements or equivalent standard. COTS equipment carrying this certification is selected; bespoke LRUs are designed to comply and could be certified.

MCCS_REQ-191 MCCS Power Interruption Survivability

MCCS racks shall survive a main power interruption at any time, without requiring a

MCCS is designed to comply.






repair activity to restore it once power is restored.

MCCS_REQ-192 MCCS Electrostatic Warning Marking

MCCS LRU equipment with electrostatic sensitive components shall be fitted with ESD warning labels.

MCCS is designed to not expose equipment internals to electrostatic shocks.

MCCS_REQ-193 MCCS Equipment Identification and Safety Markings

MCCS equipment shall in accordance with ISO 61310_2 or equivalent bear all markings necessary:

for its unambiguous identification;

for its safe use; and supplementary information given, as appropriate:

permanently on the machinery;

in accompanying documents such as instruction handbooks;

on the packaging.

MCCS equipment is procured (COTS) in accordance with ISO 61310_2 or equivalent and bears all markings necessary.

MCCS_REQ-195 MCCS Hazard Warning Marking

MCCS equipment that presents a potential hazard shall display a label in accordance with ISO 7010 or equivalent.

MCCS equipment that presents a potential hazard (e.g. shock, burn, sharp edge) displays a label in accordance with ISO 7010 or equivalent.

MCCS_REQ-196 SPS Label Robustness MCCS labels shall be affixed for at least 50 years or the lifetime of the equipment, whichever is the smaller, and unlikely to come off during maintenance or as a result of the

MCCS equipment uses durable labels that have a warrantee to stay affixed for at least 50 years, or the lifetime of the equipment.






environment.





10 EnvironmentalThe LFAA MCCS is located on-site in the Central Processing Facility (CPF) shielded room. This section describes the environmental conditions the MCCS equipment is exposed to during transportation, normal operations and when in storage in the spares storage facility

10.1 Transportation of Equipment

Commercial land, sea, and air transportation methods are employed to transport MCCS equipment to the CPF during the integration, installation, and operational phases of SKA1_Low. These methods of transportation expose the equipment to various environmental conditions. The equipment must be sufficiently robust and be packaged to endure these transportation conditions with a high degree of confidence.The equipment is designed to endure the transportation conditions and is qualified to the environmental limits specified in the ETSI / EN 300 019-1-2 (“Classification of environmental conditions; Transportation - Class 2.2: Careful transportation”) [RD6] or equivalent standards (e.g., IEC 60721-3-2, [RD5]) (L3-1808). The standard specifies the environmental condition limits that equipment is exposed to during transportation.MCCS equipment is separately packaged at the LRU level, installation cable set, and bare rack and sub-rack level for transportation.

10.2 Storage of Equipment

MCCS equipment spares are stored at the spares store facility and need to endure the environmental conditions specified in ETSI / EN 300 019-1-1 (Part 1-1: Classification of environmental conditions; Storage, Class 1.1: Weather protected, partly temperature-controlled storage locations) [RD6] or equivalent standard (e.g., IEC 60721-3-1, [RD4]) (L3-1806). The standards specify the worst-case environmental conditions equipment is exposed to during storage.

10.3 Operation

MCCS equipment is installed and used operationally in the CPF environments. MCCS COTS equipment is qualified to ETSI / EN 300 019-1-3 (Part 1-3: Classification of environmental conditions; Stationary use at weather protected locations) [RD7] or equivalent standard (e.g., IEC 60721-3-3, [RD4]) (L3-1810).

10.3.1 Mechanical Environment

MCCS rack equipment may be exposed to the following mechanical vibration in event of a seismic event: a maximum peak ground acceleration of 1 m/s2 [AD8].

10.3.2 Nominal Environmental Conditions

MCCS rack equipment is exposed to the following environmental conditions:1. Temperature range of cold air in front of racks: TBD19 °C to TBD20 °C.2. Humidity Range: 5.5°C (DP) to 60% Relative Humidity (RH) and 15 °C DP TBC1





3. Maximum Dew Point: ≤ 17 °C TBC24. Maximum Temperature Rate of Change ≤ 5°C per hour TBC35. Maximum Humidity Rate of Change Range ≤ 5% RH per hour, no condensation TBC4

10.3.3 EMC Environment

MCCS equipment electromagnetically affects its own and other equipment via emissions and can be affected by its own and other equipment in the CPF (susceptibility). The EMC Control Plan (EMCCP) [AD3] documents the LFAA.SPS EMC campaign.

10.3.4 Susceptibility to Emissions

MCCS equipment is electromagnetically affected by its own and other Sub-element equipment located in the CPF that are connected to the same electrical network.

10.3.4.1 Susceptibility to Radiated Emissions

MCCS LRU equipment that may be susceptible to radiated emissions from other equipment that is co-located in the CPF and is qualified to the IEC 61000-6-1/2 [RD16]/[RD17] or equivalent standard (specifically CISPR 24/35 [RD22]/[RD24] Class A) for immunity to radiated emissions as per the EMCCP [AD3].

10.3.4.2 Susceptibility to Conducted Emissions

MCCS equipment that connects to the CPF electrical networks is qualified to the following immunity standards or equivalent (specifically CISPR 24 [RD22] or EN 55024) with performance criteria, for conducted emissions, to not require operator intervention to continue, as per the EMCCP [AD3]:

1. IEC/EN 61000-4-2 [RD10]: Immunity to ESD.2. IEC/EN 61000-4-3 [RD11]: Immunity to Electromagnetic Fields.3. IEC/EN 61000-4-4 [RD12]: Immunity to electrical fast transients/bursts.4. IEC/EN 61000-4-5 [RD13]: Immunity to voltage surges.5. IEC/EN 61000-4-6 [RD14]: Immunity to conducted disturbances, induced by RF fields.6. IEC/EN 61000-4-11 [RD15]: Immunity to voltage dips, short interruptions and voltage

10.3.5 Emissions

MCCS equipment will electromagnetically affect MCCS and other sub-system equipment in the vicinity, located in the CPF that is connected to the same electrical network.

10.3.5.1 Radiated Emissions

MCCS LRUs are certified to the IEC 61000-6-3/4 [RD18]/[RD19] standard or equivalent (specifically CISPR 22/32 [RD21]/[RD23] or EN 55022 [RD20]) for radiated emissions.In the worst case, emissions from MCCS equipment constructively sum and increase the maximum emission levels above that of a single emitter by a factor of N, calculated as follows:Radiated emissions level increase = 10 x log10 (N)The following table lists the equipment type, emission level, and number of the MCCS radiated emissions sources:





Table 10-12. MCCS EMI Emitters

Emitter Description Certification CountMCCS Cabinet Class A 4MCCS Data Switches Class A 12MCCS Compute Node Class A 68MCCS Head Node Class A 2UPS Class A 2Cabinet PDU Class A 4Total Emitters 92

As a result, the MCCS equipment maximum radiated emission level is roughly TBD21 dB higher than the level for a single Class A device.

A more realistic scenario is that emissions add randomly. In that case the radiated emissions level goes as the square root of the number of emitters, giving ~TBD22 dB for TBD23 emitters.

10.3.5.2 Conducted Emissions

MCCS equipment that connects with the CPF electrical network is certified to the following emission standards (with performance criteria to not require operator intervention to continue):

1. IEC/EN 61000-3-2 [RD8]: Emissions of harmonic currents.2. IEC/EN 61000-3-3 [RD9]: Emissions of voltage changes, voltage fluctuations, and flicker.

In the worst case, harmonic current emissions from MCCS equipment constructively sum and increase the maximum emission levels above that of a single emitter by a factor of N for N emitters. Applying the same formula as above leads to the conclusion that in the worst case conducted emissions levels are TBD24 dB above a single emitter. However, as above, the more realistic case is that emissions are ~TBD25 dB above that of a single emitter.

10.4 Environmental Requirement Compliance

The following table provides required references to show requirements compliance

Req ID Req Title Req Text Design ComplianceMCS_REQ-128 MCCS to INFRA

InterfaceMCCS shall be compliant with the interface definitions listed in the LFAA to INFRA AUS Interface Control Document [100-000000-003]

MCCS is designed to comply with these requirements as per Section 12.

MCCS_REQ-1241

SPS Deployment Location

MCCS shall be located in the Central Processing Facility in accordance with the latest approved version of the LFAA to INFRA ICD [AD8]

MCCS is designed to match what the CPF provides. This includes aspects such as rack space, power and cooling. See the start of Section 12.

MCCS_REQ-252 MCCS Power Quality

MCCS shall operate in compliance with the SKA1 Power Quality Standard

MCCS equipment meets industry power quality standards.





Specification in accordance with the LFAA to INFRA Interface Control Document[AD8]

MCCS_REQ-254 SPS Power Factor MCCS rack equipment shall have a power factor in accordance with the LFAA to INFRA ICD

MCCS COTS meet EMC emissions standard IEC 61000-3-2 [RD8] “Harmonic Current Emissions” Class D or equivalent.

MCCS_REQ-284 SPS Harmonic Current Emissions

MCCS equipment connected to the CPF electrical network shall be compliant with the IEEE 519-1992 part 11.5 standard or equivalent for Harmonic Current Limits under standard test conditions and setups in accordance with the LFAA to INFRA ICD

MCCS COTS meet IEC 61000-3-2 [RD8]Class A or equivalent standard for Harmonic Current Emissions.

MCCS_REQ-262 SPS Voltage Fluctuations and Flicker Emissions

MCCS equipment connected to the CPF electrical network shall be compliant with the IEC 61000-3-3 standard or equivalent for Voltage Fluctuations and Flicker Emissions under standard test conditions and setups in accordance with the LFAA to INFRA ICD.

MCCS COTS meet IEC 61000-3-3 [RD9]. Class A or equivalent standard for Voltage Fluctuations and Flicker Emissions.

MCCS_REQ-264 SPS ESD Immunity

MCCS equipment susceptible to ESD shall be compliant with the IEC 61000-4-2 standard or equivalent standard for ESD immunity under standard test conditions and setups in accordance with the LFAA to INFRA ICD.

MCCS COTS meet the IEC 61000-4-2 [RD10]Level 1 (4 kV contact, 8 kV air) Performance Criteria (A, B or C) or equivalent standard for ESD immunity or equivalent as defined in this section

MCCS_REQ-265 MCCS Radiated, Radio Frequency, Electromagnetic Field Immunity

MCCS equipment connected to the CPF electrical network shall be compliant with the IEC 61000-4-3 standard or equivalent for radiated, radio-frequency, electromagnetic field immunity under standard test conditions and setups in accordance with the LFAA to INFRA ICD

MCCS COTS power supplies meet IEC 61000-4-3 [RD11] standard Test Level 2 (3 V/m) Performance Criteria (A or B) or equivalent for radiated, radio-frequency, and electromagnetic field immunity as defined in this section.

MCCS_REQ-266 MCCS Electrical MCCS equipment connected to MCCS COTS power supplies Document No.:Revision:Date:




Fast Transient/ Burst Immunity

the CPF electrical network shall be compliant with the IEC 61000-4-4 standard or equivalent for electrical fast transient/burst immunity under standard test conditions and setups in accordance with the LFAA to INFRA ICD.

meet IEC 61000-4-4 [RD12] standard Level 3 (1 kV AC power line) with given Performance Criteria (A or B) or equivalent for Electrical Fast Transient/Burst immunity as defined in this section.

MCCS_REQ-267 MCCS Surge Immunity

MCCS equipment connected to the CPF electrical network shall be compliant with the IEC 61000-4-5 standard or equivalent for Surge immunity under standard test conditions and setups in accordance with the latest approved version of the LFAA to INFRA ICD.

MCCS COTS power supplies meet IEC 61000-4-5 [RD13] Class 3 (1kV L-L, 2kV L-E on AC power, 1.5 kV L-E 10/700 usec on signal lines) with given Performance Criteria (A or B) or equivalent standard for Surge Immunity as defined in this section

MCCS_REQ-268 MCCS Immunity to Conducted Disturbances Induced by Radio Frequency Fields

MCCS equipment connected to the CPF electrical network shall be compliant with the IEC 61000-4-6 standard or equivalent for immunity to conducted disturbances, induced by radio-frequency fields under standard test conditions and setups in accordance with the latest approved version of the LFAA to INFRA ICD.

MCCS COTS power supplies meet IEC 61000-4-6 [RD14] Level 2 (3 V rms on power and signal lines) with given Performance Criteria (A or B) or equivalent standard for Immunity to conducted disturbances, induced by radio-frequency fields as defined in this section.

MCCS_REQ-269 SPS Power Frequency Magnetic Field Immunity

MCCS equipment connected to the CPF electrical network shall be compliant with the IEC 61000-4-6 standard or equivalent for power frequency magnetic field immunity under standard test conditions and setups in accordance with the LFAA to INFRA ICD.

MCCS COTS power supplies meet IEC 61000-4-11 [RD15] (for equipmentrequiring <16 A, else EN 61000-4-34 for equipment requiring >16A) Class 2 (0% input Voltage for 10 ms, 70% for 500 ms) with given Performance Criteria (A or B) or equivalent as defined in this section.

MCCS_REQ-270 MCCS Voltage Dips, Short Interruptions and Voltage Variations Immunity

MCCS LRU equipment connected to the CPF electrical network shall be compliant with the IEC 61000-4-11 standard or equivalent for voltage dips, short interruptions and voltage variations immunity under

MCCS COTS power supplies meet IEC 61000-4-11[RD15] (for equipmentrequiring <16 A, else EN 61000-4-34 for equipment requiring >16A) Class 2 (0% input Voltage for 10 ms, 70% for 500 ms) with given





standard test conditions and setups in accordance with the latest approved version of the LFAA to INFRA ICD.

Performance Criteria (A or B) or equivalent as defined in this section.

MCCS_REQ-201 MCCS EMI and EMC

MCCS equipment shall be compliant with the emission and immunity standards in the SKA EMI/EMC Standards and Procedures [SKA-TEL-SKO-0000202] document.

In accordance with EMCCP[AD3], MCCS COTS equipment is CISPR 22/32 [RD21]/[RD23] Class A compliant.

MCCS_REQ-202 MCCS Essential Conducted and Radiated Electromagnetic Emissions

MCCS LRU equipment that radiate electromagnetic emissions in the SKA1 frequency range and that aren't within a shielded enclosure of a higher level LRU, shall at a minimum be compliant with the IEC 61000-6-4 standard or equivalent for conducted and radiated emissions under standard test conditions and setups in the SKA Low telescope frequency range as specified in the SKA EMI/EMC Standards and Procedures [SKA-TEL-SKO-0000202].

MCCS COTS equipment is at least CISPR 22/32 [RD23] Class A compliant. Wherever possible, COTS equipment is selected that is CISPR 22/32 Class B.

MCCS_REQ-203 MCCS Electromagnetic Susceptibility

MCCS equipment shall be compliant with the IEC 61000-6-2 standard or equivalent for susceptibility to conducted and radiated emissions under standard test conditions and setups.

Note: Only LRUs with electrical or electronic components that are susceptible to EMI and that aren't within a shielded enclosure of a higher level LRU need to be checked for compliance.

MCCS COTS equipment is at least CISPR 22/32 [RD21][RD23] Class A compliant.

MCCS_REQ-204 MCCS Essential Off-the Shelf Equipment EMC marking

SPS "off-the-shelf" LRU equipment shall as a minimum be CISPR 22 and 24 Class A or equivalent compliant over the SKA telescope frequency range.

MCCS COTS equipment is at least CISPR 22/32 [RD21]/[RD23] and 24/35 [RD22]/[RD24]Class A compliant. Selected COTS equipment is Class A





compliant.MCCS_REQ-205 MCCS Electricity

Network Electromagnetic Compatibility

MCCS equipment shall follow the code of practice for the application of Electromagnetic Compatibility (EMC) standards and guidelines in electricity utility networks.

See previous EMC detailed requirements above for compliance.

MCCS_REQ-229 MCCS Storage of Equipment

MCCS equipment, while in its storage packaging, shall withstand, and shall operate to specification as defined herein after exposure to, the storage environmental conditions as defined in “Class 1.1: Weather protected, partly temperature-controlled storage locations” of the ETSI EN 300 019-1-1 standard and defined in BS EN IEC 60721-3-1.

MCCS COTS equipment is selected to be compliant with the storage environment or equivalent.

MCCS_REQ-230 MCCS Transportation Environment

All components and spares of the MCCS, in their transport packaging, shall be safe from damage while, and shall perform to specification as defined herein, after being transported under conditions as defined in "Class 2.2: careful transportation" of the ETSI EN 300 019-1-2 standard [37] and defined in BS EN IEC 60721-3-1

MCCS COTS equipment is selected to be compliant with this transportation environment or equivalent.

MCCS_REQ-231 MCCS COTS Operating Environment

The MCCS "off-the-shelf" equipment shall perform to specification as defined herein during operation under the environmental conditions as defined in the ETSI EN 300 019-1-3 standard for "Class 3.6: Telecommunication control room locations" or equivalent.

MCCS COTS equipment is selected to be compliant with this operating environment.

Table 10-13 Environmental Requirements Compliance





11 Assumptions

This section provides a list of assumptions used in the design of MCCS. These are in addition to assumptions made with regards to requirements, which are given in the LFAA Level 2 Requirements Specification [AD7] and in the MCCS L3 Requirements Specification. The list of assumptions below also does not cover items in the MCCS Software Architecture Document [ref].

Table 11-14. Assumptions used in the design of MCCS

ID Description AssumptionDesign Constraints

1.1 The performance benchmarks used to calculate the size of the MCCS compute cluster are based on non-exhaustive benchmarks tests which were performed in isolation during prototype development.

It is assumed that processing performed on each server will scale linearly, and that scheduling issues will be minor

1.2 In the proposed design, processes from at least two stations share the GPU. These GPUs will have correlations, model visibility generators and potentially calibration functions running.

It is assumed that full GPU resource utilization will be possible by the in-built GPU scheduler and that this will not cause scheduling delays

1.3 The compute requirements for the calibration algorithm have not been analysed, primarily since the calibration algorithm to use is still undecided

It is assumed that the remaining free resource available on each compute node will be enough to run the calibration routine for 8 stations

1.4 Each MCCS server has one 100Gb port through which all data and LMC traffic for eight stations is processed. At least 16 processes (8 for correlation and 8 for transient buffering) will be using this interface. Since the data acquisition software captures raw packets, each process will have to filter all UDP traffic in this port and select the required packets.

Packet filtering will be performed in the kernel with appropriate high performance filters. This has already been prototyped but required benchmark analysis to see whether the assumed CPU resources listed for data acquisitions suffice.

1.5 Each transient buffer (eight in total on a compute server) requires a significant amount of RAM memory, with data being written constantly to the ring buffers. The suggested RAM configuration provides more than enough bandwidth for this (as well as the DAQ processes)

Transient buffering has not been prototyped due to its late addition in L1 requirements. It is assumed that the suggested memory configuration and implementation of the transient buffer itself will be able to handle the compute, memory and I/O requirements for 8 stations on a single server

1.6 A shadow node is included in the system as a failover to when the master node goes offline. All operations performed by the master node are carried over to the shadow node in this circumstance

It is assumed that the cluster management software used to manage the compute cluster Is capable of performing this operation. The TANGO framework supports shadowing but this feature has not been prototyped

1.7 Server upgrades will be performed for all Assume that all on-line components of the





compute servers in MCCS, such that the compute cluster will have homogeneous compute resources

compute cluster are homogenous such that no special scheduling consideration need to be taken into account

1.8 Current rack assembly has plenty of spare space in case additional compute servers are required (as well as empty switch ports).

It is assumed that track power and thermal performance won’t be compromised if compute servers are added

Environmental Conditions2.1 Any personnel interacting with any MCCS

equipment, either in the CPF room or any other environment, is adequately trained and equipped to handle the equipment

It is assumed that there are no requirements to design for untrained personnel handling/servicing the equipment

Maintenance Constraints3.1 There will be cases where it is not possible to

detect processing defects while the system is online and actively observing without the cost and complexity of full N+1 redundancy in processing

Assume a maintenance schedule and procedure to take at least part of the system offline to perform such tests, and such tests will occur on a regular basis

3.2 LRUs are repaired (and can be repaired) rather than thrown away. The number of servers that can’t be repaired over the instrument lifetime is greater than the planned number of repairs

Assume that the operations maintenance budget and plan will be finalised with appropriate spares levels and lifetime buys for the expected end of life parts

3.3 An LRU repair facility is security certified to avoid any possible introduction of malicious software (such as viruses)

Assume that the operations maintenance plan includes suitable protection software

3.4 Data supplied by MCCS monitor points is expected to be archived externally of LFAA

Assume that the engineering maintenance staff shall be provided with access to this historical data to allow monitoring of all parts

Access Constraints4.1 Any authentication/protection for network

access to MCCS equipment from any location other than perhaps inside the CPF is provided by the LFAA LMC infrastructure or TM

Assume that operations plan includes suitable access control to internal Ethernet networks

Interface Constraints5.1 TM/LMC limits instrument configuration that

are not sensible or beyond the operating capacity of available resources within the entire Low instrument

Assume that internal MCCS checking of all configurations will be required to avoid instrument failure

5.2 The INAU CPF building is a critical component in keeping the Low site RFI quiet. MCCS equipment will be designed and measured to comply with appropriate EMI standards

It is assumed that INAU will provide adequate shielding of the CPF such that no additional RFI shielding is required for the MCCS equipment if they meet the specified EMI standards

TM TBDs. Cadence of MCCS monitoring points

M&C for components below sub-assembly level will be detailed during pre-construction and construction





12 Development Support

The MCCS design requires the development of bespoke software as described in [RD31]. To support the developments there are several key facilities or environments required:

LFAA ITF Emulators for MCCS verification Software development environment Code regression testing system Failure reporting database analysis EMI certification facility

These environments will allow MCCS to progress through development as well as the required acceptance testing. It will be supported by appropriate management and system engineering activities such that development can occur unimpeded. Most of the developed support facilities will likely become part of the SKA operations following the completion of the construction phase. Each of the support items are detailed further in the following subsections.

12.1 Development Facility

The components that comprise the MCCS Sub-Element will be integrated and tested in the same facility as SPS, which would be staffed by an on-site test engineer who installs and maintains the equipment. The AIV roll-out plan [AD4] expects a “stable and well-tested” Sub-element to be available at this facility for informal testing before the initial qualification event. Testing at this site continues throughout the engineering construction period until AIV handover to SKA1 operations.

This site would require one rack containing at least one server with the required infrastructure, including:

3-phase power with earth leakage breakers Optical networking interfaces Appropriate building sensors and alarms Air conditioning to maintain similar temperature/humidity as at the CPF

In addition to the infrastructure facilities there several other items which shall be available to support developments:

Ethernet traffic generator and analyser with 100Gbps interfaces Fibre optic ribbon and individual fibre cleaning and inspection station Fibre optic power meter Compute, keyboard, mouse and monitors external to the racks to show system

configuration, health and execute local testing

12.2 Integration Verification and Test Plan

The integration verification and test plan for MCCS [RD36] is split into:1. A testing process for functional elements – consisting of a cycle of unit testing, integration

testing, system testing and acceptance testing2. A testing process for non-functional elements – consisting of performance testing, security

and vulnerability testing, usability testing, portability testing3. Regression testing for both function and non-functional elements





4. Notes on repository-keeping and self-containment5. A collection of testing environments to be utilized during system testing and verification, in

particular: development environment, build environment, integration and performance environment, and production environment

6. A number of simulators required for cross-element testing7. Specific test plans for hardware components and hardware-software overlap: testing CPUs,

GPUs, and the need for real hardware for software-hardware overlap testing8. Test cycle methodology and continuous integration, with a focus on the Agile testing cycle

and which tests go into which part of the cycle, the need of continuous integration and how it should be utilized in MCCS development and testing, as well as the metadata to collect during continuous integration testing

9. Qualification and acceptance procedures10. Instruction on staged releases and test scheduling to match which test types are required for

particular stages, milestones, and their cadence11. A checklist of functional and non-functional requirements, together with the array assembly

the requirements should be applied to, and a description of the verification rationale for each requirement.





13 System Commissioning

The MCCS design and implementation phase will support the AIV Element roll-out plan [RD27]. Core to the AIV plan are the telescope-level commissioning phases show in Table 13-15 (only features relevant to LFAA are presented in this table) where there are five key milestones, one ITF Qualification Event (ITF-QE) and four Array Assemblies. The hardware required for commissioning are all COTS, and they will be required in four locations:

Development Facility: Local to the sub-element where development and debugging can occur rapidly in a team environment. Element and sub-element emulators provide sufficient test and capture capabilities to sell off at level-3

LFAA Element ITF: Where LFAA sub-elements integrate together to test internal interfaces and functionality. Only element-level emulators are required to test and sell of the system at Level 2.

AIV System ITF: Where elements are verified (sold off) against requirements. Elements are integrated together to perform final verification at level 1.

Site CPF: Integration of the hardware in its final location and tested with the

Table 13-15. Array assembly events and MCCS functionality

ITF-QE AA1 AA1 AA3 AA4Stations 24 24 64 256 512

Sub-arraying - 1 16 16 16Bandwidth 75 MHz 75 MHz 150 MHz 300 MHz 300 MHz

Station Beams 1 8 8 8 8

13.1 Hardware Commissioning

The estimated minimum quantity of hardware requirement for each phase is shown in Table 13-16. It is assumed that the MCCS construction team will always have hardware local to their location such that development can occur independent of the array assembly events. The primary element in an MCCS rack is a compute server, whilst the rest of the hardware can be regarded as support hardware. Therefore, the numbers presented in this table are primarily driven by the number of compute servers required to process the number of stations at each array assembly.

Table 13-16. Hardware requirements for supporting and processing equipment for each AA

Racks Compute Servers

1 Gb Switches

100 Gb Switches

Head Nodes PDUs

ITF-QE 1 4 1 1 1 1AA1 1 4 1 1 1 1AA2 1 8 1 2 1 1AA3 2 34 2 4 2 2AA4 4 68 4 8 2 2

Critical to the commissioning of the production hardware are the emulators to verify the functionality of the system against system requirements. The emulators only contain partial functionality and are designed to test a portion of the system. Refer to the test specification document for details on emulator functionality [RD36]. Since for each array assembly all that is





required it the addition of new hardware (leaving the currently existing hardware alone), the following procedures are all that are required for AIV:

New hardware is installed by LFAA without disruption to existing hardware (this is not always possible as there are connections between racks and external equipment which may cause temporary disruptions)

The software code configuration is updated, and the LFAA system verified, and sold-off. Since this hardware is identical (i.e., identical build versions or equivalent) to what was just installed, the verification effort should be minimal and should mostly consist of newly-installed software code configuration checks along with some operations spot-checks. The system is then accepted by AIV.

13.1.1 Production Readiness Review

A key part of the roll-out plan is a series of Production Readiness Reviews (PRR). PRRs are a necessary stage before procurement of hardware can commence. PRRs will occur for both supporting and processing equipment to ensure that hardware being purchased meets the requirements. For each piece of equipment at Level-5 in the PBS there will be a corresponding PRR. The PRRs will be distributed over a period of time and not all occur at the same time. Once the PRRs are approved then volume procurement will commence. It is important for software code developers to have a consistent and stable hardware platform.

MCCS will require Automated Test Environment (ATE) to verify the performance of each LRU. This ATE will have supporting software code to exercise the LRU capabilities in a controlled environment. The ATE will test the following items:

M&C functionality Voltage regulation and power consumption Cooling systems Optical connectivity (speed and optical power) Memory, CPU and GPU performance

13.2 Code Commissioning

There will be many releases of code internal to MCCS, however since most of this is relating to monitoring and control of COTS hardware (all firmware-related software support is part of SPS), commissioning of this code can be performed in the ITF. The order in which code is commissioned is dictated by the AIV roll-out plan. Within each release there will be several minor releases.

System health and hazard prevention is expected to be established early in the construction phase. User Interfaces (UIs) at multiple levels of system hierarchy are developed from the beginning with a use model for them that encompass engineering development and test, through to system commissioning and operational support/maintenance. For each code release there will be a corresponding user interface with appropriate functionality enabled to support the system. At the LFAA ITF this includes integration with LMC which is developed using Scaled Agile Framework (SAFe). This will see quarterly releases which are not synchronised to array releases but instead will evolve over time. Once at the AIV ITF the same will occur with integration of TM.





13.3 Verification and Acceptance

The LFAA element is responsible for the verification of L3 requirements. This requires the element to use emulators to simulate input data and evaluate output data. For many requirements the sell off at L3 is identical to the sell-off of L2 for LFAA. This includes non-functional requirements. The documentation of the L3 testing, described in [RD36], will be supplied to AIV to demonstrate compliance. Requirements needing a specific verification at L2 will have this testing done at the LFAA ITF. Again, these tests will be documented and supplied to AIV. L1 verification is conducted at the AIV ITF. The hardware will undergo Factory Acceptance Tests (FAT) at the LFAA Element ITF in the presence of AIV. AIV signs off the FAT and the LFAA will transport the hardware to the AIV ITF or the CPF site for installation and handover for AIV.





14 Integrated Logistic SupportThis section describes the MCCS Logistic Support (ILS) approach. Many aspects addressed here are described in more detail in the LFAA Logistic Engineering Report [RD28].

14.1 Support and Maintenance ConceptThe figure below shows the MCCS maintenance concept:

I-level SparesStore

O-levelMaintenance

CPF

On siteO-level

Spares Store

RemoteSupport only

I-levelMaintenance

Facility

Local & remoteS/W

Development & support

D-levelMaintenance

Facility(OEM/

Contractor)

Maintainer(s)On-site

Maintainer(s)Geraldton

Maintainer(s) atMaintenance Facility (OEM/

Contractor)

Arrow Legend

Faulty LRU

Repaired LRU

Personnel Facility Spares S/W Support

Figure 14-13. Support and Maintenance Concept

14.1.1 Support ConceptThe MCCS in-service support concept can be summarized as follows:

1. MCCS LRU spares, spare repairs, and support are provided by the construction contractor for one year after product hand-over.

2. The on-site maintainer removes and replaces LRUs to restore the MCCS.3. Suspect or faulty LRUs are sent to I-level for repair or replacement.4. Spare LRUs are procured and used at the I-level for simple repairs.5. Simple LRU repairs are done at the I-level, while complex or expensive repairs are fixed by

contractors under warranty or maintenance contracts.6. Repaired/returned LRUs are returned via the I-level, where these are configured and tested

in a realistic environment using required test equipment before being returned to the spares store.

7. COTS equipment is “refreshed” at periods that are chosen to maintain support contracts for the equipment (e.g., ITE equipment typically has contracted support for only up to 5 years).





14.1.1.1 On-site Support

On-site support includes the following: All maintenance actions must be done in coordination with telescope operators and

operations. Technical personnel with the ability to remove/replace LRUs are on call to restore the MCCS

to full capability. Faulty equipment that operates in a redundant manner is removed and replaced at the next convenient opportunity. Faulty equipment without redundancy is replaced immediately to limit the impact on operations.

On-site MCCS consoles are used to display MCCS GUIs from on-site locations to allow the maintainer, with proper authentication, to “take over” the MCCS. These consoles allow maintainers to execute MCCS and MCCS LRU-level BIST as required. The following consoles are envisioned:

o A CPF-based console that is permanently installed in the MCCS Processing Rack (or a laptop plugged into a network port and placed on a retractable shelf) to view status and optionally control MCCS and its LRUs for investigating/diagnosing and BIST execution purposes.

o A “maintenance console” in the on-site operations room where on-site maintainers have immediate access to view status and optionally control MCCS without having to enter the screened room.

o A laptop-based console for remote maintainers to access and respond to maintenance events and queries.

14.1.1.2 Remote and Off-site Support

Remote off-site support personnel can use the SKAO high-speed communication network to remotely connect to the MCCS. Remote users are thereby capable of performing the following operations:

Powering down/up the entire MCCS system. Power-cycling LRUs that are showing anomalous behaviour to correct the problem. Installing software and firmware upgrades. “Logging in” to the system via console GUIs to troubleshoot problems. Viewing snapshots of output data products, captured prior to output. Viewing other data products and statistics from monitor points. Additionally, off-site support personnel at the MCCS I-Level support depot provide services

to 1) repair LRUs that have been removed and replaced at the O-Level, 2) configure LRUs that have been replaced or sent away for repairs, and 3) verify LRUs before these are sent back to the on-site or near-to-site spares storage facility.

14.1.2 Corrective MaintenanceCorrective maintenance tasks performed on-site by the MCCS maintainer include:

1. The MCCS maintainer coordinates with the operations personnel to detect, isolate, and repair random unpredictable failures.

2. The MCCS maintainer removes and replaces LRUs and/or performs necessary corrective actions to restore the Sub-element to operation.





14.1.3 Predictive MaintenancePredictive maintenance tasks performed on-site include:

1. Periodically reviewing collected diagnostic information to detect equipment and components near end-of-life conditions (e.g. laser transceiver devices, disk drives, and cooling fans typically have end-of-life indicators).

2. Perform activities to restore these devices only as needed to ensure the equipment’s continued operation. For example, transceivers indicating degraded performance are replaced.

14.1.4 Preventative & Scheduled MaintenancePreventative maintenance tasks performed on-site by the MCCS maintainer include:

1. Perform data backups for recovery purposes.2. Manage storage space on data storage devices.3. Manage operating systems’ patching/updates as required.4. Power-cycle equipment to recover from, and prevent the build-up of, conditions (e.g.

operating system memory leaks) that can cause unpredicted failures.5. On a scheduled basis, remove and replace consumables that have a limited operational

lifetime (e.g., batteries, filters, and cooling fans).

The failure rate of moving parts such as bearings (contained in fans and disks) increases over time (Weibull distribution). The reliability of the Sub-element may be improved if these items are refreshed well before their predicted failure dates.

14.1.5 COTS Refresh CycleCOTS products (hardware and software) are procured with warranty and maintenance support service agreements. Since COTS service agreements are typically up to 5 years, at least 2 COTS refresh cycles are anticipated during construction and 10 years of operations.It is anticipated that the COTS hardware refresh activity requires some application software re-hosting, integration, and verification effort.

14.2 SparesThe MCCS spare list for year 1 after acceptance is provided below. The number of spares is calculated using a 95% level of confidence in reliability prediction.

Table 14-17 Spares list derived from reliability prediction models.

PBSLevel Name # MTBF

EstimateMTTR

EstimateRepair-able?

Spares for 1 year

2 Cabinet Chassis 4 Y 03 100G Ethernet Switch (32 ports) 8 160000 55 N 13 Cabinet 1G Ethernet Switch 4 160000 55 N 03 Cabinet PDU 4 1000000 30 N 03 Compute Server 68 35000 90 Y 0





Mark Waterson, 10/31/18,

Changed to 4 years based on email from P.Bentham – ASTRON experience.

14.2.1 COTS Equipment RepairCOTS equipment is procured with supporting maintenance contracts. Repaired or replaced LRUs are re-configured and tested under realistic conditions to confirm/prove repair.

14.2.2 Bespoke Equipment RepairMCCS does not contain any bespoke equipment.

14.2.3 ConsumablesThe UPS battery and Compute nodes motherboards batteries in the Control Processing Rack are considered to be consumable items that need to be replaced every 5 years (in the case of the UPS, if a UPS-unit refresh is not otherwise performed).

14.3 Support OrganizationThe SKA organization provides the following MCCS-related support resources and services:

1. Maintains technical data under configuration control.2. Manages the deployed configuration.3. Provides an O-level maintainer to provide MCCS on-site support.4. Provides a spares store (and staff) close to the installed MCCS.5. Provides the I-Level depot and maintainer for MCCS equipment repairs, configuration,

verification, and software/firmware development/upgrade support.6. Provides high-speed communications infrastructure for remote/off-site support.7. Provides Failure Reporting, Analysis and Corrective Actions System (FRACAS).8. Provides supply management support for spares/LRU repair support.

14.3.1 Technical dataTechnical data age provided to develop maintenance and support manuals, operator/user’s manuals, and other documents (e.g., specifications, corrective, preventive, and predictive maintenance instruction documentation, hazardous material documentation, component lists, etc.) used to support and operate MCCS.

14.3.2 Obsolescence ManagementMCCS spares and technical support must be provided for more than 10 years after delivery. Pro-active and strategic obsolescence management of MCCS equipment is performed and plans put in place to identify and mitigate the risk of parts, spares, equipment, standards, processes, skills (people), and software etc. becoming unavailable or obsolete. The approach is in accord with the IEC/EN/BS 62402 (Obsolescence management – Application guide) or equivalent standard. Pro-active and strategic obsolescence management tasks include:

1. Performing long-term availability checks on strategic MCCS equipment and components.2. Performing analysis of the MCCS BOM at defined intervals:

a. Availability forecasts.b. Identification of at-risk COTS or component items.

3. Contracting COTS and critical component suppliers to provide end-of-life (EOL) notices.4. Performing last-time and life-time buys of critical equipment and component spares.





5. Procuring support and maintenance contracts for COTS equipment and performing technology refresh activities at the end of support life.

Specifically, MCCS contains complex and expensive critical components such as the GPUs. Component vendors are requested to provide a technology roadmap to assess whether critical components will remain available for the lifetime (10 years) of MCCS.

14.4 Support Facilities and Equipment

MCCS support requires an I-level support facility and a spares store.

14.4.1 Spares Storage

MCCS requires a storage facility for MCCS spares and consumables. The near-site spare storage space is specified to be a weather-protected, but not temperature-controlled storage location with no direct sunlight or water ingress into it. However, moisture or frost may condense on items contained within it. MCCS equipment requiring a special local environment and/or handling is sealed in ESD and Moisture Barrier Bags (containing a desiccant) and packaged in padded boxes for storage.

14.4.2 Intermediate Level Support Facility

An intermediate level (I-level) support facility is used to repair, configure, test, and upgrade MCCS equipment as required. The System ITF could be re-used as the MCCS I-level facility.

14.4.3 Support Test equipmentThe support facility at the I-level requires a test environment to check that:

1. Suspect equipment removed at the O-level indeed requires repair.2. Repaired equipment is configured correctly.3. Software defects have been fixed and have a low probability of introducing side effects when

installed in the on-line system.

14.4.4 Computer ResourcesComputer resources are required at the I-level for MCCS software support tasks. The SKAO or an external contractor manages and maintains the necessary computer resources, development tools, and required licenses.

14.4.5 Accessibility and SecurityThe SKAO manages personnel access to the telescope’s physical and electronic infrastructure. Individuals such as the MCCS maintainer and remote support personnel are authorized to have physical and electronic access to specific levels required to perform their tasks.





14.4.6 Software Maintenance and Installation

Software maintenance is managed as follows:

1. After successfully completing the acceptance test phase (sell-off), MCCS software is checked for conformance against the "Fundamental SKA Software and Hardware Description Language Standards” [SKA-TEL-SKO-0000661] and placed under formal Configuration Management (CM).

2. All system/software issues are raised and tracked. Subsequent software bugs, bug fixes, and OS upgrades are formally managed:

a. Changes are approved by an SKAO Change Control Board (CCB).b. Software releases are checked for conformance to the "Fundamental SKA Software

and Hardware Description Language Standards” [SKA-TEL-SKO-0000661].c. The software release is regression tested off-line as much as possible. d. The software release is regression tested on-line; when deployed, a key set of on-sky

regression tests are run, and the results analysed – all the way through to image formation and quality analysis – before used for normal science observations.

e. On-sky regression test failure is planned for, with previous version recovery/backup in place if needed.

3. MCCS COTS equipment has both hardware and software updates during the telescope operational life. The updates are tested off-line prior to deployment and regression tested on-line after deployment (as required).

4. Bespoke SPS LRU software should require infrequent updates. LRUs are updated to the latest software on a reboot.

5. The software development environment (e.g. design entry tool(s), libraries, compilers, debuggers, and possibly the development platform’s hardware and operating system) is kept intact and maintained to the extent required to allow for maintenance in a reliable fashion.

It is not clear at this point what contractor or organization provides software maintenance.

14.5 Manpower and PersonnelOne part time O-level maintainer and 1 part time I-level maintainer are required to support MCCS as outlined in previous sections.

14.5.1 Operator Role

Apart from maintenance activities, MCCS is remotely monitored by a human operator although ‘operated’ automatically under the control of TM. The human operator(s) role is envisioned as follows:

1. The (telescope) operator has a portion of a GUI display dedicated to a rolled-up view of MCCS status.

2. Faults are visually highlighted on the MCCS rolled-up status display and audible alarms presented, if required.

3. The operator drills-down in successive detail to attempt to pinpoint the issue to the LRU level and may attempt to correct the fault by power-cycling/resetting one or more affected LRUs.





4. If applicable, the operator can bring on-line the standby FSP-UNIT thereby bypassing an FSP fault.

5. The operator has a 24/7 hot-line to maintenance personnel to report issues that cannot be corrected in such a manner.

6. On-call maintenance personnel with equipment, spares, and authorization assess the fault and do repairs as needed.

14.5.2 On-site Maintenance Support RoleThe on-site maintainer performs preventative, predictive, and corrective maintenance tasks such as removing and replacing faulty LRU equipment to restore MCCS to keep it operational. Educational and training requirements for on-site or O-level maintenance support personnel are:

1. Must have a qualified electronic technician education level.2. Must have sufficient experience in computer system maintenance and repairs.3. Must complete a training course to operate, support, and maintain MCCS.

This maintainer’s primary objective is to detect and isolate faulty LRUs (corrective maintenance) and to remove and replace these to restore MCCS to full functionality. The maintainer’s secondary objective is to determine what maintenance needs to be scheduled (predicted and preventative maintenance) and to coordinate and perform required tasks when scheduled.

14.5.3 Off-site Maintenance Support RoleThe off-site maintainer performs second line equipment maintenance (I-level) functions, including LRU repair via SRU swapping and associated testing. Educational and training requirements for the MCCS I-level maintainer are:

1. Must have a qualified electronic technician education level with software specialization.2. Must have sufficient experience in computer system maintenance, software updates, and

repairs.3. Must complete a training course to support and maintain MCCS equipment, including SRU

swapping.The off-site maintainer performs relatively simple diagnostics to pinpoint the failure items in LRUs. The maintainer configures and tests all repaired or upgraded equipment in a representative environment to verify that it is correctly configured and fully operational before being returned to the spares store.

14.5.4 Maintainer Training and Training SupportThe MCCS contractor provides manuals, training materials, and training courses to the O-level and I-level maintenance and support staff to support MCCS operations. MCCS operation training is also provided to the telescope operator, but at the level of understanding what GUI-display indicators mean and how to interpret them, as well as what actions to take to attempt clear faults in the operational system. Annual refresher training sessions are provided to the operators/maintainers as required.





15 Appendix A: List of TBD TBCsThe following entries identify information that is currently unknown but can be resolved pending completion of work to be done during the bridging phase of the project. As this plan is developed specific tasks and dependencies will be linked in the resolution plan entries.

Table 15-18 Table of TBDs

ID Section Description Resolution PlanTBD1 4.1 Antenna transient data rate

TBD2 4.1 Antenna transient data packets per second

TBD3 4.1 Antenna transient data rate

TBD4 4.1 Calibration %CPU core utilization

TBD5 4.1 Sky model generation memory utilization

TBD6 4.1 Sky model generation memory bandwidth

TBD7 4.1 Calibration memory utilization

TBD8 4.1 Calibration memory bandwidth

TBD9 4.3 UPS average power consumption

TBD10 4.3 UPS Max TDP

TBD11 4.3 UPS Total average power consumption

TBD12 4.3 UPS Total TDP

TBD13 6 Cluster time availability

TBD14 7.1 Power allocated to MCCS

TBD15 8.3 MCCS Ai allocation

TBD16 8.3 Detection and reporting percentage of all critical failures

TBD17 8.3 Detection and reporting percentage of LRU to LRU and LRU to external interface communication path faults

TBD18 9.5 Temperature limit for automatic shutdown

TBD19 10.3 Temperature range for cold air in front of racks

TBD20 10.3 Temperature range for cold air in front of racks

TBD21 10.3 MCCS maximum radiated emission level

TBD22 10.31 Radiated emission level

TBD23 10.3 Number of emitters

TBD24 10.3 Worst case conducted emissions level

TBD25 10.3 Emissions above a single emitter





Table 15-19 Table of TBCs

ID Section Description Resolution PlanTBC1 10.3.2 Humidity Range

TBC2 10.3.2 Maximum Dew Point

TBC3 10.3.2 Maximum Temperature Rate of Change

TBC4 10.3.2 Maximum Humidity Rate of Change