network performance for atlas real-time remote computing farm study

1
Network Performance for ATLAS Real-Time Remote Computing Farm Study Alberta, CERN Cracow, Manchester, NBI MOTIVATION Several experiments, including ATLAS at the Large Hadron Collider (LHC) and D0 at Fermi Lab, have expressed interest in using remote computing farms for processing and analysing, in real time, the information from particle collision events. Different architectures have been suggested from pseudo-real-time file transfer and subsequent remote processing, to the real-time requesting of individual events as described here. To test the feasibility of using remote farms for real-time processing, a collaboration was set up between members of ATLAS Trigger/DAQ community, with support from several national research and education network operators (DARENET, Canarie, Netera, PSNC, UKERNA and Dante) to demonstrate a Proof of Concept and measure end- to-end network performance. The testbed was centred at CERN and used three different types of wide area high-speed network infrastructures to link the remote sites: an end-to-end lightpath (SONET circuit) to the University of Alberta in Canada standard Internet connectivity to the University of Manchester in the UK and the Niels Bohr Institute in Denmark a Virtual Private Network (VPN) composed out of an MPLS tunnel over the GEANT and an Ethernet VPN over the PIONIER networks to IFJ PAN Krakow in Poland. Remote Computing Concepts ROB ROB ROB ROB L2PU L2PU L2PU L2PU SFI SFI SFI PF Local Event Processing Farms ATLAS Detectors – Level 1 Trigger SFOs Mass storage Experimental Area CERN B513 Copenhagen Edmonton Krakow Manchester PF Remote Event Processing Farms PF PF PF lightpath s PF Data Collection Network Back End Network GÉANT Switch Level 2 Trigger Event Builders CERN-Manchester TCP Activity TCP/IP behaviour of the ATLAS Request- observed with 64 Byte Request in Green 1 Mbyte reponse in Blue TCP in Slow Start takes 19 round trips or ~ 380 ms TCP Congestion window in Red This is reset by TCP on each Request due to lack of data sent by the application over the network. TCP obeys RFC 2518 & RFC 2861 0 50000 100000 150000 200000 250000 0 200 400 600 800 1000 1200 1400 1600 1800 2000 tim e D a ta B ytes 0 50 100 150 200 250 300 350 400 D a ta B ytes DataBytesO ut (Delta DataBytesIn (Delta 0 50000 100000 150000 200000 250000 0 200 400 600 800 1000 1200 1400 1600 1800 2000 tim e m s D a ta B ytes 0 50000 100000 150000 200000 250000 CurCwn DataBytesO ut (Delta DataBytesIn (Delta CurCwnd (Value 0 100 200 300 400 500 600 700 800 0 500 1000 1500 2000 2500 3000 tim e m s n u m Pack 0 200000 400000 600000 800000 1000000 1200000 Cwnd P ktsO ut(D elta P ktsIn (D elta C urC w nd (V alue Observation of the Status of Standard TCP with web100 Observation of TCP with no Congestion window reduction TCP Congestion window in Red grows nicely Request-response takes 2 rtt after 1.5 s Rate ~ 10 events/s with 50 ms processing time Transfer achievable throughput grows to 800 Mbit/s Data Transferred when the Application requires the data 0 100 200 300 400 500 600 700 800 900 0 1000 2000 3000 4000 5000 6000 7000 8000 tim e m s T C P A c h iv e M 0 200000 400000 600000 800000 1000000 1200000 Cwnd 3 Round Trips 2 Round Trips The ATLAS Application Protocol Send OK Send event data Request event ●● Request Buffer Send processed event Process event Time Request-Response time (Histogram) Event Filter EFD SFI and SFO Event Request: EFD requests an event from SFI SFI replies with the event data Processing of the event occurs Return of Computation: EF asks SFO for buffer space SFO send OK EF transfers the results of the computation CERN-Alberta TCP Activity 64 Byte Request in Green 1 Mbyte reponse in Blue TCP in Slow Start takes 12 round trips or ~ 1.67 s Observation of TCP with no Congestion window reduction with web100 TCP Congestion window in Red grows gradually after slowstart Request-response takes 2 rtt after ~2.5 s Rate ~ 2.2 events/s with 50 ms processing time Transfer achievable throughput grows from 250 to 800 Mbit/s 2 RoundTrips 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 0 1000 2000 3000 4000 5000 tim e D a ta B ytes 0 50 100 150 200 250 300 350 400 D a ta B ytes DataBytesO ut (Delta DataBytesIn (Delta 0 100 200 300 400 500 600 700 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 tim e m s n u m Pack 0 200000 400000 600000 800000 1000000 Cwn P ktsO ut(D elta P ktsIn (D elta C urC w nd (V alue 0 100 200 300 400 500 600 700 800 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 tim e m s T C P A c h iv e M 0 200000 400000 600000 800000 1000000 Cwn Principal partners Web100 parameters on the server located at CERN (data source) Green – small requests Blue – big responses TCP ACK packets also counted (in each direction) One response = 1 MB ~ 380 packets 64 byte Request 1 Mbyte Response CERN-Kracow TCP Activity Steady state request-response latency ~140 ms Rate ~ 7.2 events/s First event takes 600 ms due to TCP slow start

Upload: ohanzee-ojeda

Post on 04-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

CERN-Manchester TCP Activity. Observation of the Status of Standard TCP with web100. 64 Byte Request in Green 1 Mbyte reponse in Blue TCP in Slow Start takes 19 round trips or ~ 380 ms TCP Congestion window in Red - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Network Performance for ATLAS Real-Time Remote Computing Farm Study

Network Performance for ATLAS Real-Time Remote Computing Farm StudyAlberta, CERN Cracow, Manchester, NBI

MOTIVATIONSeveral experiments, including ATLAS at the Large Hadron Collider (LHC) and D0 at Fermi Lab, have expressed interest in using remote computing farms for processing and analysing, in real time, the information from particle collision events. Different architectures have been suggested from pseudo-real-time file transfer and subsequent remote processing, to the real-time requesting of individual events as described here.

To test the feasibility of using remote farms for real-time processing, a collaboration was set up between members of ATLAS Trigger/DAQ community, with support from several national research and education network operators (DARENET, Canarie, Netera, PSNC, UKERNA and Dante) to demonstrate a Proof of Concept and measure end-to-end network performance. The testbed was centred at CERN and used three different types of wide area high-speed network infrastructures to link the remote sites:• an end-to-end lightpath (SONET circuit) to the University of Alberta in Canada• standard Internet connectivity to the University of Manchester in the UK and the Niels Bohr Institute in Denmark• a Virtual Private Network (VPN) composed out of an MPLS tunnel over the GEANT and an Ethernet VPN over the PIONIER networks to IFJ PAN Krakow in Poland.

Remote Computing Concepts

ROBROBROBROB

L2PUL2PUL2PUL2PU

SFISFI SFI

PFLocal Event Processing Farms

ATLAS Detectors – Level 1 Trigger

SFOs

Mass storageExperimental Area

CERN B513

CopenhagenEdmontonKrakowManchester

PF

Remote Event Processing Farms

PF

PF PF

ligh

tpat

hs

PF

Data Collection Network

Back End Network

GÉANT

Switch

Level 2 Trigger

Event Builders

CERN-Manchester TCP Activity

TCP/IP behaviour of the ATLAS Request- Response Application Protocol observed with Web100

64 Byte Request in Green 1 Mbyte reponse in Blue TCP in Slow Start takes 19 round trips or ~ 380 ms

TCP Congestion window in RedThis is reset by TCP on each Request due to lack of data sent by the application over the network.TCP obeys RFC 2518 & RFC 2861

0

50000

100000

150000

200000

250000

0 200 400 600 800 1000 1200 1400 1600 1800 2000time

Data

Byte

s O

ut

0

50

100

150

200

250

300

350

400

Data

Byte

s I

n

DataBytesOut (Delta DataBytesIn (Delta

0

50000

100000

150000

200000

250000

0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms

Data

Byte

s O

ut

0

50000

100000

150000

200000

250000

Cu

rCw

nd

DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value

0

100

200

300

400

500

600

700

800

0 500 1000 1500 2000 2500 3000time ms

nu

m P

ackets

0

200000

400000

600000

800000

1000000

1200000

Cw

nd

PktsOut (Delta PktsIn (Delta CurCwnd (Value

Observation of the Status of Standard TCP with web100

Observation of TCP with no Congestion window reduction

TCP Congestion window in Red grows nicelyRequest-response takes 2 rtt after 1.5 sRate ~ 10 events/s with 50 ms processing time

Transfer achievable throughput grows to 800 Mbit/sData Transferred when the Application requires the data

0100200300400

500600700800900

0 1000 2000 3000 4000 5000 6000 7000 8000time ms

TC

PA

ch

ive M

bit

/s

0

200000

400000

600000

800000

1000000

1200000

Cw

nd

3 Round Trips

2 Round Trips

The ATLAS Application Protocol

Send OK

Send event data

Request event

●●●

Request Buffer

Send processed event

Process event

Time

Request-Response time (Histogram)

Event Filter EFD SFI and SFO

Event Request: EFD requests an event from SFI SFI replies with the event data

Processing of the event occurs

Return of Computation:EF asks SFO for buffer spaceSFO send OKEF transfers the results of the computation

CERN-Alberta TCP Activity

64 Byte Request in Green 1 Mbyte reponse in Blue TCP in Slow Start takes 12 round trips or ~ 1.67 s

Observation of TCP with no Congestion window reduction with web100

TCP Congestion window in Red grows graduallyafter slowstartRequest-response takes 2 rtt after ~2.5 sRate ~ 2.2 events/s with 50 ms processing time

Transfer achievable throughput grows from 250 to 800 Mbit/s

2 RoundTrips

0100000200000300000400000500000600000700000800000900000

1000000

0 1000 2000 3000 4000 5000time

Data

Byte

s O

ut

0

50

100

150

200

250

300

350

400

Data

Byte

s I

n

DataBytesOut (Delta DataBytesIn (Delta

0

100

200

300

400

500

600

700

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

time ms

nu

m P

ackets

0

200000

400000

600000

800000

1000000

Cw

nd

PktsOut (Delta PktsIn (Delta CurCwnd (Value

0100

200300

400500

600700

800

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

time ms

TC

PA

ch

ive M

bit

/s

0

200000

400000

600000

800000

1000000

Cw

nd

Principal partners

Web100 parameters on the server located at CERN (data source)

Green – small requests Blue – big responsesTCP ACK packets also counted (in each direction)One response = 1 MB ~ 380 packets

64 byte Request 1 Mbyte Response

CERN-Kracow TCP Activity

Steady state request-response latency ~140 msRate ~ 7.2 events/sFirst event takes 600 ms due to TCP slow start