the$science$dmz:$$ …...2015/03/04 · the$science$dmz:$$ a$network$design$paern$for$...
TRANSCRIPT
The Science DMZ: A Network Design Pa8ern for Data-‐Intensive Science Jason Zurawski – [email protected] Science Engagement Engineer, ESnet Lawrence Berkeley National Laboratory
KINBER Webinar March 4th 2015
Overview
• ESnet Overview • Science DMZ MoIvaIon and IntroducIon
• Science DMZ Architecture
• Network Monitoring
• Data Transfer Nodes & ApplicaIons • Science DMZ Security
• User Engagement
• Wrap Up
2 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
ESnet at a Glance • High-‐speed naIonal network,
opImized for DOE science missions: – connecIng 40 labs, plants and
faciliIes with >100 networks (naIonal and internaIonal)
– $32.6M in FY14, 42FTE – older than commercial Internet,
growing twice as fast
• $62M ARRA in 2009/2010 grant for 100G upgrade: – transiIon to new era of opIcal networking – world’s first 100G network at conInental scale
• Culture of urgency: – 4 awards in past 3 years – R&D100 Award in FY13
– “5 out of 5” for customer saIsfacIon in last review – Dedicated staff to support the mission of science
8
UniversitiesDOE laboratories
The Office of Science supports:� 27,000 Ph.D.s, graduate students, undergraduates, engineers, and technicians� 26,000 users of open-access facilities� 300 leading academic institutions� 17 DOE laboratories
SC Supports Research at More than 300 Institutions Across the U.S.
3 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Network as Infrastructure Instrument
ESnet Vision: ScienIfic progress will be completely unconstrained by the physical locaIon of instruments, people, computaIonal resources, or data.
4 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Overview
• ESnet Overview • Science DMZ MoIvaIon and IntroducIon
• Science DMZ Architecture
• Network Monitoring
• Data Transfer Nodes & ApplicaIons • Science DMZ Security
• User Engagement
• Wrap Up
5 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
• Science & Researcher are everywhere – Size of school/endowment does not ma8er – there is a researcher at your facility right now that is a8empIng to use the network for a research acIvity
• Networks are an essenIal part of data-‐intensive science – Connect data sources to data analysis – Connect collaborators to each other – Enable machine-‐consumable interfaces to data and analysis resources (e.g. portals), automaIon, scale
• Performance is criIcal – ExponenIal data growth – Constant human factors (Imelines for analysis, remote users) – Data movement and analysis must keep up
• EffecIve use of wide area (long-‐haul) networks by scienIsts has historically been difficult (the “Wizard Gap”)
Mo8va8on
6 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Tradi8onal “Big Science”
7 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Big Science Now Comes in Small Packages …
…and is happening on your campus. Guaranteed.
8 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Understanding Data Trends
100GB
10GB
1TB
10TB
1PB
100TB
10PB
100PB
Dat
a Sc
ale
Collaboration Scale
Small collaboration scale, e.g. light and
neutron sources
Medium collaboration scale,
e.g. HPC codes
Large collaboration scale, e.g. LHC
A few large collaborations have internal software and networking organizations
9 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Data Mobility in a Given Time Interval (Theore8cal)
These tables available:http://fasterdata.es.net/fasterdata-home/requirements-and-expectations/
10 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
The Central Role of the Network
• The very structure of modern science assumes science networks exist: high performance, feature rich, global scope
• What is “The Network” anyway? – “The Network” is the set of devices and applicaIons involved in the use of a remote resource • This is not about supercomputer interconnects • This is about data flow from experiment to analysis, between faciliIes, etc.
– User interfaces for “The Network” – portal, data transfer tool, workflow engine – Therefore, servers and applicaIons must also be considered
• What is important? Ordered list: 1. Correctness 2. Consistency 3. Performance
11 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
TCP – Ubiquitous and Fragile
• Networks provide connecIvity between hosts – how do hosts see the network? – From an applicaIon’s perspecIve, the interface to “the other end” is a socket
– CommunicaIon is between applicaIons – mostly over TCP
• TCP – the fragile workhorse – TCP is (for very good reasons) Imid – packet loss is interpreted as congesIon
– Packet loss in conjuncIon with latency is a performance killer • We can address the first, science hasn’t fixed the 2nd (yet)
– Like it or not, TCP is used for the vast majority of data transfer applicaIons (more than 95% of ESnet traffic is TCP)
12 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
A small amount of packet loss makes a huge difference in TCP performance
Metro Area
Local (LAN)
Regional
ConInental
InternaIonal
Measured (TCP Reno) Measured (HTCP) Theoretical (TCP Reno) Measured (no loss)
With loss, high performance beyond metro distances is essentially impossible
13 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Lets Talk Performance … "In any large system, there is always something broken.”
Jon Postel • Modern networks are occasionally designed to be one-‐size-‐fits-‐most
• e.g. if you have ever heard the phrase “converged network”, the design is to facilitate CIA (ConfidenIality, Integrity, Availability) – This is not bad for protecIng the HVAC system from hackers.
• Causes of fricIon/packet loss: – Small buffers on the network gear and hosts – Incorrect applicaIon choice – Packet disrupIon caused by overzealous security – CongesIon from herds of mice
• It all starts with knowing your users, and knowing your network
14 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
PuPng A Solu8on Together • EffecIve support for TCP-‐based data
transfer – Design for correct, consistent, high-‐
performance operaIon – Design for ease of troubleshooIng
• Easy adopIon (for all stakeholders) is criIcal – Large laboratories and universiIes
have extensive IT deployments – Small universiIes/faciliIes have
overworked/understaffed IT departments
15 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
– DrasIc change is prohibiIvely difficult • Cybersecurity – defensible without compromising performance • Borrow ideas from tradiIonal network security
– TradiIonal DMZ • Separate enclave at network perimeter (“Demilitarized Zone”) • Specific locaIon for external-‐facing services • Clean separaIon from internal network
– Do the same thing for science – Science DMZ
Data Transfer Node • High performance • Configured for data
transfer • Proper tools
perfSONAR • Enables fault isolaIon • Verify correct operaIon • Widely deployed in ESnet
and other networks, as well as sites and faciliIes
Science DMZ • Dedicated locaIon for DTN • Proper security • Easy to deploy -‐ no need to redesign
the whole network
Performance TesIng &
Measurement
Dedicated Systems for Data
Transfer
Engagement with Network Users
Network Architecture
Engagement • Partnerships • EducaIon & ConsulIng • Resources & Knowledgebase
16 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
The Science DMZ Superfecta
Overview
• ESnet Overview • Science DMZ MoIvaIon and IntroducIon
• Science DMZ Architecture
• Network Monitoring
• Data Transfer Nodes & ApplicaIons • Science DMZ Security
• User Engagement
• Wrap Up
17 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Abstract Deployment
• Simplest approach : add-‐on to exisIng network infrastructure – All that is required is a port on the border router – Small footprint, pre-‐producIon commitment
• Easy to experiment with components and technologies – DTN prototyping – perfSONAR tesIng
• Limited scope makes security policy excepIons easy – Only allow traffic from partners (use ACLs) – Add-‐on to producIon infrastructure – lower risk – IdenIfy applicaIons that are running (e.g. the DTN is not a general purpose machine – it does data transfer, and data transfer only)
• Start with a single user/user case. If it works for them in a pilot, you can expand
18 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Local And Wide Area Data Flows
10GE
10GE
10GE
10GE
10G
Border Router
WAN
Science DMZSwitch/Router
Enterprise Border Router/Firewall
Site / CampusLAN
High performanceData Transfer Node
with high-speed storage
Per-service security policy control points
Clean, High-bandwidth
WAN path
Site / Campus access to Science
DMZ resources
perfSONAR
perfSONAR
High Latency WAN Path
Low Latency LAN Path
perfSONAR
19 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Non-‐R1 Campus
• This paradigm is not just for the big guys – there is a lot of value for smaller insItuIons with a smaller number of users
• Can be constructed with exisIng hardware, or small addiIons – Does not need to be 100G, or even 10G. Capacity doesn’t ma8er – we want to eliminate fricIon and packet loss
– The best way to do this is to isolate the important traffic from the enterprise
• Can be scoped to either the expected data volume of the science, or the availability of external facing resources (e.g. if the pipe to KINBER/3ROX/MAGPI is small – you don’t want a single user monopolizing it)
• Factors: – Are you comfortable with Layer 2 Networking? – How rich is your cable/fiber plant? – Can you create a dedicated facility for science?
20 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Non-‐R1 Campus
21 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Fiber Rich Environment
Non-‐R1 Campus
22 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Layer 2 Switching
Non-‐R1 Campus
23 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Single Facility
Non-‐R1 Campus
• Every campus will be different – If you are not fiber rich, other choices may be needed. – If the researchers don’t want to move to a dedicated facility, your opIons are also limited
– Have discussions – lay out what is possible and what is not
• ROI Statements: – Eliminate congesIon where you can – the network path for the science user does not traverse the core -‐> be8er performance for her, and everyone else
– Improve the process of science – the next Ime they go for an NSF/DOE/NIST grant, they can say (with confidence) the network does what they need it to do
– Encourage others that are suffering in silence to seek you out. Once you have a success story, there will be others asking about it.
24 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Overview
• ESnet Overview • Science DMZ MoIvaIon and IntroducIon
• Science DMZ Architecture
• Network Monitoring
• Data Transfer Nodes & ApplicaIons • Science DMZ Security
• User Engagement
• Wrap Up
25 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Performance Monitoring
• Everything may funcIon perfectly when it is deployed • Eventually something is going to break
– Networks and systems are complex – Bugs, mistakes, … – SomeImes things just break – this is why we buy support contracts
• Must be able to find and fix problems when they occur (even if they have been that way for a long Ime)
• Must be able to find problems in other networks (your network may be fine, but someone else’s problem can impact your users)
• TCP was intenIonally designed to hide all transmission errors from the user: – “As long as the TCPs conInue to funcIon properly and the internet system does not become completely parIIoned, no transmission errors will affect the users.” (From RFC793, 1981)
26 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
SoV Network Failures – Hidden Problems
• Hard failures are well-‐understood – Link down, system crash, sowware crash – TradiIonal network/system monitoring tools designed to quickly find hard failures
• Sow failures result in degraded capability – ConnecIvity exists – Performance impacted – Typically something in the path is funcIoning, but not well
• Sow failures are hard to detect with tradiIonal methods – No obvious single event – SomeImes no indicaIon at all of any errors
• Independent tesIng is the only way to reliably find sow failures
27 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Rebooted router with full route table
Gradual failure of optical line card
Sample SoV Failures
Gb/s
normal performance
degrading performance
repair
one month
28 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Tes8ng Infrastructure – perfSONAR
• perfSONAR is: – A widely-‐deployed test and measurement infrastructure
• ESnet, Internet2, US regional networks, internaIonal networks • Laboratories, supercomputer centers, universiIes
– A suite of test and measurement tools – A collaboraIon that builds and maintains the toolkit
• By installing perfSONAR, a site can leverage over 1300 test servers deployed around the world
• perfSONAR is ideal for finding sow failures – Alert to existence of problems – Fault isolaIon – VerificaIon of correct operaIon
• Open Source, widely supported by a number of stakeholder organizaIons
29 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Lookup Service Directory Search: hYp://stats.es.net/ServicesDirectory/
30 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
perfSONAR Dashboard: hYp://ps-‐dashboard.es.net
31 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Overview
• ESnet Overview • Science DMZ MoIvaIon and IntroducIon
• Science DMZ Architecture
• Network Monitoring
• Data Transfer Nodes & ApplicaIons • Science DMZ Security
• User Engagement
• Wrap Up
32 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Dedicated Systems – Data Transfer Node
• The DTN is dedicated to data transfer • Set up specifically for high-‐performance data movement
– System internals (BIOS, firmware, interrupts, etc.) – Network stack – Storage (global filesystem, Fibrechannel, local RAID, etc.) – High performance tools – No extraneous sowware
• LimitaIon of scope and funcIon is powerful – No conflicts with configuraIon for other tasks – Small applicaIon set makes cybersecurity easier
33 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Data Transfer Tool Comparison • In addiIon to the network, using the right data transfer tool is criIcal • Data transfer test from Berkeley, CA to Argonne, IL (near Chicago). RTT = 53 ms, network capacity = 10Gbps.
Tool Throughput scp: 140 Mbps HPN patched scp: 1.2 Gbps wp 1.4 Gbps GridFTP, 4 streams 5.4 Gbps GridFTP, 8 streams 6.6 Gbps Note that to get more than 1 Gbps (125 MB/s) disk to disk requires properly engineered storage (RAID, parallel filesystem, etc.)
34 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Overview
• ESnet Overview • Science DMZ MoIvaIon and IntroducIon
• Science DMZ Architecture
• Network Monitoring
• Data Transfer Nodes & ApplicaIons • Science DMZ Security
• User Engagement
• Wrap Up
35 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
• Goal – disentangle security policy and enforcement for science flows from security for business systems
• RaIonale – Science data traffic is simple from a security perspecIve – Narrow applicaIon set on Science DMZ
• Data transfer, data streaming packages • No printers, document readers, web browsers, building control systems, financial databases, staff desktops, etc.
– Security controls that are typically implemented to protect business resources owen cause performance problems
• SeparaIon allows each to be opImized
Science DMZ Security
36 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Performance Is A Core Requirement
• Core informaIon security principles – ConfidenIality, Integrity, Availability (CIA) – Owen, CIA and risk miIgaIon result in poor performance
• In data-‐intensive science, performance is an addiIonal core mission requirement: CIA à PICA – CIA principles are important, but if performance is compromised the science mission fails
– Not about “how much” security you have, but how the security is implemented
– Need a way to appropriately secure systems without performance compromises
• CollaboraIon Within The OrganizaIon – All parIes (users, operators, security, administraIon) needs to sign off up this idea – revoluIonary vs. evoluIonary change.
– Make sure everyone understands the ROI potenIal. 37 – ESnet Science Engagement ([email protected]) - 3/3/15
© 2015, Energy Sciences Network
Security Without Firewalls
• Data intensive science traffic interacts poorly with firewalls
• Does this mean we ignore security? NO! – We must protect our systems – We just need to find a way to do security that does not prevent us from geyng the science done
38 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
• Key point – security policies and mechanisms that protect the Science DMZ should be implemented so that they do not compromise performance
• Traffic permi8ed by policy should not experience performance impact as a result of the applicaIon of policy
Firewall Performance Example • Observed performance, via perfSONAR, through a firewall:
• Observed performance, via perfSONAR, bypassing firewall:
Almost 20 Imes slower through the firewall
Huge improvement without the firewall
39 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Overview
• ESnet Overview • Science DMZ MoIvaIon and IntroducIon
• Science DMZ Architecture
• Network Monitoring
• Data Transfer Nodes & ApplicaIons • Science DMZ Security
• User Engagement
• Wrap Up
40 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Challenges to Network Adop8on
• Causes of performance issues are complicated for users.
• Lack of communicaIon and collaboraIon between the CIO’s office and researchers on campus.
• Lack of IT experIse within a science collaboraIon or experimental facility
• User’s performance expectaIons are low (“The network is too slow”, “I tried it and it didn’t work”).
• Cultural change is hard (“we’ve always shipped disks!”).
• ScienIsts want to do science not IT support
The Capability Gap
41 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Bridging the Gap
• ImplemenIng technology is ‘easy’ in the grand scheme of assisIng with science
• AdopIon of technology is different – Does your cosmologist care what SDN is? – Does your cosmologist want to get data from Chile each night so that they can start the next day without having to struggle with the tyranny of ineffecIve data movement strategies that involve airplanes and white/brown trucks?
42 – ESnet Science Engagement ([email protected]) - 3/3/15
© 2015, Energy Sciences Network
The Golden Spike
• We don’t want ScienIsts to have to build their own networks • Engineers don’t have to understand what a tokomak accomplishes • MeeIng in the middle is the process of science engagement:
– Engineering staff learning enough about the process of science to be helpful in how to adopt technology
– Science staff having an open mind to be8er use what is out there
43 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Establishing Requirements
h8p://www.es.net/about/science-‐requirements/network-‐requirements-‐reviews/ The purpose of these reviews is to accurately characterize the near-‐term, medium-‐term and long-‐term network requirements of the science conducted by each program office. The reviews a8empt to bring about a network-‐centric understanding of the science process used by the researchers and scienIsts, to derive network requirements. We have found this to be an effecIve method for determining network requirements for ESnet's customer base.
44 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Overview
• ESnet Overview • Science DMZ MoIvaIon and IntroducIon
• Science DMZ Architecture
• Network Monitoring
• Data Transfer Nodes & ApplicaIons • On the Topic of Security • User Engagement
• Wrap Up
45 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Why Build A Science DMZ Though?
• What we know about scienIfic network use: – Machine size decreasing, accuracy increasing – HPC resources more widely available – and
potenIally distributed from where the scienIsts are
– WAN networking speeds now at 100G, MAN approaching, LAN as well
• Value ProposiIon:
46 – ESnet Science Engagement ([email protected]) - 3/3/15
– If scienIsts can’t use the network to the fullest potenIal due to local policy constraints or bo8lenecks – they will find a way to get their done outside of what is available.
• Without a Science DMZ, this stuff is all hard – “No one will use it”. Maybe today, what about tomorrow? – “We don’t have these demands currently”. Next gen technology is always a day away
• The Science DMZ design pa8ern provides a flexible model for supporIng high-‐performance data transfers and workflows
• Key elements: – AccommodaIon of TCP
• Sufficient bandwidth to avoid congesIon • Loss-‐free IP service
– LocaIon – near the site perimeter if possible – Test and measurement – Dedicated systems – Appropriate security
• Support for advanced capabiliIes (e.g. SDN) is much easier with a Science DMZ
Wrapup
47 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
The Science DMZ in 1 Slide Consists of four key components, all required:
• “FricIon free” network path – Highly capable network devices (wire-‐speed, deep queues) – Virtual circuit connecIvity opIon – Security policy and enforcement specific to science workflows – Located at or near site perimeter if possible
• Dedicated, high-‐performance Data Transfer Nodes (DTNs) – Hardware, operaIng system, libraries all opImized for transfer – Includes opImized data transfer tools such as Globus Online and GridFTP
• Performance measurement/test node – perfSONAR
• Engagement with end users Details at h8p://fasterdata.es.net/science-‐dmz/
© 2013 Wikipedia
48 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Links
– ESnet fasterdata knowledge base • h8p://fasterdata.es.net/
– Science DMZ paper • h8p://www.es.net/assets/pubs_presos/sc13sciDMZ-‐final.pdf
– Science DMZ email list • Send mail to [email protected] with the subject "subscribe esnet-‐sciencedmz”
– Fasterdata Events (Workshop, Webinar, etc. announcements) • Send mail to [email protected] with the subject "subscribe esnet-‐fasterdata-‐events”
– perfSONAR • h8p://fasterdata.es.net/performance-‐tesIng/perfsonar/ • h8p://www.perfsonar.net
49 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
[email protected] Ask us anything: – Preparing for CC-‐DNI – Deploying perfSONAR – Debugging a problem – A8ending a training event – Designing a network
50 – ESnet Science Engagement ([email protected]) - 3/3/15 © 2015, Energy Sciences Network
Thanks!
Jason Zurawski – [email protected] Science Engagement Engineer, ESnet Lawrence Berkeley National Laboratory
KINBER Webinar March 4th 2015