performance and reliability issues – network, storage & services

59
Reliability Issues – Reliability Issues – Network, Storage & Network, Storage & Services Services Shawn McKee/University of Michigan OSG All-hands Meeting March 8 th 2010, FNAL

Upload: bianca

Post on 19-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Performance and Reliability Issues – Network, Storage & Services. Shawn McKee/University of Michigan OSG All-hands Meeting March 8 th 2010, FNAL. Outline. I want to present a mix of topics related to performance and reliability for our sites - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Performance and Reliability Issues – Network, Storage & Services

Performance and Reliability Performance and Reliability Issues – Network, Storage & Issues – Network, Storage & ServicesServices

Shawn McKee/University of MichiganOSG All-hands MeetingMarch 8th 2010, FNAL

Page 2: Performance and Reliability Issues – Network, Storage & Services

OutlineOutlineI want to present a mix of topics related to

performance and reliability for our sitesNot composed of “the answers” but rather

a set of what I consider important topics and examples followed by discussion

I will cover Network, Storage and Services◦ Configuration◦ Tuning◦ Monitoring◦ Management

March 8, 2010 2OSG All-hands USATLAS Meeting

Page 3: Performance and Reliability Issues – Network, Storage & Services

General Goals for our General Goals for our SitesSites

Goal: Build a Robust Infrastructure◦Consider physical and logical topologies◦Provide alternate paths when feasible◦Tune, test, monitor and manage

Meta-Goal: Protect Services while Maintaining Performance◦Services should be configured in such a

way that they “fail gracefully” rather than crashing. Potentially many ways to do this

◦Tune, test, monitor and manage (as always)

March 8, 2010 OSG All-hands USATLAS Meeting 3

Page 4: Performance and Reliability Issues – Network, Storage & Services

Common ProblemsCommon ProblemsPower issuesSite (mis)configurationsService Failures

◦Load related, Bugs, Configuration, Updates

Hardware Failures◦Disks, Memory, CPU, etc

Cooling FailuresNetwork FailuresRobust solutions are needed to

minimize these impactsMarch 8, 2010 OSG All-hands USATLAS Meeting 4

Page 5: Performance and Reliability Issues – Network, Storage & Services

Site InfrastructuresSite InfrastructuresThere are a number of areas to

examine where we can add robustness (usually at the cost of $ or complexity !)◦Networking

Physical and logical connectivity

◦Storage Physical and logical connectivity Filesystems, OS, Software, Services

◦Servers and Services Grid and VO software and middleware

March 8, 2010 OSG All-hands USATLAS Meeting 5

Page 6: Performance and Reliability Issues – Network, Storage & Services

Example Site-to-Site Example Site-to-Site DiagramDiagram

March 8, 2010 OSG All-hands USATLAS Meeting 6

Page 7: Performance and Reliability Issues – Network, Storage & Services

Power IssuesPower IssuesPower issues can frequently be the cause

of service loss in our infrastructureRedundant power-supplies connected to

independent circuits can minimize loss due to circuit or supply failure (Verify 1 circuit can support the required load!!)

UPS systems can bridge brown-outs or short-duration loses and protect equipment from power fluctuations

Generators can provide longer-term bridging

March 8, 2010 OSG All-hands USATLAS Meeting 7

Page 8: Performance and Reliability Issues – Network, Storage & Services

Robust Network Robust Network ConnectivityConnectivityRedundant network connectivity

can help provide robust networking◦WAN resiliency is part of almost all

WAN providers infrastructure◦Sites need to determine how best to

provide both LAN and connector-level resiliency

Basically, allow for multiple paths for network traffic to flow in case of switch/router failure, cabling mishaps, NIC failure, etc

March 8, 2010 OSG All-hands USATLAS Meeting 8

Page 9: Performance and Reliability Issues – Network, Storage & Services

Virtual Circuits in LHC Virtual Circuits in LHC (WAN)(WAN)ESnet ESnet and Internet2Internet2 have helped the LHC sites in the US setup end-to-end circuits

USATLAS USATLAS has persistent circuits from BNL to 4 of the 5 Tier-2s◦The circuits are guaranteed 1 Gbps but

may overflow to utilize the available bandwidth

This simplifies traffic management and is transparent to the sites.

Future possibilities for dynamic mgmt…Failover is back to default routing

March 8, 2010 9OSG All-hands USATLAS Meeting

Page 10: Performance and Reliability Issues – Network, Storage & Services

LAN Options to ConsiderLAN Options to ConsiderUtilize equipment of reasonable

quality. Managed switches typically are more robust as well as configurable and support monitoring

Within your LAN have redundant switches with paths managed by spanning-tree to increase uptime

Anticipate likely failure modes…At the host level you can utilize

multiple NICS (bonding)March 8, 2010 OSG All-hands USATLAS Meeting 10

Page 11: Performance and Reliability Issues – Network, Storage & Services

Example: Network Example: Network BondingBonding

You can configure multiple network interfaces on a host to cooperate as a single virtual interface via “bonding”

Linux allows multiple “modes” for the bonding configuration (see next page)

Trade-offs based upon resiliency vs performance as well as those related to hardware capabilities and topology.

March 8, 2010 OSG All-hands USATLAS Meeting 11

Page 12: Performance and Reliability Issues – Network, Storage & Services

NIC Bonding ModesNIC Bonding Modes Mode 0 – Balance Round-Robin: the only mode allowing a

single flow to balance over more than one NIC BUT reorders packets. Requires ‘etherchannel’ or ‘trunking’ on the switch

Mode 1 – Active-Backup: Allows connecting to different switches at different speeds. No throughput benefit but redundant.

Mode 2 – Balance-XOR: Selects NIC per destination based upon XOR of MAC addresses. Needs ‘etherchannel’ or ‘trunk’

Mode 3 – Balance: Transmits on all slaves. Needs distinct nets.

Mode 4 – 802.3ad: Active-active, specific flows select NIC based upon chosen algorithm. Needs switch support for 802.3ad

Mode 5 – Balance-tlb: Adaptive transmit load balancing. Output balanced based upon current slave loads. No special switch support required. NIC must support ‘ethtool’

Mode 6 – Balance-alb: Adaptive load balancing. Similar to 5 but allows receive balancing via “arp” manipulation.

March 8, 2010 OSG All-hands USATLAS Meeting 12

Page 13: Performance and Reliability Issues – Network, Storage & Services

Network Tuning (1/2)Network Tuning (1/2)Typical “default” OS tunings for

networking are not optimal for WAN data transmission.

Depending upon the OS you can find particular tuning advice at: http://fasterdata.es.net/TCP-tuning/background.html

Buffers are the primary tuning target: buffer size = bandwidth * RTT

Good news: most OSes support autotuning now=> no need to set default buffer sizes

March 8, 2010 OSG All-hands USATLAS Meeting 13

Page 14: Performance and Reliability Issues – Network, Storage & Services

Network Tuning (2/2)Network Tuning (2/2)To get maximal throughput it is

critical to use optimal TCP buffer sizes ◦If the buffers are too small, the TCP

congestion window will never fully open up.

◦If the receiver buffers are too large, TCP flow control breaks; the sender can overrun the receiver, which will cause the TCP window to shut down. This is likely to happen if the sending host is faster than the receiving host.

March 8, 2010 OSG All-hands USATLAS Meeting 14

Page 15: Performance and Reliability Issues – Network, Storage & Services

Linux TCP Tuning (1/2)Linux TCP Tuning (1/2)Like all operating systems, the default

maximum Linux TCP buffer sizes are way too small

# increase TCP max buffer size setable using setsockopt() net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # increase Linux autotuning TCP buffer limits # min, default, and max number of bytes to use # set max to at least 4MB, higher if you use very high BDP

paths net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216You should also verify that the following are all

set to the default value of 1 sysctl net.ipv4.tcp_window_scaling sysctl net.ipv4.tcp_timestamps sysctl net.ipv4.tcp_sack Of course, TEST after changes. SACK may need to be off

for large BDP paths (> 16MB) or timeouts may result.March 8, 2010 OSG All-hands USATLAS Meeting 15

Page 16: Performance and Reliability Issues – Network, Storage & Services

Linux TCP Tuning (2/2)Linux TCP Tuning (2/2)Tuning can be more complex for 10GEYou can explore different congestion

algorithms: BIC, CUBIC, HTCP, etc.Large MTU can improve throughputThere are a couple additional sysctl settings for 2.6: # don't cache ssthresh from previous connection net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_moderate_rcvbuf = 1 # recommended to increase this for 1000 BT or higher net.core.netdev_max_backlog = 2500# for 10 GigE, use 30000

March 8, 2010 OSG All-hands USATLAS Meeting 16

Page 17: Performance and Reliability Issues – Network, Storage & Services

Storage ConnectivityStorage ConnectivityIncrease robustness for storage by

providing resiliency at various levels: ◦Network: Bonding (e.g. 802.3ad)◦Raid/SCSI redundant cabling,

multipathing (hw specific)◦iSCSI (with redundant connections)◦Single-Host resiliency: redundant

power, mirrored memory, RAID OS disks, multipath controllers

◦Clustered/failover storage servers◦Multiple copies, multiple write locations

March 8, 2010 OSG All-hands USATLAS Meeting 17

Page 18: Performance and Reliability Issues – Network, Storage & Services

Example: Redundant Cabling Example: Redundant Cabling Using Dell MD1000sUsing Dell MD1000sNew firmware for Dell RAID controllers

supports redundant cabling of MD1000sEach MD1000 can have two EMMs, each

capable of accessing all disksA Perc6/E has two SAS channelsCan now cable each channel to an EMM

on a shelf. Connection shows 1 logical link (similar to “bond” in networking)

Can be daisy-chained to 3 MD1000’s

March 8, 2010 OSG All-hands USATLAS Meeting 18

Page 19: Performance and Reliability Issues – Network, Storage & Services

Redundant Path With Static Redundant Path With Static Load Balancing SupportLoad Balancing Support

The PERC 6/E adapter can detect and use redundant paths to drives contained in enclosures. This provides the ability to connect two SAS cables between a controller and an enclosure for path redundancy. The controller is able to tolerate the failure of a cable or Enclosure Management Module (EMM) by utilizing the remaining path.

When redundant paths exist, the controller automatically balances I/O load through both paths to each disk drive. This load balancing feature increases throughput to each drive and is automatically turned on when redundant paths are detected. To set up your hardware to support redundant paths, see Setting up Redundant Path Support on the PERC 6/E Adapter.

NOTE: This support for redundant paths refers to path-redundancy only and not to controller-redundancy“

March 8, 2010 OSG All-hands USATLAS Meeting 19

http://support.dell.com/support/edocs/storage/RAID/PERC6/en/UG/HTML/chapterd.htm#wp1068896

Page 20: Performance and Reliability Issues – Network, Storage & Services

Storage TuningStorage TuningHave good hardware underneath the storage system!Pick an underlying filesystem that performs well. XFS is a common choice which supports large number of directory entries and online defragmentation.

The following settings require the target to be mounted:

Set “readahead” to improve read speed (4096-16384) blockdev --setra 10240 $devSetup queuing requests (allows optimizing) echo 512 > /sys/block/${sd}/queue/nr_requestsPick an I/O scheduler suitable for your task echo deadline > /sys/block/${sd}/queue/scheduler

There are often hardware specific tunings possible. Remember to test for your expected workload to see if changes help.

March 8, 2010 OSG All-hands USATLAS Meeting 20

Page 21: Performance and Reliability Issues – Network, Storage & Services

Robust Grid Services?Robust Grid Services?Just a topic I wanted to mention. I

would like to be able to configure virtual grid services (using multiple hosts, heartbeat, LVS, etc.) to create a robust infrastructure.

Primary targets:◦Gatekeepers, Job schedulers, GUMS

servers, LFC, software servers, dCache admin servers

◦Possible solution for NFS servers via heartbeat, LVS…others?

March 8, 2010 OSG All-hands USATLAS Meeting 21

Page 22: Performance and Reliability Issues – Network, Storage & Services

Virtualization of Service Virtualization of Service NodesNodesOur current grid infrastructure for

ATLAS requires a number of servicesVirtualization technologies can be

used to provide some of these services

Depending upon the virtualization system this can help:◦Backing up critical services◦Increasing availability◦Easing management

March 8, 2010 OSG All-hands USATLAS Meeting 22

Page 23: Performance and Reliability Issues – Network, Storage & Services

Example: VMwareExample: VMwareAt AGLT2 we have VMware VMware

Enterprise Enterprise running:◦LFC,3 Squid servers,OSG Gatekeeper,

ROCKS headnodes (dev/prod), 2 of 3 Kerb/AFS/NIS nodes, central syslog-ng host,muon splitter, 2 of 5 AFS file servers

“HA” can ensure services run even if a server fails. Backup is easy as well

Can “live-migate” VMs between 3 servers or migrate VM storage to alternate back-end storage server

March 8, 2010 OSG All-hands USATLAS Meeting 23

Page 24: Performance and Reliability Issues – Network, Storage & Services

Example: AGLT2 VMwareExample: AGLT2 VMware

March 8, 2010 OSG All-hands USATLAS Meeting 24

Not shown are the 10GE Connections 1/server

Page 25: Performance and Reliability Issues – Network, Storage & Services

Example: Details for Example: Details for UMVM02UMVM02

March 8, 2010 OSG All-hands USATLAS Meeting 25

Page 26: Performance and Reliability Issues – Network, Storage & Services

BackupsBackups“You do have backups, right?...”Scary question, huh?! Backups

provide a form of resiliency against various hardware failures and unintentional acts of stupidity.

Could be anything from a full tape system backup services to various cron scripts saving needed config info.

Not always easy to get right…test!March 8, 2010 OSG All-hands USATLAS Meeting 26

Page 27: Performance and Reliability Issues – Network, Storage & Services

System TuningSystem TuningLots of topics could be put here but I

will just mention a few itemsYou can install ‘ktune’ (yum install

ktune). It will provide some tunings for large memory systems running disk and network intensive applications.

See related storage/network tuningsMemory is a likely bottleneck in

many cases…have lots!

March 8, 2010 OSG All-hands USATLAS Meeting 27

Page 28: Performance and Reliability Issues – Network, Storage & Services

Cluster Monitoring…Cluster Monitoring…This is a huge topic. In general you

can’t find problems if you don’t know about them and you can’t effectively manage systems if you can’t monitor them

I will list a few monitoring programs that I have found useful.

There are many options in this area that I won’t cover: NagiosNagios is a prime example being very successfully used.

March 8, 2010 OSG All-hands USATLAS Meeting 28

Page 29: Performance and Reliability Issues – Network, Storage & Services

GangliaGangliaGanglia is a cluster monitoring

program available from http://ganglia.sourceforge.net/ and also distributed as part of ROCKS

Allows a quick view of CPU and memory use cluster-wide

Can drill down into host specific details

Can easily extend to monitor additional data or aggregate sites

March 8, 2010 OSG All-hands USATLAS Meeting 29

Page 30: Performance and Reliability Issues – Network, Storage & Services

Example Ganglia InterfaceExample Ganglia Interface

March 8, 2010 OSG All-hands USATLAS Meeting 30

Page 31: Performance and Reliability Issues – Network, Storage & Services

Cacti MonitoringCacti MonitoringCacti ( see http://www.cacti.net/ )

is a network graphing package using SNMP and RRDtool to record data

Can be extended with plugins (threshold, monitoring, MAC lookup)

March 8, 2010 OSG All-hands USATLAS Meeting 31

Page 32: Performance and Reliability Issues – Network, Storage & Services

Example Cacti GraphsExample Cacti Graphs

March 8, 2010 OSG All-hands USATLAS Meeting 32

Inbound AGLT2 10GE Bytes/sec Outbound AGLT2 10GE Bytes/sec

Aggregate ‘ntpd’ offset (ms) Space-tokens stats (put/get)

Postgres DB stats NFS client statistics

Page 33: Performance and Reliability Issues – Network, Storage & Services

Custom MonitoringCustom MonitoringPhilippe Laurens (MSU) has

developed a summary page for AGLT2 which quickly shows cluster status:

March 8, 2010 OSG All-hands USATLAS Meeting 33

Page 34: Performance and Reliability Issues – Network, Storage & Services

Automated Automated Monitoring/RecoveryMonitoring/RecoverySome types of problems can be easily

“fixed” if we can just identify themThe ‘monit’ software (‘yum install

monit’) can provide an easy way to test various system/software components and attempt to remediate problems.

Configure a file per item to watch/testVery configurable; can fix problems at

3AM! Some examples follow:

March 8, 2010 OSG All-hands USATLAS Meeting 34

Page 35: Performance and Reliability Issues – Network, Storage & Services

Monit Example for MySQLMonit Example for MySQLThis describes the relevant MySQL info for

this host. # mysqld monitoringcheck process mysqld with pidfile /var/lib/mysql/dq2.aglt2.org.pidgroup databasestart program = "/etc/init.d/mysql start"stop program = "/etc/init.d/mysql stop"if failed host 127.0.0.1 port 3306 protocol mysql 3 cycles then

restartif failed host 127.0.0.1 port 3306 protocol mysql 3 cycles then alertif failed unixsocket /var/lib/mysql/mysql.sock protocol mysql 4 cycles

then alertif 5 restarts within 10 cycles then timeout

Restarting and alerting are triggered based upon tests.

Resides in /etc/monit.d as mysqld.conf

March 8, 2010 OSG All-hands USATLAS Meeting 35

Page 36: Performance and Reliability Issues – Network, Storage & Services

Other Other Monitoring/ManagementMonitoring/ManagementLots of sites utilize “simple” scripts

run via “cron” (or equivalent) that:◦Perform regular maintenance◦Check for “known” problems◦Backup data or configurations◦Extract monitoring data◦Remediate commonly occurring failures

These can be very helpful for increasing reliability and performance

March 8, 2010 OSG All-hands USATLAS Meeting 36

Page 37: Performance and Reliability Issues – Network, Storage & Services

Security ConsiderationsSecurity ConsiderationsSecurity is a whole separate topic…

not appropriate to cover it here…General issue is that unless security

is also addressed, your otherwise high-performing robust infrastructure may have large downtimes while you try to contain and repair system compromises!

Good security practices are part of building robust infrastructures.

March 8, 2010 OSG All-hands USATLAS Meeting 37

Page 38: Performance and Reliability Issues – Network, Storage & Services

Configuration Configuration ManagementManagement

Not directly related to performance or reliability but very important

Common tools: ◦Code management, versioning

(Subversion, CVS) ◦Provisioning and configuration

managment (ROCKS, Kickstart, Puppet, Cfengine)

All important for figuring out what was changed and what is currently configured

March 8, 2010 OSG All-hands USATLAS Meeting 38

Page 39: Performance and Reliability Issues – Network, Storage & Services

Regular Storage Regular Storage “Maintenance”“Maintenance”Start with the bits on disk. Run

‘smartd’ to look for impending failuresUse “patrol reads” or background

consistency checks to find bad sectorsRun filesystem checks when things

are “suspicious” (xfs_repair, fsck…)Run higher level consistency checks

(like Charle’s ccc.pyccc.py script) to insure various views of your storage are consistent

March 8, 2010 OSG All-hands USATLAS Meeting 39

Page 40: Performance and Reliability Issues – Network, Storage & Services

High Level Storage High Level Storage ConsistencyConsistency

March 8, 2010 OSG All-hands USATLAS Meeting 40

Being run at MWT2 and AGLT2

Allows finding consistency problems and “dark” data

Page 41: Performance and Reliability Issues – Network, Storage & Services

dCache dCache Monitoring/ManagementMonitoring/Management

AGLT2 has monitoring/mgmt we do specific to dCache (as an example)

Other storage solutions may have similar types of monitoring

We have developed some custom pages in addition to the standard dCache services web interface

Tracks usage and consistencyAlso have a series of scripts running

in ‘cron’ doing routine maintenance/checks

March 8, 2010 OSG All-hands USATLAS Meeting 41

Page 42: Performance and Reliability Issues – Network, Storage & Services

dCache Allocation and UsedCache Allocation and Use

March 8, 2010 OSG All-hands USATLAS Meeting 42

Page 43: Performance and Reliability Issues – Network, Storage & Services

dCache Consistency PagedCache Consistency Page

March 8, 2010 OSG All-hands USATLAS Meeting 43

Page 44: Performance and Reliability Issues – Network, Storage & Services

WAN Network MonitoringWAN Network MonitoringWithin the Throughput group we

have been working on network monitoring as complementary to throughput testing

Two measurement/monitoring areas:◦perfSONAR at Tier-1/Tier-2 sites

“Network” specific testing

◦Automated transfer testing “End-to-end” using standard ATLAS tools

◦May add a “transaction test” next (TBD)March 8, 2010 OSG All-hands USATLAS Meeting 44

Page 45: Performance and Reliability Issues – Network, Storage & Services

Network Monitoring: Network Monitoring: perfSONARperfSONAR

March 8, 2010 OSG All-hands USATLAS Meeting 45

As you are by now well aware there is a broad scale effort to standardize network monitoring under the perfSONAR framework

Since the network is so fundamental to our work we targeted implementation of a perfSONAR instance at all our primary facilities. We have ~20 sites running

Has already proven very useful in USATLAS!

Page 46: Performance and Reliability Issues – Network, Storage & Services

perfSONAR Examples perfSONAR Examples USATLASUSATLAS

March 8, 2010 46OSG All-hands USATLAS Meeting

Page 47: Performance and Reliability Issues – Network, Storage & Services

perfSONAR in USATLASperfSONAR in USATLASThe typical Tier-1/Tier-2 installation

provides two systems (using the same KOI hardware at each site): latency and bandwidth nodes

Automated recurring tests are configured for both latency and bandwidth between all Tier-1/Tier-2 sites (“mesh” testing)

We are acquiring a baseline and history of network performance between sites

On demand testing is also available

March 8, 2010 OSG All-hands USATLAS Meeting 47

Page 48: Performance and Reliability Issues – Network, Storage & Services

Production System TestingProduction System TestingWhile perfSONARperfSONAR is becoming the tool of

choice for monitoring the network behavior between sites, we also need to track the “end-to-end” behavior of our complex, distributed systems.

We are utilizing regularly scheduled automated testing, sending specific data between sites to verify proper operation.

This is critical for problem isolation; comparing network and application results can pin-point problem locations

March 8, 2010 48OSG All-hands USATLAS Meeting

Page 49: Performance and Reliability Issues – Network, Storage & Services

Automated Data Transfer Automated Data Transfer TestsTestsAs part of USATLAS Throughput work, Hiro

has developed an automated data transfer system which utilizes the standard ATLAS DDM system

This allows us to monitor the throughput of the system on a regular basis

It transfers a set of files once per day from the Tier-1 to each Tier-2 for two different destinations.

Recently it was extended to allow arbitrary source/destination (including Tier-3s)

http://www.usatlas.bnl.gov/dq2/throughput

March 8, 2010 OSG All-hands USATLAS Meeting 49

Page 50: Performance and Reliability Issues – Network, Storage & Services

Web Interface to Throughput Web Interface to Throughput TestTest

March 8, 2010 OSG All-hands USATLAS Meeting 50

Page 51: Performance and Reliability Issues – Network, Storage & Services

Throughput Test Graph Throughput Test Graph #1#1

March 8, 2010 OSG All-hands USATLAS Meeting 51

Page 52: Performance and Reliability Issues – Network, Storage & Services

Throughput Test Graph Throughput Test Graph #2#2

March 8, 2010 OSG All-hands USATLAS Meeting 52

Page 53: Performance and Reliability Issues – Network, Storage & Services

Throughput Test Graph Throughput Test Graph #3#3

March 8, 2010 OSG All-hands USATLAS Meeting 53

Page 54: Performance and Reliability Issues – Network, Storage & Services

Future Throughput WorkFuture Throughput WorkWith the recent release of an updated

perfSONARperfSONAR we are in position to acquire useful baseline network performance between our sites

A number of potential network issues requiring some debugging are starting to appear.

As we acquire data, both from perfSONAR and throughput testing, we need to start developing higher-level diagnostics and alerting systems (How best to integrate with “Operations”?)

March 8, 2010 OSG All-hands USATLAS Meeting 54

Page 55: Performance and Reliability Issues – Network, Storage & Services

General Considerations General Considerations (1/2)(1/2)

Lots of things can impact both reliability and performance

At the hardware level:◦Check for driver updates ◦Examine firmware/bios versions (newer

isn’t always better BTW)Software versions…fixes for problems?Test changes – Do they do what you

thought? What else did they break?

March 8, 2010 OSG All-hands USATLAS Meeting 55

Page 56: Performance and Reliability Issues – Network, Storage & Services

General Considerations General Considerations (2/2)(2/2)

Sometimes the additional complexity to add “resiliency” actually decreases availability compared to doing nothing!

Having test equipment to experiment with is critical for trying new options

Often you need to trade-off cost vs performance vs reliability (pick 2 )

Documentation, issue tracking and version control systems are your friends!

March 8, 2010 OSG All-hands USATLAS Meeting 56

Page 57: Performance and Reliability Issues – Network, Storage & Services

SummarySummaryThere are many components and

complex interactions possible in our sites

We need to understand our options (frequently site specific) to help create robust, high-performing infrastructures

Monitoring is central to delivering both reliability and performance

Reminder: test changes to make sure they actually do what you want (and not something you don’t want!)

March 8, 2010 OSG All-hands USATLAS Meeting 57

Page 58: Performance and Reliability Issues – Network, Storage & Services

?Questions??Questions?

March 8, 2010 OSG All-hands USATLAS Meeting 58

Page 59: Performance and Reliability Issues – Network, Storage & Services

Backup SlidesBackup Slides

March 8, 2010 59OSG All-hands USATLAS Meeting