slide 1 9/29/15 end-to-end performance tuning and best practices moderator: charlie mcmahon, tulane...

21
Slide 1 Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison Chris Rapier, Pittsburgh Supercomputing Center Paul Gessler, University of Idaho Maureen Dougherty, USC Wednesday, September 29, 2015

Upload: cecil-hall

Post on 14-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 1

Slide 1

9/29/15

End-to-End Performance Tuning and Best Practices

Moderator: Charlie McMahon, Tulane University

Jan Cheetham, University of Wisconsin-Madison

Chris Rapier, Pittsburgh Supercomputing Center

Paul Gessler, University of IdahoMaureen Dougherty, USC

Wednesday, September 29, 2015

Page 2: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 2

9/29/15

Slide 2

Professor & Director, Northwest Knowledge NetworkUniversity of Idaho

Paul Gessler

Page 3: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 3

9/29/15

Slide 3

Enabling 10 Gbps connections to the Idaho Regional Optical Network

• UI Moscow campus network core

• Northwest Knowledge Network and DMZ

• DOE’s Idaho National Lab

• Implemented perfSONAR monitoring over Idaho

• Institute for Biological and Evolutionary Studies

Page 4: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 4

Slide 4

9/29/15

Page 5: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 5

Slide 5

9/29/15

Page 6: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 6

9/29/15

Slide 6

Research and Instructional Technologies Consultant University of Wisconsin-Madison

Jan Cheetham

Page 7: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 7

9/29/15

Slide 7

University of Wisconsin Campus Network

HEP

Biotech

IceCUBESSEC

Engineering

LOCI

WID

WEI

CHTC Campus Network Distribution

Science DMZ Internet2 Innovation Network

100G

perfSONAR

Page 8: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 8

9/29/15

Slide 8

Diagnosing Network Issues

PerfSONAR helps uncover problems with:

• TCP window size issues to San Diego

• Optical fiber cut affecting latency-sensitive link between SSEC and NOAA

• Line card failure resulting in dropped packets on research partner’s (WID) LAN

• Transfers from internal data stores to distributed computer resources (HTCondor pools)

Page 9: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 9

9/29/15

Slide 9

Dealing with Firewalls

Can’t use firewall

• Security baseline for research computing

Must be behind a firewall

• Upgrade firewall to high speed backplane to allow 10G throughput to campus in preparation for campus network upgrade

• Plan to use SDN to shunt some traffic (identified uses within our security policy)

Page 10: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 10

9/29/15

Slide 10

Challenges

• 100 GE line card failure (pursuing buffer overflow)

• Separating spiky research traffic from the rest of campus network traffic

• Distributed campus—getting the word out to enable everyone to take advantage

• Internal network environments limitations for researchers

• Storage bottleneck

Page 11: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 11

9/29/15

Slide 11

Senior Research ProgrammerPittsburgh Supercomputing Center

Chris Rapier

Page 12: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 12

9/29/15

Slide 12

XSight & Web10G

Goal: Use the metrics provided by Web10G to enhance workflow by early identification of pathological flows.

• A distributed set of Web10G enabled listeners on Data Transfer Nodes across multiple domains.

• Gather data on all flows of interest and collate at centralized DB.

• Analyze data to find marginal and failing flows

• Provide NOC with actionable data in near real time

Page 13: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 13

9/29/15

Slide 13

Implementation

• Listener: C application periodically polls all TCP flows. Applies rule set to

• Database: InfluxDB. Time series DB.

• Analysis engine: Currently applies heuristic approach. Development of models in progress.

• UI: Web based logical map. Allows engineers to drill down to failing flows and display collected metrics.

Page 14: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 14

9/29/15

Slide 14

Results

• Analysis engine and UI still in development

• Looking for partners for listener deployment (includes NOCs)

• 6 months left under EAGER grant. Will be seeking to renew grant.

Page 15: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 15

9/29/15

Slide 15

Director, Center for High-Performance ComputingUSC

Maureen Dougherty

Page 16: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Trojan Express Network II

Goal: Develop Next Generation research network in parallel to production network to address increasing research data transfer demands

• Leverage existing 100G Science DMZ• Instead of expensive routers, use cheaper high-end

network switches• Use OpenFlow running on a server to control the switch• PerfSONSAR systems for metrics and monitoring

Page 17: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Trojan Express Network Buildout

Page 18: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Collaborative Bandwidth Tests• 72.5ms round trip between USC and Clemson• 100Gbps Shared Link• 12 machine OrangeFS cluster at USC

– Directly connected to Brocade Switch at 10Gbps Each

• 12 clients at Clemson• USC ran nuttcp sessions between pairs of USC and

Clemson hosts• Clemson ran file copies to the USC OrangeFS cluster

Page 19: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Linux Network Configuration

Bandwidth Delay Product72.5ms x 10Gbits/second = 90625000 bytes (90Mbytes)

• net.core.rmem_max = 96468992• net.core.wmem_max = 96468992• net.ipv4.tcp_rmem = 4096 87380 96468992• net.ipv4.tcp_wmem = 4096 65536 96468992• net.ipv4.tcp_congestion_control = yeah• jumbo frames enabled (mtu 9000)

Page 20: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Nuttcp Bandwidth Test

Peak Transfer of 72Gb/s with 9 nodes

Page 21: Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison

Slide 21

9/29/15

Slide 21

Contact Information

Charlie McMahon, Tulane [email protected]

Jan Cheetham University of [email protected]

Chris Rapier, Pittsburgh Supercomputing [email protected]

Paul Gessler, University of [email protected]