yearly network report 2020 - information technology services

18
2020 CREATED JANUARY 22, 2021 Information Technology Services Communication Technologies: Networking Yearly Network Report

Upload: others

Post on 16-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Yearly Network Report 2020 - Information Technology Services

1

2020

CREATED JANUARY 22, 2021

Information Technology Services Communication Technologies: Networking

Yearly Network Report

Page 2: Yearly Network Report 2020 - Information Technology Services

Y E A R L Y N E T W O R K R E P O R T 2 0 2 0

News and Information:

All,

I hope you will share in my belief that our group has managed to have an extremely successful year despite all of the adversity. In this yearly report, we have compiled summary data across several metrics. We have also included a comprehensive list of all network outages/incidents in the past year, regardless of the impact. Also, please take a moment to examine the Service-Now metrics. We serviced almost 3,000 requests in the last year alone!

I would also like to extend a personal thank you to the Network Deployment group. This group has been deemed essential, and they have been on-site throughout the entire COVID period. The rest of the networking group has leaned heavily on them over the past 10 months, and they have carried on many of our missions despite operating with less-than-ideal staffing and conditions. They have not had the opportunity to do much work from home and have routinely exposed themselves to health risks that many of us can avoid in a work from home environment. The members of this group are Reid Bradsher, Kevin Clayton, Seare Habte, Len Needham, Dale Oxendine, Danny Stubbs, David Valleroy, and Mary Wezyk. They have been supported by Eric Maynor and Chad Wade from the Wireless group.

Finally, I would like to let you know that David Valleroy, Manager of Network Deployment, retired in the last quarter. David Valleroy has led the Network Deployment group for over 15 years, and we want to wish him the best in retirement. Len Needham is acting as interim manager of the Network Deployment group.

Sincerely, Ryan Turner Head of Networking

Page 3: Yearly Network Report 2020 - Information Technology Services

Major Successes in 2020

• The new data center re-architecture is officially complete. Most all ITS services and many departments have been migrated to the new pod infrastructure. We have been able to work with the vendor and move past several code bugs that caused data center disruptions.

• A new campus VPN service was installed to support increased load resulting from the campus transition to work from home.

• A campus proxy service was released to provide systems a way to update/patch/communicate with external resources.

• We have begun to successfully migrate from the Verizon managed voice solution to an AT&T solution.

• We have life cycled dozens of buildings with new wired and wireless gear. • Our DevOps team has successfully created a new Router Proxy, a tool that we utilize to

configure and view numerous things within the network. This tool is heavily used by ITS Security, as well. Numerous people are working on this development, some outside of our division.

• Our network has been extended to UNC Health, and UNC Health’s network has been extended into ours via VRFs. To most people, this means that eduroam is available in the hospital, and SkyNet is available in the School of Medicine. This project, which was initiated before the COVID disruption, has allowed us to quickly spin up UNC Health into the Friday Center to aid in COVID vaccinations.

• We provided significant support in providing connectivity for the campus COVID testing in Genome Sciences, as well as UNC Health’s vaccination site in the Friday Center.

• We have successfully implemented a new remote office technology that will allow those sites to maintain campus services while utilizing inexpensive ISP connections versus regional/metro ethernet P2P connections. We did have some issues with this technology, but recently deployed a firmware fix that appears to have resolved most of the issues.

• We have started implementing our new School of Medicine network design with the creation of the Marsico Tier 1. Most buildings that were connected to Taylor have been migrated to the new tier 1. We are utilizing a switch capable of 100Gbps to empower research transfers.

• We migrated to a new wireless operating system. We do not normally comment on OS upgrades, but this upgrade took over 2 years of work.

Page 4: Yearly Network Report 2020 - Information Technology Services

Life Cycle 2020

Despite decreased operational capacity in the last year, we still managed to life cycle close to our target levels of equipment. Each year we endeavor to replace 300 – 350 switches and approximately 1,000 wireless access points. This was possible mainly due to large inventory purchases that were made before COVID that carried us through the year as well as the extraordinary work the Network Deployment group continued to do as essential employees.

At present, our inventory is very close to break/fix levels. Unless things change, I expect less progress in 2021 due to current limitations in purchasing allowances.

050

100150200250300350400

Aruba 225 Aps(396)

Aruba 315 Aps(375)

Aruba 515 Aps(158)

Extreme SummitSwitches (242)

Cisco Equipment(13)

Major Investments (Net Increase YOY)

2020

0100200

300400500

600

Aruba135 Aps (532) Extreme G Switches(144)

Extreme N Switches(24)

Extreme S Switches(15)

Major Divestments (Net Decrease YOY)

2020

Page 5: Yearly Network Report 2020 - Information Technology Services

Campus Traffic 2020

Traffic levels have dropped significantly on campus during the pandemic. Because these graphs are over a long period, spikes and troughs get rounded out (you cannot see our absolute peak on this scale), but you can see what happened to traffic when the campus changed how it operated in March of 2020 as well as the brief return of students in the Fall of 2020.

Green is traffic coming into campus. Blue is traffic going out of campus.

Page 6: Yearly Network Report 2020 - Information Technology Services

VPN Use 2020

Conversely, VPN traffic increased significantly during the pandemic. Our campus VPN resources were not being heavily used before the pandemic, but you can see the shift, almost overnight, to working from home. Ignore the apparent data loss for statistics between June and October of 2019.

Page 7: Yearly Network Report 2020 - Information Technology Services

Service Now Tasks / Incidents We expected to receive a low volume of requests this year. Since we recently switched to Service Now, we do not have useful statistics to compare to 2019. However, we still handled nearly 3,000 requests.

YEAR OF 2020

Group Name Service Request Incident Count

IP Services 612 41 653

Deployment 713 379 1,092

Systems 64 6 70

WAN 48 6 54

Wireless 32 72 104

Operations / Engineering 600 265 865

Count 2,069 769 2,838

September through December 2020

Group Name Service Request Incident Count

IP Services 146 14 160

Deployment 113 48 161

Systems 1 1

WAN 7 7

Wireless 9 14 23

Operations / Engineering 140 42 182

Count 415 119 534

Page 8: Yearly Network Report 2020 - Information Technology Services

I think one of the standout reductions we have been able to make through automation/extensibility is with our Infoblox platform. Before Infoblox, all DNS change requests would come to IP Services. With many departments taking advantage of being able to change DNS records themselves we have seen those requests reduced. In fact, in 2020, we recorded around 10,000 events within Infoblox by groups outside of Comm Tech. An event can be many things, including someone logging into the platform or changing a record. But you can see the platform is being used by many and paying dividends:

DEPARTMENT EVENTS Arts and Sci Information Svcs 361 Carolina Institute for DD 23 Carolina Population Center 66 ES ITS 46 FS-Bldg Svcs-LSAC-Hardware 29 GEC Building Operations 9 Information Technology-SOM 17 Institute of Marine Sciences 192 ITS - Comm Technologies 18187 ITS - Information Security 156 ITS - IT Infrastructure 3958 ITS - Teaching and Learning 31 ITS - User Supp and Engagement 33 LCCC - UCRF 890 Lineberger Compr Cancer Center 7 Renaissance Computing Inst 146 Research Computing 2572 SCE - IT 79 School of Journalism and Media 5 School of Law 109 SOD Information Systems 19 SOP-Educational Technology 488 SOP-Information Technology 227 SPH - Information Technology 145 Student Affairs Info Tech 2 SW-School of Social Work 13 Grand Total 27810

Page 9: Yearly Network Report 2020 - Information Technology Services

Key Campus Metrics for September 2020 – December 2020 WIRED

Number of switches on campus: 2,935 Number of ports: 175,010 Peak download rate: 15 Gbps (November 30) Peak upload rate: 15 Gbps (November 19) Traffic sent to the Internet: Unavailable this quarter due to data corruption Traffic received from the Internet: Unavailable this quarter due to data corruption

*See the last section for model notes.

Switch Distribution - Entire Campus

7100 Series (680) Arista (29) Cisco (39) D Series (22)

G Series (825) K Series (23) N Series (40) S Series (97)

SLX (12) Summit Series (1133)

Page 10: Yearly Network Report 2020 - Information Technology Services

WIRELESS

Number of APs on campus: 10,024 Peak concurrent connections: 11,300 (Nov 10th) Devices onboarded to eduroam: 6,512 Top Onboarded OS: iOS at 49%

*See the last section for model notes.

AP Distribution - Entire Campus

AP-225 (3926) AP-315 (3124) AP-135 (650) AP-325 (803) AP-303H (374)

AP-205H (367) AP-224 (324) AP-335 (99) AP-134 (43) AP-275 (46)

AP-277 (44) AP-515 (158) AP-377 (5)

Page 11: Yearly Network Report 2020 - Information Technology Services

Major Initiatives Review: ITS UCS and storage 4x100 Pods

We had major successes in ironing out bugs in data center switches to allow this critical work to move forward. We expect UCS and storage to start migrating over to the 100Gbps pods in the next quarter or two. We are currently testing the setup. This will significantly increase the data capacity for many of ITS services.

Campus Distribution Upgrade

We will begin testing a new SLX switch (9740) to become our replacement switch for the old Enterasys/Extreme S-series switches which are nearing the end of service. We expected this work to be done during the previous summer, but COVID delayed those plans. It may continue to delay plans if funding does not allow the replacement of over 20 tier 1 switches, but we are aiming to have half of them replaced by the end of 2021. The new switch line will support 40/100Gbps to buildings where necessary/appropriate and will give us many years of capacity and support.

School of Medicine Distribution Switching Upgrade

With the Marsico Tier 1 nearly done, we will soon be working to migrate MacNider over to the new switching hardware in the next quarter. The School of Medicine will have the capability of delivering 40Gbps to the building and 10Gbps to desktops where appropriate.

Data Center automation through Router-Proxy

The Networking Dev-Ops group has started to work on data center orchestration capabilities. This will allow us to automate/delegate changes to data center objects which should result in less human error and fewer outages. This is a large task and will take a few quarters to complete.

5520 Edge Switching Update

We have begun to order a new edge switch called a 5520. Some of the benefits of this switch include highspeed uplinks (25/40/50Gbps), support for many multirate ports (2.5/5Gbps), and 802.3bt power (90+Watts of power for ports). We will be validating these new switches in the School of Medicine over the next quarter and expect this to be our standard switch moving forward.

Page 12: Yearly Network Report 2020 - Information Technology Services

Life Cycle Update: The following locations have received substantial new switching AND wireless gear during the past quarter:

Carolina Crossing 108 East Franklin Street Giles Horney

The following locations have received substantial new wireless gear during the past quarter: 6th Fl Carolina Square Carr Mill Mall Beard EHS McGavran Greenberg New East RENCI 1st and 5th floors Smith Center

The following locations are being targeted for switch and wireless upgrades in the coming quarter (subject to change):

Undetermined until funding is fully understood

Page 13: Yearly Network Report 2020 - Information Technology Services

Network Incident Report 2020

Critical Issues:

A critical issue is generally defined as an issue that had a significant impact on the campus community during business hours operations. This report does not seek to assign blame but to record the networking events that affected the campus community, regardless of that event being inside or outside of our control.

3/31/2020 Spectrum users experiencing extreme slowness connecting to campus.

We received numerous reports from users on the CTC list that connectivity from home to campus was very slow. We were able to correlate the reports to folks who used Spectrum for home internet. Upon investigation with our upstream ISP (MCNC), we determined our ISP had made an unannounced change the previous day with regards to how they route Spectrum traffic in the hopes of resolving another issue. This had unanticipated consequences as the connectivity for Spectrum users to campus resources worsened when the change was made. Many schools connecting through MCNC also reported complaints of slowness from Spectrum remote users. MCNC worked with Spectrum to bring the peering back online, increasing the pipe substantially to multiple 10G connections up from multiple 1G connections. MCNC had the connections upgraded late on April 1st, and a formal announcement was sent on April the 2nd.

8/19/2020 Data Center Loop

From around 3:20 in the afternoon until just before 7 PM, we were battling what would be one bug from one network switching platform which cascaded to exposing another bug in a different platform that resulted in significant degradation of ITS services in the data center. This incident is being written retrospectively, as what we initially reported and what has been discovered after many months of troubleshooting with the vendor are different. The main issue started with a simple reboot of a switch member in a 100G pod we were testing with the UCS group. We did not know this at the time, but there was a resource limit during the reboot that would cause a massive loop (excess of 100M PPS). This loop would cause problems with one of the four SLX spine switches that are north of the 100G pod. While we don’t know exactly why the flooding caused one line card on one of the four SLX to start misbehaving, the vendor now knows that there appears to be an issue with their implementation of MVRP on the platform. They have just been able to replicate this in the lab and expect a code fix for us before March of 2021.

10/7/2020 Routing issues in School of Medicine

We had a normal change (CHG0033214) that was executed on October 7, between 6 and 7 AM to move routing from the legacy medical school routers (macman) to the new medical school routers (med-core).

Page 14: Yearly Network Report 2020 - Information Technology Services

Part of this change was also to move the VRF that provides some routing to UNC Health to med-core from macman. The med-core routers are pair of Nexus 7706 routers (med-core Manning and med-core Phillips). Changes must be symmetrical, but in this case, we failed to remove a configuration item from the Manning side that resulted in black-holing traffic going through it. As a result, some people experienced no connectivity loss during the change while others would connect to wireless but not get any services. In short, traffic is load balanced between the two routers and those that were load balanced to the Manning router were adversely affected. We were first made aware of the issue when an incident came to us around 8:16 AM (INC0178164). Subsequent notifications and discussions with staff and SOM-IT that would follow in the next 30 minutes dictated the scale of the issue. The root cause was not initially apparent. Notifications were made to the CTC at 8:51 AM, 10:15 AM, and at 10:22 AM when we resolved the issue.

Major Issues: A major issue is generally defined as an issue that impacted a very limited area of campus for more than a brief period.

1/7/2020 Fluffy upgrade had an unexpected reboot

During a planned change window (5 AM-7 AM) on Tuesday, the active chassis rebooted while the other chassis was being upgraded. What would be discovered is that the active chassis did an FPGA upgrade which was not expected (or noted in advance to us with Cisco). This caused the unit to reboot, disrupting campus traffic through the core between 5:29 AM and 5:41 AM. Cisco did not anticipate this problem and had mocked this upgrade in the lab. If this were during normal business hours and not in our change window, this would have been a critical issue. Because it was a scheduled change with risk, this is being assigned as a major issue. As a side effect, this took down the campus 911 center for 30 minutes.

1/29/2020 SecureW2 platform unavailable

Due to an issue with Amazon Web Services, the SecureW2 platform was unavailable for onboarding. This was an issue that affected all SecureW2 customers. Because this happened well after the start of the semester, this was not categorized as a critical outage, but as a major outage due to the length of disruption.

2/13/2020 MVRP feature caused intermittent communication issues in the School of Medicine

We enabled a feature on the School of Medicine 7706 core early in the morning. Before enabling this feature, Cisco had mocked our setup in the lab and verified it should work. We did extensive testing in a non-production environment. Despite all the testing, when we enabled this feature, it corrupted the programming for the ASICs causing some traffic to get dropped. It did not appear to be widespread and mainly affected a building control VLAN. The issue was resolved by shutting down interfaces within port channels and bringing them back online. We will not re-enable this feature in the future.

Page 15: Yearly Network Report 2020 - Information Technology Services

2/7/2020 Fault in Resnet / Wifi Manning Nexus chassis

Around 2:40 in the morning, we were alerted to some unusual traps in our monitoring platform regarding the loss of communication between the VDCs of the Manning and Phillips ResNet/Wireless routers. The initial look showed that the VDCs were in good order. However, 30 minutes later, we started getting a lot of communication failures to a lot of devices and access points in ResNet. A team was assembled, and after rebooting the manning chassis, communication was restored. After the reboot, we noticed latency to certain access points. We determined that a previously detected bug from Aruba had reappeared, and one of the controllers which had lost communication during the initial outage was sending out a lot of multicast packets (3000 PPS). We disabled the link to the affected controller, and ping response times were normalized. We are following up with the vendor. All issues were resolved around 5 AM. As this was during Saturday morning of Spring Break, this was not a critical issue.

2/28/2020 Latency in Internet traffic due to drops on border IPS

Border IPS units dropped many packets resulting in disruption on and off-campus. The issue started just before 4 PM and concluded around 5:50 PM.

6/25/2020 Google Fiber users unable to connect to campus VPN

For an undetermined reason, Google Fiber users were unable to communicate to the campus VPN for at least 1 hour in the morning. No changes were made with us or our provider, yet connectivity was restored.

10/3/2020 CRC errors on one link to MCNC

We received reports of issues communicating on/off-campus. It was determined that we were getting CRC errors on one of our four links to our ISP. We disabled the link generating errors and restored the service. We met with MCNC TAC support onsite. The link connectivity was restored after cleaning fibers and replacing a fiber jumper cable in the path.

12/15/2020 Issues connecting to eduroam

Post upgrade, we received reports of individuals unable to connect to eduroam. The main campus cluster experienced cluster health issues that prevented clients from being able to send traffic to their “anchor controller”. Part of the problem was that 2 controllers in the cluster were acting as VRRP master which is not supposed to happen, especially since 1 of the controllers is configured to be the master via a higher priority and should also preempt. Once we fixed the split VRRP master issue, the network seemed to stabilize. We received a couple more reports of issues on eduroam, but they were not related to this issue. We’ve written scripts that can now quickly check the health of our clusters and the VRRP status of the individual controllers in each cluster. It is available from the development version of Wi-Py Tools at this time. We check these daily.

12/21/2020 Problems with AT&T ASE Service shut down connectivity to many off-campus locations

Page 16: Yearly Network Report 2020 - Information Technology Services

Just before 2 PM, we lost connectivity to many locations serviced by AT&T. These sites included FPG Child Development Institute, 1700 MLK, 116 Merritt Mill, Bank of America, Coats, Med Air, 2108 Umstead Road, Trailer 50. AT&T would restore service around 8 PM.

Minor Issues:

A minor issue is generally defined as an issue that had limited impact, and is generally an inconvenience, but does not stop the network from functioning.

2/9/2020 Brief traffic disruption on main campus

Around 2:26 AM, and resolving 5 minutes later, traffic to a small number of tier 1s was disrupted due to a flow of unknown source. We believe this may be an unknown unicast flood bug we have been investigating with Cisco, as we were unable to identify the source.

3/24/2020 Chapel Hill North / 1700 MLK Down

Communication was lost around 6:00 PM. A contractor had dug a little too deeply and cut into the fiber. As this circuit is managed by AT&T, they were responsible for repairing the damage. Connectivity was restored around 4 AM on 3/25.

9/17/2020 Bad module in Phillips Tier 1

In the evening, a module went back that was connecting the Phillips Tier 1 to Sitterson-Books. It was replaced the same evening and connectivity was restored.

11/2/2020 Physician Office Building Down

During routine work to re-route campus fiber, the fiber to PoB was inadvertently cut in the morning. Connectivity was restored by 10 AM.

Page 17: Yearly Network Report 2020 - Information Technology Services

Explanation of major switching model types:

Cisco Nexus 7706 – These chassis-based routers act as the core of our network and feature high density 40Gbps/100Gbps capabilities. These switches provide almost all core routing for campus and were part of our major core redesign in 2018. They also serve as the layer 3 core for ITS data centers, providing an aggregate of 400Gbps from ITS Franklin and ITS Manning.

Arista – We use a variety of Arista switch models to provide high-density high-speed connections to ITS Research Computing as well as other research entities across campus.

Extreme Summit Series – Comprised of the most current generation fixed format of switches in our network and are divided into many sub-categories (not shown). As we eliminate the older generation switches, we will break down this category into more granular models. These switches have a minimum of 1Gbps to the desktop, 10Gbps to the server, and either 10Gbps, 40Gbps, or 100Gbps uplinks. They come in copper or fiber-based form factors. Most will support 802.3at power over ethernet. Newer models will support 802.3bt power over ethernet.

Extreme SLX series – These high density 40Gbps/100Gbps switches are currently used as spine layer switches in the new data center design. They can support a high number of 100Gbps ports, will support 400Gbps in the future, and feature a deep packet buffer system that can eliminate packet drops from a congested network. They come in chassis and fixed format, and we will be considering this line for replacement of our current distribution tier 1 switches (S series).

Extreme 7100 series – Currently supported previous generation of fixed-format switches that feature 1Gbps to the desktop, 10Gbps to the server, and 10Gbps or 40Gbps uplinks. This switch is no longer available for purchase and runs a network operating system that will eventually be deprecated by the manufacturer. Most will support 802.3at power over ethernet.

Extreme G series – Past generation of modular switches that feature 1Gbps to the desktop. They are no longer available for purchase and have no software support. As modular switches, they can support up to 4 cards of 24 ports each. These switches will support only 1 card of 802.3af power, and its 10Gbps capability is channelized, substantially limiting its ability to carry high data rates. These switches will not be adequate for the power needs of the current generation access points and replacing them is a priority as we life cycle.

Extreme K / S series – Previous generation of chassis-based and fixed-format switches that act as the workhorses for many of our key distribution points across campus. They currently carry software support. They support 10Gbps connectivity at density, but lack in their 40Gbps capabilities. Many of these switches will be replaced in the next 2 years as we attempt to move away from chassis-based switching in as many places as possible, toward fixed format high density 40Gbps/100Gbps switches.

Extreme N series – These are switches that were introduced over 10 years ago and still exist in limited parts of campus. They are 1Gbps switches. These are a priority for replacement during the next year. Few exist on the campus network.

Explanation of major wireless model types:

Page 18: Yearly Network Report 2020 - Information Technology Services

AP-1XX – Aruba access points that feature 802.11n capabilities. Aruba has set the end of support date for AP1XX access points as some time in 2021.

AP-2XX – Aruba access points that feature wave 1 of 802.11ac capabilities. Aruba has set the end of support date for these access points as some time in 2023.

AP-3XX – Aruba access points that feature wave 2 of 802.11ac capabilities. Aruba has not set the end of support date for these access points, and we consider these still current generation.

AP-5XX – Aruba access points that feature wave 1 of 802.11ax capabilities (pre standards ratification). We will be installing these access points as standard beginning in 2021