tcom 509: udp, tcp/ip - internet protocols * obtained permission to use raj jain’s technical...

TCOM 509:UDP, TCP/IP - Internet

Protocols

* Obtained permission to use Raj Jain’s technical material

3

IP Routing

An example routing table

Destination Next-Hop Flags Network Interface127.0.0.1 127.0.0.1 H lo0Default 150.100.15.54 G emd0150.100.15.0 150.100.15.11 emd0

Search Order of routing table1: Complete match of destination IP2: Match the network addr (including the subnet ID)3: Use default router4: If all previous steps fail to find a suitable entry, send an

ICMP “host unreachable error”

H = 1 Destination is a complete host address G=1 Next-Hop to a router0 Destination is a network address G=0 Next-Hop to a directly connected destination

4

Routing is the process of discovering, selecting and following paths from the transmitting host to the receiving host in a network. There are two categories of routing algorithms:

Source Routing: The transmitting host inserts a list of routers that describe a path through the network.

Hop-by-Hop Routing: The transmitting host knows how to get to the first router. The router then employs its Routing Table to select the next best hop (router), which selects the next best router, etc.

ROUTING SUMMARY

Routing

SourceRouting

Hop-by-Hop Routing

Strict Loose Static Dynamic Default

DistanceVector

LinkState

PathVector

1. RIP2. IGRP

1. OSPF2. LS-LS

1. EGP2. BGP

Distance Vector: The router sends a list of networks, how far they are and the next hop direction. Link State: The router has a complete topology map of the network. Path Vector: The router sends a complete path to get to a destination.

5

IP Routing Strategies

Static Routing Pre-determined routes setup by

administrator Dynamic Routing

Interior Routing ProtocolsRIP, OSPF

Exterior Routing ProtocolsBGP

6

How Does IP Routing Work?

Basic procedure:1. search for a matching host address

(/32)2. search for a matching network address

(/x, where 0<x<32)3. search for a default entry (0/0)4. If all previous steps fail to find a

suitable entry, send an ICMP “host unreachable error”

IP packets are routed via a “best-match” or “longest-match” principle

7

Processing an IP packet

8

IP Source Routing

IP routing has no concept of the source determining the route

What if the source wanted to specify the packet’s path?

The source route option was added to the IP protocol in order to assist in route debugging. Nowadays, it seems to be mainly used by large ISPs, to make sure that their peers aren't inappropriately dumping traffic onto their backbone links. A packet is given a list of desired hops that should be taken on the way to the final destination.

9

IP Source Routing via IP Options

CODE = 131

10

Source Routing example

11

Types of Routes

Static All packets forwarded to predetermined

destinations defined by an administrator Dynamic

Packets are forwarded to dynamically calculated routes determined by a routing protocol

13

Static Routing

Benefits Good for small networks Can help create a secure network Efficiently uses router resources

Drawbacks Does not handle network failures well Does not scale well

14

Static Routing Example

NetworkNetwork192.168.5192.168.5

NetworkNetwork172.16172.16

NetworkNetwork1010

Network 192.168.6Network 192.168.6

DestinationDestinationNext HopNext Hop

10.0.0.010.0.0.0DirectDirect

172.16172.16

Router BRouter B

192.168.5192.168.5Router CRouter C



1010Router ARouter A

172.16172.16

Router BRouter B

192.168.5192.168.5DirectDirect

192.168.6192.168.6Router DRouter D



172.16172.16

DirectDirect





DefaultDefaultRouter CRouter C

Router ARouter A

RouterRouter BB Router CRouter C

Router DRouter D

15

Static Routing with Link Failure

NetworkNetwork192.168.5192.168.5

NetworkNetwork172.16172.16

NetworkNetwork1010

Network 192.168.6Network 192.168.6


1010DirectDirect

172.16172.16

Router BRouter B



DestinationDestination Next Next HopHop

1010UnreachableUnreachable

172.16172.16

Router BRouter B

192.168.5192.168.5 DirectDirect

192.168.6192.168.6Router DRouter D



172.16172.16

DirectDirect





DefaultDefaultRouter CRouter C

Router ARouter A

Router BRouter B Router CRouter C

Router DRouter D

16

Dynamic Routing

Routing tablesRouting tables NeighborsNeighbors

Interface statusInterface status All routersAll routers

Distance-VectorDistance-Vector

Link-StateLink-State

CommunicateCommunicatewhat?what?

BetweenBetweenwhom?whom?

17

Dynamic IP Routing Protocols

RIP (Distance Vector) OSPF (Link State) IS-IS (Link State) BGP (Path Vector)

18

Distance Vector vs. Link State

19

Concept of Administrative Distance

Connected 0

Static to Interface orStatic to Next Hop

1

E-IGRP (Cisco only) 90

OSPF 110

IS-IS 115

RIP v1 and v2 120

- Handles preferences when multiple sources of routing info exists in router

- Protocol with lowest admin weight wins

Administrative Distances of Routing Protocols: Measures trustworthiness of the source of route

Only one IGP route is installed in the routing table

20

Distance-Vector and Link State Protocol

RIP v1 Distance Vector

Hop Count Bellman-Ford

RIP v2 Distance Vector

Hop Count Bellman-Ford

OSPF Link State Bandwidth-based cost

Shortest Path First

IGRP Distance Vector

Composite Bellman-Ford

Protocol Category Metric Algorithm

21

What is RIP (Routing Info Protocol)?

RIP is a Interior Gateway Protocol (IGP) Used within an Autonomous System (AS)

A collection of routers under the same administrative authority

Two versions RIP v1 (RFC 1058) RIP v2 (RFC 2453)

22

Distance Vector Routing Protocol - RIP v1 - Characteristics

Directly connected subnets are known Routing updates are broadcasted to

neighbors Listen to routing updates Metrics are used Routing info consists of subnet and

metric Periodic updates (30 sec) A route is learned via a neighbor Failed route has a metric of infinite

23

RIP Uses UDP

RIP is a UDP-based protocol. Each router that uses RIP has a routing process that sends and receives datagrams on UDP port number 520, the RIP-1/RIP-2 port. All communications intended for another routers's RIP process are sent to the RIP port. All routing update messages are sent from the RIP port.

26

RIP Characteristics

Distance-vector routing protocol Updates contain routes (vectors) and the cost (distance)

to reach them and consist of the following steps: Each node calculates the distances between itself and all other

nodes within the AS and stores this information as a table. Each node sends its table to all neighboring nodes. When a node receives distance tables from its neighbors, it

calculates the shortest routes to all other nodes and updates its own table to reflect any changes

Does not scale well for large networks as every router has to add a RIP route for every newly added network

Hop count is used as the metric for path selection, based on Bellman-Ford distance-vector routing algorithm

Maximum allowable hop count is 15 Routing updates are broadcast every 30

seconds

27

RIP Message Types

Two message types Request message

Ask neighbors to send routes Response message

Carries route updates

Advertises 25 routes per update Router decides how to handle routes

in update Add, modify, or delete

28

RIP Routing Metrics

Counts the number of hops between source and destination Number of hops is the number of router hops Hop count equals the RIP metric RIP cannot determine measured delay,

reliability, load, or link bandwidth With multiple paths to the same

prefix, one with fewest hops is selected May not be optimum path

29

RIP in Action (1):

162.11.5.0

162.11.9.0 162.11.8.0

162.11.7.0162.11.10.0

Router A

Router BRouter C

s0

s0

s0

E0E0

s1

Tr0

1111

162.11.8.0 s0 1162.11.7.0 E0 1

Router A & C is down

Routing table B

30

RIP in Action (2):

162.11.5.0 1

162.11.9.0 1

162.11.5.0

162.11.9.0 162.11.8.0

162.11.7.0162.11.10.0

Router A

Router BRouter C

s0

s0

s0

E0E0

s1

Tr0

1111

162.11.5.0 s0 2

162.11.9.0 s0 2162.11.8.0 s0 1162.11.7.0 E0 1

Router C is downRouter A is switched on

31

RIP in Action (3):

162.11.5.0 s0 2

162.11.9.0 s0 2162.11.8.0 s0 1162.11.7.0 E0 1

162.11.10.0 s0 3

162.11.10.0 1

162.11.5.0 1

162.11.9.0 1

162.11.10.0 2

162.11.5.0

162.11.9.0 162.11.8.0

162.11.7.0162.11.10.0

Router A

Router BRouter C

s0

s0

s0

E0E0

s1

Tr0

1111

2222

Router C is switched on

32

RIP Timers

RIP uses numerous timers to regulate its performance. These include a routing-update timer, a route-timeout timer, and a route-flush timer.

Routing-update timer - clocks the interval between periodic routing updates. Generally, it is set to 30 seconds, with a small random amount of time added whenever the timer is reset. This is done to help prevent congestion, which could result from all routers simultaneously attempting to update their neighbors.

Route-timeout timer - Each routing table entry has a route-timeout timer associated with it. When the route-timeout timer expires, the route is marked invalid but is retained in the table until the route-flush timer expires. Default value is 120 secs

Route-flush timer - If 180 seconds elapse from the last time the timeout was initialized, the route is considered to have expired, and the deletion process described below begins for that route. Default value is 180 secs.

33

Bellman Ford’s Distance Vector Algorithm – Example

http://www.laynetworks.com/Simulation%20of%20Bellman%20Algorithm.htm

34

Disadvantages with the Bellman Ford’s Algorithm

Does not scale well Changes in network topology are not

reflected quickly since updates are spread node-by-node.

Counting to infinity (if link or node failures render a node unreachable from some set of other nodes, those nodes may spend forever gradually increasing their estimates of the distance to it, and in the meantime there may be routing loops)

http://en.wikipedia.org/wiki/Network_topology

http://en.wikipedia.org/wiki/Infinity

35

Count To Infinity Problem – URL Link

36

Improving Convergence Split Horizon

For interface X, don’t advertise routes out X that you learned via X prevents forwarding loops only for 2 adjacent router case joke analogy: if you tell me a joke and you get it, I don’t need to tell it back to you

Hold Down Timers refuse to accept any information for a period of time (60 secs) after a route is declared

unreachable can increase convergence time

Triggered Updates when a change occurs, send update immediately (don’t wait for next update interval) Change can be defined as an observed increase in hop count over time (1.6 –2.0 increase in

originally store hop count) Attempt to speed up convergence

Hold Down Timers and Triggered Updates Can be used together to be more effective

Split Horizon with Poison Reverse a.k.a. “Infinite Split Horizon” For interface X, DO advertise routes out X that you learned via X, but with a metric of INFINITY advantage: eliminates two-router loops disadvantage: increases the size of routing updates

None of these mechanisms can completely avoid routing loops and counting to infinity doesn’t go away

37

RIPv2 (rfc 2453) – Solves some of the RIPv1 ShortcomingsRIPv2 is classless and uses UDP port 520 as does RIPv1 (classful). It is still distance vector and still uses hop count as the metric with a max hop count of 15.

The ability to multicast saves other devices on the network from wasting time opening broadcast packets.

39

RIPv2 – Subnet Mask

Classless routing protocols carry the subnet mask. This allows all 0 and 1 subnets to be used, eliminating confusion between 172.16.255.255 and 172.16.255.255. Here, one is the 'all subnets' broadcast and one is broadcast on the all 1s subnet - but which is which?

If the subnet mask is sent then 172.16.255.255 /16 and 172.16.255.255 /24 can be differentiated.

40

RIPv2 – Route Tag

Each RIPv2 entry includes a Route Tag field, where additional information about a route can be stored. It provides a method for distinguishing between internal routes (learned by RIP) and external routes (learned from other protocols).

41

RIPv2 – Next Hop In RIPv2, each RIP entry includes a space where an explicit IP address can be entered as the next hop

router for datagrams intended for the network in that entry Specifying a value of 0.0.0.0 in this field indicates that routing should be via the originator of

the RIP advertisement.

The purpose of the Next Hop field is to eliminate packets being routed through extra hops in the system. It is particularly useful when RIP is not being run on all of the routers on a network. A simple example is given in Appendix A. Note that Next Hop is an "advisory" field. That is, if the provided information is ignored, a possibly sub-optimal, but absolutely valid, route may be taken. If the received Next Hop is not directly reachable, it should be treated as 0.0.0.0.

----- ----- ----- ----- ----- ----- |IR1| |IR2| |IR3| |XR1| |XR2| |XR3| --+-- --+-- --+-- --+-- --+-- --+-- | | | | | | --+-------+-------+----------------+-----+------+-- <-------------RIP-2-----------------> Assume that IR1, IR2, and IR3 are all "internal" routers which are under one administration (e.g. a campus) which has elected to use RIP-2 as its IGP. XR1, XR2, and XR3, on the other hand, are under separate administration (e.g. a regional network, of which the campus is a member) and are using some other routing protocol (e.g. OSPF). XR1, XR2, and XR3 exchange routing information among themselves such that they know that the best routes to networks N1 and N2 are via XR1, to N3, N4, and N5 are via XR2, and to N6 and N7 are via XR3. By setting the Next Hop field correctly (to XR2 for N3/N4/N5, to XR3 for N6/N7), only XR1 need exchange RIP-2 routes with IR1/IR2/IR3 for routing to occur without additional hops through XR1. Without the Next Hop (for example, if RIP-1 were used) it would be necessary for XR2 and XR3 to also participate in the RIP-2 protocol to eliminate extra hops.

42

RIPv2 - Authentication

8 bits 8 bits 8 bits 8 bits

Command Version Unused - set to all zeros

0XFFF Authentication Type

Password (bytes 0-3)




RIPv2 authenticates the source of the packets. The source of the update uses the first field of the message that would normally carry IP address, SM, Next Hop, Metric and hijacks these for authentication. This leaves room for only 24 updates per packet instead of 25 with RIPv1.

A password is indicated if the AFI field is set to 0XFFF. The authentication type for simple authentication is set to 0X002. The password is left justified and unused bits are set to zero.

MD5 authentication may be enabled to overcome plain-text authentication. Use the Authentication Type field to identify the method used. MD5 computes a 128-bit hash value from plain text plus password. This hash is transmitted along with the message and the hash is recalculated at the far end and the received and calculated hash values are checked against each other. If they match, the message is authenticated.

43

RIP v2 Packet Format

00 88 1616 2424 3131

CommandCommand VersionVersion Reserved (Must be zero)Reserved (Must be zero)

Route TagRoute TagAddress Family IdentifierAddress Family Identifier

IP AddressIP Address

Next HopNext Hop

……

MetricMetric

Route TagRoute TagAddress Family IdentifierAddress Family Identifier

IP AddressIP Address

Subnet MaskSubnet Mask

Next HopNext Hop

MetricMetric

Subnet MaskSubnet Mask

44

RIP Limitations Maximum network diameter = 15 Lack of alternative routes. RIPv2 keeps only one route

to a destination in routing tables. It has to wait for updates after a failure to assess whether a new (if any) route exists

Regular updates include entire routing table approximately every 30 seconds

Poison reverse increases the size of the routing updates Count to infinity slows route loop prevention Metrics only involve hop count Broadcasts between neighbors (RIPv1 only) Classful routing means no prefix length carried in route

updates (RIPv1 only) – and no VLSM No authentication mechanism (RIPv1 only) Slow convergence

46

OSPF Concept : Having the Same Copy of Network Topology at Every Node R1 LSA

R2 LSA

R3 LSA

R4 LSA R6 LSA

R5 LSA

xyz LSA

abc LSA

pdq LSA

47

SPF Algorithm

The Shortest Path First (SPF) routing algorithm is the basis for OSPF operations. When an SPF router is powered up, it initializes its routing-protocol data structures and then waits for indications from lower-layer protocols that its interfaces are functional.

After a router is assured that its interfaces are functioning, it uses the OSPF Hello protocol to acquire neighbors, which are routers with interfaces to a common network. The router sends hello packets to its neighbors and receives their hello packets. In addition to helping acquire neighbors, hello packets also act as keepalives to let routers know that other routers are still functional.

On multi-access networks (networks supporting more than two routers), the Hello protocol elects a designated router and a backup designated router. Among other things, the designated router is responsible for generating LSAs for the entire multi-access network. Designated routers allow a reduction in network traffic and in the size of the topological database.

When the link-state databases of two neighboring routers are synchronized, the routers are said to be adjacent. On multiaccess networks, the designated router determines which routers should become adjacent. Topological databases are synchronized between pairs of adjacent routers. Adjacencies control the distribution of routing-protocol packets, which are sent and received only on adjacencies.

Each router periodically sends an LSA to provide information on a router's adjacencies or to inform others when a router's state changes. By comparing established adjacencies to link states, failed routers can be detected quickly, and the network's topology can be altered appropriately. From the topological database generated from LSAs, each router calculates a shortest-path tree, with itself as root. The shortest-path tree, in turn, yields a routing table.

48

How OSPF Protocol Works

Stage 1: Discovering Neighbors => Hello Message

Stage 2: Electing the Designated Router => Hello Message

Stage 3: Establishing Adjacencies => DB Description Msgs

Stage 4: Propagating Link State Information => Flooding using LS Request/Update Msgs)

Stage 5: Calculating the Routing Table(s) => (Diksjtra’s Algorithm)

49

First Requirements for the new IGP to Replace RIP

Had to be more efficient than RIP consume fewer network resources: link bandwidth

and CPU cycles Faster convergence than RIP

communicate changes quickly link, interface, and router failures

More descriptive metric than RIP hop-count limitations ability to include other factors (bandwidth, delay,

reliability, etc.) Figure 3.2

OSPF ended up using cost 16 bits, no limit on total path cost eliminated network diameter limitations

50

New IGP Requirements (2)

Support for load balancing over multiple equal-cost links to a destination more efficient use of network resources implementation was not mandated by the

protocolmultiple strategies exist: flow-based, round-robin,

hash function, packet-by-packet in theory, multiple vendor strategies can be

combined in a single network and it can still work, but this needs to be examined closely in some cases

Support for a routing hierarchy split the AS up into mini-AS’s, in a sense a scalability mechanism

51

New IGP Requirements (3)

Separate internal and external routes RIPv1 had no way to distinguish You generally trust info from your AS over routes

from outside your AS Support for more flexible subnetting

essentially CIDR addressing and notation, no notion of classful routing

Security ability to control what routers participate in OSPF

routings based on a password ToS-based routing

allow specification of different metrics for each of the original ToS categories

in reality, never really used; chicken-and-egg problem

53

What is OSPF?

An IGP using Link-State technique to update routing tables

Based on the shortest path first (SPF) algorithm, also known as the Dijkstra algorithm

Created to fill the need for a high functionality, standards-based IGP for the TCP/IP protocol family

Main RFCs: 1587 – OSPF NSSA Option 2328 – OSPF Version 2 (current implementation)

54

What is a Link-State Protocol ?

Link = router interface State = description of interface and its

relationship to neighboring routers OSPF routers send link-state

advertisements (LSAs) to all other routers within the same hierarchical area

Routers store information in a link-state, or topological, database

Each OSPF router uses the SPF algorithm to calculate the shortest path to each node

55

Three (3) Types of OSPF LS Messages1. LSA (Link State Advertisement): LSAs are included in the

database description packets (DDPs or DBDs). LSA entries include link-state type, the address of the advertising router, the cost of the link, and the sequence number.

2. LSR ( Link State Request): When a slave router receives an DDP (Database Description Packet), it sends and LSAck packet. Then it compares the received information with the information it has. If the DDP has more recent information, the slave router sends a link-state request (LSR) to the master router.

3. LSU ( Link State Update): LSU packet is sent in response to LSR (Link-State Request) packet sent from a slave router to a master router. LSU contains complete information about the requested entry.

56

What is SPF?

Places each router at the root of a tree and calculates the shortest path to each destination based on the cumulative cost to reach that destination

Each router has its own view of the topology even though all the routers build a shortest path tree using the same link-state database

57

SPF Cost

Cost, or metric, of an interface indicates the overhead required to send packets across that interface Cost = 10**8/bandwidth (bps) Higher bandwidth = lower cost 10M Ethernet line cost = 10**8/10**7 = 10 T1 line cost = 10**8/1544000 = 64

To handle hi-speed links, use a value greater than 10**8 in the cost calculation This is the Reference Bandwidth

59

Shortest Path Tree

Router A’s SPF tree A is the Root; use the least-cost path to each IP prefix If a link goes down, the SPF tree is recalculated

Each router calculates its own SPF tree

Router ARouter A

Router BRouter B

Router DRouter D

222.211.10.0222.211.10.0

1010

55

77192.213.11.0192.213.11.0

128.213.0.0128.213.0.0

Router DRouter D

55

00

1010

1010

60

Dijkstra’s Link State Algorithm

Principle: Dijkstra's algorithm works on the principle that the shortest possible path from the source has to come from one of the shortest path

already discovered.

Layman’s Terms: Using the street map, you're marking over the streets (tracing the street with a marker) in a certain order, until you have a route marked in from the starting point to the destination. The order is conceptually simple: from all the street intersections of the already marked routes, find the closest unmarked intersection - closest to the starting point (the "greedy" part). It's the whole marked route to the intersection, plus the street to the new, unmarked intersection. Mark that street to that intersection, draw an arrow with the direction, then repeat. Never mark to any intersection twice. When you get to the destination, follow the arrows backwards. There will be only one path back against the arrows, the shortest one.

Demo: http://www.eng.tau.ac.il/~shtilman/C-Programming/Year2/dijkstra.html http://www.oopweb.com/Algorithms/Documents/PLDS210/Volume/dij-

op.html

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/dijkstra.html/











http://www.eng.tau.ac.il/~shtilman/C-Programming/Year2/dijkstra.html

61

OSPF Breaks an AS into Areas

ABRABR

ABRABR

Area 0Area 0

Area Area 234234

Area 10Area 10

AS 100AS 100

63

Area Sizing Guidelines

Rules of thumb for non-backbone area No more than 100 routers No more than 50 neighbors per router

Decrease when media unstable Consider static/default and demand

techniques Decrease when large numbers of

externals injected Consider if the incoming externals can be

summarized or filtered

64

When Might Single-Area OSPF make sense?

Fewer than 50 routers with alternate paths

Needs: multivendor compatibily fast convergence VLSM complex defaults and externals

No clear candidates for core OSPF power greatest with hierarchy Multiple domains may be better than 1 area

65

Design Guidelines – Network Topology (Cont’d) - OSPF Network Size Recommendation

66

Design Guidelines – Network Topology (Cont’d) - How Many Areas Should Be Connected per ABR?

67

OSPF : Location of different routers

68

Different Types of OSPF routers

Internal router: An internal router has all the interfaces in the same area. All internal routers have same link state databases.

Backbone router: Backbone routers sit on the perimeter of Area 0, with at least one interface connected to backbone (Area 0).

Area Border Router (ABR): ABRs are routers that have interfaces attached to multiple areas. It may be noted that these routers maintain separate link-state databases for each area that they are connected. They are capable of routing traffic destined for or arriving from other areas.

Autonomous System Boundary Router (ASBR): These are the routers that have at least one interface to the external network (another autonomous system). This autonomous network can be non-OSPF. ASBRs are capable of route redistribution, a term used to imply that the concerned router can import routing information from non-OSPF networks and distribute the same in OSPF network for which it is responsible and visa versa.

69

OSPF Terminology

Area ID: A 32 bit number identifying an area. Acquired from IANA.

Router ID: A 32 bit number identifying a router. Normally the lowest numbered IP address belonging to a router.

Router Priority: An 8 bit number that indicates this router’s willingness to be a designated/backup designated router.

A router priority of Zero indicates that this router is ineligible to be a designated router.

LINK State Advertisement: Exchanged by adjacent routers to allow area topology databases to be maintained and inter-area and intra-AS routes to be advertised.

The are five types of link state advertisements.

72

OSPF NETWORKS

OSPF supports three kinds of connections and networks.

Point-to-Point between exactly two routers. Multi-access networks with broadcasting(e.g., Ethernet, T-R, etc) Multi-access networks without broadcasting (e.g., packet switching WANs)

Point-to-Point Network Multi-access w/ Broadcasting

X.25 NETWORK

Multi-access w/o Broadcasting

73

PROTOCOL ENCAPSULATION and OSPF PROTOCOL NUMBER

FCSPREAMBLE DESTINATION ADDR00 00 1B 12 23 34

SOURCE ADDR 00 00 1B 09 08 07

FIELDTYPE

ETHERNET

Protocol Type89

Source IP Address: 128.66.12.2

Destination IP Address: 224.0.0.5

IP Header

IPHEADER OSPF

OSPF

DATA LINKLAYER

NETWORKLAYER

NOTE OSPF uses direct IP encapsulation. Protocol 89 is used for OSPF. OSPF is sent as multicast on pt-to-pt and broadcast networks.

224.0.0.5

75

OSPF MESSAGE TYPES

HELLO (Type 1) is used to:identify neighbors, to elect a designated Router for multi-access network, to find out about an existing Designated Router and as "I'm alive" signal.

DATABASE DESCRIPTION (Type 2) is used to exchange information during initialization so that a router can find out what data is missing from its topology database. Each LSA is preceded by a common LS Advertisement Header.

LSA Specific Type 1 Router Link Advertisement LSA Specific Type 2 Network Link Advertisement LSA Specific Type 3 Summary Link Advertisement to other Areas. LSA Specific Type 4 Summary Link Advertisement to ASBR. LSA Specific Type 5 AS External Link Advertisement

LINK STATE REQUEST (Type 3) is used to ask for data that a router has discovered is missing from its topology database or to replace data that is out of date.

Database descriptions are exchanged first thenLink State request are submitted to resolve missing or suspicious data.

LINK STATE UPDATE (Type 4) is used to reply to a Link State Request and also to dynamically report changes in network topology. LINK STATE ACKNOWLEDGEMENT (Type 5) is used to confirm receipt of a Link State Update. The sender will retransmit until an update is ACKed.

76

Link State Advertisement Types

Router Link Advertisement (LSA Type 1) Generated by all OSPF routers and describe the state of the router's

interface (links) within the area. They are flooded throughout a single area only.

Network Link Advertisement (LSA Type 2) Generated by the Designated Router (DR) on a multi-access network and

lists the routers connected to the network. They are flooded throughout a single area only.

Summary Link Advertisements Generated by Area Border Routers (ABR) and flooded throughout a single

area only. There are two types: A summary advertisement (LSA Type 3) describing routes to destinations in other areas within the same AS. A summary advertisement (LSA Type 4) describing routes to AS Boundary Routers. For routers to get information out of the AS.

AS External Link Advertisement (LSA Type 5) Generated by the AS Boundary Routers(ASBR) to describe routes to

destinations external to the OSPF network. They are flooded to all areas in the OSPF network.

77

ASBR

External LinksType 5

Originated by an ASBR.Describe destinations externalto the autonomous system or adefault route to the outside AS.

DR

Network LinksType 2

Originated for multi-access segments with more than one attached router. Describe all routers attached to the specific segment. Originated by a Designated Router (discussed later on).

Router LinksType 1

Describe the state and cost of the router’s links (interfaces) to the area (Intra-area).

Summary LinksTypes 3 and 4

Originated by ABRs only.Describe networks in the AS but outside of area (Inter-area).Also describe the location of the ASBR.

ABR

LSA Types Used in Flooding

78

LSA Specific - Description LSA Specific Type 1 (Router Description)

These are the router-LSAs. They describe the collected states of the router's interfaces. For more information, consult Section 12.4.1.

LSA Specific Type 2 (Network Description) These are the network-LSAs. They describe the set of routers attached to

the network. LSA Specific Type 3 or 4

These are the summary-LSAs. They describe inter-area routes, and enable the condensation of routing information at area borders. Originated by area border routers,

The Type 3 summary-LSAs describe routes to networks (Network Description)

Type 4 summary-LSAs describe routes to AS boundary routers. (Router Description)

LSA Specific Type 5 (Network Description) These are the AS-external-LSAs. Originated by AS boundary routers, they

describe routes to destinations external to the Autonomous System. A default route for the Autonomous System can also be described by an AS-external-LSA.

79

BASIC OSPF OPERATIONAL SEQUENCE

S1: Routers discover their OSPF neighbors. When the OSPF routers first start they establish and maintain a relationship with their neighbors using the Hello protocol.

S2: Routers elect a Designated Router (DR) and a Backup Designated Router (BDR) for a network (LAN) with multiple routers using the Hello protocol.S3: The routers form adjacencies.

For routers on Multi-access networks all routers become adjacent to the DR and the BDR.

80

BASIC OSPF OPERATIONAL SEQUENCE Contd

S4: Adjacent routers then exchange Database Description packets which may be part or all of the routers Link State Database.

The adjacent routers then synchronize their Link State Databases by requesting missing or outdated information on the advertised links. This is done through a Link State Request packet . The response is a Link State Update packet. A Link State Acknowledgement packet is used to confirm the correct receipt of a Link State Update packet.

S5: The routers then calculate the routing table by running the Shortest Path First(SPF) algorithm using the Link State Database as input.

The routers periodically engage in advertising its Link States based upon a refresh timer expiration or a link state change.They then recalculate their routing table.

81


S1: Discovering Neighbors => Hello Messages

S2: Electing the Designated Router => Hello Messages

S3: Establishing Adjacencies => DB Description Msgs

S4: Propagating Link State Information => Flooding using LS Request/Update Msgs)

S5: Calculating the Routing Table(s) => (Diksjtra’s Algorithm)

82

Discovering Neighbors – Hello Protocol

83

Hello Exchange Process – Pt-to-Pt Link

84

Hello Exchange Process – Ethernet Link

85

OSPF HELLO MESSAGE

NETWORK MASK: This field contains the subnet mask of the network over which the message was sent (the mask associated with the interface). If this field does not match the receiving router's mask for that network, the

receiving router rejects the Hello message and does not accept the transmitting router as a neighbor.

In the absence of subnetting it is set to the default subnet mask. HELLO INTERVAL: This field tells how often in seconds this router transmits its Hello

messages. A Broadcast is normally 10 seconds. A non-Broadcast is normally every 30 seconds. HelloInterval and RouterDeadInterval fields in sent OSPF packet must match

with the settings configured in the receiving interface.

Authentication (octets 0-3)

VersionMessage

Type Message Length

Router Identification

Area Identification

Checksum Authentication Type


0 8 16 24 31

Neighbor One IP Address

Network Mask

Designated Router

Backup Designated Router

Dead Interval Timer

Hello Interval Router Priority

Common Message Header

Hello Message TypeIdentifies neighborsElects the Designated Router(DR)Find out about an existing DRAn Alive Signal

Options

...

E T

86


S1: Discovering Neighbors => Hello Message

S2: Electing the Designated Router => Hello Message




87

DR and BDR Election - Example

88







90

Database Sync Process In a link-state routing algorithm, it is very important for all routers' link-state databases to stay

synchronized in order to have a compatible routing tables. OSPF simplifies this by requiring only adjacent routers to remain synchronized. The synchronization process begins as soon as the routers attempt to bring up the adjacency. Each router describes its database by sending a sequence of Database Description packets to its neighbor. Each Database Description Packet describes a set of LSAs belonging to the router's database. This sending and receiving of Database Description packets is called the "Database Exchange Process". During this process, the two routers form a master/slave relationship. Each Database Description Packet has a sequence number. Database Description Packets (DDPs) sent by the master (polls) are acknowledged by the slave through echoing of the sequence number. The DB exchange initially only sends the LSA headers and not the LSA info to achieve better BW and processing efficiencies.

The master is the only one allowed to retransmit Database Description Packets. It does so only at fixed intervals, the length of which is the configured per-interface constant RxmtInterval. Each Database Description contains an indication that there are more packets to follow --- the M-bit. The Database Exchange Process is over when a router has received and sent Database Description Packets with the M-bit off.

During and after the Database Exchange Process, each router has a list of those LSAs for which the neighbor has more up-to-date instances. These LSAs are requested in Link State Request Packets. Link State Request packets that are not satisfied are retransmitted at fixed intervals of time RxmtInterval. When the Database Description Process has completed and all Link State Requests have been satisfied, the databases are deemed synchronized and the routers are marked fully adjacent. At this time the adjacency is fully functional and is advertised in the two routers' router-LSAs.

Criteria for determining adjacency between 2 routers: Have the same number of LSAs in their LSDBs Sum of their LSA’s LS Checksum fields are equal

91

OSPF DB DESCRIPTION MESSAGE

NETWORK MASK: This field contains the subnet mask of the network over which the message was sent (the mask associated with the interface). If this field does not match the receiving router's mask for that network, the

receiving router rejects the Hello message and does not accept the transmitting router as a neighbor.

In the absence of subnetting it is set to the default subnet mask. HELLO INTERVAL: This field tells how often in seconds this router transmits its Hello

messages. A Broadcast is normally 10 seconds. A non-Broadcast is normally every 30 seconds. HelloInterval and RouterDeadInterval fields in sent OSPF packet must match

with the settings configured in the receiving interface.


VersionMessage

Type Message Length


Area Identification



0 8 16 24 31

Common Message Header

DB Descrioption Message TypeEstablishing adjacency

LS Age

Link State Identification

Advertising Router

0 8 16 24 31

Options LS Type

Link State Sequence Number

LS Checksum Length

92

OSPF DB DESCRIPTION MESSAGE WITH LINK ADVERTISEMENT HEADER

LS Age: A 16 bit number indicating the time in seconds since the origin of the advertisements. This time increases as the link state advertisement resides in the router database and/or with each hop count. When it reaches a maximum value, normally one hour, it is discarded unless needed for synchronization.

Options: See the Hello Packet. LS Type: This field specifies which of five different link state advertisements is contained in this header. Link State ID: A unique ID for the advertisement which is dependent upon the message type.

LSA Message Types 1/4 uses the Router ID. LSA Message Type 2 uses the IP address of the Designated Router. LSA Message Types 3/5 uses an IP network number.

LS Types are:Type 1: Router LinksType 2: Network LinksType 3/4: Summary LinksType 5: External Links

Headers contain enough information to identify the LS records needed during synchronization.The receiver marks the LS Records to be requested.


VersionMessage

Type Message Length


Area Identification



LS Age


Advertising Router

0 8 16 24 31

Options LS Type


LS Checksum Length

93

OSPF DB DESCRIPTION MESSAGE WITH LINK ADVERTISEMENT HEADER (CONT’D)

Advertising Router: The Router ID of the router that originated the link state advertisement.LSA Message Type 1 is identical to the Link State ID.LSA Message Type 2 uses the Router ID of the Network's Designated Router.LSA Message Types 3/4 use Router ID of the Area Border Router.LSA Message Type 5 uses the Router ID of the AS Boundary Router.

LS Sequence Number: This field is used to sequence the advertisements and to detect duplicate or old packets. LS Checksum: The Checksum of the complete Link State Advertisement excluding the LS Age field. Length: The length is the size of the advertisements in bytes including the Advertisement Header.

LS Types are:Type 1: Router LinksType 2: Network LinksType 3/4: Summary LinksType 5: External Links

Headers contain enough information to identify the LS records needed during synchronization.The receiver marks the LS Records to be requested.


VersionMessage

Type Message Length


Area Identification



LS Age


Advertising Router

0 8 16 24 31

Options LS Type


LS Checksum Length

94

LSA Flooding - Operations

95

Example of Router LSAs

96

Example of Network LSAs

97

Example of Summary LSAs

98

External Route LSAs - Example

99







100

Propagating LS Info: When a link state changes

101

Propagating LS Info: When a link state changes

103







104

Pros and Cons of OSPF

Advantages of OSPF: 1.Changes in an OSPF network are propagated quickly. 2.OSPF is heirarchical, using area 0 as the top as the heirarchy. 3.OSPF is a Link State Algorithm. 4.OSPF supports Variable Length Subnet Masks (VLSM). 5.OSPF uses multicasting within areas. 6.After initialization, OSPF only sends updates on routing table sections which have changed, it does not send the entire routing table. 7.Using areas, OSPF networks can be logically segmented to decrease the size of routing tables. Table size can be further reduced by using route summarization. 8.OSPF is an open standard, not related to any particular vendor. 9.Can load-balance up to 6 equal-cost routes with 4 as the default

Disadvantages of OSPF: 1.OSPF is very processor intensive. 2.OSPF maintains multiple copies of routing information, increasing the amount of memory needed. 3.Using areas, OSPF can be logically segmented (this can be a good thing and a bad thing). 4.OSPF is not as easy to learn as some other protocols. 5.In the case where an entire network is running OSPF, and one link within it is "bouncing" every few seconds, OSPF updates would dominate the network by informing every other router every time the link changed state

106

Networks and Routers under a single administrative authority

Each AS is assigned a numberAS numbers range form 1 to

65,535

BGP - Autonomous System

107

Different AS Types

http://ipmon.sprint.com/pubs_trs/tutorials/Taft_BGP.pdf (slide 30)

http://ipmon.sprint.com/pubs_trs/tutorials/Taft_BGP.pdf

108

BGP is

An Exterior Gateway Protocol (EGP), used to propagate tens or hundreds of thousands of routes between networks (ASs).

The only protocol used to do this on the Internet today.

109

What is BGP?

BGP is an inter-domain routing protocol that communicates prefix reachability

BGP is a path vector protocol Similar to distance vector

BGP views the Internet as a collection of autonomous systems

Stability is very important to the Internet and BGP

BGP supports CIDR BGP routers exchange routing information

between peers Defined in RFC 1771

110

How Does BGP Work? BGP uses TCP as its transport protocol (port 179). Two BGP routers form a

TCP connection between one another (peer routers) and exchange messages to open and confirm the connection parameters.

BGP routers exchange network reachability information. This information is mainly an indication of the full paths (BGP AS numbers) that a route should take in order to reach the destination network. This information helps in constructing a graph of ASs that are loop-free and where routing policies can be applied in order to enforce some restrictions on the routing behavior.

Any two routers that have formed a TCP connection in order to exchange BGP routing information are called peers, or neighbors. BGP peers initially exchange their full BGP routing tables. After this exchange, incremental updates are sent as the routing table changes. BGP keeps a version number of the BGP table, which should be the same for all of its BGP peers. The version number changes whenever BGP updates the table due to routing information changes. Keepalive packets are sent to ensure that the connection is alive between the BGP peers and notification packets are sent in response to errors or special conditions.

111

BGP Fundamentals BGP peers exchange routes and send updates not

faster than every 90 seconds by default.

Routes consist of destination prefixes with an AS path and BGP-specific attributes

Each BGP update contains one path advertisement and attributes

Many destinations can share the same path

BGP compares the AS path and attributes to choose the best path

Unfeasible routes can be advertised Unreachable routes are withdrawn

112

BGP Connections

BGP updates are incremental No regular refreshes Except at session establishment, when

volume of routing can be high BGP runs over TCP connections

TCP port 179 TCP Services

Fragmentation, Acknowledgements, Checksums, Sequencing, and Flow Control

No automatic neighbor discovery

113

BGP Peering

BGP sessions are established between peers BGP Speakers

Two types of peering sessions E-BGP (external) peers with different ASs I-BGP (internal) peers within the same AS

Still requires interior gateway protocols (IGPs) IGP connects BGP speakers within the AS IGP advertises internal routes

114

iBGP

AS 3847When BGP speakers in the same AS form a BGP connection for the purpose of exchanging routinginformation, they are said to be running IBGP or internal BGP.

IBGP speakers are usually fully-meshed.

B

A

c

115

eBGP (1)

AS 3561

AS 3847

When BGP speakers in different ASs form a BGP connection for the purpose of exchanging routinginformation, they are said to be running EBGP or external BGP.

EBGP peers are usually directly connected.

B

A

116

eBGP (2)

AS 2033

AS 4200

AS 7007

AS 2041

117

AS 7007XP

AS 1239

AS 6079

AS 701

AS 4006

iBGP and eBGP Diagram

118

eBGP Rules

By default, only talks to directly-connected router.

Sends the one best BGP route for each destination.

Sends all of the important “attributes”; omits the “local preference” attribute.

Adds (prepends) the speaker’s ASN to the “as-path” attribute.

Usually rewrites the “next-hop” attribute.

119

iBGP Rules

Can talk to routers many hops away by default.

Can only send routes it “injects”, or routes heard DIRECTLY from an external peer.

Thus, requires a FULL mesh. Sends all attributes. Leaves the as-path attribute alone. Doesn’t touch the “next hop”

attribute.

120

Logical view of 16 routers, fully meshed

121

iBGP Restriction (1)iBGP Restriction (1)

Assume AS1239 sends route 10.0.0.0/8 to AS2828. Router A will send that route to Routers B and C.

AS 2828AS 1239A

B

C

122

iBGP Restriction (2)iBGP Restriction (2)

When Router B receives 10.0.0.0/8, it will not propagate that route to Router C because it was learned from an iBGP neighbor. Router C will behave similarly.

AS 2828AS 1239A

B

C

123

BGP Route Advertisement

Only advertise the active BGP routes to peers (by default) BGP Next-hop must be reachable

Never forward I-BGP routes to I-BGP peers Prevents loops

Withdraw routes if active BGP routes become unreachable

124

CIDR and Aggregate Addresses

Router ARouter A

Router CRouter C

Router BRouter B

3.3.3.13.3.3.1

3.3.3.23.3.3.2

AS 1AS 1

2.2.2.22.2.2.2

AS 3AS 3

2.2.2.12.2.2.1

AS 2AS 2

192.168.0/22192.168.0/22

192.168.0.0/2192.168.0.0/24 4 192.168.1.0/2192.168.1.0/24 4 192.168.2.0/2192.168.2.0/24 4 192.168.3.0/2192.168.3.0/244

(1) AS 2 has (1) AS 2 has the detailed the detailed

routesroutes

(2) BGP with (2) BGP with Routing Policy Routing Policy can advertise a can advertise a

prefix that prefix that aggregates the aggregates the detailed routesdetailed routes

(3) AS 1 learns (3) AS 1 learns only the only the

aggregate and aggregate and not the detailsnot the details

192.168.0/22192.168.0/22

125

IBGP, EBGP Example

IBGP

EBGP EBGP

AS 1 AS 3

AS 2

126

Advertising Networks

Using the Network commandRedistributing static routesRedistributing Dynamic routes

127


Using Network Command

Router A router bgp 1 neighbor 1.1.1.2 remote-as 2 network 11.0.0.0 network 12.0.0.0

Router B router bgp 2 neighbor 1.1.1.1 remote-as 1 network 92.0.0.0 network 93.0.0.0

AS1A

AS2

B

EBGP

11.0.0.012.0.0.0

92.0.0.093.0.0.0

128


By redistributing Static Routes

Router A router bgp 1 neighbor 1.1.1.2 remote-as 2 redistribute static

ip route 11.0.0.0 255.0.0.0 null 0ip route 12.0.0.0 255.0.0.0 null 0

AS1A

AS2

B

EBGP

11.0.0.012.0.0.0

92.0.0.093.0.0.0

129


Router A router bgp 1 neighbor 1.1.1.2 remote-as 2 redistribute ospf 1

router ospf 1 network 11.0.0.0 0.255.255.255 area 0 AS2

AS1A

B

EBGP

11.0.0.012.0.0.0

92.0.0.093.0.0.0

By Redistributing Dynamic Routes

130

BGP Attributes

AS-pathNext-hopLocal preferenceMEDOriginCommunities

131

BGP Attributes

Path traversed one or more members of a set

{1880, 1881, 1882} (as-set)

A list of AS’s that a route has traversed

1880 1883 (sequence)

Shortest AS path preferred

AS-Path

1880193.0.34/24 1882

193.0.35/24

1881193.0.33/24

193.0.33/24 1880 1881193.0.34/24 1880193.0.35/24 1880 1882193.0.32/22 1880 1983

1883193.0.32/24

132

1755

690

200

1880

1883

209

BGP Attributes

Multi-Exit Discriminator (MED)

Preference sent to all routers in remote ASWhere do I want to receive the traffic ?

133

Indication to external peers of the preferred path into an AS.

Affects routes with same AS path.Advertised to external neighborsUsually based on IGP metric* Lowest MED preferred

Multi-Exit Discriminator (MEDMulti-Exit Discriminator (MED)

134

MED Attribute (2)MED Attribute (2)

The MED (multi-exit discriminator) is a commonly used attribute. It comes after the AS_PATH in evaluation, and thus isn’t quite as much of a “hammer” as local-pref.

Commonly, MED is used to tack a distance on BGP routes as they move within your network.

NSPs advertise MEDs to each other to let it be known which POP the route is “closest” to.

135

BGP Attributes

Preference sent to all routers in local ASWhere do I want traffic to leave?

Needs to go to 690A

1755 1880

666

690

Local Preference

135© 1998, Cisco Systems, Inc.

102NW’98

136

Local to AS Used to influence BGP path selection Default 100* Highest local-pref

preferred

AS 6201

208.1.1.0/24

A B

208.1.1.0/24 100Preferred by all AS3847 routers

208.1.1.0/24 80

AS 3847

GF E

C D

Local Preference AttributeLocal Preference Attribute

137

Local-Pref Attribute (2)Local-Pref Attribute (2)

An often-used attribute, local-pref (normally 100) overrides AS_PATH, and is transitive throughout your network. It is never advertised to an eBGP peer.

For example, you can express the policy “prefer private interconnects” by making the local_pref be 150 and leaving all other peers at 100.

Best used as an intermediate-level knob.

138

The BGP Path Decision Algorithm BGP determines the best path to each destination for a

BGP speaker by comparing path attributes according to the following selection sequence:

1. Select a path with a reachable next hop. 2. Select the path with the highest weight. 3. If path weights are the same, select the path with the highest

local preference value. 4. Prefer locally originated routes (network routes,

redistributed routes, or aggregated routes) over received routes.

5. Select the route with the shortest AS-path length. 6. If all paths have the same AS-path length, select the path

based on origin: IGP is preferred over EGP; EGP is preferred over Incomplete.

7. If the origins are the same, select the path with lowest MED value.

8. If the paths have the same MED values, select the path learned via EBGP over one learned via IBGP.

9. Select the route with the lowest IGP cost to the next hop. 10.Select the route received from the peer with the lowest BGP

router ID.

139

Common Internet Routing Phenomemon

E-BGP Route Flapping/Oscillation Remedy: Route Flap Dampening

140

BGP - Route Flapping

Routing instability Routes disappear, appear again, then

disappear Withdrawal, announcement, withdrawal,

announcement Visible to the Internet

Waste resources Some causes of route flapping

Flaky inter-AS links Flaky or insufficient hardware Link congestion IGP instability Operator error

141

BGP – Route Flap Dampening If you are running BGP version 4, the BGP process assigns a penalty

of 1000 to the route each time it flaps. When the penalty value exceeds the first of two limits (Re-use limit, Suppress limit), the route is moved into the 'historical' list of routes, dampened, and then is no longer accepted from other peers or announced to any peers. After the first limit has been exceeded, the timer which tracks the period for which the route is to be dampened is doubled for each flap.

The suppression half-life is 15 minutes. The maximum suppress limit is four times the half-life; thus, one hour is the default. The suppression penalty decays at half the half life (7.5 minutes). So:

1. First flap, penalty 1000 assigned, route placed in 'historical' category and becomes less preferred.

2. Second flap, route has met the suppression limit of 2000 (a Cisco default). The route is dampened and no longer advertised to neighbors or accepted from neighbors.

3. If route does not flap any further the penalty is decayed. The decay process begins 7.5 minutes after the route stabilized and decays exponentially every 5 seconds thereafter.

4. Once the suppression penalty decays below 750 (the default value for the reuse threshold), the route is removed from dampened state and reused. The router parses the historical routes list every 10 seconds for reusable routes.

142

Route Flap Dampening - Operation

143

Useful Tool To Understand BGP Peering Relationships

www.netlantis.org

UDP, TCP/IP - Internet Protocols

UDP: User Datagram Protocol

146

UDP Header

147

What is UDP?

Relatively simple compared to TCP UDP provides connectionless service

for data delivery between two hosts No logical connection is established by

UDP, so no connection-oriented services are supplied

The application may need some of those services (e.g., no corrupt packets), so the application is responsible to provide them

Applications that use UDP: tftp, DNS (for some functions), SNMP, RIP, VoIP

148

Why Use UDP over TCP?

No connection establishment. As we shall discuss in Section 3.5, TCP uses a three-way handshake before it starts to transfer data. UDP just blasts away without any formal preliminaries. Thus UDP does not introduce any delay to establish a connection. This is probably the principle reason why DNS runs over UDP rather than TCP -- DNS would be much slower if it ran over TCP. HTTP uses TCP rather than UDP, since reliability is critical for Web pages with text. But, as we briefly discussed in Section 2.2, the TCP connection establishment delay in HTTP is an important contributor to the "world wide wait".

No connection state. TCP maintains connection state in the end systems. This connection state includes receive and send buffers, congestion control parameters, and sequence and acknowledgment number parameters. We will see in Section 3.5 that this state information is needed to implement TCP's reliable data transfer service and to provide congestion control. UDP, on the other hand, does not maintain connection state and does not track any of these parameters. For this reason, a server devoted to a particular application can typically support many more active clients when the application runs over UDP rather than TCP.

Small segment header overhead. The TCP segment has 20 bytes of header overhead in every segment, whereas UDP only has 8 bytes of overhead.

Unregulated send rate. TCP has a congestion control mechanism that throttles the sender when one or more links between sender and receiver becomes excessively congested. This throttling can have a severe impact on real-time applications, which can tolerate some packet loss but require a minimum send rate. On the other hand, the speed at which UDP sends data is only constrained by the rate at which the application generates data, the capabilities of the source (CPU, clock rate, etc.) and the access bandwidth to the Internet. We should keep in mind, however, that the receiving host does not necessarily receive all the data - when the network is congested, a significant fraction of the UDP-transmitted data could be lost due to router buffer overflow. Thus, the receive rate is limited by network congestion even if the sending rate is not constrained.

149

UDP Encapsulation

150

UDP Header

151

Computing the UDP Checksum

152

IP Fragmentation

153

Other UDP Uses

Path MTU Discovery Using Traceroute Using UDP

Max UDP datagram size ICMP Source Quench

TCP: Transmission Control Protocol

155

TCP Header

156

What is TCP?

Relatively complex compared to UDP TCP provides connection-oriented service for data delivery

between two hosts Client and server establish a logical TCP connection before

exchanging data TCP segments flow over the network in IP packets (which are

connectionless) so that the logical TCP connection can be maintained over a changing physical path

Connections are full-duplex Timers are used to maintain connections

TCP relies on IP to provide hop-by-hop routing and error detection

Applications that use TCP: telnet, ftp, http, many others

157

TCP Logical Connections/Ports

TCP and UDP introduce the concept of ports

Common ports and the services that run on them:

FTP 21 and 20 telnet 23 SMTP 25 http 80 POP3 110

Multiple ports/logical connections can be supported

158

TCP Header Fields and Other Info Each connection uniquely identified by combination of src IP, dest IP, src port, and dest port

socket = IP + port (e.g., 10.1.1.1.23) Sequence numbers

The sequence number of the first data octet in this segment (except when SYN is present). If SYN is present the sequence number is the initial sequence number (ISN) and the first data octet is ISN+1.

Acknowledgements If the ACK control bit is set this field contains the value of the next sequence number the sender of

the segment is expecting to receive. Once a connection is established this is always sent. Header Length: Length of header in bytes Flag bits: Provides connection-oriented service

The SYN and Fin flags are used when establishing and terminating a TCP connection, respectively.

The ACK flag is set any time the Acknowledgement field is valid, implying that the receiver should pay attention to it.

The URG flag signifies that this segment contains urgent data. When this flag is set, the UrgPtr field indicates where the non-urgent data contained in this segment begins.

The PUSH flag signifies that the sender invoked the push operation, which indicates to the receiving side of TCP that it should notify the receiving process of this fact.

Finally, the RESET flag signifies that the receiver has become confused and so wants to abort the connection.

Window size The number of data octets beginning with the one indicated in the acknowledgment field which the

sender of this segment is willing to accept.

159

TCP Options

160

TCP Encapsulation

161

TCP Functionalities Connection-Orient Identifies traffic flow by some identifier rather than by explicitly listing source and destination addresses

Stream Data Transfer From the application's viewpoint, TCP transfers a contiguous stream of bytes. TCP does this by grouping the bytes in TCP segments, which are passed to IP for transmission to the destination. TCP itself decides how to segment the data and it may forward the data at its own convenience. Reliability TCP assigns a sequence number to each byte transmitted, and expects a positive acknowledgment (ACK) from the receiving TCP. If the ACK is not received within a timeout interval, the data is retransmitted. The receiving TCP uses the sequence numbers to rearrange the segments when they arrive out of order, and to eliminate duplicate segments.

Flow Control The receiving TCP, when sending an ACK back to the sender, also indicates to the sender the number of bytes it can receive beyond the last received TCP segment, without causing overrun and overflow in its internal buffers. This is sent in the ACK in the form of the highest sequence number it can receive without problems.

Logical Connections The reliability and flow control mechanisms described above require that TCP initializes and maintains certain status information for each data stream. The combination of this status, including sockets, sequence numbers and window sizes, is called a logical connection. Each connection is uniquely identified by the pair of sockets used by the sending and receiving processes.

Multiplexing To allow for many processes within a single host to use TCP communication facilities simultaneously, the TCP provides a set of addresses or ports within each host. Concatenated with the network and host addresses from the internet communication layer, this forms a socket. A pair of sockets uniquely identifies each connection. Full Duplex TCP provides for concurrent data streams in both directions.

162

Important Factors That Affect TCP (Application) Performance

Link BW (window size), network delay (RTT) and MTU size (Bit Error Rate) How a receiver/sender implements Acknowlegment scheme

Stop-N-Wait, Go-back-N, Selective Ack Speed at which received data is processed and ACKed at destination

(sender/receiver buffer size) Maximum TCP Buffer (Memory) space for use by any TCP connection Socket Buffer Sizes for individual TCP connection

Ability to actively manage speed mismatch between sender and receiver by regulating how much can/should be sent without getting into a congestion state Window size based on Bandwidth Delay Product

Ability to actively detect and prevent dynamic congestion and re-act to it Connection timeout, slow start, back-off strategies

Ability for sender to maximize/improve BW/network throughput TCP Large Window Scaling

TCP Connection-Oriented Protocol:

TCP Connection Establishmentand Termination

164

Passive and Active Ports

TCP enables two methods to establish a connection: active and passive. An active connection establishment happens when TCP issues a request for the connection, based on an instruction from an upper-level protocol that provides the socket number. A passive approach takes place when the upper-level protocol instructs TCP to wait for the arrival of connection requests from a remote system (usually from an active open instruction). When TCP receives the request, it assigns a port number. This enables a connection to proceed rapidly, without waiting for the active process.

165

Connection Establishment

166

Connection Termination

167

MSS: Maximum Segment Size

170

TCP Half-Close

171

TCP State Diagram

172

States: Establishment and Termination

173

TCP Reset

174

Simultaneous Open

For example: An application at host A uses 7777 as the local port and connects to port 8888 on host B. At the same time, an application at host B uses 8888 as the local port and connects to port 7777 on host A. This is "Simultaneous Open".

Here is another example: The Telnet client at host A connects to the Telnet server at host B. At the same time, the Telnet client at host B connects to the Telnet server at host A. Be careful. This time, it's not "Simultaneous Open" because the two Telnet servers on both sides do "passive open" instead of "active open". There are actually two TCP connections, instead of one in "Simultaneous Open".

175

Simultaneous Close

176

TCP Provides a Byte-Stream ServiceTCP is a byte-oriented protocol, which means the sender writes bytes into a TCP connection and the receiver reads bytes out of the TCP connection. Although ``byte-stream'' describes the service TCP offers to application processes, TCP does not, itself, transmit individual bytes over the Internet. Instead, TCP on the source host buffers enough bytes from the sending process to fill a reasonably sized packet, and then sends this packet to its peer on the destination host. TCP on the destination host then empties the contents of the packet into a receive buffer, and the receiving process reads from this buffer at its leisure. This situation is illustrated in figure below, which for simplicity, shows data flowing in only one direction. In general, remember, a single TCP connection supports byte-streams flowing in both directions.

http://www.cs.arizona.edu/llp/book/node63.html

177

TCP Provides a Byte-Stream Service

It is a streaming protocol No “record markers” inserted into data stream

Writes on one end and reads on the other are independent of each other Ex: data could be written in a sequence of 10

bytes, then 20 bytes, then 50 bytes That data could be read as 4 x 20 byte

No interpretation of the application data Very similar to Unix kernel’s treatment of files in

a filesystem

178

TCP Provides Reliability

TCP segments are sized for the application

Segments must be acknowledged TCP checksum on header and data Out-of-sequence IP packets can be re-

ordered Receiving TCP must discard duplicate

packets Flow control is employed to manage

finite buffer space

179

TCP Flow Control Algorithms

Sliding Window Regular Sliding Window The BW Delay Product (BWDP) Window Scaling option The Silly Window syndrome Tinygram Congestion Prevention: The Nagle algorithm

TCP Timeout RTT-Timeout Calculation

Acknowlegement schemes Stop-N-Wait, Go-back-N, Selective Ack schemes Ambiguous Acknowledgements: The Karn’s Algorithm

Congestion Avoidance Slow Start

180

End-to-end flow control

Problem Sender can send more traffic that

receiver can handle. (Too fast) Solution

variable sliding window protocol each acknowledgement, which

specifies how many octets have been received, contains a window advertisement that specifies how many additional octets receiver are prepared to accept.

181

Variable Window Size

… …

Transmitter Receiver

Window Advertisement

Transmitter Window Size Value of Window Advertisement

Free space in buffer to fill

increase bigger increase

decrease smaller decrease

Stop transmissions 0 full

182

Sliding window protocol in TCP

TCP allows the window size to vary over time.

Window size changes at the time it slides forward.

Advantage: it provides flow control as well as reliable transfer.

183

TCP Sliding Window Algorithm

Flow control with the use of Window concept via specifying an acceptable range of

sequence numbers

184

Sliding Windows

Available Window

Advertized Window

185

ReceiverReceiverSenderSender

Sliding Window: Details

… …

Sent & Acked Sent Not Acked

OK to Send Not Usable

… …

Max acceptable

Receiver window

Max ACK received Next seqnum

Received & Acked Acceptable Packet

Not Usable

Sender window

Next expected

186

acknowledged sent to be sent outside window

Source PortSource Port Dest. PortDest. Port

Sequence NumberSequence Number

AcknowledgmentAcknowledgment

HL/FlagsHL/Flags WindowWindow

D. ChecksumD. Checksum Urgent PointerUrgent Pointer

Options..Options..

Source PortSource Port Dest. PortDest. Port

Sequence NumberSequence Number

AcknowledgmentAcknowledgment

HL/FlagsHL/Flags WindowWindow

D. ChecksumD. Checksum Urgent PointerUrgent Pointer

Options..Options..

Packet Sent Packet Received

App write

Window Flow Control: Header

187

Sliding Windows Example - A Dynamic Parameter

188

Window Size – A Dynamic Parameter

189

How To Determine Optimal Window Size (Keeping the Pipe Full): The Bandwidth-Delay Product (BDP)

190

The TCP Window Scaling option

The original TCP specification included a window size no larger than 64 KB. This limitation was introduced by the 16 bit header that specified window size. To achieve the recommended 1 MB window size, TCP extensions must be enabled to add another 14 bits to the window size,making the total bist equal to 30 for window size

The TCP window scaling option works by including a scale factor in a SYN packet imbedded in the TCP OPTIONS field. This scale factor informs the receiver that the sender is willing to do window scaling and offers a scale factor for the communication. The scale factor is used to shift the window field before the data segment is sent.

It's important to note that the window size used in the actual 3-way handshake is NOT the window size that is scaled. This means that the first data packet sent after the 3-way handshake is the actual window size. If there is a scaling factor, the initial window size of 65,535 bytes is always used. The window size is then multiplied by the scaling factor identified in the 3-way handshake. The table below represents the scaling factor boundaries for various window sizes.

191

The TCP Window Scaling Option

Scale factor

Scale Value

Initial Window

Window Scaled

0 1 65535 or less

65535 or less

1 2 65535 131,070

2 4 65535 262,140

3 8 65535 524,280

4 16 65535 1,048,560

5 65535 …

6 65535 …

… 65535 …

14 16384 65535 1,073,725,440

192

TCP Options – Window Scale Factor

193

The TCP Receiver Silly Window Syndrome Problems Associated With "Shrinking" The TCP Window - when the receiver is much

slower than the sender (ex below where receiver can only process 1 out of 3 received packets)

This diagram shows one example of how the phenomenon known as TCP silly window syndrome can arise. The client is trying to send data as fast as possible to the server, which is very busy and cannot clear its buffers promptly. Each time the client sends data the server reduces its receive window. The size of the messages the client sends shrinks until it is only sending very small, inefficient segments.

194

Receiver SWS Avoidance

Let's start with SWS avoidance by the receiver. As we saw in the initial example above, the receiver contributed to SWS by reducing the size of its receive window to smaller and smaller values due its being busy. This caused the right edge of the sender's send window to move by ever-smaller increments, leading to smaller and smaller segments. To avoid SWS, we simply make the rule that the receiver may not update its advertised receive window in such a way that this leaves too little usable window space on the part of the sender. In other words, we restrict the receiver from moving the right edge of the window by too small an amount. The usual minimum that the edge may be moved is either the value of the MSS parameter, or one-half the buffer size, whichever is less.

Let's see how we might use this in the example above. When the server receives the initial 360-byte segment from the client and can only process 120 bytes, it does not reduce the window size to 120. It reduces it all the way to 0, closing the window. It sends this back to the client, which will then stop and not send a small segment. Once the server has removed 60 more bytes from the buffer, it will now have 180 bytes free, half the size of the buffer. It now opens the window up to 180 bytes in size and sends the new window size to the client.

It will continue to only advertise either 0 bytes, or 180 or more, not smaller values in between. This seems to slow down the operation of TCP, but it really doesn't. Because the server is overloaded, the limiting factor in overall performance of the connection is the rate at which the server can clear the buffer. We are just exchanging many small segments for a few larger ones.

195

Tinygram Congestion Prevention: The Nagle algorithm (Sender SWS) Nagle’s algorithm, named after John Nagle, is a means of improving the

efficiency of TCP/IP networks by reducing the number of packets that need to be sent over the network.

Telnet example test over a long-haul link with a 5-second round trip time. User sends 25 bytes. Without any mechanism to prevent small-packet (tinygram) congestion 25 new packets would be sent in 5 seconds in accordance with the delayed ACK algorithm, meaning delay up to 200 ms. Amount data to be sent is 41x25 bytes (20 bytes for IP header, 20 bytes for TCP header). Overhead here is 4000%.

With Nagle algorithm however, the first character from the user would be sent immediately. The next 24 characters, arriving from the user at 200ms intervals. When an ACK arrived for the first packet at the end of 5 seconds, a single packet with the 24 queued characters would be sent, i.e. 41x2 +25 bytes would be sent total. Overhead is only 320% with no penalty in response time.

The Nagle Algorithm is useful on a slow WAN when it is desired to reduce the tinygram congestion.

Sometimes the Nagle algorithm needs to be turned off. For example, in X Window System server small messages (mouse movements) must be delivered without delay to provide real-time feedback for interactive user.

http://hnguyen2k.wordpress.com/wiki/TCP/IP

196

Key TCP Concepts

Modern TCP implementations incorporate a set of SWS avoidance algorithms. When receiving, devices are programmed not to advertise very small windows, waiting instead until there is enough room in the buffer for one of a reasonable size. Transmitters use Nagle’s algorithm to ensure that small segments are not generated when there are unacknowledged bytes outstanding.

197

TCP Timers The Retransmission Timer

The retransmission timer manages retransmission timeouts (RTOs), which occur when a preset interval between the sending of a datagram and the returning acknowledgment is exceeded. The value of the timeout tends to vary, depending on the network type, to compensate for speed differences. If the timer expires, the datagram is retransmitted with an adjusted RTO, which is usually increased exponentially to a maximum preset limit. If the maximum limit is exceeded, connection failure is assumed, and error messages are passed back to the upper-layer application.

Values for the timeout are determined by measuring the average time that data takes to be transmitted to another machine and the acknowledgment received back, which is called the round-trip time, or RTT. From experiments, these RTTs are averaged by a formula that develops an expected value, called the smoothed round-trip time, or SRTT. This value is then increased to account for unforeseen delays.

The Delayed ACK Timer TCP uses delayed acknowledgments to reduce the number of packets that are sent

on the media. Instead of sending an acknowledgment for each TCP segment received, TCP takes a common approach to implementing delayed acknowledgments. As data is received by TCP on a particular connection, it sends an acknowledgment back only if one of the following conditions is true:

No acknowledgment was sent for the previous segment received. A segment is received, but no other segment arrives within 200 milliseconds for that

connection.

198

TCP Timers (Continued) The Quiet Timer

After a TCP connection is closed, it is possible for datagrams that are still making their way through the network to attempt to access the closed port. The quiet timer is intended to prevent the just-closed port from reopening again quickly and receiving these last datagrams.

The quiet timer is usually set to twice the maximum segment lifetime (the same value as the Time to Live field in an IP header), ensuring that all segments still heading for the port have been discarded. Typically, this can result in a port being unavailable for up to 30 seconds, prompting error messages when other applications attempt to access the port during this interval.

The Keep-Alive Timer and the Idle Timer Both the keep-alive timer and the idle timer were added to the TCP

specifications after their original definition. The keep-alive timer sends an empty packet at regular intervals to ensure that the connection to the other machine is still active. If no response has been received after sending the message by the time the idle timer has expired, the connection is assumed to be broken.

The keep-alive timer value is usually set by an application, with values ranging from 5 to 45 seconds. The idle timer is usually set to 360 seconds.

199

RTT and TCP Retransmission Timeouts (RTO) The RTO is typically calculated based on the RTT

The original TCP specification had TCP update a smoothed RTT estimator (called R) using the low-pass filter R <- aR + (1-a)M, M is the measured RTT where a is a smoothing factor with a recommended value of 0.9.

This smoothed RTT is updated every time a new measurement is made. Ninety percent of each new estimate is from the previous estimate and 10% is from the new measurement.

Given this smoothed estimator, which changes as the RTT changes, RFC 793 recommended the retransmission timeout value (RTO) be set to RTO = Rb where b is a delay variance factor with a recommended value of 2.

200

TCP Performance Enhancement - Congestion Avoidance Using Slow Start

Cwnd = congestion window

Some implementationrecommends cwnd = 4 as the initial value to improve throughput

201

The TCP Delayed ACK timer

Delayed ACK timer = 200 msec by default

202

TCP Reliability – Use of Acknowledgement

TCP is responsible for data recovery by providing a sequence number with

each packet that it sends TCP requires ACK (acknowledgement)

to ensure correct data is received Packet can be retransmitted if error

detected Three ACK schemes available which

will be discussed at length later

203

TCP Acknowledgement Schemes

Stop-and-Wait : individual ACK is required for each segment received

Go-Back-N : cumulative ACK required for consecutive segments received

Selective-Ack: selectively ACK only segments that are in error as part of the consecutive segments received

Stop-And-Wait ACK Scheme:TCP Interactive Data Flow

205

Interactive Data: character/echo

206

Delayed Acknowledgements

Delayed ACK timer = 200 msec by default

207

Regular Acknowledgements

208

Stop-And-Wait ACK Scheme – Loss Packet Scenario

209

Stop-And-Wait ACK Scheme – Loss ACK Scenario

210

Performance of Stop-And-Wait ACK Scheme

It works, but performance stinks Example: 1 Gbps link, 15 ms e-e prop. delay,

1KB packet:

Ttransmit

= 8kb/pkt10**9 b/sec

= 8 microsec

U sender: utilization – fraction of time sender busy sending 1KB pkt every 30 msec -> 33kB/sec thruput over 1 Gbps link network protocol limits use of physical resources!

U sender

= .008

30.008 = 0.00027

microseconds

L / R

RTT + L / R =

L (packet length in bits)R (transmission rate, bps)

=

211

Ambiguous ACK – Loss ACK Scenario

212

Ambiguous Acknowledgement - The Karn Algorithm Rule 1: Ignore measured RTT for retransmitted packets.

This removes ambiguity from RTT measurements.

Rule 2: RTO should be doubled after retransmission.This is called "Exponential Back-off".

A problem occurs when a packet is retransmitted. Say a packet is transmitted, a timeout occurs, the RTO is backed off, the packet is retransmitted with the longer RTO, and an acknowledgment is received. Is the ACK for the first transmission or the second? This is called the retransmission ambiguity problem.

[Karn and Partridge 1987] specify that when a timeout and retransmission occur, we cannot update the RTT estimators when the acknowledgment for the retransmitted data finally arrives. This is because we don't know to which transmission the ACK corresponds. (Perhaps the first transmission was delayed and not thrown away, or perhaps the ACK of the first transmission was delayed.)

Also, since the data was retransmitted, and the exponential backoff has been applied to the RTO, we reuse this backed off RTO for the next transmission. Don't calculate a new RTO until an acknowledgment is received for a segment that was not retransmitted.

213

TCP Fast Retransmission

Fast Retransmission: When TCP detects segment loss using retransmission timer, then ssthresh is set to half of CWND. i.e. ( ssthresh = CWND / 2 ) and CWND is set to 1-full size segment. Now instead of waiting for retransmission timer to get off , TCP detects packet loss by looking for packet re-ordering and retransmit the lost packet. This scheme is called as " Fast Retransmit". In this algorithm , TCP receiver sends an immediate duplicate ACK on out-of-order segment arrival. The other end TCP deduce from small number ( normally 3 ) of consecutive duplicate ACKs that the segment has been lost and deduces the starting sequence number of missing segment. The missing segment is retransmitted.

Go-Back-N ACK Scheme:TCP Bulk Data Flow

215

Motivation: Pipelining Increases Utilization

first packet bit transmitted, t = 0

sender receiver

RTT

last bit transmitted, t = L / R

first packet bit arriveslast packet bit arrives, send ACK

ACK arrives, send next packet, t = RTT + L / R

last bit of 2nd packet arrives, send ACKlast bit of 3rd packet arrives, send ACK

U sender

= .024

30.008 = 0.0008

microseconds

3 * L / R

RTT + L / R =

Increase utilizationby a factor of 3

216

Go-Back-N (GBN) - Sender

Sender: k-bit seq # in pkt header “sliding window” of up to N, consecutive unack’ed pkts

allowed

ACK(n): ACKs all pkts up to, including seq # n - “cumulative ACK” may receive duplicate ACKs (see receiver)

The local node must also keep a buffer of all PDUs which have been sent, but have not yet been acknowledged

Timer for each in-flight pkt Timeout(n): retransmit pkt n and all higher seq # pkts in window

217

GBN: Receiver

ACK-only: always send ACK for correctly-received pkt with highest in-order seq # may generate duplicate ACKs need only remember expectedseqnum

out-of-order pkt: discard (don’t buffer) -> no receiver buffering! Re-ACK pkt with highest in-order seq #

218

GBN In Action

219

GBN – How To Handle Packet Loss

Example of Go-Back-N. The sender in this example transmits four PDUs (1-4) and the first one (1) of these is not successfully received. The receiver notes that it was expecting a PDU numbered 1 and actually receives a PDU numbered 2. It therefore deduces that (1) was lost. It requests retransmission of the missing PDU by sending a Go-Back-N request (in this case N=1), and discards all received PDUs with a number greater than 1.The sender receives the Go-Back-N request and retransmits the missing PDU (1), followed by all subsequently sent PDUs (2-4) which the receiver the correctly receives and acknowledges.

220

GBN: ACK of Multiple Segments

221

GBN: ACK of Multiple Segments – Slight Differences

222

GBN: Fast Sender, Slow Receiver

223

Limitations With Stop-And-Wait and Go-Back-N Ack Schemes

Poor performance when multiple packets are lost from a window of data

Cumulative ACKs provide limited information

Aggressively re-transmitting packets is inefficient

Selective-ACK solves these problems