restartability manage- ment in the cisco core router crs/ng

15
Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Upload: bella

Post on 23-Feb-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Restartability Manage- ment in the Cisco Core Router CRS/NG. Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.). Table of Contents. System Overview CRS/NG Restartability Overview − Problem Definition and H igh L evel S olution - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Restartability  Manage- ment in the Cisco Core Router CRS/NG

Restartability Manage-ment in the Cisco Core Router CRS/NGStefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Page 2: Restartability  Manage- ment in the Cisco Core Router CRS/NG

Table of Contents

System OverviewCRS/NG Restartability Overview −Problem Definition and High Level SolutionConcrete Example −Statistics Resource Manager LibraryConclusion

2

Page 3: Restartability  Manage- ment in the Cisco Core Router CRS/NG

System Overview

Core Router Extremely complex System• SW: 16 MLOC• HW: several chasses, LCs (1 CPU, 5 NPUs,

chips galore), RPs (1 CPU, chips galore), fabric cards, blade cards, …

Forms distributed System99.9...9% Uptime

3

Page 4: Restartability  Manage- ment in the Cisco Core Router CRS/NG

System Overview

System Manager: restarts crashed Process• HW bug• SW bugProcess must maintain State (after Crash)CRS/NG Approach• Key data structures in shared memory• Well written algorithm guarantee consistencyCRS 1 CRS 3 CRS/NG (final name?)

4

Page 5: Restartability  Manage- ment in the Cisco Core Router CRS/NG

CRS/NG Restartability Overview

CRS/NG runs Cisco IOS/XRCisco IOS/XR Abstraction Layer on Linux• Sophisticated IPC• Sophisticated shared memory API

Special malloc for shared memory Static configuration file

– Mapping identifiers to fixed virtual addresses– STATS_RESTART 0x50000000

(Re)attaching to shared memory via identifier Previously allocated objects always available

…5

Page 6: Restartability  Manage- ment in the Cisco Core Router CRS/NG

CRS/NG Restartability Overview

Process requiring Restartability• Key data-structures in shared memory• Careful algorithm design to avoid

• Temporary inconsistencies account1 := account1+X; account2 := account2-X;• Pointer operations (disconnection of linked lists)• Crashes during IPCs• Crashes before a return; (caller records success)

• Optional recovery phase• Compromises are possible

6

Page 7: Restartability  Manage- ment in the Cisco Core Router CRS/NG

Concrete Example: Statistics Resource Manager LibraryHW: Extremely simplified View on CRS/NG

7

Page 8: Restartability  Manage- ment in the Cisco Core Router CRS/NG

Concrete Example: Statistics Resource Manager LibrarySW: Somewhat simplified View on CRS/NG Statistics Manager

8

Page 9: Restartability  Manage- ment in the Cisco Core Router CRS/NG

Concrete Example: Statistics Resource Manager Library

Client Application / Library crashes RestartClient Application: State is gone• Stats pointers are lost• Other state is lostStats Lib• State is gone• Stats pointers are lostSolution for Stats Lib• Keep freelists in shared memory• Smart algorithm for keeping state consistent9

Page 10: Restartability  Manage- ment in the Cisco Core Router CRS/NG

Concrete Example: Statistics Resource Manager Library

Step 1: Keeping State in Shared Memory01 stats_cl_ctx_st *mstats_cl_bind (char *name) {02 void *shmem;03 stats_cl_ctx_st *con;04 05 /* open shmem at a predetermined address */06 shmem = shmwin_attach(SSE_STATS_RESTART_ADDRESS); // posix mmap: MAP_FIXED flag07 con=shmem+name_to_offset(name);08 09 if (strcmp(con->name, name)) {10 /* first bind */1112 /* init "empty" context */13 con->freelist[0..max]=NULL;14 con->mutex=0;15 strcpy(con->name, name);16 } else {17 /* restart */18 /* do nothing, just return con */18 }20 return con;21 }

10

Page 11: Restartability  Manage- ment in the Cisco Core Router CRS/NG

Concrete Example: Statistics Resource Manager Library

Step 2a: Smart Algorithm −A pragmatic Approach (chosen for CRS/NG)Few Concepts: (Re-)moving nodes from freelist

• Worst case: a page is lost (bad?) Requesting fresh page from server

• Worst case: page is lost (bad?) Updating bitmap: mark some pointers as

allocated − client does not pick up• Worst case: some pointers are lost (bad?)11

Page 12: Restartability  Manage- ment in the Cisco Core Router CRS/NG

Concrete Example: Statistics Resource Manager Library

Discussion of worst Case ScenariosA page (or a few Pointers within) is lost• = 256 out of 8 million stats pointers in NPU

memory − no big deal• = 80 byte out of several GB of CPU memory

for node structure − no big dealClient frees a Pointer from a lost Page Error Code is returned Client is irritated but has to ignore itWe never give out same Pointer twice

12

Page 13: Restartability  Manage- ment in the Cisco Core Router CRS/NG

Concrete Example: Statistics Resource Manager Library

Step 2b: Smart Algorithm −A perfect Approach

Complicated Algorithm /Very difficult Implementation• Further pointers in shared memory• Need to figure out where crashed and

continue from thereRequirement: interacting Libraries and Processes must be "perfect" as well

13

Page 14: Restartability  Manage- ment in the Cisco Core Router CRS/NG

Conclusion

Pragmatic Approach of CRS/NG+ Easy to implement+/− Crashes: worst Case: small Mem. Leak+ No Run-time Performance Hit

Perfect Approach+ Very difficult to implement Error prone+ Crashes: no Memory Leak− Perhaps Run-time Performance Hit

14

Page 15: Restartability  Manage- ment in the Cisco Core Router CRS/NG

Thank You

15

Platinum Sponsors:

Gold Sponsors:

Silver Sponsors:

Organization Sponsors