sla monitoring system (slam) - icann · epp epp service availability ≤ 864 min of down(me...
TRANSCRIPT
![Page 1: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/1.jpg)
SLA Monitoring System (SLAM) Gustavo Lozano | ICANN DNS Symposium | 13 May 2017
![Page 2: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/2.jpg)
| 2
Contractual Provisions
SLAM
MoSAPI
Statistics
1 2
3 4
SLA Monitoring System (SLAM) - Agenda
![Page 3: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/3.jpg)
Contractual Provisions
![Page 4: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/4.jpg)
| 4
Why ICANN is monitoring gTLDs?
• Specifica(on10ofthenewgTLDsRegistryAgreementspecifiestheServiceLevelRequirementsforRegistryOperators.
• AmonitoringsystemcalledSLAM(ServiceLevelAgreementMonitoring)SystemwasdevelopedbyICANNasatooltomeasurethecompliancewiththeserequirements.
![Page 5: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/5.jpg)
| 5
What are the Service Level Requirements?
Parameter SLR(monthlybasis)DNS DNSserviceavailability 0mindown(me=100%availability
DNSnameserveravailability ≤432minofdown(me(≈99%)TCPDNSresolu(onRTT ≤1500ms,foratleast95%ofqueriesUDPDNSresolu(onRTT ≤500ms,foratleast95%ofqueriesDNSupdate(me ≤60min,foratleast95%ofprobes
RDDS RDDSavailability ≤864minofdown(me(≈98%)RDDSqueryRTT ≤2000ms,foratleast95%ofqueriesRDDSupdate(me ≤60min,foratleast95%ofprobes
EPP EPPserviceavailability ≤864minofdown(me(≈98%)EPPsession-commandRTT ≤4000ms,foratleast95%ofcommandsEPPquery-commandRTT ≤2000ms,foratleast95%ofcommandsEPPtransform-commandRTT ≤4000ms,foratleast95%ofcommands
![Page 6: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/6.jpg)
| 6
What are the Emergency Thresholds?
• ICANNcandesignateaninterimEBERO(EmergencyBackendRegistryOperator)totakeovertheopera(onofagTLDwhenanemergencythresholdisreached.
• SLAMsystemalertsandComplianceno(cesaresenttoRegistryOperatorswhencertainpercentagesofthespecifiedEmergencyThresholdsaremet.
![Page 7: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/7.jpg)
| 7
What are the Emergency Thresholds?
Cri9calFunc9on EmergencyThresholdDNSService(allservers) 4-hourtotaldown(me/weekDNSSECproperresolu(on
4-hourtotaldown(me/week
EPP 24-hourtotaldown(me/weekRDDS(WHOIS/Web-basedWHOIS)
24-hourtotaldown(me/week
![Page 8: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/8.jpg)
SLAM
![Page 9: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/9.jpg)
| 9
What is the SLAM?
• Zabbixastheprimarymonitoringplaaorm.CustompluginsandcodetosupportICANNmonitoringweredevelopedbyZabbix.
• Probenodenetwork– Consistsof40probenodescoveringallICANNregions.
• ANetworkOpera(onsCenteropera(ng24/7
• ICANN-staffison-call24/7
![Page 10: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/10.jpg)
| 10
Design principles of the system
• Avoidfalseposi(ves• Avoidfalseposi(ves• Avoidfalseposi(ves• ReachtheaffectedRegistryOperatorassoonaspossible
• Providegeneralguidanceregardingthepoten(alissue
![Page 11: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/11.jpg)
| 11
How it works?
DataProcessor
ProbeNode
ProbeNode
ProbeNode
Ry
![Page 12: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/12.jpg)
| 12
DNS test
• Onenon-recursiveDNSquerysenteveryminutefromallprobenodes– QueryissenttoeveryIPaddress,NSpair– QueryisfortheFQDNzz--icann-monitoring.<TLD>
• IfDNSSECisoffered,NSEC/NSEC3andthesignaturesareverified.
• ThechainoftrustisvalidatedagainsttherootzoneKSK.
![Page 13: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/13.jpg)
| 13
DNS test
• Examplesoffailurecriteria
– Noreply– Invalidreply(e.g.,RCODE/SERVFAIL)– Malformedorinvalidresponses– Brokenchainoftrust– NSECandNSEC3errors
![Page 14: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/14.jpg)
| 14
RDDS test
• OneWhois(port43)transac(onini(atedevery5minutesfromallprobenodes.
• OneHTTP(web-Whois)connec(ontestevery5minutes.ThesystemwillfollowHTTPredirects.
![Page 15: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/15.jpg)
| 15
RDDS test
• Examplesoffailurecriteria
– DNS/DNSSECfailureswhenresolvingwhois.nic.<TLD>
– MalformedorinvalidWhois(port43)responses
– HTTP500errorcodeincaseofweb-Whois
![Page 16: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/16.jpg)
| 16
Recursive DNS servers
• RecursiveDNSserversareusedduringthetests(e.g.resolvingwhois.nic.<TLD>)
• DNSSECisenabledintherecursiveDNSservers
• IfDNSSECisfailingwhenresolvingwhois.nic.<TLD>,theRDDSisconsideredtobefailing
• ThemaximumTTLallowedinthecacheandnega(vecacheis15minutes
![Page 17: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/17.jpg)
| 17
What happens when a failure is detected?
RySLAsystemcon(nuously
monitorallgTLDs
• Threeconsecu(vefailingcycles
• 51%ormoreoftheprobenodesdetectedtheissue
• Atleast20probenodesareonline Aler(ng
machine
DNSissues
RDDSissues
• Twoconsecu(vefailingcycles• 51%ormoreoftheprobe
nodesdetectedtheissue• Atleast10probenodesare
online
![Page 18: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/18.jpg)
| 18
What happens when a failure is detected? – cont.
Aler(ngmachine
ICANN’sNOCcontactstheRy’sEmergencyContactstoverify
recep(onofthealert
ICANNTechnicalServicesstaffcontactstheRytoprovidehelp
CallstheRy’sEmergencyContacts
ContactsICANNContractualCompliance
ContactsICANNIT,iftheSLAMisfailing
![Page 19: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/19.jpg)
| 19
Monitoring the quality of IPv4 and IPv6
• EveryprobenodemonitorsthequalityofitsIPv4andIPv6connec(vity.
• IfthequalityofitsIPv4andIPv6connec(vityisdeterminedtobeinsufficient,theprobenodegoesofflineautoma(cally.
• InordertomonitorthequalityofIPv4andIPv6connec(vity,thenode:
– SendsaDNSquerytoeveryroot-servereveryminute– If5ormoreresponsesarereceivedperIPprotocolwithin250ms,thequalityofconnec(vityisconsideredtobesufficient
![Page 20: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/20.jpg)
MoSAPI
![Page 21: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/21.jpg)
| 21
MoSAPI
• ThemonitoringsystemAPIisinpilotmodeatthemoment.
• TheAPIallowstheRegistrytoaccesstheinforma(oncollectedbytheSLAM.
• Theproduc(onversionisgoingtosupportdefiningamaintenancewindowprogramma(cally.Atthemoment,thisisamanualprocess.
![Page 22: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/22.jpg)
Statistics
![Page 23: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/23.jpg)
| 23
Statistics – Interesting data points
• 11outof37RSPshavehadatleastoneTLDthatreachedtheEBEROthresholdinatleastoneservice
• 27(DNSorRDDS)servicefailuresreachedtheEBEROthreshold(wehaven'tdeclaredoneEBEROeventyet)
• 1.7%(21out1,211)ofthenewgTLDshavereachedtheEBEROthresholdinatleastoneservice(DNSorRDDS)
• 32outof37RSPshavehadatleastoneDNSservicefailuresince25-Sep-2014
Note:dataasof1-Jan-2017.
![Page 24: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/24.jpg)
| 24
Statistics – Potential EBERO events
0
2
4
6
8
10
12
14
Feb Mar Apr Jul Nov Dec Jan Feb Apr Jul Oct
Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4
2014 2015 2016
FailuresthatreachedtheEBEROthreshold
![Page 25: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/25.jpg)
| 25
Statistics – Potential EBERO events
0
1
2
3
4
5
6
7
8
FailuresthatreachedtheEBEROthresholdperRSP
![Page 26: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/26.jpg)
| 26
Statistics – DNS failures
0
20
40
60
80
100
120
140
160
Oct Nov Dec Jan Feb Mar Apr May Jun Jul Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1
2014 2015 2016 2017
DNSfailures
![Page 27: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/27.jpg)
| 27
Statistics – Unique-RSP DNS failures
0
1
2
3
4
5
6
7
8
9
Oct Nov Dec Jan Feb Mar Apr May Jun Jul Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1
2014 2015 2016 2017
Unique-RSPDNSfailures
![Page 28: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe](https://reader033.vdocument.in/reader033/viewer/2022041601/5e31569a7a4f8053386e3ce8/html5/thumbnails/28.jpg)
| 28
Reach us at: Email: [email protected] Website: icann.org
Thank You and Questions
Engage with ICANN
linkedin.com/company/icann
twitter.com/icann
facebook.com/icannorg weibo.com/ICANNorg
youtube.com/user/icannnews
slideshare.net/icannpresentations
flickr.com/photos/icann
soundcloud.com/icann