troubleshooting kafka's socket server: from incident to resolution
TRANSCRIPT
![Page 1: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/1.jpg)
Troubleshooting Kafka’s socket serverfrom incident to resolution
Joel KoshyLinkedIn
![Page 2: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/2.jpg)
The incidentOccurred a few days after upgrading to pick up quotas and SSL
Multi-portKAFKA-1809
KAFKA-1928
SSLKAFKA-1690x25 x38
October 13
Various quota patches
June 3April 5 August 18
![Page 3: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/3.jpg)
The incidentBroker (which happened to be controller) failed in our queuing Kafka cluster
![Page 4: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/4.jpg)
The incident● Alerts fire; NOC engages SYSOPS/Kafka-SRE● Kafka-SRE restarts the broker● Broker failure does not generally cause prolonged application impact
○ but in this incident…
![Page 5: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/5.jpg)
The incidentMultiple applications begin to report “issues”: socket timeouts to Kafka cluster
Posts search was one such impacted application
![Page 6: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/6.jpg)
The incidentTwo brokers report high request and response queue sizes
![Page 7: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/7.jpg)
The incidentTwo brokers report high request queue size and request latencies
![Page 8: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/8.jpg)
The incident● Other observations
○ High CPU load on those brokers○ Throughput degrades to ~ half the normal throughput○ Tons of broken pipe exceptions in server logs○ Application owners report socket timeouts in their logs
![Page 9: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/9.jpg)
RemediationShifted site traffic to another data center
“Kafka outage ⇒ member impact
Multi-colo is critical!
![Page 10: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/10.jpg)
Remediation● Controller moves did not help● Firewall the affected brokers
● The above helped, but cluster fell over again after dropping the rules● Suspect misbehaving clients on broker failure
○ … but x25 never exhibited this issue
sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT
sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP
![Page 11: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/11.jpg)
![Page 12: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/12.jpg)
“Good dashboards/alerts
Skilled operators
Clear communication
Audit/log all operations
CRM principles apply to operations
![Page 13: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/13.jpg)
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
![Page 14: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/14.jpg)
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Move leaders
![Page 15: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/15.jpg)
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Firewall
![Page 16: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/16.jpg)
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
Firewall
x25
![Page 17: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/17.jpg)
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
x25
Move leaders
![Page 18: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/18.jpg)
… oh and BTWbe careful when saving a lot of public-access/server logs:
● Can cause GC
[Times: user=0.39 sys=0.01, real=8.01 secs]
● Use ionice or rsync --bwlimit
lowlow
high
![Page 19: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/19.jpg)
The investigation
![Page 20: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/20.jpg)
● Test cluster○ Tried killing controller○ Multiple rolling bounces○ Could not reproduce
● Upgraded the queuing cluster to x38 again○ Could not reproduce
● So nothing… �
Attempts at reproducing the issue
![Page 21: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/21.jpg)
Understanding queue backups…
![Page 22: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/22.jpg)
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Quota manager
![Page 23: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/23.jpg)
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Quota manager
Newconnections
![Page 24: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/24.jpg)
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
and then turn off read interest from that connection (for ordering)
![Page 25: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/25.jpg)
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
![Page 26: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/26.jpg)
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time
![Page 27: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/27.jpg)
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
![Page 28: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/28.jpg)
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quotaviolated
+ quota-time
![Page 29: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/29.jpg)
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quotaviolated
+ quota-time
Await processor
+ response-queue-time
![Page 30: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/30.jpg)
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quotaviolated
+ quota-time
Await processor
+ response-queue-time
Writeresponse
+ response-send-time
![Page 31: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/31.jpg)
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quotaviolated
+ quota-time
Await processor
+ response-queue-time
Writeresponse
+ response-send-time
Turn read interest back on from that connection
![Page 32: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/32.jpg)
Investigating high request times● Total time is useful for monitoring● but high total time is not necessarily bad
![Page 33: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/33.jpg)
Investigating high request times● Total time is useful for monitoring● but high total time is not necessarily bad
Low
![Page 34: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/34.jpg)
Investigating high request times● Total time is useful for monitoring● but high total time is not necessarily bad
Low
Highbut “normal”(purgatory)
![Page 35: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/35.jpg)
Investigating high request times● First look for high local time
○ then high response send time■ then high remote (purgatory) time → generally non-issue (but caveats described later)
● High request queue/response queue times are effects, not causes
![Page 36: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/36.jpg)
High local times during incident (e.g., fetch)
![Page 37: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/37.jpg)
How are fetch requests handled?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation
![Page 38: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/38.jpg)
Could these cause high local times?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation
Not using acks -1
Should be fast
Maybe
Should be fast
Delayed outsideAPI thread
![Page 39: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/39.jpg)
ISR churn?
![Page 40: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/40.jpg)
● Low ZooKeeper write latencies● Churn in this incident: effect of some other root cause● Long request queues can cause churn
○ ⇒ follower fetches timeout■ ⇒ fall out of ISR (ISR shrink happens asynchronously in separate thread)
○ Outstanding fetch catches up and ISR expands
ISR churn? … unlikely
![Page 41: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/41.jpg)
High local times during incident (e.g., fetch)
Besides, fetch-consumer (not just follower) has high local time
![Page 42: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/42.jpg)
Could these cause high local times?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation
Not using acks -1
Should be fast
Should be fast
Delayed outsideAPI thread
Test this…
![Page 43: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/43.jpg)
Maintains byte-rate metrics on a per-client-id basis
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:ANONYMOUS
Quota metrics
??!
![Page 44: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/44.jpg)
Quota metrics - a quick benchmark
for (clientId ← 0 until N) { timer.time { quotaMetrics.recordAndMaybeThrottle(sensorId, 0, DefaultCallBack) }}
![Page 45: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/45.jpg)
Quota metrics - a quick benchmark
![Page 46: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/46.jpg)
Quota metrics - a quick benchmark
Fixed in KAFKA-2664
![Page 47: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/47.jpg)
meanwhile in our queuing cluster…due to climbing client-id counts
![Page 48: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/48.jpg)
Rolling bounce of cluster forced the issue to recur on brokers that had high client-id metric counts
○ Used jmxterm to check per-client-id metric counts before experiment○ Hooked up profiler to verify during incident
■ Generally avoid profiling/heapdumps in production due to interference○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time
![Page 49: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/49.jpg)
MACRO
● Observe week-over-week trends● Formulate theories● Test theory (micro-experiments)● Deploy fix and validate
MICRO (live debugging)
● Instrumentation● Attach profilers● Take heapdumps● Trace-level logs, tcpdump, etc.
Troubleshooting: macro vs micro
![Page 50: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/50.jpg)
MACRO
● Observe week-over-week trends● Formulate theories● Test theory (micro-experiments)● Deploy fix and validate
MICRO (live debugging)
● Instrumentation● Attach profilers● Take heapdumps● Trace-level logs, tcpdump, etc.
Troubleshooting: macro vs micro
Generally more effective Sometimes warranted,but invasive and tedious
![Page 51: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/51.jpg)
How to fix high local times● Optimize the request’s handling. For e.g.,:
○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)○ and KAFKA-1356
● Make it asynchronous○ E.g., we will do this for StopReplica in KAFKA-1911
● Put it in a purgatory (usually if response depends on some condition); but be aware of the caveats:
○ Higher memory pressure if request purgatory size grows○ Expired requests are handled in purgatory expiration thread (which is good)
○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies several delayed requests then local time can increase for the satisfying request
![Page 52: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/52.jpg)
as for rogue clients…2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:ANONYMOUS
“Get apps to use wrapper libraries that
implement good client behavior,
shield from API changes and so on…
![Page 53: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/53.jpg)
not done yet!this lesson needs repeating...
![Page 54: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/54.jpg)
After deploying the metrics fix to some clusters…
![Page 55: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/55.jpg)
After deploying the metrics fix to some clusters…
Deployment
Persistent URP
![Page 56: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/56.jpg)
After deploying the metrics fix to some clusters…
Applications also begin to report higher than usual consumer lag
![Page 57: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/57.jpg)
Root cause: zero-copy broke for plaintext
Upgradedcluster
Rolledback
With fix(KAFKA-2517)
![Page 58: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/58.jpg)
The lesson...
![Page 59: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/59.jpg)
● Request queue size● Response queue sizes● Request latencies:
○ Total time○ Local time○ Response send time○ Remote time
● Request handler pool idle ratio
Monitor these closely!
![Page 60: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/60.jpg)
Continuous validation on trunk
![Page 61: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/61.jpg)
Any other high latency requests?
Image courtesy of ©Nevit Dilmen
![Page 62: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/62.jpg)
Local times
ConsumerMetadata OffsetFetch
ControlledShutdown Offsets (by time)
Fetch Produce
LeaderAndIsr StopReplica (for delete=true)
TopicMetadata UpdateMetadata
OffsetCommit
These are (typically 1:N) broker-to-broker requests
![Page 63: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/63.jpg)
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Read bit off so ties up at most one API handler
Request queue
![Page 64: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/64.jpg)
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queueBut if requesting brokertimes out and retries…
![Page 65: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/65.jpg)
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queueBut if requesting brokertimes out and retries…
Processor Response queueAPI handler
![Page 66: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/66.jpg)
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queue
Processor Response queueAPI handler
Configure socket timeouts >> MAX(latency)
![Page 67: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/67.jpg)
● Broker to controller○ ControlledShutdown (KAFKA-1342)
● Controller to broker○ StopReplica[delete = true] should be asynchronous (KAFKA-1911)
○ LeaderAndIsr: batch request - maybe worth optimizing or putting in a purgatory? Haven’t looked closely yet…
Broker-to-broker request latency improvements
![Page 68: Troubleshooting Kafka's socket server: from incident to resolution](https://reader034.vdocument.in/reader034/viewer/2022052116/58859b8b1a28abd2498b5b15/html5/thumbnails/68.jpg)
The end