- My pods are getting created but some pods are failing or getting disconnected with an error similar to the following in the console output or master logs:
java.net.SocketTimeoutException: sent ping but didn't receive pong within XXXXms (after XX successful ping/pongs) at okhttp3.internal.ws.RealWebSocket.writePingFrame(RealWebSocket.java:546) at okhttp3.internal.ws.RealWebSocket$PingRunnable.run(RealWebSocket.java:530) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
- CloudBees Core from 188.8.131.52 to 184.108.40.206
- CloudBees Core on modern cloud platforms - Managed Master from 220.127.116.11 to 18.104.22.168
- CloudBees Core on modern cloud platforms - Operations Center from 22.214.171.124 to 126.96.36.199
- CloudBees Core on traditional platforms - Client Master from 188.8.131.52 to 184.108.40.206
- CloudBees Core on traditional platforms - Operations Center from 220.127.116.11 to 18.104.22.168
- CloudBees Jenkins Platform - Client Master from 22.214.171.124 to 126.96.36.199
- CloudBees Jenkins Platform - Operations Center from 188.8.131.52 to 184.108.40.206
- CloudBees Jenkins Distribution from 220.127.116.11 to 18.104.22.168
- Kubernetes Plugin from 1.14.8 to 1.19.3
- okhttp 3.10.0 is more aggressive on ping interval.
- JENKINS-50429 (issue introduced): Kubernetes plugin uses
fabric8/kubernetes-client> 4.1.2 that uses
- JENKINS-58301 (issue reported)
- fabric8io/kubernetes-client #1767 fixed in
java.net.SocketTimeoutException: sent ping but didn't receive pong within XXXXms (after XX successful ping/pongs) is caused by a ping failure from the HTTP client that maintain the connection to kubernetes through the okhttp library. The Kubernetes plugin relies on the fabric8/kubernetes-client that relies on the
Since its early days, the
fabric8/kubernetes-client is setting the ping interval to 1ms by default. This was “stable” up until version 4.1.2, when it started to use version 3.10.0 of the
okhttp library. Starting from this version, the http client connection is closed on any ping failure and the recommendation is to use a value of 30s for a ping interval or greater if necessary. Since
fabric8/kubernetes-client uses a value of
1ms, the connection is rather unstable depending on the load of the network, the client and the server. The Jenkins Kubernetes plugin is impacted since version 1.14.8.
In Jenkins, the symptoms are agent connection failure or agent disconnection due to a socket timeout:
java.net.SocketTimeoutException: sent ping but didn't receive pong within XXXXms. The exception may appear in the agent logs or even a build console output.
If the value is
1000ms or something lower than the recommended
30000ms then this is most likely the issue.
If the value is greater or equal to
30000ms, then it could well be an underlying network problem or performance on one end of the connection (Jenkins master or agent unresponsive).
The solution is to upgrade the Kubernetes plugin to version 1.19.3. That is available under the CloudBees Assurance Program since version 22.214.171.124 of CloudBees Core.
It is possible to increase timeout interval by adding
-Dkubernetes.websocket.ping.interval=<miliseconds> to the startup of the impacted instance(s). The recommended interval from
okhttp documentation is 30 seconds (as the argument unit are milliseconds, you may configure
Take a look at how to add Java arguments to Jenkins to know how to add this argument to your environment.