Considerations for Kubernetes Clients Connections when using Kubernetes Plugin

Issue

  • The master thread dump shows many OkHttp ConnectionPool and / or OkHttp WebSocket threads

  • Build execution fails with web socket exception such as

    Interrupted while waiting for websocket connection, you should increase the Max connections to Kubernetes API
    

    or

    Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at <currentTimeout> seconds
    
  • Kubernetes client requests are enqueued

Environment

Related Issue(s)

Explanation

The Kubernetes plugin manages references of Kubernetes Clients to talk to the Kubernetes API Server. In general, there is a reference of one active Kubernetes Client per Kubernetes Cloud.

Each kubernetes Client can handle requests concurrently and to prevent Jenkins from overloading the Kubernetes API Server, there is a configurable limit to the number of concurrent connections that a Kubernetes Client / Cloud can make to the Kubernetes API server. It is labelled as Max connections to Kubernetes API in a Kubernetes Cloud advanced configuration, and the default value is 32. This sets the maxRequestsPerHost and the maxRequests to the client dispatcher.

The Kubernetes plugin needs to send requests to the Kubernetes API Server for different kinds of operations, mainly to manage agent pods but also to execute steps inside non jnlp containers. In fact every time a durable task step or a step that use a Launcher is executed in a container that is not the jnlp container, calls are made to the Kubernetes API. The Kubernetes plugin relies on the exec API using a WebSocket connection to execute those steps. In case of durable task steps - such as a sh / bat / powershell step - the connection is intended to be opened only while the step is being launched and then quickly closed (even if step runs for hours). However, for other steps which use a Launcher, such as some publishers and checkout with some SCMs but not Git or Subversion, will hold open this WebSocket connection for the duration of the step.

When the limit is reached, requests are still submitted to the dispatcher but enqueued until a dispatcher thread is available to handle it. Depending on the operations, a timeout apply, waiting for the connection to eventually be handled, before it fails. For example org.csanchez.jenkins.plugins.kubernetes.pipeline.websocketConnectionTimeout is the timeout that apply when wait for the WebSocket connection to succeed to be able to launch a durable task inside a container block. When a pipeline fails due to this timeout, it may mean that there are currently too many concurrent requests and that this particular request was in the client dispatcher queue for the duration of that timeout.

To sum up on this:

  • Each Kubernetes Cloud has a reference to an active Kubernetes Client
  • Each Kubernetes Cloud has it own limit Max connections to Kubernetes API to the number of concurrent requests that it can make to the Kubernetes API Server
  • By default, the max number of concurrent request per Kubernetes Cloud is 32
  • Agent pod maintenance and Pipeline steps execution are the most common operations that require Kubernetes API Server connections
  • Durable Task steps (sh / bat / powershell steps) open a WebSocket connection to launch the step and close it quickly.
  • Other steps which use a Launcher hold the connection for the duration of the step.
  • Kubernetes plugin timeouts such as org.csanchez.jenkins.plugins.kubernetes.pipeline.websocketConnectionTimeout are likely to fail when the limit is reached for too long

Therefore, depending on a few things - such as the activity on the master, Kubernetes API Server responsiveness, the way pipelines are designed and step execution times - this limit may be reached rather easily. A workaround is to increase the limit, at the expense of overloading the Kubernetes API Server. Other practices can be followed to avoid reaching this limit.

Notable Fixes / Improvements

There are a couple of critical improvement that have been made to improve the behavior around the consumption of Kubernetes API Calls

  • Kubernetes Plugin 1.16.6 / Durable Task 1.30: Improve durable tasks behavior so that a durable task step execution does not hold a connection for the entire execution of the step. See JENKINS-58290 for more details.
  • Kubernetes Plugin 1.27.1: Remove a redundant call to the Kubernetes API Server that checks that all containers are READY every time a Launcher is executed in container. See #826 for more details.
  • Kubernetes Plugin 1.27.3: Fix the Max connections to Kubernetes API that was limited to 64. See JENKINS-58463 for more details.
  • Kubernetes Plugin 1.28.1: Expired (closed) clients are sometimes used. See #889 for more details.

We recommend staying up to date.

Solution

We recommend running at least version 1.28.1 of the Kubernetes Plugin.

There are different solutions that can help mitigate the problem:

  • Increase the Max connections to Kubernetes API
  • Increase the ContainerExecDecorator#websocketConnectionTimeout
  • Use a single container approach
  • Concatenate small and successive durable task steps

Increase the Max Concurrent requests

This is the most straightforward solution / workaround:

  • Consider increasing the Max connections to Kubernetes API to allow more concurrent requests to be made concurrently. In CloudBees CI on Modern Platform, this must be done in the “kubernetes shared cloud” in Operations Center.

Note: Although this sounds like the solution, do not raise this setting to a very large number at once as this could overload the Kubernetes API Server as well as the Master - more concurrent requests means that more resources are needed. Increase the value gradually to find a good spot. Above all it is important to understand what is causing the limit to be reached by monitoring the number of running / queued requests, see [Monitor Usage][#monitorusage] section below.

Increase the ContainerExecDecorator#websocketConnectionTimeout

This helps mitigate the problem in cases where builds are failing with Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at <currentTimeout> seconds but does not address the concurrent request usage:

  • Increase the value the websocket connection timeout within the container block by adding the system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout=<timeInSeconds> on start up. The default value is 30 for 30 seconds. This requires a restart of the Master. See How to add Java arguments to Jenkins for more details.

This timeout allows a tasks to be enqueued for a certain amount of time before it fails.

Use a Single Container approach

Running steps in the jnlp container helps to reduce the number of calls made to the API Server tremendously. As those executions are using the remoting channel and do not rely on the kubernetes API to launch tasks. There is no general recommendation around using a single-container or a multi-container approach, although in that particular case using a single container approach can help tremendously.

  • Consider using a single container approach: building a jnlp image that contains the build tools required to be able to run steps in the jnlp container

Note: when creating custom jnlp images, we recommend using jenkins/inbound-agent or cloudbees/cloudbees-core-agent as a base.

Concatenate small and successive Durable Tasks steps

Every time a durable task step such as a sh / bat / powershell step is executed in a container that is not the jnlp container, calls are made to the Kubernetes API.

  • It is recommended to concatenate small and successive durable task steps into larger ones to avoid unnecessary calls, increase stability and improve scalability.

For example, each of the following sh requires several successive API calls to be made:

  container('notjnlp') {
      sh "echo 'this'"
      sh "echo 'that'"
      sh "grep 'this' that | jq ."
  }

As opposed to the concatenated version that requires a single call:

  [...]
  container('notjnlp') {
      sh """
        echo 'this'
        echo 'that'
        grep 'this' that | jq .
      """
  }

Monitor Usage

1. If using CloudBees CI, the Kube Agents Management plugin exposes JMX metrics to track the kubernetes clients running and queued tasks kubernetes-client.connections.running and kubernetes-client.connections.queued that can help monitor instances, evaluate requirements for concurrent requests and narrow down the root cause of related issues.

2. To have a better idea of what requests are currently queued and running, the following groovy script may be executed under Manage Jenkins > Script Console:

```groovy
def allRunningCount = 0
def allQueuedCount = 0

/**
 * Method that dumps information of a specific k8s client.
 */
def dumpClientConsumer = { client ->
  def k8sClient = client.client
  def httpClient = k8sClient.httpClient
  def dispatcher = httpClient.dispatcher()

  allRunningCount += dispatcher.runningCallsCount()
  allQueuedCount += dispatcher.queuedCallsCount()

  println "(${k8sClient})"
  println "* STATE "
  println "  * validity " + client.validity
  def runningCalls = dispatcher.runningCalls()
  println "* RUNNING " + runningCalls.size()
  runningCalls.each { call ->
    println "  * " + call.request()
  }
  def queuedCalls = dispatcher.queuedCalls()
  println "* QUEUED " + queuedCalls.size()
  queuedCalls.each { call ->
    println "  * " + call.request()
  }
  println "* SETTINGS "
  println "  * Connect Timeout (ms): " + httpClient.connectTimeoutMillis()
  println "  * Read Timeout (ms): " + httpClient.readTimeoutMillis()
  println "  * Write Timeout (ms): " + httpClient.writeTimeoutMillis()
  println "  * Ping Interval (ms): " + httpClient.pingIntervalMillis()
  println "  * Retry on failure " + httpClient.retryOnConnectionFailure()
  println "  * Max Concurrent Requests: " + dispatcher.getMaxRequests()
  println "  * Max Concurrent Requests per Host: " + dispatcher.getMaxRequestsPerHost()
  def connectionPool = httpClient.connectionPool()
  println "* CONNECTION POOL "
  println "  * Active Connection " + connectionPool.connectionCount()
  println "  * Idle Connection " + connectionPool.idleConnectionCount()
  println ""
}

println "Active K8s Clients\n----------"
org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider.clients.asMap().values().forEach(dumpClientConsumer)

println ""
println "K8s Clients Summary\n----------"
println "* ${org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider.clients.asMap().size()} active clients"
println "* ${org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider.runningCallsCount} running calls (from plugin)"
println "* ${org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider.queuedCallsCount} queued calls (from plugin)"
println "* ${allRunningCount} running calls"
println "* ${allQueuedCount} queued calls"

return
```

Have more questions?

0 Comments

Please sign in to leave a comment.