dedicated JNLP agents formerly slaves get disconnected

Precondition: The agent (formerly slave) was working properly before the issue

Symptoms

  • The build has failed because the connection got broken
  • The build is stalled in the queue waiting for the agent (formerly slave)
  • The agent is disconnected and cannot connect again
  • Channel is broken warning at logs
  • Any of the exceptions listed below

Diagnosis/Treatment

There are some stuff required before starting to diagnose the issue that needs to be provided:

  • Support bundle, ideally generated when the agent is connected.
  • Support bundle, generated when during the problematic period in general, if possible/applicable.
  • Agent configuration file (config.xml), could be recovered from disk at $JENKINS_HOME/nodes/[agent_name]/config.xml and/or from a browser at http(s)://JENKINS_SERVER/computer/[agent_name]/config.xml
  • Agent logs
  • Build console output log if exists.
  • Jenkins logs

Before trying any other issue, firstly a manual launch of the agent java process. If that fails,

  • try with curl -I -v http(s)://JENKINS_SERVER/computer/windows-slave/slave-agent.jnlp and provide us the output.
  • CLI, downloaded from http(s)://JENKINS_SERVER/jnlpJars/jenkins-cli.jar and execute java -jar jenkins-cli.jar -s http(s)://JENKINS_SERVER/ help

Known issues

Unable to load class once the loading was interrupted

JENKINS-36991 Unable to load class once the loading was interrupted is resolved and Released in remoting 2.61.

Jenkins log / Build console output log

java.lang.NoClassDefFoundError: Could not initialize class jenkins.model.Jenkins
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:191)
    at Script1.class$(Script1.groovy)
    at Script1.$get$$class$jenkins$model$Jenkins(Script1.groovy)
    at Script1.run(Script1.groovy:1)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:580)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:618)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:589)
    at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:142)
    at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:114)
    at hudson.remoting.UserRequest.perform(UserRequest.java:121)
    at hudson.remoting.UserRequest.perform(UserRequest.java:49)
    at hudson.remoting.Request$2.run(Request.java:326)
    at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Agent log

Slave.jar version: 2.52
This is a Unix slave
Evacuated stdout
Slave successfully connected and online
Jul 27, 2016 8:36:57 AM jenkins.model.Jenkins <clinit>
SEVERE: Failed to load Jenkins.class
hudson.remoting.RemotingSystemException: java.lang.InterruptedException
    at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:266)
    at com.sun.proxy.$Proxy5.fetch3(Unknown Source)
    at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:171)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at com.thoughtworks.xstream.XStream.buildMapper(XStream.java:590)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:568)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:496)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:465)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:411)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:350)
    at hudson.util.XStream2.<init>(XStream2.java:88)
    at jenkins.model.Jenkins.<clinit>(Jenkins.java:4217)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:191)
    at Script1.class$(Script1.groovy)
    at Script1.$get$$class$jenkins$model$Jenkins(Script1.groovy)
    at Script1.run(Script1.groovy:1)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:580)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:618)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:589)
    at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:142)
    at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:114)
    at hudson.remoting.UserRequest.perform(UserRequest.java:121)
    at hudson.remoting.UserRequest.perform(UserRequest.java:49)
    at hudson.remoting.Request$2.run(Request.java:326)
    at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at hudson.remoting.Request.call(Request.java:147)
    at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:253)
    ... 30 more

Intermittent Invalid Object ID in remoting module

JENKINS-23271 Intermittent Invalid Object ID in remoting module

It’s fixed and released on Jenkins core higher than 2.32

Happens frequently on Java 8 due its object management logic.
Causes issues in task execution (build failures, agent disconnects)
Temporary solution - use Java 7 (does not completely solve the issue) on both Master and agents

Jenkins log / Build console output log

FATAL: Invalid object ID 18649 iuota=18470
java.lang.IllegalStateException: Invalid object ID 18469 iota=18470
at hudson.remoting.ExportTable.diagnoseInvalidId(ExportTable.java:277)

Ping Thread

Ping Thread

PingThread checks that agent is ABLE to execute a command from master (NOOP request)

Ping command may fail to execute:
* Overloaded queue, all agent workers are busy → On big boxes you can increase the number of remoting TaskPool workers
* Network overloaded

In some cases disabling can help

So, if this is the stacktrace you are seeing all the time, you should then disable the PingThread. The side effect is just that the agent is suppose to hung in case the communication is really failing between master and agents, but this is good as you will then use jstack to take a threadDump on both sides master and agent it self.

Jenkins log / Build console output log

Caused by: java.io.IOException
    at hudson.remoting.Channel.close(Channel.java:1163)
    at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118)
    at hudson.remoting.PingThread.ping(PingThread.java:126)
    at hudson.remoting.PingThread.run(PingThread.java:85)
Caused by: java.util.concurrent.TimeoutException: Ping started at 1474633728617 hasn't completed by 1474633968617
    ... 2 more

JNLP Cloud Agents are disconnected on start process

It affects Jenkins core higher than 2.28

Relax requirements of the JNLP connection receiver, which was rejections connections from agents not using JNLPComputerLauncher (e.g. from Slave Setup, vSphere Cloud and other plugins). No the connection is accepted from launchers implementing other proxying and filtering Launcher implementations. Particular plugins may require setting up the -Djenkins.slaves.DefaultJnlpSlaveReceiver.disableStrictVerification=true system property in the master JVM to allow connecting agents. JENKINS-39232, regression in 2.28

JNLP commons

Outdated slave.jar/remoting.jar

  • slave.jar version should match with the downloaded from http(s)://JENKINS_SERVER/jnlpJars/slave.jar
  • For JNLP agents Jenkins does not automatically update remoting.jar and does not monitor its versions
  • As there are no testing of remoting against previous versions of Jenkins core, the connection MAY be unstable + known bugs
  • Partial solution for monitoring: https://wiki.jenkins-ci.org/display/JENKINS/VersionColumn+Plugin
  • Another solution: share slave.jar via shared directories and configure JNLP agents to pick versions from there

HA / LB / Reverse proxy bypass

  • It’s highly recommended adding -Dhudson.TcpSlaveAgentListener.hostName=master_ip and -Dhudson.TcpSlaveAgentListener.port=jnlp_port Java properties on master.
  • In such case connection goes directly to instance w/o passing through HAproxy/Load balancer/Reverse proxy

JNLP windows agent

Runaway slave process

  • In particular cases jenkins-slave.exe gets forcibly terminated (user action, fatal remoting failure, windows service hardstop)
  • Java.exe running slave may be leaked
  • It causes multiple “slave is already connected” messages in the Jenkins log

Other considerations

Remoting issue

TCP retransmission timeout OSS - perhaps increase

Linux

Using Keep Alive

sysctl -w net.ipv4.tcp_keepalive_time=120
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=8
sysctl -w net.ipv4.tcp_fin_timeout=30
sysctl -w net.ipv4.tcp_tw_recycle=1
sysctl -w net.ipv4.tcp_tw_reuse=1

Windows

Things that you may want to know about TCP Keepalives
Avoiding TCP/IP Port Exhaustion

KeepAliveInterval = 30000
KeepAliveTime = 120000
TcpMaxDataRetransmissions = 8
TcpTimedWaitDelay=30

KeepAliveInterval
KeepAliveTime
TcpMaxDataRetransmissions
TcpTimedWaitDelay

Mac

Using TCP keepalive to Detect Network Errors

net.inet.tcp.keepidle=120000
net.inet.tcp.keepintvl=30000
net.inet.tcp.keepcnt=8

Note: remoting 2.62.1 has an improvement wrt to keepalive from the client (agent) side

When all fails

  • Try to add this Java property on master -Djenkins.slaves.NioChannelSelector.disabled=true
  • Still I/O available and it complicates and improve the performance
  • Try to add this Java property on master -Djenkins.slaves.JnlpSlaveAgentProtocol3.enabled=false
Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.