Dedicated JNLP agents Troubleshooting guide

Symptoms

  • I am not able to connect a JNLP agent to a Jenkins Instance
  • The build has failed because the connection got broken
  • The build is stalled in the queue waiting for the agent
  • The agent is disconnected and cannot connect again
  • Channel is broken warning at logs
  • Any of the exceptions listed below

Diagnosis/Treatment

There are some required data before starting to diagnose the issue that needs to be provided:

1. Requirements

1.A Ensure that the Java version is at least on the same line on both master and agent

A good practice is to run the exactly same Java version in both Jenkins and agent, but when this is not possible it is mandatory to be running at least the same base line (major version coordinate). Check Supported JDK for CloudBees Core.

Run java -version in both Jenkins master box and agent to check the java version you are running in both.

1.B Ensure that the version of agent.jar matches with the one

The main problem of running JNLP as an agent Launcher is that when you upgrade Jenkins agent.jar is not automatically upgraded on the agent it happens in SSH Launcher out of the box. It can be solved in Windows by using JNLP + winsw adding the Remoting executable in <download from="${JENKINS_URL}/jnlpJars/agent.jar" to="%BASE%\agent.jar"/>.

Check that agent.jar is the same using for example md5sum agent.jar. agent.jar can be downloaded from Jenkins master from the URL below:

http://<JENKINS_URL>/jnlpJars/agent.jar

Please refer to Remoting Best Practices – Agent Daemonization

Partial solutions:

1.C Connectivities checks

Use jenkins-cli to check the connection

In the agent box, download the CLI and run a help command in your favorite mode. For example, using http mode:

java -jar jenkins-cli.jar [-s $JENKINS_URL] -auth <user>:<token> help

Check that the agent is able to see the JENKINS headers

# curl -IvL <JENKINS_URL>
curl -IvL https://jenkins:8443

For Windows, curl command can be available on a Windows box using for example curl Download Wizard or cwyng.

Check that the JNLP port is accesible to the agent

# telnet <JENKINS_HOST> <JNLP_PORT>
telnet jenkins.host.example.com 50234

2. Use a different Launch mechanism

For Jenkins >= 2.204.1 LTS, switch to a different Launch mechanism: Connect directly to TCP port.

3. Known issues

3.A. Unable to load class once the loading was interrupted

JENKINS-36991 Unable to load class once the loading was interrupted is resolved and Released in remoting 2.61.

Jenkins log / Build console output log

java.lang.NoClassDefFoundError: Could not initialize class jenkins.model.Jenkins
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:191)
    at Script1.class$(Script1.groovy)
    at Script1.$get$$class$jenkins$model$Jenkins(Script1.groovy)
    at Script1.run(Script1.groovy:1)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:580)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:618)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:589)
    at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:142)
    at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:114)
    at hudson.remoting.UserRequest.perform(UserRequest.java:121)
    at hudson.remoting.UserRequest.perform(UserRequest.java:49)
    at hudson.remoting.Request$2.run(Request.java:326)
    at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Agent log

Slave.jar version: 2.52
This is a Unix slave 
Evacuated stdout
Slave successfully connected and online
Jul 27, 2016 8:36:57 AM jenkins.model.Jenkins <clinit>
SEVERE: Failed to load Jenkins.class
hudson.remoting.RemotingSystemException: java.lang.InterruptedException
    at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:266)
    at com.sun.proxy.$Proxy5.fetch3(Unknown Source)
    at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:171)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at com.thoughtworks.xstream.XStream.buildMapper(XStream.java:590)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:568)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:496)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:465)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:411)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:350)
    at hudson.util.XStream2.<init>(XStream2.java:88)
    at jenkins.model.Jenkins.<clinit>(Jenkins.java:4217)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:191)
    at Script1.class$(Script1.groovy)
    at Script1.$get$$class$jenkins$model$Jenkins(Script1.groovy)
    at Script1.run(Script1.groovy:1)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:580)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:618)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:589)
    at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:142)
    at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:114)
    at hudson.remoting.UserRequest.perform(UserRequest.java:121)
    at hudson.remoting.UserRequest.perform(UserRequest.java:49)
    at hudson.remoting.Request$2.run(Request.java:326)
    at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at hudson.remoting.Request.call(Request.java:147)
    at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:253)
    ... 30 more

3.B. Intermittent Invalid Object ID in remoting module

JENKINS-23271 Intermittent Invalid Object ID in remoting module

It’s fixed and released on Jenkins core higher than 2.32

Happens frequently on Java 8 due its object management logic.
Causes issues in task execution (build failures, agent disconnects)

Jenkins log / Build console output log

FATAL: Invalid object ID 18649 iuota=18470
java.lang.IllegalStateException: Invalid object ID 18469 iota=18470
at hudson.remoting.ExportTable.diagnoseInvalidId(ExportTable.java:277)

3.C. Ping Thread

Check the Ping Thread Documentation here.

PingThread checks that agent is ABLE to execute a command from master (NOOP request)

Ping command may fail to execute:

  • Overloaded queue, all agent workers are busy → On big boxes you can increase the number of remoting TaskPool workers
  • Network overloaded

In some cases disabling can help

So, if this is the stacktrace you are seeing all the time, you should then disable the PingThread. The side effect is just that the agent is suppose to hung in case the communication is failing between master and agents. The good side is that you will be able to get a thread dump on both sides master and agent.

Jenkins log / Build console output log

Caused by: java.io.IOException
    at hudson.remoting.Channel.close(Channel.java:1163)
    at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118)
    at hudson.remoting.PingThread.ping(PingThread.java:126)
    at hudson.remoting.PingThread.run(PingThread.java:85)
Caused by: java.util.concurrent.TimeoutException: Ping started at 1474633728617 hasn't completed by 1474633968617
    ... 2 more

3.D. JNLP Cloud Agents are disconnected on start process

It affects Jenkins core higher than 2.28

Relax requirements of the JNLP connection receiver, which was rejections connections from agents not using JNLPComputerLauncher (e.g. from Agent Setup, vSphere Cloud and other plugins). No the connection is accepted from launchers implementing other proxying and filtering Launcher implementations. Particular plugins may require setting up the -Djenkins.slaves.DefaultJnlpSlaveReceiver.disableStrictVerification=true system property in the master JVM to allow connecting agents. JENKINS-39232, regression in 2.28

4. HA / LB / Reverse proxy bypass

5. Clear the Java Web Start Cache

If, when starting the JNLP file, you see an error like the one below, run the command javaws -clearcache to clear the cache of the java webstart program.

java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(Unknown Source)
	at java.net.SocketInputStream.read(Unknown Source)
	at sun.security.ssl.InputRecord.readFully(Unknown Source)
	at sun.security.ssl.InputRecord.read(Unknown Source)
	at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source)
	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(Unknown Source)
	at sun.security.ssl.SSLSocketImpl.startHandshake(Unknown Source)
	at sun.security.ssl.SSLSocketImpl.startHandshake(Unknown Source)
	at sun.net.www.protocol.https.HttpsClient.afterConnect(Unknown Source)
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(Unknown Source)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
	at sun.net.www.protocol.http.HttpURLConnection.access$200(Unknown Source)
	at sun.net.www.protocol.http.HttpURLConnection$9.run(Unknown Source)
	at sun.net.www.protocol.http.HttpURLConnection$9.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.security.AccessController.doPrivilegedWithCombiner(Unknown Source)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
	at com.sun.deploy.net.HttpUtils.followRedirects(Unknown Source)
	at com.sun.deploy.net.BasicHttpRequest.doRequest(Unknown Source)
	at com.sun.deploy.net.BasicHttpRequest.doGetRequestEX(Unknown Source)
	at com.sun.deploy.cache.ResourceProviderImpl.checkUpdateAvailable(Unknown Source)
	at com.sun.deploy.cache.ResourceProviderImpl.isUpdateAvailable(Unknown Source)
	at com.sun.deploy.cache.ResourceProviderImpl.getResource(Unknown Source)
	at com.sun.deploy.cache.ResourceProviderImpl.getResource(Unknown Source)
	at com.sun.javaws.LaunchDownload$DownloadTask.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

6. JNLP windows agent

Runaway agent process

  • In particular cases jenkins-agent.exe gets forcibly terminated (user action, fatal remoting failure, windows service hardstop)
  • Java.exe running agent may be leaked
  • It causes multiple “slave is already connected” messages in the Jenkins log

7. TCP retransmission timeout OSS - perhaps increase

7.A Linux

Using Keep Alive

sysctl -w net.ipv4.tcp_keepalive_time=120
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=8
sysctl -w net.ipv4.tcp_fin_timeout=30

7.B Windows

Things that you may want to know about TCP Keepalives
Avoiding TCP/IP Port Exhaustion

KeepAliveInterval = 30000
KeepAliveTime = 120000
TcpMaxDataRetransmissions = 8
TcpTimedWaitDelay=30

KeepAliveInterval
KeepAliveTime
TcpMaxDataRetransmissions
TcpTimedWaitDelay

7.C Mac

Using TCP keepalive to Detect Network Errors

net.inet.tcp.keepidle=120000
net.inet.tcp.keepintvl=30000
net.inet.tcp.keepcnt=8

Note: remoting 2.62.1 has an improvement wrt to keepalive from the client (agent) side

8. When all fails

  • Try to add this Java property on master -Djenkins.slaves.NioChannelSelector.disabled=true
  • Still I/O available and it complicates and improve the performance
  • Try to add this Java property on master -Djenkins.slaves.JnlpSlaveAgentProtocol3.enabled=false

References

Have more questions?

2 Comments

  • -2
    Avatar
    Steven Christenson

    There are quite a few spelling and grammatical errors here.

    E.g.

     

    • "There are some stuff required "
    • Support bundle, generated when during the problematic period in general, if possible/applicable.

    (Highlighted words seem extraneous)

    • Relax requirements of the JNLP connection receiver, which was rejections connections from agents not using JNLPComputerLauncher


    ... and more.

  • 0
    Avatar
    Vitaly Karasik

    "For Jenkins >= 2.204.1 LTS, switch to a different Launch mechanism: Connect directly to TCP port."

    Unfortunately, the URL doesn't exist.
    Can you please share updated info?

Please sign in to leave a comment.