dedicated SSH agents formerly slaves get disconnected

Precondition: The agent (formerly slave) was working properly before the issue

Symptoms

  • The build has failed because the connection got broken
  • The build is stalled in the queue waiting for the agent (formerly slave)
  • The agent is disconnected and cannot connect again
  • Channel is broken warning at logs
  • Any of the exceptions listed below

Diagnosis/Treatment

There are some stuff required before starting to diagnose the issue that needs to be provided:

  • Support bundle, ideally generated when the agent is connected.
  • Agent configuration file (config.xml), could be recovered from disk at $JENKINS_HOME/nodes/[agent_name]/config.xml and/or from a browser at http(s)://JENKINS_SERVER/computer/[agent_name]/config.xml
  • Agent logs (In order to retain the logs, you will need to add that to agent configuration)
  • ssh-slave-plugin / cloudbees-ssh-slaves
    • If you are using bash, add Suffix Start Slave Command: 2> >(tee -a $( date '+test.%H-%M-%S.txt' ) 1>&2) Notice that there is a space before the 2.
    • If you are using ksh
    • Prefix Start Slave Command: bash -c "
    • Suffix Start Slave Command: 2> >(tee -a slave-stderr.log 1>&2)" (you need to add a space at the beginning)
  • cloudbees-ssh-slaves
    • WARNING: If you enable “Log environment on initial connect”, the agent will face a lack of performance on the startup time. The ssh-slave-plugin always log those environment variables during the startup process.
  • When the issue happens again, attach the logfile on the agent (formerly slave) and a new support bundle to check what happened at the same time on master and agent (formerly slave)
  • Build console output log if exists.
  • Jenkins logs

Before trying any other issue, firstly a manual ssh connection at Jenkins server shell to agent has to be performed, in order to discard basic connectivity problems. You could use this command that will verify the connection periodically.

ssh user@agent_ip "bash -c 'while true ; do export sleeptime=$(( 60*60 )) ; echo "Hey, time is '$(date)'. Now sleeping for \$sleeptime seconds" ; sleep \$sleeptime ; done'"

The most important diagnosing step

The first step is to change the agent launch method to “Launch the slave as a command on the master” and just use a plain old ssh client. You will need to configure keys at host and a launch command like ssh -o StrictHostKeyChecking=yes jenkinsuser@slavehost java -jar slave.jar. Please follow the inline help and Run SSH Agent using a command on Master

Changing the launch method in this case has many benefits.

  • Biggest benefit: Well known, google-able error messages. If you give a network admin a ChannelClosedException he’ll tell you to fix Jenkins. If you tell him that the error message is Write failed: Broken pipe he’ll understand the problem.
  • Removes Jenkins from the equation completely.
  • The SSH client is also more mature than the Jenkins SSH client and is probably more reliable in the face of poorly performing servers or connections.

This change is only a diagnostic step. Once the issue is understood, you can go back to regular ssh agents.

Known issues

Unable to load class once the loading was interrupted

JENKINS-36991 Unable to load class once the loading was interrupted is resolved and released in remoting 2.61.

Jenkins log / Build console output log

java.lang.NoClassDefFoundError: Could not initialize class jenkins.model.Jenkins
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:191)
    at Script1.class$(Script1.groovy)
    at Script1.$get$$class$jenkins$model$Jenkins(Script1.groovy)
    at Script1.run(Script1.groovy:1)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:580)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:618)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:589)
    at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:142)
    at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:114)
    at hudson.remoting.UserRequest.perform(UserRequest.java:121)
    at hudson.remoting.UserRequest.perform(UserRequest.java:49)
    at hudson.remoting.Request$2.run(Request.java:326)
    at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Agent log

Slave.jar version: 2.52
This is a Unix slave
Evacuated stdout
Slave successfully connected and online
Jul 27, 2016 8:36:57 AM jenkins.model.Jenkins <clinit>
SEVERE: Failed to load Jenkins.class
hudson.remoting.RemotingSystemException: java.lang.InterruptedException
    at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:266)
    at com.sun.proxy.$Proxy5.fetch3(Unknown Source)
    at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:171)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at com.thoughtworks.xstream.XStream.buildMapper(XStream.java:590)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:568)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:496)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:465)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:411)
    at com.thoughtworks.xstream.XStream.<init>(XStream.java:350)
    at hudson.util.XStream2.<init>(XStream2.java:88)
    at jenkins.model.Jenkins.<clinit>(Jenkins.java:4217)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:191)
    at Script1.class$(Script1.groovy)
    at Script1.$get$$class$jenkins$model$Jenkins(Script1.groovy)
    at Script1.run(Script1.groovy:1)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:580)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:618)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:589)
    at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:142)
    at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:114)
    at hudson.remoting.UserRequest.perform(UserRequest.java:121)
    at hudson.remoting.UserRequest.perform(UserRequest.java:49)
    at hudson.remoting.Request$2.run(Request.java:326)
    at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at hudson.remoting.Request.call(Request.java:147)
    at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:253)
    ... 30 more

Trilead ssh MAC and key exchange algorithms severely outdated

JENKINS-33021

https://github.com/jenkinsci/trilead-ssh2/blob/master/src/com/trilead/ssh2/crypto/digest/MAC.java#L19

  • ssh libraries are outdated - recent versions might have problem
  • ssh setting allows old encryption algorithm

In order to know which algorithms are supported, you could execute man sshd_config

MACs    Specifies the available MAC (message authentication code) algorithms.  The MAC algorithm is used in protocol version 2 for data integrity protection.  Multiple algorithms must be
             comma-separated.  The default is:
                   hmac-md5,hmac-sha1,umac-64@openssh.com,
                   hmac-ripemd160,hmac-sha1-96,hmac-md5-96,
                   hmac-sha2-256,hmac-sha256-96,hmac-sha2-512,
                   hmac-sha2-512-96

Then, edit the file /etc/ssh/sshd_config and comment the lines related to MACs.

Jenkins log / Build console output log

[06/22/15 14:49:05] [SSH] Opening SSH connection to 10.11.12.13:22.
Key exchange was not finished, connection is closed.
ERROR: Unexpected error in launching a slave. This is probably a bug in Jenkins.
java.lang.IllegalStateException: Connection is not established!
    at com.trilead.ssh2.Connection.getRemainingAuthMethods(Connection.java:1030)
    at com.cloudbees.jenkins.plugins.sshcredentials.impl.TrileadSSHPublicKeyAuthenticator.getRemainingAuthMethods(TrileadSSHPublicKeyAuthenticator.java:88)
    at com.cloudbees.jenkins.plugins.sshcredentials.impl.TrileadSSHPublicKeyAuthenticator.canAuthenticate(TrileadSSHPublicKeyAuthenticator.java:80)
    at com.cloudbees.jenkins.plugins.sshcredentials.SSHAuthenticator.newInstance(SSHAuthenticator.java:207)
    at com.cloudbees.jenkins.plugins.sshcredentials.SSHAuthenticator.newInstance(SSHAuthenticator.java:169)
    at hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1173)
    at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:701)
    at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:696)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
[06/22/15 14:49:06] Launch failed - cleaning up connection
[06/22/15 14:49:06] [SSH] Connection closed.

Agent log

fatal: no matching mac found:
client: hmac-sha1-96,hmac-sha1,hmac-md5-96,hmac-md5
server: hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-512,hmac-sha2-256,umac-128@openssh.com [preauth]

Intermittent Invalid Object ID in remoting module

JENKINS-23271 Intermittent Invalid Object ID in remoting module

It’s fixed and released on Jenkins core higher than 2.32

Happens frequently on Java 8 due its object management logic.
Causes issues in task execution (build failures, agent disconnects)
Temporary solution - use Java 7 (does not completely solve the issue) on both Master and agents

Jenkins log / Build console output log

FATAL: Invalid object ID 18649 iuota=18470
java.lang.IllegalStateException: Invalid object ID 18469 iota=18470
at hudson.remoting.ExportTable.diagnoseInvalidId(ExportTable.java:277)

Ping Thread

Ping Thread

PingThread checks that agent is ABLE to execute a command from master (NOOP request)

Ping command may fail to execute:
* Overloaded queue, all agent workers are busy → On big boxes you can increase the number of remoting TaskPool workers
* Network overloaded

In some cases disabling can help

So, if this is the stacktrace you are seeing all the time, you should then disable the PingThread. The side effect is just that the agent is suppose to hung in case the communication is really failing between master and agents, but this is good as you will then use jstack to take a threadDump on both sides master and agent it self.

Jenkins log / Build console output log

Caused by: java.io.IOException
    at hudson.remoting.Channel.close(Channel.java:1163)
    at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118)
    at hudson.remoting.PingThread.ping(PingThread.java:126)
    at hudson.remoting.PingThread.run(PingThread.java:85)
Caused by: java.util.concurrent.TimeoutException: Ping started at 1474633728617 hasn't completed by 1474633968617
    ... 2 more

Other considerations

Remoting issue

TCP retransmission timeout OSS - perhaps increase

Linux

Using Keep Alive

sysctl -w net.ipv4.tcp_keepalive_time=120
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=8
sysctl -w net.ipv4.tcp_fin_timeout=30
sysctl -w net.ipv4.tcp_tw_recycle=1
sysctl -w net.ipv4.tcp_tw_reuse=1

Windows

Things that you may want to know about TCP Keepalives
Avoiding TCP/IP Port Exhaustion

KeepAliveInterval = 30
KeepAliveTime = 120
TcpMaxDataRetransmissions = 8
TcpTimedWaitDelay=30

Note: remoting 2.62.1 has an improvement wrt to keepalive from the client (agent) side

When all fails

  • Try to use the other agent implementation (move from ssh-slave-plugin to cloudbees-ssh-slaves or vice versa)
  • Try to add this JAVA property on master -Djenkins.slaves.NioChannelSelector.disabled=true
  • Try to add this JAVA property on master -Djenkins.slaves.JnlpSlaveAgentProtocol3.enabled=false
Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.