Why are the messaging related operations failing?

Issue

At cluster level, all the messaging related operations cease to work. Below, you can check a list of potential symptoms:

  • The remote trigger operations that use the Operations Center to trigger jobs on a different controller do not work anymore.
  • The Move / Copy operations that use Operations Center to move or copy jobs among controllers do not work anymore.

And we can see in the logs a stacktrace similar to the one shown below:

2019-XX-XX XX:XX:XX.XXX+0000 [id=XX]    INFO    hudson.model.AsyncPeriodicWork#doRun: com.cloudbees.opscenter.server.messaging.Transport thread is still running. Execution aborted.

Environment

Workaround

Operations Center uses a method to manage operations that involve more than one controller. For this, it uses a messaging system. This messaging system relies on a “mailbox checkout” task that check the mail boxes of every single controller connected to the Operations Center, one at a time. The task, at this moment in time does not implement a timeout, and sometimes and due to different reasons can get stuck waiting for a controller to answer with the contents of its message box.

If you are experiencing this issue, one easy way to get things back to normal would be to locate which controller is holding the task, using for that additional loggers as described below, and disconnect temporarily that specific controller. This way, the task will continue to the next controller and the processes relying on it would get back to normal.

But in an environment with a large number of controllers connected to the Operations Center, finding the one that is causing the problem might be challenging.

  • One way to get additional information on what is happening at transport level is to create a custom logger in your Operations Center including the class shown below with the Log level referenced:

  • com.cloudbees.opscenter.server.messaging.Transport Level: FINE. This will give us additional insight on which controller is blocking the task.

Another potential workaround, that you can apply in case that you cannot locate which controller is holding the task, is to restart the Operations Center, this way the messaging system will be reset.

Tested product/plugin versions

Have more questions?

0 Comments

Please sign in to leave a comment.