Why are the messaging related operations failing?

Issue

At cluster level, all the messaging related operations cease to work. Below, you can check a list of potential symptoms:

  • The remote trigger operations that use the Operations Center to trigger jobs on a different master do not work anymore.
  • The Move / Copy operations that use Operations Center to move or copy jobs among masters do not work anymore.

And we can see in the logs a stacktrace similar to the one shown below:

2019-XX-XX XX:XX:XX.XXX+0000 [id=XX]    INFO    hudson.model.AsyncPeriodicWork#doRun: com.cloudbees.opscenter.server.messaging.Transport thread is still running. Execution aborted.

Environment

Workaround

Operations Center uses a method to manage operations that involve more than one master. For this, it uses a messaging system. This messaging system relies on a “mailbox checkout” task that check the mail boxes of every single master connected to the Operations Center, one at a time. The task, at this moment in time does not implement a timeout, and sometimes and due to different reasons can get stuck waiting for a master to answer with the contents of its message box.

If you are experiencing this issue, one easy way to get things back to normal would be to locate which master is holding the task, using for that additional loggers as described below, and disconnect temporarily that specific master. This way, the task will continue to the next master and the processes relying on it would get back to normal.

But in an environment with a large number of masters connected to the Operations Center, finding the one that is causing the problem might be challenging.

  • One way to get additional information on what is happening at transport level is to create a custom logger in your Operations Center including the class shown below with the Log level referenced:
  • com.cloudbees.opscenter.server.messaging.Transport Level: FINE. This will give us additional insight on which master is blocking the task.

Another potential workaround, that you can apply in case that you cannot locate which master is holding the task, is to restart the Operations Center, this way the messaging system will be reset.

Tested product/plugin versions

Have more questions?

0 Comments

Please sign in to leave a comment.