Jenkins intermittently fails to restart on RHEL 7 and CentOS 7

Issue

Under certain circumstances, Jenkins may “hang” with the following conditions:

  • The Jenkins java process is running in a waiting state.
  • Jenkins is effectively down.
  • Nothing is logged.

Sometimes, after numerous restarts, the Jenkins service may start up again normally.

The root cause for this issue is that the Jenkins service hangs immediately before it forks the child process that starts Jetty and Jenkins. Although the Java process is running, nothing is logged, because Jenkins has not yet started and is not yet listening on any port.

Environment

NOTE: This issue affects a very small number of CloudBees customers. You only need to take action if you are directly affected by this issue: if you are not experiencing this issue, no action is necessary.

Customers most likely to be affected by this issue are:

  • Using the init.d scripts with the --daemon argument.
  • Running a 3.10.0 kernel.
  • Using RPM-based distributions based on RHEL or CentOS 7.

The issue affects the following platforms:

Workaround

To work around this issue:

  1. Log into the server that is running your product instance.
  2. Stop and back up your instance, as detailed in the Backup and Restore guide.
  3. On the server running the instance, look in the /etc/init.d directory for a file that matches your distribution:
    • jenkins for CloudBees Jenkins Platform Client Master instances
    • jenkins-oc for CloudBees Jenkins Platform Operations Center instances
    • cloudbees-jenkins-distribution for CloudBees Jenkins Distribution instances
    • cloudbees-core-cm for CloudBees Core on Traditional Platforms Client Master instances
    • cloudbees-core-oc for CloudBees Core on Traditional Platforms Operations Center instances
  4. Make a note of this filename. For example, for CloudBees Jenkins Distribution, the file is /etc/init.d/cloudbees-jenkins-distribution.
  5. In the file matching your distribution, search for the line reading -Dcb.distributable.commit_sha=[hashToUse]. Make a note of the hashToUse value.
  6. In a temporary directory, create a new file and populate it with the content in the file https://cloudbees-jenkins-scripts.s3.amazonaws.com/e206a5-linux/systemInit.sh.
  7. In the file you just created, find all occurrences of the string @@ARTIFACTNAME@@, and replace them with your installed product identifier, where product identifier is one of:
    • jenkins for CloudBees Jenkins Platform Client Master instances
    • jenkins-oc for CloudBees Jenkins Platform Operations Center instances
    • cloudbees-jenkins-distribution for CloudBees Jenkins Distribution instances
    • cloudbees-core-cm for CloudBees Core on Traditional Platforms Client Master instances
    • cloudbees-core-oc for CloudBees Core on Traditional Platforms Operations Center instances
  8. Locate and replace all occurrences of the string @@COMMIT_SHA@@ with the value of hashToUse you previously looked up.
  9. Back up the /etc/init.dfile matching your distribution, and move it to a safe location.
  10. If you have modified the /etc/init.d file, like adding new Java params or removing them, make sure the same modifications are contained in the new script file.
  11. Make the script file you created earlier executable (if needed) and execute it.
  12. This action should generate a .service file with the name of your distribution. For example, for CloudBees Jenkins Distribution, the resulting file should be cloudbees-jenkins-distribution.service in the local directory.
  13. Copy the newly-generated file into the /etc/systemd/system folder.
  14. Restart the instance using the systemctl command:

    systemctl daemon-reload && systemctl start <name of service>
    

Troubleshooting

If you want to roll back the changes:

  1. Remove the /etc/systemd/system file you previously added.

  2. Restore the /etc/init.d script that you saved to an alternate location.

  3. Restart your system using the command:

    systemctl daemon-reload
    

Background

The RPM packaging for RHEL and CentOS systems calls the OS daemon command from the /etc/init.d scripts used to manage the instance.

Jenkins-based products used the Akuma library, invoked with the --daemon flag), to make Jenkins daemonize itself in a way that the OS daemon command could handle properly.

On recent versions of RHEL and CentOS operating systems, the Akuma library no longer worked as expected upon restarts, and instead could cause an infinite loop while attempting to daemonize the instance.

One possible workaround was to simply remove the --daemon flag, but this was an incomplete solution: it fixed the error but caused the underlying /etc/init.d script to become nonfunctional.

The better solution (and the solution detailed in this article) was to use a native systemd service file to manage the daemon.

Have more questions?

2 Comments

  • 0
    Avatar
    Lee Meador

    Is there a plan to add this change to the distribution rpms and, if so, what version will have it?

  • 0
    Avatar
    Arnaud Héritier

    Hi Lea,

      Yes we are working on such a plan. It's not so easy in term of migration path, tests, ... thus we don't yet have an ETA to share but our engineering is working on it. As soon as we will have more updates and especially an official fix we will update this article to recommend to upgrade.

    Best regards

Please sign in to leave a comment.