How to set-up a CJP HA environment with HA Proxy

Introduction

CloudBees Jenkins Platforms come with the capability to run in a high-availability setup, where two or more JVMs form a so-called “HA singleton” cluster, to ensure business continuation for Jenkins. This improves the availability of the service against unexpected problems in the JVM, the hardware that it runs on, etc. When a Jenkins JVM becomes unavailable (for example, when it stops responding, or when it dies), other nodes in the cluster automatically take over the role of the master, thereby restoring service with minimum interruption.
It is also important for users to understand what this feature does not do in its current form. Namely, it is not a symmetric cluster, where participating nodes will share workloads together. At any given point only one of the nodes is performing the master role (hence “HA singleton”). Because of this, when a fail-over takes place, users will experience a brief downtime, comparable to someone rebooting a Jenkins master in a non-HA setup. Builds that were in progress would be lost, too.
This guide will walk through the High Level Overview, detail implementation and configuration steps, and finish with troubleshooting tips.

Note: This guide is tailored towards CloudBees Operations Center, but can be applied to CloudBees Jenkins Client Masters. Where applicable, a note will describe differences between the two.

High-level overview

The diagram, below, outlines the principal design details with routing for each communication protocol.

Alpha and Beta

To create the “HA singleton”, it is necessary to have two copies of Operations Center: Alpha and Beta (active and standby, respectively). In HA mode, the two Jenkins instances cooperatively elect the “primary” JVM, and depending on the outcome of this election, members of a cluster starts/stops the CJE master in the same JVM. From the viewpoint of the code inside CloudBees Jenkins Operations Center, this is as if it is started/stopped programmatically.
CloudBees Jenkins Operations Centers relies on JGroups for the underlying group membership service.

By default, CloudBees Jenkins Operations Center uses TCP to communicate between members, with IP addresses and ports registered in a directory $JENKINS_HOME/jgroups (which all members must be able to write to). This can be changed by creating $JENKINS_HOME/jgroups.xml that describes the JGroups protocol stack configuration XML. See theJBoss Clustering documentation for the format of this file, as well as typical configuration tips and troubleshooting.

Fail-over

A fail-over is effectively (1) shutting down the current Jenkins JVM, followed by (2) starting it up in another location. Sometimes step 1 doesn’t happen, for example when the current master crashes. Because these masters work with the same $JENKINS_HOME, this fail-over process has the following characteristics:

  • Jenkins global settings, configuration of jobs/users, fingerprints, record of completed builds (including archived artifacts, test reports, etc.), will all survive a fail-over.
  • User sessions are lost. If your Jenkins installation requires users to log in, they’ll be asked to log in again.

During the startup phase of the fail-over, Jenkins will not be able to serve inbound requests or builds. Therefore, a fail-over typically takes a few minutes, not a few seconds.

Note: For HA Client Masters, builds that were in progress will normally not survive a fail-over, although their records will survive. The builds based on Jenkins Pipeline Jobs will continue to execute. No attempt will be made to re-execute interrupted builds, though the Restart Aborted Builds plugin will list the aborted builds.

Load Balancer

In general, all Jenkins traffic should be routed through the load balancer with the exception of the JNLP transport. The JNLP transport is used by Client Masters to connect to Operations Center and by Build Agents configured to use the “Java Web Start” launcher.

The fail-over of the JNLP transport is handled by the CloudBees Jenkins Operations Center. This JNLP fail-over also applies to Jenkins JNLP Agents and CloudBees Jenkins Client Masters.

There are many software and hardware load balancer solutions. CloudBees recommends the Open Source solution haproxy which doubles as a reverse proxy. For truly highly-available setup, haproxy itself needs to be made highly available. Support for HAProxy is available from HAProxy Technologies Inc.

There are several solutions to make HAProxy highly available, HAProxy Technologies Inc proposes its commercial product, there are also open source solution documented on the Internet.

HTTPS and SSL

In addition to acting as a load balancer and reverse proxy, haproxy acts as a SSL Termination point for CloudBees Jenkins Operations Center. This allows internal traffic to remain in the default http configuration while providing a secured endpoint for users.

If you do not have access to your environments’ SSL key files, please reach out to your operations teams.

If the SSL certificate used by haproxy is not trusted by default by the CloudBees Jenkins Operations Center JVM, it must be added to the keystores of both CloudBees Jenkins Operations Centers and all client masters.

If the SSL certificate is used to secure the connection to elasticsearch, it must be added to both CloudBees Jenkins Operations Center JVM keystores.

Note: When making client masters highly available, if the SSL certificate used by haproxy is not trusted by default by the CloudBees Jenkins Operations Center JVM, it is recommended that it be added to the keystores of both CloudBees Jenkins Operations Centers.

For further information, please refer to How to install a new SSL certificate

Shared Storage

Each member node of a CloudBees Jenkins Operations Center HA cluster needs to see a single coherent shared file system that can be read and written simultaneously. That is to say, for node Alpha and Beta in the cluster, if node Alpha creates a file in $JENKINS_HOME, node Beta needs to be able to see it within a reasonable amount of time. A “reasonable amount of time” here means the time window during which you are willing to lose data in case of a failure. This is commonly accomplished with a NFS Shared File System.

To set up the NFS Server please see: NFS Setup Procedure Best Practices

Note: For truly highly-available Jenkins, NFS storage itself needs to be made highly-available. There are many resources on the web describing how to do this. See the Linux HA-NFS Wiki as a starting point.

If you are using the Amazon’s Elastic File System Management service for your NFS Server, please ensure that the server’s Performance Mode is set to “Max I/O Performance Mode”.

CloudBees Jenkins HA monitor tool

The CloudBees Jenkins Enterprise HA monitor tool is a small background application that executes code when the primary Jenkins JVM becomes unresponsive, such setup/teardown scripts can only be reliably triggered from outside Jenkins. Those scripts also normally require root privileges to run. It can be downloaded from the jenkins-ha-monitor section of the download site.

Its 3 defining options are as follows:

  • The -home option that specifies the location of $JENKINS_HOME. The monitor tool picks up network configuration and other important parameters from here, so it needs to know this location.
  • The -host-promotion option that specifies the location of the promotion script, which gets executed when the primary Jenkins JVM moves in from another system into this system as a result of election. In native packages, this file is placed at /etc/jenkins-ha-monitor/promotion.sh.
  • The -host-demotion option that specifies the demotion script, which is the opposite of the promotion script and gets executed when the primary Jenkins JVM moves from this system into another system. In native packages, this file is placed at /etc/jenkins-ha-monitor/demotion.sh.
    Promotion and demotion scripts need to be idempotent, in the sense that the monitor tool may run the promotion script on an already promoted node, and the demotion script on an already demoted node. This can happen, for example, when a power outage hits a stand-by node and when it comes back up. The monitor tool runs the demotion script again on this node, since it cannot be certain about the state of the node before the power outage.

Run the tool with the -help option to see the complete list of available options. Configuration file that is read during start of the service is located at /etc/sysconfig/jenkins-ha-monitor.

HA Setup Procedure

The HA setup in CloudBees Jenkins Operations Centers provides the means for multiple JVMs to coordinate and ensure that the Jenkins master is running somewhere, but it does so by relying on the availability of the storage that houses $JENKINS_HOME and the HTTP reverse proxy mechanism that hides a fail-over from users who are accessing Jenkins. Aside from NFS as a storage and the reverse proxy mechanism, CloudBees Jenkins Operations Center can run on a wide range of environments. In the following sections, we’ll describe the parameters required from them and discuss more examples of the deployment mode.

The setup procedure will follow this basic format:
* Before you begin
* Configure Shared Storage
* Configure CloudBees Jenkins Operations Center and Client Masters
* Install CloudBees Jenkins Operations Center HA monitor tool
* Configure HAProxy

After these steps are followed, you should follow the Testing and Troubleshooting section.

Before you begin

In this tutorial, we’ll describe the simplest HAProxy CloudBees Jenkins setup and will focus only on configuration necessary on the impacted CloudBees Jenkins Operations Center components. Before we begin, you will need to have the following:

  • A Shared Storage server that hosts $JENKINS_HOME. We assume this is available already, and we will not discuss how one would set this up. (We’ll call this machine sierra.)
  • Two linux systems that will form a Jenkins HA cluster, each running one JVM with the CloudBees Jenkins Operations Center application. In this tutorial, those two machines need to be on the same local network.
  • The trusted SSL certificates. If the SSL certifications are not trusted by default by the JVMs, please refer to the SSL section of this document for further instruction.
  • A third linux system which will be used to run HAProxy as the load balancer and reverse proxy.

Configure Shared Storage on Alpha and Beta

All the member nodes of a CloudBees Jenkins Enterprise HA cluster need to see a single coherent file system that can be read and written to simultaneously. Creating a shared storage mount is outside the scope of this guide. Please see the Shared Storage section, above.

Mount the Shared Storage to Alpha and Beta:

$ mount -t nfs -o rw,hard,intr sierra:/jenkins /var/lib/jenkins

/var/lib/jenkins is chosen to match what the {CJE-full} packages use as $JENKINS_HOME. If you change them, update /etc/default/jenkins (on Debian) or /etc/sysconfig/jenkins (on RedHat and SUSE) to have $JENKINS_HOME point to the correct directory.

Before continuing, test that the mountpoint is accessible and permissions are correct on both machines by executing touch /<mount_point>/test.txt on Alpha and then try editing the file on Beta.

Install and Configure CloudBees Jenkins Operations Center and Client Masters

Choose the appropriate debian/redhat/openSUSE package format depending on the type of your distribution. Install CJOC on Alpha and Beta.

Upon installation, both instances of Jenkins will start running. Stop them by issuing /etc/init.d/jenkins stop, while we work on the HA setup configuration. If you don’t know how to set-up a Java Argument on Jenkins you can follow this KB article.

When CloudBees Jenkins Operations Center is being used in an HA configuration, both Alpha and Beta instances must modify the JENKINS_HOME Jenkins argument to point to the shared NFS mount.

It is an important performance optimization that the .war file is not extracted to the $JENKINS_HOME/war directory in the shared filesystem. In a rolling upgrade scenario, an upgrade of the secondary instance followed by a (standby) boot can corrupt this directory. Some configurations may do this by default, but .war extraction can easily be redirected to a local cache (ideally SSD for better Jenkins core I/O) on the container/VM’s local filesystem with the JENKIN_ARGS properties ----webroot=$LOCAL_FILESYSTEM/war --pluginroot=$LOCAL_FILESYSTEM/plugins. For example, on debian installations, where $NAME refers to the name of the jenkins instance: --webroot=/var/cache/$NAME/war --pluginroot=/var/cache/$NAME/plugins

$JENKINS_HOME is read intensively during the start-up. If bandwidth to your shared storage is limited, you’ll see the most impact in startup performance. Large latency causes a similar issue, but this can be mitigated somewhat by using a higher value in the bootup concurrency by a system property -Djenkins.InitReactorRunner.concurrency=8.

Because we are using a proxy to route requests to the active instance, it will be necessary to ensure that the remoting channel can be established between client masters and the CloudBees Jenkins Operations Center server. This means that in a HA configuration, each instance must set the system property JENKINS_OPTS=-Dhudson.TcpSlaveAgentListener.hostName= equal to a hostname or IP address of that instance that can be resolved and connected to by all client masters and build agents which use the JNLP connection protocol. This requires more exposure of CloudBees Jenkins Operations Center servers, as all instances must be addressable by all client masters.

Because we are configuring HTTPS traffic, it is necessary to update the property JAVA_ARGS and add -Djavax.net.ssl.trustStore=path/to/jenkins-truststore.jks -Djavax.net.ssl.trustStorePassword=changeit". To debug ssl issues, add "-Djavax.net.debug=all" Oracle Debugging SSL/TLS Connections and How to install a new SSL certificate

[TIP]

When using the redhat package (RPM) it is highly recommended after the installation to set the option JENKINS_INSTALL_SKIP_CHOWN to false in /etc/sysconfig/jenkins.

This option will prevent in future upgrades to apply a chown on your JENKINS_HOME folder which may take a lot of time (especially for an HA setup where JENKINS_HOME is on remote file system using NFS and will be updated by the upgrade of the two nodes).

Install CloudBees Jenkins Operations Center HA monitor tool

Next, we set up a monitoring service to ensure the HA Fail-over completes. To do this, log on to Alpha and install the jenkins-ha-monitor package. This monitoring program watches Jenkins as root, and when the role transition occurs, it’ll execute the promotion script or the demotion script.

The monitor tool is packaged into a single jar file that can be executed as java -jar jenkins-ha-monitor.jar. It is also packaged as the jenkins-ha-monitor RPM/DEB packages for an easier installation on the jenkins-ha-monitor section of the download site

To install it you need to do it as any RPM file, for example:

sudo rpm -e jenkins-ha-monitor-<version>.noarch.rpm

And for uninstalling you can do it by calling:

sudo rpm -e jenkins-ha-monitor

Configuration file that is read during start of the service is located at /etc/sysconfig/jenkins-ha-monitor. CloudBees Jenkins Enterprise HA monitor tool logs by default at /var/log/jenkins/jenkins-ha-monitor.log.

Configure HAProxy

Let’s expand on this setup further by introducing an external load balancer and reverse proxy that receives traffic from users, then direct them to the active primary JVM. HAProxy can be installed on most Linux systems via native packages, such as apt-get install haproxy or yum install haproxy, however CloudBees recommends that haproxy version 1.6 (or newer) be installed. For CloudBees Jenkins Operations Center HA, the configuration file (normally/etc/haproxy/haproxy.cfg) should look like the following:


global log 127.0.0.1 local0 log 127.0.0.1 local1 notice maxconn 4096 user haproxy group haproxy # Default SSL material locations ca-base /etc/ssl/certs crt-base /etc/ssl/private # Default ciphers to use on SSL-enabled listening sockets. # For more information, see ciphers(1SSL). This list is from: # https://hynek.me/articles/hardening-your-web-servers-ssl-ciphers/ ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS ssl-default-bind-options no-sslv3 tune.ssl.default-dh-param 2048 defaults log global option http-server-close option log-health-checks option dontlognull timeout http-request 10s timeout queue 1m timeout connect 5000 timeout client 50000 timeout server 50000 timeout http-keep-alive 10s timeout check 500 default-server inter 5 downinter 500 rise 1 fall 1 #redirect HTTP to HTTPS listen http-in bind *:80 mode http redirect scheme https code 301 if !{ ssl_fc } listen https-in # change the location of the pem file bind *:443 ssl crt /etc/ssl/certs/your.pem mode http option httplog option httpchk HEAD /ha/health-check option forwardfor option http-server-close # alpha and beta should be replaced with hostname (or ip) and port # 8888 is the default for CJOC, 8080 is the default for Client Masters server alpha alpha:8888 check server beta beta:8888 check reqadd X-Forwarded-Proto:\ https listen ssh bind 0.0.0.0:2022 mode tcp option tcplog option httpchk HEAD /ha/health-check # alpha and beta should be replaced with hostname (or ip) and port # 8888 is the default for CJOC, 8080 is the default for Client Masters server alpha alpha:2022 check port 8888 server beta beta:2022 check port 8888

The global section contains stock settings and defaults has been configured with typical timeout settings. You must configure the listen blocks to your particular environment. To determine the port configuration of an existing installation, please reference the How to add Java arguments to Jenkins guide.

The “Default SSL material locations” and bind *:443 ssl crt /etc/ssl/certs/server.bundle.pem sections define the paths to the necessary ssl keyfiles.

Together, the https-in and http-in sections determine the bulk of the configuration necessary for CloudBees Jenkins Operations Center routing. These listen blocks tell haproxy to forward traffic to two servers alpha and beta, and periodically check their health by sending a GET request to /ha/health-check. Unlike active nodes, standby nodes do not respond positively to this health check, and that’s how haproxy determines traffic routing.

The ssh services section is configured to forward tcp requests. For these services, haproxy uses the same health check on the application port. This ensures that all services fail over together when the health check fails.

Testing and Troubleshooting your HA Installation

To test for cluster membership roles, place an executable script in $JENKINS_HOME/sanity-check.sh that gets run before a node assumes the primary role, as well as when there’s a change in the cluster membership. The use case is for you to make sure that the node should really proceed to act as the primary. If the script exits with 0, the node will boot up as the primary node, and if it exists with non-zero, the node will not act as the primary node.

When two nodes that form a cluster lose contact, each node will assume that the other had died, and will take the responsibility as the primary node. This is called a “split brain” problem. This is problematic as you end up having two independently acting Jenkins masters. A similar problem can happen if one node in the cluster is severely stressed under load. The sanity check script provides users an opportunity to apply some heuristics to reduce the likelihood of this problem.

For example, you can check the availability of $JENKINS_HOME, if you can ping the router, or if the system load is reasonably low. If you are using the CloudBees Jenkins Enterprise HA monitor tool to control resources, the sanity check script might run before the HA monitor tool had completed running its promotion script. If some of the sanity checks require the promotion to be complete, please retry the check a few times.

If you are having issues with haproxy connectivity, modify the defaults section of the haproxy config to troubleshoot with these additional options:

defaults
        log     global
        # The following log settings are useful for debugging
        # Tune these for production use
        option  logasap
        option  http-server-close
        option  redispatch
        option  abortonclose
        option  log-health-checks
        mode    http
        option  dontlognull
        retries 3
        maxconn         2000
        timeout         http-request    10s
        timeout         queue           1m
        timeout         connect         5000
        timeout         client          50000
        timeout         server          50000
        timeout         http-keep-alive 10s
        timeout         check           500
        default-server  inter 5s downinter 500 rise 1 fall 1

For additional troubleshooting tips please see: https://go.cloudbees.com/docs/cloudbees-documentation/cje-user-guide/chapter-ha.html#chapter-ha_ha-sect-troubleshooting

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.