High Availability Installations Troubleshooting

Symptoms

  • HA is not working as expected.

Diagnosis/Treatment

This article gives simple steps for troubleshooting High Availability installations. If these troubleshooting steps do not resolve your issue, follow the steps outlined in the “What should I attach in a CloudBees support ticket?” section to gather the most relevant information for the Support Team to resolve your issue as quickly as possible.

Troubleshooting

Versions

Ensure that:

  • All instances must run the same Jenkins version
  • All instances must run using the same JDK version

CJOC and CM nodes do not form a cluster

See the JGroups troubleshooting guide for typical problems. When nodes don’t form a cluster, it is normally either because the protocol needs additional configuration, or there’s a problem in the network configuration of the operating system or the network equipment (e.g., nodes cannot “see” one another via TCP).

Ensure that all the instances are using the same Unix user:group

Ensure that the userID used to run the Jenkins process is the same on the 3 of the servers: NFS, Jenkins primary + Jenkins failover.

Run on the three instances the following commands and ensure that the ID used is the same for the user. On this example jenkins:jenkins is used, which is the user:group created by default if you installed Jenkins via Unix package.

id -u jenkins

In case the ID is not the same on the three instances, the following Unix commands can be used to create a jenkins:jenkins - which is needed in the NFS instance and to modify the ID of the user and group.

sudo useradd jenkins
sudo usermod -u [jenkins UID] jenkins
sudo groupadd jenkins
sudo groupmod -g [jenkins GID] jenkins

Ensure that the owner of the $JENKINS_HOME is the user which run the Jenkins process

The user which owns the Jenkins process should be the owner of the $JENKINS_HOME. Run ls -la $JENKINS_HOME to check this. If you need to change the owner, the following Unix command can be used.

sudo chown -R jenkins:jenkins $JENKINS_HOME

Customize JGroups in case your instances are running behind a firewall

By default, the CloudBees HA plugin uses a random port to communication, so if the instances are running behind a firewall you must customize the JGroups by placing the following snipped in $JENKINS_HOME/jgroups.xml

The following elements needs to be customized:

  • bind_port
  • port_range
  • diagnostics_port

The example below means that you need to open the following ports on the firewall: 56736, 56737, 56738, 56739, 56740 and 35483.

bind_port="56736"
port_range="5"
diagnostics_port="35483"
<!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml -->
<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.1.xsd">
    <TCP loopback="false"
         recv_buf_size="${tcp.recv_buf_size:128K}"
         send_buf_size="${tcp.send_buf_size:128K}"
         max_bundle_size="64K"
         max_bundle_timeout="30"
         use_send_queues="true"
         sock_conn_timeout="300"

         bind_port="56736"
         port_range="5"
         diagnostics_port="35483"

         timer_type="new"
         timer.min_threads="4"
         timer.max_threads="10"
         timer.keep_alive_time="3000"
         timer.queue_max_size="500"

         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="10"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="false"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="1"
         oob_thread_pool.max_threads="8"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="discard"/>

    <CENTRAL_LOCK />

    <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING timeout="3000"
             location="${HA_JGROUPS_DIR}"
             num_initial_members="3"/>
    <MERGE2 max_interval="30000"
            min_interval="10000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 use_mcast_xmit="false"
                   discard_delivered_msgs="true"/>
    <UNICAST />
    <!--
      When a new node joins a cluster, initial message broadcast doesn't necessarily seem
      to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize
      this dropped transmission and cause a retransmission.
    -->
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
    <pbcast.STATE_TRANSFER />
    <!-- pbcast.FLUSH  /-->
</config>

Bind JGroups with the right network interface

In case that any of the two instances which tries to form a cluster has more than one network interface you must tell JGroups which one should be used to listen for packets. The following two Java arguments are used for this purpose:

-Djgroups.bind_addr=<IP_ADDRESS>
-Djava.net.preferIPv4Stack=true

where <IP_ADDRESS> is the IP address of the network interface in the instance which should be able to reach out the other node in the cluster.

NOTE: In case of any issue with the CloudBees HA plugin these arguments should be set-up to ensure JGroups is bound to the right interface.

If the promotion works on the troubleshooter and not in the instances

If the promotion works on the [troubleshooter](REQUIRED DATA) and not in the instances the following experiment is recommended:

  1. Ensure that the latest version of the cloudbees-ha plugin is installed on both instances.
  2. Stop the service of the Jenkins instances
  3. Add the following Java argument -Dcom.cloudbees.jenkins.ha.level=ALL in both instances so all the JGroups and HA logs are exposed.
  4. Start one instance
  5. Wait 10 seconds or so
  6. Start the second instance
  7. Check the full logs of Jenkins instances that you can usually find under /var/log/jenkins

After the experiment you should remove the argument -Dcom.cloudbees.jenkins.ha.level=ALL you just added on both instances.

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.