How to troubleshoot High Availability installations?

Issue

  • How to troubleshoot HA issues and generate the necessary configuration details for CloudBees Support

Environment

  • CloudBees Jenkins Enterprise (CJE)
  • CloudBees Jenkins Operations Center (CJOC)

Resolution

This article gives simple steps for troubleshooting High Availability installations. If these troubleshooting steps do not resolve your issue, follow the steps outlined in the “What should I attach in a CloudBees support ticket?” section to gather the most relevant information for the Support Team to resolve your issue as quickly as possible.

Troubleshooting

Versions

Ensure that:

  • All instances must run the same Jenkins version
  • All instances must run using the same JDK version

CJOC and CM nodes do not form a cluster

See the JGroups troubleshooting guide for typical problems. When nodes don’t form a cluster, it is normally either because the protocol needs additional configuration, or there’s a problem in the network configuration of the operating system or the network equipment (e.g., nodes cannot “see” one another via TCP).

Ensure that all the instances are using the same Unix user:group

Ensure that the userID used to run the Jenkins process is the same on the 3 of the servers: NFS, Jenkins primary + Jenkins failover.

Run on the three instances the following commands and ensure that the ID used is the same for the user. On this example jenkins:jenkins is used, which is the user:group created by default if you installed Jenkins via Unix package.

id -u jenkins

In case the ID is not the same on the three instances, the following Unix commands can be used to create a jenkins:jenkins - which is needed in the NFS instance and to modify the ID of the user and group.

sudo useradd jenkins
sudo usermod -u [jenkins UID] jenkins
sudo groupadd jenkins
sudo groupmod -g [jenkins GID] jenkins

Ensure that the owner of the $JENKINS_HOME is the user which run the Jenkins process

The user which owns the Jenkins process should be the owner of the $JENKINS_HOME. Run ls -la $JENKINS_HOME to check this. If you need to change the owner, the following Unix command can be used.

sudo chown -R jenkins:jenkins $JENKINS_HOME

Customize JGroups in case your instances are running behind a firewall

By default, the CloudBees HA plugin uses a random port to communication, so if the instances are running behind a firewall you must customize the JGroups by placing the following snipped in $JENKINS_HOME/jgroups.xml

The following elements needs to be customized:

  • bind_port
  • port_range
  • diagnostics_port

The example below means that you need to open the following ports on the firewall: 56736, 56737, 56738, 56739, 56740 and 35483.

bind_port="56736"
port_range="5"
diagnostics_port="35483"
<!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml -->
<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.1.xsd">
    <TCP loopback="false"
         recv_buf_size="${tcp.recv_buf_size:128K}"
         send_buf_size="${tcp.send_buf_size:128K}"
         max_bundle_size="64K"
         max_bundle_timeout="30"
         use_send_queues="true"
         sock_conn_timeout="300"

         bind_port="56736"
         port_range="5"
         diagnostics_port="35483"

         timer_type="new"
         timer.min_threads="4"
         timer.max_threads="10"
         timer.keep_alive_time="3000"
         timer.queue_max_size="500"

         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="10"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="false"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="1"
         oob_thread_pool.max_threads="8"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="discard"/>

    <CENTRAL_LOCK />

    <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING timeout="3000"
             location="${HA_JGROUPS_DIR}"
             num_initial_members="3"/>
    <MERGE2 max_interval="30000"
            min_interval="10000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 use_mcast_xmit="false"
                   discard_delivered_msgs="true"/>
    <UNICAST />
    <!--
      When a new node joins a cluster, initial message broadcast doesn't necessarily seem
      to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize
      this dropped transmission and cause a retransmission.
    -->
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
    <pbcast.STATE_TRANSFER />
    <!-- pbcast.FLUSH  /-->
</config>

Bind JGroups with the right network interface

In case that any of the two instances which tries to form a cluster has more than one network interface you must tell JGroups which one should be used to listen for packets. The following two Java arguments are used for this purpose:

-Djgroups.bind_addr=<IP_ADDRESS>
-Djava.net.preferIPv4Stack=true

where <IP_ADDRESS> is the IP address of the network interface in the instance which should be able to reach out the other node in the cluster.

NOTE: In case of any issue with the CloudBees HA plugin these arguments should be set-up to ensure JGroups is bound to the right interface.

Generating HA Configuration for CloudBees Support

Run troubleshooting application to get more details

To simplify the troubleshooting process of the network issues, we have published the troubleshooter program. This program runs the same lower level stack as Jenkins HA, and thus exercises the network in the exact same fashion. When you type in a text from stdin and hit enter, you should see the text echoed on all nodes of the cluster (including the node in which you typed the text.)

A good first step to diagnose the network problem is to run two instances of the troubleshooter program on the same host and see if they can communicate with each other. Then do the same on all the hosts. In this way, you can further isolate the problem.

Run the following command on both instances to determinate if primary and backup nodes are selected correctly. You need to go to both instances and run the following command (Please, change $JENKINS_HOME for the corresponding value):

java -DJENKINS_HOME=$JENKINS_HOME -DHA_JGROUPS_DIR=$JENKINS_HOME/jgroups/ -Djgroups.bind_addr=<IP_ADDRESS>  -Djava.net.preferIPv4Stack=true -jar troubleshooter-<VERSION>-jar-with-dependencies.jar

In case the promotion process doesn’t work correctly, i.e both nodes run as primary node, run now the troubleshooter application on logging mode to expose the problem.

java -DJENKINS_HOME=$JENKINS_HOME -DHA_JGROUPS_DIR=$JENKINS_HOME/jgroups/ -Djgroups.bind_addr=<IP_ADDRESS>  -Dlogging.org.jgroups=ALL -Dlogging.com.cloudbees.jenkins.ha=ALL -Djava.net.preferIPv4Stack=true -Dha-troubleshooter.filelogging -jar troubleshooter-<VERSION>-jar-with-dependencies.jar
Note about file logging

The -Dha-troubleshooter.filelogging will enable file logging with log rotation.
This will by default rotate on 100 MB.
The use case is to be able to let it run in background while waiting for the issue to reoccur.

If you need to cover a bigger period of time, you may want to also use -Dha-troubleshooter.filelogging.count=NN to raise the default value of 10.
For example, to cover a whole week-end duration, you may want to use -Dha-troubleshooter.filelogging.count=100 and rotate on 100 files of 10 MB, to consume a maximum of 1 GB of disk space.

When the tool starts, it will display the values for all those configuration so that you can make sure it was taken in account. Something like:

Logs File Rotation enabled: # of files: 10, max size per file: 10000000, pattern: ha-troubleshooting.abcd.%u.log

The tool generates a random four hexa digits in the file name to avoid clashing with existing one, when for example running the tool on many nodes of a HA cluster.

If the promotion works on the troubleshooter and not in the instances

In this case, the following experiment is recommended:

  1. Ensure that the latest version of the cloudbees-ha plugin is installed on both instances.
  2. Stop the service of the Jenkins instances
  3. Add the following Java argument -Dcom.cloudbees.jenkins.ha.level=ALL in both instances so all the JGroups and HA logs are exposed.
  4. Start one instance
  5. Wait 10 seconds or so
  6. Start the second instance
  7. Check the full logs of Jenkins instances that you can usually find under `/var/log/jenkins

After the experiment you should remove the argument -Dcom.cloudbees.jenkins.ha.level=ALL you just added on both instances.

What should I attach in a CloudBees support ticket?

Gather the following information and attach it to a the support ticket:

  1. Is there any firewall (or any other interposed network device) in the middle of both nodes?
  2. Do you have several network interfaces on those instances?
  3. Support bundles from both CJOC and CM
  4. The content inside of $JENKINS_HOME/jgroups/
  5. The .txt files produced in the above step
Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.