Upgrade guide for instances running High Availability previous to 2.249.2.3

Issue Summary

Starting in 2.249.2.3 the CloudBees High Availability plugin has been upgraded to 4.24, which incorporates a JGroups dependency update to 4.0.2.Final+. This update was performed to solve a previously known memory leak issue.

This change requires CloudBees CI administrators to recreate the JGroups configuration if it was previously customized as some fields have changed between JGroups 3.x and JGroups 4.x.

  • MERGE2 -> MERGE3
  • UNICAST -> UNICAST3
  • Various TCP_NIO2 options were removed
  • For example, thread_pool*, oob_thread_pool.*

If the recommended migration updates are not completed, then the instance will fail to start with a stack trace, as illustrated in the following examples:

Typical stack trace when CloudBees CI fails to start because of a non-updated $JENKINS_HOME/jgroups.xml file

2020-10-14 10:30:14.066+0000 [id=1] SEVERE  c.c.jenkins.ha.HASwitcher#reportFallback: CloudBees CI Operations Center appears to have failed to boot. If this is a problem in the HA feature, you can disable HA by specifying JENKINS_HA=false as environment variable
java.lang.IllegalArgumentException: JGRP000001: configuration error: the following properties in TCP_NIO2 are not recognized: {oob_thread_pool.enabled=true, timer.keep_alive_time=3000, thread_pool.queue_enabled=false, thread_pool.queue_max_size=100, oob_thread_pool.queue_max_size=100, oob_thread_pool.keep_alive_time=5000, oob_thread_pool.min_threads=1, oob_thread_pool.queue_enabled=false, oob_thread_pool.max_threads=8, oob_thread_pool.rejection_policy=discard, thread_pool.rejection_policy=discard, timer.queue_max_size=500, timer.min_threads=4, max_bundle_timeout=30, timer.max_threads=10, timer_type=new}
        at org.jgroups.stack.Configurator.createLayer(Configurator.java:278)
        at org.jgroups.stack.Configurator.createProtocols(Configurator.java:215)
        at org.jgroups.stack.Configurator.setupProtocolStack(Configurator.java:82)
        at org.jgroups.stack.Configurator.setupProtocolStack(Configurator.java:49)
        at org.jgroups.stack.ProtocolStack.setup(ProtocolStack.java:475)
        at org.jgroups.JChannel.init(JChannel.java:965)
        at org.jgroups.JChannel.<init>(JChannel.java:148)
        at org.jgroups.JChannel.<init>(JChannel.java:106)
        at com.cloudbees.jenkins.ha.AbstractJenkinsSingleton.createChannel(AbstractJenkinsSingleton.java:143)
        at com.cloudbees.jenkins.ha.singleton.HASingleton.start(HASingleton.java:86)
Caused: java.lang.Error: Failed to form a cluster
	at com.cloudbees.jenkins.ha.singleton.HASingleton.start(HASingleton.java:179)

Typical stack trace when CloudBees CI fails to start because JGroups customized through GUI + running 2.249.2.3

020-10-12 21:25:06.280+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed
java.lang.IllegalArgumentException: JGRP000001: configuration error: the following properties in com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING are not recognized: {remove_old_files_on_view_change=true}
	at org.jgroups.stack.Configurator.createLayer(Configurator.java:278)
	at org.jgroups.stack.Configurator.createProtocols(Configurator.java:215)
	at org.jgroups.stack.Configurator.setupProtocolStack(Configurator.java:82)
	at org.jgroups.stack.Configurator.setupProtocolStack(Configurator.java:49)
	at org.jgroups.stack.ProtocolStack.setup(ProtocolStack.java:475)
	at org.jgroups.JChannel.init(JChannel.java:965)
	at org.jgroups.JChannel.<init>(JChannel.java:148)
	at org.jgroups.JChannel.<init>(JChannel.java:122)
	at com.cloudbees.jenkins.ha.AbstractJenkinsSingleton.createChannel(AbstractJenkinsSingleton.java:176)
	at com.cloudbees.jenkins.ha.singleton.HASingleton.start(HASingleton.java:86)
Caused: java.lang.Error: Failed to form a cluster
	at com.cloudbees.jenkins.ha.singleton.HASingleton.start(HASingleton.java:179)

Environment

Affected instances are those which are using the CloudBees High Availability plugin and are customizing JGroups by placing a jgroups.xml file inside the $JENKINS_HOME directory.
Note: If High Availability has been configured only via the GUI under Manage Jenkins -> Configure System -> High Availability Configuration, then the instance will not be affected and you can upgrade to version 2.249.2.4 or higher without changes.

Resolution

Restoring service quickly while you work on migrating your jgroups.xml

To restore service quickly while you work on migrating your jgroups.xml file, add the following Java argument -Dcom.cloudbees.jenkins.ha=false to one master (only add it to one master) and restart it. This Java argument will disable High Availability, so the instance should start without issues. Applying this workaround means only the master with the Java Argument will be available and running.

Upgrading from versions older than 2.249.2.3 to 2.249.2.3 or higher

Instances with a customized JGroups file upgrading from a version older than 2.249.2.3 must update the current JGroups customization (by placing a jgroups.xml file inside the $JENKINS_HOME directory as explained below).

JGroups customization performed through $JENKINS_HOME/jgroups.xml

To migrate the JGroups configuration, you must determine what customization has previously been applied by comparing the jgroups.xml file with the reference file included in Appendix B: Example JGroups customization previous to 2.249.2.3.
On Unix-like systems, an easy way to see what customizations have been made is with the diff tool.
Copying the contents of the JGroups file from Appendix B and saving it as jgroups-3-base.xml, then running the command diff --color --ignore-all-space --unified=500 jgroups-3-base.xml jgroups.xml will show what lines have changed.
Removed lines are identified by a line starting with a - symbol. Added lines start with the + symbol, and if a line has been changed, it will display both removed and added symbols.

Depending on the scope of changes identified above, different paths are recommended. If only ports have been configured, we recommend the jgroups.xml file be removed from $JENKINS_HOME, a single instance started, and HA ports configured inside the GUI; otherwise, the old configuration entries will need to be mapped to a new configuration format.

Only ports have been configured, if only the following entries are different: config/TCP_NIO2/bind_port,config/TCP_NIO2/port_range and config/TCP_NIO2/diagnostics_port.
The output of the diff command from above will look like the following for this case:

--- jgroups-3-base.xml  2020-10-14 14:40:24.918727200 +0100
+++ jgroups.xml 2020-10-14 11:09:19.640017900 +0100
@@ -1,68 +1,68 @@
 <!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml -->
 <config xmlns="urn:org:jgroups"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.1.xsd">
     <TCP_NIO2
          recv_buf_size="${tcp.recv_buf_size:128K}"
          send_buf_size="${tcp.send_buf_size:128K}"
          max_bundle_size="64K"
          max_bundle_timeout="30"
          sock_conn_timeout="1000"

-         bind_port="${HA_BIND_PORT}"
-         port_range="${HA_PORT_RANGE}"
-         diagnostics_port="${HA_DIAGNOSTIC_PORT}"
+         bind_port="56736"
+         port_range="5"
+         diagnostics_port="35483"

          timer_type="new"
          timer.min_threads="4"
          timer.max_threads="10"
          timer.keep_alive_time="3000"
          timer.queue_max_size="500"

          thread_pool.enabled="true"
          thread_pool.min_threads="1"
          thread_pool.max_threads="10"
          thread_pool.keep_alive_time="5000"
          thread_pool.queue_enabled="false"
          thread_pool.queue_max_size="100"
          thread_pool.rejection_policy="discard"

          oob_thread_pool.enabled="true"
          oob_thread_pool.min_threads="1"
          oob_thread_pool.max_threads="8"
          oob_thread_pool.keep_alive_time="5000"
          oob_thread_pool.queue_enabled="false"
          oob_thread_pool.queue_max_size="100"
          oob_thread_pool.rejection_policy="discard"/>

     <CENTRAL_LOCK />

     <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING
              location="${HA_JGROUPS_DIR}"
              remove_old_coords_on_view_change="true"
              remove_all_files_on_view_change="true"/>
     <MERGE2 max_interval="30000"
             min_interval="10000"/>
     <FD_SOCK/>
     <FD timeout="3000" max_tries="3" />
     <VERIFY_SUSPECT timeout="1500"  />
     <BARRIER />
     <pbcast.NAKACK2 use_mcast_xmit="false"
                    discard_delivered_msgs="true"/>
     <UNICAST />
     <!--
       When a new node joins a cluster, initial message broadcast doesn't necessarily seem
       to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize
       this dropped transmission and cause a retransmission.
     -->
     <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                    max_bytes="4M"/>
     <pbcast.GMS print_local_addr="true" join_timeout="3000"
                 view_bundling="true"
                 max_join_attempts="5"/>
     <MFC max_credits="2M"
          min_threshold="0.4"/>
     <FRAG2 frag_size="60K"  />
     <pbcast.STATE_TRANSFER />
     <!-- pbcast.FLUSH  /-->
 </config>

Migrating from a customized JGroups configuration file to in-product GUI configuration

Note: This migration requires 2.249.2.4 or higher.

Starting from CloudBees High Availability version 4.8, it is possible to configure the ports via the UI. Open Manage Jenkins > Configure System and locate the High Availability Configuration section. Enable the customization, specify the ports and restart both nodes in HA singleton.

Add port customization from a customized jgroups.xml file to the GUI as follows:

  • Bind Port in the GUI corresponds to bind_port in the jgroups.xml
  • Port Range in the GUI corresponds to port_range in the jgroups.xml
  • Diagnostics Port in the GUI corresponds to diagnostics_port in the jgroups.xml

The following example shows what would be set if your jgroups.xml had the following entries:
* bind_port="1111"
* port_range="3"
* diagnostics_port="2222"

example UI configuration

Migrating from a customized JGroups configuration to the newer format JGroups configuration

If more values have changed than just the ports, then a fuller migration will be required.

In the majority of cases, the only additional values that will have been changed are the timeouts like in the following example:

<FD timeout="20000" max_tries="3" />
<VERIFY_SUSPECT timeout="5000"  />

To migrate the values, first create a backup of your old jgroups.xml configuration (mv jgroups.xml jgroups.xml.pre-migration) and then place a copy of the new jgroups.xml configuration from Appendix A: Example JGroups customization for version 2.249.2.3 or higher in $JENKINS_HOME.
1. Open the jgroups.xml file in your text editor.
2. Replace ${HA_BIND_PORT} with the value from your old configuration file.
3. Replace ${HA_PORT_RANGE} with the value from your old configuration file.
4. Replace ${HA_DIAGNOSTIC_PORT} with the value from your old configuration file.
5. Replace the <FD timeout="3000" max_tries="3" /> entry with the corresponding line from your old configuration file.
6. Replace the <VERIFY_SUSPECT timeout="1500" /> entry with the corresponding line from your old configuration file.

If you have more customizations to the file than those outlined above and are not familiar with jgroups, please open a ticket with support and include the contents of the file in the ticket.

Appendix A: Example JGroups customization for version 2.249.2.3 or higher

<!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml -->
<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups-4.0.xsd">
    <TCP_NIO2
         recv_buf_size="${tcp.recv_buf_size:128K}"
         send_buf_size="${tcp.send_buf_size:128K}"
         max_bundle_size="64K"
         sock_conn_timeout="1000"

         bind_port="${HA_BIND_PORT}"
         port_range="${HA_PORT_RANGE}"
         diagnostics_port="${HA_DIAGNOSTIC_PORT}"

         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="10"
         thread_pool.keep_alive_time="5000"/>

    <CENTRAL_LOCK />

    <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING
             location="${HA_JGROUPS_DIR}"
             remove_old_coords_on_view_change="true"/>
    <MERGE3 max_interval="30000"
            min_interval="10000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 use_mcast_xmit="false"
                    discard_delivered_msgs="true"/>
    <UNICAST3 />
    <!--
      When a new node joins a cluster, initial message broadcast doesn't necessarily seem
      to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize
      this dropped transmission and cause a retransmission.
    -->
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"
                max_join_attempts="5"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
    <pbcast.STATE_TRANSFER />
    <!-- pbcast.FLUSH  /-->
</config>

Appendix B: Example JGroups customization previous to 2.249.2.3

<!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml -->
<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.1.xsd">
    <TCP_NIO2
         recv_buf_size="${tcp.recv_buf_size:128K}"
         send_buf_size="${tcp.send_buf_size:128K}"
         max_bundle_size="64K"
         max_bundle_timeout="30"
         sock_conn_timeout="1000"

         bind_port="${HA_BIND_PORT}"
         port_range="${HA_PORT_RANGE}"
         diagnostics_port="${HA_DIAGNOSTIC_PORT}"

         timer_type="new"
         timer.min_threads="4"
         timer.max_threads="10"
         timer.keep_alive_time="3000"
         timer.queue_max_size="500"

         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="10"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="false"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="1"
         oob_thread_pool.max_threads="8"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="discard"/>

    <CENTRAL_LOCK />

    <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING
             location="${HA_JGROUPS_DIR}"
             remove_old_coords_on_view_change="true"
             remove_all_files_on_view_change="true"/>
    <MERGE2 max_interval="30000"
            min_interval="10000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 use_mcast_xmit="false"
                   discard_delivered_msgs="true"/>
    <UNICAST />
    <!--
      When a new node joins a cluster, initial message broadcast doesn't necessarily seem
      to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize
      this dropped transmission and cause a retransmission.
    -->
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"
                max_join_attempts="5"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
    <pbcast.STATE_TRANSFER />
    <!-- pbcast.FLUSH  /-->
</config>

Have more questions?

0 Comments

Please sign in to leave a comment.