Strategy for Rolling OS Upgrades

Issue

  • We would like to apply patches to worker / controller
  • We would like to apply patches to worker / controller periodically

Environment

Resolution

Updating the OS images used in CJE requires a restart of the components, namely the controllers and workers. This may produce a lot of downtime.

You can reduce this downtime by launching new worker nodes and then destroy the old ones once the new ones are up. Then restart each controller, one by one.

With this approach, the downtime is reduced to the re-provisioning of CJOC and the Managed masters to the new workers.

Note: The rollout of the workers is manageable. It could be done one instance at a time - i.e. add one worker then remove an old one - or all at once - i.e. add N workers then remove the N old ones.

AWS

Pre-requisites:

  • Prepare new AMIs
  • Ensure enough IPs are available in VPC(s)

Constraints:

  • Ensure 2 controllers are available at all time
  • Ensure CJOC is / can be provisioned when executing worker-add and worker-remove operations

Process:

  1. Use cje upgrade –config-only –force to update the CJE config to use the new AMI (see How to change controllers and workers AMI)
  2. Use the cje prepare worker-add operation to add the new worker(s), using the new AMI
  3. Provision a “Test” master to check that the provisioning works on the new worker(s) - CPU / Memory resources can be tweaked to ensure only the new worker(s) can accept the offer
  4. Use cje prepare worker-remove to remove the old worker(s)
  5. Use cje prepare controller-restart to restart each of the controllers, one at a time. By ensuring that only one controller is down at a time, there shouldn’t be any downtime.

Note: Step 3) are optional but recommended to ensure that the patched worker(s) behave.

Anywhere

Pre-requisites:

  • Prepare new hosts for the new workers / controllers

Constraints:

  • [IMPORTANT] Ensure new controllers have the same IPs / DNS hostnames (the Mesos / Zookeeper cluster is based on the IPs / DNS hostnames provided, changing these IPs / DNS hostnames would require to re-create the cluster)
  • Ensure 2 controllers are available at all time
  • Ensure CJOC is / can be provisioned when executing worker-add and worker-remove operations

Process:

  1. Use the cje prepare worker-add operation to add the new worker(s)
  2. Provision a “Test” master to check that the provisioning works on the new worker(s) - CPU / Memory resources can be tweaked to ensure only the new worker(s) can accept the offer
  3. Use cje prepare worker-remove to remove the old worker(s)
  4. Shutdown the old worker(s)
  5. Replace each controller one at a time. By ensuring that only one controller is down at a time, there shouldn’t be any downtime. For each controller
    1. Shutdown the old controller
      1. Update the Load Balancer / DNS (removing the old controller / adding the new controller)
      2. Use cje prepare controller-restart to initialize the new controller

Note: Step 2) is optional but recommended to ensure that the patched worker(s) behave.

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.