Strategy for Rolling OS Upgrades (Live)

Issue

  • We would like to apply patches to worker / controller
  • We would like to apply patches to worker / controller periodically

Environment

Resolution

Updating the OS images used in CJE requires a restart of the components, namely the controllers and workers. This may produce a lot of downtime. We recommend using the approach explained in the article Strategy for Rolling OS Upgrades.

When opting for live upgrades, CJE is not aware of the changes applied and this can be a problem, especially for AWS environments. At the very least, health checks similar to what CJE does must be conducted after the upgrade is performed.

This article is meant to provide recommendations/guidance on the process and actions to carry out.

AWS

(Note: in the case of AWS, changes applied live to workers/controller would vanish after a worker-restart or a controller-restart. These operations “reconstruct” instances based on an AMI ID. Live upgrade are therefore not recommended.)

Constraints:

  • Ensure 2 controllers are available at all time
  • (If a restart is required) Ensure that the IPs are preserved by using Reboot in EC2. Or if you are using Elastic IPs, you can do a Stop and Start of the EC2 instance. See Differences Between Reboot, Stop, and Terminate for more information.

Recommendations:

  • Ensure you have a strategy to backup and restore instances if the upgrade goes wrong

Process:

  1. Perform live upgrade on Controllers one at a time - perform the following process for one controller and when it is validated that it works, carry on with the others. Otherwise, rollback the changes.
    1. Upgrade a controller
    2. (If a restart required) Reboot the instance
    3. Check required services are running in the controller (it may take few seconds for the services to start):
      • marathon
      • mesos-master
      • zookeeper
    4. Check that sub systems are running
      • docker
      • ntp
      • rsyslog
      • topbeat
    5. Check Mesos / Marathon UI is reachable
  2. Perform live upgrade on Worker(s) - perform the following process for one worker and when it is validated that it works, carry on with the others. Otherwise, rollback the changes.
    1. Upgrade one worker
    2. (If a restart required) Reboot the instance
    3. Check required services are running in the controller (it may take few seconds for the services to start):
      • mesos-slave
    4. Check that sub systems are running
      • docker
      • ntp
      • rsyslog
      • topbeat
  3. (If applicable) Perform live upgrade on Bastion

Anywhere

Constraints:

  • Ensure 2 controllers are available at all time
  • (If a restart is required) Ensure that the IPs / DNS Hostnames are preserved

Recommendations:

  • Ensure you have a strategy to backup and restore instances if the upgrade goes wrong

Restart Required:

  1. Perform live upgrade on Controllers one at a time - perform the following process for one controller and when it is validated that it works, carry on with the others. Otherwise, rollback the changes.
    1. Upgrade a controller
    2. (If a restart required) Restart the instance
    3. Check required services are running in the controller (it may take few seconds for the services to start):
      • marathon
      • mesos-master
      • zookeeper
    4. Check that sub systems are running
      • docker
      • ntp
      • rsyslog
      • topbeat
    5. Check Mesos / Marathon UI is reachable
  2. Perform live upgrade on Worker(s) - perform the following process for one worker and when it is validated that it works, carry on with the others. Otherwise, rollback the changes.
    1. Upgrade one worker
    2. (If a restart required) Restart the instance
    3. Check required services are running in the controller (it may take few seconds for the services to start):
      • mesos-slave
    4. Check that sub systems are running
      • docker
      • ntp
      • rsyslog
      • topbeat
  3. (If applicable) Perform live upgrade on Bastion

Note: If the upgrade of the first controller / worker goes wrong. Stop right here and check what is wrong. Don’t perform the upgrade of other controllers / workers.

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.