How we can restore a CJE environment when it is in a bad state?

Issue

  • How can we restore a CJE environment when it is in a bad state?
  • Our CJE Cluster is not longer accessible after some infrastructure changes.
  • We changed the IP of every workers/controllers, and our CJE cluster is not accessible
  • We stop/restart all our VMs/Hosts/EC2, and our CJE cluster is not accessible
  • We stop/restart all our controllers, and our CJE cluster is not accessible
  • We changed our DNS/LB, and our CJE cluster is not accessible

Environment

Resolution

Before starting performing changes, you have to sure about the root cause for the issue, for this we will need to find the configuration error. You should run the cje status command to check if the cluster is accessible from the bastion host (the machine that hosts the CJE project), the output should be something like this:

Project template: aws
Cluster state: initialized
CloudBees Jenkins Operation Center
 URL: http://cluster.example.com/cjoc/
 cjoc: OK
Server Status
 controller-1: OK
 controller-2: ERROR
 controller-3: OK
 worker-1: OK
 worker-2: OK
 worker-3: OK
 worker-4: OK
 worker-5: OK

Check also the URL of the exposed services by running cje run display-outputs

Controllers: c1.example.com,c2.example.com,c3.example.com
Workers    : w1.example.com,w2.example.com,w3.example.com,w4.example.com,w5.example.com

CJOC    : http://cluster.example.com/cjoc/
Mesos   : http://mesos.example.com
Marathon: http://marathon.example.com

after that, we will see a failure in some of the cluster components, n order to verify that kind of errors, we should try to connect to them with dna connect SERVER e.g.

dna connect controller-1

Check DNS resolution

The DNS resolution should work in all the VMs/Hosts, you can check it with nslookup e.g if you use . as domain separator these DNS names should resolve the IP of your Load Balancer.

nslookup cluster.example.com
nslookup mesos.cluster.example.com
nslookup marathon.cluster.example.com

Check Load Balancer (LB)

In front of your CJE Cluster you have a LB that route all the traffic from outside of the CJE Cluster to each container inside it. If the status command said that all is OK but you cannot access to the CJE-OC, Mesos, or Marathon URL from outside you should check that the LB can reach your controllers and it is able to route traffic to them, the steps to follow may vary depending on your LB solution, a simple way to check whether the traffic is being routed is by running a curl command to access Mesos URL e.g.

curl -IvL mesos.cluster.example.com

This will be the result where 92.168.1.98 is the IP of the LB, the response code is 401 because we did not use -u mesos_web_ui_username:mesos_web_ui_password that are in .dna/secrets

*   Trying 192.168.1.98...
* TCP_NODELAY set
* Connected to mesos.cluster.example.com (192.168.1.98) port 80 (#0)
> HEAD / HTTP/1.1
> Host: mesos.cluster.example.com
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 401 Unauthorized

worker/controller not accessible

if we can not connect to the server because is not accessible or it does not respond, we can go ahead an restart the worker or controller with the proper operation cje prepare worker-restart or cje prepare controller-restart.

On CJE Anywhere check that the VMs/Hosts are up and running and you can run dna init worker-X.

CJOC not accessible

If we cannot connect to the CJE-OC Docker container nor the web UI for the JCE-OC is accessible, try to restart CJE-OC with dna init cjoc

dna connect cjoc
timeout
curl -IvL http://cluster.example.com/cjoc/

*   Trying 192.168.1.98...
* TCP_NODELAY set
* Connected to cluster.example.com (192.168.1.98) port 80 (#0)
> HEAD /cjoc/ HTTP/1.1
> Host: cluster.example.com
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 503 Service Unavailable

Worker not available anymore

In some situations, a worker is no longer in the CJE Cluster (e.g. a terminated EC2 instance) in those cases we have to remove the worker from the cluster to avoid other issues, to do that we have to use the cje prepare worker-remove, if this operation fails we have to be sure that the VM/Host no longer exists in our infrastructure (cloud) and manually mark the worker as deleted with touch .dna/servers/worker-X/.deleted. After that we can purge deleted workers to eliminate the unnecessary configuration.

Restore a CJE Cluster

Backup your CJE project

Before starting any kind of restore process you have to make a backup of the folder that contains the CJE project.

tar -cJf cje_backup.tar.bz2 PROJECT_FOLDER

Re-initialized controllers, workers, and services

If The issue affects to the whole cluster, if it affects every controller or the workers/controllers configuration needs to be recreated because of an infrastructure change, we can try to reinitialize the cluster with tiger init, this command will init every single controller, worker, and service. If we made changes in the infrastructure, IP addresses or any other similar change, we will need to check the .dna/project.config to be sure that the configuration is correct.

Cluster Recover AWS multi AZ

Before starting the Cluster Recover procedure, you have to move all your workers to the same region because it is not possible to restore a cluster in multiple regions, to make it you have to modified your workers configuration in .dna/project.config file.

Cluster Recover

If the re-initialized fails, we can try to recover the cluster with the operation cje prepare cluster-recover, follow these steps:

  • In some cases we need to follow the steps in Destroy a Cluster preserving old data section before continue.
  • Rename the .dna folder - mv .dna .dna_to_restore
  • Execute the cje prepare cluster-recover
  • Edit the cluster-recover.config file:
  • set the dna_path to PATH_TO_THE_PROJECT/.dna_to_restore
  • set the recovery_mode to
    • repair if the cluster hasn’t been destroyed OR if this is a destroyed AWS cluster with preserved old data (S3 and EBS)
    • full otherwise
[pse]

## Recovery mode
# By default (recovery_mode = full), the recovery process creates a new cluster based on the specified configuration
# (see dna_path parameter below)
# To attempt a cluster repair (re-creating/updating missing components), set the recovery_mode = repair.
# Note that depending on the state of the cluster, the repair mode might not be successful.
recovery_mode = full

## Cluster configuration directory path to recover
# [required] path relative to the PROJECT directory
dna_path=PATH_TO_THE_PROJECT/.dna_to_restore
  • Execute cje apply

If everything goes OK the cluster will be up and running.

Recreate the cluster with the same configuration and data

If the cluster recover fails, we can try to recreate the cluster from scratch but preserving the CJE-OC data, CJE-MM data, ES data, and all the rest of configurations made before. To do that we will destroy the cluster preserving the data and create it again, follow these steps:

  • Follow the steps in Destroy a Cluster preserving old data section
  • The cluster should be destroyed and a new folder named like .dna-destroyed-20171124T212834Z will be created
  • Prepare a cluster-ini operation - cje prepare cluster-init
  • Copy the configuration from the initial cluster-init operation in operations/20170508T115907Z-cluster-init/config to the cluster-init.config, or edit the file with the values in .dna-destroyed-20171124T212834Z/project.config.
  • Execute cje apply

After that, the cluster will be created from scratch reusing the data, then you can enter in CJE-OC and start again all CJE-MM.

Destroy a Cluster preserving old data

In some cases is need to destroy the cluster infrastructure preserving old data (CJE-OC, CJE-MM, ES, …)

  • Destroy the cluster - cje prepare cluster-destroy
  • Edit the cluster-destroy.config file and ONLY uncomment the line with the name of the cluster cluster_name = support-cluster
[pse]

## Cluster name
# Uncomment the next line after checking this is really the cluster you want to destroy
# cluster_name = support-cluster

## Destroy storage bucket
# Uncomment to destroy
# destroy_storage_bucket = yes

## Destroy EBS resources (This option destroys long-term storage of the cluster. It CANNOT be recovered)
# This is done as a best effort. Some resources may not be deletable (pending snapshots, in-use volumes) and their ids
# will be reported.
# Uncomment to destroy
# destroy_ebs_resources = yes
  • Execute cje apply
Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.