- How can we restore a CJE environment when it is in a bad state?
- Our CJE Cluster is not longer accessible after some infrastructure changes.
- We changed the IP of every workers/controllers, and our CJE cluster is not accessible
- We stop/restart all our VMs/Hosts/EC2, and our CJE cluster is not accessible
- We stop/restart all our controllers, and our CJE cluster is not accessible
- We changed our DNS/LB, and our CJE cluster is not accessible
- CloudBees Jenkins Enterprise - AWS/OpenStack/Anywhere
Before starting performing changes, you have to sure about the root cause for the issue, for this we will need to find the configuration error. You should run the
cje status command to check if the cluster is accessible from the bastion host (the machine that hosts the CJE project), the output should be something like this:
Project template: aws Cluster state: initialized CloudBees Jenkins Operation Center URL: http://cluster.example.com/cjoc/ cjoc: OK Server Status controller-1: OK controller-2: ERROR controller-3: OK worker-1: OK worker-2: OK worker-3: OK worker-4: OK worker-5: OK
Check also the URL of the exposed services by running
cje run display-outputs
Controllers: c1.example.com,c2.example.com,c3.example.com Workers : w1.example.com,w2.example.com,w3.example.com,w4.example.com,w5.example.com CJOC : http://cluster.example.com/cjoc/ Mesos : http://mesos.example.com Marathon: http://marathon.example.com
after that, we will see a failure in some of the cluster components, n order to verify that kind of errors, we should try to connect to them with
dna connect SERVER e.g.
dna connect controller-1
Check DNS resolution
The DNS resolution should work in all the VMs/Hosts, you can check it with
nslookup e.g if you use
. as domain separator these DNS names should resolve the IP of your Load Balancer.
nslookup cluster.example.com nslookup mesos.cluster.example.com nslookup marathon.cluster.example.com
Check Load Balancer (LB)
In front of your CJE Cluster you have a LB that route all the traffic from outside of the CJE Cluster to each container inside it. If the status command said that all is OK but you cannot access to the CJE-OC, Mesos, or Marathon URL from outside you should check that the LB can reach your controllers and it is able to route traffic to them, the steps to follow may vary depending on your LB solution, a simple way to check whether the traffic is being routed is by running a
curl command to access Mesos URL e.g.
curl -IvL mesos.cluster.example.com
This will be the result where
22.214.171.124 is the IP of the LB, the response code is
401 because we did not use
-u mesos_web_ui_username:mesos_web_ui_password that are in
* Trying 192.168.1.98... * TCP_NODELAY set * Connected to mesos.cluster.example.com (192.168.1.98) port 80 (#0) > HEAD / HTTP/1.1 > Host: mesos.cluster.example.com > User-Agent: curl/7.54.0 > Accept: */* > < HTTP/1.1 401 Unauthorized
worker/controller not accessible
if we can not connect to the server because is not accessible or it does not respond, we can go ahead an restart the worker or controller with the proper operation
cje prepare worker-restart or
cje prepare controlelr-restart.
On CJE Anywhere check that the VMs/Hosts are up and running and you can run
dna stop worker-X and
dna init worker-X.
CJOC not accessible
If we cannot connect to the CJE-OC Docker container nor the web UI for the JCE-OC is accessible, try to restart CJE-OC with
dna stop cjoc and
dna init cjoc
dna connect cjoc timeout
curl -IvL http://cluster.example.com/cjoc/ * Trying 192.168.1.98... * TCP_NODELAY set * Connected to cluster.example.com (192.168.1.98) port 80 (#0) > HEAD /cjoc/ HTTP/1.1 > Host: cluster.example.com > User-Agent: curl/7.54.0 > Accept: */* > < HTTP/1.1 503 Service Unavailable
Worker not available anymore
In some situations, a worker is no longer in the CJE Cluster (e.g. a terminated EC2 instance) in those cases we have to remove the worker from the cluster to avoid other issues, to do that we have to use the
cje prepare worker-remove, if this operation fails we have to be sure that the VM/Host no longer exists in our infrastructure (cloud) and manually mark the worker as deleted with
touch .dna/servers/worker-X/.deleted. After that we can purge deleted workers to eliminate the unnecessary configuration.
Restore a CJE Cluster
Backup your CJE project
Before starting any kind of restore process you have to make a backup of the folder that contains the CJE project.
tar -cJf cje_backup.tar.bz2 PROJECT_FOLDER
Re-initialized controllers, workers, and services
If The issue affects to the whole cluster, if it affects every controller or the workers/controllers configuration needs to be recreated because of an infrastructure change, we can try to reinitialize the cluster with
tiger init, this command will
init every single controller, worker, and service. If we made changes in the infrastructure, IP addresses or any other similar change, we will need to check the
.dna/project.config to be sure that the configuration is correct.
Cluster Recover AWS multi AZ
Before starting the
Cluster Recover procedure, you have to move all your workers to the same region because it is not possible to restore a cluster in multiple regions, to make it you have to modified your workers configuration in
If the re-initialized fails, we can try to recover the cluster with the operation
cje prepare cluster-recover, follow these steps:
- In some cases we need to follow the steps in
Destroy a Cluster preserving old datasection before continue.
- Rename the
mv .dna .dna_to_restore
- Execute the
cje prepare cluster-recover
- Edit the
cluster-recover.configfile to set the
recovery_mode = fulland the
[pse] ## Recovery mode # By default (recovery_mode = full), the recovery process creates a new cluster based on the specified configuration # (see dna_path parameter below) # To attempt a cluster repair (re-creating/updating missing components), set the recovery_mode = repair. # Note that depending on the state of the cluster, the repair mode might not be successful. recovery_mode = full ## Cluster configuration directory path to recover # [required] path relative to the PROJECT directory dna_path=PATH_TO_THE_PROJECT/.dna_to_restore
If everything goes OK the cluster will be up and running.
Recreate the cluster with the same configuration and data
If the cluster recover fails, we can try to recreate the cluster from scratch but preserving the CJE-OC data, CJE-MM data, ES data, and all the rest of configurations made before. To do that we will destroy the cluster preserving the data and create it again, follow these steps:
- Follow the steps in
Destroy a Cluster preserving old datasection
- The cluster should be destroyed and a new folder named like
.dna-destroyed-20171124T212834Zwill be created
- Prepare a
cje prepare cluster-init
- Copy the configuration from the initial
operations/20170508T115907Z-cluster-init/configto the cluster-init.config, or edit the file with the values in
After that, the cluster will be created from scratch reusing the data, then you can enter in CJE-OC and start again all CJE-MM.
Destroy a Cluster preserving old data??
In some cases is need to destroy the cluster infrastructure preserving old data (CJE-OC, CJE-MM, ES, …)
- Destroy the cluster -
cje prepare cluster-destroy
- Edit the
cluster-destroy.configfile and ONLY uncomment the line with the name of the cluster
cluster_name = support-cluster
[pse] ## Cluster name # Uncomment the next line after checking this is really the cluster you want to destroy # cluster_name = support-cluster ## Destroy storage bucket # Uncomment to destroy # destroy_storage_bucket = yes ## Destroy EBS resources (This option destroys long-term storage of the cluster. It CANNOT be recovered) # This is done as a best effort. Some resources may not be deletable (pending snapshots, in-use volumes) and their ids # will be reported. # Uncomment to destroy # destroy_ebs_resources = yes