- How can we restore a CJE environment when it is in a bad state?
- Our CJE Cluster is no longer accessible after some infrastructure changes
- We changed the IP of every workers/controllers, and our CJE cluster is not accessible
- We stop/restart all our VMs/Hosts/EC2, and our CJE cluster is not accessible
- We stop/restart all our controllers, and our CJE cluster is not accessible
- We changed our DNS/LB, and our CJE cluster is not accessible
- CloudBees Jenkins Enterprise - AWS/OpenStack/Anywhere
Before making further configuration changes to the unhealthy CJE cluster, you have to be sure about the root cause for the issue. For this we will need to find the configuration error.
You should run the
cje status command to check if the cluster is accessible from the bastion host (the machine that hosts the CJE project), the output should be something like this:
Project template: aws Cluster state: initialized CloudBees Jenkins Operation Center URL: http://cluster.example.com/cjoc/ cjoc: OK Server Status controller-1: OK controller-2: ERROR controller-3: OK worker-1: OK worker-2: OK worker-3: OK worker-4: OK worker-5: OK
Take a moment to review the output. All servers should report OK status. In our example,
controller-2 is the obvious source of problems.
If you see any failure in some of the cluster components, in order to verify that kind of errors, we should try to connect to them from the bastion host with
dna connect <SERVER-NAME>, e.g.
dna connect controller-2
Check also the URLs of the exposed services by running
cje run display-outputs and validate that the listed URLs are accessible from your browser.
Controllers: c1.example.com,c2.example.com,c3.example.com Workers : w1.example.com,w2.example.com,w3.example.com,w4.example.com,w5.example.com CJOC : http://cluster.example.com/cjoc/ Mesos : http://mesos.example.com Marathon: http://marathon.example.com
The DNS resolution should work in all the VMs/Hosts. You can check it with
nslookup e.g if you use
. as domain separator these DNS names should resolve the IP of your Load Balancer.
nslookup cluster.example.com nslookup mesos.cluster.example.com nslookup marathon.cluster.example.com
In front of your CJE Cluster you have a LB that routes all the traffic from outside of the CJE Cluster to each container inside it. If the status command said that all is OK but you cannot access to the CJE-OC, Mesos, or Marathon URL from outside, you should check that the LB can reach your controllers and it is able to route traffic to them. The steps to follow may vary depending on your LB solution, a simple way to check whether the traffic is being routed is by running a
curl command to access a questionable URL, e.g. the Mesos service
curl -IvL mesos.cluster.example.com
This will be the result where
22.214.171.124 is the IP of the LB, the response code is
401 because we did not use
-u mesos_web_ui_username:mesos_web_ui_password that are in
* Trying 192.168.1.98... * TCP_NODELAY set * Connected to mesos.cluster.example.com (192.168.1.98) port 80 (#0) > HEAD / HTTP/1.1 > Host: mesos.cluster.example.com > User-Agent: curl/7.54.0 > Accept: */* > < HTTP/1.1 401 Unauthorized
If we can not connect to the server because is not accessible or it does not respond, we can go ahead an restart the worker or controller with the proper operation
cje prepare worker-restart or
cje prepare controller-restart.
On CJE Anywhere check that the VMs/Hosts are up and running, then you can run
dna init worker-X.
If we cannot connect to the CJE-OC Docker container nor the web UI for the JCE-OC is accessible, try to restart CJE-OC with
dna init cjoc
dna connect cjoc timeout
curl command to check the CJOC response
curl -IvL http://cluster.example.com/cjoc/ * Trying 192.168.1.98... * TCP_NODELAY set * Connected to cluster.example.com (192.168.1.98) port 80 (#0) > HEAD /cjoc/ HTTP/1.1 > Host: cluster.example.com > User-Agent: curl/7.54.0 > Accept: */* > < HTTP/1.1 503 Service Unavailable
In some situations, a worker is no longer in the CJE Cluster (e.g. a terminated EC2 instance). In those cases we have to remove the worker from the cluster to avoid other issues, to do that we have to use the
cje prepare worker-remove, if this operation fails we have to be sure that the VM/Host no longer exists in our infrastructure (cloud) and manually mark the worker as deleted with
touch .dna/servers/worker-X/.deleted. After that we can purge deleted workers to eliminate the unnecessary configuration.
Before starting any kind of restore process you have to make a backup of the folder that contains the CJE project.
tar -cJf cje_backup.tar.bz2 PATH_TO_THE_PROJECT
If the issue affects the whole cluster, or it affects every controller, or the workers/controllers configuration needs to be recreated because of an infrastructure change, we can try to reinitialize the cluster with
tiger init. This command will
init every single controller, worker, and service. If we made changes in the infrastructure, IP addresses or any other similar change, we will need to check the
.dna/project.config to be sure that the configuration is correct.
Before starting the
Cluster Recover procedure, you have to move all your workers to the same region because it is not possible to restore a cluster in multiple regions. To complete this step, you have to modified your workers configuration in
If the re-initialized CJE cluster remains unhealthy, we can try to recover the cluster with the operation
cje prepare cluster-recover. Follow these steps:
- In some cases we need to follow the steps in
Destroy a Cluster preserving old datasection before continue.
- Rename the
mv .dna .dna_to_restore
- Execute the
cje prepare cluster-recover
- Edit the
- set the
- set the
repairif the cluster hasn’t been destroyed OR if this is a destroyed AWS cluster with preserved old data (S3 and EBS)
- set the
[pse] ## Recovery mode # By default (recovery_mode = full), the recovery process creates a new cluster based on the specified configuration # (see dna_path parameter below) # To attempt a cluster repair (re-creating/updating missing components), set the recovery_mode = repair. # Note that depending on the state of the cluster, the repair mode might not be successful. recovery_mode = full ## Cluster configuration directory path to recover # [required] path relative to the PROJECT directory dna_path=PATH_TO_THE_PROJECT/.dna_to_restore
If everything goes OK the cluster will be up and running.
If the cluster recover fails, we can try to recreate the cluster from scratch but preserving the CJE-OC data, CJE-MM data, ES data, and all the rest of configurations made before. To do that we will destroy the cluster preserving the data and create it again, follow these steps:
- Follow the steps in
Destroy a Cluster preserving old datasection
- The cluster should be destroyed and a new folder named like
.dna-destroyed-20171124T212834Zwill be created
- Prepare a
cje prepare cluster-init
- Copy the configuration from the initial
operations/20170508T115907Z-cluster-init/configto the cluster-init.config, or edit the file with the values in
After that, the cluster will be created from scratch reusing the data, then you can enter in CJE-OC and start again all CJE-MM.
In some cases is need to destroy the cluster infrastructure preserving old data (CJE-OC, CJE-MM, ES, …)
- Destroy the cluster -
cje prepare cluster-destroy
- Edit the
cluster-destroy.configfile and ONLY uncomment the line with the name of the cluster
cluster_name = support-cluster
[pse] ## Cluster name # Uncomment the next line after checking this is really the cluster you want to destroy # cluster_name = support-cluster ## Destroy storage bucket # Uncomment to destroy # destroy_storage_bucket = yes ## Destroy EBS resources (This option destroys long-term storage of the cluster. It CANNOT be recovered) # This is done as a best effort. Some resources may not be deletable (pending snapshots, in-use volumes) and their ids # will be reported. # Uncomment to destroy # destroy_ebs_resources = yes