Build worker is stuck and unable to provision new agents

Issue

One of the build workers is stuck and seem to have performance issues and everytime that Palace try to provision new agents, it fails.

Reviewing the docker containers running in the worker (sudo docker ps -a), you’d expect that most of the containers are created by mesos (mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)

However, it appears some containers not being managed by mesos

$ sudo docker ps -a
CONTAINER ID        IMAGE                               COMMAND                  CREATED             STATUS                           PORTS               NAMES
a910d8477f96        cloudbees/java-with-docker-client   "/bin/sh -c 'java ..."   28 minutes ago      Exited (137) 27 minutes ago                          mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.aed981c6-62e0-4db3-94c8-c2d3ea220d72
65af2f5cba92        cloudbees/java-with-docker-client   "/bin/sh -c 'java ..."   32 minutes ago      Exited (137) 26 minutes ago                          mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.b743aa54-a2e8-4339-8f10-6d7e46c77339
76114a6a5d2c        cloudbees/java-with-docker-client   "/bin/sh -c 'java ..."   50 minutes ago      Exited (137) 49 minutes ago                          mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.c75aebc9-308d-490b-9f12-b37ee8cd90b4
f83f11265120        cloudbees/java-with-docker-client   "/bin/sh -c 'java ..."   53 minutes ago      Exited (137) 47 minutes ago                          mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.7090eada-fb90-4b3a-a7fb-78f56d35565b
bb22cf792328        cloudbees/java-with-docker-client   "/bin/sh -c 'java ..."   About an hour ago   Exited (137) About an hour ago                       mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.e1ff903d-e335-469a-8011-9cac9620c59a
f62c089180c4        cloudbees/java-with-docker-client   "/bin/sh -c 'java ..."   About an hour ago   Exited (137) About an hour ago                       mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.323d02d6-9c3a-4ce2-ad77-b3551d4a54fc
b6dd779b2ef5        cloudbees/java-with-docker-client   "/bin/sh -c 'java ..."   About an hour ago   Exited (137) About an hour ago                       mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.97acbad2-2303-47df-a0a2-05832c1e0885
c7dd2c1b714c        cloudbees/java-with-docker-client   "/bin/sh -c 'java ..."   About an hour ago   Exited (137) About an hour ago                       mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.543454f6-97c7-4004-b238-0acfe428a4db
86a5ba7ac9c6        cloudbees/java-with-docker-client   "/bin/sh -c 'java ..."   About an hour ago   Exited (137) About an hour ago                       mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.c9450704-33c1-4272-8982-48bb7623a829
1dd4e5ef0994        cloudbees/java-with-docker-client   "/bin/sh -c 'java ..."   2 hours ago         Exited (137) 2 hours ago                             mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.aefa9c68-45f8-477d-a374-8dae4129a841
3f7985251de0        cloudbees/java-with-docker-client   "/bin/sh -c 'java ..."   2 hours ago         Exited (137) About an hour ago                       mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.078659f5-0a5b-4758-8f93-a784f6a5fdeb
660772aed51a        cloudbees/java-with-docker-client   "/bin/sh -c 'java ..."   4 hours ago         Exited (137) 4 hours ago                             mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.8ffc0079-7975-479a-aebc-92c5d287619a
e94b99d151fd        cloudbees/java-with-docker-client   "/bin/sh -c 'java ..."   6 hours ago         Exited (137) 5 hours ago                             mesos-85460300-7b56-4e8d-949e-a72d6814ebdd-S6.23d8b6f9-ec99-4f20-aff5-fc54926b9a71
a7e92e0f6058        67cb15258c17                        "/bin/sh -c '#(nop..."   24 hours ago        Created                                              brave_mccarthy
5ed5b62c7b0f        435fc2058f58                        "/bin/sh -c '#(nop..."   24 hours ago        Created                                              distracted_fermi
409a0d39b902        789d48722b82                        "/bin/sh -c '#(nop..."   24 hours ago        Created                                              condescending_leakey
dd7413a8d1d8        5722f20670a9                        "/bin/sh -c '#(nop..."   24 hours ago        Created                                              optimistic_heyrovsky
9f26c9b067a2        79b99ffe38a3                        "/bin/sh -c '#(nop..."   24 hours ago        Created                                              nostalgic_bassi
f92cf5447cd0        ccb47b1244b0                        "/bin/sh -c '#(nop..."   24 hours ago        Created                                              laughing_mirzakhani
cc9571b14478        b77c37369d92                        "/bin/sh -c 'echo ..."   5 days ago          Exited (0) 5 days ago                                eager_brattain
83a6d5a6ff5a        c83e1bad0b55                        "/bin/sh -c '#(nop..."   5 days ago          Created                                              eloquent_jang

Note: in this particular case the problematic containers (brave_mccarthy, distracted_fermi, …) were created by using the Docker Pipeline plugin which allow to run containers from a Pipeline.

Given that situation, Mesos is not aware that those containers are consuming the worker resources and still thinks that there are enough resources to run the agents but that’s not real.

Environment

Resolution

There are three different things that can be done to prevent and/or alleviate the situation

1) Change the algorithm used by Palace to provision agents

Palace uses by default BinPack algorithm as described in our documentation (see Palace)

Palace uses the BinPack algorithm by default to provision single-use agents. In clusters with more than one worker of type build, you may notice unpredictable job run times. This happens because many agents may be packed into one worker and share the worker CPU, so that worker instance is heavily overloaded, while others have little or no load.

There is an alternative by using the LeastLoaded algorithm. That algorithm might work better in cases where there is enough capacity so all build worker nodes are not fully allocated.

In order to set the new algorithm, you have to

  • Add the CONSTRAINTS property to the $CJE_PROJECT/.dna/servers/palace/marathon.json
  • Edit the file and replace the following:
[...]
    "MESOS_CREDENTIALS": "{{ mesos.credentials }}"
[...]

by this (make sure that you don’t miss that comma , at the end of the MESOS_CREDENTIALS line):

[...]
    "MESOS_CREDENTIALS": "{{ mesos.credentials }}",
    "CONSTRAINTS": "8"
[...]
  • Apply the changes into the configuration by running
cje upgrade-project --apply-template
  • Run dna stop palace and dna start palace

Note: You will need to repeat this process every time that you upgrade CJE to a new release

2) Use DinD base images to run the jobs which requires creating new docker containers

Following this approach, the issue the problem is avoided in two different ways.

  • Containers created out of mesos will be destroyed once the build ends. As those docker containers run within the DinD container, once this container is removed everything inside too.
  • Preventing to run the agent non-mesos containers in the build workers docker daemon. As DinD containers don’t need to mount the docker.sock, it can be removed from them Docker Agent Template configuration, so there is no way to connect to it from a build.

So the Pipeline jobs will be still able to run those containers inside the DinD container created but not in the build worker.

Please have a look at Set up a Docker in Docker Agent Template to understand how DinD works and how to configure it in CJE.

3) Periodic clean-up process

Configure a cron job on each build worker monitoring the docker activity and remove the containers that were created 2 or 3 days ago and are not running anymore.

Have more questions?

0 Comments

Please sign in to leave a comment.