We observe that our cluster is functional and operative, but on the other hand we see in Marathon UI that
castle service is continuously deploying and marked as unhealthy. This happens for every castle process in the environment.
- CloudBees Jenkins Enterprise Anywhere
When this is happening, this is usually related to a communication problem between the Marathon service running in the controllers and the castle processes running in the master workers. In order to confirm this hypothesis, we should be able to find in the controller syslogs messages like the ones shown below:
marathon[XXX]: [2019-x-xx:xx] INFO Received health result for app [/jce/castle] version [2019-x-xx:xx]: [Unhealthy(jce_castle.xxx-x-xx-xx-x,2019-x-xx:xx,AskTimeoutException: Ask timed out on [Actor[akka://marathon/user/IO-HTTP#xx]] after [20000 ms],2019-x-xx:xx)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-34)
The health check is not successful and that causes Marathon to kill the container hosting the
castle service and schedule a new one.
To solve this issue, you should ensure that the ports needed for the health check to work as expected are opened correctly, this way controllers should be able to reach workers on port 31080 which is the port where
castle service will be accesible.
- CloudBees Jenkins Enterprise 1.11.X Anywhere