Castle restarts periodically for no apparent reason

Issue

We observe that our cluster is functional and operative, but on the other hand we see in Marathon UI that castle service is continuously deploying and marked as unhealthy. This happens for every castle process in the environment.

Environment

Resolution

When this is happening, this is usually related to a communication problem between the Marathon service running in the controllers and the castle processes running in the master workers. In order to confirm this hypothesis, we should be able to find in the controller syslogs messages like the ones shown below:

marathon[XXX]: [2019-x-xx:xx] INFO Received health result for app [/jce/castle] version [2019-x-xx:xx]: [Unhealthy(jce_castle.xxx-x-xx-xx-x,2019-x-xx:xx,AskTimeoutException: Ask timed out on [Actor[akka://marathon/user/IO-HTTP#xx]] after [20000 ms],2019-x-xx:xx)] (mesosphere.marathon.health.HealthCheckActor:marathon-akka.actor.default-dispatcher-34)

The health check is not successful and that causes Marathon to kill the container hosting the castle service and schedule a new one.

To solve this issue, you should ensure that the ports needed for the health check to work as expected are opened correctly, this way controllers should be able to reach workers on port 31080 which is the port where castle service will be accesible.

Tested product/plugin versions

References

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.