Changes made to both components:
1. Logs are always flushed on process exit
2. Informer cache sync can no longer hang process start up forever
Changes made to concierge:
1. Add pre-shutdown hook that waits for controllers to exit cleanly
2. Informer caches are synced in post-start hook
Changes made to supervisor:
1. Add shutdown code that waits for controllers to exit cleanly
2. Add shutdown code that waits for active connections to become idle
Waiting for controllers to exit cleanly is critical as this allows
the leader election logic to release the lock on exit. This reduces
the time needed for the next leader to be elected.
Signed-off-by: Monis Khan <mok@vmware.com>
The kubelet will send the SIGTERM signal when it wants a process to
exit. After a grace period, it will send the SIGKILL signal to
force the process to terminate. The concierge has always handled
both SIGINT and SIGTERM as indicators for it to gracefully exit
(i.e. stop watches, controllers, etc). This change updates the
supervisor to do the same (previously it only handled SIGINT). This
is required to allow the leader election lock release logic to run.
Otherwise it can take a few minutes for new pods to acquire the
lease since they believe it is already held.
Signed-off-by: Monis Khan <mok@vmware.com>
For clusters where the control plane nodes aren't running a CNI, the
kube-cert-agent pods deployed by concierge cannot be scheduled as they
don't know to use `hostNetwork: true`. This change allows embedding the
host network setting in the Concierge configuration. (by copying it from
the kube-controller-manager pod spec when generating the kube-cert-agent
Deployment)
Also fixed a stray double comma in one of the nearby tests.
Instead of blindly waiting long enough for a disruptive change to
have been observed by the old leader and followers, we instead rely
on the approximation that checkOnlyLeaderCanWrite provides - i.e.
only a single actor believes they are the leader. This does not
account for clients that were in the followers list before and after
the disruptive change, but it serves as a reasonable approximation.
Signed-off-by: Monis Khan <mok@vmware.com>
Those images that are pulled from Dockerhub will cause pull failures
on some test clusters due to Dockerhub rate limiting.
Because we already have some images that we use for testing, and
because those images are already pre-loaded onto our CI clusters
to make the tests faster, use one of those images and always specify
PullIfNotPresent to avoid pulling the image again during the integration
test.