Before this fix, the deadlock would prevent the leader pod from giving
up its lease, which would make it take several minutes for new pods to
be allowed to elect a new leader. During that time, no Pinniped
controllers could write to the Kube API, so important resources were not
being updated during that window. It would also make pod shutdown take
about 1 minute.
After this fix, the leader gives up its lease immediately, and pod
shutdown takes about 1 second. This improves restart/upgrade time and
also fixes the problem where there was no leader for several minutes
after a restart/upgrade.
The deadlock was between the post-start hook and the pre-shutdown hook.
The pre-shutdown hook blocked until a certain background goroutine in
the post-start hook finished, but that goroutine could not finish until
the pre-shutdown hook finished. Thus, they were both blocked, waiting
for each other infinitely. Eventually the process would be externally
killed.
This deadlock was most likely introduced by some change in Kube's
generic api server package related to how the many complex channels used
during server shutdown interact with each other, and was not noticed
when we upgraded to the version which introduced the change.
To make the subject of the downstream ID token more unique when
there are multiple IDPs. It is possible to define two IDPs in a
FederationDomain using the same identity provider CR, in which
case the only thing that would make the subject claim different
is adding the IDP display name into the values of the subject claim.
- Remove that validation from the controller since the CRD already
validates it during creates and updates.
- Also finish the supervisor_federationdomain_status_test.go by adding
more tests for both controller validations and CRD validations
- Also fix small bug in controller where it used Sprintf wrong
- Rename WaitForTestFederationDomainStatus test helper to
WaitForFederationDomainStatusPhase
Also changes the transformation pipeline code to sort and uniq
the transformed group names at the end of the pipeline. This makes
the results more predicable without changing the semantics.
- Avoid a possible race condition where the status says "Ready" but
the endpoints take another moment to become available, potentially
casing a fast client to get a 404 after observing that the status
is "Ready" and then immediately trying to use the endpoints.
Co-authored-by: Benjamin A. Petersen <ben@benjaminapetersen.me>
- adds the truthy condition
- TODOs for falsy conditions
- addiional notes for other conditions
- tests updated to pass with the new condition
Co-authored-by: Ryan Richard <richardry@vmware.com>