After noticing that the upstream OIDC discovery calls can hang
indefinitely, I had tried to impose a one minute timeout on them
by giving them a timeout context. However, I hadn't noticed that the
context also gets passed into the JWKS fetching object, which gets
added to our cache and used later. Therefore the timeout context
was added to the cache and timed out while sitting in the cache,
causing later JWKS fetchers to fail.
This commit is trying again to impose a reasonable timeout on these
discovery and JWKS calls, but this time by using http.Client's Timeout
field, which is documented to be a timeout for *each* request/response
cycle, so hopefully this is a more appropriate way to impose a timeout
for this use case. The http.Client instance ends up in the cache on
the JWKS fetcher object, so the timeout should apply to each JWKS
request as well.
Requests that can hang forever are effectively a server-side resource
leak, which could theoretically be taken advantage of in a denial of
service attempt, so it would be nice to avoid having them.
- Add new optional ytt params for the Supervisor deployment.
- When the Supervisor is making calls to an upstream OIDC provider,
use these variables if they were provided.
- These settings are integration tested in the main CI pipeline by
sometimes setting them on deployments in certain cases, and then
letting the existing integration tests (e.g. TestE2EFullIntegration)
provide the coverage, so there are no explicit changes to the
integration tests themselves in this commit.
This is a partial cherry-pick of 5240f5e84a. The token expirations are unchanged, but the garbage collection lifetime is now matched to avoid garbage collection breaking the refresh flow.
This is a backport to fix https://github.com/vmware-tanzu/pinniped/issues/601 on the v0.4.x release line.
Signed-off-by: Matt Moyer <moyerm@vmware.com>
Sometimes the container runtime detects the exiting PID 1 very quickly and shuts down the entire container while the `killall` process is still running.
When this happens, we see it as exit code 137 (SIGKILL).
This never failed for me in Kind locally, but fails pretty often in CI (probably due to timing differences).
Signed-off-by: Matt Moyer <moyerm@vmware.com>
This causes a CI failure because I modified the generated directory manually. These versions don't really matter because they will be overridden by the parent go.mod file anyway.
Signed-off-by: Matt Moyer <moyerm@vmware.com>
This change adjusts the kube-cert-agent "deleter" controller to also delete pods that are unusable because they are no longer "Running".
This should make the Concierge recover from scenarios where clusters are suspended and resumed, as well as other edge cases where the `sleep` process in the agent pod exits for some reason.
Signed-off-by: Matt Moyer <moyerm@vmware.com>
This required some small adjustments to the produciton code to make it more feasible to test.
The new test takes an existing agent pod and terminates the `sleep` process, causing the pod to go into an `Error` status.
The agent controllers _should_ respond to this by deleting and recreating that failed pod, but the current code just gets stuck.
This is meant to replicate the situation when a cluster is suspended and resumed, which also causes the agent pod to be in this terminal error state.
Signed-off-by: Matt Moyer <moyerm@vmware.com>
The group claims read from the session cache file are loaded as `[]interface{}` (slice of empty interfaces) so when we previously did a `groups, _ := idTokenClaims[oidc.DownstreamGroupsClaim].([]string)`, then `groups` would always end up nil.
The solution I tried here was to convert the expected value to also be `[]interface{}` so that `require.Equal(t, ...)` does the right thing.
This bug only showed up in our acceptance environnment against Okta, since we don't have any other integration test coverage with IDPs that pass a groups claim.
Signed-off-by: Matt Moyer <moyerm@vmware.com>
We were seeing a race in this test code since the require.NoError() and
require.Eventually() would write to the same testing.T state on separate
goroutines. Hopefully this helper function should cover the cases when we want
to require.NoError() inside a require.Eventually() without causing a race.
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
Co-authored-by: Margo Crawford <margaretc@vmware.com>
Co-authored-by: Monis Khan <i@monis.app>
This change updates our clients to always set an owner ref when:
1. The operation is a create
2. The object does not already have an owner ref set
Signed-off-by: Monis Khan <mok@vmware.com>
See comment. This is at least a first step to make our GKE acceptance
environment greener. Previously, this test assumed that the Pinniped-under-test
had been deployed in (roughly) the last 10 minutes, which is not an assumption
that we make anywhere else in the integration test suite.
Signed-off-by: Andrew Keesler <akeesler@vmware.com>