After noticing that the upstream OIDC discovery calls can hang
indefinitely, I had tried to impose a one minute timeout on them
by giving them a timeout context. However, I hadn't noticed that the
context also gets passed into the JWKS fetching object, which gets
added to our cache and used later. Therefore the timeout context
was added to the cache and timed out while sitting in the cache,
causing later JWKS fetchers to fail.
This commit is trying again to impose a reasonable timeout on these
discovery and JWKS calls, but this time by using http.Client's Timeout
field, which is documented to be a timeout for *each* request/response
cycle, so hopefully this is a more appropriate way to impose a timeout
for this use case. The http.Client instance ends up in the cache on
the JWKS fetcher object, so the timeout should apply to each JWKS
request as well.
Requests that can hang forever are effectively a server-side resource
leak, which could theoretically be taken advantage of in a denial of
service attempt, so it would be nice to avoid having them.
- Add new optional ytt params for the Supervisor deployment.
- When the Supervisor is making calls to an upstream OIDC provider,
use these variables if they were provided.
- These settings are integration tested in the main CI pipeline by
sometimes setting them on deployments in certain cases, and then
letting the existing integration tests (e.g. TestE2EFullIntegration)
provide the coverage, so there are no explicit changes to the
integration tests themselves in this commit.
This is a partial cherry-pick of 5240f5e84a. The token expirations are unchanged, but the garbage collection lifetime is now matched to avoid garbage collection breaking the refresh flow.
This is a backport to fix https://github.com/vmware-tanzu/pinniped/issues/601 on the v0.4.x release line.
Signed-off-by: Matt Moyer <moyerm@vmware.com>
This change adjusts the kube-cert-agent "deleter" controller to also delete pods that are unusable because they are no longer "Running".
This should make the Concierge recover from scenarios where clusters are suspended and resumed, as well as other edge cases where the `sleep` process in the agent pod exits for some reason.
Signed-off-by: Matt Moyer <moyerm@vmware.com>
This required some small adjustments to the produciton code to make it more feasible to test.
The new test takes an existing agent pod and terminates the `sleep` process, causing the pod to go into an `Error` status.
The agent controllers _should_ respond to this by deleting and recreating that failed pod, but the current code just gets stuck.
This is meant to replicate the situation when a cluster is suspended and resumed, which also causes the agent pod to be in this terminal error state.
Signed-off-by: Matt Moyer <moyerm@vmware.com>
This change updates our clients to always set an owner ref when:
1. The operation is a create
2. The object does not already have an owner ref set
Signed-off-by: Monis Khan <mok@vmware.com>
- JWKSWriterController
- JWKSObserverController
- FederationDomainSecretsController for HMAC keys
- FederationDomainSecretsController for state signature key
- FederationDomainSecretsController for state encryption key
Signed-off-by: Ryan Richard <richardry@vmware.com>
- Only sync on add/update of secrets in the same namespace which
have the "storage.pinniped.dev/garbage-collect-after" annotation, and
also during a full resync of the informer whenever secrets in the
same namespace with that annotation exist.
- Ignore deleted secrets to avoid having this controller trigger itself
unnecessarily when it deletes a secret. This controller is never
interested in deleted secrets, since its only job is to delete
existing secrets.
- No change to the self-imposed rate limit logic. That still applies
because secrets with this annotation will be created and updated
regularly while the system is running (not just during rare system
configuration steps).
We stared at this very carefully and we don't think there are any structural changes. Maybe something small happened to get the RNG off by one?
Signed-off-by: Matt Moyer <moyerm@vmware.com>
This implementation is janky because I wanted to make the smallest change
possible to try to get the code back to stable so we can release.
Also deep copy an object so we aren't mutating the cache.
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
This is a bit more clear. We're changing this now because it is a non-backwards-compatible change that we can make now since none of this RFC8693 token exchange stuff has been released yet.
There is also a small typo fix in some flag usages (s/RF8693/RFC8693/)
Signed-off-by: Matt Moyer <moyerm@vmware.com>
Fosite overrides the `Cache-Control` header we set, which is basically fine even though it's not exactly what we want.
Signed-off-by: Matt Moyer <moyerm@vmware.com>
From RFC2616 (https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2):
> It MUST be possible to combine the multiple header fields into one "field-name: field-value" pair,
> without changing the semantics of the message, by appending each subsequent field-value to the first,
> each separated by a comma.
This was correct before, but this simplifes a bit and shaves off a few bytes from the response.
Signed-off-by: Matt Moyer <moyerm@vmware.com>
The bug itself has to do with when headers are streamed to the client. Once a wrapped handler has sent any bytes to the `http.ResponseWriter`, the value of the map returned from `w.Header()` no longer matters for the response. The fix is fairly trivial, which is to add those response headers before invoking the wrapped handler.
The existing unit test didn't catch this due to limitations in `httptest.NewRecorder()`. It is now replaced with a new test that runs a full HTTP test server, which catches the previous bug.
Signed-off-by: Matt Moyer <moyerm@vmware.com>
Because the library that we are using which returns that error
formats the timestamp in localtime, which is LMT when running
on a laptop, but is UTC when running in CI.
Signed-off-by: Ryan Richard <richardry@vmware.com>