After noticing that the upstream OIDC discovery calls can hang
indefinitely, I had tried to impose a one minute timeout on them
by giving them a timeout context. However, I hadn't noticed that the
context also gets passed into the JWKS fetching object, which gets
added to our cache and used later. Therefore the timeout context
was added to the cache and timed out while sitting in the cache,
causing later JWKS fetchers to fail.
This commit is trying again to impose a reasonable timeout on these
discovery and JWKS calls, but this time by using http.Client's Timeout
field, which is documented to be a timeout for *each* request/response
cycle, so hopefully this is a more appropriate way to impose a timeout
for this use case. The http.Client instance ends up in the cache on
the JWKS fetcher object, so the timeout should apply to each JWKS
request as well.
Requests that can hang forever are effectively a server-side resource
leak, which could theoretically be taken advantage of in a denial of
service attempt, so it would be nice to avoid having them.
- Add new optional ytt params for the Supervisor deployment.
- When the Supervisor is making calls to an upstream OIDC provider,
use these variables if they were provided.
- These settings are integration tested in the main CI pipeline by
sometimes setting them on deployments in certain cases, and then
letting the existing integration tests (e.g. TestE2EFullIntegration)
provide the coverage, so there are no explicit changes to the
integration tests themselves in this commit.
- Automatically try to fall back to using StartTLS when using TLS
doesn't work. Only complain when both don't work.
- Remember (in-memory) which one worked and keeping using that one
in the future (unless the pod restarts).
- This enhances our LDAP client code to make it possible to optionally
dial an LDAP server without TLS and then use StartTLS to upgrade
the connection to TLS.
- The controller for LDAPIdentityProviders is not using this option
yet. That will come in a future commit.
Reflect the upstream group membership into the Supervisor's
downstream tokens, so they can be added to the user's
identity on the workload clusters.
LDAP group search is configurable on the
LDAPIdentityProvider resource.
This change makes it easier to understand misconfigurations caused
by issuers with extraneous trailing slashes.
Signed-off-by: Mo Khan <mok@vmware.com>
Also force the LDAP server pod to restart whenever the LDIF file
changes, so whenever you redeploy the tools deployment with a new test
user password the server will be updated.
Unfortunately, Secrets do not seem to have a Generation field, so we
use the ResourceVersion field instead. This means that any change to
the Secret will cause us to retry the connection to the LDAP server,
even if the username and password fields in the Secret were not
changed. Seems like an okay trade-off for this early draft of the
controller compared to a more complex implementation.
This early version of the controller is not intended to act as an
ongoing health check for your upstream LDAP server. It will connect
to the LDAP server to essentially "lint" your configuration once.
It will do it again only when you change your configuration. To account
for transient errors, it will keep trying to connect to the server
until it succeeds once.
This commit does not include looking for changes in the associated bind
user username/password Secret.
- The ldap_upstream_watcher.go controller validates the bind secret and
uses the Conditions to report errors. Shares some condition reporting
logic with its sibling controller oidc_upstream_watcher.go, to the
extent which is convenient without generics in golang.
- When the upstream IDP is an LDAP IDP and the user's LDAP username and
password are received as new custom headers, then authenticate the
user and, if authentication was successful, return a redirect with
an authcode. Handle errors according to the OAuth/OIDC specs.
- Still does not support having multiple upstream IDPs defined at the
same time, which was an existing limitation of this endpoint.
- Does not yet include the actual LDAP authentication, which is
hidden behind an interface from the point of view of auth_handler.go
- Move the oidctestutil package to the testutil directory.
- Add an interface for Fosite storage to avoid a cyclical test
dependency.
- Add GetURL() to the UpstreamLDAPIdentityProviderI interface.
- Extract test helpers to be shared between callback_handler_test.go
and auth_handler_test.go because the authcode and fosite storage
assertions should be identical.
- Backfill Content-Type assertions in callback_handler_test.go.
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
All controller unit tests were accidentally using a timeout context
for the informers, instead of a cancel context which stays alive until
each test is completely finished. There is no reason to risk
unpredictable behavior of a timeout being reached during an individual
test, even though with the previous 3 second timeout it could only be
reached on a machine which is running orders of magnitude slower than
usual, since each test usually runs in about 100-300 ms. Unfortunately,
sometimes our CI workers might get that slow.
This sparked a review of other usages of timeout contexts in other
tests, and all of them were increased to a minimum value of 1 minute,
under the rule of thumb that our tests will be more reliable on slow
machines if they "pass fast and fail slow".
These are prone to breaking when stdr is upgraded because they rely on the exact ordering of keys in the log message. If we have more problems we can rewrite the assertions to be more robust, but for this time I'm just fixing them to match the new output.
Signed-off-by: Matt Moyer <moyerm@vmware.com>
Yes, this is a huge commit.
The middleware allows you to customize the API groups of all of the
*.pinniped.dev API groups.
Some notes about other small things in this commit:
- We removed the internal/client package in favor of pkg/conciergeclient. The
two packages do basically the same thing. I don't think we use the former
anymore.
- We re-enabled cluster-scoped owner assertions in the integration tests.
This code was added in internal/ownerref. See a0546942 for when this
assertion was removed.
- Note: the middlware code is in charge of restoring the GV of a request object,
so we should never need to write mutations that do that.
- We updated the supervisor secret generation to no longer manually set an owner
reference to the deployment since the middleware code now does this. I think we
still need some way to make an initial event for the secret generator
controller, which involves knowing the namespace and the name of the generated
secret, so I still wired the deployment through. We could use a namespace/name
tuple here, but I was lazy.
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
Co-authored-by: Ryan Richard <richardry@vmware.com>
This change updates our clients to always set an owner ref when:
1. The operation is a create
2. The object does not already have an owner ref set
Signed-off-by: Monis Khan <mok@vmware.com>