ContainerImage.Pinniped/proposals/1141_audit-logging/README.md

19 KiB
Raw Blame History

title authors status sponsor approval_date
Audit Logging
@cfryanr
accepted

Disclaimer: Proposals are point-in-time designs and decisions. Once approved and implemented, they become historical documents. If you are reading an old proposal, please be aware that the features described herein might have continued to evolve since.

Audit Logging

Problem Statement

Audit logging is a requirement from most compliance standards (e.g. FedRAMP, PCI-DSS). The Pinniped Supervisor and Concierge components should provide audit logs to help users meet these compliance requirements.

The Kubernetes API server already supports rich audit logging features which are implemented by vendors of Kubernetes distributions. The Pinniped audit logs are meant to augment, not replace, the Kubernetes audit logs.

How Pinniped Works Today (as of version v0.16.0)

The Pinniped Supervisor and Concierge components are Kubernetes Deployments. Today, each Pod has a single container, which is the Supervisor or Concierge app. Kubernetes captures the stdout and stderr of the app into the Pod logs.

Today, the Pinniped Supervisor and Concierge log many interesting events to their Pod logs. These logs are meant primarily to help an admin user debug problems with their Pinniped configuration or with their cluster. The Supervisor and Concierge each offer an install-time configuration option to turn up the verbosity of these Pod logs.

However, these logs are not meant to be audit logs. They generally focus on logging problems, not on logging successes. They try to avoid logging anything that might be confidential or PII (personally identifiable information). Since email addresses might be considered PII, these logs generally avoid including usernames at the default log level, since usernames could be email addresses in some configurations. Logging the identity of actors (usernames) are a key aspect of audit logs.

Terminology / Concepts

None.

Proposal

The goal of an audit log is to log events that could be helpful in a forensic investigation of past usage, including the actor (the username) and the actions that were taken on the system.

Goals and Non-goals

Goals

  • Auditing events relating to upstream identity provider (IDP) authentication, refresh, and sessions.
  • Auditing events relating to minting and validating cluster credentials.
  • Enabling auditors to easily stitch together authentication events into an audit trail.
  • Provide consistent data across auditable events.
  • Provide the ability to enable and disable auditing.
  • Provide the ability to route audit logs to a separate destination from the rest of Pinnipeds logs.

Non-goals

  • Enabling Kubernetes API request auditing in the impersonation proxy. If needed, this will be handled in a separate feature.
  • Providing the ability to filter or choose which audit events to capture.
  • Auditing the management of CRs (e.g. OIDCIdentityProvider). These events are captured by the API server audit logs.

Specification / How it Solves the Use Cases

This proposal recommends following the recommendation of the Kubernetes docs to create a separate Pod container log. This new container log will contain the audit logs (and only the audit logs).

API Changes

Configuration Options

There will be very few user-facing configuration options for audit logging in the first version of the feature. If later found to be needed, more configuration could be added in future versions.

This proposal recommends adding a single on/off install-time configuration option for disabling audit logs. By default, audit logs will be disabled. Usernames may be considered PII, so disabled by default avoids potentially logging PII.

Like other install-time configuration options, this option would appear in the values.yaml file of the Supervisor and Concierge deployment directories. The selected value would be rendered into the "static" ConfigMap, and read by the Supervisor or Concierge app's Golang code at Pod startup time.

Event Data

Deciding every specific audit event is an implementation detail beyond the scope of this proposal.

Generally, the following data should be included with every audit event, whenever possible:

  • What type of event occurred (e.g. login)
  • Outcomes of event (succeed or fail)
  • When the event occurred
  • Where the event occurred (Kubernetes Pod logs automatically include the ID of the Pod, which should be sufficient)
  • Source of the event (e.g. requester IP address)
  • The identity of individuals or subjects associated with the event (who initiated, who participated. etc.)
  • Details involving any objects accessed

The Supervisor's audit logs would include events such as:

  • Upstream logins for all IdP types (started, succeeded, failed)
  • Upstream refresh for all IdP types (succeeded, failed)
  • Upstream group refresh for all IdP types (succeeded, failed)
  • Downstream login (started, succeeded, failed)
  • Downstream token exchange (succeeded, failed)
  • Session expired
  • The equivalent of access log events for all Supervisor endpoints, since there is no other component providing access logs. This would include logging things like calls to the Supervisor's OIDC well-known discovery endpoint. These logs could help an investigator determine more about the usage pattern of a suspicious client.
  • The identity (username, group memberships) of newly authenticated users
  • Newly authenticated user is associated with “admin-like” RBAC. Any user that is allowed to perform verbs=* groups=* resources=* according to a subject access review API call shall be considered "admin-like". This would only indicate that the user has "admin-like" permissions on the Supervisor cluster itself, not on other workload clusters, since the Supervisor is not aware of the RBAC settings on the workload clusters.

The Concierge's audit logs would include events such as:

  • Token credential request (succeeded, failed, maps to admin RBAC). While already captured by the API server audit logs, those should likely be set to metadata. Duplicating the event allows for more controlled capture & management of data.
    • Similar to the Supervisor, the TCR endpoint could log when an authenticated user is associated with “admin-like” RBAC. Any user that is allowed to perform verbs=* groups=* resources=* according to a subject access review API call shall be considered "admin-like".
  • WhoAmI Request. While already captured by the API server audit logs, duplicating the event allows for more controlled capture & management of data.

Other events may be useful to auditors and may be included in the audit logs, such as:

  • Application startup with version information
  • Graceful application shutdown
Audit Logs as Separate Log Files

The Concierge and Supervisor apps could each send audit logs to separate files on disk in JSON format. The performance impact of logging to a file should be acceptable thanks to file buffering, but this assumption should be tested. Note that this approach would not guarantee that the log statement is flushed to the file before the action is performed, because then we would lose the benefit of buffering. It would be "best effort" to the file, e.g. the process crashing might lose a few lines of logs. A normal pod shutdown should be able to flush the file without any loss.

A new streaming sidecar container will be added to both the Concierge and Supervisor apps Deployments' Pods. These containers will tail those audit logs to stdout, thus effectively moving those log lines from files on the Pod to Kubernetes container logs. Those sidecar container images can be minimal with just enough in the image to support the unix tail command (or similar Go binary, such as hpcloud/tail, although that particular example library may not be maintained anymore).

Kubernetes will take care of concerns such as log rotation for the container logs. For the files on the Pod's disk output by the Supervisor and Concierge apps, we should research whether Pinniped should have code to avoid allowing those files from growing too large. Old lines can be discarded since the sidecar container should have already streamed them.

Container logs in JSON format are easy for node-level logging agents, e.g. fluentbit, to ingest/annotate/parse/filter and send to numerous sink destinations. These containers could still run when audit logs are disabled by the admin, but would produce no log lines in that case.

Parsing, Filtering, and Sending Audit Logs to an External Destination

Many users will use the popular fluentbit project to filter and extract Pod logs from their cluster. This project implements a node-level log agent which understands the Kubernetes directory and file layout for Pod logs. It also has a feature to further enrich the logs by automatically adding more information about the source Pod to each event (line) in the log. It supports many configurable options for parsing, filtering, and sending logs to many destinations.

By putting the Supervisor and Concierge audit logs into their own Pod logs, Pinniped will be compatible with any existing node-level agent software which can extract logs from a Kubernetes cluster. This allows the Pinniped code to focus on generating the logs as JSON, without worrying about providing any configuration options for filtering or sending to various destinations.

Audit Log JSON Format

The format of Kubernetes audit logs is not a perfect fit for Pinniped. The Kubernetes audit logs are strongly oriented towards API requests for Kubernetes resources, with many of the fields representing the details of a request and response. The format of the Pinniped audit logs will draw inspiration from the Kubernetes audit events without trying to directly copy them.

Each line of audit log will represent an event. Each line will be a complete JSON object, i.e. {"key1":"value1","key2":"value2"}.

Some, but not all, events will be the result of a user making an API request to an endpoint. One API request from a user may cause more than one event to be logged. If possible, unique ID will be determined for each incoming request, and will be included in all events caused by that request.

Where possible, the top-level keys of the JSON object will use standardized names. Other top-level keys specific to that action type may be added. All keys should be included in documentation for the audit log feature.

Every event should include these keys:

  • timestamp: the timestamp of the event
  • event: the event type, which is a brief description of what happened, with no string interpolation, so it will always be the same for a given event type (e.g. upstream refresh succeeded)
  • v: a number specifying the format version of the event type, starting with 1, to give us flexibility to make breaking changes to the format of an event type in future releases (e.g. change the name of the JSON keys, or change the data type of the value of an existing key)

Depending on the event type, an event might include other keys, such as:

  • message: a freeform warning or error message meant to be read by a human (e.g. the error message that was returned by an upstream IDP during a failed login attempt)
  • requestID: a unique ID for the request, if the event is related to an API request
  • requestURI: the path of the endpoint, if the event is related to an API request
  • verb: the REST method called on the endpoint, if the event is related to an API request
  • sourceIPs: the client's IPs, if the event is related to an API request
  • userAgent: the user agent string reported by the client, if the event is related to an API request
  • user: a nested structure which can include the username, groups, and uid of the user performing the action, if there is one

The names of many of these keys are purposefully similar to the names of the keys used by Kubernetes audit events to make them feel familiar. Also, where it makes sense, the key names should be similar to those used in the Pinniped Pod logs.

The details of these additional keys will be worked out as the details of the specific events are being worked out, during implementation of this proposal.

Audit Log Timestamps

The date format used in the audit logs should be something which can be easily parsed by fluentbit, to make it easy for users to configure fluentbit. We could easily document this to provide instructions on how to configure a custom fluentbit parser for Pinniped audit logs. We should probably avoid fluentbit's default json parser's date format, which assumes dates will be in an ugly format and also lacks sub-second precision (e.g. 08/Apr/2022:19:24:01 +0000).

fluentbit uses strptime with an extension for fractional seconds to parse timestamps.

It would be desirable for a timestamp to:

  1. Be human-readable (e.g. not seconds since an epoch)
  2. Be easily parsable by log parsers, especially fluentbit
  3. Be expressed in UTC time
  4. Use at least millisecond precision
  5. Use the consistent JSON key name timestamp

Golang's standard library's interpretation of RFC 3339 with nanosecond precision defines a timestamp format which meets the above goals. An example timestamp in this format, printed by fmt.Println(time.Now().UTC().Format(time.RFC3339Nano)), is 2022-05-09T21:32:59.811913Z, which represents UTC time on May 9, 2022, at 21:32:59 pm, 811913 nanoseconds into the next second. Note that trailing zeros on the nanoseconds are dropped, so the length of the nanoseconds field is variable in the output.

Given this timestamp format, the following fluentbit configuration could be used to parse Pinniped's audit logs.

    [PARSER]
      Name   json
      Format json
      Time_Key timestamp
      Time_Format %Y-%m-%dT%H:%M:%S.%LZ

Upgrades

Since audit logs will be output to a new location, there are not any backward compatibility concerns for them in the first release.

Adding a second container to the Pods in generally not noticeable by a user, but may have some impact on existing installations in some rare cases, so it should be explained in the release notes. For example, a GKE Ingress will, by default, read the Pod's container definition to try to guess the health check endpoint for the backend Service of the Ingress. When there is only one container, it will try to guess, but where there is more than one container it will give up on guessing and instead expect the user to configure the health checks. So upgrading could break the health checks of a GKE Ingress, if no health checks were configured.

Tests

Audit logging will be a user-facing feature, and the format of the logs should be considered a documented and versioned API. Unnecessary changes to the format should be avoided after the first release. Therefore, all audit log events should be covered by unit tests.

This implies that it may be desirable for the implementation to involve passing around a pointer to some interface to all code which needs to add events to the audit log. Such an implementation would make the audit logs more testable. A production code implementation of the interface should take care of common concerns, such as adding the timestamp, deciding required key names, and formatting the output as JSON. A test implementation of the interface could handle those common concerns differently to make testing easier.

New Dependencies

  • We might want to consider using a library like zap to aid in implementation, but that is already an indirect dependency of Pinniped.
  • The new streaming sidecar container will need a container image. Using the existing pinniped-server container image seems desirable. It is a distroless image, which is good for security. And it is the only image that we currently ship in Pinniped releases. One option to make this happen would be to implement the tail command in Go, but any binary that can work in a distroless image should be okay. We should avoid adding linux standard libraries to the container image, so the binary should be statically linked with no external dependencies. The binary should support the same OS and architecture that our existing Go binary supports.

Performance Considerations

By using buffered output to write to the audit log files, there should not be any meaningful performance impact. This assumption should be tested.

Observability Considerations

Auditing will improve operator observability, as described in the other sections of this document.

Security Considerations

The audit logs will be Pod container logs, so the contents of the logs will be protected by Kubernetes like any Pod container logs.

Usability Considerations

By using Pod container logs, the user will have many options to manage these logs.

Documentation Considerations

The supported audit event types, and they JSON keys output for each event type, should be documented. Users should be able to build their own parsers for these events based on the documentation.

If the production code implementation of the audit interface used Golang constants for all allowed JSON key names and event type names, and otherwise enforced certain standards, then it may be possible to auto-generate (or nearly auto-generate) the documentation for the audit event types.

Other Approaches Considered

None yet.

Open Questions

None.

Answered Questions

  • Should we output events that can function similar to access logs for the Supervisor endpoints? Yes (paragraphs above updated).
  • Should we try to somehow detect that a user is "root-like"? Yes (paragraphs above updated).

Implementation Plan

The maintainers will implement these features. It might fit into one PR.

Implementation PRs

This section is a placeholder to list the PRs that implement this proposal. This section should be left empty until after the proposal is approved. After implementation, the proposal can be updated to list related implementation PRs.