End-to-end user identity across a research platform with Istio, OPA, and token exchange

An auditor asked me a question I couldn't answer: which user authorized this genomics pipeline to process patient data? I could tell them Service A called Service B with a valid API key. I could not tell them who was behind it. The user's identity died at the first hop, and everything downstream ran under a credential that meant "some service, on behalf of nobody in particular." That gap is what I spent the first half of this year closing.

The platform runs scientific workflows for researchers. A single user action, submitting a pipeline or creating a dataset or running an analysis, can touch half a dozen services, and those services don't all live in the same place. Some are on Kubernetes. Some run on bare-metal HPC nodes. Some are third-party services we call over the network. The question underneath the auditor's question was: how does the user's identity and authorization travel through that entire chain, so every service in the path knows who initiated the work and what they're allowed to do?

What was actually broken

The old model was fine right up to the point it wasn't. The user authenticates against Keycloak and gets a JWT. The frontend attaches that JWT to API calls. The first service validates it. Good so far.

Then the first service needed to call the second. And to do that, it used a static API key stored in a Kubernetes Secret. Service B saw a valid key and allowed the request, with no idea which user triggered it. I want to be blunt about how bad that is, because I lived with it longer than I should have.

Three concrete reasons it had to go.

Compliance audit. When a researcher runs a pipeline over patient data, the compliance team needs to know exactly which user authorized that processing, at every step. "Service A called Service B with an API key" does not satisfy an auditor. "User researcher-42 initiated this, delegated through the workflow service, then through the task runner" does.

Authorization granularity. With API keys, every service-to-service call carries identical permissions. If the workflow service can call the data service, it can do so for any user, for any dataset. There's no way to say "this call is on behalf of a user who only has access to Project X." The API key is a skeleton key, and skeleton keys are exactly what you don't want auditing a genomics platform.

Long-lived tasks across mixed infrastructure. A pipeline might run for hours. Submission happens through Kubernetes services, but the compute might execute on HPC nodes outside the cluster. Those nodes call back to stage files, report progress, write outputs, and those callbacks need the original user's identity. A static API key can't carry who the work is for. A short-lived user JWT can't survive long enough to finish the job. Neither end of the spectrum works.

Istio RequestAuthentication

The first change moved JWT validation out of application code and into the mesh. Every service had its own validation middleware, some Python, some Go, some Node, each with its own quirks around clock skew, JWKS caching, and error responses. That's four subtly different security implementations to audit instead of one. Moving validation to Istio made it uniform.

apiVersion: security.istio.io/v1
kind: RequestAuthentication
metadata:
  name: keycloak-jwt
  namespace: platform
spec:
  jwtRules:
    - issuer: "https://auth.example.com/realms/platform"
      jwksUri: "https://auth.example.com/realms/platform/protocol/openid-connect/certs"
      forwardOriginalToken: true
      outputPayloadToHeader: x-jwt-payload

forwardOriginalToken: true passes the original JWT through to the application. Services that exchange the token for downstream calls (more on that below) need the raw token, not just the claims. outputPayloadToHeader base64-encodes the validated claims into a request header, so a service that only needs to read user identity can do it without a JWT library at all. Base64-decode the header, parse the JSON, done.

The Envoy sidecar handles signature verification against the JWKS endpoint, expiry, issuer, and audience checks. An invalid token gets a 401 before it ever reaches the application container. The part I like most: a new service dropped into the namespace is authenticated by default. Nobody has to remember to wire up auth middleware, which is precisely the step people forget.

The AuthorizationPolicy decides which paths require a valid principal:

apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: require-auth
  namespace: platform
spec:
  action: DENY
  rules:
    - from:
        - source:
            notRequestPrincipals: ["*"]
      to:
        - operation:
            notPaths: ["/health", "/ready", "/public/*"]

Health checks and readiness probes are excluded, because Kubernetes itself has to reach those without a token. Everything else requires a validated JWT.

OPA for per-service authorization

Validation tells you who the caller is. Authorization tells you what they can do. For that layer I used Open Policy Agent, deployed as a gRPC external authorization service that Envoy consults on every request.

The integration uses Envoy's ext_authz filter, configured through Istio's EnvoyFilter:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: opa-ext-authz
spec:
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: SIDECAR_INBOUND
        listener:
          filterChain:
            filter:
              name: envoy.filters.network.http_connection_manager
              subFilter:
                name: envoy.filters.http.router
      patch:
        operation: INSERT_BEFORE
        value:
          name: envoy.filters.http.ext_authz
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz
            grpc_service:
              envoy_grpc:
                cluster_name: opa
            transport_api_version: V3
            with_request_body:
              max_request_bytes: 8192
              allow_partial_message: true

I ran OPA as a sidecar in each service pod rather than one shared instance, and I'd make that call again. Policies are service-specific. The dataset service has different rules than the workflow service, and a shared OPA becomes a single blast radius plus a config file nobody wants to own. Each service's OPA loads its policy bundle from a ConfigMap mounted at /policies.

A typical policy reads the JWT claims (from the x-jwt-payload header Istio set), the HTTP method, the request path, and for delegated calls, the act claim chain:

package authz

default allow = false

# Direct user access: project admins can delete datasets
allow {
    input.parsed_path[0] == "api"
    input.parsed_path[1] == "datasets"
    input.method == "DELETE"
    token.realm_access.roles[_] == "project-admin"
}

# Delegated access: workflow service can read datasets on behalf of a researcher
allow {
    input.parsed_path[0] == "api"
    input.parsed_path[1] == "datasets"
    input.method == "GET"
    token.act.sub == "workflow-service"
    token.realm_access.roles[_] == "researcher"
}

token := payload {
    encoded := input.parsed_headers["x-jwt-payload"]
    payload := json.unmarshal(base64url.decode(encoded))
}

The second rule is the interesting one. It allows a GET on datasets if the immediate caller is the workflow service (act.sub) and the original user has the researcher role. The dataset service knows nothing about the workflow service's internals. It just asks: who is calling me, and who are they acting for?

Policies are version-controlled alongside the Helm charts. Deploying a new one is a Helm upgrade that updates the ConfigMap, and OPA picks it up within its polling interval. Auth rules living in git next to the code they protect is the whole reason this stayed maintainable.

Service-to-service token exchange

This is the piece that actually solved the auditor's problem. OAuth 2.0 Token Exchange (RFC 8693) lets a service swap a user's JWT for a new token that still says "this is user-123's request" but adds "and it's being carried by service-B."

When Service A needs to call Service B on behalf of a user:

POST /realms/platform/protocol/openid-connect/token
Content-Type: application/x-www-form-urlencoded

grant_type=urn:ietf:params:oauth:grant-type:token-exchange
&subject_token={user_jwt}
&subject_token_type=urn:ietf:params:oauth:token-type:access_token
&requested_token_type=urn:ietf:params:oauth:token-type:access_token
&audience={service_b_client_id}

Keycloak returns a new JWT. The sub is still the original user. The azp (authorized party) is now Service B's client ID. And there's an act (actor) claim recording who performed the exchange:

{
  "sub": "user-123",
  "azp": "service-b",
  "realm_access": { "roles": ["researcher"] },
  "act": {
    "sub": "service-a"
  }
}

Service B's OPA policy can now check both facts: that the calling service is allowed to act as a delegate (act.sub), and that the original user has the right role (realm_access.roles). The identity rides along with the request instead of evaporating at the door.

Chained exchanges for multi-hop delegation

In practice the chains are deeper than A to B. A user submits a workflow, the platform service calls the workflow runner, the runner calls the task executor, the executor calls the data service to stage input files. Four hops. The clean single-hop diagram in my head did not survive contact with the real call graph.

I wrote a custom Keycloak SPI in Java to support chained exchanges. The standard spec doesn't define what happens when a token that already carries an act claim gets exchanged again. Our SPI nests them:

{
  "sub": "user-123",
  "azp": "data-service",
  "act": {
    "sub": "task-executor",
    "act": {
      "sub": "workflow-runner",
      "act": {
        "sub": "platform-api"
      }
    }
  }
}

The full delegation path is preserved. The data service reading this token knows the whole story: user-123 initiated the work, it flowed through platform-api, then workflow-runner, then task-executor, and now it's here. OPA policies can enforce constraints on the chain itself. "The data service only accepts requests delegated through the workflow runner." "No more than 4 hops."

Getting there meant fighting Keycloak's default validation, and this is the part I got wrong the first time. Keycloak's built-in token exchange provider checks that the client making the exchange request matches the azp in the subject token. That holds for single-hop exchanges, where service-a exchanges the user's token and the user's token has azp set to the frontend or service-a. It breaks for chained exchanges, because the subject token's azp is the previous service in the chain, not the current one. I spent a while assuming my SPI was buggy before I realized the platform was rejecting me on purpose. The SPI relaxes that check for a whitelist of service clients permitted to perform chained exchanges.

Long-lived tasks and the expiry problem

User JWTs expire. Ours have a 15-minute lifetime. A pipeline can run for 8 hours. If the task executor needs to call the data service at hour 6 to write output files, the original token has been dead for hours. This is the constraint that quietly kills naive designs.

The exchanged tokens our SPI issues have a configurable lifetime, independent of the original. When the workflow runner exchanges the user token at submission, it gets one with a lifetime matching the workflow's estimated duration plus a safety margin. That token is handed to the task executor and used for the whole run.

It's the same mechanism that carries identity across infrastructure boundaries. An HPC job on a bare-metal node outside the Kubernetes cluster can hold an exchanged token and call back to platform services over HTTPS. The node doesn't need to be in the mesh. It doesn't need a sidecar. It just needs a valid JWT with the right claims. The receiving service validates it at the mesh layer (or via plain JWT validation if it's outside the mesh), checks the act chain and user roles through OPA, and processes the request. That's the property I was after: the token is the passport, and it works the same on Kubernetes and on bare metal.

The proxy sidecar

I didn't want every service implementing the token exchange call by hand. So I built a small proxy, a Go binary deployed as a sidecar, that intercepts outbound service-to-service calls and does the exchange transparently:

Reads the user JWT from the inbound request headers (set by Istio's outputPayloadToHeader)
Calls Keycloak's token exchange endpoint with the user JWT and the target service's client ID
Attaches the exchanged token to the outbound request
Caches exchanged tokens by (user_sub, target_service, act_chain_hash) tuple with TTL matching the token's exp claim

The cache is an in-memory LRU bounded by entry count (default 1000) and TTL. Hits skip the Keycloak round-trip. In production about 85% of exchange requests hit the cache, because the same users tend to trigger the same call chains inside a single token's lifetime. I expected that number to be lower and was happy to be wrong; it's the difference between Keycloak being on the hot path and being an occasional dependency.

Application code has no idea any of this is happening. It makes a normal HTTP call. The proxy does the rest. That's what let us roll token exchange out service by service without rewriting application code: the proxy showed up as a new sidecar in the pod spec, and outbound calls quietly started carrying proper delegated tokens.

What it solved

The change was less about swapping one security technology for another and more about making the user's identity a first-class thing that survives across service boundaries, infrastructure boundaries, and time.

Every service-to-service call now carries the originating user's identity and the full delegation chain. Authorization can be user-aware at every hop. Compliance can trace any action back to the user who started it, through every service that touched it. Long-running tasks on remote infrastructure can call back with proper delegation. And adding a new service no longer means distributing API keys or writing custom auth middleware.

The migration took about three months. We ran the old API-key path and the new token exchange in parallel for two weeks, with OPA policies accepting either a valid exchanged JWT or a valid API key, before pulling the API key path out. No incidents during the overlap.

Which brings me back to that auditor. Ask me now which user authorized a given pipeline to touch patient data, and the answer is a claim chain I can read straight off the token: user-123, through platform-api, workflow-runner, task-executor. The question that had no answer now answers itself.