End-to-end user identity across a research platform with Istio, OPA, and token exchange
2026-05-13
I spent the first half of this year redesigning how authentication and authorization work across a research platform. The platform runs scientific workflows for researchers. A single user action (submitting a pipeline, creating a dataset, running an analysis) can touch half a dozen services, and those services don't all live in the same place. Some are on Kubernetes. Some run on bare-metal HPC nodes. Some are third-party services we call over the network. The problem we needed to solve was: how does the user's identity and authorization travel through that entire chain, so that every service in the path knows who initiated the work and what they're allowed to do?
What was actually broken
The old auth model worked like this: the user authenticates against Keycloak and gets a JWT. The platform's frontend attaches that JWT to API calls. The first service validates the JWT. So far, fine.
The problem was what happened after that first service. When Service A needed to call Service B on behalf of the user, it used a static API key stored in a Kubernetes Secret. Service B saw a valid API key and allowed the request, but it had no idea which user triggered it. The user's identity was lost at the first hop.
This matters for three concrete reasons:
Compliance audit. When a researcher runs a genomics pipeline that processes patient data, the compliance team needs to know exactly which user authorized that processing, at every step. "Service A called Service B with an API key" doesn't satisfy an auditor. "User researcher-42 initiated this, delegated through the workflow service, then through the task runner" does.
Authorization granularity. With API keys, every service-to-service call has the same permissions. If the workflow service can call the data service, it can do so for any user, for any dataset. There's no way to say "this call is on behalf of a user who only has access to Project X's datasets." The API key is a skeleton key.
Long-lived tasks across heterogeneous infrastructure. A bioinformatics pipeline might run for hours. The initial submission happens through Kubernetes services, but the actual compute might execute on HPC nodes that aren't part of the Kubernetes cluster. Those nodes need to call back to platform services (to stage files, report progress, write outputs), and those callbacks need to carry the original user's identity. A static API key can't do this because it doesn't encode who the work is for. A short-lived user JWT can't do this because it expires before the pipeline finishes.
Istio RequestAuthentication
The first change was moving JWT validation out of application code and into the mesh. Each service had its own JWT validation middleware (some in Python, some in Go, some in Node), each with slightly different behavior around clock skew tolerance, JWKS caching, and error responses. Moving validation to Istio made it uniform.
apiVersion: security.istio.io/v1
kind: RequestAuthentication
metadata:
name: keycloak-jwt
namespace: platform
spec:
jwtRules:
- issuer: "https://auth.example.com/realms/platform"
jwksUri: "https://auth.example.com/realms/platform/protocol/openid-connect/certs"
forwardOriginalToken: true
outputPayloadToHeader: x-jwt-payload
forwardOriginalToken: true passes the original JWT through to the application. Services that need to exchange the token for downstream calls (more on this below) need the raw token, not just the claims. outputPayloadToHeader base64-encodes the validated JWT claims into a request header, so services that only need to read user identity can do so without any JWT library at all: just base64-decode the header and parse the JSON.
The Envoy sidecar handles signature verification against the JWKS endpoint, expiry checks, issuer validation, and audience validation. If the token is invalid, the request gets a 401 before it reaches the application container. This means a new service added to the namespace is authenticated by default. You don't need to remember to add auth middleware.
The AuthorizationPolicy controls which paths require a valid principal:
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
name: require-auth
namespace: platform
spec:
action: DENY
rules:
- from:
- source:
notRequestPrincipals: ["*"]
to:
- operation:
notPaths: ["/health", "/ready", "/public/*"]
Health checks and readiness probes are excluded because Kubernetes itself needs to reach those without a token. Everything else requires a validated JWT.
OPA for per-service authorization
JWT validation tells you who the caller is. Authorization tells you what they're allowed to do. For this layer I used Open Policy Agent, deployed as a gRPC external authorization service that Envoy consults on every request.
The integration uses Envoy's ext_authz filter, configured through Istio's EnvoyFilter:
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: opa-ext-authz
spec:
configPatches:
- applyTo: HTTP_FILTER
match:
context: SIDECAR_INBOUND
listener:
filterChain:
filter:
name: envoy.filters.network.http_connection_manager
subFilter:
name: envoy.filters.http.router
patch:
operation: INSERT_BEFORE
value:
name: envoy.filters.http.ext_authz
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz
grpc_service:
envoy_grpc:
cluster_name: opa
transport_api_version: V3
with_request_body:
max_request_bytes: 8192
allow_partial_message: true
OPA runs as a sidecar in each service pod rather than as a shared instance. This is important because policies are service-specific. The dataset service has different authorization rules than the workflow service. Each service's OPA instance loads its policy bundle from a ConfigMap mounted at /policies.
A typical policy evaluates the JWT claims (from the x-jwt-payload header set by Istio), the HTTP method, the request path, and for delegated calls, the act claim chain:
package authz
default allow = false
# Direct user access: project admins can delete datasets
allow {
input.parsed_path[0] == "api"
input.parsed_path[1] == "datasets"
input.method == "DELETE"
token.realm_access.roles[_] == "project-admin"
}
# Delegated access: workflow service can read datasets on behalf of a researcher
allow {
input.parsed_path[0] == "api"
input.parsed_path[1] == "datasets"
input.method == "GET"
token.act.sub == "workflow-service"
token.realm_access.roles[_] == "researcher"
}
token := payload {
encoded := input.parsed_headers["x-jwt-payload"]
payload := json.unmarshal(base64url.decode(encoded))
}
That second rule is the interesting one. It says: allow a GET on datasets if the immediate caller is the workflow service (act.sub) AND the original user has the researcher role. The dataset service doesn't need to know about the workflow service's internals. It just checks "who is calling me, and who are they acting for?"
The policies are version-controlled alongside the Helm charts. Deploying a new policy is a Helm upgrade that updates the ConfigMap. OPA picks up the change within its polling interval.
Service-to-service token exchange
This is the piece that actually solved the problem. OAuth 2.0 Token Exchange (RFC 8693) lets a service exchange a user's JWT for a new token that says "this is still user-123's request, but now it's being carried by service-B."
When Service A needs to call Service B on behalf of a user:
POST /realms/platform/protocol/openid-connect/token
Content-Type: application/x-www-form-urlencoded
grant_type=urn:ietf:params:oauth:grant-type:token-exchange
&subject_token={user_jwt}
&subject_token_type=urn:ietf:params:oauth:token-type:access_token
&requested_token_type=urn:ietf:params:oauth:token-type:access_token
&audience={service_b_client_id}
Keycloak returns a new JWT. The sub claim is still the original user. The azp (authorized party) is now Service B's client ID. And there's an act (actor) claim that records who performed the exchange:
{
"sub": "user-123",
"azp": "service-b",
"realm_access": { "roles": ["researcher"] },
"act": {
"sub": "service-a"
}
}
Service B's OPA policy can verify both things: that the calling service is authorized to act as a delegate (act.sub), and that the original user has the right role (realm_access.roles). The user's identity travels with the request.
Chained exchanges for multi-hop delegation
In practice, call chains are deeper than A -> B. A user submits a workflow, the platform service calls the workflow runner, the runner calls the task executor, the executor calls the data service to stage input files. That's four hops.
I wrote a custom Keycloak SPI in Java to support chained exchanges. The standard token exchange spec doesn't define what happens when a token that already has an act claim gets exchanged again. Our SPI nests them:
{
"sub": "user-123",
"azp": "data-service",
"act": {
"sub": "task-executor",
"act": {
"sub": "workflow-runner",
"act": {
"sub": "platform-api"
}
}
}
}
The full delegation path is preserved. The data service receiving this token knows: user-123 initiated the work, it flowed through platform-api, then workflow-runner, then task-executor, and now it's here. OPA policies can enforce constraints on the chain itself: "the data service only accepts requests that were delegated through the workflow runner" or "no more than 4 hops."
The SPI had to handle a subtle issue with Keycloak's default validation. Keycloak's built-in token exchange provider validates that the client making the exchange request matches the azp in the subject token. That works for single-hop exchanges (service-a exchanges the user's token, and the user's token has azp set to the frontend client or service-a). It breaks for chained exchanges because the subject token's azp is the previous service in the chain, not the current one. The SPI relaxes this check for a whitelist of service clients that are permitted to perform chained exchanges.
Long-lived tasks and the expiry problem
User JWTs expire (ours have a 15-minute lifetime). A bioinformatics pipeline can run for 8 hours. If the task executor needs to call back to the data service at hour 6 to write output files, the original token is long dead.
The exchanged tokens issued by our SPI have a configurable lifetime that's independent of the original token. When the workflow runner exchanges the user token at submission time, it gets a token with a lifetime matching the workflow's estimated duration (with a safety margin). That token is passed to the task executor, which can use it for the duration of the run.
This is the same mechanism that lets the system work across infrastructure boundaries. An HPC job running on a bare-metal node outside the Kubernetes cluster can carry an exchanged token and use it to call back to platform services over HTTPS. The HPC node doesn't need to be part of the mesh. It doesn't need a sidecar. It just needs a valid JWT with the right claims. The receiving service validates the token at the mesh layer (or via standard JWT validation if it's not in the mesh), checks the act chain and user roles via OPA, and processes the request.
The proxy sidecar
To avoid every service implementing the token exchange call, I built a lightweight proxy (a Go binary, deployed as a sidecar) that intercepts outbound service-to-service calls and handles the exchange transparently:
- Reads the user JWT from the inbound request headers (set by Istio's
outputPayloadToHeader) - Calls Keycloak's token exchange endpoint with the user JWT and the target service's client ID
- Attaches the exchanged token to the outbound request
- Caches exchanged tokens by
(user_sub, target_service, act_chain_hash)tuple with TTL matching the token'sexpclaim
The cache is an in-memory LRU bounded by entry count (default 1000) and TTL. Cache hits skip the Keycloak round-trip. In production, about 85% of exchange requests hit the cache because the same users tend to trigger the same call chains within a token's lifetime.
Application code doesn't know the exchange is happening. It makes a normal HTTP call. The proxy handles the rest. This meant we could roll out token exchange service by service without rewriting application code. The proxy just appeared as a new sidecar in the pod spec, and the application's outbound calls started carrying proper delegated tokens.
What it solved
The change was less about replacing one security technology with another and more about making the user's identity a first-class concept that survives across service boundaries, infrastructure boundaries, and time.
Every service-to-service call now carries the originating user's identity and the full delegation chain. Authorization policies can be user-aware at every hop. Compliance audit can trace any action back to the user who initiated it, through every service that participated. Long-running tasks on remote infrastructure can call back to platform services with proper delegation. And adding a new service to the platform doesn't require distributing API keys or writing custom auth middleware.
The migration took about three months. We ran the old API-key auth and the new token exchange in parallel for two weeks (OPA policies accepted either a valid exchanged JWT or a valid API key) before removing the API key path. No incidents during the overlap.