Building a GPU SaaS Platform - Worker Lifecycle
/ 6 min read
In the previous chapter, we changed the worker sidecar and the main worker container so that the sidecar consumes dispatch data and communicates with the worker over UDS. That keeps NATS credentials inside the trusted platform boundary while letting users focus as much as possible on their own application logic.
But the most important serverless concern, worker lifecycle management, was still missing. We need at least a few basic policies, such as the minimum number of available instances, maximum idle time, and scale-up or scale-down rules.
So in this chapter, we will finish the last missing piece of the serverless runtime path.
Chapter Goal
By the end of Part 16, the project has four new properties:
- the activator consumes worker metrics and maintains an in-memory worker state table
serverless.minAvailableCountnow drives prewarm worker creationserverless.idleTimeoutSecondsnow drives idle worker scale-down- activator-created workers are clearly separated from user-created template units
This chapter is about putting lifecycle ownership in the correct process and making the control loop executable. But is this already a production-ready architecture? Which critical modules are still missing? Keep those questions in mind as we walk through the chapter.
The New Lifecycle Shape
The runtime-side shape now looks like this:
The activator now runs two durable consumers plus one periodic reconcile loop:
- ingress consumer: turns queued invocations into worker dispatch messages
- metrics consumer: records worker registration, heartbeat, start, finish, and failure events
- lifecycle reconcile: creates missing warm workers and deletes idle managed workers
That means the activator is no longer only a dispatch bridge. It is now the owner of the serverless worker pool.
Prewarm: Turning minAvailableCount Into Workers
The control plane records the serverless policy on the GPUUnit spec:
serverless:
enabled: true
requestID: sd-webui
minAvailableCount: 2
idleTimeoutSeconds: 300
framework:
socketPath: /tmp/serverless-framework/framework.sock
invokePath: /invoke
healthPath: /healthz
The activator periodically lists serverless-enabled GPUUnit objects and groups them by serverless.requestID.
For each request group, it compares:
- desired warm capacity:
minAvailableCount - current ready workers
- current progressing workers
If ready plus progressing is below the target, the activator creates more workers by cloning the non-managed template unit for that request ID.
The worker creation still goes through the existing CreateGPUUnit service path.
That is important because stock handoff, storage validation, SSH settings, access settings, and serverless sidecar injection all remain in the same runtime path we already built.
The activator is not creating Pods directly.
It is asking the runtime service to create another GPUUnit.
Worker Names Need To Be Distinguishable
A subtle lifecycle problem appears as soon as scale-down exists:
how does the activator know which workers it is allowed to delete?
We do not want it to delete the user-created template unit by accident.
So activator-created workers now use a distinct managed-name shape:
unit-<requestID>-worker-<suffix>
For example:
unit-sd-webui-worker-12345678
That gives the lifecycle loop a simple rule:
- user-created template units are preserved
- activator-created managed workers may be deleted when they are idle
This is deliberately conservative.
If a unit was not created by the activator naming path, the lifecycle manager leaves it alone.
Idle Scale-Down: Metrics Become Lifecycle Input
The worker sidecar already emits metrics events:
registeredheartbeatinvocation_startedinvocation_finishedinvocation_failed
Part 16 adds an activator metrics consumer for:
runtime.serverless.metrics.<requestID>
The activator folds those events into a worker state table:
worker namespace/name
-> requestID
-> last seen time
-> last activity time
-> inflight count
-> last event type
There is one important detail:
heartbeats update lastSeen, not lastActivity.
If heartbeat refreshed activity, an idle worker would never become idle because the sidecar keeps sending heartbeats forever. So activity is only advanced by registration or invocation events.
When idleTimeoutSeconds is configured, the lifecycle loop can delete a worker only when all of these are true:
- the worker is ready
- the worker name matches the activator-managed name pattern
- the worker has no inflight invocation according to metrics
- the worker has been idle longer than
idleTimeoutSeconds - deleting it would not take the ready pool below
minAvailableCount
That gives us the first real scale-down path without letting the activator delete user-owned template units.
Where Results Go In This Boundary
Part 13 introduced async invocation enqueueing. Part 15 made the worker publish results.
Part 16 deliberately does not make the activator a result API.
The worker sidecar already publishes completed results to:
runtime.serverless.result.<requestID>
That subject is still part of the runtime event stream, but the activator does not consume it.
For sync requests, the activator only preserves the dispatch contract. If the invocation has a replySubject, the activator passes it through to the worker dispatch message. The worker sidecar sends the result to that reply subject after local execution completes.
For async requests, result lookup belongs to the control plane:
- user calls the control-plane serverless URL
- control plane validates auth, tenant, quota, and request shape
- control plane publishes or forwards the invocation into the runtime queue path
- activator dispatches to workers
- worker sidecar publishes the durable result event
- a control-plane result consumer stores metadata and object pointers
- user reads the async result through a control-plane signed URL
The storage piece is not in Part 16. We will add it in Part 17 with ScyllaDB for invocation metadata and object storage for large payloads or results.
What Changed In The Code
The shared serverless queue contract now has one new consumer interface:
WorkerMetricConsumer
The NATS implementation now consumes:
runtime.serverless.metrics.*
The activator package now includes:
- a
LifecycleManagerthat observes metrics and reconciles worker pools - config for metrics consumer name and lifecycle interval
The activator config now looks like this locally:
consumerName: "runtime-activator"
metricsConsumerName: "runtime-activator-metrics"
workerReadyWait: "2m"
workerPollInterval: "2s"
lifecycleInterval: "15s"
serverless:
url: "nats://127.0.0.1:4222"
subjectPrefix: "runtime.serverless"
streamName: "RUNTIME_SERVERLESS"
The runtime controller still owns Pod construction. The worker sidecar still owns NATS inside the Pod. The framework still only talks over UDS.
Part 16 only changes who owns worker lifecycle decisions.
Verification
Start NATS:
nats-server -js
Start the manager:
GOTOOLCHAIN=go1.26.0 go run ./cmd/main.go --config config/local/runtime-manager.yaml
Start the activator:
GOTOOLCHAIN=go1.26.0 go run ./cmd/activator --config config/local/activator.yaml
Start the example framework:
SERVERLESS_FRAMEWORK_SOCKET_PATH=/tmp/serverless-framework/framework.sock \
GOTOOLCHAIN=go1.26.0 go run ./cmd/framework-echo
Start the worker sidecar:
SERVERLESS_NATS_URL=nats://127.0.0.1:4222 \
SERVERLESS_SUBJECT_PREFIX=runtime.serverless \
SERVERLESS_STREAM_NAME=RUNTIME_SERVERLESS \
SERVERLESS_WORKER_NAME=sd-webui-template \
SERVERLESS_WORKER_NAMESPACE=runtime-instance \
SERVERLESS_REQUEST_ID=sd-webui \
SERVERLESS_FRAMEWORK_SOCKET_PATH=/tmp/serverless-framework/framework.sock \
SERVERLESS_FRAMEWORK_INVOKE_PATH=/invoke \
SERVERLESS_FRAMEWORK_HEALTH_PATH=/healthz \
GOTOOLCHAIN=go1.26.0 go run ./cmd/serverless-sidecar
Then enqueue an async invocation:
curl -s -X POST http://127.0.0.1:8080/api/v1/serverless/invocations \
-H 'Content-Type: application/json' \
-d '{
"serverlessRequestID":"sd-webui",
"mode":"async",
"attributes":{
"path":"/generate",
"method":"POST"
},
"payload":{
"prompt":"draw a lifecycle diagram"
}
}' | jq
At this chapter boundary, the result event is published to NATS but not yet persisted into a product result store. That persistence path is the ScyllaDB-backed Part 17.
For lifecycle behavior, create or update a serverless GPUUnit with:
serverless:
requestID: sd-webui
minAvailableCount: 2
idleTimeoutSeconds: 60
Then watch the runtime namespace:
kubectl get gpuunits -n runtime-instance -w
The activator should create managed workers until the warm pool target is reached. After managed workers go idle longer than the timeout, it should delete only the activator-created workers while preserving the template unit.
Summary
Part 16 turns the activator from a request dispatcher into the owner of serverless worker lifecycle.
We now have:
- prewarm behavior from
minAvailableCount - idle scale-down from worker-side metrics and
idleTimeoutSeconds - conservative deletion that only applies to activator-managed workers
- a cleaner split between controller-owned Pod shape and activator-owned request pressure
The core serverless path now has the right control loops, even though the state is still local to one activator process.
Next Chapter Preview
Part 17 will add the result storage layer: ScyllaDB for invocation metadata and status, object storage for large payloads and result bodies, local Docker configuration, and a control-plane result consumer that writes worker result events into storage.
Repository
Code for this chapter:
Comments
Join the discussion with your GitHub account. Powered by giscus .