Building a GPU SaaS Platform - Queue-First Ingress
/ 7 min read
In Part 12, we dealt with cold start at the image level. That gave us a better startup path for large GPU images, but it still did not answer a more basic serverless control-plane question:
where should requests go before any worker is ready?
The answer cannot be “directly to a Pod.”
If requests go straight to a worker address, then we lose the main property that makes serverless workable in the first place:
- requests can arrive before workers exist
- requests can survive worker churn
- the platform can decide when to create, reuse, or retire workers
So Part 13 introduces the first queue-first runtime contract.
This chapter does two things:
- it records the serverless policy that the control plane has already decided on
- it makes invocation ingress durable by sending requests to NATS JetStream before any worker executes them
Chapter Goal
By the end of Part 13, the project has four new properties:
GPUUnit.spec.serverlessstores the runtime-side serverless contract, including the control-plane generatedrequestID- the manager can optionally connect to NATS JetStream through local YAML config
- the HTTP API exposes a queue-first invocation ingress endpoint at
/api/v1/serverless/invocations - synchronous and asynchronous invocation modes now share one durable message contract, even though the dedicated activator has not been added yet
This chapter intentionally does not yet implement:
- the dedicated activator service
- worker registration
- sidecar-based request execution
- scale-to-zero or prewarm lifecycle control
Those are the next chapter’s job.
The Full Request Path Readers Should Keep In Mind
Part 13 only implements one internal boundary, so it is easy to lose sight of the whole request path. Before we go deeper, it helps to state the intended platform flow explicitly.
In the full design, each serverless application exposes its own public request URL from the control plane. A user does not call a worker Pod directly, and does not call the runtime manager directly.
The end-to-end path is supposed to look like this:
- a user sends a request to the public serverless URL
- the control plane authenticates the caller and validates tenant, quota, and request shape
- the control plane forwards the validated request into the platform’s internal request path
- the internal runtime-facing layer forwards the invocation to the runtime manager or operator API
- the runtime manager persists the invocation into NATS JetStream
- a later activator service consumes that durable request, decides whether to reuse or create a worker, and publishes a worker-targeted dispatch message
- the worker sidecar consumes that dispatch message, calls the local worker framework over an internal protocol, and keeps NATS credentials away from the user workload
- the worker sidecar publishes the result, sync reply, and metrics paths back into NATS
This chapter implements step 5.
That is why the public request URL is not part of this repository, and why the endpoint added here:
POST /api/v1/serverless/invocations
should be understood as an internal development-facing ingress for the queue boundary, not as the final public serving interface.
If we keep that picture in mind, the chapter becomes much easier to read. We are not trying to finish serverless routing here. We are only making sure that, once a validated request reaches the runtime side, it is durably accepted before any worker decision happens.
Why The Request ID Belongs To The Control Plane
Earlier in the series, some configuration naturally lived inside the runtime because it was really runtime-local. Serverless request identity is different.
When a user configures a serverless application, the control plane decides the logical workload identity and scaling policy. That identity must exist before any particular worker exists, because requests may arrive while the worker pool is still empty.
So this chapter treats the requestID as a control-plane artifact.
The runtime does not generate it. The runtime only records it.
That is why the GPUUnit schema now carries a serverless block like this:
serverless:
enabled: true
requestID: sd-webui
minAvailableCount: 1
idleTimeoutSeconds: 300
minRequestCount: 0
This is not yet a worker-orchestration feature. It is a contract boundary.
The runtime can now say:
- this unit belongs to serverless request pool
sd-webui - this is the prewarm floor
- this is the idle timeout policy
- this is the threshold policy the future activator will honor
That is enough for this chapter.
Why NATS JetStream Fits This Boundary
For this tutorial, NATS JetStream is a good fit for three reasons.
First, the publish path is very small. The runtime manager only needs to publish invocation envelopes durably and receive an acknowledgement.
Second, JetStream already gives us the durability and de-duplication hooks we need. This chapter uses the invocation ID as the JetStream message ID, so duplicate publishes can be detected by the stream.
Third, the subject model maps cleanly onto later serverless control-plane concepts.
This chapter introduces three subject families:
runtime.serverless.invoke.<requestID>
runtime.serverless.result.<requestID>
runtime.serverless.metrics.<requestID>
Only the invocation publish path is used today, but the result and metrics subjects are already part of the shared contract so that the activator, worker sidecar, and local framework boundary can reuse the same identity in the next chapter.
The Shared Invocation Contract
The new ingress endpoint uses one invocation envelope for both synchronous and asynchronous calls.
A simplified request looks like this:
{
"serverlessRequestID": "sd-webui",
"mode": "sync",
"payload": {
"prompt": "draw a robot"
}
}
The important design point is that sync and async are now part of the message contract, not part of the transport shortcut.
That means both modes follow the same first step:
- validate and normalize the invocation request
- assign or preserve an
invocationID - publish the invocation envelope to JetStream
- return the queue acknowledgement
After that first step, the two modes will diverge in later chapters:
asyncwill return the durable acknowledgement immediately, and the caller will track progress byinvocationIDsyncwill still enter the same durable queue first, but the future activator will wait on the corresponding result path and only return once a worker finishes or the request times out
In this chapter, sync does not yet block waiting for the model result, because the dedicated activator and worker path do not exist yet.
Instead, sync and async are both durably accepted into the same queue path, and the mode becomes a downstream execution hint for the activator we
will add next.
What Changed In The Code
The runtime-side serverless contract now starts in the API type:
GPUUnit.spec.serverless
That spec is normalized at the contract layer during create and update requests, so the runtime stores a clean requestID and defaulted lifecycle
values instead of leaving that work to the service layer later.
The queue contract lives in a dedicated shared package:
pkg/serverless
That package defines:
- invocation modes
- subject builders
- request ID normalization
- the durable invocation envelope
- the NATS JetStream publisher
The manager YAML now has an optional serverless: section:
serverless:
url: "nats://127.0.0.1:4222"
subjectPrefix: "runtime.serverless"
streamName: "RUNTIME_SERVERLESS"
streamReplicas: 1
streamMaxAge: "72h"
connectTimeout: "5s"
duplicatesWindow: "24h"
If url is empty, queue ingress stays disabled.
If it is set, the manager connects to JetStream and ensures the stream exists before starting the HTTP API.
The new HTTP endpoint is:
POST /api/v1/serverless/invocations
It does not try to execute work itself. It only performs the durable enqueue step and returns a durable acknowledgement, including:
- the
invocationID - the stream sequence
- whether JetStream treated the publish as a duplicate
- the invocation, result, and metrics subjects associated with that serverless request ID
That response is the bridge between the internal request path and the later activator.
Verification
There are four good checks after implementing this chapter.
1. Verify the new GPUUnit contract
Inspect the sample object:
cat config/samples/runtime_v1alpha1_gpuunit.yaml
Check that the serverless: block is present and records the control-plane identity and lifecycle hints.
2. Start a local JetStream-enabled NATS
nats-server -js
Then set serverless.url in config/local/runtime-manager.yaml.
3. Start the runtime manager
GOTOOLCHAIN=go1.26.0 go run ./cmd/main.go --config config/local/runtime-manager.yaml
The manager should create or update the configured stream before serving the API.
4. Publish one invocation
curl -X POST http://127.0.0.1:8080/api/v1/serverless/invocations \
-H 'Content-Type: application/json' \
-d '{
"serverlessRequestID": "sd-webui",
"mode": "async",
"payload": {
"prompt": "draw a robot"
}
}'
The response should contain:
- an
invocationID - a JetStream
sequence - the invocation subject
- the result subject
- the metrics subject
That means the request has entered the durable queue path before any worker execution exists.
Summary
Part 13 is about ordering the serverless control plane correctly.
Before we add the activator, we need two things:
- a stable runtime-side record of serverless identity and lifecycle hints
- a durable queue ingress path that accepts work before workers are ready
This chapter adds both.
GPUUnit now records the serverless contract, and the runtime manager now publishes invocation envelopes into NATS JetStream before any execution
happens. In the full platform, that enqueue happens only after the control plane has already validated the user’s request and forwarded it into the
internal runtime path. That gives the next chapter a clean starting point: the activator can focus on worker selection and lifecycle management
instead of inventing request identity and queue semantics from scratch.
Next Chapter Preview
Part 14 will add the dedicated activator service and the worker-dispatch boundary. That chapter will consume the invocation subjects, decide when to
create or reuse GPUUnit workers, publish worker-targeted dispatch messages, and define how the worker sidecar and local framework split
responsibilities.
Repository
Code for this chapter:
Comments
Join the discussion with your GitHub account. Powered by giscus .