Building a GPU SaaS Platform - Useful Operator Contracts
/ 11 min read
Part 5 gave us a real control loop:
- HTTP request
StockPoolcustom resource- controller reconcile
Deployment
That was the minimum line where the project stopped being a mock.
Part 6 is about making that line survivable.
In a real system, the next problems are not glamorous:
- clients retry requests
- duplicate writes appear
- bad specs get accepted and then fail somewhere deeper in the stack
- the controller owns too little of the workload lifecycle
- the pod template is still too fake to support later runtime work
That is exactly what this chapter fixes. No one wants to spend every day patching the same fragile system. If you have ever been on call, and spent the whole night half-awake because you were afraid of missing an alert, you already understand why these “boring” problems matter so much. Then the next day, because you did not sleep well, you create even more bugs. A lot of the time, engineers are not out there “building the future.” They are repairing cracks in systems that should have been made safer earlier. That is why we need to take validation, monitoring, degradation, and circuit-breaking seriously from the beginning.
Chapter Goal
By the end of this chapter, the runtime has five new properties:
- create requests are idempotent at the operation level
- the API contract is stricter about what a job means
- the controller reports failures and lifecycle state more explicitly
StockPool.speccontains the first real runtime template instead of a hardwiredsleep 3600- the Echo API publishes Swagger documentation so the contract is visible without reading handler code
This is not yet a full GPU runtime. It is the point where the Operator starts behaving like software that can survive retries, support debugging, and grow into real workloads.
The Real Problem We Were Hiding
Before this iteration, a write request could easily lie to the caller without meaning to.
Example:
- the API accepts a create request
- the async job reports success because the CR was created
- the controller later fails to build the workload because
memoryis invalid - the caller only sees “job succeeded” unless they inspect controller logs or cluster events
That is a bad contract.
A production system does not need perfect abstractions on day one, but it does need honest ones.
So Part 6 tightens the contract in two directions at the same time:
- the write path becomes safer under retries
- the reconcile path becomes more observable when desired state is invalid or incomplete
All changes in this chapter are tightly related. If you only add idempotency but keep bad controller feedback, you still have a hard-to-debug system. If you only improve controller status but keep a loose write contract, retries still create garbage. If you only add a runtime template without a service, you still have no stable network boundary for the pod. So when you finish this chapter, stop and ask yourself: is this really enough? What problems are still unsolved? If we leave them alone now, will they become much more expensive later? Keeping that instinct alive is part of what makes software engineering interesting.
Why operationID Matters
Distributed systems retry. That is normal.
Browsers retry. Gateways retry. SDKs retry. Humans retry.
If a POST request can create two StockPool objects because the caller did not receive the first response, the problem is not “the caller should be smarter”. The problem is that the API contract is under-specified.
So the create request now requires an operationID:
type CreateStockPoolRequest struct {
OperationID string `json:"operationID"`
Name string `json:"name,omitempty"`
Namespace string `json:"namespace,omitempty"`
SpecName string `json:"specName"`
Image string `json:"image,omitempty"`
Memory string `json:"memory,omitempty"`
GPU int32 `json:"gpu,omitempty"`
Replicas int32 `json:"replicas"`
Template runtimev1alpha1.StockPoolTemplate `json:"template,omitempty"`
}
The service now does three things with that identifier:
- it normalizes and validates the request before queuing work
- it computes a request hash from the normalized payload
- it stores both
operationIDand request hash on theStockPoolannotations
That gives us useful behavior:
- same
operationID+ same payload: return the same operation - same
operationID+ different payload: reject with409 Conflict - generated object names become deterministic when the caller does not provide one
That last point matters more than it sounds. Random names are convenient in demos. They are terrible for idempotency.
Why Swagger Belongs Here
Once we switched the HTTP layer to echo, the control-plane API stopped being “just a few handlers”.
At that point, the contract deserves to be visible:
- request body shape
- response shape
- status codes
- path parameters
- query parameters
That is why this chapter adds Swagger now instead of much later.
This is not about chasing tooling for its own sake. It is about reducing ambiguity while the API surface is still small enough to keep honest.
The practical result is simple:
- the server now exposes
/swagger/index.html - the docs are generated from handler annotations
make swaggerandmake cikeep the checked-in spec reproducible
For a teaching project, this matters even more than usual. Readers should be able to inspect the API contract directly, not reverse-engineer it from curl examples and handler code.
Tightening What A Job Means
The job now represents:
- “did the control plane accept this operation?”
- “did it persist the desired state as a
StockPoolresource?”
It does not mean:
- “the runtime is ready for user traffic”
That distinction is important.
The operation contract belongs to the write path. Runtime readiness belongs to reconciliation status.
If you collapse those two ideas into one field too early, the API becomes confusing very quickly.
So the split is now:
GET /api/v1/operator/jobs/:operationID: operation acceptance/persistenceStockPool.status: reconcile progress, readiness, and failure details
That is a much cleaner boundary.
The Controller Owns More Of The Lifecycle Now
In Part 5, the controller only created a Deployment.
That was enough to prove the architecture, but it was still thin:
- no stable service endpoint for a runtime pod
- no explicit failed phase
- no readable condition message for invalid desired state
Part 6 extends controller ownership in two practical ways.
1. Reconcile the Service, not just the Deployment
If the runtime template exposes ports, the controller now creates a matching ClusterIP Service.
That is the first step toward a real runtime boundary:
- pods can restart and be replaced
- the service name stays stable
- later chapters can attach access policy, probes, ingress, or sidecar communication to a stable endpoint
This is why the service belongs in the controller instead of being created ad hoc somewhere in the API layer.
The controller owns workload lifecycle. A service is part of that lifecycle.
2. Report explicit failure and readiness conditions
The StockPool status now includes:
phaseserviceNameconditionsobservedGenerationlastSyncTime
And the controller sets a Ready condition with reasons such as:
DeploymentProgressingDeploymentReadyScaledToZeroInvalidSpecPodStartupFailedPodStatusSyncFailedServiceSyncFailedDeploymentSyncFailed
That means invalid desired state is no longer just a log line.
If memory: "not-a-quantity" is sent, the controller marks the resource as Failed with an InvalidSpec reason instead of endlessly returning a parse error and hoping someone notices.
That is a very production-shaped change. Operators should explain failure in resource status whenever they can.
This chapter also goes one step further: when the deployment exists but a runtime pod is failing to start, the controller inspects the owned pods and copies the most useful failure message into the Ready condition.
So instead of only seeing:
phase: Failed
you can now get something much closer to the real problem, for example:
- image pull failures
- crash loop messages
- the last terminated container message when startup logic fails
That is the difference between “status exists” and “status helps you debug production”.
Preparing For Real Runtime Pods
The old pod template was intentionally primitive:
Command: []string{"sh", "-c", "sleep 3600"}
That was fine in Part 5 because the goal was proving the control loop.
It is not fine anymore.
If the controller hardcodes the workload shape, every future runtime feature becomes awkward:
- ports
- startup command
- env injection
- probes
- storage mounts
- sidecars
So this iteration introduces the first real runtime-facing template:
type StockPoolTemplate struct {
Command []string `json:"command,omitempty"`
Args []string `json:"args,omitempty"`
Envs []StockPoolEnvVar `json:"envs,omitempty"`
Ports []StockPoolPortSpec `json:"ports,omitempty"`
}
That does not mean we are done. It means we now have the right place to put runtime concerns.
The controller still falls back to the placeholder sleep command when no template command or args are provided. That is intentional. We are not pretending to have a finished runtime image contract yet.
What changed is the direction of the architecture:
- workload shape is now part of desired state
- the controller translates that shape into container config and service ports
- later chapters can extend the template instead of rewriting the control flow again
Flow After Part 6
+-----------------------------+
| Echo HTTP API |
| POST /operator/stockpools |
+-------------+---------------+
|
v
+-------------+---------------+
| service layer |
| validate request |
| require operationID |
| detect duplicate payload |
| create StockPool CR |
+-------------+---------------+
|
v
+-------------+---------------+
| StockPool |
| spec.template |
| operation annotations |
+-------------+---------------+
|
v
+-------------+---------------+
| controller |
| reconcile Deployment |
| reconcile Service |
| update phase + conditions |
+-------------+---------------+
|
v
+-------------+---------------+
| runtime worker pods |
| stable ClusterIP service |
+-----------------------------+
Code Walkthrough
Request validation moved before the queue
This is the right time to reject:
- missing
operationID - invalid
memory - negative replica or GPU values
- invalid or duplicate template env/port definitions
If the request is clearly wrong, the system should say so immediately. There is no value in accepting garbage, writing a CR, and forcing the controller to discover the mistake later.
Idempotency is anchored in Kubernetes state
The request hash is stored on the StockPool annotations, not only in an in-memory map.
That matters because the in-memory job store is only a convenience for this single-process stage. The custom resource is the durable system record.
This is one of the subtle lessons in production work:
- process memory is operationally useful
- Kubernetes objects are the contract boundary
Status is now for humans, not just code
The controller uses phase and conditions together.
That is deliberate:
phaseis good for quick scanningconditionsare good for precise diagnosis
This is why many mature Kubernetes APIs use both a compact summary and condition detail.
Run And Verify
Run the code:
cd ./gpu-operator-runtime
make ci
make run
Before you test a request with "gpu": 1, make sure the cluster already exposes nvidia.com/gpu.
For most readers, that means installing NVIDIA GPU Operator first. A manually prepared cluster also works, but only if the NVIDIA drivers, runtime integration, and device plugin are already in place.
If the cluster does not expose nvidia.com/gpu, keep the tutorial request at gpu: 0 while you work on the API and controller flow.
Open the API docs:
open http://127.0.0.1:8080/swagger/index.html
Create a stock pool with a real runtime template:
curl -s -X POST http://127.0.0.1:8080/api/v1/operator/stockpools \
-H 'Content-Type: application/json' \
-d '{
"operationID": "stock-g1-demo-001",
"name": "pool-g1-demo",
"namespace": "default",
"specName": "g1.1",
"image": "python:3.12-slim",
"memory": "16Gi",
"gpu": 1,
"replicas": 1,
"template": {
"command": ["python"],
"args": ["-m", "http.server", "8080"],
"envs": [
{"name": "MODEL_ID", "value": "demo-model"}
],
"ports": [
{"name": "http", "port": 8080, "protocol": "TCP"}
]
}
}' | jq
Send the same request again:
curl -s -X POST http://127.0.0.1:8080/api/v1/operator/stockpools \
-H 'Content-Type: application/json' \
-d @same-request.json | jq
Expected result:
- first request returns
202 Accepted - second request returns
200 OK - both refer to the same
operationID
Try reusing the same operationID with a different payload:
curl -s -X POST http://127.0.0.1:8080/api/v1/operator/stockpools \
-H 'Content-Type: application/json' \
-d '{
"operationID": "stock-g1-demo-001",
"name": "pool-g1-demo",
"namespace": "default",
"specName": "g2.1",
"replicas": 1
}' | jq
Expected result:
409 Conflict
Inspect the cluster objects:
kubectl get stockpool pool-g1-demo -n default -o yaml
kubectl get deployment pool-pool-g1-demo -n default
kubectl get service pool-pool-g1-demo -n default
Useful fields to inspect:
.metadata.annotations["runtime.lokiwager.io/operation-id"].status.phase.status.serviceName.status.conditions
If a pod is crashing, inspect the failure message directly from the condition:
kubectl get stockpool pool-g1-demo -n default -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}'
If you want to force a failure path, send an invalid memory quantity:
curl -s -X POST http://127.0.0.1:8080/api/v1/operator/stockpools \
-H 'Content-Type: application/json' \
-d '{
"operationID": "stock-invalid-001",
"name": "pool-invalid",
"namespace": "default",
"specName": "g1.1",
"memory": "not-a-quantity",
"replicas": 1
}' | jq
That request should now fail fast at the API boundary instead of sneaking into reconciliation.
If a bad CR is created manually, the controller should mark it Failed with an InvalidSpec reason.
Summary
Part 6 is where the Operator stops being merely correct in architecture and starts becoming trustworthy in behavior.
We now have:
- operation-level idempotency
- deterministic write semantics under retry
- a clearer split between write acceptance and runtime readiness
- controller-owned
Servicereconciliation - explicit failure status and readiness conditions
- pod startup failure messages surfaced into status
- a first runtime template for command, args, envs, and ports
- Swagger documentation published from the Echo server
That is the right foundation for the next stage.
We can now start talking about storage and real runtime mounting without pretending the pod contract still lives in controller code.
Next Chapter Preview
Part 7 is where we move from stock capacity to real user-facing GPU instances.
Right now, we only create stock. We do not yet create the actual GPU instance a user can access. That is the next step. Once we reach that point, the system starts to feel much closer to a real runtime product.
In the next chapter, we will implement:
- the flow that turns a stock unit into a real GPU instance
- how users reach that GPU instance
- GPU instance status reporting
- the GPU instance deletion flow
- the GPU instance update flow
Repository
Code for this chapter:
Comments
Join the discussion with your GitHub account. Powered by giscus .