Building a GPU SaaS Platform - Operator Baseline
/ 11 min read
Part 4 gave us a service-shaped project.
Part 5 is where it starts acting like a Kubernetes system instead of a well-organized mock.
The high-level change is simple:
- the HTTP server still accepts control-plane requests
- but it no longer tries to act as the source of truth
- instead, it creates a
StockPoolcustom resource - the controller reconciles that resource into a
Deployment
That is the first real control loop in the project.
What We Are Doing In This Chapter
This chapter does six concrete things:
- refactor the project onto a standard
kubebuilderlayout - switch the HTTP layer from raw
net/httptoecho - define the
StockPoolCRD - implement
StockPoolReconcilerso a CR becomes aDeployment - generate RBAC and CRD manifests instead of hand-maintaining them
- add unit tests for the API flow and reconcile flow
That gives us a believable baseline without pretending we already finished the whole runtime.
A Few Ground Rules Before We Start
There are a few design choices in this chapter that are intentional, even if they are not final.
First, the HTTP server and the Kubernetes operator live in the same binary for now. That is a temporary trade-off, not a philosophical commitment. Long term, splitting them usually makes maintenance, failover, and ownership boundaries cleaner. But for this stage of the project, a single process keeps the lifecycle simple and makes the control flow easier to teach:
request -> custom resource -> reconcile -> workload
Second, some of the earlier “stock” ideas still show up in the broader series because this is an iterative project, not a fake greenfield rewrite every week. Stock-style reservation can simplify certain scheduling conversations, but it is not the final answer. Later in the series we will talk about better approaches and why they matter.
Third, this chapter uses echo instead of raw net/http. That is not because Go lacks framework choices. It definitely does not. You could reasonably pick Gin, Fiber, or something else. I picked echo for boring, practical reasons:
- it is easy to read and easy to wire
- it has solid documentation and a mature community
- its HTTP behavior is configurable enough for real services
- it stays lightweight for a control-plane service that should not become the main throughput bottleneck anyway
If the control plane ever becomes a hot path, you usually have a traffic-shaping problem before you have an HTTP framework problem.
What Is An Operator?
An operator is just application-specific control logic built on top of the Kubernetes reconciliation model.
- users declare desired state
- Kubernetes stores that desired state
- a controller watches for changes
- the controller keeps nudging the cluster toward the declared state
That last bit matters. The controller is not just handling a one-shot request. It is continuously correcting drift.
For a GPU SaaS platform, that is exactly the model we want. Users ask for capacity. The system records the request. Controllers make the workloads exist and keep them healthy.
What Is A CRD?
A CRD, or CustomResourceDefinition, is how you teach Kubernetes a new API type.
Without a CRD, StockPool is just a Go struct and some wishful thinking.
With a CRD:
- Kubernetes knows the resource exists
- the API server can store it
- clients can query it
- controllers can watch it
That is why this chapter is a real milestone. We are moving from “service logic that happens to know about Kubernetes” to “Kubernetes-native desired state with a dedicated API contract.”
Why We Switched To Kubebuilder
The previous hand-wired operator code was fine as a sketch. It was not fine as the foundation of a teaching project that is supposed to model production habits.
Once CRDs, controllers, RBAC, generated manifests, and manager wiring enter the picture, hand-rolling everything quickly becomes a maintenance tax.
Could we have picked operator-sdk instead? Sure. kubebuilder is not the only valid option. I picked it partly out of preference, and partly because the documentation is deep enough that when something goes sideways, you have a decent chance of finding the answer without sacrificing a weekend to archaeology.
So this iteration makes a clear move:
- use the standard
kubebuilderproject layout - generate CRD and RBAC artifacts
- keep one binary and one control-plane entrypoint for now
That gives readers a structure they are likely to see again in real controller repositories.
Architecture In This Iteration
+--------------------------------------------------------------+
| cmd/main.go |
| one process: HTTP server + controller manager + background |
| jobs |
+-----------------------------+--------------------------------+
|
v
+------------+-------------+
| Echo HTTP API |
| POST /operator/stockpools|
+------------+-------------+
|
v
+------------+-------------+
| service layer |
| create async job |
| create StockPool CR |
+------------+-------------+
|
v
+------------+-------------+
| StockPool CR |
| runtime.lokiwager.io |
+------------+-------------+
|
v
+------------+-------------+
| StockPoolReconciler |
| ensure Deployment |
| update status |
+------------+-------------+
|
v
+------------+-------------+
| Deployment |
| placeholder runtime pods |
+--------------------------+
Notice what changed from Part 4:
- the API is no longer the source of truth
- the custom resource is the source of truth
- reconcile owns the drift-correction path
That mental model is more important than any individual code snippet in this chapter.
Step 1: Replace The Hand-Wired Layout With Kubebuilder
The first major change is structural.
We move from a homegrown operator layout to the standard shape most Kubernetes engineers expect:
PROJECTapi/v1alpha1internal/controllerconfig/crdconfig/rbacconfig/default
Why do this now?
Because teaching real engineering practice means teaching the boring defaults too, not just the fun parts.
kubebuilder buys us a few things immediately:
- predictable file layout
- generated deepcopy methods
- CRD generation from Go markers
- RBAC generation from controller markers
- easier onboarding for anyone who has seen a controller repo before
This is not glamorous, but it is the kind of decision that saves your future self from becoming unpaid support for your own clever shortcuts.
Step 2: Define A Small But Honest API Type
The StockPool API lives in api/v1alpha1/stockpool_types.go.
Core fields now look like this:
type StockPoolSpec struct {
SpecName string `json:"specName"`
Image string `json:"image,omitempty"`
Memory string `json:"memory,omitempty"`
GPU int32 `json:"gpu,omitempty"`
Replicas int32 `json:"replicas"`
}
type StockPoolStatus struct {
Available int32 `json:"available,omitempty"`
Allocated int32 `json:"allocated,omitempty"`
Phase string `json:"phase,omitempty"`
ObservedGeneration int64 `json:"observedGeneration,omitempty"`
LastSyncTime metav1.Time `json:"lastSyncTime,omitempty"`
}
Why this shape?
SpecName stays because users still need a concrete runtime flavor such as g1.1.
Replicas is still the smallest useful desired-state signal.
Image, Memory, and GPU are where the API starts to feel less toy-like. Once those fields exist in the spec, readers can see a real path from control-plane input to pod template output.
Status gives users immediate feedback without forcing them to reverse-engineer controller logs every time something is still converging.
We also add kubebuilder markers so the CRD can be generated from the type:
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Spec",type=string,JSONPath=`.spec.specName`
// +kubebuilder:printcolumn:name="Desired",type=integer,JSONPath=`.spec.replicas`
// +kubebuilder:printcolumn:name="Available",type=integer,JSONPath=`.status.available`
That means the CRD definition comes from the Go contract instead of a hand-maintained YAML file quietly drifting off the map.
Step 3: Keep One Entry Point
The unified entrypoint is cmd/main.go.
This file now does three jobs:
- build the controller manager
- register the reconciler
- attach non-leader background runnables such as the HTTP server and the job worker
Manager setup:
mgr, err := ctrl.NewManager(restConfig, ctrl.Options{
Scheme: scheme,
Metrics: metricsserver.Options{
BindAddress: metricsAddr,
SecureServing: secureMetrics,
TLSOpts: tlsOpts,
},
HealthProbeBindAddress: probeAddr,
LeaderElection: enableLeaderElection,
LeaderElectionID: "9d4c4758.lokiwager.io",
})
Then we attach the API server to the manager lifecycle:
if err := mgr.Add(nonLeaderRunnable{run: func(ctx context.Context) error {
return startHTTPServer(ctx, httpServer)
}}); err != nil {
os.Exit(1)
}
That is cleaner than building a second bootstrap world outside the manager and then trying to keep shutdown behavior consistent by brute force.
One more practical change landed in this iteration: the deployment manifest now declares the API port and exposes it through a dedicated Service:
config/manager/manager.yamldeclares--http-addr=:8080and the container portconfig/default/api_service.yamlexposes the HTTP API inside the cluster
That is the kind of detail teams forget surprisingly often when the binary grows from “just a controller” into “controller plus API.”
Step 4: Switch The API Layer To Echo
The HTTP layer in pkg/api/server.go now uses echo instead of raw net/http.
Current endpoints:
GET /api/v1/health
GET /api/v1/operator/stockpools
POST /api/v1/operator/stockpools
GET /api/v1/operator/jobs/{jobID}
The service layer in pkg/service/service.go owns the actual flow.
Request DTO:
type CreateStockPoolRequest struct {
Name string `json:"name,omitempty"`
Namespace string `json:"namespace,omitempty"`
SpecName string `json:"specName"`
Image string `json:"image,omitempty"`
Memory string `json:"memory,omitempty"`
GPU int32 `json:"gpu,omitempty"`
Replicas int32 `json:"replicas"`
}
Async create path:
func (s *Service) CreateStockPoolAsync(ctx context.Context, req CreateStockPoolRequest) (domain.OperatorJob, error) {
...
s.jobQueue <- createStockPoolJob{jobID: jobID, req: req}
return job, nil
}
Worker:
func (s *Service) StartOperatorJobWorker(ctx context.Context) {
for {
select {
case <-ctx.Done():
return
case job := <-s.jobQueue:
s.setJobRunning(job.jobID)
if err := s.createStockPoolObject(ctx, job.req); err != nil {
s.setJobFailed(job.jobID, err)
continue
}
s.setJobSucceeded(job.jobID, job.req)
}
}
}
This is the key boundary in the current design:
HTTP request -> async job -> CR creation -> reconcile
We are no longer storing pretend runtime state in memory and calling that progress. The API hands desired state to Kubernetes. That is the right shape for the control plane we are trying to build.
Step 5: Reconcile To A Deployment
The reconciler lives in internal/controller/stockpool_controller.go.
This is the first chapter where reconcile performs a real side effect:
- load
StockPool - ensure a
Deploymentexists - update
Deployment.spec.replicas - map
image,memory, andgpuinto the pod template - compute and write
StockPool.status
The creation path looks like this:
newDep, err := desiredDeployment(pool, desired)
if err := controllerutil.SetControllerReference(&pool, newDep, r.Scheme); err != nil {
return ctrl.Result{}, err
}
if err := r.Create(ctx, newDep); err != nil {
return ctrl.Result{}, err
}
And the status path still reflects observed state, not wishful thinking:
next := runtimev1alpha1.StockPoolStatus{
Available: dep.Status.AvailableReplicas,
Allocated: maxInt32(desired-dep.Status.AvailableReplicas, 0),
ObservedGeneration: pool.Generation,
LastSyncTime: metav1.NewTime(time.Now().UTC()),
}
The new resource mapping logic is especially worth noticing. memory is parsed into Kubernetes resource quantities, and gpu is wired to nvidia.com/gpu requests and limits. That means the reader can now see a clean line from API payload to CR spec to pod resources.
The deployment still uses a placeholder container image by default if one is not provided. That is fine. The point here is control flow, not pretending we have already built the final GPU runtime.
Step 6: Let RBAC And CRD Manifests Be Generated
The repo now uses generated output under config/.
That includes:
config/crd/bases/runtime.lokiwager.io_stockpools.yamlconfig/rbac/role.yamlconfig/samples/runtime_v1alpha1_stockpool.yaml
And the Makefile includes:
make manifests generate
This is one of those habits that pays off quietly. When manifests are derived from types and markers, the diff usually tells a coherent story. When they are maintained by hand, the diff often tells you someone forgot something on a random Friday and hoped nobody would notice.
Step 7: Keep Tests Small And Direct
We keep tests practical in this chapter.
Controller test:
Service tests:
API test:
The controller test now checks more than “did status change?” It also verifies that the reconciled deployment carries the expected image, memory limit, and GPU limit.
The service test verifies that the async job worker eventually creates the StockPool CR with the requested runtime fields.
That is enough coverage for this iteration because the main risk lives in glue code and state transitions.
We still have not introduced envtest here, and that is deliberate. This chapter already carries a major conceptual jump: kubebuilder, real reconciliation, and real workload generation. Throwing every testing strategy into the same chapter would make it louder, not better.
How To Run This Version
In the code repo:
make manifests generate
kubectl apply -f config/crd/bases/runtime.lokiwager.io_stockpools.yaml
make run
Create a pool:
curl -s -X POST http://127.0.0.1:8080/api/v1/operator/stockpools \
-H 'Content-Type: application/json' \
-d '{"name":"pool-g1","namespace":"default","specName":"g1.1","image":"nginx:1.27","memory":"16Gi","gpu":1,"replicas":2}' | jq
Then verify:
kubectl get stockpools.runtime.lokiwager.io pool-g1 -o yaml
kubectl get deployment -n default
If you deploy this through the generated manifests instead of make run, the in-cluster API is exposed on port 8080 through the generated API Service.
Local validation for this iteration:
make ci
That now covers:
- CRD/RBAC generation
- formatting
go vet- race-enabled tests
- build
Common Mistakes In This Step
The API works locally, but nothing happens in cluster
Check whether the process can actually talk to the cluster. This version relies on standard controller-runtime kubeconfig handling, so a bad context or missing config breaks the chain before reconcile even gets a chance to be blamed for crimes it did not commit.
The StockPool exists, but no Deployment appears
Check:
- the reconciler is registered with the manager
- the CRD group/version matches the Go type
- RBAC allows
deploymentscreate and update
The deployment is created, but the pod spec looks wrong
Check the CR values:
imagememorygpu
Remember that invalid memory values have to parse as Kubernetes resource quantities, and GPU is intentionally mapped to nvidia.com/gpu.
Status never moves past progressing
Remember that status is based on observed deployment state, not on what we hope the cluster will do eventually. If the deployment is not becoming available, the operator should not fake confidence.
Summary
Part 5 is the first chapter where the project starts to feel structurally honest.
We now have:
- a standard
kubebuilderscaffold - an
echo-based control-plane API - a single-process control plane
- an HTTP API that creates CRs instead of fake runtime state
- a reconciler that creates a real Kubernetes workload
- generated CRD and RBAC artifacts
- runtime fields flowing from API all the way into pod resources
That is a much stronger base for the rest of the book.
Next Chapter Preview
Part 6 should stop at “minimal but real” and move into “useful.”
That likely means:
- making the controller own more lifecycle detail
- improving idempotency and error handling
- tightening the API/job contract
- preparing the path toward actual runtime pods instead of placeholders
Repository
Code for this chapter:
Comments
Join the discussion with your GitHub account. Powered by giscus .