Building a GPU SaaS Platform - Runtime Bootstrap in Go
/ 8 min read
Target readers:
- you already know Golang syntax and basic project structure
- you are not yet confident in production-oriented engineering decisions
This chapter is not just about “making code run”. It is about learning how to make decisions that support long-term delivery.
Chapter Goal
By the end of this chapter, you should have a runnable single-cluster runtime baseline that includes:
- process startup and graceful shutdown
- baseline API (
health / stocks / vms) - minimal lifecycle loop (create stock -> allocate VM -> release stock)
- periodic status reporting
- optional Kubernetes connectivity
- initial quality baseline (unit tests + CI/CD)
More importantly, you should understand why we implement it this way.
Engineering Goal of This Iteration
For a real production system, the first iteration should optimize for:
- clear boundaries, not feature completeness
- fast validation, not premature complexity
- low refactor cost in next iteration
That is why this chapter intentionally does not implement a full Operator yet.
What We Deliberately Do Not Build Yet
Not included in this chapter:
- CRD + reconcile loop
- PVC/Ceph workflows
- multi-cluster state aggregation
- serverless runtime workflow
Reason: these features are important, but introducing them before we stabilize service boundaries makes troubleshooting much harder.
Technology Selection in This Iteration
Constraints:
- use standard library for HTTP (
net/http) - use standard library for logging (
log/slog) - avoid extra frameworks in the first implementation
- include only required Kubernetes dependency (
client-go)
require k8s.io/client-go v0.30.10
Why this choice:
- fewer abstractions at start means easier debugging
- lower cognitive load for readers new to engineering practice
- we keep room to evolve later (Echo/Gin, zap, metrics stack) based on measured needs
Architecture (Iteration 1)
+-------------------+ +-------------------+
| cmd/runtime | -----> | pkg/runtime |
| (flags & signals) | | (wire everything) |
+-------------------+ +---------+---------+
|
+-----------------+-----------------+
| |
+---------v---------+ +---------v---------+
| pkg/api | | pkg/jobs |
| net/http | | status reporter |
+---------+---------+ +---------+---------+
| |
+-----------------+-----------------+
|
+---------v---------+
| pkg/service |
| use-cases |
+----+---------+----+
| |
+--------v-+ +---v----------------+
| pkg/store | | pkg/kube |
| in-memory | | client-go adapter |
+-----------+ +--------------------+
Boundary rules:
api: transport only (decode/encode/status code)service: business orchestration onlystore: state operations onlyruntime: wiring only
These rules prevent “everything in handler” code, which is the most common early-stage anti-pattern.
Step-by-Step Implementation
Step 0: Add testing and CI/CD from day one
Purpose: Set minimum quality gates at project initialization, not after incidents. This chapter only gives a lightweight setup. You should treat this part as mandatory homework.
Code:
make ci
ci: fmt-check vet test-race build
# .github/workflows/ci.yml
name: CI
on:
pull_request:
push:
branches: [main]
jobs:
verify:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version-file: go.mod
- run: make fmt-check
- run: make vet
- run: make test-race
- run: make build
# .github/workflows/release-image.yml
name: Release Image
on:
push:
tags: ["v*"]
Why:
- tests prevent regressions while architecture is still changing fast
- CI gives consistent verification on every PR/push
- release workflow makes delivery repeatable and auditable
Reader requirement:
- you should understand this part by yourself and run it locally
- use the repository workflows and test files as the reference implementation
- do not skip this step, even if business features look more interesting
Pitfall: If testing and CI/CD are postponed, technical debt grows quickly and every refactor becomes risky.
Follow-up: In a later standalone article, we will cover production testing strategy and CI/CD engineering in depth (unit/integration/e2e, test pyramids, flaky test control, pipeline design, release safety).
Files:
Makefile.github/workflows/ci.yml.github/workflows/release-image.ymlpkg/config/config_test.gopkg/store/memory_test.gopkg/service/service_test.gopkg/api/server_test.go
Step 1: Keep main minimal and predictable
Purpose:
Define a deterministic startup and shutdown path. main is orchestration only.
Code:
func main() {
cfg, err := loadConfig()
if err != nil {
fmt.Fprintf(os.Stderr, "config error: %v\\n", err)
os.Exit(1)
}
ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()
runtime, err := app.New(cfg)
if err != nil {
fmt.Fprintf(os.Stderr, "startup error: %v\\n", err)
os.Exit(1)
}
if err := runtime.Run(ctx); err != nil {
fmt.Fprintf(os.Stderr, "runtime error: %v\\n", err)
os.Exit(1)
}
}
Why:
- startup failures are explicit and visible in one place
- shutdown behavior is deterministic
- business logic remains testable outside
main
Pitfall:
Putting business logic in main makes testing hard and refactors expensive.
File: cmd/runtime/main.go
Step 2: Use a dedicated runtime wiring layer
Purpose:
Create one composition root (pkg/runtime) to wire dependencies and keep layering stable.
Code:
func New(cfg config.Config) (*Runtime, error) {
logger := slog.New(slog.NewTextHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelInfo}))
kubeClient, err := kube.BuildClient(cfg.KubeMode, cfg.Kubeconfig)
if err != nil {
return nil, err
}
memStore := store.NewMemoryStore()
svc := service.New(memStore, kubeClient, logger)
handler := api.NewServer(svc, logger)
httpServer := &http.Server{
Addr: cfg.HTTPAddr,
Handler: handler,
ReadHeaderTimeout: 5 * time.Second,
}
return &Runtime{...}, nil
}
Why:
- all dependencies are visible in one location
- easier to replace components in tests
- clean migration path to operator manager later
Pitfall: If handlers/services instantiate dependencies directly, ownership becomes unclear and startup behavior fragments.
File: pkg/runtime/runtime.go
Step 3: Keep API handlers thin
Purpose: Keep HTTP layer responsible only for transport, not business rules.
Code:
func (s *Server) routes() {
s.mux.HandleFunc("/api/v1/health", s.handleHealth)
s.mux.HandleFunc("/api/v1/stocks", s.handleStocks)
s.mux.HandleFunc("/api/v1/vms", s.handleVMs)
s.mux.HandleFunc("/api/v1/vms/", s.handleVMByID)
}
type envelope struct {
Data any `json:"data,omitempty"`
Error *apiError `json:"error,omitempty"`
}
func (s *Server) handleStocks(w http.ResponseWriter, r *http.Request) {
switch r.Method {
case http.MethodPost:
var req service.CreateStockRequest
if err := decodeBody(r.Body, &req); err != nil {
writeError(w, http.StatusBadRequest, "invalid_request", err.Error())
return
}
stocks, err := s.service.CreateStocks(r.Context(), req)
if err != nil {
writeError(w, http.StatusBadRequest, "create_stocks_failed", err.Error())
return
}
writeData(w, http.StatusCreated, stocks)
}
}
Why:
- transport concerns remain isolated
- service methods can be reused by jobs/tests later
- API protocol changes do not force lifecycle refactors
Pitfall: If validation, status code mapping, and business logic are mixed in handlers, every API change becomes high risk.
File: pkg/api/server.go
Step 4: Keep lifecycle orchestration in service
Purpose:
Implement the first runtime lifecycle loop and explain why Stock is a first-class model.
Stock represents pre-provisioned GPU capacity. It is intentionally separated from tenant VM identity.
This avoids coupling capacity accounting with workload lifecycle.
Code:
func (s *Service) CreateVM(ctx context.Context, req CreateVMRequest) (domain.VM, error) {
var (
stock domain.Stock
err error
)
if strings.TrimSpace(req.StockID) != "" {
stock, err = s.store.ReserveStockByID(strings.TrimSpace(req.StockID))
} else {
stock, err = s.store.ReserveStock(strings.TrimSpace(req.SpecName))
}
if err != nil {
return domain.VM{}, err
}
vm := domain.VM{...}
if err := s.store.CreateVM(vm); err != nil {
_ = s.store.ReleaseStock(stock.ID)
return domain.VM{}, err
}
return vm, nil
}
Why:
- clear separation: capacity (
Stock) vs workload (VM) - explicit rollback path when VM creation fails
- same flow maps naturally to future reconcile logic
Pitfall: If you create VM first and reserve stock later, failure handling becomes inconsistent and can leak capacity.
File: pkg/service/service.go
Step 5: Extract config early
Purpose: Make runtime behavior configurable from day one, instead of adding flags ad hoc later.
Code:
type Config struct {
HTTPAddr string
ReportInterval time.Duration
KubeMode KubeMode
Kubeconfig string
}
const (
KubeModeAuto KubeMode = "auto"
KubeModeOff KubeMode = "off"
KubeModeRequired KubeMode = "required"
)
Why:
- one binary can run in local dev, CI, or cluster
- behavior changes through config instead of code edits
- operational behavior is explicit and documented
Pitfall: Without a config model, feature flags and env parsing spread across packages quickly.
File: pkg/config/config.go
Step 6: Add Kubernetes adapter for connectivity signal
Purpose: Add Kubernetes awareness before operator adoption.
func BuildClient(mode config.KubeMode, kubeconfig string) (kubernetes.Interface, error) {
if mode == config.KubeModeOff {
return nil, nil
}
restConfig, err := buildRestConfig(kubeconfig)
...
return kubernetes.NewForConfig(restConfig)
}
Why:
/healthcan expose real cluster connectivity- startup mode supports local and in-cluster runtime
- future migration to controller-runtime is incremental, not disruptive
Pitfall: If cluster integration starts only when introducing reconcile, migration risk and debugging complexity both spike.
File: pkg/kube/client.go
Step 7: Add a periodic reporter job
Purpose: Introduce a minimal observability loop with periodic runtime state reporting.
func (r *StatusReporter) Start(ctx context.Context) {
ticker := time.NewTicker(r.interval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
health, err := r.service.Health(ctx)
...
r.logger.Info("runtime status", ...)
}
}
}
Why:
- request logs show calls, not steady-state runtime health
- periodic reporting surfaces drift and silent failures
- provides extension point for future metrics/events pipeline
Pitfall: No background reporting means incidents can remain invisible until user traffic fails.
File: pkg/jobs/status_reporter.go
How to Validate This Iteration
Validation checklist:
- compile and dependency sanity
make tidy
go test ./...
- run runtime
make run
- health check
curl -s http://127.0.0.1:8080/api/v1/health | jq
- stock lifecycle
curl -s -X POST http://127.0.0.1:8080/api/v1/stocks \
-H 'Content-Type: application/json' \
-d '{"number":2,"specName":"g1.1","cpu":"4","memory":"16Gi","gpuType":"RTX4090","gpuNum":1}' | jq
- vm lifecycle
curl -s -X POST http://127.0.0.1:8080/api/v1/vms \
-H 'Content-Type: application/json' \
-d '{"tenantID":"tenant-a","tenantName":"team-a","specName":"g1.1"}' | jq
Definition of Done for this chapter:
- service is runnable locally
- API contract is stable and explicit
- stock/vm lifecycle loop works end-to-end
- status reporter emits periodic runtime state
Troubleshooting Guide (Early Iteration)
Runtime exits on startup
Check:
- invalid flag values (
--kube-mode) - port already in use (
:8080) requiredkube mode without valid kubeconfig
/health shows kubernetesConnected=false
Check:
- run with
--kube-mode=autoor--kube-mode=required - verify
~/.kube/configexists and context is correct - if running in cluster, verify service account permissions
VM creation fails with no available stock
This is expected behavior if stock pool is empty.
Create stock first or use a valid specName.
API returns 400
Common causes:
- malformed JSON
- missing required fields (
specName,number) - unsupported request shape due to strict decode
Iteration Summary
This chapter intentionally prioritizes engineering foundations over feature volume.
We now have:
- clear layering
- deterministic startup/shutdown path
- explicit lifecycle flow with rollback
- basic operational visibility
This is a good production-minded baseline for introducing more complexity safely.
Next Chapter Preview
Part 5 will introduce the minimal Operator skeleton:
controller-runtimemanager- first CRD model for runtime resources
- first reconcile loop and status update flow
At that point, we will migrate from in-memory state to Kubernetes-native desired/actual state management.
Repository
Code for this tutorial runtime: