squeezing an ml service into fourteen gigabytes
the ml team said the preprocessing service couldn’t run in CI. too big. too many dependencies. private repos. cuda base images. the works.
so naturally i spent a night proving them wrong. here’s how that went.
the premise
we have a platform. four services. a go API, a data generation service, a preprocessing service, and an e2e test suite. they all need to be built, loaded into a local kubernetes cluster, and tested together. the e2e tests deploy the whole stack into kind (kubernetes-in-docker) and run a jest suite against it.
three of these services are small. normal docker images. a few hundred megabytes. the preprocessing service is not small. it pulls in private ML repositories, torch dependencies, opencv, cuda tooling. the image lands somewhere around 3.5 gigabytes compressed.
the CI runner we started on was ubuntu-latest. github’s hosted runners. they ship with about 30GB of pre-installed tooling — .NET, android SDK, GHC, CodeQL — leaving roughly 14GB of usable space.
fourteen gigabytes. for four docker images, a kubernetes cluster, and all the uncompressed tarballs kind load writes to /tmp.
…
this is fine.
act one: no space left on device
the first attempt was the obvious one. build all four images, then load them all into kind.
skaffold run --force
skaffold builds all the images, then calls kind load docker-image for each one. kind load shells out to docker save, which writes an uncompressed tarball to /tmp. this is the part nobody tells you about. your 3.5GB compressed image becomes roughly 7GB uncompressed on disk. sitting in /tmp. while the other images are also sitting in docker’s image store. while kind’s containerd is importing them.
the error was exactly what you’d expect:
ERROR: command "docker save -o /tmp/images-tar1916451864/images.tar
svc-preprocessing:84b96fc..." failed with error: exit status 1
Command Output: write /tmp/images-tar1916451864/.docker_temp_417581142:
no space left on device
it’s always the preprocessing service. always the last one to load. because by the time it gets there, the disk is already full of the previous images’ uncompressed remains.
act two: the cleanup
the first fix was blunt. delete the pre-installed tooling.
- name: Free disk space
run: |
sudo rm -rf /usr/share/dotnet /usr/local/lib/android \
/opt/ghc /opt/hostedtoolcache/CodeQL
sudo docker image prune --all --force
this is, for the record, standard practice for docker-heavy CI jobs on github-hosted runners. the runner-images repo has multiple issues and discussions about it. you are renting a VM that comes pre-loaded with every SDK microsoft has ever shipped, and your first step is deleting most of it. it’s like buying a house and immediately gutting the kitchen because you need the floorspace for a server rack.
this got us to about 28GB free. enough for the images. but not enough for the dumb way — building all four, then loading all four. because docker save writes to /tmp and docker’s image store both exist simultaneously during the load.
so we got creative.
act three: the build-load-delete dance
the fix was a sequential pipeline. build one image. load it into kind. delete it from docker. prune the buildx cache. then build the next one.
- name: Build and load svc-api-gateway
run: |
just ci-build api
kind load docker-image svc-api-gateway:latest --name ci-cluster
docker rmi svc-api-gateway:latest
docker buildx prune -f
- name: Build and load svc-data-generation
run: |
just ci-build datagen
kind load docker-image svc-data-generation:latest --name ci-cluster
docker rmi svc-data-generation:latest
docker buildx prune -f
- name: Build and load svc-preprocessing
run: |
just ci-build preproc
kind load docker-image svc-preprocessing:latest --name ci-cluster
docker rmi svc-preprocessing:latest
docker buildx prune -f
- name: Build and load svc-e2e-test-suite
run: |
just ci-build e2e
kind load docker-image svc-e2e-test-suite:latest --name ci-cluster
docker rmi svc-e2e-test-suite:latest
this way, at most one uncompressed tarball and one docker image exist at any time. the moment kind has imported the image into containerd, we throw away the docker copy and the buildx layer cache. the next build starts clean.
the docker buildx prune -f is the subtle part. docker rmi removes images from docker’s image store. but if you’re using buildx (which we were, for GHA caching), buildx maintains its own layer cache in a separate builder instance. docker rmi does nothing to it. so the buildx cache accumulates across builds, silently eating disk, until the preprocessing image comes along and there’s nothing left. ask me how i know.
this worked. but it was… a lot. four sequential build-load-delete cycles, each with its own buildx prune. the workflow file was getting longer than most of the services it tested. we were optimising around a constraint that shouldn’t exist.
act four: the constraint that shouldn’t exist
there’s a rob pike quote i think about a lot. it goes something like: “a little copying is better than a little dependency.” the corollary for CI is: a little money is better than a lot of yaml.
we set up self-hosted runners. runs-on: [self-hosted, linux, x64]. persistent machines with actual disk space and a docker layer cache that survives between runs. suddenly:
- no disk cleanup. the machines have enough space.
- no buildx cache gymnastics. plain
docker buildwith the host’s layer cache. - no build-load-delete dance. build all four, load all four, done.
- builds that took 15 minutes on cache miss now take 2 minutes on cache hit because the layer cache persists.
the entire e2e workflow went from 80 lines of disk management hacks to about 30 lines of actual work. the ml team’s “too big for CI” service builds in under 3 minutes. it was never too big. it just needed a bigger room.
act five: where docker save actually comes from
a quick aside on the thing that actually caused all of this, because i think it’s worth understanding.
kind load docker-image needs to get your image from docker’s image store into kind’s containerd runtime. docker and containerd are separate systems — docker uses containerd under the hood, but kind runs its own containerd inside the kind node container. they don’t share storage.
so kind load does this:
- calls
docker saveto serialize the image into a tar archive (the OCI image layout format) - writes that tar to a temp directory (defaults to
/tmp) - copies the tar into the kind node container via
docker cp - calls
ctr images importinside the node to load it into containerd
the tar is uncompressed. docker images are stored compressed (gzip layers), but docker save decompresses them into the tar. a 3.5GB compressed image can easily become 7GB on disk. this is because the OCI spec for docker save outputs are uncompressed layer tarballs — it’s a serialization format, not a storage format. the assumption is that you’re moving data between systems, not that you’re doing it on a machine with 14GB of free space.
you can set TMPDIR to point somewhere with more space, but on a github-hosted runner, there is nowhere with more space. it’s turtles all the way down.
this is why the build-load-delete dance worked. by deleting the docker image immediately after kind imports it, you free the compressed copy in docker’s store, and the uncompressed tarball gets cleaned up when kind load finishes. net disk usage: one image at a time, briefly doubled during the save.
act six: the private dependency problem
with the self-hosted runner and the disk problem solved, a new problem appeared. the preprocessing service depends on private github repositories. ML libraries that live in separate repos. on a developer’s machine, this works because docker build --ssh default forwards the local ssh-agent into the build. the Dockerfile does:
RUN mkdir -p /root/.ssh && ssh-keyscan github.com >> /root/.ssh/known_hosts
RUN --mount=type=ssh uv sync --frozen $UV_SYNC_FLAGS
and it just works. the ssh agent provides credentials, git clones the private repos, uv installs them.
on a CI runner, there is no ssh agent.
ERROR: failed to build: failed to convert agent config {default []}:
invalid empty ssh agent socket: make sure SSH_AUTH_SOCK is set
…
fair enough.
act seven: the netrc era
the standard CI approach for private github dependencies is .netrc. you generate a short-lived github app token, write it to $HOME/.netrc, and git picks it up automatically for HTTPS auth.
machine github.com login x-access-token password ghs_xxxxxxxxxxxx
the github action generates the token, writes the file, and the docker build mounts it as a secret:
- name: Write .netrc for private dependencies
run: |
echo "machine github.com login x-access-token password $TOKEN" > "$HOME/.netrc"
chmod 600 "$HOME/.netrc"
RUN --mount=type=secret,id=netrc,target=/root/.netrc \
git config --global url."https://github.com/".insteadOf "ssh://[email protected]/" && \
awk '/machine github.com/{getline; split($0,a," "); ...}' /root/.netrc \
> /root/.git-credentials && \
git config --global credential.helper "store --file /root/.git-credentials" && \
uv sync --frozen $UV_SYNC_FLAGS
except… $HOME isn’t set on the self-hosted runner.
fatal: $HOME not set
ok, fix that.
NETRC_DIR="${HOME:-$(eval echo ~)}"
then $HOME isn’t set in the docker build command either.
sh: 1: HOME: parameter not set
fix that too. push. wait. fail. fix. push. wait. fail.
five consecutive CI failures, each revealing a new assumption about the runner environment. it was like peeling an onion, except each layer was a different shell variable that didn’t exist.
but then i stopped and looked at what we’d built. we were writing a plaintext credential to the host filesystem of a shared runner. chmod 600, sure. short-lived token, sure. but the file exists on disk. another process could read it. a leaked core dump could contain it. a misconfigured log could echo it. the file outlives the build step.
netrc is a spec from 1strstrstr. it was designed for FTP auto-login. we were using a 1980s FTP convenience feature to authenticate ML dependency downloads inside a docker build on a kubernetes CI runner.
no.
act eight: docker build secrets
the actual requirement is simple: pass a token into a docker build, use it during one RUN layer, and have it vanish. never touch the host filesystem. never bake it into an image layer. never write it anywhere.
docker buildkit has exactly this feature. --secret passes data into a build that’s mountable during RUN but excluded from the final image. the data can come from a file, an environment variable, or stdin.
stdin.
the github action generates a short-lived app token and exposes it as a step output. the workflow passes it as an environment variable to the just recipe. the just recipe pipes it into the docker build via stdin:
ci-build:
echo "$GITHUB_TOKEN" | docker build \
--file Dockerfile \
--tag svc-preprocessing:latest \
--build-arg UV_SYNC_FLAGS="--extra dev" \
--secret id=github_token,src=/dev/stdin \
..
inside the dockerfile, the token is mounted at /run/secrets/github_token for exactly one RUN instruction. a git credential helper reads it on demand:
RUN --mount=type=secret,id=github_token \
git config --global url."https://github.com/".insteadOf "ssh://[email protected]/" && \
git config --global credential.helper \
'!f() { echo "username=x-access-token"; \
echo "password=$(cat /run/secrets/github_token)"; }; f' && \
uv sync --frozen $UV_SYNC_FLAGS && \
git config --global --unset credential.helper
the credential helper is a shell function. when git needs auth, it calls the function, which reads the token from the secret mount. after uv sync finishes, we unset the credential helper. the secret mount only exists during this RUN layer — it’s a buildkit feature, not a docker layer, so it’s never committed to the image.
the full chain:
actions/create-github-app-token@v1generates a short-lived token (1 hour expiry) and callscore.setSecret()to mask it in all github actions log output.- the token flows as a step output into an environment variable. github actions masks it in logs.
echo "$GITHUB_TOKEN" | docker build --secret id=github_token,src=/dev/stdin— the echo goes to a pipe, not stdout. the token enters the build as a buildkit secret.- inside the build,
--mount=type=secret,id=github_tokenmakes it available at/run/secrets/github_tokenfor exactly one RUN instruction. it never enters a layer. it never appears indocker history. it’s not in the image. - after the RUN completes, the mount is gone. the credential helper is unset. the next layer has no trace of the token.
nothing on disk. nothing in the image. nothing in the logs. a short-lived token that exists in memory for the duration of one uv sync and then disappears.
the security chain, diagrammed
GitHub App (scoped to 2 repos, 1hr token)
│
▼
actions/create-github-app-token@v1
├─ generates token
└─ core.setSecret() → masked in all CI logs
│
▼
step output → env var (GITHUB_TOKEN)
├─ exists in runner process memory only
└─ masked by GitHub Actions in any log output
│
▼
echo "$GITHUB_TOKEN" | docker build --secret id=github_token,src=/dev/stdin
├─ pipe, not stdout — token goes directly to buildkit
└─ no file written to host filesystem
│
▼
RUN --mount=type=secret,id=github_token
├─ available at /run/secrets/github_token during this RUN only
├─ git credential helper reads it on demand
├─ never committed to image layer
└─ mount removed after RUN completes
│
▼
git config --global --unset credential.helper
└─ no trace in subsequent layers
the only thing that’s slightly less ideal: the token exists as a process environment variable on the runner during the step. theoretically readable from /proc/<pid>/environ by the same OS user. but it’s a short-lived token on infrastructure you control, and the alternative is inlining the docker build in the CI workflow instead of delegating to a justfile. the tradeoff is worth it. keep the delegation clean.
and this gets at something bigger. the entire CI pipeline is containerised. every build runs in docker. every test runs in docker. code generation, dependency resolution, compilation — all containerised. the CI runner needs exactly two things installed: docker and just. nothing else. no Go toolchain. no Python runtime. no node. no SDK. nothing that could be compromised, outdated, or misconfigured.
we live in interesting times. AI coding assistants will cheerfully hand your SSH keys to a subprocess if you ask nicely enough. supply chain attacks ship in popular npm packages. the average CI runner accumulates tooling like a kitchen junk drawer — and every binary, every cached credential, every pip install you ran six months ago is something you’re trusting to not phone home.
this containerised runner has nothing to steal. if it gets compromised, the blast radius is: a docker daemon with no cached credentials, a just binary, and whatever short-lived token github generated for that specific workflow run. secrets flow through pipes, not files. builds happen in containers, not on the host. the host is boring. boring is safe.
act nine: the pull policy
with the secrets sorted, the e2e job finally built all four images, loaded them into kind, and deployed via skaffold. the pods came up. and then:
deployment/svc-api-gateway: container wait-for-postgres is waiting
to start: svc-api-gateway:latest can't be pulled
can’t be pulled. from a registry that doesn’t exist. because the images aren’t in a registry — they’re loaded directly into kind’s containerd via kind load docker-image. the entire point of this exercise.
the problem is kubernetes’s default imagePullPolicy. when your image tag is :latest, kubelet defaults to imagePullPolicy: Always — it always tries to pull from a remote registry, even if the image is already present locally. the kubernetes docs are clear about this, but it’s one of those things you never think about until you’re staring at a pod that won’t start.
during local development with skaffold run, this isn’t an issue. skaffold rewrites image tags to include a content-based digest (svc-api-gateway:84b96fc557d2...), and kubernetes defaults non-:latest tags to imagePullPolicy: IfNotPresent. so locally, everything works. but in CI, where we’re passing pre-built :latest images via skaffold deploy --images, skaffold doesn’t rewrite anything. it applies the manifests as-is. kubernetes sees :latest, defaults to Always, and tries to pull from thin air.
the fix was one line per container:
imagePullPolicy: IfNotPresent
across three deployment files, seven containers (the API service has three init containers plus the main container). the e2e test suite job already had imagePullPolicy: Never — because of course it did, it was the one manifest that had been tested in kind before.
IfNotPresent rather than Never, because Never would break local dev where skaffold sometimes does need to pull. IfNotPresent works everywhere: if the image is loaded (kind), use it. if it needs pulling (hypothetical registry), pull it. the right default that kubernetes chose not to give you for :latest.
act ten: context cancelled
you’d think we were done. images build. secrets work. pull policy fixed. skaffold deploys. the pods come up. the tests run.
except sometimes they don’t.
error: context cancelled
two words. no stack trace. no container name. no explanation. just a Go context.Context getting cancelled somewhere in the kind/kubectl/skaffold stack, and the entire e2e job silently falling over.
i stared at this for longer than i’d like to admit. re-read the skaffold logs. re-read the kind logs. checked timeouts. checked network policies. nothing. everything looked fine, right up until it wasn’t.
then i remembered something. my macbook’s Docker Desktop is configured with 16GB of memory. i did that months ago, because you configure your systems without bottlenecks by default. so locally, the kind node had plenty of headroom. the preprocessing container could load pytorch, initialise model weights, import half of pypi, and the kind node wouldn’t even flinch. the CI runner had no such luxury.
consider the architecture that betrayed us: the CI runner is a Linux VM with default docker. inside that docker, we spin up a kind cluster — which is itself a Docker container. that kind container has whatever memory docker allocates to it by default, which is “whatever’s left” after the runner’s other processes. inside that container, we have a Kubernetes node running containerd, which tries to run our microservices.
so we have nested containers: VM → Docker → kind (Docker container) → Kubernetes → your service. each layer has memory constraints. on my macbook, every layer was generous. on the CI runner, every layer was stingy.
default docker, default kind, default kubelet, default everything. and the default amount of memory for “whatever’s left” after the kind control plane, four other pods, and containerd itself had taken their share was… not enough for a service that loads a neural network into RAM on startup.
it worked on my machine because my machine is configured by someone who’s been burned before. the CI runner is configured by nobody.
when a container inside kind’s containerd exceeds the kind node’s available memory, the kubelet OOM-kills it. the problem is: the kind node itself is running out of memory because the kind container (which lives on the CI runner) never had much to begin with. when the kubelet OOM-kills a container during pod startup, the pod goes into CrashLoopBackOff. when skaffold is waiting for a deployment to become ready and the pod keeps crash-looping, eventually the Go context times out. and the error you get is context cancelled.
the memory pressure happens at the kind node level, not visible from outside. you don’t see “kind ran out of memory.” you see the pod fail to start, see the context timeout, and blame skaffold or networking or anything else.
not “out of memory.” not “OOM killed.” not “the container that loads a 3.5GB ML model needs more than the default memory allocation.” just… context cancelled.
CLASSIC
resources:
requests:
memory: "6Gi"
limits:
memory: "6Gi"
six gigabytes. one YAML block. four lines. requests equals limits, so the pod gets Guaranteed QoS — the kubelet will reserve the memory and won’t evict it under pressure. the preprocessing service loads its model weights, starts its HTTP server, passes its readiness probe, and the tests run.
the reason this took so long to diagnose is that kind doesn’t surface OOM kills in any obvious way. running inside nested containers — VM → Docker → kind container — means the OOM killer’s messages are buried in kernel logs you can’t see from the CI runner’s stdout. kubectl describe pod would have shown the OOM event, but by the time you’re looking at it, skaffold has already torn down the deployment and printed context cancelled. the evidence is gone. you’re debugging a ghost.
this is, i think, the most insidious category of infrastructure bug. it’s not a wrong answer. it’s a missing answer. the system fails, and then cleans up the evidence of why it failed, and hands you a generic error that could mean anything. context cancelled is Go’s way of saying “something went wrong somewhere, eventually.” it’s the SIGKILL of error messages — technically correct, practically useless.
the fix was four lines of YAML that should have been there from the start. but nobody thinks about resource limits for local dev manifests, because locally your docker runtime happily hands out 16GB and everything works. it’s only when you put the same manifests in a constrained environment — like a kind node inside a CI runner — that the defaults betray you.
defaults. again.
they said it couldn’t run in CI.
there’s some truth to that. at the end of the day, the full e2e pipeline takes 17 minutes. 3 minutes building the preprocessing container. 13 minutes watching kind load serialize, copy, and reimport an image that’s already on the same machine. the build is fast. redistributing it to a container runtime six inches away is four times slower. someone should probably fix kind load inflating every layer just to hand it to a runtime that’s going to compress it again. that genuinely needs to be dealt with. what’s more, someone in ML needs to ship a slim build so we can move faster. pull the teeth.
but. it runs in CI. 200 lines of yaml. the preprocessing service, 3.5GB compressed, private ML dependencies, cuda base images, all build in under 3 minutes on the self-hosted runner. the full e2e suite deploys all four services into a kind cluster, runs jest against the real stack, and finishes in under 17 minutes. no disk hacks. no credentials on disk. no buildx cache gymnastics. no mental gymnastics, either.
disk space — worked around it (build-load-delete-prune), then removed it (self-hosted runners). authentication — worked around it (netrc), rejected it (credentials on disk are not ok), found the right answer (buildkit secrets via stdin). memory — didn’t even know it was a constraint until Go said context cancelled and we had to work backwards from a ghost. four lines of YAML. “it can’t be done” usually means “i tried the obvious thing and it didn’t work.” not the same thing. the obvious thing has a constraint. find it. work around it or remove it.
i untangled workaround after workaround of half-baked solutions (cough, ai slop) and every workaround was more complex than the fix. the build-load-delete dance was more complex than getting a bigger machine. netrc was more complex than piping a token through stdin. debugging context cancelled was more complex than setting resource limits. if your solution is getting more complicated, you’re probably solving the wrong problem.
do the boring thing. ship the work.