Issues / #124
Auto-deploy hardening: digest reconcile, registry OIDC, canary, fleet CI sweep
open
improvement
Project: nimsforest2
Reporter:
11 May 2026 12:32
Description
Tracks the auto-deploy hardening shipped 2026-05-11 across mycelium, nimsforestregistry, land, nimsforest2, landconfigregistry, and 45 service repos.
## Relation to existing issues
- **Supersedes the mechanism in #101** ("Pin image tags in landconfigregistry and auto-replant on registry release"). #101 proposed: pin explicit tags + have CI PATCH landconfigregistry. Shipped instead: digest-aware reconcile so `:latest` indirection actually works (a HEAD on the registry plus `docker image inspect RepoDigests` comparison per reconcile tick). Same end-user outcome, fewer moving pieces. **Recommend closing #101** once end-to-end validation lands.
- **Builds on #109** (done, drift-proof config reconcile) by piggybacking the digest check on the same reconcile loop.
- **Adjacent to #123** (done, bringup-chain reliability) — converged at 2026-05-11 20:00 UTC.
## Layers shipped
**Layer 1 — Push auth (mycelium + nimsforestregistry)**
- mycelium v1.2.0: `internal/deploytokens/`, `internal/registrysigner/` (RSA-2048 Docker Registry v2 JWT, libtrust kid), `/api/registry/token`, `/api/registry/pubkey`.
- nimsforestregistry b46fb02: `--require-auth` flag; `/v2/*` writes require a mycelium-signed JWT; pulls stay anonymous.
- mycelium v1.3.0: `internal/githuboidc/` (verifies GitHub Actions OIDC against well-known JWKS), `/api/deploy-repos` (binding-only), unified auth at `/api/registry/token` accepts OIDC (Bearer or Basic) or opaque-token. Zero long-lived secrets in any repo.
- `mycelium/scripts/bootstrap-deploy-repos.sh` one-shot operator script.
**Layer 2 — Pull safety net (land)**
- land v0.83.0: `pkg/land/imagecheck.go` HEAD-then-inspect digest comparison; persistent unpinned containers replant when registry digest moves. Reconcile is mutex-serialized.
- land v0.84.0: startup health probe; emits `land.status.<svc>.{healthy,unhealthy}` on Wind. No auto-rollback by design.
- land v0.85.0: `DeployRequest.CanaryHosts` field; `WatchDeploy` filters by hostname.
- **land v0.86.0**: fix from #126 — `RunningDigest` uses `docker image inspect {{index .RepoDigests 0}}` (manifest digest, matches the registry HEAD) instead of `docker inspect {{.Image}}` (image config SHA, never matches).
**Layer 3 — Canary orchestrator (nimsforest2)**
- nimsforest2 v0.82.0: `internal/orchestrator/canary.go`. Env-gated on `CANARY_HOSTS`. Subscribes to `new.release` + `land.status.*.>`; stages canary, awaits health, then drops fleet event or `release.blocked.<svc>`.
**Layer 4 — CI fleet sweep — DONE (45 repos)**
All nimsforest-org repos that push to `registry.nimsforest.com` now use the GitHub OIDC pattern. Verified by audit. Single operator command registers the bindings for the entire fleet:
```bash
MYCELIUM_API_TOKEN=... /opt/mycelium/scripts/bootstrap-deploy-repos.sh
```
**Layer 5 — mycelium public-TLS fix (2026-05-11 20:00 UTC)**
landconfigregistry v0.42.0 adds `domains: [mycelium.nimsforest.com]` to the mycelium block in `sharedRole()`. Previously the autocert path was a no-op for that hostname, returning `tls internal_error` to any external client (including CI's docker auth dance). With v0.42.0 reseeded, Land's TLS terminator provisions a Let's Encrypt cert and `https://mycelium.nimsforest.com/api/registry/token` becomes reachable. This unblocks the operator step of flipping nimsforestregistry's `--require-auth` flag.
## Open / blocked
1. **End-to-end validation against a real release.** Now unblocked from the mycelium-TLS side. Steps:
a. Reseed landconfigregistry on the hub after v0.42.0 deploys (`docker exec landconfigregistry landconfigregistry seed`).
b. Force the hub Land to reconcile so the new `domains:` takes effect: Caddy/autocert provisions the cert; verify with `curl -fsS https://mycelium.nimsforest.com/health`.
c. Operator runs `MYCELIUM_API_TOKEN=... /opt/mycelium/scripts/bootstrap-deploy-repos.sh` to register the binding for all 45 nimsforest-org repos.
d. Flip `--require-auth` in nimsforestregistry's systemd unit.
e. Tag any service repo (e.g. `iamnim`) to fire CI's new OIDC step.
2. **Close #101** once step (1) succeeds.
## Reference
- Plan file: `~/.claude/plans/synchronous-cooking-mountain.md`
- mycelium runbook: `docs/runbooks/RUNBOOK.md` sections `Recommended: GitHub Actions OIDC` / `Per-repo deploy tokens` / `Registry token service` / `Rotating the registry signing key`
- mycelium testbook: tests 20, 21, 22
- mycelium operator script: `mycelium/scripts/bootstrap-deploy-repos.sh`
- land runbook: `docs/runbooks/organization-land.md` sections `Image-digest reconcile`, `Startup health checks`, `Canary-cohort deploys`
- land testbook: `docs/testbooks/auto-deploy.md` steps 6–8
- canary orchestrator: `nimsforest2/docs/{runbooks,testbooks}/canary-orchestrator.md`
- nimsforestregistry: `docs/runbooks/RUNBOOK.md`, `docs/testbooks/TESTBOOK.md`
## Post-ship bug found and fixed (2026-05-11)
Land v0.83.0 was deployed at 17:10 UTC and immediately caused a restart loop: `RunningDigest` compared the image config SHA (`docker inspect {{.Image}}`) against the registry manifest digest (from `HEAD /v2/.../manifests/<tag>`) — two different objects that never match. All 29 org-land containers restarted every 5 minutes until fixed.
Fixed in **land v0.86.0** (commit 81c3af8, released 18:17 UTC). `RunningDigest` now uses `docker image inspect {{index .RepoDigests 0}}` to get the manifest digest from the pulled image, which correctly matches the registry HEAD response.
Downstream: issue #125 (user-reported nims unresponsive) was caused by this loop.
Tracked: issue #126.
## Relation to existing issues
- **Supersedes the mechanism in #101** ("Pin image tags in landconfigregistry and auto-replant on registry release"). #101 proposed: pin explicit tags + have CI PATCH landconfigregistry. Shipped instead: digest-aware reconcile so `:latest` indirection actually works (a HEAD on the registry plus `docker image inspect RepoDigests` comparison per reconcile tick). Same end-user outcome, fewer moving pieces. **Recommend closing #101** once end-to-end validation lands.
- **Builds on #109** (done, drift-proof config reconcile) by piggybacking the digest check on the same reconcile loop.
- **Adjacent to #123** (done, bringup-chain reliability) — converged at 2026-05-11 20:00 UTC.
## Layers shipped
**Layer 1 — Push auth (mycelium + nimsforestregistry)**
- mycelium v1.2.0: `internal/deploytokens/`, `internal/registrysigner/` (RSA-2048 Docker Registry v2 JWT, libtrust kid), `/api/registry/token`, `/api/registry/pubkey`.
- nimsforestregistry b46fb02: `--require-auth` flag; `/v2/*` writes require a mycelium-signed JWT; pulls stay anonymous.
- mycelium v1.3.0: `internal/githuboidc/` (verifies GitHub Actions OIDC against well-known JWKS), `/api/deploy-repos` (binding-only), unified auth at `/api/registry/token` accepts OIDC (Bearer or Basic) or opaque-token. Zero long-lived secrets in any repo.
- `mycelium/scripts/bootstrap-deploy-repos.sh` one-shot operator script.
**Layer 2 — Pull safety net (land)**
- land v0.83.0: `pkg/land/imagecheck.go` HEAD-then-inspect digest comparison; persistent unpinned containers replant when registry digest moves. Reconcile is mutex-serialized.
- land v0.84.0: startup health probe; emits `land.status.<svc>.{healthy,unhealthy}` on Wind. No auto-rollback by design.
- land v0.85.0: `DeployRequest.CanaryHosts` field; `WatchDeploy` filters by hostname.
- **land v0.86.0**: fix from #126 — `RunningDigest` uses `docker image inspect {{index .RepoDigests 0}}` (manifest digest, matches the registry HEAD) instead of `docker inspect {{.Image}}` (image config SHA, never matches).
**Layer 3 — Canary orchestrator (nimsforest2)**
- nimsforest2 v0.82.0: `internal/orchestrator/canary.go`. Env-gated on `CANARY_HOSTS`. Subscribes to `new.release` + `land.status.*.>`; stages canary, awaits health, then drops fleet event or `release.blocked.<svc>`.
**Layer 4 — CI fleet sweep — DONE (45 repos)**
All nimsforest-org repos that push to `registry.nimsforest.com` now use the GitHub OIDC pattern. Verified by audit. Single operator command registers the bindings for the entire fleet:
```bash
MYCELIUM_API_TOKEN=... /opt/mycelium/scripts/bootstrap-deploy-repos.sh
```
**Layer 5 — mycelium public-TLS fix (2026-05-11 20:00 UTC)**
landconfigregistry v0.42.0 adds `domains: [mycelium.nimsforest.com]` to the mycelium block in `sharedRole()`. Previously the autocert path was a no-op for that hostname, returning `tls internal_error` to any external client (including CI's docker auth dance). With v0.42.0 reseeded, Land's TLS terminator provisions a Let's Encrypt cert and `https://mycelium.nimsforest.com/api/registry/token` becomes reachable. This unblocks the operator step of flipping nimsforestregistry's `--require-auth` flag.
## Open / blocked
1. **End-to-end validation against a real release.** Now unblocked from the mycelium-TLS side. Steps:
a. Reseed landconfigregistry on the hub after v0.42.0 deploys (`docker exec landconfigregistry landconfigregistry seed`).
b. Force the hub Land to reconcile so the new `domains:` takes effect: Caddy/autocert provisions the cert; verify with `curl -fsS https://mycelium.nimsforest.com/health`.
c. Operator runs `MYCELIUM_API_TOKEN=... /opt/mycelium/scripts/bootstrap-deploy-repos.sh` to register the binding for all 45 nimsforest-org repos.
d. Flip `--require-auth` in nimsforestregistry's systemd unit.
e. Tag any service repo (e.g. `iamnim`) to fire CI's new OIDC step.
2. **Close #101** once step (1) succeeds.
## Reference
- Plan file: `~/.claude/plans/synchronous-cooking-mountain.md`
- mycelium runbook: `docs/runbooks/RUNBOOK.md` sections `Recommended: GitHub Actions OIDC` / `Per-repo deploy tokens` / `Registry token service` / `Rotating the registry signing key`
- mycelium testbook: tests 20, 21, 22
- mycelium operator script: `mycelium/scripts/bootstrap-deploy-repos.sh`
- land runbook: `docs/runbooks/organization-land.md` sections `Image-digest reconcile`, `Startup health checks`, `Canary-cohort deploys`
- land testbook: `docs/testbooks/auto-deploy.md` steps 6–8
- canary orchestrator: `nimsforest2/docs/{runbooks,testbooks}/canary-orchestrator.md`
- nimsforestregistry: `docs/runbooks/RUNBOOK.md`, `docs/testbooks/TESTBOOK.md`
## Post-ship bug found and fixed (2026-05-11)
Land v0.83.0 was deployed at 17:10 UTC and immediately caused a restart loop: `RunningDigest` compared the image config SHA (`docker inspect {{.Image}}`) against the registry manifest digest (from `HEAD /v2/.../manifests/<tag>`) — two different objects that never match. All 29 org-land containers restarted every 5 minutes until fixed.
Fixed in **land v0.86.0** (commit 81c3af8, released 18:17 UTC). `RunningDigest` now uses `docker image inspect {{index .RepoDigests 0}}` to get the manifest digest from the pulled image, which correctly matches the registry HEAD response.
Downstream: issue #125 (user-reported nims unresponsive) was caused by this loop.
Tracked: issue #126.