feat(init): --cluster-endpoint flag and talm.key recovery hint#206
Conversation
Talm.key absence is a real operational scenario (laptop swap, lost backup, fresh checkout from a repo whose .gitignore properly excludes the key) and the raw "open .../talm.key: no such file" that LoadKey emitted surfaces as if it's an internal bug rather than a "you need to restore your key" condition. Distinguish the not-exist case from other I/O errors at the LoadKey boundary and attach a hint naming both recovery paths the operator can take: restore the backed-up key, or re-run \`talm init\` to regenerate (with the warning that regeneration writes new secrets and the old secrets.encrypted.yaml becomes undecryptable without the original key). Pinned by TestContract_Age_LoadKey_MissingFileSurfacesRecoveryHint. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
…or values.yaml
Two distinct concepts share the word "endpoint" in talm and
the CLI flag --endpoints only addressed one of them:
* talosconfig.contexts.<name>.endpoints — list of host[:port]
for the talosctl-client round-robin. Populated by --endpoints.
* values.yaml::endpoint — single URL (https://host:port) that
the chart renders into cluster.controlPlane.endpoint of every
node's MachineConfig. Used by kubelet / kube-proxy to dial
the kube-apiserver.
Operators reasonably expected --endpoints to populate both;
discovering the split required reading the README. Add a
resolution chain that bridges the two without breaking
multi-endpoint scenarios where picking one node is wrong:
1. --cluster-endpoint URL — explicit operator override,
validated for scheme + host + port.
2. --endpoints with exactly one entry — auto-derive
https://<that>:6443 (unambiguous single-target case).
3. Empty — late hint at end of init nudges the operator at
values.yaml::endpoint with examples for VIP / LB cases.
Multi-endpoint --endpoints never auto-derives: in multi-cp
topologies the correct value is a VIP / LB / DNS round-robin,
not one arbitrarily-picked node IP whose loss would couple
cluster availability to that node.
Validation of --cluster-endpoint happens twice: once in
PreRunE so a malformed flag short-circuits before any files
land on disk, and again in resolveClusterEndpoint as defense in
depth for direct callers.
Pinned by:
* TestResolveClusterEndpoint — precedence + multi-endpoint
safety + malformed rejection across 7 sub-cases.
* TestApplyEndpointOverride — values.yaml rewrite preserves
surrounding content, no-op on presets without the field.
* TestPrintClusterEndpointHintIfEmpty — hint fires on
explicit empty / single-quoted empty / bare-colon empty;
stays silent on populated / missing / no-field cases.
Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThe PR enhances ChangesInit Error Recovery and Cluster Endpoint Configuration
Sequence DiagramsequenceDiagram
participant Operator
participant Init as talm init
participant Resolve as resolveClusterEndpoint
participant Apply as applyEndpointOverride
participant Hint as printClusterEndpointHintIfEmpty
Operator->>Init: init --cluster-endpoint https://api.example.com:6443
Init->>Init: PreRunE validate URL format early
Init->>Resolve: resolve endpoint (flag/derive/empty)
Resolve-->>Init: resolved endpoint or error
Init->>Apply: rewrite values.yaml endpoint: line
Apply-->>Init: updated bytes
Init->>Hint: check values.yaml for empty endpoint
Hint-->>Operator: (silent if populated, hint if empty)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces the --cluster-endpoint flag to the init command, allowing users to explicitly set the Kubernetes control-plane URL in values.yaml. It includes logic to auto-derive this URL from a single entry in --endpoints while providing a hint for multi-endpoint configurations. Additionally, the changes improve error reporting by adding actionable recovery hints when talm.key is missing. A review comment correctly identified a bug in the auto-derivation logic where endpoints already containing a port would result in malformed URLs.
| if len(endpoints) == 1 { | ||
| return "https://" + net.JoinHostPort(endpoints[0], "6443"), nil | ||
| } |
There was a problem hiding this comment.
The auto-derivation logic for the cluster endpoint assumes that endpoints[0] is a bare host. However, in Talos, endpoints are frequently specified as host:port (e.g., 1.2.3.4:50000). If a port is present, net.JoinHostPort will append :6443 to the existing host:port string, resulting in an invalid URL like https://1.2.3.4:50000:6443. We should extract the host part before joining it with the Kubernetes API server port.
if len(endpoints) == 1 {
host := endpoints[0]
if h, _, err := net.SplitHostPort(host); err == nil {
host = h
}
return "https://" + net.JoinHostPort(host, "6443"), nil
}
Closes the next init UX batch (#205 + #177). Both touch operator-facing pain points on init flow and are independent at the file level.
Commits
fix(init): hint at talm.key recovery path when --decrypt fails(#177)talm.keyabsence (lost backup, fresh checkout, laptop swap) used to surface a rawopen .../talm.key: no such filethat reads like an internal bug. Distinguish the not-exist case from other I/O errors at theLoadKeyboundary and attach a hint naming both recovery paths: restore the backed-up key, or re-runtalm initto regenerate (with the warning that regeneration writes new secrets and the oldsecrets.encrypted.yamlbecomes undecryptable). Pinned byTestContract_Age_LoadKey_MissingFileSurfacesRecoveryHint.feat(init): --cluster-endpoint flag and single-endpoint auto-derive for values.yaml(#205)Two distinct "endpoint" concepts in talm shared the same flag space and only one was addressed:
talosconfig.contexts.<name>.endpoints— talosctl-client round-robin list (populated by--endpoints)values.yaml::endpoint— Kubernetes control-plane URL the chart renders into every node's MachineConfigAdds a resolution chain that bridges them without breaking multi-CP scenarios where picking one node is wrong:
--cluster-endpoint URL— explicit operator override, validated for scheme + host + port.--endpointswith exactly one entry — auto-derivehttps://<that>:6443.values.yaml::endpointwith VIP / LB examples.Multi-endpoint
--endpointsnever auto-derives (would silently couple cluster availability to one arbitrary node). Validation runs inPreRunEso a malformed--cluster-endpointshort-circuits before any files land on disk. Pinned byTestResolveClusterEndpoint,TestApplyEndpointOverride,TestPrintClusterEndpointHintIfEmpty.Test plan
docs/manual-test-plan.mdextended:init --decryptwithouttalm.keysurfaces recovery hint--cluster-endpointpopulatesvalues.yaml::endpoint--endpointsauto-derives--cluster-endpointrejected before any files land on diskREADME.mdGetting Started section documents the two-endpoint distinction + the new flag.Validation
go build ./...cleango test ./... -count=1all packages greengolangci-lint run ./...0 issuesSummary by CodeRabbit
Release Notes
New Features
--cluster-endpointflag to specify the Kubernetes control-plane endpoint during initialization.Bug Fixes
Documentation