Skip to content

feat(charts,modeline,upgrade): operator extension points and dev17 install fixes#210

Closed
lexfrei wants to merge 3 commits into
mainfrom
feat/dev17-feedback
Closed

feat(charts,modeline,upgrade): operator extension points and dev17 install fixes#210
lexfrei wants to merge 3 commits into
mainfrom
feat/dev17-feedback

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei lexfrei commented May 15, 2026

Summary

Three independent improvements surfaced while installing Cozystack on a fresh 3-node Talos cluster (dev17, OCI VM.Standard.E5.Flex, Talos v1.12.6 → v1.13.0). Each commit stands alone and can be reverted independently.

fix(modeline): split modeline parts on top-level commas only

The # talm: nodes=[...], endpoints=[...] parser used strings.SplitSeq(content, ", ") which cut the value at the first comma INSIDE a multi-element JSON array. Hand-authored shared side-patches with nodes=["a", "b"] (space after comma) hit error parsing JSON array for key nodes, value ["a". The talm-generated form happened to dodge it because json.Marshal of a string slice emits no space after the comma.

Replace the literal split with a depth-aware tokenizer that splits on , only at JSON-array depth 0, tracking string state and backslash escapes. Both the canonical no-space form and the human form parse cleanly. Comma inside string literals never splits. Empty arrays, trailing commas, and unbalanced brackets keep their existing semantics (test-pinned).

feat(charts): expose Talos operator extension points in cozystack + generic presets

Operators wanting to extend the rendered Talos config (machine.kernel.modules, machine.sysctls, machine.kubelet.extraConfig, machine.files) had to fork the chart or hand-edit the generated nodes/<n>.yaml against the autogen banner. Adds four extra* values keys to both presets.

On cozystack (defaults present): list-shaped keys append; map-shaped keys are collision-checked at template time and fail with a precise hint if an operator key names a built-in. Built-ins are never overridable — yaml.v3 rejects duplicate map keys on decode, so a silent emit-both would produce a config that cannot round-trip.

On generic (no defaults): each block emits only when the matching extra* key is non-empty.

fix(upgrade): point-patch node body install.image after successful upgrade

talm upgrade resolves the target installer image from values.yaml, but the per-node nodes/<n>.yaml::machine.install.image was never resynced afterwards. A follow-up talm apply rendered the chart against values.yaml (got the new image), then merged the stale body via MergeFileAsPatch and silently pinned the cluster back to the pre-upgrade image — the next A/B boot would roll back without warning.

After successful RPC + post-upgrade verify, point-patch machine.install.image in every -f node body to the applied image. Uses a yaml.v3 streaming decoder so the v1.12+ multi-doc shape (machine + cluster + RegistryMirrorConfig + LinkConfig + Layer2VIPConfig + VLANConfig) is preserved intact — yaml.Unmarshal reads only the first document and would silently erase the rest. Encoder pinned at 2-space indent so the diff stays a single line. Comments and modeline survive untouched.

Verify-failure (auto-rollback) intentionally leaves the body alone: the node ran the old image, so the body must reflect that. Operator opt-out via --skip-post-upgrade-verify (and --insecure / --stage paths that drop Phase 2C entirely) patches unconditionally on RPC success.

Verification

  • go test ./...: all packages green.
  • golangci-lint run ./...: 0 issues.
  • Manual-test-plan extended in each commit (B7/B8 for extension points, C6f for multi-IP modeline, E4 for post-upgrade body sync).

Docs sync

The new extra* keys and the post-upgrade body-sync behaviour are operator-visible. A parallel update at cozystack/website will follow under a separate PR.

lexfrei added 3 commits May 15, 2026 22:36
ParseModeline previously split the modeline body on the literal `, `
(comma+space) string, which is the separator GenerateModeline emits
between key-pairs. JSON arrays serialised by encoding/json carry no
space after each inner comma ("["a","b"]"), so the canonical
talm-generated form parsed back cleanly. But a hand-authored modeline
with whitespace inside the array ("nodes=["a", "b"]" — the
natural shape) was cut at the first inner comma. The leading half
("nodes=["a"") then failed JSON parsing with
"unexpected end of JSON input", with a hint that misled the operator
into thinking their JSON was malformed.

Replace the literal split with a depth-aware tokenizer that tracks
`[`/`]` nesting and JSON string state (including backslash
escapes) and only emits a key-pair boundary when it sees a `,` at
depth 0 outside a string literal. The canonical no-space form still
parses; the human-written form now parses; commas inside string
literals never split the value.

Update the contract test that pinned the old strict separator to
document the relaxation, and add cases for multi-element arrays with
various whitespace shapes plus comma-in-string-literal coverage.
Extend the manual-test-plan with section C6f covering the parser
behaviour on both anchor and side-patch slots.

Signed-off-by: Aleksei Sviridkin <f@lex.la>
…eneric presets

Operators wanting to extend the Talos machine config rendered by the
cozystack or generic presets had to fork the chart or hand-edit the
generated per-node YAML against the autogen banner asking them to
keep template edits in values. Add four "extra*" values keys that
plug into the load-bearing sections of machine.* — kernel modules,
sysctls, kubelet.extraConfig, machine.files — without overriding the
preset's hardcoded defaults.

Cozystack preset (append / merge to existing built-ins):

  extraKernelModules     []  append to openvswitch / drbd / zfs / spl
                             / vfio_pci / vfio_iommu_type1
  extraKubeletExtraArgs  {}  merge into cpuManagerPolicy + maxPods;
                             last-key-wins on conflict (escape hatch)
  extraSysctls           {}  merge into gc_thresh* + nr_hugepages
  extraMachineFiles      []  append to CRI customization + lvm.conf

Generic preset (emit only when set):

  extraKernelModules     []  emits machine.kernel.modules block
  extraKubeletExtraArgs  {}  emits kubelet.extraConfig block
  extraSysctls           {}  emits machine.sysctls block
  extraMachineFiles      []  emits machine.files block

Contract tests under pkg/engine pin append vs merge semantics for
every key on cozystack, on / off state on generic, and the default
generic render still emits no machine.kernel / sysctls / files /
extraConfig blocks (the existing NoCozystackOpinionsOnGeneric guard
stays green). Manual-test-plan B7 / B8 cover both presets with
forward-looking yq queries against the rendered output.

Signed-off-by: Aleksei Sviridkin <f@lex.la>
…grade

talm upgrade resolves the target installer image from values.yaml
(canonical "raise the cluster's Talos version" workflow), but the
node body's machine.install.image was never resynced afterwards. A
follow-up talm apply re-rendered the chart against values.yaml and
got the new image, then merged the stale body via MergeFileAsPatch
and silently pinned the cluster back to the pre-upgrade image — the
next A/B boot would roll back without warning.

Point-patch machine.install.image in every -f node body after the
upgrade RPC + post-upgrade verify return success. Uses a yaml.v3
node round-trip so the modeline, autogen banner, sibling keys, and
per-key comments survive untouched; files without an install.image
key (side-patches, orphans, partials) are silently skipped so the
handler can blindly fan out over the full -f list.

Verify failure intentionally leaves the body alone: the node is on
the pre-upgrade image after auto-rollback, and the body must reflect
what the node actually runs — not what talosctl was asked to install.
Operator opt-out via --skip-post-upgrade-verify (or the --insecure /
--stage paths that drop Phase 2C entirely) patches unconditionally
on RPC success, trading the verify gate's safety for body-sync.

Contract tests pin scalar swap, idempotency (no rewrite when already
on target), silent-skip across three no-key shapes, structural-error
surface for unexpected machine.install.image kinds, and file-list
fan-out semantics. Manual-test-plan E4 covers the happy path,
failure-path skip, and operator-opt-out path against a real cluster.

Signed-off-by: Aleksei Sviridkin <f@lex.la>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8cad208a-a9ce-4b78-976b-e73b436d146e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/dev17-feedback

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces operator-supplied extension points for kernel modules, sysctls, kubelet arguments, and machine files across the cozystack and generic presets. It also implements a post-upgrade synchronization feature that updates the machine.install.image in node configuration files to prevent accidental reverts during future applies. Furthermore, the modeline parser has been enhanced to correctly handle commas within JSON arrays. Review feedback recommends refactoring the Helm template collision checks to use lists for improved maintainability and adopting the idiomatic errors.Is(err, io.EOF) check in the test suite.

Comment on lines +49 to +53
{{- range $k, $_ := .Values.extraKubeletExtraArgs }}
{{- if or (eq $k "cpuManagerPolicy") (eq $k "maxPods") }}
{{- fail (printf "values.yaml: extraKubeletExtraArgs.%s collides with the cozystack preset's built-in kubelet.extraConfig — keys never override (yaml.v3 rejects duplicate map keys on decode). Remove the entry from extraKubeletExtraArgs, or fork the chart preset if you need a different default." $k) }}
{{- end }}
{{- end }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and maintainability, you could define a list of the built-in keys and use the has function to check for collisions. This avoids a long or chain and makes it easier to manage the list of keys.

{{- $builtInKubeletKeys := list "cpuManagerPolicy" "maxPods" -}}
{{- range $k, $_ := .Values.extraKubeletExtraArgs -}}
  {{- if has $k $builtInKubeletKeys -}}
    {{- fail (printf "values.yaml: extraKubeletExtraArgs.%s collides with the cozystack preset's built-in kubelet.extraConfig — keys never override (yaml.v3 rejects duplicate map keys on decode). Remove the entry from extraKubeletExtraArgs, or fork the chart preset if you need a different default." $k) -}}
  {{- end -}}
{{- end -}}

Comment on lines +62 to +66
{{- range $k, $_ := .Values.extraSysctls }}
{{- if or (eq $k "vm.nr_hugepages") (eq $k "net.ipv4.neigh.default.gc_thresh1") (eq $k "net.ipv4.neigh.default.gc_thresh2") (eq $k "net.ipv4.neigh.default.gc_thresh3") }}
{{- fail (printf "values.yaml: extraSysctls.%s collides with the cozystack preset's built-in machine.sysctls — keys never override (yaml.v3 rejects duplicate map keys on decode). Remove the entry from extraSysctls, or fork the chart preset if you need a different default." $k) }}
{{- end }}
{{- end }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the extraKubeletExtraArgs check, using a list for the built-in sysctl keys and checking with has would improve readability and make this easier to maintain.

{{- $builtInSysctls := list "vm.nr_hugepages" "net.ipv4.neigh.default.gc_thresh1" "net.ipv4.neigh.default.gc_thresh2" "net.ipv4.neigh.default.gc_thresh3" -}}
{{- range $k, $_ := .Values.extraSysctls -}}
  {{- if has $k $builtInSysctls -}}
    {{- fail (printf "values.yaml: extraSysctls.%s collides with the cozystack preset's built-in machine.sysctls — keys never override (yaml.v3 rejects duplicate map keys on decode). Remove the entry from extraSysctls, or fork the chart preset if you need a different default." $k) -}}
  {{- end -}}
{{- end -}}

Comment on lines +491 to +493
if err.Error() == "EOF" {
break
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Checking for end-of-file by comparing the error string err.Error() == "EOF" is brittle. The idiomatic and more robust way to check for io.EOF is to use errors.Is(err, io.EOF). You may need to import the io and errors packages.

Suggested change
if err.Error() == "EOF" {
break
}
if errors.Is(err, io.EOF) {
break
}

@lexfrei
Copy link
Copy Markdown
Contributor Author

lexfrei commented May 15, 2026

Closing in favour of a follow-up PR with a cleaner branch name and description.

@lexfrei lexfrei closed this May 15, 2026
@lexfrei lexfrei deleted the feat/dev17-feedback branch May 15, 2026 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant