Skip to content

Conversation

@VannTen
Copy link
Contributor

@VannTen VannTen commented Oct 7, 2025

What type of PR is this?
/kind cleanup

What this PR does / why we need it:
This is extracted from #10714, with only the parts related to Node Allocatable.
This clarify the API change the API for users which happened in #10643 (see #12438):
Dinstinguish between:

  1. reserving resources for kube / system daemon -> this impacts the calculation of Allocatable
  2. enforcing those limits -> this translate the limits in cgroups terms on the node

The current variables (kube_reserved / system_reserved) were used, before #10643, to enable 1 and activate 2.
#10643 decoupled 1 and 2, but those variable and the documentation are still quite confusing. This PR align instead the variables with upstream Kubernetes terminology (enforce_node_allocatable_*).

Which issue(s) this PR fixes:

Fixes #12438

Special notes for your reviewer:
@rptaylor would appreciate your thoughts on this
@tico88612 I wonder if we should revert the change in #10643 which set resources unconditionnaly, and reapply them on master only, thoughts ?

Does this PR introduce a user-facing change?:

action-required
`kube_reserved` is renamed to `enforce_allocatable_kube_reserved`
`system_reserved` is renamed to `enforce_allocatable_system_reserved`

Setting the {kube,system}ReservedCgroup does not make the kubelet
enforce the limits, adding the corresponding entry in
enforceNodeAllocatable does, so:
- unconditionally add the *ReservedCgroup settings to the kubelet config
- replace the *_reserved variable with enforce_node_allocatable_*, which
  match more directly to upstream Kubernetes concepts.

Fix-up the tests using those variables.
Remove the cgroups schema as it's not really actionable => the link to
kubernetes documentation and design doc over here already has that stuff.
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 7, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: VannTen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from cyclinder October 7, 2025 07:33
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2025
@k8s-ci-robot k8s-ci-robot requested a review from tico88612 October 7, 2025 07:33
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 7, 2025
pods will be evicted when resource usage excess Allocatable.
Here is an example:
You can optionnaly enforce the reservations for kube-reserved and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Here is an example:
You can optionnaly enforce the reservations for kube-reserved and
system-reserved, but proceed with caution (see [the kubernetes
guidelines](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#general-guidelines)).
Copy link
Contributor

@rptaylor rptaylor Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Enforcing" a reservation doesn't seem entirely clear to me.
If I recall correctly, the point is that the reservation should always be in effect as a matter of best practice, and it happens via scheduling considerations.
You can optionally also run the components in cgroups to ensure they don't exceed the amount of reserved resources.
In short I think it is clearer to say "enforce limits" rather than "enforce reservations", because the cgroup limits are a separate matter from the reservations which should always apply.
Basically similar to the idea of a k8s request vs limit.

Copy link
Contributor Author

@VannTen VannTen Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the wording is a bit poor you're right I'll fix that.

Yes, the various *ReservedSettings by themselves only have an effect because kubelet use them to calculate Allocatable (which is used by kube-scheduler).
To make them limiting (with cgroups) you need the corresponding ReservedCgroups + including them in enforceNodeAllocatable (which, btw, is a bit poorly worded IMO 😆 )

Maybe "You can optionally make the kubelet enforce the reservations as cgroups limits" ? 🤔

Copy link
Contributor

@rptaylor rptaylor Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure sounds good, maybe something like:
"In addition, you can optionally make the kubelet enforce the reservations as cgroup limits, but beware that this can cause system instability if the limits are not set appropriately, because the system daemons will be terminated if their resource usage exceeds the reserved amount. (link to docs)"

@rptaylor
Copy link
Contributor

rptaylor commented Oct 7, 2025

Background: I believe there was a misunderstanding in the implementation of #9209 that lead to some confusion. I tried to fix this in #11367 , adding some clarifying comments but without breaking changes. This is also discussed in #9692

I took a quick look in this PR but didn't check all the details and logic. But @VannTen in my opinion the important thing is to understand the concepts and have clear terminology , then I am confident you will get the details right :) Hopefully the above background info helps clarify if needed.

@VannTen
Copy link
Contributor Author

VannTen commented Oct 9, 2025

Yeah I read those comments some months ago, when doing #10714 (unfortunately the remaining bits of that needs a newer version of community.general, so we get rid of the weird cgroups variable).

## Node Allocatable

You can use `kubelet_enforce_node_allocatable` to set node allocatable enforcement.
Node Allocatable is calculated by subtracting from the node capacity:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

"Node Allocatable" is the amount of resources on a node that can be used by pods. It is calculated by subtracting the following quantities from the physically available resources of a node:

Note that to enforce kube-reserved or system-reserved, `kube_reserved_cgroups` or `system_reserved_cgroups` needs to be specified respectively.
By default, the kubelet will enforce Node Allocatable for pods, which means
pods will be evicted when resource usage excess Allocatable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"when resource usage exceeds the Allocatable amount"

@tico88612
Copy link
Member

@VannTen What is mean set resources unconditionnaly? I wasn't involved in this PR (might need some time to read through it), as I had just joined 1-2 months ago at that time. 😆

@VannTen
Copy link
Contributor Author

VannTen commented Oct 10, 2025

Basically before #10643, kube_reserved and system_reserved had two effects:

  1. Use the per resources *_kube/sytem_reserved reservation variables.-> which deduces them from Allocatable, aka, the resources available to schedule pods (without the main switch kube/system_reserved, those variables had no effect.
  2. enable the enforcement of the reservation variables as cgroup limits (by adding them to kubelet setting enforceNodeAllocatable and specifying the cgroups on which limits should be applied.

#10643 removed 1., instead reserving the resources unconditionally based on kubespray variables.
Thus, for a cluster using system_reserved: false, for instance, with keeping the default value, the amount of resources in Allocatable would decrease (since the reservations are done when they weren't before).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

system_reserved changed behavior in 2.27

4 participants