-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Change semantics of kubelet resources reservation #12603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Change semantics of kubelet resources reservation #12603
Conversation
Setting the {kube,system}ReservedCgroup does not make the kubelet
enforce the limits, adding the corresponding entry in
enforceNodeAllocatable does, so:
- unconditionally add the *ReservedCgroup settings to the kubelet config
- replace the *_reserved variable with enforce_node_allocatable_*, which
match more directly to upstream Kubernetes concepts.
Fix-up the tests using those variables.
Remove the cgroups schema as it's not really actionable => the link to kubernetes documentation and design doc over here already has that stuff.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: VannTen The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| pods will be evicted when resource usage excess Allocatable. | ||
| Here is an example: | ||
| You can optionnaly enforce the reservations for kube-reserved and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
| Here is an example: | ||
| You can optionnaly enforce the reservations for kube-reserved and | ||
| system-reserved, but proceed with caution (see [the kubernetes | ||
| guidelines](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#general-guidelines)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Enforcing" a reservation doesn't seem entirely clear to me.
If I recall correctly, the point is that the reservation should always be in effect as a matter of best practice, and it happens via scheduling considerations.
You can optionally also run the components in cgroups to ensure they don't exceed the amount of reserved resources.
In short I think it is clearer to say "enforce limits" rather than "enforce reservations", because the cgroup limits are a separate matter from the reservations which should always apply.
Basically similar to the idea of a k8s request vs limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the wording is a bit poor you're right I'll fix that.
Yes, the various *ReservedSettings by themselves only have an effect because kubelet use them to calculate Allocatable (which is used by kube-scheduler).
To make them limiting (with cgroups) you need the corresponding ReservedCgroups + including them in enforceNodeAllocatable (which, btw, is a bit poorly worded IMO 😆 )
Maybe "You can optionally make the kubelet enforce the reservations as cgroups limits" ? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure sounds good, maybe something like:
"In addition, you can optionally make the kubelet enforce the reservations as cgroup limits, but beware that this can cause system instability if the limits are not set appropriately, because the system daemons will be terminated if their resource usage exceeds the reserved amount. (link to docs)"
|
Background: I believe there was a misunderstanding in the implementation of #9209 that lead to some confusion. I tried to fix this in #11367 , adding some clarifying comments but without breaking changes. This is also discussed in #9692 I took a quick look in this PR but didn't check all the details and logic. But @VannTen in my opinion the important thing is to understand the concepts and have clear terminology , then I am confident you will get the details right :) Hopefully the above background info helps clarify if needed. |
|
Yeah I read those comments some months ago, when doing #10714 (unfortunately the remaining bits of that needs a newer version of |
| ## Node Allocatable | ||
|
|
||
| You can use `kubelet_enforce_node_allocatable` to set node allocatable enforcement. | ||
| Node Allocatable is calculated by subtracting from the node capacity: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
"Node Allocatable" is the amount of resources on a node that can be used by pods. It is calculated by subtracting the following quantities from the physically available resources of a node:
| Note that to enforce kube-reserved or system-reserved, `kube_reserved_cgroups` or `system_reserved_cgroups` needs to be specified respectively. | ||
| By default, the kubelet will enforce Node Allocatable for pods, which means | ||
| pods will be evicted when resource usage excess Allocatable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"when resource usage exceeds the Allocatable amount"
|
@VannTen What is mean |
|
Basically before #10643,
#10643 removed 1., instead reserving the resources unconditionally based on kubespray variables. |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This is extracted from #10714, with only the parts related to Node Allocatable.
This clarify the API change the API for users which happened in #10643 (see #12438):
Dinstinguish between:
The current variables (kube_reserved / system_reserved) were used, before #10643, to enable 1 and activate 2.
#10643 decoupled 1 and 2, but those variable and the documentation are still quite confusing. This PR align instead the variables with upstream Kubernetes terminology (
enforce_node_allocatable_*).Which issue(s) this PR fixes:
Fixes #12438
Special notes for your reviewer:
@rptaylor would appreciate your thoughts on this
@tico88612 I wonder if we should revert the change in #10643 which set resources unconditionnaly, and reapply them on master only, thoughts ?
Does this PR introduce a user-facing change?: