-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[KEP-5381]: blog for mutable pv nodeAffintiy alpha #53006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👷 Deploy Preview for kubernetes-io-vnext-staging processing.
|
✅ Pull request preview available for checkingBuilt without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
@huww98 thank you for opening this feature blog PR. Thank you. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hi @huww98 👋 v1.35 Communications team here, @yuanwang04 as author of #52895, I'd like you to be a writing buddy for @huww98 on this PR. Please:
|
|
Hi @huww98 👋 -- this is Graziano (@graz-dev) from the v1.35 Communications Team! Just a friendly reminder that we are approaching the feature blog "ready for review" deadline: Friday 21st November. We ask you to have the blog PR in non-draft state, and all write-up to be complete, so that we can start the blog review from SIG Docs Blog team. If you have any questions or need help, please don't hesitate to reach out to me or any of the Communications Team members. We are here to help you! |
|
Sorry @huww98 the correct deadline for "Feature Blog Ready for Review" is Monday 24 November. Sorry, my bad :( |
content/en/blog/_posts/2025-XX-XX-mutable-pv-affinity-alpha/index.md
Outdated
Show resolved
Hide resolved
content/en/blog/_posts/2025-XX-XX-mutable-pv-affinity-alpha/index.md
Outdated
Show resolved
Hide resolved
content/en/blog/_posts/2025-XX-XX-mutable-pv-affinity-alpha/index.md
Outdated
Show resolved
Hide resolved
content/en/blog/_posts/2025-XX-XX-mutable-pv-affinity-alpha/index.md
Outdated
Show resolved
Hide resolved
content/en/blog/_posts/2025-XX-XX-mutable-pv-affinity-alpha/index.md
Outdated
Show resolved
Hide resolved
| As another example, providers sometimes offer new generations of disks. | ||
| New disks cannot always be attached to older nodes in the cluster. | ||
| While this accessibility can also be expressed through PV node affinity and ensures the Pods can be scheduled to the right nodes, | ||
| this can also prevent online disk upgrade. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this prevent online disk upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, you can upgrade, but scheduler will not get that upgrade automatically. Update this to:
This accessibility can also be expressed through PV node affinity and ensures the Pods can be scheduled to the right nodes.
But when the disk is upgraded, new Pods using this disk can still be scheduled to older nodes.
To prevent this, you may want to change the PV node affinity from:
|
|
||
| As another example, providers sometimes offer new generations of disks. | ||
| New disks cannot always be attached to older nodes in the cluster. | ||
| While this accessibility can also be expressed through PV node affinity and ensures the Pods can be scheduled to the right nodes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this involves detach and re-attach which will disrupt the workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary if the disk is already attached to a node that supports both gen1 and gen2.
| Typically only administrators can edit PVs, please make sure you have the right RBAC permissions. | ||
|
|
||
| Note that changing PV node affinity alone will not actually change the accessibility of the underlying volume. | ||
| You must also update the underlying volume in the storage provider, and keep the node affinity in sync. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think before asking people to try it out and edit PV nodeAffinity, you should explain what a storage vendor needs to do to support this feature.
Otherwise, some admin may try it manually and cause problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storage provider needs to offer online updates that affects the accessibility of the volume.
If admins want to utilize those online update capabilities, they should use this feature.
Expanded "Try it out" section and hopes this can make it more clear.
| One mitigation under discussion is to have the kubelet fail Pod startup if the PV’s node affinity is violated. | ||
| This has not landed yet. | ||
| So if you are trying this out now, please watch subsequent Pods that use the updated PV, | ||
| and make sure they are scheduled onto nodes that can access the volume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these are for storage vendors who are interested in having their drivers support this feature. I think all of these should be under a heading that clarifies who is intended audience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intended for admins who are willing to try this feature, to inform them the race condition. If someone try to update PV then start new pods in a script, it may not work as intended.
|
|
||
| ## Future Integration with CSI (Container Storage Interface) | ||
|
|
||
| Currently, it is up to the cluster administrator to modify both PV's node affinity and the underlying volume in the storage provider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does a cluster admin need to do before making node affinity changes so that he/she won't run into problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be explained in the "Try it out" section.
|
|
||
| As noted earlier, this is only a first step. | ||
|
|
||
| If you are a Kubernetes user, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of user has access to PV node affinity? It should be cluster admin, not any user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently. But after integration with CSI, unprivileged users should be able to trigger an update with VAC. So I'd like to here from all users.
yuanwang04
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for introducing this useful feature and the blog; overall LGTM, left some clarification comments.
| It is fine to allow more nodes to access the volume by relaxing node affinity. | ||
| But there is a race condition when you try to tighten node affinity: | ||
| We don't know how scheduler will see our modified PV in its cache, | ||
| so there is a small window where the scheduler may place a Pod on an old node that can no longer access the volume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if this happens? Would the PV failed to be bind to the node / pod?
| This has not landed yet. | ||
| So if you are trying this out now, please watch subsequent Pods that use the updated PV, | ||
| and make sure they are scheduled onto nodes that can access the volume. | ||
| If you update PV then immediately start new Pods in a script, it may not work as intended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an estimated time window to wait before Pod can be scheduled correctly?
| dates back to Kubernetes v1.10. | ||
| It is widely used to express that volumes may not be equally accessible by all nodes in the cluster. | ||
| This field was previously immutable, | ||
| we are now making it mutable in Kubernetes v1.35 (alpha), Opening a door to more flexible online volume management. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: avoid we
| we are now making it mutable in Kubernetes v1.35 (alpha), Opening a door to more flexible online volume management. | |
| and it is now mutable in Kubernetes v1.35 (alpha). This change opens a door to more flexible online volume management. |
| - available | ||
| ``` | ||
|
|
||
| So, we are making it mutable now, a first step towards a more flexible online volume management. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: avoid we
| So, we are making it mutable now, a first step towards a more flexible online volume management. | |
| So, it is mutable now, a first step towards a more flexible online volume management. |
| There are only a few things out of Pod that can affects the scheduling decision. PV node affinity is one of them. | ||
| It is fine to allow more nodes to access the volume by relaxing node affinity. | ||
| But there is a race condition when you try to tighten node affinity: | ||
| We don't know how scheduler will see our modified PV in its cache, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: avoid we
| We don't know how scheduler will see our modified PV in its cache, | |
| It is unclear how the Scheduler will see the modified PV in its cache, |
|
|
||
| Currently, it is up to the cluster administrator to modify both PV's node affinity and the underlying volume in the storage provider. | ||
| But manual operations are error-prone and time-consuming. | ||
| We would like to eventually integrate this with VolumeAttributesClass, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: avoid we
| We would like to eventually integrate this with VolumeAttributesClass, | |
| It is preferred to eventually integrate this with VolumeAttributesClass, |
| But manual operations are error-prone and time-consuming. | ||
| We would like to eventually integrate this with VolumeAttributesClass, | ||
| so that an unprivileged user can modify their PersistentVolumeClaim (PVC) to trigger storage-side updates, | ||
| and PV node affinity is updated automatically when approprate, without the need for cluster admin's intervention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo
| and PV node affinity is updated automatically when approprate, without the need for cluster admin's intervention. | |
| and PV node affinity is updated automatically when appropriate, without the need for cluster admin's intervention. |
Serenity611
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed for general proofreading, made a few suggestions for readability and alignment with the style guide :) Thanks!
| draft: true | ||
| slug: kubernetes-v1-35-mutable-pv-nodeaffinity | ||
| author: > | ||
| Weiwen Hu (Alibaba Cloud) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Weiwen Hu (Alibaba Cloud) | |
| Weiwen Hu (Alibaba Cloud), |
| This field was previously immutable, | ||
| we are now making it mutable in Kubernetes v1.35 (alpha), Opening a door to more flexible online volume management. | ||
|
|
||
| ## Why Making Node Affinity Mutable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Why Making Node Affinity Mutable? | |
| ## Why make node affinity mutable? |
| Note that changing PV node affinity alone will not actually change the accessibility of the underlying volume. | ||
| So before using this feature, | ||
| You must update the underlying volume in the storage provider first, | ||
| and understand which nodes can access the volume after the update. | ||
| Then you can enable this feature and keep the PV node affinity in sync. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Note that changing PV node affinity alone will not actually change the accessibility of the underlying volume. | |
| So before using this feature, | |
| You must update the underlying volume in the storage provider first, | |
| and understand which nodes can access the volume after the update. | |
| Then you can enable this feature and keep the PV node affinity in sync. | |
| Note that changing PV node affinity alone will not actually change the accessibility of the underlying volume. | |
| Before using this feature, | |
| you must first update the underlying volume in the storage provider | |
| and understand which nodes can access the volume after the update. | |
| You can then enable this feature and keep the PV node affinity in sync. |
|
|
||
| Currently, this feature is in alpha state. | ||
| It is disabled by default, and may subject to change. | ||
| To try it out, enable `MutablePVNodeAffinity` feature gate on APIServer, then you can edit PV spec.nodeAffinity field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| To try it out, enable `MutablePVNodeAffinity` feature gate on APIServer, then you can edit PV spec.nodeAffinity field. | |
| To try it out, enable the `MutablePVNodeAffinity` feature gate on APIServer, then you can edit the PV `spec.nodeAffinity` field. |
| To try it out, enable `MutablePVNodeAffinity` feature gate on APIServer, then you can edit PV spec.nodeAffinity field. | ||
| Typically only administrators can edit PVs, please make sure you have the right RBAC permissions. | ||
|
|
||
| ### Race Condition between Updating and Scheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ### Race Condition between Updating and Scheduling | |
| ### Race condition between updating and scheduling |
|
|
||
| ### Race Condition between Updating and Scheduling | ||
|
|
||
| There are only a few things out of Pod that can affects the scheduling decision. PV node affinity is one of them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| There are only a few things out of Pod that can affects the scheduling decision. PV node affinity is one of them. | |
| There are only a few things out of Pod that can affect the scheduling decision, and PV node affinity is one of them. |
| It is fine to allow more nodes to access the volume by relaxing node affinity. | ||
| But there is a race condition when you try to tighten node affinity: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| It is fine to allow more nodes to access the volume by relaxing node affinity. | |
| But there is a race condition when you try to tighten node affinity: | |
| It is fine to allow more nodes to access the volume by relaxing node affinity, | |
| but there is a race condition when you try to tighten node affinity: |
| We don't know how scheduler will see our modified PV in its cache, | ||
| so there is a small window where the scheduler may place a Pod on an old node that can no longer access the volume. | ||
|
|
||
| One mitigation under discussion is to have the kubelet fail Pod startup if the PV’s node affinity is violated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| One mitigation under discussion is to have the kubelet fail Pod startup if the PV’s node affinity is violated. | |
| One mitigation under discussion is to have the `kubelet` fail Pod startup if the PV’s node affinity is violated. |
| and make sure they are scheduled onto nodes that can access the volume. | ||
| If you update PV then immediately start new Pods in a script, it may not work as intended. | ||
|
|
||
| ## Future Integration with CSI (Container Storage Interface) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Future Integration with CSI (Container Storage Interface) | |
| ## Future integration with CSI (Container Storage Interface) |
|
/assign |
Description
blog for mutable pv nodeAffintiy alpha
Issue
Closes: #