Skip to content

Conversation

@huww98
Copy link

@huww98 huww98 commented Oct 30, 2025

Description

blog for mutable pv nodeAffintiy alpha

Issue

Closes: #

@k8s-ci-robot k8s-ci-robot added this to the 1.35 milestone Oct 30, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 30, 2025
@netlify
Copy link

netlify bot commented Oct 30, 2025

👷 Deploy Preview for kubernetes-io-vnext-staging processing.

Name Link
🔨 Latest commit 1ba9c20
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-io-vnext-staging/deploys/6902de3e86fce80008306fdf

@k8s-ci-robot k8s-ci-robot requested a review from graz-dev October 30, 2025 03:40
@k8s-ci-robot k8s-ci-robot added the area/blog Issues or PRs related to the Kubernetes Blog subproject label Oct 30, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 30, 2025
@netlify
Copy link

netlify bot commented Oct 30, 2025

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit 965f2d4
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-io-main-staging/deploys/6925330d82f623000812023b
😎 Deploy Preview https://deploy-preview-53006--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@graz-dev
Copy link
Contributor

@huww98 thank you for opening this feature blog PR.
Feature blog PRs should be opened against the main branch, could you fix it please?

Thank you.

@huww98 huww98 changed the base branch from dev-1.35 to main October 30, 2025 08:06
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 30, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign nate-double-u for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 30, 2025
@graz-dev
Copy link
Contributor

Hi @huww98 👋 v1.35 Communications team here,

@yuanwang04 as author of #52895, I'd like you to be a writing buddy for @huww98 on this PR.

Please:

  • Review this PR, paying attention to the guidelines and review hints
  • Update your own PR based on any best practices you identify that should be applied
  • Remember to be compassionate with your fellow article author

@graz-dev
Copy link
Contributor

Hi @huww98 👋 -- this is Graziano (@graz-dev) from the v1.35 Communications Team!

Just a friendly reminder that we are approaching the feature blog "ready for review" deadline: Friday 21st November. We ask you to have the blog PR in non-draft state, and all write-up to be complete, so that we can start the blog review from SIG Docs Blog team.

If you have any questions or need help, please don't hesitate to reach out to me or any of the Communications Team members. We are here to help you!

@graz-dev
Copy link
Contributor

Sorry @huww98 the correct deadline for "Feature Blog Ready for Review" is Monday 24 November.
So you still have some days to finish the content and change the status of the PR.

Sorry, my bad :(

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 24, 2025
@huww98 huww98 marked this pull request as ready for review November 24, 2025 13:49
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 24, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 24, 2025
As another example, providers sometimes offer new generations of disks.
New disks cannot always be attached to older nodes in the cluster.
While this accessibility can also be expressed through PV node affinity and ensures the Pods can be scheduled to the right nodes,
this can also prevent online disk upgrade.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this prevent online disk upgrade?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, you can upgrade, but scheduler will not get that upgrade automatically. Update this to:

This accessibility can also be expressed through PV node affinity and ensures the Pods can be scheduled to the right nodes.
But when the disk is upgraded, new Pods using this disk can still be scheduled to older nodes.
To prevent this, you may want to change the PV node affinity from:


As another example, providers sometimes offer new generations of disks.
New disks cannot always be attached to older nodes in the cluster.
While this accessibility can also be expressed through PV node affinity and ensures the Pods can be scheduled to the right nodes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this involves detach and re-attach which will disrupt the workload.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary if the disk is already attached to a node that supports both gen1 and gen2.

Typically only administrators can edit PVs, please make sure you have the right RBAC permissions.

Note that changing PV node affinity alone will not actually change the accessibility of the underlying volume.
You must also update the underlying volume in the storage provider, and keep the node affinity in sync.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think before asking people to try it out and edit PV nodeAffinity, you should explain what a storage vendor needs to do to support this feature.
Otherwise, some admin may try it manually and cause problems.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storage provider needs to offer online updates that affects the accessibility of the volume.
If admins want to utilize those online update capabilities, they should use this feature.

Expanded "Try it out" section and hopes this can make it more clear.

One mitigation under discussion is to have the kubelet fail Pod startup if the PV’s node affinity is violated.
This has not landed yet.
So if you are trying this out now, please watch subsequent Pods that use the updated PV,
and make sure they are scheduled onto nodes that can access the volume.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these are for storage vendors who are interested in having their drivers support this feature. I think all of these should be under a heading that clarifies who is intended audience.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended for admins who are willing to try this feature, to inform them the race condition. If someone try to update PV then start new pods in a script, it may not work as intended.


## Future Integration with CSI (Container Storage Interface)

Currently, it is up to the cluster administrator to modify both PV's node affinity and the underlying volume in the storage provider.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does a cluster admin need to do before making node affinity changes so that he/she won't run into problems?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be explained in the "Try it out" section.


As noted earlier, this is only a first step.

If you are a Kubernetes user,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of user has access to PV node affinity? It should be cluster admin, not any user.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently. But after integration with CSI, unprivileged users should be able to trigger an update with VAC. So I'd like to here from all users.

Copy link
Contributor

@yuanwang04 yuanwang04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for introducing this useful feature and the blog; overall LGTM, left some clarification comments.

It is fine to allow more nodes to access the volume by relaxing node affinity.
But there is a race condition when you try to tighten node affinity:
We don't know how scheduler will see our modified PV in its cache,
so there is a small window where the scheduler may place a Pod on an old node that can no longer access the volume.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if this happens? Would the PV failed to be bind to the node / pod?

This has not landed yet.
So if you are trying this out now, please watch subsequent Pods that use the updated PV,
and make sure they are scheduled onto nodes that can access the volume.
If you update PV then immediately start new Pods in a script, it may not work as intended.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an estimated time window to wait before Pod can be scheduled correctly?

dates back to Kubernetes v1.10.
It is widely used to express that volumes may not be equally accessible by all nodes in the cluster.
This field was previously immutable,
we are now making it mutable in Kubernetes v1.35 (alpha), Opening a door to more flexible online volume management.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid we

Suggested change
we are now making it mutable in Kubernetes v1.35 (alpha), Opening a door to more flexible online volume management.
and it is now mutable in Kubernetes v1.35 (alpha). This change opens a door to more flexible online volume management.

- available
```

So, we are making it mutable now, a first step towards a more flexible online volume management.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid we

Suggested change
So, we are making it mutable now, a first step towards a more flexible online volume management.
So, it is mutable now, a first step towards a more flexible online volume management.

There are only a few things out of Pod that can affects the scheduling decision. PV node affinity is one of them.
It is fine to allow more nodes to access the volume by relaxing node affinity.
But there is a race condition when you try to tighten node affinity:
We don't know how scheduler will see our modified PV in its cache,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid we

Suggested change
We don't know how scheduler will see our modified PV in its cache,
It is unclear how the Scheduler will see the modified PV in its cache,


Currently, it is up to the cluster administrator to modify both PV's node affinity and the underlying volume in the storage provider.
But manual operations are error-prone and time-consuming.
We would like to eventually integrate this with VolumeAttributesClass,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid we

Suggested change
We would like to eventually integrate this with VolumeAttributesClass,
It is preferred to eventually integrate this with VolumeAttributesClass,

But manual operations are error-prone and time-consuming.
We would like to eventually integrate this with VolumeAttributesClass,
so that an unprivileged user can modify their PersistentVolumeClaim (PVC) to trigger storage-side updates,
and PV node affinity is updated automatically when approprate, without the need for cluster admin's intervention.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo

Suggested change
and PV node affinity is updated automatically when approprate, without the need for cluster admin's intervention.
and PV node affinity is updated automatically when appropriate, without the need for cluster admin's intervention.

Copy link
Contributor

@Serenity611 Serenity611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed for general proofreading, made a few suggestions for readability and alignment with the style guide :) Thanks!

draft: true
slug: kubernetes-v1-35-mutable-pv-nodeaffinity
author: >
Weiwen Hu (Alibaba Cloud)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Weiwen Hu (Alibaba Cloud)
Weiwen Hu (Alibaba Cloud),

This field was previously immutable,
we are now making it mutable in Kubernetes v1.35 (alpha), Opening a door to more flexible online volume management.

## Why Making Node Affinity Mutable?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Why Making Node Affinity Mutable?
## Why make node affinity mutable?

Comment on lines +95 to +99
Note that changing PV node affinity alone will not actually change the accessibility of the underlying volume.
So before using this feature,
You must update the underlying volume in the storage provider first,
and understand which nodes can access the volume after the update.
Then you can enable this feature and keep the PV node affinity in sync.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that changing PV node affinity alone will not actually change the accessibility of the underlying volume.
So before using this feature,
You must update the underlying volume in the storage provider first,
and understand which nodes can access the volume after the update.
Then you can enable this feature and keep the PV node affinity in sync.
Note that changing PV node affinity alone will not actually change the accessibility of the underlying volume.
Before using this feature,
you must first update the underlying volume in the storage provider
and understand which nodes can access the volume after the update.
You can then enable this feature and keep the PV node affinity in sync.


Currently, this feature is in alpha state.
It is disabled by default, and may subject to change.
To try it out, enable `MutablePVNodeAffinity` feature gate on APIServer, then you can edit PV spec.nodeAffinity field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To try it out, enable `MutablePVNodeAffinity` feature gate on APIServer, then you can edit PV spec.nodeAffinity field.
To try it out, enable the `MutablePVNodeAffinity` feature gate on APIServer, then you can edit the PV `spec.nodeAffinity` field.

To try it out, enable `MutablePVNodeAffinity` feature gate on APIServer, then you can edit PV spec.nodeAffinity field.
Typically only administrators can edit PVs, please make sure you have the right RBAC permissions.

### Race Condition between Updating and Scheduling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Race Condition between Updating and Scheduling
### Race condition between updating and scheduling


### Race Condition between Updating and Scheduling

There are only a few things out of Pod that can affects the scheduling decision. PV node affinity is one of them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
There are only a few things out of Pod that can affects the scheduling decision. PV node affinity is one of them.
There are only a few things out of Pod that can affect the scheduling decision, and PV node affinity is one of them.

Comment on lines +109 to +110
It is fine to allow more nodes to access the volume by relaxing node affinity.
But there is a race condition when you try to tighten node affinity:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is fine to allow more nodes to access the volume by relaxing node affinity.
But there is a race condition when you try to tighten node affinity:
It is fine to allow more nodes to access the volume by relaxing node affinity,
but there is a race condition when you try to tighten node affinity:

We don't know how scheduler will see our modified PV in its cache,
so there is a small window where the scheduler may place a Pod on an old node that can no longer access the volume.

One mitigation under discussion is to have the kubelet fail Pod startup if the PV’s node affinity is violated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One mitigation under discussion is to have the kubelet fail Pod startup if the PV’s node affinity is violated.
One mitigation under discussion is to have the `kubelet` fail Pod startup if the PV’s node affinity is violated.

and make sure they are scheduled onto nodes that can access the volume.
If you update PV then immediately start new Pods in a script, it may not work as intended.

## Future Integration with CSI (Container Storage Interface)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Future Integration with CSI (Container Storage Interface)
## Future integration with CSI (Container Storage Interface)

@gnufied
Copy link
Member

gnufied commented Dec 3, 2025

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/blog Issues or PRs related to the Kubernetes Blog subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants