Extend CacheRuntime phase 2.3: remount when dataset mount changed or cache runtime master restarted#5834
Extend CacheRuntime phase 2.3: remount when dataset mount changed or cache runtime master restarted#5834xliuqq wants to merge 3 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request implements dynamic UFS mount updates for the generic cache runtime by executing a MountUFS script and parsing its JSON output to synchronize dataset status. The changes include API updates to CacheRuntimeStatus, new documentation for integration requirements, and the core logic in the engine to handle UFS changes and remounting. Review feedback identifies a high-severity issue where missing updates to dataset.Status.Mounts could cause infinite reconciliation loops. Other suggestions focus on making JSON parsing more robust, improving logging conciseness by avoiding full object serialization, and fixing typos in API documentation.
There was a problem hiding this comment.
Pull request overview
This PR extends the generic CacheRuntime “MountUFS” integration so the cache runtime can remount dynamically when Dataset.spec.mounts changes or when the master pod restarts, using a MountUFS script whose stdout is parsed as JSON to sync mount state.
Changes:
- Add CacheEngine UFS change detection + remount-on-master-restart logic, driven from
CacheEngine.Sync(). - Introduce/propagate
status.mountTimeand related status/CRD updates to track the last successful mount time. - Update Curvine e2e MountUFS script + docs to require MountUFS stdout to be JSON matching
CacheRuntimeMountUfsOutput, and add Sync unit tests.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| test/gha-e2e/curvine/mount.yaml | Adjust MountUFS script to be quiet on stdout and emit {"mounted":[...]} JSON for Fluid parsing. |
| pkg/ddc/cache/engine/ufs.go | Implement UFS delta detection, remount checks on master restart, MountUFS stdout JSON parsing, and dataset mount sync. |
| pkg/ddc/cache/engine/sync.go | Invoke UFS-change handling as part of CacheEngine Sync flow. |
| pkg/ddc/cache/engine/sync_test.go | Add Ginkgo unit tests covering Sync/configmap generation scenarios (currently limited coverage of new UFS path). |
| pkg/ddc/cache/engine/setup.go | Update Setup to call new PrepareUFS(runtimeClass) signature. |
| pkg/ddc/cache/engine/runtime.go | Add updateMountTime() helper to persist last mount timestamp in CacheRuntime status. |
| pkg/ddc/cache/engine/master.go | Change master pod/container resolution to derive container name from CacheRuntimeClass template. |
| pkg/ddc/cache/engine/fileutils.go | Return Mount command stdout so MountUFS output can be parsed by controller logic. |
| pkg/ddc/cache/engine/dataset.go | Update runtime MountTime when binding dataset. |
| docs/zh/dev/generic_cache_runtime_integration.md | Document MountUFS JSON stdout contract and usage scenarios (Step 2.7). |
| docs/en/dev/generic_cache_runtime_integration.md | English version of MountUFS JSON stdout contract and guidance (Step 2.7). |
| config/crd/bases/data.fluid.io_cacheruntimes.yaml | CRD schema updates for status.mountTime + status.mounts shape. |
| charts/fluid/fluid/crds/data.fluid.io_cacheruntimes.yaml | Helm CRD schema mirror of the same status changes. |
| api/v1alpha1/zz_generated.deepcopy.go | Deepcopy updates for new/changed status fields and MountUFS output type. |
| api/v1alpha1/status.go | Update RuntimeStatus comments around mounts (minor doc tweak). |
| api/v1alpha1/common.go | Add CacheRuntimeMountUfsOutput type definition. |
| api/v1alpha1/cacheruntimeclass_types.go | Update MountUFS comment to indicate JSON output requirement. |
| api/v1alpha1/cacheruntime_status.go | Replace older mount status struct with Mounts + MountTime fields on CacheRuntime status. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [ -z "$mountPoint" ] && { echo "mountPoint is not set or empty" >&2; exit 1; } | ||
| [ -z "$path" ] && { echo "path is not set or empty" >&2; exit 1; } |
| err := engine.Sync(ctx) | ||
| // Sync may fail at UpdateOnUFSChange due to missing master pod, but ConfigMap should be created | ||
|
|
||
| // Verify ConfigMap was created | ||
| configMap := &corev1.ConfigMap{} | ||
| err = fakeClient.Get(context.Background(), types.NamespacedName{ |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #5834 +/- ##
==========================================
+ Coverage 61.65% 62.55% +0.90%
==========================================
Files 480 480
Lines 32613 32773 +160
==========================================
+ Hits 20108 20502 +394
+ Misses 10897 10635 -262
- Partials 1608 1636 +28 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for dynamic UFS mount updates and status synchronization within the generic cache runtime. Key changes include refactoring CacheRuntimeStatus to simplify mount tracking, implementing logic to detect dataset mount changes or master pod restarts, and requiring mount scripts to output status in a specific JSON format (CacheRuntimeMountUfsOutput). Comprehensive unit tests and documentation updates were also provided. Review feedback highlights opportunities to optimize performance by passing the CacheRuntime object through the call stack to avoid redundant API server requests in the UpdateOnUFSChange, shouldUpdateUFS, and checkIfRemountRequired methods.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 19 changed files in this pull request and generated 9 comments.
Files not reviewed (2)
- api/v1alpha1/openapi_generated.go: Language not supported
- api/v1alpha1/zz_generated.deepcopy.go: Language not supported
| // handle ufs change - support dynamic mount updates | ||
| err = e.updateOnUFSChange(runtime) | ||
| if err != nil { | ||
| e.Log.Error(err, "Failed to update UFS") | ||
| return err | ||
| } |
|
Review of PR #5834 — CacheRuntime phase 2.3: dynamic remount on dataset mount change or master restart. Design direction is sound and the core chain (shouldUpdateUFS → PrepareUFS → SyncDatasetMounts) follows existing Fluid patterns. Two blocking issues need fixing before merge:
Non-blocking: redundant runtimeClass fetch, checkIfRemountRequired error handling, config map timing, comment fixes. Test coverage for the new UFS path is critically low (4.67% patch coverage) — strongly recommend adding targeted unit tests before merge. |
| } | ||
|
|
||
| // 2. set update status to updating | ||
| err = utils.UpdateMountStatus(e.Client, e.name, e.namespace, datav1alpha1.UpdatingDatasetPhase) |
There was a problem hiding this comment.
When updateOnUFSChange sets the dataset phase to Updating, the UpdateMountStatus Updating path does NOT update dataset.Status.Mounts (only phase + condition). If the subsequent MountUFS execution or SyncDatasetMounts fails, the reconciler will re-enter Sync → updateOnUFSChange → shouldUpdateUFS. Since AnalyzePathsDelta compares spec.Mounts vs status.Mounts and the status.Mounts remain stale, shouldUpdateUFS will always return true, causing an infinite reconcile loop with repeated MountUFS invocations.
Consider adding a failure-path phase reset (e.g., revert to Bound on error) or a backoff/retry counter to break the cycle when the dataset gets stuck in Updating.
| } | ||
|
|
||
| return restart | ||
| } |
There was a problem hiding this comment.
updateOnUFSChange fetches runtimeClass via e.getRuntimeClass even though it could be passed in from the caller. Passing runtimeClass as a parameter would eliminate the redundant API call and reduce potential race conditions between multiple fetches of the same object.
|
Reviewed the latest state (commit Blocker: The PR currently has merge conflicts that need resolution before merge. Non-blocking observations:
Overall the code quality and test coverage have improved significantly since the initial submission. Once conflicts are resolved, this looks close to ready. |
| datasetMountPath := utils.UFSPathBuilder{}.GenUFSPathInUnifiedNamespace(mount) | ||
| if !mountedPaths[datasetMountPath] { | ||
| e.Log.Info("Waiting for mount point to be mounted", "Mount point", datasetMountPath) | ||
| return fmt.Errorf("mount point %s is not yet mounted", datasetMountPath) |
There was a problem hiding this comment.
When the mount script runs successfully but the reported mounted paths don't match expectations, the error "mount point %s is not yet mounted" will cause the reconciler to requeue and retry. This is fine for transient issues — but if the mismatch is persistent (e.g., the script has a bug and always returns wrong paths), the user has no clear signal other than the dataset sitting in Updating phase.
Consider emitting a Kubernetes Event when syncDatasetMounts fails repeatedly, so users can quickly identify whether the issue is their mount script vs a transient problem. Not blocking — the retry-via-reconcile behavior is correct.
There was a problem hiding this comment.
add en event for mount comman executing failed
| func (e *CacheEngine) checkIfRemountRequired(runtimeClass *datav1alpha1.CacheRuntimeClass, runtime *datav1alpha1.CacheRuntime) bool { | ||
| masterPodName, masterContainerName, err := e.getMasterPodInfo(runtimeClass) | ||
| if err != nil { | ||
| e.Log.Error(err, "get runtime pod container name failed", "method", "checkIfRemountRequired", "runtimeClass name", e.name) |
There was a problem hiding this comment.
If getMasterPodInfo or GetPodByName fails due to a transient network error (not NotFound), the function logs and returns false, meaning no remount is triggered. This is a conservative/safe approach, but it could delay detecting a legitimate master restart if the API is briefly unavailable.
Minor suggestion: consider distinguishing NotFound (pod genuinely gone — skip) from other errors (could requeue and check again later). Not blocking since the next reconcile will retry anyway.
Signed-off-by: xliuqq <xlzq1992@gmail.com> update openapi and crd Signed-off-by: xliuqq <xlzq1992@gmail.com> 添加脚本更新,以及说明文档 Signed-off-by: xliuqq <xlzq1992@gmail.com> fix test Signed-off-by: xliuqq <xlzq1992@gmail.com> add test fix tests and update openapi Signed-off-by: xliuqq <xlzq1992@gmail.com> add more test and fix suggestions Signed-off-by: xliuqq <xlzq1992@gmail.com>
Signed-off-by: xliuqq <xlzq1992@gmail.com>
|



Ⅰ. Describe what this PR does
remount when dataset mount changed or cache runtime master restarted
Ⅱ. Does this pull request fix one issue?
part of #5412
Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.
unit test for
Syncmethod.Ⅳ. Describe how to verify it
Ⅴ. Special notes for reviews