DAOS-18674 control: Route check engine upcalls/downcalls by kjacque · Pull Request #17747 · daos-stack/daos

kjacque · 2026-03-20T20:10:43Z

This patch moves responsibility for correctly routing check engine reports and repair requests to the control plane, and away from the check leader. This should be transparent to the end user, but lays groundwork for better managing checker activities from the control plane.

Allow check engines to send CheckReports directly to their
local control plane via dRPC.
Forward the CheckReports to the management service over
gRPC for storage.
Add control service endpoint for CheckEngineRepair. This
endpoint sends a repair dRPC to the requested rank.
Update SystemCheckRepair to call CheckEngineRepair on the
control plane of the rank where the finding originated.

Features: recovery

github-actions · 2026-03-20T20:11:02Z

Ticket title is 'Enhance CHK upcall to handle MS leader switch'
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-18674

daosbuild3 · 2026-03-23T00:30:47Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/1/execution/node/1385/log

daosbuild3 · 2026-03-23T01:54:41Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/1/execution/node/1354/log

daosbuild3 · 2026-03-28T00:32:00Z

Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17747/2/testReport/

daosbuild3 · 2026-03-28T00:38:00Z

Test stage Unit Test completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17747/2/testReport/

daosbuild3 · 2026-03-30T15:05:41Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/2/execution/node/1432/log

daosbuild3 · 2026-03-30T15:51:27Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/2/execution/node/1391/log

mjmac

Looks really good. Just a minor note about cleaning up dead code.

src/control/server/mgmt_drpc_checker.go

src/control/lib/control/check_test.go

daosbuild3 · 2026-03-31T08:46:46Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/4/execution/node/1410/log

daosbuild3 · 2026-03-31T23:20:21Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/5/execution/node/429/log

- Allow check engines to send CheckReports directly to their local control plane via dRPC. - Forward the CheckReports to the management service over gRPC for storage. Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>

- Add new Protobuf definitions for per-engine repair. - Add control service endpoint for CheckEngineRepair. This endpoint sends a repair dRPC to the requested rank. - Update SystemCheckRepair to call CheckEngineRepair on the control plane of the rank where the finding originated. Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>

Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>

- Remove unused struct member. - Fix unit test name. Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>

kjacque · 2026-04-02T15:34:51Z

Had to rebase on parent patch and resolve conflicts. I left the commits basically the same, the conflict was a change in the copyright format.

daosbuild3 · 2026-04-04T05:22:38Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/6/execution/node/1353/log

Nasf-Fan · 2026-04-06T15:42:47Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/6/execution/node/1353/log

test_two_pools_corrupted failed as following:

2026-04-04 00:53:25,446 test L0471 INFO | ==> Step 10: Wait for checker to detect inconsistent container label for "dmg check start pool_1 pool_1." [elapsed since last step: 1.42s] 2026-04-04 00:53:25,446 command_utils_ba L0203 DEBUG| Updated param pool => None 2026-04-04 00:53:25,446 general_utils L0151 INFO | Command environment vars: {} 2026-04-04 00:53:25,447 process L0604 INFO | Running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j check query' 2026-04-04 00:53:25,681 process L0416 DEBUG| [stderr] DEBUG 2026/04/04 00:53:25.681163 main.go:228: debug output enabled ... 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "84d301a4-ae0d-4d31-a237-bd97e7191255": { 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "uuid": "84d301a4-ae0d-4d31-a237-bd97e7191255", 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "label": "", 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "status": "CPS_CHECKING", 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "phase": "CSP_POOL_MBS", 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "rank_count": 2, 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "start_time": "2026-04-04T00:53:24.000+00:00", 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "remaining": 9, 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "elapsed": 0 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] }, ... 2026-04-04 00:53:25,723 process L0416 DEBUG| [stdout] "reports": [ 2026-04-04 00:53:25,723 process L0416 DEBUG| [stdout] { 2026-04-04 00:53:25,723 process L0416 DEBUG| [stdout] "seq": 1331178504193, 2026-04-04 00:53:25,723 process L0416 DEBUG| [stdout] "class": 12, 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "action": 7, 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "rank": 1, 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "pool_uuid": "79fd2019-64b0-4436-9dc7-50c110886e8c", 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "pool_label": "TestPool_3", 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "cont_uuid": "5403a7ff-11a0-4069-9ca7-39d2be77d415", 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "cont_label": "TestContainer_1", 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "timestamp": "Sat Apr 4 00:51:17 2026", 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "msg": "Check engine detects inconsistent container label: new-label (CS) vs TestContainer_1 (property).", 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "act_msgs": [ 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "Update the CS label to use the container property value for 5403a7ff-11a0-4069-9ca7-39d2be77d415" 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] ] 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] } ... 2026-04-04 00:53:26,782 process L0604 INFO | Running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j check repair 1331178504193 2' 2026-04-04 00:53:27,016 process L0416 DEBUG| [stderr] DEBUG 2026/04/04 00:53:27.016338 main.go:228: debug output enabled 2026-04-04 00:53:27,016 process L0416 DEBUG| [stderr] DEBUG 2026/04/04 00:53:27.016840 main.go:260: control config loaded from /var/tmp/daos_testing/configs/daos_control.yml 2026-04-04 00:53:27,022 process L0416 DEBUG| [stderr] DEBUG 2026/04/04 00:53:27.022216 check.go:587: DAOS system check query request: *mgmt.CheckQueryReq (sys:"daos_server-2.9.100" seqs:1331178504193) 2026-04-04 00:53:27,022 process L0416 DEBUG| [stderr] DEBUG 2026/04/04 00:53:27.022682 rpc.go:278: request hosts: [hdr-132:10001] 2026-04-04 00:53:27,049 process L0416 DEBUG| [stderr] DEBUG 2026/04/04 00:53:27.049216 response.go:179: hdr-132:10001: *mgmt.CheckQueryResp (reports:{seq:1331178504193 class:CIC_CONT_BAD_LABEL action:CIA_TRUST_TARGET rank:1 pool_uuid:"79fd2019-64b0-4436-9dc7-50c110886e8c" pool_label:"TestPool_3" cont_uuid:"5403a7ff-11a0-4069-9ca7-39d2be77d415" cont_label:"TestContainer_1" timestamp:"Sat Apr 4 00:51:17 2026" msg:"Check engine detects inconsistent container label: new-label (CS) vs TestContainer_1 (property)." act_msgs:"Update the CS label to use the container property value for 5403a7ff-11a0-4069-9ca7-39d2be77d415"}) 2026-04-04 00:53:27,050 process L0416 DEBUG| [stdout] { 2026-04-04 00:53:27,050 process L0416 DEBUG| [stderr] ERROR: dmg: finding 0x135f06c0001 is already resolved: TRUST_TARGET: Update the CS label to use the container property value for 5403a7ff-11a0-4069-9ca7-39d2be77d415 2026-04-04 00:53:27,050 process L0416 DEBUG| [stdout] "response": null, 2026-04-04 00:53:27,050 process L0416 DEBUG| [stdout] "error": "finding 0x135f06c0001 is already resolved: TRUST_TARGET: Update the CS label to use the container property value for 5403a7ff-11a0-4069-9ca7-39d2be77d415", 2026-04-04 00:53:27,051 process L0416 DEBUG| [stdout] "status": -1025 2026-04-04 00:53:27,051 process L0416 DEBUG| [stdout] } 2026-04-04 00:53:27,052 process L0685 INFO | Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j check repair 1331178504193 2' finished with 1 after 0.2686338424682617s 2026-04-04 00:53:27,079 general_utils L0176 INFO | Error occurred running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j check repair 1331178504193 2': Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j check repair 1331178504193 2' failed. ...

Please note that the CHK report 1331178504193 belong to former test case. But it was returned by current CHK query. That misguided the test logic.

@kjacque , should we remove such staled CHK report (with seq 1331178504193)? since related pool has already been destroyed.

kjacque · 2026-04-06T16:37:45Z

Please note that the CHK report 1331178504193 belong to former test case. But it was returned by current CHK query. That misguided the test logic.

should we remove such staled CHK report (with seq 1331178504193)? since related pool has already been destroyed.

Agreed, reports for any destroyed pools should be removed. Let me look deeper into this today. I did a quick Jira search and only saw a couple open issues, so not sure if this is introduced by this PR.

kjacque · 2026-04-06T17:38:05Z

@Nasf-Fan This is an interesting corner case. I think it didn't get cleared because the test only starts the checker on specific pools, so the control plane left alone old findings.

Right now we don't do anything to the finding DB if a pool (or container) is destroyed. Those operations only happen outside checker mode. I wonder if we generally should clear old findings when checker mode is disabled. Is there a reason we'd want to keep old findings?

Edit: Found another ticket that seems like a similar phenomenon, looks like the solution at that point was to update the test. https://daosio.atlassian.net/browse/DAOS-18481

Nasf-Fan · 2026-04-07T02:49:14Z

@Nasf-Fan This is an interesting corner case. I think it didn't get cleared because the test only starts the checker on specific pools, so the control plane left alone old findings.

Right now we don't do anything to the finding DB if a pool (or container) is destroyed. Those operations only happen outside checker mode. I wonder if we generally should clear old findings when checker mode is disabled. Is there a reason we'd want to keep old findings?

Edit: Found another ticket that seems like a similar phenomenon, looks like the solution at that point was to update the test. https://daosio.atlassian.net/browse/DAOS-18481

I do not think we should operate CHK findings DB when destroy the pool/container under regular mode, instead, all related trouble should and can be resolved under check mode. For example, we can introduce new parameter for dmg check query command, such as -f|--full. If it is specified, then all the history findings (belong to the required pools if specified) will be shown; otherwise (by default), only the findings belong to current check instance (or the latest check instance if current instance has completed) will be shown. On the other hand, we can add new CHK dmg command to allow the user to discard specified (stale) findings from DB if he/she want. But of course, these can be done in another dedicated patch.

How do you think? @kjacque

kjacque · 2026-04-07T23:57:27Z

I do not think we should operate CHK findings DB when destroy the pool/container under regular mode, instead, all related trouble should and can be resolved under check mode. For example, we can introduce new parameter for dmg check query command, such as -f|--full. If it is specified, then all the history findings (belong to the required pools if specified) will be shown; otherwise (by default), only the findings belong to current check instance (or the latest check instance if current instance has completed) will be shown. On the other hand, we can add new CHK dmg command to allow the user to discard specified (stale) findings from DB if he/she want. But of course, these can be done in another dedicated patch.

How do you think? @kjacque

I agree we should probably not be touching the findings in normal mode. If we want to track the findings according to check instance, we will probably need some kind of sequence number to label the check instance. Not sure if it would need to be tracked at the engine level--maybe the control plane could track this on its own.

Earlier today in triage we saw a ticket for the same issue seen on master, so we can continue the discussion there. Since it seems this patch is otherwise OK, and @mjmac approved, shall we merge it into your branch? @Nasf-Fan

Nasf-Fan · 2026-04-08T07:20:45Z

I do not think we should operate CHK findings DB when destroy the pool/container under regular mode, instead, all related trouble should and can be resolved under check mode. For example, we can introduce new parameter for dmg check query command, such as -f|--full. If it is specified, then all the history findings (belong to the required pools if specified) will be shown; otherwise (by default), only the findings belong to current check instance (or the latest check instance if current instance has completed) will be shown. On the other hand, we can add new CHK dmg command to allow the user to discard specified (stale) findings from DB if he/she want. But of course, these can be done in another dedicated patch.
How do you think? @kjacque

I agree we should probably not be touching the findings in normal mode. If we want to track the findings according to check instance, we will probably need some kind of sequence number to label the check instance. Not sure if it would need to be tracked at the engine level--maybe the control plane could track this on its own.

Earlier today in triage we saw a ticket for the same issue seen on master, so we can continue the discussion there. Since it seems this patch is otherwise OK, and @mjmac approved, shall we merge it into your branch? @Nasf-Fan

Yes, we can handle dmg check query related issues in DAOS-18773.

I am not familiar with control plane implementation. @tanabarr , would you please to help another review for the huge patch before merging the patch? Thanks!

tanabarr

src/engine/drpc_ras.c formatting-only changes should be reverted
findRankIndex could be consolidated with other similar code

tanabarr · 2026-04-03T09:25:25Z

src/control/lib/control/check.go

+}
+
+// SystemCheckEngineReportResp contains the response from the SystemCheckEngineReport RPC.
+type SystemCheckEngineReportResp struct {


what's the benefit of aliasing CheckReportResp rather than using directly?

Consistency in the naming mostly, but it also gives us some flexibility with implementation in the future without altering the API types. With a lot of the checker stuff we just wrap the protobuf structures, rather than copying the structures (which is a more typical approach elsewhere in the control lib).

src/control/server/ctl_check.go

src/control/server/mgmt_check.go

tanabarr · 2026-04-08T19:07:51Z

src/proto/ctl/check.proto

+
+option  go_package = "github.com/daos-stack/daos/src/control/common/proto/ctl";
+
+import "mgmt/check.proto";


should the base check types be declared in shared if they need to be used in both ctl and mgmt packages?

I thought this initially too, but it ended up being a bit painful and ugly at the engine level, figuring out which files to export where. CheckActReq is needed in src/mgmt, while other types in the shared namespace wind up in src/engine... Seemed less confusing to leave the cross-module import. If you feel strongly I can change it.

Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>

kjacque self-assigned this Mar 20, 2026

kjacque requested a review from Nasf-Fan March 20, 2026 20:10

kjacque force-pushed the kjacque/DAOS-18674_1/control branch from 0d26b8d to 8228c7b Compare March 27, 2026 23:47

kjacque changed the title ~~DAOS-18674 control: Forward engine check reports to MS~~ DAOS-18674 control: Route check engine upcalls/downcalls Mar 27, 2026

kjacque marked this pull request as ready for review March 28, 2026 00:21

kjacque requested review from a team as code owners March 28, 2026 00:21

kjacque requested review from mjmac and tanabarr March 28, 2026 00:21

mjmac approved these changes Mar 30, 2026

View reviewed changes

src/control/server/mgmt_drpc_checker.go Show resolved Hide resolved

kjacque commented Mar 30, 2026

View reviewed changes

src/control/lib/control/check_test.go Outdated Show resolved Hide resolved

kjacque requested a review from mjmac March 31, 2026 00:01

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18674_1 branch 4 times, most recently from 8f7d9a1 to 832d488 Compare April 2, 2026 14:28

kjacque added 3 commits April 2, 2026 15:28

Fix gRPC auth unit test

bafc0f9

Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>

Address review comments

df41152

- Remove unused struct member. - Fix unit test name. Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>

kjacque force-pushed the kjacque/DAOS-18674_1/control branch from e42a1d8 to df41152 Compare April 2, 2026 15:33

mjmac approved these changes Apr 7, 2026

View reviewed changes

tanabarr approved these changes Apr 8, 2026

View reviewed changes

kjacque added 2 commits April 11, 2026 00:41

Address review comments

d1eb67a

Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>

Revert irrelevant formatting changes

38ad534

Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>

kjacque requested review from mjmac and tanabarr April 11, 2026 00:55


		option go_package = "github.com/daos-stack/daos/src/control/common/proto/ctl";

		import "mgmt/check.proto";

Conversation

kjacque commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

daosbuild3 commented Mar 23, 2026

Uh oh!

daosbuild3 commented Mar 23, 2026

Uh oh!

daosbuild3 commented Mar 28, 2026

Uh oh!

daosbuild3 commented Mar 28, 2026

Uh oh!

daosbuild3 commented Mar 30, 2026

Uh oh!

daosbuild3 commented Mar 30, 2026

Uh oh!

mjmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

daosbuild3 commented Mar 31, 2026

Uh oh!

daosbuild3 commented Mar 31, 2026

Uh oh!

kjacque commented Apr 2, 2026

Uh oh!

daosbuild3 commented Apr 4, 2026

Uh oh!

Nasf-Fan commented Apr 6, 2026

Uh oh!

kjacque commented Apr 6, 2026

Uh oh!

kjacque commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nasf-Fan commented Apr 7, 2026

Uh oh!

kjacque commented Apr 7, 2026

Uh oh!

Nasf-Fan commented Apr 8, 2026

Uh oh!

tanabarr left a comment

Choose a reason for hiding this comment

Uh oh!

tanabarr Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

kjacque Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tanabarr Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

kjacque Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

kjacque commented Mar 20, 2026 •

edited

Loading

kjacque commented Apr 6, 2026 •

edited

Loading