Skip to content

DAOS-18674 control: Route check engine upcalls/downcalls#17747

Open
kjacque wants to merge 6 commits intoNasf-Fan/DAOS-18674_1from
kjacque/DAOS-18674_1/control
Open

DAOS-18674 control: Route check engine upcalls/downcalls#17747
kjacque wants to merge 6 commits intoNasf-Fan/DAOS-18674_1from
kjacque/DAOS-18674_1/control

Conversation

@kjacque
Copy link
Copy Markdown
Contributor

@kjacque kjacque commented Mar 20, 2026

This patch moves responsibility for correctly routing check engine reports and repair requests to the control plane, and away from the check leader. This should be transparent to the end user, but lays groundwork for better managing checker activities from the control plane.

  • Allow check engines to send CheckReports directly to their
    local control plane via dRPC.
  • Forward the CheckReports to the management service over
    gRPC for storage.
  • Add control service endpoint for CheckEngineRepair. This
    endpoint sends a repair dRPC to the requested rank.
  • Update SystemCheckRepair to call CheckEngineRepair on the
    control plane of the rank where the finding originated.

Features: recovery

@kjacque kjacque self-assigned this Mar 20, 2026
@kjacque kjacque requested a review from Nasf-Fan March 20, 2026 20:10
@github-actions
Copy link
Copy Markdown

Ticket title is 'Enhance CHK upcall to handle MS leader switch'
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-18674

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/1/execution/node/1385/log

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/1/execution/node/1354/log

@kjacque kjacque force-pushed the kjacque/DAOS-18674_1/control branch from 0d26b8d to 8228c7b Compare March 27, 2026 23:47
@kjacque kjacque changed the title DAOS-18674 control: Forward engine check reports to MS DAOS-18674 control: Route check engine upcalls/downcalls Mar 27, 2026
@kjacque kjacque marked this pull request as ready for review March 28, 2026 00:21
@kjacque kjacque requested review from a team as code owners March 28, 2026 00:21
@kjacque kjacque requested review from mjmac and tanabarr March 28, 2026 00:21
@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/2/execution/node/1432/log

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/2/execution/node/1391/log

Copy link
Copy Markdown
Contributor

@mjmac mjmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. Just a minor note about cleaning up dead code.

@kjacque kjacque requested a review from mjmac March 31, 2026 00:01
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/4/execution/node/1410/log

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/5/execution/node/429/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18674_1 branch 4 times, most recently from 8f7d9a1 to 832d488 Compare April 2, 2026 14:28
kjacque added 3 commits April 2, 2026 15:28
- Allow check engines to send CheckReports directly to their
  local control plane via dRPC.
- Forward the CheckReports to the management service over
  gRPC for storage.

Features: recovery

Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
- Add new Protobuf definitions for per-engine repair.
- Add control service endpoint for CheckEngineRepair. This
  endpoint sends a repair dRPC to the requested rank.
- Update SystemCheckRepair to call CheckEngineRepair on the
  control plane of the rank where the finding originated.

Features: recovery

Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
Features: recovery

Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
- Remove unused struct member.
- Fix unit test name.

Features: recovery

Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
@kjacque kjacque force-pushed the kjacque/DAOS-18674_1/control branch from e42a1d8 to df41152 Compare April 2, 2026 15:33
@kjacque
Copy link
Copy Markdown
Contributor Author

kjacque commented Apr 2, 2026

Had to rebase on parent patch and resolve conflicts. I left the commits basically the same, the conflict was a change in the copyright format.

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/6/execution/node/1353/log

@Nasf-Fan
Copy link
Copy Markdown
Contributor

Nasf-Fan commented Apr 6, 2026

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/6/execution/node/1353/log

test_two_pools_corrupted failed as following:

2026-04-04 00:53:25,446 test L0471 INFO | ==> Step 10: Wait for checker to detect inconsistent container label for "dmg check start pool_1 pool_1." [elapsed since last step: 1.42s] 2026-04-04 00:53:25,446 command_utils_ba L0203 DEBUG| Updated param pool => None 2026-04-04 00:53:25,446 general_utils L0151 INFO | Command environment vars: {} 2026-04-04 00:53:25,447 process L0604 INFO | Running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j check query' 2026-04-04 00:53:25,681 process L0416 DEBUG| [stderr] DEBUG 2026/04/04 00:53:25.681163 main.go:228: debug output enabled ... 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "84d301a4-ae0d-4d31-a237-bd97e7191255": { 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "uuid": "84d301a4-ae0d-4d31-a237-bd97e7191255", 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "label": "", 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "status": "CPS_CHECKING", 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "phase": "CSP_POOL_MBS", 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "rank_count": 2, 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "start_time": "2026-04-04T00:53:24.000+00:00", 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "remaining": 9, 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] "elapsed": 0 2026-04-04 00:53:25,722 process L0416 DEBUG| [stdout] }, ... 2026-04-04 00:53:25,723 process L0416 DEBUG| [stdout] "reports": [ 2026-04-04 00:53:25,723 process L0416 DEBUG| [stdout] { 2026-04-04 00:53:25,723 process L0416 DEBUG| [stdout] "seq": 1331178504193, 2026-04-04 00:53:25,723 process L0416 DEBUG| [stdout] "class": 12, 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "action": 7, 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "rank": 1, 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "pool_uuid": "79fd2019-64b0-4436-9dc7-50c110886e8c", 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "pool_label": "TestPool_3", 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "cont_uuid": "5403a7ff-11a0-4069-9ca7-39d2be77d415", 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "cont_label": "TestContainer_1", 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "timestamp": "Sat Apr 4 00:51:17 2026", 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "msg": "Check engine detects inconsistent container label: new-label (CS) vs TestContainer_1 (property).", 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "act_msgs": [ 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] "Update the CS label to use the container property value for 5403a7ff-11a0-4069-9ca7-39d2be77d415" 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] ] 2026-04-04 00:53:25,724 process L0416 DEBUG| [stdout] } ... 2026-04-04 00:53:26,782 process L0604 INFO | Running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j check repair 1331178504193 2' 2026-04-04 00:53:27,016 process L0416 DEBUG| [stderr] DEBUG 2026/04/04 00:53:27.016338 main.go:228: debug output enabled 2026-04-04 00:53:27,016 process L0416 DEBUG| [stderr] DEBUG 2026/04/04 00:53:27.016840 main.go:260: control config loaded from /var/tmp/daos_testing/configs/daos_control.yml 2026-04-04 00:53:27,022 process L0416 DEBUG| [stderr] DEBUG 2026/04/04 00:53:27.022216 check.go:587: DAOS system check query request: *mgmt.CheckQueryReq (sys:"daos_server-2.9.100" seqs:1331178504193) 2026-04-04 00:53:27,022 process L0416 DEBUG| [stderr] DEBUG 2026/04/04 00:53:27.022682 rpc.go:278: request hosts: [hdr-132:10001] 2026-04-04 00:53:27,049 process L0416 DEBUG| [stderr] DEBUG 2026/04/04 00:53:27.049216 response.go:179: hdr-132:10001: *mgmt.CheckQueryResp (reports:{seq:1331178504193 class:CIC_CONT_BAD_LABEL action:CIA_TRUST_TARGET rank:1 pool_uuid:"79fd2019-64b0-4436-9dc7-50c110886e8c" pool_label:"TestPool_3" cont_uuid:"5403a7ff-11a0-4069-9ca7-39d2be77d415" cont_label:"TestContainer_1" timestamp:"Sat Apr 4 00:51:17 2026" msg:"Check engine detects inconsistent container label: new-label (CS) vs TestContainer_1 (property)." act_msgs:"Update the CS label to use the container property value for 5403a7ff-11a0-4069-9ca7-39d2be77d415"}) 2026-04-04 00:53:27,050 process L0416 DEBUG| [stdout] { 2026-04-04 00:53:27,050 process L0416 DEBUG| [stderr] ERROR: dmg: finding 0x135f06c0001 is already resolved: TRUST_TARGET: Update the CS label to use the container property value for 5403a7ff-11a0-4069-9ca7-39d2be77d415 2026-04-04 00:53:27,050 process L0416 DEBUG| [stdout] "response": null, 2026-04-04 00:53:27,050 process L0416 DEBUG| [stdout] "error": "finding 0x135f06c0001 is already resolved: TRUST_TARGET: Update the CS label to use the container property value for 5403a7ff-11a0-4069-9ca7-39d2be77d415", 2026-04-04 00:53:27,051 process L0416 DEBUG| [stdout] "status": -1025 2026-04-04 00:53:27,051 process L0416 DEBUG| [stdout] } 2026-04-04 00:53:27,052 process L0685 INFO | Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j check repair 1331178504193 2' finished with 1 after 0.2686338424682617s 2026-04-04 00:53:27,079 general_utils L0176 INFO | Error occurred running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j check repair 1331178504193 2': Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j check repair 1331178504193 2' failed. ...

Please note that the CHK report 1331178504193 belong to former test case. But it was returned by current CHK query. That misguided the test logic.

@kjacque , should we remove such staled CHK report (with seq 1331178504193)? since related pool has already been destroyed.

@kjacque
Copy link
Copy Markdown
Contributor Author

kjacque commented Apr 6, 2026

Please note that the CHK report 1331178504193 belong to former test case. But it was returned by current CHK query. That misguided the test logic.

should we remove such staled CHK report (with seq 1331178504193)? since related pool has already been destroyed.

Agreed, reports for any destroyed pools should be removed. Let me look deeper into this today. I did a quick Jira search and only saw a couple open issues, so not sure if this is introduced by this PR.

@kjacque
Copy link
Copy Markdown
Contributor Author

kjacque commented Apr 6, 2026

@Nasf-Fan This is an interesting corner case. I think it didn't get cleared because the test only starts the checker on specific pools, so the control plane left alone old findings.

Right now we don't do anything to the finding DB if a pool (or container) is destroyed. Those operations only happen outside checker mode. I wonder if we generally should clear old findings when checker mode is disabled. Is there a reason we'd want to keep old findings?

Edit: Found another ticket that seems like a similar phenomenon, looks like the solution at that point was to update the test. https://daosio.atlassian.net/browse/DAOS-18481

@Nasf-Fan
Copy link
Copy Markdown
Contributor

Nasf-Fan commented Apr 7, 2026

@Nasf-Fan This is an interesting corner case. I think it didn't get cleared because the test only starts the checker on specific pools, so the control plane left alone old findings.

Right now we don't do anything to the finding DB if a pool (or container) is destroyed. Those operations only happen outside checker mode. I wonder if we generally should clear old findings when checker mode is disabled. Is there a reason we'd want to keep old findings?

Edit: Found another ticket that seems like a similar phenomenon, looks like the solution at that point was to update the test. https://daosio.atlassian.net/browse/DAOS-18481

I do not think we should operate CHK findings DB when destroy the pool/container under regular mode, instead, all related trouble should and can be resolved under check mode. For example, we can introduce new parameter for dmg check query command, such as -f|--full. If it is specified, then all the history findings (belong to the required pools if specified) will be shown; otherwise (by default), only the findings belong to current check instance (or the latest check instance if current instance has completed) will be shown. On the other hand, we can add new CHK dmg command to allow the user to discard specified (stale) findings from DB if he/she want. But of course, these can be done in another dedicated patch.

How do you think? @kjacque

@kjacque
Copy link
Copy Markdown
Contributor Author

kjacque commented Apr 7, 2026

I do not think we should operate CHK findings DB when destroy the pool/container under regular mode, instead, all related trouble should and can be resolved under check mode. For example, we can introduce new parameter for dmg check query command, such as -f|--full. If it is specified, then all the history findings (belong to the required pools if specified) will be shown; otherwise (by default), only the findings belong to current check instance (or the latest check instance if current instance has completed) will be shown. On the other hand, we can add new CHK dmg command to allow the user to discard specified (stale) findings from DB if he/she want. But of course, these can be done in another dedicated patch.

How do you think? @kjacque

I agree we should probably not be touching the findings in normal mode. If we want to track the findings according to check instance, we will probably need some kind of sequence number to label the check instance. Not sure if it would need to be tracked at the engine level--maybe the control plane could track this on its own.

Earlier today in triage we saw a ticket for the same issue seen on master, so we can continue the discussion there. Since it seems this patch is otherwise OK, and @mjmac approved, shall we merge it into your branch? @Nasf-Fan

@Nasf-Fan
Copy link
Copy Markdown
Contributor

Nasf-Fan commented Apr 8, 2026

I do not think we should operate CHK findings DB when destroy the pool/container under regular mode, instead, all related trouble should and can be resolved under check mode. For example, we can introduce new parameter for dmg check query command, such as -f|--full. If it is specified, then all the history findings (belong to the required pools if specified) will be shown; otherwise (by default), only the findings belong to current check instance (or the latest check instance if current instance has completed) will be shown. On the other hand, we can add new CHK dmg command to allow the user to discard specified (stale) findings from DB if he/she want. But of course, these can be done in another dedicated patch.
How do you think? @kjacque

I agree we should probably not be touching the findings in normal mode. If we want to track the findings according to check instance, we will probably need some kind of sequence number to label the check instance. Not sure if it would need to be tracked at the engine level--maybe the control plane could track this on its own.

Earlier today in triage we saw a ticket for the same issue seen on master, so we can continue the discussion there. Since it seems this patch is otherwise OK, and @mjmac approved, shall we merge it into your branch? @Nasf-Fan

Yes, we can handle dmg check query related issues in DAOS-18773.

I am not familiar with control plane implementation. @tanabarr , would you please to help another review for the huge patch before merging the patch? Thanks!

Copy link
Copy Markdown
Contributor

@tanabarr tanabarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/engine/drpc_ras.c formatting-only changes should be reverted
findRankIndex could be consolidated with other similar code

}

// SystemCheckEngineReportResp contains the response from the SystemCheckEngineReport RPC.
type SystemCheckEngineReportResp struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the benefit of aliasing CheckReportResp rather than using directly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency in the naming mostly, but it also gives us some flexibility with implementation in the future without altering the API types. With a lot of the checker stuff we just wrap the protobuf structures, rather than copying the structures (which is a more typical approach elsewhere in the control lib).


option go_package = "github.com/daos-stack/daos/src/control/common/proto/ctl";

import "mgmt/check.proto";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the base check types be declared in shared if they need to be used in both ctl and mgmt packages?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this initially too, but it ended up being a bit painful and ugly at the engine level, figuring out which files to export where. CheckActReq is needed in src/mgmt, while other types in the shared namespace wind up in src/engine... Seemed less confusing to leave the cross-module import. If you feel strongly I can change it.

kjacque added 2 commits April 11, 2026 00:41
Features: recovery

Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
Features: recovery

Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
@kjacque kjacque requested review from mjmac and tanabarr April 11, 2026 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants