DAOS-18674 control: Route check engine upcalls/downcalls#17747
DAOS-18674 control: Route check engine upcalls/downcalls#17747kjacque wants to merge 6 commits intoNasf-Fan/DAOS-18674_1from
Conversation
|
Ticket title is 'Enhance CHK upcall to handle MS leader switch' |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/1/execution/node/1385/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/1/execution/node/1354/log |
0d26b8d to
8228c7b
Compare
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17747/2/testReport/ |
|
Test stage Unit Test completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17747/2/testReport/ |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/2/execution/node/1432/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/2/execution/node/1391/log |
mjmac
left a comment
There was a problem hiding this comment.
Looks really good. Just a minor note about cleaning up dead code.
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/4/execution/node/1410/log |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/5/execution/node/429/log |
8f7d9a1 to
832d488
Compare
- Allow check engines to send CheckReports directly to their local control plane via dRPC. - Forward the CheckReports to the management service over gRPC for storage. Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
- Add new Protobuf definitions for per-engine repair. - Add control service endpoint for CheckEngineRepair. This endpoint sends a repair dRPC to the requested rank. - Update SystemCheckRepair to call CheckEngineRepair on the control plane of the rank where the finding originated. Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
- Remove unused struct member. - Fix unit test name. Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
e42a1d8 to
df41152
Compare
|
Had to rebase on parent patch and resolve conflicts. I left the commits basically the same, the conflict was a change in the copyright format. |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17747/6/execution/node/1353/log |
Please note that the CHK report @kjacque , should we remove such staled CHK report (with seq |
Agreed, reports for any destroyed pools should be removed. Let me look deeper into this today. I did a quick Jira search and only saw a couple open issues, so not sure if this is introduced by this PR. |
|
@Nasf-Fan This is an interesting corner case. I think it didn't get cleared because the test only starts the checker on specific pools, so the control plane left alone old findings. Right now we don't do anything to the finding DB if a pool (or container) is destroyed. Those operations only happen outside checker mode. I wonder if we generally should clear old findings when checker mode is disabled. Is there a reason we'd want to keep old findings? Edit: Found another ticket that seems like a similar phenomenon, looks like the solution at that point was to update the test. https://daosio.atlassian.net/browse/DAOS-18481 |
I do not think we should operate CHK findings DB when destroy the pool/container under regular mode, instead, all related trouble should and can be resolved under check mode. For example, we can introduce new parameter for How do you think? @kjacque |
I agree we should probably not be touching the findings in normal mode. If we want to track the findings according to check instance, we will probably need some kind of sequence number to label the check instance. Not sure if it would need to be tracked at the engine level--maybe the control plane could track this on its own. Earlier today in triage we saw a ticket for the same issue seen on master, so we can continue the discussion there. Since it seems this patch is otherwise OK, and @mjmac approved, shall we merge it into your branch? @Nasf-Fan |
Yes, we can handle I am not familiar with control plane implementation. @tanabarr , would you please to help another review for the huge patch before merging the patch? Thanks! |
tanabarr
left a comment
There was a problem hiding this comment.
src/engine/drpc_ras.c formatting-only changes should be reverted
findRankIndex could be consolidated with other similar code
| } | ||
|
|
||
| // SystemCheckEngineReportResp contains the response from the SystemCheckEngineReport RPC. | ||
| type SystemCheckEngineReportResp struct { |
There was a problem hiding this comment.
what's the benefit of aliasing CheckReportResp rather than using directly?
There was a problem hiding this comment.
Consistency in the naming mostly, but it also gives us some flexibility with implementation in the future without altering the API types. With a lot of the checker stuff we just wrap the protobuf structures, rather than copying the structures (which is a more typical approach elsewhere in the control lib).
|
|
||
| option go_package = "github.com/daos-stack/daos/src/control/common/proto/ctl"; | ||
|
|
||
| import "mgmt/check.proto"; |
There was a problem hiding this comment.
should the base check types be declared in shared if they need to be used in both ctl and mgmt packages?
There was a problem hiding this comment.
I thought this initially too, but it ended up being a bit painful and ugly at the engine level, figuring out which files to export where. CheckActReq is needed in src/mgmt, while other types in the shared namespace wind up in src/engine... Seemed less confusing to leave the cross-module import. If you feel strongly I can change it.
Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
Features: recovery Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
This patch moves responsibility for correctly routing check engine reports and repair requests to the control plane, and away from the check leader. This should be transparent to the end user, but lays groundwork for better managing checker activities from the control plane.
local control plane via dRPC.
gRPC for storage.
endpoint sends a repair dRPC to the requested rank.
control plane of the rank where the finding originated.
Features: recovery