CNF-18941: e2e: PP: cover ExecCPUAffinity support in tests#1432
CNF-18941: e2e: PP: cover ExecCPUAffinity support in tests#1432openshift-merge-bot[bot] merged 3 commits intoopenshift:mainfrom
Conversation
|
depend on #1426 |
6ef4f1a to
beeea3d
Compare
565820a to
0af067c
Compare
|
regarding /test e2e-gcp-pao-workloadhints |
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). In this commit we start updating only the affected job on which the test would run, later we will need to add this setting to all other jobs that consume ipi-gcp cluster configuration. Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). In this commit we start updating only the affected job on which the test would run, later we will need to add this setting to all other jobs that consume ipi-gcp cluster configuration. Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
|
/retest |
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
|
when temporarly removed the failing test due to misaligning node topology with PP cpu section, |
|
/hold |
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
0af067c to
41afeca
Compare
|
/test e2e-aws-ovn |
acdb51a to
a8b7158
Compare
|
/retest |
|
/unhold |
SargunNarula
left a comment
There was a problem hiding this comment.
Thanks for the tests, IMO some tests are redundant which can be removed.
e9e8dd1 to
a6f7bb8
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/e2e/performanceprofile/functests/1_performance/cpu_management.go (1)
157-163:⚠️ Potential issue | 🟠 MajorCPU allocatable assertion is still incorrect when shared CPUs are configured
Line 162 explicitly notes the gap, but the assertion still enforces
capacity-allocatable == reserved, which can fail for valid shared-CPU profiles and make this test flaky across environments. Please gate or adjust this expectation before release.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go` around lines 157 - 163, The assertion currently enforces capacityCPU - allocatableCPU == int64(len(listReservedCPU)) which fails when shared CPUs exist; modify the check in the cpu_management.go test (around workerRTNode, listReservedCPU, differenceCPUGot, differenceCPUExpected) to account for shared-CPU profiles by either (A) gating the strict equality: if a shared-CPU configuration is detected (e.g., inspect the node's topology/CPU manager state or a helper flag indicating shared CPUs) then assert differenceCPUGot <= differenceCPUExpected (or skip the equality check), or (B) replace the Expect(...).To(Equal(...)) with Expect(differenceCPUGot).To(BeNumerically("<=", differenceCPUExpected)) and add a clear By() note explaining the relaxed check for shared CPUs so the test is not flaky across environments.
♻️ Duplicate comments (6)
test/e2e/performanceprofile/functests/utils/pods/pods.go (2)
123-129:⚠️ Potential issue | 🟠 Major
WaitForDeletionswallows non-NotFound errors.At Line 125, only
IsNotFoundis handled; othercli.Getfailures are retried asfalse, niluntil timeout, which hides the real failure mode.As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Proposed fix
func WaitForDeletion(ctx context.Context, cli client.Client, pod *corev1.Pod, timeout time.Duration) error { return wait.PollUntilContextTimeout(ctx, time.Second, timeout, true, func(ctx context.Context) (bool, error) { - if err := cli.Get(ctx, client.ObjectKeyFromObject(pod), pod); errors.IsNotFound(err) { + err := cli.Get(ctx, client.ObjectKeyFromObject(pod), pod) + if errors.IsNotFound(err) { return true, nil } - return false, nil + if err != nil { + return false, err + } + return false, nil }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/utils/pods/pods.go` around lines 123 - 129, The WaitForDeletion function currently swallows all non-NotFound errors from cli.Get; update the polling closure (used by wait.PollUntilContextTimeout) to propagate real errors by returning (false, err) when cli.Get returns an error that is not errors.IsNotFound(err), so that these failures break the poll and surface the underlying error instead of silently retrying; keep the existing behavior of returning (true, nil) when errors.IsNotFound(err) to signal successful deletion.
99-103:⚠️ Potential issue | 🟠 MajorNil check is currently after dereference in
DeleteAndSync.At Line 100,
pod.Namespaceandpod.Nameare accessed before the nil guard on Line 101, so teardown can panic when cleanup is invoked withnil.As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Proposed fix
func DeleteAndSync(ctx context.Context, cli client.Client, pod *corev1.Pod) error { - klog.InfoS("delete pod and waiting for deletion", "namespace", pod.Namespace, "name", pod.Name) if pod == nil { klog.InfoS("pod is nil, skipping deletion") return nil } if cli == nil { return fmt.Errorf("client is nil") } + klog.InfoS("delete pod and waiting for deletion", "namespace", pod.Namespace, "name", pod.Name)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/utils/pods/pods.go` around lines 99 - 103, The function DeleteAndSync dereferences pod (accessing pod.Namespace and pod.Name) before checking for nil which can panic; move the nil guard to the top of DeleteAndSync so you return early if pod == nil, then perform the klog.InfoS call using pod.Namespace and pod.Name only after the nil check; update any related logging or downstream logic in DeleteAndSync to assume pod is non-nil once past the guard.test/e2e/performanceprofile/functests/1_performance/irqbalance.go (1)
340-346:⚠️ Potential issue | 🟠 MajorRegister cleanup before the failing assertion.
Line 341 can fail before the defer on Line 343 is registered. If pod creation partially succeeds and then errors, cleanup is skipped and the pod can leak.
As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Proposed fix
testpod, err := createPodWithHouskeeping(numOfContainersInPod, cpusPerContainer, profile, context.TODO(), targetNode) -Expect(err).ToNot(HaveOccurred(), "Failed to create pod with housekeeping annotation") - defer func() { - testlog.Infof("deleting pod %q", testpod.Name) - Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed()) + if testpod != nil { + testlog.Infof("deleting pod %q", testpod.Name) + Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed()) + } }() +Expect(err).ToNot(HaveOccurred(), "Failed to create pod with housekeeping annotation")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go` around lines 340 - 346, createPodWithHouskeeping may return a non-nil testpod and an error, so register cleanup immediately after the call to avoid leaking pods; move the defer that calls pods.DeleteAndSync/testclient.DataPlaneClient so it runs before the Expect(err).ToNot(HaveOccurred()) assertion, but guard it by checking testpod != nil (or capture a local copy of testpod inside the defer) so you only attempt deletion when a pod was actually returned; reference createPodWithHouskeeping, testpod, pods.DeleteAndSync and testclient.DataPlaneClient when making the change.test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go (2)
1273-1276:⚠️ Potential issue | 🟠 MajorFix deferred cleanup closure capturing loop index.
DeferCleanupcloses overi; by cleanup time,imay be out of range or point to a different node/pod. Capture stable locals before registering cleanup.Proposed fix
- for i := 0; i < len(workerRTNodes); i++ { + for i := 0; i < len(workerRTNodes); i++ { + nodeName := workerRTNodes[i].Name By("Determining the default container runtime used in the node") tunedPod, err := tuned.GetPod(context.TODO(), &workerRTNodes[i]) @@ - testpod := testpodTemplate.DeepCopy() - testpod.Spec.NodeName = workerRTNodes[i].Name - testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: workerRTNodes[i].Name} - By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", workerRTNodes[i].Name)) + testpod := testpodTemplate.DeepCopy() + testpod.Spec.NodeName = nodeName + testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: nodeName} + By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", nodeName)) Expect(testclient.DataPlaneClient.Create(context.TODO(), testpod)).ToNot(HaveOccurred()) + podToDelete := testpod DeferCleanup(func() { - By(fmt.Sprintf("deleting the test pod from node %s", workerRTNodes[i].Name)) - Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed()) + By(fmt.Sprintf("deleting the test pod from node %s", nodeName)) + Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podToDelete)).To(Succeed()) })As per coding guidelines,
Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go` around lines 1273 - 1276, DeferCleanup's closure captures the loop index i so by the time the deferred function runs it may reference the wrong node/pod; fix by capturing stable locals before registering the cleanup (e.g., create local copies like node := workerRTNodes[i] and podToDelete := testpod) and have the closure call pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podToDelete) and reference node.Name for the log; this ensures the deferred closure uses the intended node and pod rather than a mutated loop variable.
1391-1433:⚠️ Potential issue | 🟠 MajorValidate against runnable exclusive CPUs, not raw cpuset.
This assertion can be unsatisfiable when exclusive cpuset includes offline CPUs and only one runnable exclusive CPU remains. Intersect with online CPUs, skip when
<2, and validate samples against that runnable set.Proposed fix
- exclusiveCPUs := cpusList.Difference(reservedCPUs).Difference(sharedCPUs) - Expect(exclusiveCPUs.Size()).ToNot(BeZero()) - firstExclusiveCPU := exclusiveCPUs.List()[0] - testlog.Infof("first exclusive CPU: %d, all exclusive CPUs: %s", firstExclusiveCPU, exclusiveCPUs.String()) + exclusiveCPUs := cpusList.Difference(reservedCPUs).Difference(sharedCPUs) + Expect(exclusiveCPUs.Size()).ToNot(BeZero()) + onlineCPUs, err := nodes.GetOnlineCPUsSet(ctx, &workerRTNodes[0]) + Expect(err).ToNot(HaveOccurred()) + runnableExclusiveCPUs := exclusiveCPUs.Intersection(onlineCPUs) + if runnableExclusiveCPUs.Size() < 2 { + Skip(fmt.Sprintf("need at least 2 runnable exclusive CPUs, got %s (exclusive=%s online=%s)", + runnableExclusiveCPUs.String(), exclusiveCPUs.String(), onlineCPUs.String())) + } + firstExclusiveCPU := runnableExclusiveCPUs.List()[0] + testlog.Infof("first runnable exclusive CPU: %d, runnable exclusive CPUs: %s", firstExclusiveCPU, runnableExclusiveCPUs.String()) @@ - if !exclusiveCPUs.Contains(cpu) { + if !runnableExclusiveCPUs.Contains(cpu) { return fmt.Errorf("entry %d: exec process was pinned to a reserved or shared CPU %d", idx, cpu) } @@ - Expect(execProcessCPUs).ToNot(HaveEach(firstExclusiveCPU), + Expect(execProcessCPUs).ToNot(HaveEach(firstExclusiveCPU), "exec process was always pinned to the first exclusive CPU %d across all %d retries", firstExclusiveCPU, retries)As per coding guidelines,
Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go` around lines 1391 - 1433, The test currently validates samples against exclusiveCPUs (derived from cpusList.Difference(reservedCPUs).Difference(sharedCPUs)) which can include offline CPUs; change the logic to compute runnableExclusiveCPUs by intersecting exclusiveCPUs with the set of online CPUs (e.g., get online set then runnableExclusive := exclusiveCPUs.Intersection(onlineCPUs)), skip the remainder of the check when runnableExclusiveCPUs.Size() < 2, and use runnableExclusiveCPUs (not exclusiveCPUs) for membership checks in the exec goroutine and the final Expect on execProcessCPUs vs firstExclusiveCPU; update references to exclusiveCPUs, firstExclusiveCPU, execProcessCPUs and the final Expect to use runnableExclusiveCPUs so the assertion is satisfiable.test/e2e/performanceprofile/functests/1_performance/cpu_management.go (1)
1015-1016:⚠️ Potential issue | 🟠 MajorPod lifecycle client mismatch can leak test pods
Cleanup now uses
testclient.DataPlaneClient, but pod creation in this same test still usestestclient.Client(Line 1008 and Line 1041). In split control-plane/data-plane environments, this can leave pods orphaned or fail cleanup.Suggested fix
- err := testclient.Client.Create(ctx, guaranteedPod) + err := testclient.DataPlaneClient.Create(ctx, guaranteedPod) @@ - err = testclient.Client.Create(ctx, bestEffortPod) + err = testclient.DataPlaneClient.Create(ctx, bestEffortPod)Also applies to: 1046-1047
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go` around lines 1015 - 1016, The cleanup is calling pods.DeleteAndSync(ctx, testclient.DataPlaneClient, guaranteedPod) while pod creation in this test uses testclient.Client, which can orphan pods in split control-plane/data-plane setups; update the test so pod creation and deletion use the same client: either change the DeleteAndSync calls (and other cleanup calls around guaranteedPod) to use testclient.Client, or change the pod creation calls to use testclient.DataPlaneClient—ensure all references to pods.DeleteAndSync, guaranteedPod creation, and any subsequent cleanup (the other occurrence at the same block around lines noted) consistently use the same client object.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1316-1318: The code currently only warns when len(samples) <
samplesPerRun (using testlog.Warningf), which lets undersampled runs pass;
change that to fail the test immediately by replacing the warning with a fatal
failure — e.g. call testlog.Fatalf (or the test runner's equivalent) with the
same message and parameters ("entry %d: got %d/%d samples (process may have
exited early)", idx, len(samples), samplesPerRun) so any undersampled run causes
the test to stop and report failure.
---
Outside diff comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 157-163: The assertion currently enforces capacityCPU -
allocatableCPU == int64(len(listReservedCPU)) which fails when shared CPUs
exist; modify the check in the cpu_management.go test (around workerRTNode,
listReservedCPU, differenceCPUGot, differenceCPUExpected) to account for
shared-CPU profiles by either (A) gating the strict equality: if a shared-CPU
configuration is detected (e.g., inspect the node's topology/CPU manager state
or a helper flag indicating shared CPUs) then assert differenceCPUGot <=
differenceCPUExpected (or skip the equality check), or (B) replace the
Expect(...).To(Equal(...)) with Expect(differenceCPUGot).To(BeNumerically("<=",
differenceCPUExpected)) and add a clear By() note explaining the relaxed check
for shared CPUs so the test is not flaky across environments.
---
Duplicate comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1015-1016: The cleanup is calling pods.DeleteAndSync(ctx,
testclient.DataPlaneClient, guaranteedPod) while pod creation in this test uses
testclient.Client, which can orphan pods in split control-plane/data-plane
setups; update the test so pod creation and deletion use the same client: either
change the DeleteAndSync calls (and other cleanup calls around guaranteedPod) to
use testclient.Client, or change the pod creation calls to use
testclient.DataPlaneClient—ensure all references to pods.DeleteAndSync,
guaranteedPod creation, and any subsequent cleanup (the other occurrence at the
same block around lines noted) consistently use the same client object.
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 340-346: createPodWithHouskeeping may return a non-nil testpod and
an error, so register cleanup immediately after the call to avoid leaking pods;
move the defer that calls pods.DeleteAndSync/testclient.DataPlaneClient so it
runs before the Expect(err).ToNot(HaveOccurred()) assertion, but guard it by
checking testpod != nil (or capture a local copy of testpod inside the defer) so
you only attempt deletion when a pod was actually returned; reference
createPodWithHouskeeping, testpod, pods.DeleteAndSync and
testclient.DataPlaneClient when making the change.
In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`:
- Around line 1273-1276: DeferCleanup's closure captures the loop index i so by
the time the deferred function runs it may reference the wrong node/pod; fix by
capturing stable locals before registering the cleanup (e.g., create local
copies like node := workerRTNodes[i] and podToDelete := testpod) and have the
closure call pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient,
podToDelete) and reference node.Name for the log; this ensures the deferred
closure uses the intended node and pod rather than a mutated loop variable.
- Around line 1391-1433: The test currently validates samples against
exclusiveCPUs (derived from
cpusList.Difference(reservedCPUs).Difference(sharedCPUs)) which can include
offline CPUs; change the logic to compute runnableExclusiveCPUs by intersecting
exclusiveCPUs with the set of online CPUs (e.g., get online set then
runnableExclusive := exclusiveCPUs.Intersection(onlineCPUs)), skip the remainder
of the check when runnableExclusiveCPUs.Size() < 2, and use
runnableExclusiveCPUs (not exclusiveCPUs) for membership checks in the exec
goroutine and the final Expect on execProcessCPUs vs firstExclusiveCPU; update
references to exclusiveCPUs, firstExclusiveCPU, execProcessCPUs and the final
Expect to use runnableExclusiveCPUs so the assertion is satisfiable.
In `@test/e2e/performanceprofile/functests/utils/pods/pods.go`:
- Around line 123-129: The WaitForDeletion function currently swallows all
non-NotFound errors from cli.Get; update the polling closure (used by
wait.PollUntilContextTimeout) to propagate real errors by returning (false, err)
when cli.Get returns an error that is not errors.IsNotFound(err), so that these
failures break the poll and surface the underlying error instead of silently
retrying; keep the existing behavior of returning (true, nil) when
errors.IsNotFound(err) to signal successful deletion.
- Around line 99-103: The function DeleteAndSync dereferences pod (accessing
pod.Namespace and pod.Name) before checking for nil which can panic; move the
nil guard to the top of DeleteAndSync so you return early if pod == nil, then
perform the klog.InfoS call using pod.Namespace and pod.Name only after the nil
check; update any related logging or downstream logic in DeleteAndSync to assume
pod is non-nil once past the guard.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 135c6ac5-fd02-4fe8-b6a0-bede26a89279
📒 Files selected for processing (16)
go.modtest/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.gotest/e2e/performanceprofile/functests/13_llc/llc.gotest/e2e/performanceprofile/functests/1_performance/cpu_management.gotest/e2e/performanceprofile/functests/1_performance/hugepages.gotest/e2e/performanceprofile/functests/1_performance/irqbalance.gotest/e2e/performanceprofile/functests/2_performance_update/memorymanager.gotest/e2e/performanceprofile/functests/2_performance_update/updating_profile.gotest/e2e/performanceprofile/functests/4_latency/latency.gotest/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.gotest/e2e/performanceprofile/functests/8_performance_workloadhints/workloadhints.gotest/e2e/performanceprofile/functests/utils/nodes/nodes.gotest/e2e/performanceprofile/functests/utils/nodes/nodes_test.gotest/e2e/performanceprofile/functests/utils/pods/pods.gotest/e2e/performanceprofile/functests/utils/resources/resources.gotest/e2e/performanceprofile/functests/utils/resources/resources_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- test/e2e/performanceprofile/functests/2_performance_update/memorymanager.go
- test/e2e/performanceprofile/functests/utils/resources/resources_test.go
- go.mod
a6f7bb8 to
b76ea67
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (4)
test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go (2)
1273-1275:⚠️ Potential issue | 🟠 MajorCapture loop values before
DeferCleanupto avoid wrong cleanup target/panic.The cleanup closure uses
workerRTNodes[i].Namefrom the loop variable. By cleanup time,imay be out of range, and cleanup can panic or reference the wrong node/pod.Proposed fix
for i := 0; i < len(workerRTNodes); i++ { + nodeName := workerRTNodes[i].Name By("Determining the default container runtime used in the node") tunedPod, err := tuned.GetPod(context.TODO(), &workerRTNodes[i]) @@ testpod := testpodTemplate.DeepCopy() - testpod.Spec.NodeName = workerRTNodes[i].Name - testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: workerRTNodes[i].Name} - By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", workerRTNodes[i].Name)) + testpod.Spec.NodeName = nodeName + testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: nodeName} + By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", nodeName)) Expect(testclient.DataPlaneClient.Create(context.TODO(), testpod)).ToNot(HaveOccurred()) + podToDelete := testpod DeferCleanup(func() { - By(fmt.Sprintf("deleting the test pod from node %s", workerRTNodes[i].Name)) - Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed()) + By(fmt.Sprintf("deleting the test pod from node %s", nodeName)) + Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podToDelete)).To(Succeed()) })As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go` around lines 1273 - 1275, The DeferCleanup closure captures the loop variable i which may change by cleanup time; capture the needed values (e.g., nodeName := workerRTNodes[i].Name and podRef := testpod) immediately before calling DeferCleanup and use those local copies inside the closure so DeferCleanup calls pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podRef) and logs nodeName instead of referencing workerRTNodes[i].Name or i.
1371-1376:⚠️ Potential issue | 🟠 MajorDerive expected affinity from runnable exclusive CPUs (online intersection).
The test chooses
firstExclusiveCPUfrom raw cpuset data. In CI, cpusets can include offline CPUs; this makes the assertion flaky or unsatisfiable. Intersect with node online CPUs first, then assert against that runnable set.Proposed fix
cpusetCfg := &controller.CpuSet{} Expect(getter.Container(ctx, testPod, testPod.Spec.Containers[0].Name, cpusetCfg)).To(Succeed(), "Failed to get cpuset config for test pod") cpusList, err := cpuset.Parse(cpusetCfg.Cpus) Expect(err).ToNot(HaveOccurred()) Expect(cpusList.Size()).ToNot(BeZero()) @@ exclusiveCPUs := cpusList.Difference(reservedCPUs).Difference(sharedCPUs) - Expect(exclusiveCPUs.Size()).ToNot(BeZero()) - firstExclusiveCPU := exclusiveCPUs.List()[0] - testlog.Infof("first exclusive CPU: %d, all exclusive CPUs: %s", firstExclusiveCPU, exclusiveCPUs.String()) + onlineCPUs, err := nodes.GetOnlineCPUsSet(ctx, &workerRTNodes[0]) + Expect(err).ToNot(HaveOccurred()) + runnableExclusiveCPUs := exclusiveCPUs.Intersection(onlineCPUs) + if runnableExclusiveCPUs.Size() < 2 { + Skip(fmt.Sprintf("need at least two online exclusive CPUs, got %s (exclusive=%s, online=%s)", + runnableExclusiveCPUs.String(), exclusiveCPUs.String(), onlineCPUs.String())) + } + firstExclusiveCPU := runnableExclusiveCPUs.List()[0] + testlog.Infof("first exclusive CPU: %d, runnable exclusive CPUs: %s", firstExclusiveCPU, runnableExclusiveCPUs.String()) @@ - if !exclusiveCPUs.Contains(cpu) { + if !runnableExclusiveCPUs.Contains(cpu) { return fmt.Errorf("entry %d: exec process was pinned to a reserved or shared CPU %d", idx, cpu) }As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 1391-1394, 1432-1433
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go` around lines 1371 - 1376, The test currently derives expected affinity from raw cpuset data (cpusetCfg -> cpusList) which may include offline CPUs; change the logic that picks firstExclusiveCPU and makes assertions to first compute the runnable/exclusive set by intersecting cpusList with the node's online CPUs (use the node online CPU set before any further filtering), then derive firstExclusiveCPU and assert against that runnable intersection; update all occurrences using cpusetCfg/cpusList/firstExclusiveCPU (including the blocks around cpusetCfg, cpusList parsing and the assertions at 1391-1394 and 1432-1433) to use the online-intersected cpuset instead.test/e2e/performanceprofile/functests/utils/pods/pods.go (1)
123-129:⚠️ Potential issue | 🟠 MajorPropagate non-
NotFounderrors inWaitForDeletion.
cli.Getfailures (RBAC/API/network) are currently swallowed and retried until timeout. This hides root cause and makes teardown flaky.Proposed fix
func WaitForDeletion(ctx context.Context, cli client.Client, pod *corev1.Pod, timeout time.Duration) error { return wait.PollUntilContextTimeout(ctx, time.Second, timeout, true, func(ctx context.Context) (bool, error) { - if err := cli.Get(ctx, client.ObjectKeyFromObject(pod), pod); errors.IsNotFound(err) { + err := cli.Get(ctx, client.ObjectKeyFromObject(pod), pod) + if err == nil { + return false, nil + } + if errors.IsNotFound(err) { return true, nil } - return false, nil + return false, err }) }As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/utils/pods/pods.go` around lines 123 - 129, The WaitForDeletion function swallows all cli.Get errors and only treats NotFound as success; propagate non-NotFound errors immediately so RBAC/API/network failures aren't retried and hidden. In WaitForDeletion (which uses client.Client cli and wait.PollUntilContextTimeout), check the error returned by cli.Get: if errors.IsNotFound(err) return true,nil; if err != nil return false, err to surface the failure; only return false,nil when there is no error and the object still exists. This ensures callers see real failures instead of timeouts.test/e2e/performanceprofile/functests/1_performance/cpu_management.go (1)
1316-1318:⚠️ Potential issue | 🟠 MajorFail on undersampled runs to prevent false positives
If sampling returns fewer than
samplesPerRun, the test only warns and still proceeds. This can pass affinity checks with incomplete evidence.Suggested change
- if len(samples) < samplesPerRun { - testlog.Warningf("entry %d: got %d/%d samples (process may have exited early)", idx, len(samples), samplesPerRun) - } + if len(samples) < samplesPerRun { + return fmt.Errorf("entry %d: got %d/%d samples (process may have exited early)", idx, len(samples), samplesPerRun) + }As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go` around lines 1316 - 1318, The test currently only warns when a run is undersampled (len(samples) < samplesPerRun) which can produce false positives; update the check around samples, samplesPerRun (and idx) to fail the test instead of continuing—use the test harness failure path (e.g., testlog.Fatalf or the appropriate test failure function used elsewhere in this file) to abort the test with a clear message including idx, len(samples) and samplesPerRun so undersampled runs do not proceed to affinity checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1291-1323: The loop launches all retry goroutines concurrently
which can create too many simultaneous pods.ExecCommandOnPod streams; limit
concurrency by adding a semaphore or worker pool when spawning goroutines for
the errgroup (use errgroup.WithContext as before) — for example create a
buffered channel (or semaphore) sized to a safe maxConcurrent (e.g.,
min(retries, 4)), acquire before calling g.Go and release inside the goroutine
when done; keep using idx := i, write results into execProcessCPUSamples[idx],
and ensure g.Wait() still checks for errors.
In `@test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go`:
- Around line 569-575: The selection of "first" CPUs uses raw cpuset order and
can pick offline CPUs; change the logic in the block that calls
nodes.GetCoreSiblings and nodes.GetTwoSiblingsFromCPUSet (and similar blocks
that set updatedShared/updatedIsolated) to intersect the candidate cpuset with
the current online CPU set before choosing CPUs—i.e., compute onlineCandidates
:= candidateCpuset.Intersection(onlineCPUs) and then pick the first two from
onlineCandidates (or fall back appropriately) instead of using
updatedIsolated.List() directly; apply the same intersection-based selection
wherever updatedShared/updatedIsolated are constructed (the other occurrences
noted in the comment).
---
Duplicate comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1316-1318: The test currently only warns when a run is
undersampled (len(samples) < samplesPerRun) which can produce false positives;
update the check around samples, samplesPerRun (and idx) to fail the test
instead of continuing—use the test harness failure path (e.g., testlog.Fatalf or
the appropriate test failure function used elsewhere in this file) to abort the
test with a clear message including idx, len(samples) and samplesPerRun so
undersampled runs do not proceed to affinity checks.
In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`:
- Around line 1273-1275: The DeferCleanup closure captures the loop variable i
which may change by cleanup time; capture the needed values (e.g., nodeName :=
workerRTNodes[i].Name and podRef := testpod) immediately before calling
DeferCleanup and use those local copies inside the closure so DeferCleanup calls
pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podRef) and logs
nodeName instead of referencing workerRTNodes[i].Name or i.
- Around line 1371-1376: The test currently derives expected affinity from raw
cpuset data (cpusetCfg -> cpusList) which may include offline CPUs; change the
logic that picks firstExclusiveCPU and makes assertions to first compute the
runnable/exclusive set by intersecting cpusList with the node's online CPUs (use
the node online CPU set before any further filtering), then derive
firstExclusiveCPU and assert against that runnable intersection; update all
occurrences using cpusetCfg/cpusList/firstExclusiveCPU (including the blocks
around cpusetCfg, cpusList parsing and the assertions at 1391-1394 and
1432-1433) to use the online-intersected cpuset instead.
In `@test/e2e/performanceprofile/functests/utils/pods/pods.go`:
- Around line 123-129: The WaitForDeletion function swallows all cli.Get errors
and only treats NotFound as success; propagate non-NotFound errors immediately
so RBAC/API/network failures aren't retried and hidden. In WaitForDeletion
(which uses client.Client cli and wait.PollUntilContextTimeout), check the error
returned by cli.Get: if errors.IsNotFound(err) return true,nil; if err != nil
return false, err to surface the failure; only return false,nil when there is no
error and the object still exists. This ensures callers see real failures
instead of timeouts.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9fe94267-8a2b-4f0d-905b-3817ed39bb59
📒 Files selected for processing (16)
go.modtest/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.gotest/e2e/performanceprofile/functests/13_llc/llc.gotest/e2e/performanceprofile/functests/1_performance/cpu_management.gotest/e2e/performanceprofile/functests/1_performance/hugepages.gotest/e2e/performanceprofile/functests/1_performance/irqbalance.gotest/e2e/performanceprofile/functests/2_performance_update/memorymanager.gotest/e2e/performanceprofile/functests/2_performance_update/updating_profile.gotest/e2e/performanceprofile/functests/4_latency/latency.gotest/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.gotest/e2e/performanceprofile/functests/8_performance_workloadhints/workloadhints.gotest/e2e/performanceprofile/functests/utils/nodes/nodes.gotest/e2e/performanceprofile/functests/utils/nodes/nodes_test.gotest/e2e/performanceprofile/functests/utils/pods/pods.gotest/e2e/performanceprofile/functests/utils/resources/resources.gotest/e2e/performanceprofile/functests/utils/resources/resources_test.go
🚧 Files skipped from review as they are similar to previous changes (10)
- test/e2e/performanceprofile/functests/4_latency/latency.go
- test/e2e/performanceprofile/functests/utils/resources/resources_test.go
- test/e2e/performanceprofile/functests/13_llc/llc.go
- go.mod
- test/e2e/performanceprofile/functests/1_performance/hugepages.go
- test/e2e/performanceprofile/functests/utils/nodes/nodes.go
- test/e2e/performanceprofile/functests/utils/nodes/nodes_test.go
- test/e2e/performanceprofile/functests/8_performance_workloadhints/workloadhints.go
- test/e2e/performanceprofile/functests/1_performance/irqbalance.go
- test/e2e/performanceprofile/functests/utils/resources/resources.go
b76ea67 to
3604b53
Compare
|
/retest |
3604b53 to
c5f7e7f
Compare
Introduce pods.DeleteAndSync helper (with client.Client parameter) and replace all local deleteTestPod functions and inline Get+Delete+WaitForDeletion patterns across functest suites with calls to this shared utility. Assisted-by: Cursor v2.3.37 AIA Attribution: AIA Primarily human, Stylistic edits, Human-initiated, Reviewed Signed-off-by: Shereen Haj <shajmakh@redhat.com>
Add e2e tests to verify exec-cpu-affinity behavior across cpu_management, updating_profile, and mixedcpus suites. Tests verify that exec processes are pinned to the correct exclusive, shared, or isolated CPU sets based on pod QoS class and container resource requests. Assisted-by: Cursor v2.3.37 AIA Attribution: AIA Primarily human, Stylistic edits, Human-initiated, Reviewed v1.0 Signed-off-by: Shereen Haj <shajmakh@redhat.com>
Add unit tests for functions in resources helper package for tests. Assisted-by: Cursor v1.2.2 AI-Attribution: AIA Entirely AI, Human-initiated, Reviewed, Cursor v1.2.2 v1.0 Signed-off-by: Shereen Haj <shajmakh@redhat.com>
c5f7e7f to
04916c8
Compare
|
/retest |
|
@shajmakh: This pull request references CNF-18941 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/verified later @SargunNarula |
|
@shajmakh: This PR has been marked to be verified later by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/override ci/prow/okd-scos-images per offline agreement, can't be caused by this PR as it only affects tests |
|
@ffromani: Overrode contexts on behalf of ffromani: ci/prow/okd-scos-images DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/approve |
|
@shajmakh: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani, shajmakh, Tal-or The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Add basic e2e tests that checks the default behavior of performance-profile with default enabled
ExecCPUAffinity: first.