CNF-18941: e2e: PP: cover ExecCPUAffinity support in tests by shajmakh · Pull Request #1432 · openshift/cluster-node-tuning-operator

shajmakh · 2025-11-13T13:45:17Z

Add basic e2e tests that checks the default behavior of performance-profile with default enabled ExecCPUAffinity: first.

shajmakh · 2025-11-13T15:13:51Z

depend on #1426

shajmakh · 2026-01-20T11:14:22Z

regarding ci/prow/e2e-gcp-pao-updating-profile the newly added test in the PR is failing because the exec process was always (for 20 retries) pinned to the first CPU of the set although the execCPUAffinity feature is disabled.
this was tested locally several times and it passed. looking deeper in the mustgather, we can see that the PP cpu config is as follow:
cpu: isolated: 1-3 reserved: "0"
while the test logs show us that the exclusive CPUs that were assigned to the running (guaranteed) container were:
first exclusive CPU: 1, all exclusive CPUs: 1,4
which means CPU 4 is likely offline which leaves only CPU 1 for the process to be pinned to.
looking at the node's allocatable cpus:
allocatable: cpu: "5"
which means that the PP didn't distribute the rest of the unreserved CPUs. The thing that caused unalignments when scheduling a workload.
Investigation is ongoing to solve this.

/test e2e-gcp-pao-workloadhints

GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). In this commit we start updating only the affected job on which the test would run, later we will need to add this setting to all other jobs that consume ipi-gcp cluster configuration. Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>

shajmakh · 2026-01-22T07:53:12Z

/retest

GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>

shajmakh · 2026-01-22T13:22:18Z

when temporarly removed the failing test due to misaligning node topology with PP cpu section,
ci/prow/e2e-gcp-pao-updating-profile lane passed. A fix for the infra issue is proposed here: openshift/release#73835

shajmakh · 2026-01-23T10:50:00Z

/hold
for prow change to be merged

GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>

shajmakh · 2026-01-26T10:55:32Z

/test e2e-aws-ovn
/test e2e-aws-operator

shajmakh · 2026-01-28T07:49:06Z

/retest

shajmakh · 2026-01-29T14:27:53Z

/unhold

SargunNarula

Thanks for the tests, IMO some tests are redundant which can be removed.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/e2e/performanceprofile/functests/1_performance/cpu_management.go (1)
157-163: ⚠️ Potential issue | 🟠 Major

CPU allocatable assertion is still incorrect when shared CPUs are configured

Line 162 explicitly notes the gap, but the assertion still enforces capacity-allocatable == reserved, which can fail for valid shared-CPU profiles and make this test flaky across environments. Please gate or adjust this expectation before release.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go` around
lines 157 - 163, The assertion currently enforces capacityCPU - allocatableCPU
== int64(len(listReservedCPU)) which fails when shared CPUs exist; modify the
check in the cpu_management.go test (around workerRTNode, listReservedCPU,
differenceCPUGot, differenceCPUExpected) to account for shared-CPU profiles by
either (A) gating the strict equality: if a shared-CPU configuration is detected
(e.g., inspect the node's topology/CPU manager state or a helper flag indicating
shared CPUs) then assert differenceCPUGot <= differenceCPUExpected (or skip the
equality check), or (B) replace the Expect(...).To(Equal(...)) with
Expect(differenceCPUGot).To(BeNumerically("<=", differenceCPUExpected)) and add
a clear By() note explaining the relaxed check for shared CPUs so the test is
not flaky across environments.

♻️ Duplicate comments (6)

test/e2e/performanceprofile/functests/utils/pods/pods.go (2)

123-129: ⚠️ Potential issue | 🟠 Major

WaitForDeletion swallows non-NotFound errors.

At Line 125, only IsNotFound is handled; other cli.Get failures are retried as false, nil until timeout, which hides the real failure mode.

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Proposed fix

 func WaitForDeletion(ctx context.Context, cli client.Client, pod *corev1.Pod, timeout time.Duration) error {
 	return wait.PollUntilContextTimeout(ctx, time.Second, timeout, true, func(ctx context.Context) (bool, error) {
-		if err := cli.Get(ctx, client.ObjectKeyFromObject(pod), pod); errors.IsNotFound(err) {
+		err := cli.Get(ctx, client.ObjectKeyFromObject(pod), pod)
+		if errors.IsNotFound(err) {
 			return true, nil
 		}
-		return false, nil
+		if err != nil {
+			return false, err
+		}
+		return false, nil
 	})
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go` around lines 123 -
129, The WaitForDeletion function currently swallows all non-NotFound errors
from cli.Get; update the polling closure (used by wait.PollUntilContextTimeout)
to propagate real errors by returning (false, err) when cli.Get returns an error
that is not errors.IsNotFound(err), so that these failures break the poll and
surface the underlying error instead of silently retrying; keep the existing
behavior of returning (true, nil) when errors.IsNotFound(err) to signal
successful deletion.

99-103: ⚠️ Potential issue | 🟠 Major

Nil check is currently after dereference in DeleteAndSync.

At Line 100, pod.Namespace and pod.Name are accessed before the nil guard on Line 101, so teardown can panic when cleanup is invoked with nil.

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Proposed fix

 func DeleteAndSync(ctx context.Context, cli client.Client, pod *corev1.Pod) error {
-	klog.InfoS("delete pod and waiting for deletion", "namespace", pod.Namespace, "name", pod.Name)
 	if pod == nil {
 		klog.InfoS("pod is nil, skipping deletion")
 		return nil
 	}
 
 	if cli == nil {
 		return fmt.Errorf("client is nil")
 	}
+	klog.InfoS("delete pod and waiting for deletion", "namespace", pod.Namespace, "name", pod.Name)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go` around lines 99 -
103, The function DeleteAndSync dereferences pod (accessing pod.Namespace and
pod.Name) before checking for nil which can panic; move the nil guard to the top
of DeleteAndSync so you return early if pod == nil, then perform the klog.InfoS
call using pod.Namespace and pod.Name only after the nil check; update any
related logging or downstream logic in DeleteAndSync to assume pod is non-nil
once past the guard.

test/e2e/performanceprofile/functests/1_performance/irqbalance.go (1)

340-346: ⚠️ Potential issue | 🟠 Major

Register cleanup before the failing assertion.

Line 341 can fail before the defer on Line 343 is registered. If pod creation partially succeeds and then errors, cleanup is skipped and the pod can leak.

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Proposed fix

 testpod, err := createPodWithHouskeeping(numOfContainersInPod, cpusPerContainer, profile, context.TODO(), targetNode)
-Expect(err).ToNot(HaveOccurred(), "Failed to create pod with housekeeping annotation")
-
 defer func() {
-	testlog.Infof("deleting pod %q", testpod.Name)
-	Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed())
+	if testpod != nil {
+		testlog.Infof("deleting pod %q", testpod.Name)
+		Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed())
+	}
 }()
+Expect(err).ToNot(HaveOccurred(), "Failed to create pod with housekeeping annotation")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go` around
lines 340 - 346, createPodWithHouskeeping may return a non-nil testpod and an
error, so register cleanup immediately after the call to avoid leaking pods;
move the defer that calls pods.DeleteAndSync/testclient.DataPlaneClient so it
runs before the Expect(err).ToNot(HaveOccurred()) assertion, but guard it by
checking testpod != nil (or capture a local copy of testpod inside the defer) so
you only attempt deletion when a pod was actually returned; reference
createPodWithHouskeeping, testpod, pods.DeleteAndSync and
testclient.DataPlaneClient when making the change.

test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go (2)

1273-1276: ⚠️ Potential issue | 🟠 Major

Fix deferred cleanup closure capturing loop index.

DeferCleanup closes over i; by cleanup time, i may be out of range or point to a different node/pod. Capture stable locals before registering cleanup.

Proposed fix

-				for i := 0; i < len(workerRTNodes); i++ {
+				for i := 0; i < len(workerRTNodes); i++ {
+					nodeName := workerRTNodes[i].Name
 					By("Determining the default container runtime used in the node")
 					tunedPod, err := tuned.GetPod(context.TODO(), &workerRTNodes[i])
@@
-					testpod := testpodTemplate.DeepCopy()
-					testpod.Spec.NodeName = workerRTNodes[i].Name
-					testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: workerRTNodes[i].Name}
-					By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", workerRTNodes[i].Name))
+					testpod := testpodTemplate.DeepCopy()
+					testpod.Spec.NodeName = nodeName
+					testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: nodeName}
+					By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", nodeName))
 					Expect(testclient.DataPlaneClient.Create(context.TODO(), testpod)).ToNot(HaveOccurred())
+					podToDelete := testpod
 					DeferCleanup(func() {
-						By(fmt.Sprintf("deleting the test pod from node %s", workerRTNodes[i].Name))
-						Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed())
+						By(fmt.Sprintf("deleting the test pod from node %s", nodeName))
+						Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podToDelete)).To(Succeed())
 					})

As per coding guidelines, Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`
around lines 1273 - 1276, DeferCleanup's closure captures the loop index i so by
the time the deferred function runs it may reference the wrong node/pod; fix by
capturing stable locals before registering the cleanup (e.g., create local
copies like node := workerRTNodes[i] and podToDelete := testpod) and have the
closure call pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient,
podToDelete) and reference node.Name for the log; this ensures the deferred
closure uses the intended node and pod rather than a mutated loop variable.

1391-1433: ⚠️ Potential issue | 🟠 Major

Validate against runnable exclusive CPUs, not raw cpuset.

This assertion can be unsatisfiable when exclusive cpuset includes offline CPUs and only one runnable exclusive CPU remains. Intersect with online CPUs, skip when <2, and validate samples against that runnable set.

Proposed fix

-				exclusiveCPUs := cpusList.Difference(reservedCPUs).Difference(sharedCPUs)
-				Expect(exclusiveCPUs.Size()).ToNot(BeZero())
-				firstExclusiveCPU := exclusiveCPUs.List()[0]
-				testlog.Infof("first exclusive CPU: %d, all exclusive CPUs: %s", firstExclusiveCPU, exclusiveCPUs.String())
+				exclusiveCPUs := cpusList.Difference(reservedCPUs).Difference(sharedCPUs)
+				Expect(exclusiveCPUs.Size()).ToNot(BeZero())
+				onlineCPUs, err := nodes.GetOnlineCPUsSet(ctx, &workerRTNodes[0])
+				Expect(err).ToNot(HaveOccurred())
+				runnableExclusiveCPUs := exclusiveCPUs.Intersection(onlineCPUs)
+				if runnableExclusiveCPUs.Size() < 2 {
+					Skip(fmt.Sprintf("need at least 2 runnable exclusive CPUs, got %s (exclusive=%s online=%s)",
+						runnableExclusiveCPUs.String(), exclusiveCPUs.String(), onlineCPUs.String()))
+				}
+				firstExclusiveCPU := runnableExclusiveCPUs.List()[0]
+				testlog.Infof("first runnable exclusive CPU: %d, runnable exclusive CPUs: %s", firstExclusiveCPU, runnableExclusiveCPUs.String())
@@
-						if !exclusiveCPUs.Contains(cpu) {
+						if !runnableExclusiveCPUs.Contains(cpu) {
 							return fmt.Errorf("entry %d: exec process was pinned to a reserved or shared CPU %d", idx, cpu)
 						}
@@
-				Expect(execProcessCPUs).ToNot(HaveEach(firstExclusiveCPU),
+				Expect(execProcessCPUs).ToNot(HaveEach(firstExclusiveCPU),
 					"exec process was always pinned to the first exclusive CPU %d across all %d retries", firstExclusiveCPU, retries)

As per coding guidelines, Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`
around lines 1391 - 1433, The test currently validates samples against
exclusiveCPUs (derived from
cpusList.Difference(reservedCPUs).Difference(sharedCPUs)) which can include
offline CPUs; change the logic to compute runnableExclusiveCPUs by intersecting
exclusiveCPUs with the set of online CPUs (e.g., get online set then
runnableExclusive := exclusiveCPUs.Intersection(onlineCPUs)), skip the remainder
of the check when runnableExclusiveCPUs.Size() < 2, and use
runnableExclusiveCPUs (not exclusiveCPUs) for membership checks in the exec
goroutine and the final Expect on execProcessCPUs vs firstExclusiveCPU; update
references to exclusiveCPUs, firstExclusiveCPU, execProcessCPUs and the final
Expect to use runnableExclusiveCPUs so the assertion is satisfiable.

test/e2e/performanceprofile/functests/1_performance/cpu_management.go (1)

1015-1016: ⚠️ Potential issue | 🟠 Major

Pod lifecycle client mismatch can leak test pods

Cleanup now uses testclient.DataPlaneClient, but pod creation in this same test still uses testclient.Client (Line 1008 and Line 1041). In split control-plane/data-plane environments, this can leave pods orphaned or fail cleanup.

Suggested fix

- err := testclient.Client.Create(ctx, guaranteedPod)
+ err := testclient.DataPlaneClient.Create(ctx, guaranteedPod)
@@
- err = testclient.Client.Create(ctx, bestEffortPod)
+ err = testclient.DataPlaneClient.Create(ctx, bestEffortPod)

Also applies to: 1046-1047

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go` around
lines 1015 - 1016, The cleanup is calling pods.DeleteAndSync(ctx,
testclient.DataPlaneClient, guaranteedPod) while pod creation in this test uses
testclient.Client, which can orphan pods in split control-plane/data-plane
setups; update the test so pod creation and deletion use the same client: either
change the DeleteAndSync calls (and other cleanup calls around guaranteedPod) to
use testclient.Client, or change the pod creation calls to use
testclient.DataPlaneClient—ensure all references to pods.DeleteAndSync,
guaranteedPod creation, and any subsequent cleanup (the other occurrence at the
same block around lines noted) consistently use the same client object.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1316-1318: The code currently only warns when len(samples) <
samplesPerRun (using testlog.Warningf), which lets undersampled runs pass;
change that to fail the test immediately by replacing the warning with a fatal
failure — e.g. call testlog.Fatalf (or the test runner's equivalent) with the
same message and parameters ("entry %d: got %d/%d samples (process may have
exited early)", idx, len(samples), samplesPerRun) so any undersampled run causes
the test to stop and report failure.

---

Outside diff comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 157-163: The assertion currently enforces capacityCPU -
allocatableCPU == int64(len(listReservedCPU)) which fails when shared CPUs
exist; modify the check in the cpu_management.go test (around workerRTNode,
listReservedCPU, differenceCPUGot, differenceCPUExpected) to account for
shared-CPU profiles by either (A) gating the strict equality: if a shared-CPU
configuration is detected (e.g., inspect the node's topology/CPU manager state
or a helper flag indicating shared CPUs) then assert differenceCPUGot <=
differenceCPUExpected (or skip the equality check), or (B) replace the
Expect(...).To(Equal(...)) with Expect(differenceCPUGot).To(BeNumerically("<=",
differenceCPUExpected)) and add a clear By() note explaining the relaxed check
for shared CPUs so the test is not flaky across environments.

---

Duplicate comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1015-1016: The cleanup is calling pods.DeleteAndSync(ctx,
testclient.DataPlaneClient, guaranteedPod) while pod creation in this test uses
testclient.Client, which can orphan pods in split control-plane/data-plane
setups; update the test so pod creation and deletion use the same client: either
change the DeleteAndSync calls (and other cleanup calls around guaranteedPod) to
use testclient.Client, or change the pod creation calls to use
testclient.DataPlaneClient—ensure all references to pods.DeleteAndSync,
guaranteedPod creation, and any subsequent cleanup (the other occurrence at the
same block around lines noted) consistently use the same client object.

In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 340-346: createPodWithHouskeeping may return a non-nil testpod and
an error, so register cleanup immediately after the call to avoid leaking pods;
move the defer that calls pods.DeleteAndSync/testclient.DataPlaneClient so it
runs before the Expect(err).ToNot(HaveOccurred()) assertion, but guard it by
checking testpod != nil (or capture a local copy of testpod inside the defer) so
you only attempt deletion when a pod was actually returned; reference
createPodWithHouskeeping, testpod, pods.DeleteAndSync and
testclient.DataPlaneClient when making the change.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`:
- Around line 1273-1276: DeferCleanup's closure captures the loop index i so by
the time the deferred function runs it may reference the wrong node/pod; fix by
capturing stable locals before registering the cleanup (e.g., create local
copies like node := workerRTNodes[i] and podToDelete := testpod) and have the
closure call pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient,
podToDelete) and reference node.Name for the log; this ensures the deferred
closure uses the intended node and pod rather than a mutated loop variable.
- Around line 1391-1433: The test currently validates samples against
exclusiveCPUs (derived from
cpusList.Difference(reservedCPUs).Difference(sharedCPUs)) which can include
offline CPUs; change the logic to compute runnableExclusiveCPUs by intersecting
exclusiveCPUs with the set of online CPUs (e.g., get online set then
runnableExclusive := exclusiveCPUs.Intersection(onlineCPUs)), skip the remainder
of the check when runnableExclusiveCPUs.Size() < 2, and use
runnableExclusiveCPUs (not exclusiveCPUs) for membership checks in the exec
goroutine and the final Expect on execProcessCPUs vs firstExclusiveCPU; update
references to exclusiveCPUs, firstExclusiveCPU, execProcessCPUs and the final
Expect to use runnableExclusiveCPUs so the assertion is satisfiable.

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go`:
- Around line 123-129: The WaitForDeletion function currently swallows all
non-NotFound errors from cli.Get; update the polling closure (used by
wait.PollUntilContextTimeout) to propagate real errors by returning (false, err)
when cli.Get returns an error that is not errors.IsNotFound(err), so that these
failures break the poll and surface the underlying error instead of silently
retrying; keep the existing behavior of returning (true, nil) when
errors.IsNotFound(err) to signal successful deletion.
- Around line 99-103: The function DeleteAndSync dereferences pod (accessing
pod.Namespace and pod.Name) before checking for nil which can panic; move the
nil guard to the top of DeleteAndSync so you return early if pod == nil, then
perform the klog.InfoS call using pod.Namespace and pod.Name only after the nil
check; update any related logging or downstream logic in DeleteAndSync to assume
pod is non-nil once past the guard.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 135c6ac5-fd02-4fe8-b6a0-bede26a89279

📥 Commits

Reviewing files that changed from the base of the PR and between e9e8dd1 and a6f7bb8.

📒 Files selected for processing (16)

go.mod
test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
test/e2e/performanceprofile/functests/13_llc/llc.go
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
test/e2e/performanceprofile/functests/1_performance/hugepages.go
test/e2e/performanceprofile/functests/1_performance/irqbalance.go
test/e2e/performanceprofile/functests/2_performance_update/memorymanager.go
test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
test/e2e/performanceprofile/functests/4_latency/latency.go
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
test/e2e/performanceprofile/functests/8_performance_workloadhints/workloadhints.go
test/e2e/performanceprofile/functests/utils/nodes/nodes.go
test/e2e/performanceprofile/functests/utils/nodes/nodes_test.go
test/e2e/performanceprofile/functests/utils/pods/pods.go
test/e2e/performanceprofile/functests/utils/resources/resources.go
test/e2e/performanceprofile/functests/utils/resources/resources_test.go

🚧 Files skipped from review as they are similar to previous changes (3)

test/e2e/performanceprofile/functests/2_performance_update/memorymanager.go
test/e2e/performanceprofile/functests/utils/resources/resources_test.go
go.mod

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (4)

test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go (2)

1273-1275: ⚠️ Potential issue | 🟠 Major

Capture loop values before DeferCleanup to avoid wrong cleanup target/panic.

The cleanup closure uses workerRTNodes[i].Name from the loop variable. By cleanup time, i may be out of range, and cleanup can panic or reference the wrong node/pod.

Proposed fix

 				for i := 0; i < len(workerRTNodes); i++ {
+					nodeName := workerRTNodes[i].Name
 					By("Determining the default container runtime used in the node")
 					tunedPod, err := tuned.GetPod(context.TODO(), &workerRTNodes[i])
@@
 					testpod := testpodTemplate.DeepCopy()
-					testpod.Spec.NodeName = workerRTNodes[i].Name
-					testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: workerRTNodes[i].Name}
-					By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", workerRTNodes[i].Name))
+					testpod.Spec.NodeName = nodeName
+					testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: nodeName}
+					By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", nodeName))
 					Expect(testclient.DataPlaneClient.Create(context.TODO(), testpod)).ToNot(HaveOccurred())
+					podToDelete := testpod
 					DeferCleanup(func() {
-						By(fmt.Sprintf("deleting the test pod from node %s", workerRTNodes[i].Name))
-						Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed())
+						By(fmt.Sprintf("deleting the test pod from node %s", nodeName))
+						Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podToDelete)).To(Succeed())
 					})

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`
around lines 1273 - 1275, The DeferCleanup closure captures the loop variable i
which may change by cleanup time; capture the needed values (e.g., nodeName :=
workerRTNodes[i].Name and podRef := testpod) immediately before calling
DeferCleanup and use those local copies inside the closure so DeferCleanup calls
pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podRef) and logs
nodeName instead of referencing workerRTNodes[i].Name or i.

1371-1376: ⚠️ Potential issue | 🟠 Major

Derive expected affinity from runnable exclusive CPUs (online intersection).

The test chooses firstExclusiveCPU from raw cpuset data. In CI, cpusets can include offline CPUs; this makes the assertion flaky or unsatisfiable. Intersect with node online CPUs first, then assert against that runnable set.

Proposed fix

 				cpusetCfg := &controller.CpuSet{}
 				Expect(getter.Container(ctx, testPod, testPod.Spec.Containers[0].Name, cpusetCfg)).To(Succeed(), "Failed to get cpuset config for test pod")
 				cpusList, err := cpuset.Parse(cpusetCfg.Cpus)
 				Expect(err).ToNot(HaveOccurred())
 				Expect(cpusList.Size()).ToNot(BeZero())
@@
 				exclusiveCPUs := cpusList.Difference(reservedCPUs).Difference(sharedCPUs)
-				Expect(exclusiveCPUs.Size()).ToNot(BeZero())
-				firstExclusiveCPU := exclusiveCPUs.List()[0]
-				testlog.Infof("first exclusive CPU: %d, all exclusive CPUs: %s", firstExclusiveCPU, exclusiveCPUs.String())
+				onlineCPUs, err := nodes.GetOnlineCPUsSet(ctx, &workerRTNodes[0])
+				Expect(err).ToNot(HaveOccurred())
+				runnableExclusiveCPUs := exclusiveCPUs.Intersection(onlineCPUs)
+				if runnableExclusiveCPUs.Size() < 2 {
+					Skip(fmt.Sprintf("need at least two online exclusive CPUs, got %s (exclusive=%s, online=%s)",
+						runnableExclusiveCPUs.String(), exclusiveCPUs.String(), onlineCPUs.String()))
+				}
+				firstExclusiveCPU := runnableExclusiveCPUs.List()[0]
+				testlog.Infof("first exclusive CPU: %d, runnable exclusive CPUs: %s", firstExclusiveCPU, runnableExclusiveCPUs.String())
@@
-						if !exclusiveCPUs.Contains(cpu) {
+						if !runnableExclusiveCPUs.Contains(cpu) {
 							return fmt.Errorf("entry %d: exec process was pinned to a reserved or shared CPU %d", idx, cpu)
 						}

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 1391-1394, 1432-1433

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`
around lines 1371 - 1376, The test currently derives expected affinity from raw
cpuset data (cpusetCfg -> cpusList) which may include offline CPUs; change the
logic that picks firstExclusiveCPU and makes assertions to first compute the
runnable/exclusive set by intersecting cpusList with the node's online CPUs (use
the node online CPU set before any further filtering), then derive
firstExclusiveCPU and assert against that runnable intersection; update all
occurrences using cpusetCfg/cpusList/firstExclusiveCPU (including the blocks
around cpusetCfg, cpusList parsing and the assertions at 1391-1394 and
1432-1433) to use the online-intersected cpuset instead.

test/e2e/performanceprofile/functests/utils/pods/pods.go (1)

123-129: ⚠️ Potential issue | 🟠 Major

Propagate non-NotFound errors in WaitForDeletion.

cli.Get failures (RBAC/API/network) are currently swallowed and retried until timeout. This hides root cause and makes teardown flaky.

Proposed fix

 func WaitForDeletion(ctx context.Context, cli client.Client, pod *corev1.Pod, timeout time.Duration) error {
 	return wait.PollUntilContextTimeout(ctx, time.Second, timeout, true, func(ctx context.Context) (bool, error) {
-		if err := cli.Get(ctx, client.ObjectKeyFromObject(pod), pod); errors.IsNotFound(err) {
+		err := cli.Get(ctx, client.ObjectKeyFromObject(pod), pod)
+		if err == nil {
+			return false, nil
+		}
+		if errors.IsNotFound(err) {
 			return true, nil
 		}
-		return false, nil
+		return false, err
 	})
 }

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go` around lines 123 -
129, The WaitForDeletion function swallows all cli.Get errors and only treats
NotFound as success; propagate non-NotFound errors immediately so
RBAC/API/network failures aren't retried and hidden. In WaitForDeletion (which
uses client.Client cli and wait.PollUntilContextTimeout), check the error
returned by cli.Get: if errors.IsNotFound(err) return true,nil; if err != nil
return false, err to surface the failure; only return false,nil when there is no
error and the object still exists. This ensures callers see real failures
instead of timeouts.

test/e2e/performanceprofile/functests/1_performance/cpu_management.go (1)

1316-1318: ⚠️ Potential issue | 🟠 Major

Fail on undersampled runs to prevent false positives

If sampling returns fewer than samplesPerRun, the test only warns and still proceeds. This can pass affinity checks with incomplete evidence.

Suggested change

-								if len(samples) < samplesPerRun {
-									testlog.Warningf("entry %d: got %d/%d samples (process may have exited early)", idx, len(samples), samplesPerRun)
-								}
+								if len(samples) < samplesPerRun {
+									return fmt.Errorf("entry %d: got %d/%d samples (process may have exited early)", idx, len(samples), samplesPerRun)
+								}

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go` around
lines 1316 - 1318, The test currently only warns when a run is undersampled
(len(samples) < samplesPerRun) which can produce false positives; update the
check around samples, samplesPerRun (and idx) to fail the test instead of
continuing—use the test harness failure path (e.g., testlog.Fatalf or the
appropriate test failure function used elsewhere in this file) to abort the test
with a clear message including idx, len(samples) and samplesPerRun so
undersampled runs do not proceed to affinity checks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1291-1323: The loop launches all retry goroutines concurrently
which can create too many simultaneous pods.ExecCommandOnPod streams; limit
concurrency by adding a semaphore or worker pool when spawning goroutines for
the errgroup (use errgroup.WithContext as before) — for example create a
buffered channel (or semaphore) sized to a safe maxConcurrent (e.g.,
min(retries, 4)), acquire before calling g.Go and release inside the goroutine
when done; keep using idx := i, write results into execProcessCPUSamples[idx],
and ensure g.Wait() still checks for errors.

In `@test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go`:
- Around line 569-575: The selection of "first" CPUs uses raw cpuset order and
can pick offline CPUs; change the logic in the block that calls
nodes.GetCoreSiblings and nodes.GetTwoSiblingsFromCPUSet (and similar blocks
that set updatedShared/updatedIsolated) to intersect the candidate cpuset with
the current online CPU set before choosing CPUs—i.e., compute onlineCandidates
:= candidateCpuset.Intersection(onlineCPUs) and then pick the first two from
onlineCandidates (or fall back appropriately) instead of using
updatedIsolated.List() directly; apply the same intersection-based selection
wherever updatedShared/updatedIsolated are constructed (the other occurrences
noted in the comment).

---

Duplicate comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1316-1318: The test currently only warns when a run is
undersampled (len(samples) < samplesPerRun) which can produce false positives;
update the check around samples, samplesPerRun (and idx) to fail the test
instead of continuing—use the test harness failure path (e.g., testlog.Fatalf or
the appropriate test failure function used elsewhere in this file) to abort the
test with a clear message including idx, len(samples) and samplesPerRun so
undersampled runs do not proceed to affinity checks.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`:
- Around line 1273-1275: The DeferCleanup closure captures the loop variable i
which may change by cleanup time; capture the needed values (e.g., nodeName :=
workerRTNodes[i].Name and podRef := testpod) immediately before calling
DeferCleanup and use those local copies inside the closure so DeferCleanup calls
pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podRef) and logs
nodeName instead of referencing workerRTNodes[i].Name or i.
- Around line 1371-1376: The test currently derives expected affinity from raw
cpuset data (cpusetCfg -> cpusList) which may include offline CPUs; change the
logic that picks firstExclusiveCPU and makes assertions to first compute the
runnable/exclusive set by intersecting cpusList with the node's online CPUs (use
the node online CPU set before any further filtering), then derive
firstExclusiveCPU and assert against that runnable intersection; update all
occurrences using cpusetCfg/cpusList/firstExclusiveCPU (including the blocks
around cpusetCfg, cpusList parsing and the assertions at 1391-1394 and
1432-1433) to use the online-intersected cpuset instead.

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go`:
- Around line 123-129: The WaitForDeletion function swallows all cli.Get errors
and only treats NotFound as success; propagate non-NotFound errors immediately
so RBAC/API/network failures aren't retried and hidden. In WaitForDeletion
(which uses client.Client cli and wait.PollUntilContextTimeout), check the error
returned by cli.Get: if errors.IsNotFound(err) return true,nil; if err != nil
return false, err to surface the failure; only return false,nil when there is no
error and the object still exists. This ensures callers see real failures
instead of timeouts.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9fe94267-8a2b-4f0d-905b-3817ed39bb59

📥 Commits

Reviewing files that changed from the base of the PR and between a6f7bb8 and b76ea67.

📒 Files selected for processing (16)

go.mod
test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
test/e2e/performanceprofile/functests/13_llc/llc.go
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
test/e2e/performanceprofile/functests/1_performance/hugepages.go
test/e2e/performanceprofile/functests/1_performance/irqbalance.go
test/e2e/performanceprofile/functests/2_performance_update/memorymanager.go
test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
test/e2e/performanceprofile/functests/4_latency/latency.go
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
test/e2e/performanceprofile/functests/8_performance_workloadhints/workloadhints.go
test/e2e/performanceprofile/functests/utils/nodes/nodes.go
test/e2e/performanceprofile/functests/utils/nodes/nodes_test.go
test/e2e/performanceprofile/functests/utils/pods/pods.go
test/e2e/performanceprofile/functests/utils/resources/resources.go
test/e2e/performanceprofile/functests/utils/resources/resources_test.go

🚧 Files skipped from review as they are similar to previous changes (10)

test/e2e/performanceprofile/functests/4_latency/latency.go
test/e2e/performanceprofile/functests/utils/resources/resources_test.go
test/e2e/performanceprofile/functests/13_llc/llc.go
go.mod
test/e2e/performanceprofile/functests/1_performance/hugepages.go
test/e2e/performanceprofile/functests/utils/nodes/nodes.go
test/e2e/performanceprofile/functests/utils/nodes/nodes_test.go
test/e2e/performanceprofile/functests/8_performance_workloadhints/workloadhints.go
test/e2e/performanceprofile/functests/1_performance/irqbalance.go
test/e2e/performanceprofile/functests/utils/resources/resources.go

shajmakh · 2026-03-17T18:56:48Z

/retest

Introduce pods.DeleteAndSync helper (with client.Client parameter) and replace all local deleteTestPod functions and inline Get+Delete+WaitForDeletion patterns across functest suites with calls to this shared utility. Assisted-by: Cursor v2.3.37 AIA Attribution: AIA Primarily human, Stylistic edits, Human-initiated, Reviewed Signed-off-by: Shereen Haj <shajmakh@redhat.com>

Add e2e tests to verify exec-cpu-affinity behavior across cpu_management, updating_profile, and mixedcpus suites. Tests verify that exec processes are pinned to the correct exclusive, shared, or isolated CPU sets based on pod QoS class and container resource requests. Assisted-by: Cursor v2.3.37 AIA Attribution: AIA Primarily human, Stylistic edits, Human-initiated, Reviewed v1.0 Signed-off-by: Shereen Haj <shajmakh@redhat.com>

Add unit tests for functions in resources helper package for tests. Assisted-by: Cursor v1.2.2 AI-Attribution: AIA Entirely AI, Human-initiated, Reviewed, Cursor v1.2.2 v1.0 Signed-off-by: Shereen Haj <shajmakh@redhat.com>

Tal-or

/lgtm
Thank you.
The CI failures are related to the infrastructure

shajmakh · 2026-03-18T15:29:38Z

/retest

openshift-ci-robot · 2026-03-18T15:30:21Z

@shajmakh: This pull request references CNF-18941 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add basic e2e tests that checks the default behavior of performance-profile with default enabled ExecCPUAffinity: first.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

shajmakh · 2026-03-18T15:31:55Z

/verified later @SargunNarula

openshift-ci-robot · 2026-03-18T15:32:11Z

@shajmakh: This PR has been marked to be verified later by @SargunNarula.

Details

In response to this:

/verified later @SargunNarula

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

ffromani · 2026-03-19T07:58:52Z

/override ci/prow/okd-scos-images

per offline agreement, can't be caused by this PR as it only affects tests

openshift-ci · 2026-03-19T07:59:21Z

@ffromani: Overrode contexts on behalf of ffromani: ci/prow/okd-scos-images

Details

In response to this:

/override ci/prow/okd-scos-images

per offline agreement, can't be caused by this PR as it only affects tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ffromani · 2026-03-19T07:59:22Z

/approve

openshift-ci · 2026-03-19T07:59:23Z

@shajmakh: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci · 2026-03-19T08:00:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, shajmakh, Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ffromani]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci Bot requested review from MarSik and swatisehgal November 13, 2025 13:46

shajmakh mentioned this pull request Nov 13, 2025

CNF-18941: perfprof: enable exec-cpu-affinity by default (annotation) #1426

Merged

shajmakh force-pushed the exec-affinity-pp-e2e branch 3 times, most recently from 6ef4f1a to beeea3d Compare November 18, 2025 10:34

shajmakh force-pushed the exec-affinity-pp-e2e branch 4 times, most recently from 565820a to 0af067c Compare January 20, 2026 11:14

shajmakh mentioned this pull request Jan 22, 2026

telco: PP: configure the isolated and reserved cpus on gcp openshift/release#73835

Merged

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2026

shajmakh force-pushed the exec-affinity-pp-e2e branch 2 times, most recently from 0af067c to 41afeca Compare January 26, 2026 06:56

shajmakh force-pushed the exec-affinity-pp-e2e branch 2 times, most recently from acdb51a to a8b7158 Compare January 27, 2026 20:50

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2026

SargunNarula reviewed Jan 30, 2026

View reviewed changes

shajmakh force-pushed the exec-affinity-pp-e2e branch from e9e8dd1 to a6f7bb8 Compare March 16, 2026 12:42

coderabbitai Bot reviewed Mar 16, 2026

View reviewed changes

Comment thread test/e2e/performanceprofile/functests/1_performance/cpu_management.go

shajmakh force-pushed the exec-affinity-pp-e2e branch from a6f7bb8 to b76ea67 Compare March 16, 2026 12:55

coderabbitai Bot reviewed Mar 16, 2026

View reviewed changes

Comment thread test/e2e/performanceprofile/functests/1_performance/cpu_management.go

Comment thread test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go

shajmakh force-pushed the exec-affinity-pp-e2e branch from b76ea67 to 3604b53 Compare March 17, 2026 18:34

shajmakh mentioned this pull request Mar 17, 2026

OCPBUGS-81122: CNF-21333: [4.21]:perfprof: enable exec-cpu-affinity by default #1484

Merged

shajmakh force-pushed the exec-affinity-pp-e2e branch from 3604b53 to c5f7e7f Compare March 18, 2026 07:27

shajmakh added 3 commits March 18, 2026 09:49

PP: e2e utils:add unit tests

04916c8

Add unit tests for functions in resources helper package for tests. Assisted-by: Cursor v1.2.2 AI-Attribution: AIA Entirely AI, Human-initiated, Reviewed, Cursor v1.2.2 v1.0 Signed-off-by: Shereen Haj <shajmakh@redhat.com>

shajmakh force-pushed the exec-affinity-pp-e2e branch from c5f7e7f to 04916c8 Compare March 18, 2026 07:51

Tal-or approved these changes Mar 18, 2026

View reviewed changes

openshift-ci Bot assigned Tal-or Mar 18, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 18, 2026

shajmakh changed the title ~~e2e: PP: cover ExecCPUAffinity support in tests~~ CNF-18941: e2e: PP: cover ExecCPUAffinity support in tests Mar 18, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 18, 2026

openshift-ci-robot added verified-later verified Signifies that the PR passed pre-merge verification criteria labels Mar 18, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 19, 2026

openshift-merge-bot Bot merged commit 5a7adce into openshift:main Mar 19, 2026
19 checks passed

Conversation

shajmakh commented Nov 13, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shajmakh commented Nov 13, 2025

Uh oh!

shajmakh commented Jan 20, 2026

Uh oh!

shajmakh commented Jan 22, 2026

Uh oh!

shajmakh commented Jan 22, 2026

Uh oh!

shajmakh commented Jan 23, 2026

Uh oh!

shajmakh commented Jan 26, 2026

Uh oh!

shajmakh commented Jan 28, 2026

Uh oh!

shajmakh commented Jan 29, 2026

Uh oh!

SargunNarula left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shajmakh commented Mar 17, 2026

Uh oh!

Tal-or left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shajmakh commented Mar 18, 2026

Uh oh!

openshift-ci-robot commented Mar 18, 2026

Uh oh!

shajmakh commented Mar 18, 2026

Uh oh!

openshift-ci-robot commented Mar 18, 2026

Uh oh!

ffromani commented Mar 19, 2026

Uh oh!

openshift-ci Bot commented Mar 19, 2026

Uh oh!

ffromani commented Mar 19, 2026

Uh oh!

openshift-ci Bot commented Mar 19, 2026

Uh oh!

openshift-ci Bot commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shajmakh commented Nov 13, 2025 •

edited by coderabbitai Bot

Loading

Tal-or left a comment •

edited

Loading