Skip to content

Enrich failure handling#1065

Open
Rei1010 wants to merge 852 commits into
Project-HAMi:masterfrom
Rei1010:enrichFailureHandling
Open

Enrich failure handling#1065
Rei1010 wants to merge 852 commits into
Project-HAMi:masterfrom
Rei1010:enrichFailureHandling

Conversation

@Rei1010

@Rei1010 Rei1010 commented May 21, 2025

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:
Improving failure handling for test.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Error logs:

  [FAILED] in [It] - /Users/rui/Documents/Repos/Rei1010/HAMi/test/e2e/pod/test_pod.go:156 @ 05/21/25 11:09:07.958
  STEP: Check pod detailed after each test @ 05/21/25 11:09:07.958
I0521 11:09:07.977833   30711 pod.go:179] Pod default/gpu-pod5729 is in Pending status
I0521 11:09:07.977850   30711 pod.go:181] Show events for default/gpu-pod5729:
I0521 11:09:07.986884   30711 pod.go:190] Reason: FailedScheduling, Message : 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.. 
I0521 11:09:07.993636   30711 pod.go:200] Show logs for default/gpu-pod5729:
I0521 11:09:07.993648   30711 pod.go:201] 

Does this PR introduce a user-facing change?:

archlitchi and others added 30 commits August 6, 2024 16:20
Signed-off-by: limengxuan <391013634@qq.com>
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.5.0 to 6.6.0.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](docker/build-push-action@v6.5.0...v6.6.0)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
Signed-off-by: wawa0210 <xiao.zhang@daocloud.io>
Signed-off-by: wawa0210 <xiao.zhang@daocloud.io>
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.6.0 to 6.6.1.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](docker/build-push-action@v6.6.0...v6.6.1)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: william-wang <wang.platform@gmail.com>
Signed-off-by: wawa0210 <xiao.zhang@daocloud.io>
Signed-off-by: wawa0210 <xiao.zhang@daocloud.io>
* fix: fix duplicate resource keys in configmap

* fix: Update incorrect component names in monitorservice
Bumps [github.com/opencontainers/runc](https://github.com/opencontainers/runc) from 1.1.2 to 1.1.12.
- [Release notes](https://github.com/opencontainers/runc/releases)
- [Changelog](https://github.com/opencontainers/runc/blob/main/CHANGELOG.md)
- [Commits](opencontainers/runc@v1.1.2...v1.1.12)

---
updated-dependencies:
- dependency-name: github.com/opencontainers/runc
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
Signed-off-by: wawa0210 <xiao.zhang@daocloud.io>
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 2 to 3.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@v2...v3)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/setup-go](https://github.com/actions/setup-go) from 4 to 5.
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](actions/setup-go@v4...v5)

---
updated-dependencies:
- dependency-name: actions/setup-go
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
Signed-off-by: wawa0210 <xiao.zhang@daocloud.io>
Signed-off-by: wawa0210 <xiao.zhang@daocloud.io>
Signed-off-by: wawa0210 <xiao.zhang@daocloud.io>
Signed-off-by: wawa0210 <xiao.zhang@daocloud.io>
Bumps [github.com/opencontainers/runc](https://github.com/opencontainers/runc) from 1.1.12 to 1.1.14.
- [Release notes](https://github.com/opencontainers/runc/releases)
- [Changelog](https://github.com/opencontainers/runc/blob/main/CHANGELOG.md)
- [Commits](opencontainers/runc@v1.1.12...v1.1.14)

---
updated-dependencies:
- dependency-name: github.com/opencontainers/runc
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.6.1 to 6.7.0.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](docker/build-push-action@v6.6.1...v6.7.0)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: wawa0210 <xiao.zhang@daocloud.io>
chinaran and others added 21 commits April 19, 2025 14:44
Signed-off-by: yxxhero <aiopsclub@163.com>
Signed-off-by: yxxhero <aiopsclub@163.com>
Signed-off-by: yxxhero <aiopsclub@163.com>
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.15.0 to 6.16.0.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](docker/build-push-action@v6.15.0...v6.16.0)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-version: 6.16.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Project-HAMi#1023)

Signed-off-by: wangmin <wangmin@riseunion.io>

Co-authored-by: wangmin <wangmin@riseunion.io>
…i#1021)

* feat: Support for using RuntimeClass with nvidia devices

Signed-off-by: 王然 <ranwang@alauda.io>

* docs: runtimeClassName

Signed-off-by: 王然 <ranwang@alauda.io>

* feat: reset hasResource logic

Signed-off-by: 王然 <ranwang@alauda.io>

---------

Signed-off-by: 王然 <ranwang@alauda.io>
…1020)

Signed-off-by: wangmin <wangmin@riseunion.io>

Co-authored-by: wangmin <wangmin@riseunion.io>
…t after ConfigMap modification (Project-HAMi#1022)

Signed-off-by: 王然 <ranwang@alauda.io>
 (Project-HAMi#1012)

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
add new ai accelerator GCU S60 made by https://www.enflame-tech.com

Signed-off-by: winston-zhang-orz <73474183+winston-zhang-orz@users.noreply.github.com>
* update cambricon devices

Signed-off-by: limengxuan <391013634@qq.com>

* update

Signed-off-by: limengxuan <391013634@qq.com>

* update

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* update

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

---------

Signed-off-by: limengxuan <391013634@qq.com>
Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
…roject-HAMi#1031)

Fix scheduler metrics can not be accessed when using master branch of HAMi

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
…roject-HAMi#938)

* Separate options from client to make the responsibility more clear.
Remove the magic number in the main function and define it as a constant.

Signed-off-by: yangshiqi <yangshiqi@riseunion.io>

* fix merge bugs and add testcase.
remove some comments to try e2e

Signed-off-by: yangshiqi <yangshiqi@riseunion.io>

* debug for e2e

Signed-off-by: yangshiqi <yangshiqi@riseunion.io>

* fix e2e error

Signed-off-by: yangshiqi <yangshiqi@riseunion.io>

---------

Signed-off-by: yangshiqi <yangshiqi@riseunion.io>
Co-authored-by: yangshiqi <yangshiqi@riseunion.io>
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.16.0 to 6.17.0.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](docker/build-push-action@v6.16.0...v6.17.0)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-version: 6.17.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Shouren Yang <yangshouren@gmail.com>
Signed-off-by: wawa0210 <xiao.zhang@dynamia.ai>
Signed-off-by: wen.rui <wen.rui@daocloud.io>
@Rei1010 Rei1010 force-pushed the enrichFailureHandling branch from 38c2ac4 to 8aa69f2 Compare May 21, 2025 07:28
@codecov

codecov Bot commented May 21, 2025

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.48175% with 24 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/device/ascend/device.go 82.48% 21 Missing and 3 partials ⚠️
Flag Coverage Δ
unittests 61.07% <82.48%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/device/cambricon/device.go 79.71% <ø> (ø)
pkg/device/devices.go 73.65% <ø> (ø)
pkg/device/enflame/device.go 78.08% <ø> (ø)
pkg/device/hygon/device.go 93.37% <ø> (ø)
pkg/device/iluvatar/device.go 86.23% <ø> (ø)
pkg/device/metax/config.go 100.00% <ø> (ø)
pkg/device/metax/device.go 81.13% <ø> (ø)
pkg/device/metax/protocol.go 62.74% <ø> (ø)
pkg/device/metax/sdevice.go 21.51% <ø> (ø)
pkg/device/mthreads/device.go 90.90% <ø> (ø)
... and 23 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances test failure diagnostics by adding utilities to fetch and log pod details across namespaces, adjusts the pod‐running wait interval for GPU workloads, and integrates detailed pod checks after any test failure.

  • Increased the polling interval in WaitForPodRunning from 5s to 30s.
  • Introduced GetNamespaceList, GetPodLogs, and CheckPodDetails in test/utils/pod.go.
  • Updated AfterEach in test/e2e/pod/test_pod.go to call CheckPodDetails on failures and removed a debug fmt.Printf.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
test/utils/pod.go Added new failure-handling helpers, updated polling interval, and imported I/O packages.
test/e2e/pod/test_pod.go Call CheckPodDetails on test failures and remove leftover fmt.Printf debug statement.
Comments suppressed due to low confidence (1)

test/utils/pod.go:96

  • Use the passed-in context ctx instead of context.TODO() to allow cancellation and deadlines to propagate correctly.
pod, err := clientSet.CoreV1().Pods(namespace).Get(context.TODO(), podName, metav1.GetOptions{})

Comment thread test/utils/pod.go
events, err := GetPodEvents(clientSet, ns, pod.Name)
if err != nil {
klog.Errorf("Failed to get events for %s/%s: %v", ns, pod.Name, err)
return

Copilot AI May 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning here stops logging details for other pods. Consider using continue to proceed to the next pod and log all failures.

Suggested change
return
continue

Copilot uses AI. Check for mistakes.
Comment thread test/utils/pod.go
logs, err := GetPodLogs(clientSet, ns, pod.Name)
if err != nil {
klog.Errorf("Failed to get logs for %s/%s: %v", ns, pod.Name, err)
return

Copilot AI May 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with events, use continue instead of return so that other pods are still checked and logged.

Suggested change
return
continue

Copilot uses AI. Check for mistakes.
Comment thread test/utils/pod.go
}

klog.Infof("Show logs for %s/%s:", ns, pod.Name)
klog.Infof(logs)

Copilot AI May 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Passing raw logs to Infof can misinterpret formatting verbs—use klog.Info(logs) or klog.Infof("%s", logs) instead.

Suggested change
klog.Infof(logs)
klog.Infof("%s", logs)

Copilot uses AI. Check for mistakes.
Comment thread test/utils/pod.go
return false, nil
})
}

Copilot AI May 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Add a doc comment to describe the purpose and behavior of this public function for better maintainability.

Suggested change
// GetNamespaceList retrieves a list of all namespaces in the Kubernetes cluster.
// It takes a Kubernetes clientset as input and returns a slice of namespace names
// or an error if the operation fails.

Copilot uses AI. Check for mistakes.
@mesutoezdil

Copy link
Copy Markdown
Contributor

any changes about you pr?

@wawa0210

Copy link
Copy Markdown
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


any changes about you pr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.