Skip to content

feat: rewrite weka_runtime.py in Go (phase 1 - structure, deployment, basic modes)#2492

Draft
kristina-solovyova wants to merge 1 commit into
mainfrom
04-28-feat_rewrite_weka_runtime.py_in_go_phase_1_-_structure_deployment_basic_modes_
Draft

feat: rewrite weka_runtime.py in Go (phase 1 - structure, deployment, basic modes)#2492
kristina-solovyova wants to merge 1 commit into
mainfrom
04-28-feat_rewrite_weka_runtime.py_in_go_phase_1_-_structure_deployment_basic_modes_

Conversation

@kristina-solovyova
Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions
Copy link
Copy Markdown

Dependency Review

The following issues were found:
  • ✅ 0 vulnerable package(s)
  • ✅ 0 package(s) with incompatible licenses
  • ✅ 0 package(s) with invalid SPDX license definitions
  • ⚠️ 2 package(s) with unknown licenses.
See the Details below.

License Issues

go.mod

PackageVersionLicenseIssue Type
github.com/pmezard/go-difflib1.0.1-0.20181226105442-5d4384ee4fb2NullUnknown License
github.com/stretchr/testify1.11.1NullUnknown License

OpenSSF Scorecard

PackageVersionScoreDetails
gomod/github.com/pmezard/go-difflib 1.0.1-0.20181226105442-5d4384ee4fb2 UnknownUnknown
gomod/github.com/stretchr/testify 1.11.1 UnknownUnknown

Scanned Files

  • go.mod

Copy link
Copy Markdown
Collaborator Author

kristina-solovyova commented Apr 28, 2026


How to use the Graphite Merge Queue

Add the label main-merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has required the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Claude finished @kristina-solovyova's task in 2m 59s —— View job


Code Review: Go Runtime Rewrite — Phase 1

  • Gather context and understand the PR
  • Review core runtime infrastructure (config, daemon, cmdutil)
  • Review drive-related code (blockdev, wekadrive)
  • Review adhoc operations
  • Review modes and service layer
  • Review OS info and pod results
  • Review integration points (controller changes, pod.go, env.go)
  • Review Helm/chart changes and dagger flows
  • Compile findings and post review

Good overall structure for Phase 1. The code is well-organized and the Python-parity comments are helpful. A few issues ranging from critical to minor below.


Critical

1. JSON w/ single-quotes breaks shell script in writeConfigToContainerinternal/runtime/weka/local.go:149-155

script := fmt.Sprintf(`...
echo '%s' > %s
...`, string(b), staging, name, staging, dest)

json.Marshal does not escape single quotes. If any config value contains ' (e.g., a trace path, an error message), the echo '...' will break the shell script silently or corrupt the file. For example {"key": "it's ok"}echo '{"key": "it's ok"}' is invalid shell.

Fix: write via os.WriteFile or use printf '%s' with proper escaping, or use base64 -d. Fix this →

2. NETWORK_SELECTORS / MANAGEMENT_IPS_SELECTORS format mismatchinternal/runtime/config/config.go:111-112 + internal/controllers/resources/pod.go:128-135

pod.go encodes selectors via json.Marshal → the env var is a JSON array like ["foo","bar"]. But config.go parses it with parseStringSlice (splits on ,) → the Go runtime receives ["[\"foo\"", "\"bar\"]"] — completely broken.

These fields need a JSON-unmarshal parser, not a comma-splitter. This won't cause issues in Phase 1 if network modes are not yet invoked, but it's a latent bug for Phase 2. Fix this →


Medium

3. Silent INSTRUCTIONS parse errorinternal/runtime/config/config.go:152-157

if raw := os.Getenv("INSTRUCTIONS"); raw != "" {
    var inst v1alpha1.Instructions
    if err := json.Unmarshal([]byte(raw), &inst); err == nil {
        cfg.Instructions = &inst
    }
    // if err != nil, cfg.Instructions stays nil — silently
}

When JSON is invalid, cfg.Instructions is nil, and runAdhoc() returns the opaque error "adhoc-op: no instructions provided" instead of the real cause. This will make production debugging painful. Fix this →

4. Unimplemented mode stubs silently succeedinternal/runtime/modes/client.go (and compute, drive, s3, nfs, envoy, etc.)

All these are registered via init() and return nil. If a pod in "client" mode accidentally runs the Go binary (e.g., UsePythonFallback is false in a misconfigured deploy), it will silently exit 0 — doing nothing, causing the Weka container to never start. Consider returning fmt.Errorf("mode %q: not yet implemented in Go runtime", name) for Phase 1 stubs, or making UsePythonFallback: true the default until Phase 2 ships. Fix this →

5. Context from CreateLogSpan discarded in daemoninternal/runtime/daemon/daemon.go:46

_, logger := instrumentation.CreateLogSpan(ctx, "daemon.supervise", ...)

All other callers use ctx, logger := .... Here the enriched ctx (with span ID) is thrown away, so any child spans or traces within supervise() won't be linked to the parent. Fix this →


Minor

6. Go naming convention violationinternal/runtime/blockdev/blockdev.go:120

maj_min := strings.TrimSpace(string(devIndex))

Should be majMin. Fix this →

7. Magic string "proxy guid" should be a constantinternal/runtime/wekadrive/sign.go:163

if clusterGUID != proxySignedGUID && clusterGUID != "proxy guid" && !dev.WekaInfo.IsProxy {

proxySignedGUID has a named constant, but the second sentinel "proxy guid" is a bare string. Fix this →

8. Misleading log message in hugepagesinternal/runtime/cos/hugepages.go:25-27

if !cfg.COSAllowHugepageConfig {
    logger.Info("Skipping hugepages configuration (WEKA_COS_ALLOW_HUGEPAGE_CONFIG not set)")
    // still check and error if hugepages are missing — fall through to build sedCmds
}

The log says "Skipping" but the code does NOT skip — it continues and may return fmt.Errorf(...) at line 81. This is misleading for operators reading logs. Fix this →

9. Idiomatic Go: else after returninternal/runtime/wekadrive/sign.go:200-203 and :235-246

if _, err := cmdutil.Output(ctx, ...); err == nil {
    return paths, nil
} else {
    logger.Warn(...)
}

The else branch after a return is redundant. Prefer:

if _, err := cmdutil.Output(ctx, ...); err != nil {
    logger.Warn(...)
}
return paths, nil

Observations / Not Issues

  • config.go:199-214 loadFeatureFlags: Taking only entries[0] is fine since there's exactly one version directory under /opt/weka/dist/release. The comment mirroring Python is helpful.
  • adhoc/sign_drives.go:92: time.Sleep(3s) mirrors Python. Consider a brief comment on why the sleep is needed (presumably to let the kernel settle after signing).
  • pod.go:218-222 Python fallback via UsePythonFallback config flag: Clean design for the transition period. The unconditional mount of weka_runtime.py even when Go binary is used is harmless.
  • daemon.go backoff logic: Correctly resets on successful start, caps at 30s. Clean.
  • cmdutil.go stderr capture: Both stdout capture and stderr append-to-error is a good pattern.
  • Tests are present for config, cmdutil, daemon, osinfo, and pod_results — good coverage for Phase 1.

script := fmt.Sprintf(`set -e
mkdir -p /opt/weka/k8s-scripts
echo '%s' > %s
weka local run --container %s mv %s %s`, string(b), staging, name, staging, dest)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants