Skip to content

Conversation

@viliakov
Copy link
Contributor

@viliakov viliakov commented Jan 13, 2026

Summary

  • Implement restore lock mechanism to prevent parallel restore operations from running concurrently
  • Add mutual exclusion support for datastores that cannot be restored at the same time (stackgraph and settings both modify HBase data)
  • Integrate lock acquisition/release into the scale down/up workflow with automatic cleanup on failure

Output

Mutually exclusive stackgraph and settings restores

❯ go run main.go settings restore --namespace stac-23374-ha --latest --yes
Finding latest backup...
Setting up port-forward to suse-observability-minio:9000 in namespace stac-23374-ha...
✅ Port-forward established successfully
Listing Settings backups in bucket 'sts-configuration-backup'...

Ensuring backup scripts ConfigMap exists...
✅ Backup scripts ConfigMap ready
Ensuring Minio keys secret exists...
✅ Minio keys secret ready

Creating job to list Settings backups stored on PVC...
✅ List job created: settings-list-20260113t083423

✅ List job completed successfully
Cleaning up resources...
✅ Job deleted: settings-list-20260113t083423
Using latest backup: sts-backup-20260113-0400.sty

❌ Error: cannot start settings restore: stackgraph restore is in progress (started at 2026-01-13T07:33:58Z on Deployment/suse-observability-api). Note: settings and stackgraph restores are mutually exclusive

To manually remove a stuck restore lock, run:
  kubectl annotate deployment,statefulset -l stackstate.com/connects-to-stackgraph=true stackstate.com/restore-in-progress- stackstate.com/restore-started-at- -n stac-23374-ha
exit status 1

Running two Stackgraph restores

❯ go run main.go stackgraph restore --namespace stac-23374-ha --latest --yes
Finding latest backup...
Setting up port-forward to suse-observability-minio:9000 in namespace stac-23374-ha...
✅ Port-forward established successfully
Using latest backup: sts-backup-20260113-0300.graph

❌ Error: cannot start stackgraph restore: another stackgraph restore is already in progress (started at 2026-01-13T07:33:58Z on Deployment/suse-observability-api)

To manually remove a stuck restore lock, run:
  kubectl annotate deployment,statefulset -l stackstate.com/connects-to-stackgraph=true stackstate.com/restore-in-progress- stackstate.com/restore-started-at- -n stac-23374-ha
exit status 1

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a restore lock mechanism to prevent parallel restore operations that could corrupt data. The solution uses Kubernetes annotations on Deployments and StatefulSets to coordinate restore operations and implements mutual exclusion between datastores that share underlying data (stackgraph and settings both use HBase).

Changes:

  • Added restore lock package with conflict detection and mutual exclusion support
  • Integrated lock acquisition/release into scale down/up workflow with automatic cleanup on failure
  • Updated all restore commands to use the new locking mechanism

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
internal/orchestration/restorelock/datastore.go Defines datastore identifiers and mutual exclusion groups
internal/orchestration/restorelock/lock.go Core lock management logic with conflict checking
internal/orchestration/restorelock/lock_test.go Comprehensive tests for lock functionality
internal/orchestration/restorelock/datastore_test.go Tests for datastore grouping and mutual exclusion
internal/orchestration/scale/scale.go New ScaleDownWithLock and ScaleUpAndReleaseLock wrapper functions
internal/clients/k8s/client.go K8s client methods for managing restore lock annotations
internal/foundation/config/config.go Datastore constants and method to retrieve all selectors
cmd/*/restore.go Updated restore commands to use ScaleDownWithLock
cmd/*/check_and_finalize.go Updated finalization to use ScaleUpAndReleaseLock
internal/foundation/logger/logger.go Added emoji prefixes for better visual distinction
internal/foundation/logger/logger_test.go Updated tests for emoji changes
cmd/cmdutils/common.go Simplified error handling using logger
README.md Added restore lock protection documentation
ARCHITECTURE.md Documented restore lock pattern in architecture guide

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@viliakov viliakov merged commit 8da04a3 into main Jan 14, 2026
5 checks passed
@viliakov viliakov deleted the STAC-24078 branch January 14, 2026 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants