Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 38 additions & 7 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@ stackstate-backup-cli/
│ ├── orchestration/ # Layer 2: Workflows
│ │ ├── portforward/ # Port-forwarding orchestration
│ │ ├── scale/ # Deployment/StatefulSet scaling workflows
│ │ └── restore/ # Restore job orchestration
│ │ ├── restore/ # Restore job orchestration
│ │ └── restorelock/ # Restore lock mechanism (prevents parallel restores)
│ │
│ ├── app/ # Layer 3: Dependency Container
│ │ └── app.go # Application context and dependency injection
Expand Down Expand Up @@ -126,6 +127,7 @@ appCtx.Formatter
- `portforward/`: Manages Kubernetes port-forwarding lifecycle
- `scale/`: Deployment and StatefulSet scaling workflows with detailed logging
- `restore/`: Restore job orchestration (confirmation, job lifecycle, finalization, resource management)
- `restorelock/`: Prevents parallel restore operations using Kubernetes annotations

**Dependency Rules**:
- ✅ Can import: `internal/foundation/*`, `internal/clients/*`
Expand Down Expand Up @@ -317,16 +319,45 @@ Log: log,

### 7. Structured Logging

All operations use structured logging with consistent levels:
All operations use structured logging with consistent levels and emoji prefixes for visual clarity:

```go
log.Infof("Starting operation...")
log.Debugf("Detail: %v", detail)
log.Warningf("Non-fatal issue: %v", warning)
log.Errorf("Operation failed: %v", err)
log.Successf("Operation completed successfully")
log.Infof("Starting operation...") // No prefix
log.Debugf("Detail: %v", detail) // 🛠️ DEBUG:
log.Warningf("Non-fatal issue: %v", warning) // ⚠️ Warning:
log.Errorf("Operation failed: %v", err) // ❌ Error:
log.Successf("Operation completed") // ✅
```

### 8. Restore Lock Pattern

The `restorelock` package prevents parallel restore operations that could corrupt data:

```go
// Scale down with automatic lock acquisition
scaledApps, err := scale.ScaleDownWithLock(scale.ScaleDownWithLockParams{
K8sClient: k8sClient,
Namespace: namespace,
LabelSelector: selector,
Datastore: config.DatastoreStackgraph,
AllSelectors: config.GetAllScaleDownSelectors(),
Log: log,
})

// Scale up and release lock
defer scale.ScaleUpAndReleaseLock(k8sClient, namespace, selector, log)
```

**How it works**:
1. Before scaling down, checks for existing restore locks on Deployments/StatefulSets
2. Detects conflicts for the same datastore or mutually exclusive datastores (e.g., Stackgraph and Settings)
3. Sets annotations (`stackstate.com/restore-in-progress`, `stackstate.com/restore-started-at`) on resources
4. Releases locks when scaling up or on failure

**Mutual Exclusion Groups**:
- Stackgraph and Settings restores are mutually exclusive (both modify HBase data)
- Other datastores (Elasticsearch, ClickHouse, VictoriaMetrics) are independent

## Testing Strategy

### Unit Tests
Expand Down
29 changes: 24 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -391,11 +391,12 @@ See [internal/foundation/config/testdata/validConfigMapConfig.yaml](internal/fou
│ ├── orchestration/ # Layer 2: Workflows
│ │ ├── portforward/ # Port-forwarding lifecycle
│ │ ├── scale/ # Deployment/StatefulSet scaling
│ │ └── restore/ # Restore job orchestration
│ │ ├── confirmation.go # User confirmation prompts
│ │ ├── finalize.go # Job status check and cleanup
│ │ ├── job.go # Job lifecycle management
│ │ └── resources.go # Restore resource management
│ │ ├── restore/ # Restore job orchestration
│ │ │ ├── confirmation.go # User confirmation prompts
│ │ │ ├── finalize.go # Job status check and cleanup
│ │ │ ├── job.go # Job lifecycle management
│ │ │ └── resources.go # Restore resource management
│ │ └── restorelock/ # Parallel restore prevention
│ ├── app/ # Layer 3: Dependency container
│ │ └── app.go # Application context and DI
│ └── scripts/ # Embedded bash scripts
Expand All @@ -409,6 +410,24 @@ See [internal/foundation/config/testdata/validConfigMapConfig.yaml](internal/fou
- **Dependency Injection**: Centralized dependency creation via `internal/app/` eliminates boilerplate from commands
- **Testability**: All layers use interfaces for external dependencies, enabling comprehensive unit testing
- **Clean Commands**: Commands are thin (50-100 lines) and focused on business logic
- **Restore Lock Protection**: Prevents parallel restore operations that could corrupt data

### Restore Lock Protection

The CLI prevents parallel restore operations that could corrupt data by using Kubernetes annotations on Deployments and StatefulSets. When a restore starts:

1. The CLI checks for existing restore locks before proceeding
2. If another restore is in progress for the same datastore, the operation is blocked
3. Mutually exclusive datastores are also protected (e.g., Stackgraph and Settings cannot restore simultaneously because they share HBase data)

If a restore operation is interrupted or fails, the lock annotations may remain. To manually remove a stuck lock:

```bash
kubectl annotate deployment,statefulset -l <label-selector> \
stackstate.com/restore-in-progress- \
stackstate.com/restore-started-at- \
-n <namespace>
```

See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed information about the layered architecture and design patterns.

Expand Down
6 changes: 3 additions & 3 deletions cmd/clickhouse/check_and_finalize.go
Original file line number Diff line number Diff line change
Expand Up @@ -127,21 +127,21 @@ func waitAndFinalize(appCtx *app.Context, chClient clickhouse.Interface, operati
return finalizeRestore(appCtx)
}

// finalizeRestore finalizes the restore by executing SQL and scaling up
// finalizeRestore finalizes the restore by executing SQL, scaling up, and releasing lock
func finalizeRestore(appCtx *app.Context) error {
if err := executePostRestoreSQL(appCtx); err != nil {
appCtx.Logger.Warningf("Post-restore SQL failed: %v", err)
}

appCtx.Logger.Println()
scaleSelector := appCtx.Config.Clickhouse.Restore.ScaleDownLabelSelector
if err := scale.ScaleUpFromAnnotations(
if err := scale.ScaleUpAndReleaseLock(
appCtx.K8sClient,
appCtx.Namespace,
scaleSelector,
appCtx.Logger,
); err != nil {
return fmt.Errorf("failed to scale up: %w", err)
return fmt.Errorf("failed to scale up and release lock: %w", err)
}

appCtx.Logger.Println()
Expand Down
11 changes: 9 additions & 2 deletions cmd/clickhouse/restore.go
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,17 @@ func runRestore(appCtx *app.Context) error {
}
}

// Scale down deployments/statefulsets before restore
// Scale down deployments/statefulsets before restore (with lock protection)
appCtx.Logger.Println()
scaleDownLabelSelector := appCtx.Config.Clickhouse.Restore.ScaleDownLabelSelector
_, err := scale.ScaleDown(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger)
_, err := scale.ScaleDownWithLock(scale.ScaleDownWithLockParams{
K8sClient: appCtx.K8sClient,
Namespace: appCtx.Namespace,
LabelSelector: scaleDownLabelSelector,
Datastore: config.DatastoreClickhouse,
AllSelectors: appCtx.Config.GetAllScaleDownSelectors(),
Log: appCtx.Logger,
})
if err != nil {
return err
}
Expand Down
18 changes: 6 additions & 12 deletions cmd/cmdutils/common.go
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
package cmdutils

import (
"errors"
"fmt"
"io"
"os"

"github.com/stackvista/stackstate-backup-cli/internal/app"
Expand All @@ -18,19 +16,15 @@ const (
func Run(globalFlags *config.CLIGlobalFlags, runFunc func(ctx *app.Context) error, minioRequired bool) {
appCtx, err := app.NewContext(globalFlags)
if err != nil {
exitWithError(err, os.Stderr)
_, _ = fmt.Fprintf(os.Stderr, "❌ Error: %v\n", err)
os.Exit(1)
}
if minioRequired && !appCtx.Config.Minio.Enabled {
exitWithError(errors.New("commands that interact with Minio require SUSE Observability to be deployed with .Values.global.backup.enabled=true"), os.Stderr)
appCtx.Logger.Errorf("commands that interact with Minio require SUSE Observability to be deployed with .Values.global.backup.enabled=true")
os.Exit(1)
}
if err := runFunc(appCtx); err != nil {
exitWithError(err, os.Stderr)
appCtx.Logger.Errorf(err.Error())
os.Exit(1)
}
}

// ExitWithError prints an error message to the writer and exits with status code 1.
// This is a helper function to avoid repeating error handling code in commands.
func exitWithError(err error, w io.Writer) {
_, _ = fmt.Fprintf(w, "error: %v\n", err)
os.Exit(1)
}
10 changes: 6 additions & 4 deletions cmd/elasticsearch/check_and_finalize.go
Original file line number Diff line number Diff line change
Expand Up @@ -113,20 +113,22 @@ func waitAndFinalize(appCtx *app.Context, repository, snapshotName string) error
return finalizeRestore(appCtx)
}

// finalizeRestore performs post-restore finalization (scale up deployments)
// finalizeRestore performs post-restore finalization (scale up deployments and release lock)
func finalizeRestore(appCtx *app.Context) error {
appCtx.Logger.Println()
labelSelector := appCtx.Config.Elasticsearch.Restore.ScaleDownLabelSelector
scaleUpFn := func() error {
return scale.ScaleUpFromAnnotations(appCtx.K8sClient, appCtx.Namespace, appCtx.Config.Elasticsearch.Restore.ScaleDownLabelSelector, appCtx.Logger)
return scale.ScaleUpAndReleaseLock(appCtx.K8sClient, appCtx.Namespace, labelSelector, appCtx.Logger)
}

return restore.FinalizeRestore(scaleUpFn, appCtx.Logger)
}

// attemptScaleUp tries to scale up deployments (used when restore is not found/already complete)
// attemptScaleUp tries to scale up deployments and release lock (used when restore is not found/already complete)
func attemptScaleUp(appCtx *app.Context) error {
labelSelector := appCtx.Config.Elasticsearch.Restore.ScaleDownLabelSelector
scaleUpFn := func() error {
return scale.ScaleUpFromAnnotations(appCtx.K8sClient, appCtx.Namespace, appCtx.Config.Elasticsearch.Restore.ScaleDownLabelSelector, appCtx.Logger)
return scale.ScaleUpAndReleaseLock(appCtx.K8sClient, appCtx.Namespace, labelSelector, appCtx.Logger)
}

if err := scaleUpFn(); err != nil {
Expand Down
12 changes: 10 additions & 2 deletions cmd/elasticsearch/restore.go
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,17 @@ func runRestore(appCtx *app.Context) error {
}
}

// Scale down deployments before restore
// Scale down deployments before restore (with lock protection)
appCtx.Logger.Println()
_, err = scale.ScaleDown(appCtx.K8sClient, appCtx.Namespace, appCtx.Config.Elasticsearch.Restore.ScaleDownLabelSelector, appCtx.Logger)
scaleDownLabelSelector := appCtx.Config.Elasticsearch.Restore.ScaleDownLabelSelector
_, err = scale.ScaleDownWithLock(scale.ScaleDownWithLockParams{
K8sClient: appCtx.K8sClient,
Namespace: appCtx.Namespace,
LabelSelector: scaleDownLabelSelector,
Datastore: config.DatastoreElasticsearch,
AllSelectors: appCtx.Config.GetAllScaleDownSelectors(),
Log: appCtx.Logger,
})
if err != nil {
return err
}
Expand Down
2 changes: 1 addition & 1 deletion cmd/settings/check_and_finalize.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ func runCheckAndFinalize(appCtx *app.Context) error {
Namespace: appCtx.Namespace,
JobName: checkJobName,
ServiceName: "settings",
ScaleUpFn: scale.ScaleUpFromAnnotations,
ScaleUpFn: scale.ScaleUpAndReleaseLock,
ScaleDownFn: scale.ScaleDown,
ScaleSelector: appCtx.Config.Settings.Restore.ScaleDownLabelSelector,
CleanupPVC: false,
Expand Down
15 changes: 11 additions & 4 deletions cmd/settings/restore.go
Original file line number Diff line number Diff line change
Expand Up @@ -78,19 +78,26 @@ func runRestore(appCtx *app.Context) error {
}
}

// Scale down deployments before restore
// Scale down deployments before restore (with lock protection)
appCtx.Logger.Println()
scaleDownLabelSelector := appCtx.Config.Settings.Restore.ScaleDownLabelSelector
scaledDeployments, err := scale.ScaleDown(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger)
scaledDeployments, err := scale.ScaleDownWithLock(scale.ScaleDownWithLockParams{
K8sClient: appCtx.K8sClient,
Namespace: appCtx.Namespace,
LabelSelector: scaleDownLabelSelector,
Datastore: config.DatastoreSettings,
AllSelectors: appCtx.Config.GetAllScaleDownSelectors(),
Log: appCtx.Logger,
})
if err != nil {
return err
}

// Ensure deployments are scaled back up on exit (even if restore fails)
// Ensure deployments are scaled back up and lock released on exit (even if restore fails)
defer func() {
if len(scaledDeployments) > 0 && !background {
appCtx.Logger.Println()
if err := scale.ScaleUpFromAnnotations(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger); err != nil {
if err := scale.ScaleUpAndReleaseLock(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger); err != nil {
appCtx.Logger.Warningf("Failed to scale up deployments: %v", err)
}
}
Expand Down
2 changes: 1 addition & 1 deletion cmd/stackgraph/check_and_finalize.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ func runCheckAndFinalize(appCtx *app.Context) error {
Namespace: appCtx.Namespace,
JobName: checkJobName,
ServiceName: "stackgraph",
ScaleUpFn: scale.ScaleUpFromAnnotations,
ScaleUpFn: scale.ScaleUpAndReleaseLock,
ScaleDownFn: scale.ScaleDown,
ScaleSelector: appCtx.Config.Stackgraph.Restore.ScaleDownLabelSelector,
CleanupPVC: true,
Expand Down
15 changes: 11 additions & 4 deletions cmd/stackgraph/restore.go
Original file line number Diff line number Diff line change
Expand Up @@ -84,19 +84,26 @@ func runRestore(appCtx *app.Context) error {
}
}

// Scale down deployments before restore
// Scale down deployments before restore (with lock protection)
appCtx.Logger.Println()
scaleDownLabelSelector := appCtx.Config.Stackgraph.Restore.ScaleDownLabelSelector
scaledDeployments, err := scale.ScaleDown(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger)
scaledDeployments, err := scale.ScaleDownWithLock(scale.ScaleDownWithLockParams{
K8sClient: appCtx.K8sClient,
Namespace: appCtx.Namespace,
LabelSelector: scaleDownLabelSelector,
Datastore: config.DatastoreStackgraph,
AllSelectors: appCtx.Config.GetAllScaleDownSelectors(),
Log: appCtx.Logger,
})
if err != nil {
return err
}

// Ensure deployments are scaled back up on exit (even if restore fails)
// Ensure deployments are scaled back up and lock released on exit (even if restore fails)
defer func() {
if len(scaledDeployments) > 0 && !background {
appCtx.Logger.Println()
if err := scale.ScaleUpFromAnnotations(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger); err != nil {
if err := scale.ScaleUpAndReleaseLock(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger); err != nil {
appCtx.Logger.Warningf("Failed to scale up deployments: %v", err)
}
}
Expand Down
2 changes: 1 addition & 1 deletion cmd/victoriametrics/check_and_finalize.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ func runCheckAndFinalize(appCtx *app.Context) error {
Namespace: appCtx.Namespace,
JobName: checkJobName,
ServiceName: "victoria-metrics",
ScaleUpFn: scale.ScaleUpFromAnnotations,
ScaleUpFn: scale.ScaleUpAndReleaseLock,
ScaleDownFn: scale.ScaleDown,
ScaleSelector: appCtx.Config.VictoriaMetrics.Restore.ScaleDownLabelSelector,
CleanupPVC: false,
Expand Down
15 changes: 11 additions & 4 deletions cmd/victoriametrics/restore.go
Original file line number Diff line number Diff line change
Expand Up @@ -82,19 +82,26 @@ func runRestore(appCtx *app.Context) error {
}
}

// Scale down workload before restore
// Scale down workload before restore (with lock protection)
appCtx.Logger.Println()
scaleDownLabelSelector := appCtx.Config.VictoriaMetrics.Restore.ScaleDownLabelSelector
scaledStatefulSets, err := scale.ScaleDown(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger)
scaledStatefulSets, err := scale.ScaleDownWithLock(scale.ScaleDownWithLockParams{
K8sClient: appCtx.K8sClient,
Namespace: appCtx.Namespace,
LabelSelector: scaleDownLabelSelector,
Datastore: config.DatastoreVictoriaMetrics,
AllSelectors: appCtx.Config.GetAllScaleDownSelectors(),
Log: appCtx.Logger,
})
if err != nil {
return err
}

// Ensure workload are scaled back up on exit (even if restore fails)
// Ensure workload are scaled back up and lock released on exit (even if restore fails)
defer func() {
if len(scaledStatefulSets) > 0 && !background {
appCtx.Logger.Println()
if err := scale.ScaleUpFromAnnotations(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger); err != nil {
if err := scale.ScaleUpAndReleaseLock(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger); err != nil {
appCtx.Logger.Warningf("Failed to scale up workload: %v", err)
}
}
Expand Down
Loading