NM-341-v1: Add Support for Multi-tenancy#4069
Conversation
… from database/ pkg;
… unique constraints for network, user invite and pending users;
…rics and tags tables;
…/ pkg functions with methods from the schema/ pkg;
…ables on migration;
… tags tables on migration;
…or set default scope;
|
Review Complete Files Reviewed: 102 By Severity:
Massive KV-to-GORM migration introducing 3 critical panics (nil type assertion in scope middleware, bare context.TODO() without DB, missing GORM model registration), 4 high-severity issues (tenant data isolation gaps, non-idempotent migration, upsert column overwrites), and several medium-correctness regressions. Request changes before merge. Files Reviewed (102 files) |
There was a problem hiding this comment.
Risk: 🔴 Critical (88/100) — 3 critical findings, 2 high, 3 medium, 1 low · 3410 LOC across 102 files
Critical Issues (3)
scope/scope.go:68—ScopeMiddlewarepanics on first request whenAllowMultipleTenants=falsebecause it calls.Load().(string)onatomic.Valueinitialized with rawConfig{}(nil type assertion).logic/settings.go:128—GetUserSettingsandUpsertUserSettingscalldatabase.FetchRecordwith barecontext.TODO()not carrying the DB handle; the store defaults to in-memory and panics.schema/models.go:27—TenantSettingsRecordis missing fromListModels(), so GORM never creates theserver_settingstable on fresh installs.
High Severity (4)
schema/acl_record.go:90— GORMSave-based upsert does not selectTenantID/NetworkIDcolumns, causingUPDATEto overwrite them with zero values.scope/scope.go:55— Tenant scope middleware validates theX-Tenant-IDheader but never enforces data-isolation (all schema queries run without tenant filter).migrate/migrate_v1_7_0.go:65—createDefaultsnot idempotent; re-run fails with unique constraint violation onslug='default'.controllers/network.go:306—createNetworkdoes not propagate tenant ID from scope context to the newschema.Networkrecord.
Medium & Low (3)
logic/dns.go:247—GetCustomDNSbehavior change: empty results formerly returned error, now silently succeed.logic/settings.go:126—GetUserSettingssilently returns defaults on all errors, masking real database failures.logic/settings.go:60— Widespreadcontext.TODO()/context.Background()usage loses request context propagation.
| if defaultTenantID.Load().(string) == "" { | ||
| t := &schema.Tenant{} | ||
| if err := t.GetDefault(r.Context()); err != nil { | ||
| logic.ReturnErrorResponse(w, r, logic.FormatError(errDefaultTenantFailed, logic.Internal)) | ||
| return | ||
| } | ||
|
|
||
| defaultTenantID.Store(t.ID) | ||
| } | ||
|
|
||
| id = defaultTenantID.Load().(string) |
There was a problem hiding this comment.
🔴 Panic on first request: nil type assertion on atomic.Value in scope middleware (bug)
scope/scope.go Middleware() performs defaultTenantID.Load().(string) (lines 68, 78) and defaultOrgID.Load().(string) (lines 95, 105) without nil-guarding the Load() result. Since defaultTenantID and defaultOrgID are zero-valued atomic.Value with no initial Store() call, Load() returns nil on the first invocation. The bare type assertion nil.(string) will panic, crashing the server on the very first request to any route protected by scope middleware (which is every route in this PR when AllowMultipleTenants is false, the CE default).
💡 Suggestion: Guard all Load().(string) calls with a nil check. Use a helper: func loadDefaultTenant() string { if v := defaultTenantID.Load(); v != nil { return v.(string) }; return "" }. Or initialize with defaultTenantID.Store("") at package init.
| if defaultTenantID.Load().(string) == "" { | |
| t := &schema.Tenant{} | |
| if err := t.GetDefault(r.Context()); err != nil { | |
| logic.ReturnErrorResponse(w, r, logic.FormatError(errDefaultTenantFailed, logic.Internal)) | |
| return | |
| } | |
| defaultTenantID.Store(t.ID) | |
| } | |
| id = defaultTenantID.Load().(string) | |
| v := defaultTenantID.Load() | |
| if v == nil || v.(string) == "" { | |
| t := &schema.Tenant{} | |
| if err := t.GetDefault(r.Context()); err != nil { | |
| logic.ReturnErrorResponse(w, r, logic.FormatError(errDefaultTenantFailed, logic.Internal)) | |
| return | |
| } | |
| defaultTenantID.Store(t.ID) | |
| } | |
| id, _ = defaultTenantID.Load().(string) |
📋 Prompt for AI Agents
In scope/scope.go around lines 68-78 (TenantScope) and lines 95-105 (OrgScope), replace each bare Load().(string) with a nil-safe pattern. Create a helper function: func getCachedOrDefault(model interface{}, atomicVal *atomic.Value, ctx context.Context, getter func(context.Context) error, store func(string)) (string, error) that loads, checks nil, resolves from DB if needed, stores, and returns. Apply to both defaultTenantID and defaultOrgID.
| if err := u.Get(context.TODO()); err != nil { | ||
| return defaultUserSettings | ||
| } | ||
| var userSettings models.UserSettings | ||
| err = json.Unmarshal([]byte(data), &userSettings) | ||
| if err != nil { | ||
| return defaultUserSettings | ||
| return models.UserSettings{ | ||
| Theme: u.Theme, | ||
| TextSize: u.TextSize, | ||
| ReducedMotion: u.ReducedMotion, | ||
| } | ||
|
|
||
| return userSettings | ||
| } | ||
|
|
||
| func UpsertUserSettings(userID string, userSettings models.UserSettings) error { | ||
| func UpsertUserSettings(username string, userSettings models.UserSettings) error { | ||
| if userSettings.TextSize == "" { | ||
| userSettings.TextSize = "16" | ||
| } | ||
|
|
||
| if userSettings.Theme == "" { | ||
| userSettings.Theme = models.Dark | ||
| } | ||
|
|
||
| data, err := json.Marshal(userSettings) | ||
| if err != nil { | ||
| return err | ||
| u := schema.User{ | ||
| Username: username, | ||
| Theme: userSettings.Theme, | ||
| TextSize: userSettings.TextSize, | ||
| ReducedMotion: userSettings.ReducedMotion, | ||
| } | ||
| return database.Insert(userID, string(data), database.SERVER_SETTINGS) | ||
| } | ||
|
|
||
| func DeleteUserSettings(userID string) error { | ||
| return database.DeleteRecord(database.SERVER_SETTINGS, userID) | ||
| return u.UpdateUserSettings(context.TODO()) |
There was a problem hiding this comment.
🔴 GetUserSettings and UpsertUserSettings will PANIC due to bare context.TODO() without DB (bug)
logic/settings.go:128: u.Get(context.TODO()) and logic/settings.go:151: u.UpdateUserSettings(context.TODO()). Both schema.User.Get (schema/users.go:82) and schema.User.UpdateUserSettings (schema/users.go:173) call db.FromContext(ctx) which panics when no *gorm.DB is stored in the context (db/db.go:78-80). The context.TODO() value has no DB connection — only contexts created via db.WithContext() contain the connection. Every other function in logic/settings.go correctly uses db.WithContext(context.TODO()). These are called from HTTP handlers at GET/PUT /api/users/{username}/settings, so any user accessing settings will trigger a panic.
💡 Suggestion: Wrap the context with db.WithContext: use u.Get(db.WithContext(context.TODO())) on line 128 and u.UpdateUserSettings(db.WithContext(context.TODO())) on line 151.
📋 Prompt for AI Agents
In logic/settings.go, change line 128 from u.Get(context.TODO()) to u.Get(db.WithContext(context.TODO())) and change line 151 from u.UpdateUserSettings(context.TODO()) to u.UpdateUserSettings(db.WithContext(context.TODO())). This matches the pattern used by all other functions in the same file (lines 60, 66, 109, 115).
| &Organization{}, | ||
| &Tenant{}, | ||
| &OrganizationSettings{}, | ||
| &AclRecord{}, | ||
| &CacheRecord{}, | ||
| &DNSRecord{}, | ||
| &ExtClientRecord{}, | ||
| &MetricsRecord{}, | ||
| &SsoStateRecord{}, | ||
| &TagRecord{}, | ||
| } |
There was a problem hiding this comment.
🔴 TenantSettingsRecord missing from ListModels() — server_settings table not created on fresh installs (bug)
schema/tenant_settings.go:53 defines TenantSettingsRecord with TableName() returning 'server_settings'. This model is used by logic/settings.go (getServerSettingsFromDB at line 65, UpsertServerSettings at line 114) and migrate/migrate.go (migrateSettings at line 500). However, TenantSettingsRecord is absent from ListModels() in schema/models.go (lines 27-37). On fresh installations, GORM's AutoMigrate (called via db.InitializeDB in main.go line 120) will NOT create the server_settings table. The old KV initialization that previously created this table has been removed. All server settings operations then silently fail — JWT validity falls to 0, metric intervals default to 0, DNS/STUN config is lost.
💡 Suggestion: Add &TenantSettingsRecord{} to the slice returned by ListModels() in schema/models.go.
📋 Prompt for AI Agents
In schema/models.go, inside ListModels(), add &TenantSettingsRecord{} to the returned slice alongside the other record types (AclRecord, CacheRecord, DNSRecord, etc.). This ensures GORM's AutoMigrate creates the server_settings table on fresh installs.
| func (r *AclRecord) Upsert(ctx context.Context) error { | ||
| return db.FromContext(ctx).Save(r).Error |
There was a problem hiding this comment.
🟠 GORM Save-based Upsert overwrites TenantID and NetworkID columns after migration (bug)
The schema.*Record types (AclRecord, DNSRecord, ExtClientRecord, MetricsRecord, TagRecord, CacheRecord, SsoStateRecord) implement Upsert() via db.Save(r) which generates a full-row UPDATE. All callers in logic/ and pro/logic/ construct records with Key and Value only, leaving TenantID and NetworkID at their zero values (""). The migration in migrate_v1_7_0.go setTenantID() (line 388) and setNetworkID() (line 416) populates these columns during startup, but every subsequent create/update via Upsert resets tenant_id and network_id back to empty strings, undoing migration work and causing data corruption. Affected callers include: UpdateAcl, UpsertAcl, InsertAcl, CreateDNS, SaveExtClient, SetState, UpdateMetrics, UpsertTag, InsertTag, netcache.Set.
💡 Suggestion: Replace GORM Save() with conditional logic: if the primary key exists, use Updates() omitting tenant_id and network_id; otherwise Create. Or populate TenantID/NetworkID from scope context in callers before Upsert.
📋 Prompt for AI Agents
In each schema/*_record.go Upsert method, change from db.Save(r) to a pattern that preserves tenant_id and network_id: check if the record exists (by primary key), and if so, use Updates() with a map omitting tenant_id and network_id. Alternatively, update all callers to populate TenantID from scope.ID(ctx) and NetworkID from model data before calling Upsert.
| func Middleware(level Scope, next http.Handler) http.HandlerFunc { | ||
| return func(w http.ResponseWriter, r *http.Request) { | ||
| var id string | ||
|
|
||
| switch level { | ||
| case TenantScope: | ||
| id = r.Header.Get(HeaderTenantID) | ||
| if id == "" { | ||
| if logic.GetFeatureFlags().AllowMultipleTenants { | ||
| logic.ReturnErrorResponse(w, r, logic.FormatError(errMissingTenantID, logic.BadReq)) | ||
| return | ||
| } | ||
|
|
||
| if defaultTenantID.Load().(string) == "" { | ||
| t := &schema.Tenant{} | ||
| if err := t.GetDefault(r.Context()); err != nil { | ||
| logic.ReturnErrorResponse(w, r, logic.FormatError(errDefaultTenantFailed, logic.Internal)) | ||
| return | ||
| } | ||
|
|
||
| defaultTenantID.Store(t.ID) | ||
| } | ||
|
|
||
| id = defaultTenantID.Load().(string) | ||
| } else { | ||
| t := &schema.Tenant{ID: id} | ||
| if err := t.Get(db.WithContext(r.Context())); err != nil { | ||
| logic.ReturnErrorResponse(w, r, logic.FormatError(errTenantNotFound, logic.BadReq)) | ||
| return | ||
| } | ||
| } | ||
|
|
||
| case OrgScope: | ||
| id = r.Header.Get(HeaderOrgID) | ||
| if id == "" { | ||
| if logic.GetFeatureFlags().AllowMultipleTenants { | ||
| logic.ReturnErrorResponse(w, r, logic.FormatError(errMissingOrgID, logic.BadReq)) | ||
| return | ||
| } | ||
|
|
||
| if defaultOrgID.Load().(string) == "" { | ||
| o := &schema.Organization{} | ||
| if err := o.GetDefault(r.Context()); err != nil { | ||
| logic.ReturnErrorResponse(w, r, logic.FormatError(errDefaultOrgFailed, logic.Internal)) | ||
| return | ||
| } | ||
|
|
||
| defaultOrgID.Store(o.ID) | ||
| } | ||
|
|
||
| id = defaultOrgID.Load().(string) | ||
| } else { | ||
| o := &schema.Organization{ID: id} | ||
| if err := o.Get(db.WithContext(r.Context())); err != nil { | ||
| logic.ReturnErrorResponse(w, r, logic.FormatError(errOrgNotFound, logic.BadReq)) | ||
| return | ||
| } | ||
| } | ||
|
|
||
| case GlobalScope: | ||
| // no header required | ||
| } | ||
|
|
||
| next.ServeHTTP(w, r.WithContext(WithContext(r.Context(), level, id))) | ||
| } | ||
| } | ||
|
|
||
| // WithContext stores the scope level and id in ctx and returns the new context. | ||
| // If level is non-global and id is empty, ctx is returned unchanged. | ||
| func WithContext(ctx context.Context, level Scope, id string) context.Context { | ||
| if level != GlobalScope && id == "" { | ||
| return ctx | ||
| } | ||
| ctx = context.WithValue(ctx, ctxLevel, level) | ||
| ctx = context.WithValue(ctx, ctxID, id) | ||
| return ctx | ||
| } |
There was a problem hiding this comment.
🟠 Tenant scope middleware validates tenant header but never enforces tenant data isolation (security)
The scope.Middleware validates the X-Tenant-ID header and stores the tenant ID in the request context via WithContext (scope.go:118). However, no downstream handler, schema query, or DB operation reads the tenant ID from context. All schema CRUD methods operate without tenant-scoped WHERE clauses despite tables having TenantID columns. Specific instances: Network.Get/Update/Delete at schema/networks.go:70,93,116 query by name alone (cross-tenant when two tenants have same network name); PendingUser Exists/Get/Delete at schema/pending_users.go operate globally; UserGroup.GetByName at schema/user_groups.go:49 has no tenant filter; SsoStateRecord.List() returns all tenants' records; IsNetworkNameUnique at logic/networks.go:359 checks across all tenants. In multi-tenant deployments, an authenticated user can access or mutate other tenants' data.
💡 Suggestion: All schema query methods must filter by tenant_id from request context. All handler functions must inject tenant_id from scope.ID(r.Context()) into models before creation. Authentication-only endpoints (authenticateHost, authenticate) must remain accessible without tenant scope.
📋 Prompt for AI Agents
- Add a helper in scope/scope.go:
func TenantID(ctx context.Context) string { return ID(ctx) }. 2. Modify all schema CRUD methods (networks, nodes, hosts, pending_users, user_groups, sso_state_record, etc.) to filter by tenant ID. Example for networks.go ListAll: addWhere(&Network{TenantID: scope.ID(ctx)}). 3. In all handler functions that create or update records, set TenantID from scope.ID(r.Context()) on the model before Create/Update. 4. Ensure auth-only endpoints remain accessible without tenant scope.
| func createDefaults(ctx context.Context) error { | ||
| defaultOrg := &schema.Organization{} | ||
| err := defaultOrg.CreateDefault(ctx) | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| defaultTenant := &schema.Tenant{ | ||
| OrganizationID: defaultOrg.ID, | ||
| } | ||
| return defaultTenant.CreateDefault(ctx) |
There was a problem hiding this comment.
🟡 Migration createDefaults not idempotent — fails on re-run with unique constraint violation (bug)
migrate/migrate_v1_7_0.go createDefaults() creates a default Organization and Tenant by calling CreateDefault(ctx) on each. Both generate new UUIDs and set Slug="default" without checking for existing records. Organization.Slug and Tenant.Slug have uniqueIndex constraints. If the migration is re-run (partial failure recovery or HA pod), the second attempt fails with a unique constraint violation on the 'default' slug.
💡 Suggestion: Add existence check before creation: call GetDefault(ctx) first and return nil if already exists. Or handle the unique constraint error gracefully.
📋 Prompt for AI Agents
In migrate/migrate_v1_7_0.go, modify createDefaults() (lines 65-75) to first check if a default Organization and Tenant already exist. Use o.GetDefault(ctx) / t.GetDefault(ctx) and if no error, skip creation.
| @@ -267,32 +261,19 @@ func GetCustomDNS(network string) ([]models.DNSEntry, error) { | |||
| dns = append(dns, entry) | |||
| } | |||
| } | |||
|
|
|||
| return dns, err | |||
| return dns, nil | |||
| } | |||
There was a problem hiding this comment.
🟡 GetCustomDNS behavior change: empty results no longer return error (bug)
logic/dns.go GetCustomDNS (lines 247-265) was refactored from database.FetchRecords (which returned error on empty results) to schema.DNSRecord{}.List() (which returns empty slice with nil error). Previously, callers could detect 'no records' via err != nil. Now, they receive nil error with empty slice. While tests were updated, external or internal callers relying on error-based detection of empty results will silently get an empty response instead of an error.
💡 Suggestion: Document the new behavior or add explicit empty-result handling that matches the old semantics if callers depend on it.
📋 Prompt for AI Agents
In logic/dns.go around line 247-265, review all callers of GetCustomDNS. If backward compatibility is needed, add: if len(dns) == 0 { return dns, gorm.ErrRecordNotFound } at the end of the function.
| func GetUserSettings(username string) models.UserSettings { | ||
| u := schema.User{Username: username} | ||
| if err := u.Get(context.TODO()); err != nil { | ||
| return defaultUserSettings | ||
| } | ||
| var userSettings models.UserSettings | ||
| err = json.Unmarshal([]byte(data), &userSettings) | ||
| if err != nil { | ||
| return defaultUserSettings | ||
| return models.UserSettings{ | ||
| Theme: u.Theme, | ||
| TextSize: u.TextSize, | ||
| ReducedMotion: u.ReducedMotion, | ||
| } | ||
|
|
||
| return userSettings | ||
| } |
There was a problem hiding this comment.
🟡 GetUserSettings silently returns defaults on all errors, masking database failures (bug)
logic/settings.go:126-136: GetUserSettings calls u.Get(context.TODO()) and returns defaultUserSettings on ANY error — including database connection failures, serialization errors, transient issues — not just 'user not found'. The old code (removed) would propagate database.FetchRecord errors to callers. The controller at controllers/user.go:934 calls GetUserSettings without error checking, so a database outage silently gives users default theme/settings with no indication of failure.
💡 Suggestion: Distinguish between gorm.ErrRecordNotFound (return defaults) and other errors (log and/or propagate). Use a proper context parameter.
📋 Prompt for AI Agents
In logic/settings.go, change GetUserSettings to return (models.UserSettings, error). Check if err is gorm.ErrRecordNotFound — return defaults, nil. For other errors, return zero-value and error. Update callees in controllers/user.go.
| data, err := database.FetchRecord(database.SERVER_SETTINGS, ServerSettingsDBKey) | ||
| // TODO: replace with tenant ID from context once multi-tenancy is fully wired | ||
| defaultTenant := &schema.Tenant{} | ||
| err := defaultTenant.GetDefault(db.WithContext(context.TODO())) |
There was a problem hiding this comment.
🟢 Widespread context.TODO() / context.Background() usage loses request context propagation (bug)
Multiple refactored functions use context.TODO() or context.Background() instead of accepting a context.Context parameter. Key locations: logic/settings.go (getServerSettingsFromDB, UpsertServerSettings), logic/dns.go (GetCustomDNS, DeleteNetworkDNS, etc.), pro/logic/metrics.go (GetMetrics, UpdateMetrics, etc.), pro/logic/tags.go (GetTag, ListNetworkTags, etc.). While the old KV-based layer also didn't propagate context, this PR introduces the GORM-based pattern broadly without improving context propagation. Request-scoped cancellation, tracing, and timeout propagation are lost.
💡 Suggestion: Add context.Context as the first parameter to functions performing database operations and thread the caller's context through instead of context.TODO()/context.Background().
📋 Prompt for AI Agents
Refactor key functions (GetServerSettings, GetUserSettings, GetMetrics, GetTag) to accept context.Context and use it for all database operations. Thread request context from HTTP handlers through to the data layer.
Describe your changes
Provide Issue ticket number if applicable/not in title
Provide testing steps
Checklist before requesting a review