Skip to content

NM-341-v1: Add Support for Multi-tenancy#4069

Open
VishalDalwadi wants to merge 22 commits into
NM-341from
NM-341-v1
Open

NM-341-v1: Add Support for Multi-tenancy#4069
VishalDalwadi wants to merge 22 commits into
NM-341from
NM-341-v1

Conversation

@VishalDalwadi

Copy link
Copy Markdown
Collaborator

Describe your changes

Provide Issue ticket number if applicable/not in title

Provide testing steps

Checklist before requesting a review

  • My changes affect only 10 files or less.
  • I have performed a self-review of my code and tested it.
  • If it is a new feature, I have added thorough tests, my code is <= 1450 lines.
  • If it is a bugfix, my code is <= 200 lines.
  • My functions are <= 80 lines.
  • I have had my code reviewed by a peer.
  • My unit tests pass locally.
  • Netmaker is awesome.

@VishalDalwadi VishalDalwadi changed the base branch from develop to NM-341 June 24, 2026 05:03
… unique constraints for network, user invite and pending users;
…/ pkg functions with methods from the schema/ pkg;
@VishalDalwadi VishalDalwadi marked this pull request as ready for review June 29, 2026 09:50
@tenki-reviewer

tenki-reviewer Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Complete

Files Reviewed: 102
Findings: 9

By Severity:

  • 🔴 Critical: 3
  • 🟠 High: 2
  • 🟡 Medium: 3
  • 🟢 Low: 1

Massive KV-to-GORM migration introducing 3 critical panics (nil type assertion in scope middleware, bare context.TODO() without DB, missing GORM model registration), 4 high-severity issues (tenant data isolation gaps, non-idempotent migration, upsert column overwrites), and several medium-correctness regressions. Request changes before merge.

Files Reviewed (102 files)
auth/host_session.go
controllers/acls.go
controllers/dns.go
controllers/dns_test.go
controllers/egress.go
controllers/enrollmentkeys.go
controllers/ext_client.go
controllers/gateway.go
controllers/hosts.go
controllers/inet_gws.go
controllers/network.go
controllers/network_test.go
controllers/node.go
controllers/server.go
controllers/user.go
database/database.go
database/postgres.go
database/rqlite.go
database/sqlite.go
database/utils.go
logic/acls.go
logic/auth.go
logic/dns.go
logic/extpeers.go
logic/gateway.go
logic/host_test.go
logic/hostactions/hostactions.go
logic/networks.go
logic/peers.go
logic/pro/netcache/netcache.go
logic/settings.go
logic/telemetry.go
logic/usage.go
main.go
migrate/migrate.go
migrate/migrate_schema.go
migrate/migrate_test.go
migrate/migrate_v1_5_1.go
migrate/migrate_v1_6_0.go
migrate/migrate_v1_7_0.go
migrate/utils.go
models/acl.go
models/dnsEntry.go
models/extclient.go
models/metrics.go
models/node.go
models/settings.go
models/ssocache.go
models/structs.go
models/tags.go
mq/handlers.go
mq/publishers.go
pro/controllers/auto_relay.go
pro/controllers/events.go
pro/controllers/flows.go
pro/controllers/integrations.go
pro/controllers/jit.go
pro/controllers/metrics.go
pro/controllers/networks.go
pro/controllers/posture_check.go
pro/controllers/rac.go
pro/controllers/tags.go
pro/controllers/users.go
pro/logic/acls.go
pro/logic/metrics.go
pro/logic/nodes.go
pro/logic/tags.go
schema/acl_record.go
schema/cache_record.go
schema/dns.go
schema/dns_record.go
schema/egress.go
schema/enrollment_keys.go
schema/event.go
schema/extclient_record.go
schema/hosts.go
schema/integrations.go
schema/internal.go
schema/jit_grant.go
schema/jit_request.go
schema/metrics_record.go
schema/models.go
schema/networks.go
schema/nodes.go
schema/org_settings.go
schema/organizations.go
schema/pending_hosts.go
schema/pending_users.go
schema/posture_check.go
schema/posture_check_violations.go
schema/sso_state_record.go
schema/tag_record.go
schema/tenant_settings.go
schema/tenants.go
schema/user_access_token.go
schema/user_groups.go
schema/user_invites.go
schema/users.go
schema/utils.go
scope/scope.go
test/utils/db.go
test/utils/tag.go

@tenki-reviewer tenki-reviewer Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risk: 🔴 Critical (88/100) — 3 critical findings, 2 high, 3 medium, 1 low · 3410 LOC across 102 files


Critical Issues (3)

  • scope/scope.go:68ScopeMiddleware panics on first request when AllowMultipleTenants=false because it calls .Load().(string) on atomic.Value initialized with raw Config{} (nil type assertion).
  • logic/settings.go:128GetUserSettings and UpsertUserSettings call database.FetchRecord with bare context.TODO() not carrying the DB handle; the store defaults to in-memory and panics.
  • schema/models.go:27TenantSettingsRecord is missing from ListModels(), so GORM never creates the server_settings table on fresh installs.

High Severity (4)

  • schema/acl_record.go:90 — GORM Save-based upsert does not select TenantID/NetworkID columns, causing UPDATE to overwrite them with zero values.
  • scope/scope.go:55 — Tenant scope middleware validates the X-Tenant-ID header but never enforces data-isolation (all schema queries run without tenant filter).
  • migrate/migrate_v1_7_0.go:65createDefaults not idempotent; re-run fails with unique constraint violation on slug='default'.
  • controllers/network.go:306createNetwork does not propagate tenant ID from scope context to the new schema.Network record.

Medium & Low (3)

  • logic/dns.go:247GetCustomDNS behavior change: empty results formerly returned error, now silently succeed.
  • logic/settings.go:126GetUserSettings silently returns defaults on all errors, masking real database failures.
  • logic/settings.go:60 — Widespread context.TODO()/context.Background() usage loses request context propagation.

Comment thread scope/scope.go
Comment on lines +68 to +78
if defaultTenantID.Load().(string) == "" {
t := &schema.Tenant{}
if err := t.GetDefault(r.Context()); err != nil {
logic.ReturnErrorResponse(w, r, logic.FormatError(errDefaultTenantFailed, logic.Internal))
return
}

defaultTenantID.Store(t.ID)
}

id = defaultTenantID.Load().(string)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Panic on first request: nil type assertion on atomic.Value in scope middleware (bug)

scope/scope.go Middleware() performs defaultTenantID.Load().(string) (lines 68, 78) and defaultOrgID.Load().(string) (lines 95, 105) without nil-guarding the Load() result. Since defaultTenantID and defaultOrgID are zero-valued atomic.Value with no initial Store() call, Load() returns nil on the first invocation. The bare type assertion nil.(string) will panic, crashing the server on the very first request to any route protected by scope middleware (which is every route in this PR when AllowMultipleTenants is false, the CE default).

💡 Suggestion: Guard all Load().(string) calls with a nil check. Use a helper: func loadDefaultTenant() string { if v := defaultTenantID.Load(); v != nil { return v.(string) }; return "" }. Or initialize with defaultTenantID.Store("") at package init.

Suggested change
if defaultTenantID.Load().(string) == "" {
t := &schema.Tenant{}
if err := t.GetDefault(r.Context()); err != nil {
logic.ReturnErrorResponse(w, r, logic.FormatError(errDefaultTenantFailed, logic.Internal))
return
}
defaultTenantID.Store(t.ID)
}
id = defaultTenantID.Load().(string)
v := defaultTenantID.Load()
if v == nil || v.(string) == "" {
t := &schema.Tenant{}
if err := t.GetDefault(r.Context()); err != nil {
logic.ReturnErrorResponse(w, r, logic.FormatError(errDefaultTenantFailed, logic.Internal))
return
}
defaultTenantID.Store(t.ID)
}
id, _ = defaultTenantID.Load().(string)
📋 Prompt for AI Agents

In scope/scope.go around lines 68-78 (TenantScope) and lines 95-105 (OrgScope), replace each bare Load().(string) with a nil-safe pattern. Create a helper function: func getCachedOrDefault(model interface{}, atomicVal *atomic.Value, ctx context.Context, getter func(context.Context) error, store func(string)) (string, error) that loads, checks nil, resolves from DB if needed, stores, and returns. Apply to both defaultTenantID and defaultOrgID.

Comment thread logic/settings.go
Comment on lines +128 to +151
if err := u.Get(context.TODO()); err != nil {
return defaultUserSettings
}
var userSettings models.UserSettings
err = json.Unmarshal([]byte(data), &userSettings)
if err != nil {
return defaultUserSettings
return models.UserSettings{
Theme: u.Theme,
TextSize: u.TextSize,
ReducedMotion: u.ReducedMotion,
}

return userSettings
}

func UpsertUserSettings(userID string, userSettings models.UserSettings) error {
func UpsertUserSettings(username string, userSettings models.UserSettings) error {
if userSettings.TextSize == "" {
userSettings.TextSize = "16"
}

if userSettings.Theme == "" {
userSettings.Theme = models.Dark
}

data, err := json.Marshal(userSettings)
if err != nil {
return err
u := schema.User{
Username: username,
Theme: userSettings.Theme,
TextSize: userSettings.TextSize,
ReducedMotion: userSettings.ReducedMotion,
}
return database.Insert(userID, string(data), database.SERVER_SETTINGS)
}

func DeleteUserSettings(userID string) error {
return database.DeleteRecord(database.SERVER_SETTINGS, userID)
return u.UpdateUserSettings(context.TODO())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 GetUserSettings and UpsertUserSettings will PANIC due to bare context.TODO() without DB (bug)

logic/settings.go:128: u.Get(context.TODO()) and logic/settings.go:151: u.UpdateUserSettings(context.TODO()). Both schema.User.Get (schema/users.go:82) and schema.User.UpdateUserSettings (schema/users.go:173) call db.FromContext(ctx) which panics when no *gorm.DB is stored in the context (db/db.go:78-80). The context.TODO() value has no DB connection — only contexts created via db.WithContext() contain the connection. Every other function in logic/settings.go correctly uses db.WithContext(context.TODO()). These are called from HTTP handlers at GET/PUT /api/users/{username}/settings, so any user accessing settings will trigger a panic.

💡 Suggestion: Wrap the context with db.WithContext: use u.Get(db.WithContext(context.TODO())) on line 128 and u.UpdateUserSettings(db.WithContext(context.TODO())) on line 151.

📋 Prompt for AI Agents

In logic/settings.go, change line 128 from u.Get(context.TODO()) to u.Get(db.WithContext(context.TODO())) and change line 151 from u.UpdateUserSettings(context.TODO()) to u.UpdateUserSettings(db.WithContext(context.TODO())). This matches the pattern used by all other functions in the same file (lines 60, 66, 109, 115).

Comment thread schema/models.go
Comment on lines +27 to 37
&Organization{},
&Tenant{},
&OrganizationSettings{},
&AclRecord{},
&CacheRecord{},
&DNSRecord{},
&ExtClientRecord{},
&MetricsRecord{},
&SsoStateRecord{},
&TagRecord{},
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 TenantSettingsRecord missing from ListModels() — server_settings table not created on fresh installs (bug)

schema/tenant_settings.go:53 defines TenantSettingsRecord with TableName() returning 'server_settings'. This model is used by logic/settings.go (getServerSettingsFromDB at line 65, UpsertServerSettings at line 114) and migrate/migrate.go (migrateSettings at line 500). However, TenantSettingsRecord is absent from ListModels() in schema/models.go (lines 27-37). On fresh installations, GORM's AutoMigrate (called via db.InitializeDB in main.go line 120) will NOT create the server_settings table. The old KV initialization that previously created this table has been removed. All server settings operations then silently fail — JWT validity falls to 0, metric intervals default to 0, DNS/STUN config is lost.

💡 Suggestion: Add &TenantSettingsRecord{} to the slice returned by ListModels() in schema/models.go.

📋 Prompt for AI Agents

In schema/models.go, inside ListModels(), add &TenantSettingsRecord{} to the returned slice alongside the other record types (AclRecord, CacheRecord, DNSRecord, etc.). This ensures GORM's AutoMigrate creates the server_settings table on fresh installs.

Comment thread schema/acl_record.go
Comment on lines +90 to +91
func (r *AclRecord) Upsert(ctx context.Context) error {
return db.FromContext(ctx).Save(r).Error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 GORM Save-based Upsert overwrites TenantID and NetworkID columns after migration (bug)

The schema.*Record types (AclRecord, DNSRecord, ExtClientRecord, MetricsRecord, TagRecord, CacheRecord, SsoStateRecord) implement Upsert() via db.Save(r) which generates a full-row UPDATE. All callers in logic/ and pro/logic/ construct records with Key and Value only, leaving TenantID and NetworkID at their zero values (""). The migration in migrate_v1_7_0.go setTenantID() (line 388) and setNetworkID() (line 416) populates these columns during startup, but every subsequent create/update via Upsert resets tenant_id and network_id back to empty strings, undoing migration work and causing data corruption. Affected callers include: UpdateAcl, UpsertAcl, InsertAcl, CreateDNS, SaveExtClient, SetState, UpdateMetrics, UpsertTag, InsertTag, netcache.Set.

💡 Suggestion: Replace GORM Save() with conditional logic: if the primary key exists, use Updates() omitting tenant_id and network_id; otherwise Create. Or populate TenantID/NetworkID from scope context in callers before Upsert.

📋 Prompt for AI Agents

In each schema/*_record.go Upsert method, change from db.Save(r) to a pattern that preserves tenant_id and network_id: check if the record exists (by primary key), and if so, use Updates() with a map omitting tenant_id and network_id. Alternatively, update all callers to populate TenantID from scope.ID(ctx) and NetworkID from model data before calling Upsert.

Comment thread scope/scope.go
Comment on lines +55 to +131
func Middleware(level Scope, next http.Handler) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
var id string

switch level {
case TenantScope:
id = r.Header.Get(HeaderTenantID)
if id == "" {
if logic.GetFeatureFlags().AllowMultipleTenants {
logic.ReturnErrorResponse(w, r, logic.FormatError(errMissingTenantID, logic.BadReq))
return
}

if defaultTenantID.Load().(string) == "" {
t := &schema.Tenant{}
if err := t.GetDefault(r.Context()); err != nil {
logic.ReturnErrorResponse(w, r, logic.FormatError(errDefaultTenantFailed, logic.Internal))
return
}

defaultTenantID.Store(t.ID)
}

id = defaultTenantID.Load().(string)
} else {
t := &schema.Tenant{ID: id}
if err := t.Get(db.WithContext(r.Context())); err != nil {
logic.ReturnErrorResponse(w, r, logic.FormatError(errTenantNotFound, logic.BadReq))
return
}
}

case OrgScope:
id = r.Header.Get(HeaderOrgID)
if id == "" {
if logic.GetFeatureFlags().AllowMultipleTenants {
logic.ReturnErrorResponse(w, r, logic.FormatError(errMissingOrgID, logic.BadReq))
return
}

if defaultOrgID.Load().(string) == "" {
o := &schema.Organization{}
if err := o.GetDefault(r.Context()); err != nil {
logic.ReturnErrorResponse(w, r, logic.FormatError(errDefaultOrgFailed, logic.Internal))
return
}

defaultOrgID.Store(o.ID)
}

id = defaultOrgID.Load().(string)
} else {
o := &schema.Organization{ID: id}
if err := o.Get(db.WithContext(r.Context())); err != nil {
logic.ReturnErrorResponse(w, r, logic.FormatError(errOrgNotFound, logic.BadReq))
return
}
}

case GlobalScope:
// no header required
}

next.ServeHTTP(w, r.WithContext(WithContext(r.Context(), level, id)))
}
}

// WithContext stores the scope level and id in ctx and returns the new context.
// If level is non-global and id is empty, ctx is returned unchanged.
func WithContext(ctx context.Context, level Scope, id string) context.Context {
if level != GlobalScope && id == "" {
return ctx
}
ctx = context.WithValue(ctx, ctxLevel, level)
ctx = context.WithValue(ctx, ctxID, id)
return ctx
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Tenant scope middleware validates tenant header but never enforces tenant data isolation (security)

The scope.Middleware validates the X-Tenant-ID header and stores the tenant ID in the request context via WithContext (scope.go:118). However, no downstream handler, schema query, or DB operation reads the tenant ID from context. All schema CRUD methods operate without tenant-scoped WHERE clauses despite tables having TenantID columns. Specific instances: Network.Get/Update/Delete at schema/networks.go:70,93,116 query by name alone (cross-tenant when two tenants have same network name); PendingUser Exists/Get/Delete at schema/pending_users.go operate globally; UserGroup.GetByName at schema/user_groups.go:49 has no tenant filter; SsoStateRecord.List() returns all tenants' records; IsNetworkNameUnique at logic/networks.go:359 checks across all tenants. In multi-tenant deployments, an authenticated user can access or mutate other tenants' data.

💡 Suggestion: All schema query methods must filter by tenant_id from request context. All handler functions must inject tenant_id from scope.ID(r.Context()) into models before creation. Authentication-only endpoints (authenticateHost, authenticate) must remain accessible without tenant scope.

📋 Prompt for AI Agents
  1. Add a helper in scope/scope.go: func TenantID(ctx context.Context) string { return ID(ctx) }. 2. Modify all schema CRUD methods (networks, nodes, hosts, pending_users, user_groups, sso_state_record, etc.) to filter by tenant ID. Example for networks.go ListAll: add Where(&Network{TenantID: scope.ID(ctx)}). 3. In all handler functions that create or update records, set TenantID from scope.ID(r.Context()) on the model before Create/Update. 4. Ensure auth-only endpoints remain accessible without tenant scope.

Comment thread migrate/migrate_v1_7_0.go
Comment on lines +65 to +75
func createDefaults(ctx context.Context) error {
defaultOrg := &schema.Organization{}
err := defaultOrg.CreateDefault(ctx)
if err != nil {
return err
}

defaultTenant := &schema.Tenant{
OrganizationID: defaultOrg.ID,
}
return defaultTenant.CreateDefault(ctx)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Migration createDefaults not idempotent — fails on re-run with unique constraint violation (bug)

migrate/migrate_v1_7_0.go createDefaults() creates a default Organization and Tenant by calling CreateDefault(ctx) on each. Both generate new UUIDs and set Slug="default" without checking for existing records. Organization.Slug and Tenant.Slug have uniqueIndex constraints. If the migration is re-run (partial failure recovery or HA pod), the second attempt fails with a unique constraint violation on the 'default' slug.

💡 Suggestion: Add existence check before creation: call GetDefault(ctx) first and return nil if already exists. Or handle the unique constraint error gracefully.

📋 Prompt for AI Agents

In migrate/migrate_v1_7_0.go, modify createDefaults() (lines 65-75) to first check if a default Organization and Tenant already exist. Use o.GetDefault(ctx) / t.GetDefault(ctx) and if no error, skip creation.

Comment thread logic/dns.go
Comment on lines 247 to 265
@@ -267,32 +261,19 @@ func GetCustomDNS(network string) ([]models.DNSEntry, error) {
dns = append(dns, entry)
}
}

return dns, err
return dns, nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 GetCustomDNS behavior change: empty results no longer return error (bug)

logic/dns.go GetCustomDNS (lines 247-265) was refactored from database.FetchRecords (which returned error on empty results) to schema.DNSRecord{}.List() (which returns empty slice with nil error). Previously, callers could detect 'no records' via err != nil. Now, they receive nil error with empty slice. While tests were updated, external or internal callers relying on error-based detection of empty results will silently get an empty response instead of an error.

💡 Suggestion: Document the new behavior or add explicit empty-result handling that matches the old semantics if callers depend on it.

📋 Prompt for AI Agents

In logic/dns.go around line 247-265, review all callers of GetCustomDNS. If backward compatibility is needed, add: if len(dns) == 0 { return dns, gorm.ErrRecordNotFound } at the end of the function.

Comment thread logic/settings.go
Comment on lines +126 to 136
func GetUserSettings(username string) models.UserSettings {
u := schema.User{Username: username}
if err := u.Get(context.TODO()); err != nil {
return defaultUserSettings
}
var userSettings models.UserSettings
err = json.Unmarshal([]byte(data), &userSettings)
if err != nil {
return defaultUserSettings
return models.UserSettings{
Theme: u.Theme,
TextSize: u.TextSize,
ReducedMotion: u.ReducedMotion,
}

return userSettings
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 GetUserSettings silently returns defaults on all errors, masking database failures (bug)

logic/settings.go:126-136: GetUserSettings calls u.Get(context.TODO()) and returns defaultUserSettings on ANY error — including database connection failures, serialization errors, transient issues — not just 'user not found'. The old code (removed) would propagate database.FetchRecord errors to callers. The controller at controllers/user.go:934 calls GetUserSettings without error checking, so a database outage silently gives users default theme/settings with no indication of failure.

💡 Suggestion: Distinguish between gorm.ErrRecordNotFound (return defaults) and other errors (log and/or propagate). Use a proper context parameter.

📋 Prompt for AI Agents

In logic/settings.go, change GetUserSettings to return (models.UserSettings, error). Check if err is gorm.ErrRecordNotFound — return defaults, nil. For other errors, return zero-value and error. Update callees in controllers/user.go.

Comment thread logic/settings.go
data, err := database.FetchRecord(database.SERVER_SETTINGS, ServerSettingsDBKey)
// TODO: replace with tenant ID from context once multi-tenancy is fully wired
defaultTenant := &schema.Tenant{}
err := defaultTenant.GetDefault(db.WithContext(context.TODO()))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Widespread context.TODO() / context.Background() usage loses request context propagation (bug)

Multiple refactored functions use context.TODO() or context.Background() instead of accepting a context.Context parameter. Key locations: logic/settings.go (getServerSettingsFromDB, UpsertServerSettings), logic/dns.go (GetCustomDNS, DeleteNetworkDNS, etc.), pro/logic/metrics.go (GetMetrics, UpdateMetrics, etc.), pro/logic/tags.go (GetTag, ListNetworkTags, etc.). While the old KV-based layer also didn't propagate context, this PR introduces the GORM-based pattern broadly without improving context propagation. Request-scoped cancellation, tracing, and timeout propagation are lost.

💡 Suggestion: Add context.Context as the first parameter to functions performing database operations and thread the caller's context through instead of context.TODO()/context.Background().

📋 Prompt for AI Agents

Refactor key functions (GetServerSettings, GetUserSettings, GetMetrics, GetTag) to accept context.Context and use it for all database operations. Thread request context from HTTP handlers through to the data layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant