feat(optimizer): [3/N] Analyzer by mkuchenbecker · Pull Request #533 · linkedin/openhouse

mkuchenbecker · 2026-04-07T16:38:43Z

Optimizer Stack

PR	Content
#527	Data Model
#530	Database Repos
#531	REST service
#533 (this)	Analyzer app
#534	Scheduler app
#599	Spark BatchedOFD app
#tbd	Infra, docker-compose, smoke test

Summary

PR 3 of N in the optimizer stack.

Introduces apps/optimizer-analyzer, a Spring Boot CommandLineRunner that evaluates every table in table_stats against pluggable OperationAnalyzer strategies. The first strategy, OrphanFilesDeletionAnalyzer, schedules OFD operations with 24h success / 1h failure retry cadence, a 6h SCHEDULED timeout, and a 5-strike circuit breaker.

Key design choices:

Bulk-loads operations and history into maps (one query per type), then iterates the stats list — O(types) queries, not O(tables).
Uses the existing generic find() repository methods with null params.
Pure unit tests with Mockito — no Spring context needed.

Changes

Core: AnalyzerRunner — loads table_stats, pre-loads operations and history into maps, evaluates each table against all analyzers, circuit breaker logic.

Strategy interface: OperationAnalyzer — isEnabled(table), shouldSchedule(table, currentOp, latestHistory), getCircuitBreakerThreshold().

Cadence policy: CadencePolicy — encapsulates time-based retry logic shared across operation types.

OFD analyzer: OrphanFilesDeletionAnalyzer — enabled via maintenance.optimizer.ofd.enabled table property.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

25 unit tests:

AnalyzerRunnerTest (7 tests) — eligible table insertion, cadence skip, disabled table, shouldSchedule=false, null UUID, circuit breaker trip, below-threshold pass
OrphanFilesDeletionAnalyzerTest (18 tests) — isEnabled variants, shouldSchedule for no-op/PENDING/SCHEDULING/SCHEDULED with history combinations

./gradlew :apps:optimizer-analyzer:test
# BUILD SUCCESSFUL — 25 tests pass

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

Introduces the optimizer service module with: - MySQL/H2 schema for table_operations, table_stats, table_stats_history, and table_operations_history - JPA entities with JSON column support (vladmihalcea hibernate-types) - All model/DTO/enum types: OperationType, OperationStatus, TableStats, CompleteOperationRequest, JobResult, OperationMetrics, etc. - JPA AttributeConverters for JobResult and OperationMetrics JSON columns - MapStruct mapper (OptimizerMapper) for entity→DTO conversion - Spring Boot application shell and build wiring (settings.gradle, build.gradle dockerPrereqs) No repositories, controllers, or service layer yet — those follow in subsequent PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove OperationMetrics class and converter; stats are read directly from table_stats instead of duplicating into operations - Remove orphanFilesDeleted/orphanBytesDeleted from history entity, DTO, and schema; operation-specific data belongs in the result JSON - Add addedSizeBytes to CommitDelta for tracking write volume - Fix OperationType javadoc to describe current state, not roadmap - Fix TableOperationsHistoryRow javadoc: written on operation complete, not by Spark app directly - Add field comments to all DTOs and request objects Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Spring Data JPA repositories for all four optimizer tables with filtered query support. Includes tests exercising save/find, filtered queries, upsert semantics, and append-only history. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address PR review comments: rename findFiltered → find across all repos, remove redundant findByTableUuid/findByTableUuidSince from history repos, add explicit assertion to context test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Service interface and implementation for all optimizer CRUD operations including complete-operation lifecycle, stats upsert with history double-write, and filtered queries. Three REST controllers expose the endpoints. The apps/optimizer shared module provides lightweight entity/repo copies for the analyzer and scheduler apps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Align OptimizerDataServiceImpl with renamed repository methods from optimizer-1 review feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Shared JPA entities and repositories for optimizer apps (analyzer, scheduler). All repos expose a single find method with optional filters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve repo conflicts by taking optimizer-1's clean find-only versions. Scheduler-specific methods and streamAll removed per review feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

These fields never belonged in the data model — remove them at the source rather than adding then deleting in a later PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Propagate CompleteOperationRequest orphan field removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

H2 integration tests for OptimizerDataServiceImpl covering completeOperation (write history, not-found) and upsertTableStats (create, update, history append). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Strengthen upsertTableStats test to verify history rows contain the raw delta stats from each call, not just the row count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ling Introduces apps/optimizer-analyzer, a Spring Boot CommandLineRunner that evaluates every table in table_stats against pluggable OperationAnalyzer strategies. The first strategy, OrphanFilesDeletionAnalyzer, schedules OFD operations with 24h success / 1h failure retry cadence, a 6h SCHEDULED timeout, and a 5-strike circuit breaker. Key design choices: - Bulk-loads operations and history into maps (one query per type), then iterates the stats list — O(types) queries, not O(tables). - Uses the existing generic find() repository methods with null params. - Pure unit tests with Mockito — no Spring context needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@transactional

- AnalyzerRunner.analyzeDatabase: @transactional; try/catch+continue per table save; per-table log dropped to DEBUG, aggregate INFO per database; comment on dual defaultLimit (Mockito tests vs @value); replace TODO(scale-test) inline comment with link to BDP-102738. - CadencePolicy: extract intervalFor(status) with exhaustive switch + throwing default so new HistoryStatusDto values surface explicitly. - CadenceBasedOrphanFilesDeletionAnalyzer: drop redundant @Autowired on the single constructor. - CadenceBasedOrphanFilesDeletionAnalyzerTest: rename SUCCESS_INTERVAL/ FAILURE_INTERVAL to TEST_*; type tableWithProperty as boolean; drop redundant tableWithoutProperty (covered by _whenTablePropertiesEmpty). - TableOperationsHistoryDto: new after() method (replaces ad-hoc AnalyzerRunner::moreRecentHistory comparator); trusts service-set completedAt invariant. - application.properties: ofd.success-retry-hours 24 -> 16 so a re-eval is guaranteed within any rolling 24h window regardless of when the prior run landed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

abhisheknath2011 · 2026-05-23T00:16:44Z

Regarding the directory structure, can we make follow two options:

Root project can be apps/optimizer/analyzer
Root project can be existing one apps/spark and source dir can reflect optimizer.analyzer in the package name.

Address PR #533 review (abhisheknath2011): > Root project can be apps/optimizer/analyzer Move: - apps/optimizer-analyzer/ → apps/optimizer/analyzer/ - package com.linkedin.openhouse.analyzer → com.linkedin.openhouse.optimizer.analyzer - Gradle module :apps:optimizer-analyzer → :apps:optimizer:analyzer Same shape will land on opt-4 (apps/optimizer-scheduler → apps/optimizer/ scheduler) in a follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…duler Brings the analyzer rename from opt-3 (apps/optimizer-analyzer → apps/optimizer/analyzer, package com.linkedin.openhouse.analyzer → com.linkedin.openhouse.optimizer.analyzer). Applies the same shape to the scheduler per PR #533 review: - apps/optimizer-scheduler/ → apps/optimizer/scheduler/ - package com.linkedin.openhouse.scheduler → com.linkedin.openhouse.optimizer.scheduler - Gradle module :apps:optimizer-scheduler → :apps:optimizer:scheduler settings.gradle conflict resolved to include both nested modules.

…nalyzer Per PR #533 review (abhisheknath2011, 2026-05-26): > As analyzer is going to a service with APIs exposed, can we move the > code to the services module instead of keeping under apps module? Move: - apps/optimizer/analyzer/ → services/optimizer/analyzer/ - Gradle module :apps:optimizer:analyzer → :services:optimizer:analyzer - Package unchanged (com.linkedin.openhouse.optimizer.analyzer) Same shape will land on opt-4 (apps/optimizer/scheduler → services/ optimizer/scheduler). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

abhisheknath2011 · 2026-05-26T22:36:51Z

Regarding the directory structure, can we make follow two options:

Root project can be apps/optimizer/analyzer

Root project can be existing one apps/spark and source dir can reflect optimizer.analyzer in the package name.

Somehow I don't see previous comment to move Analyzer from apps to services module as Analyzer is a service. So added the comment here.

mkuchenbecker · 2026-05-26T22:39:45Z

PTAL @abhisheknath2011

Comment is here: https://github.com/linkedin/openhouse/pull/533/changes#r3262568853

…per-database Per PR #533 review (abhisheknath2011): > Is this related to max default number of tables to be processed by on > execution cycle of Analyzer run? Can we update the doc here and reflect > the same on the property name as well? Yes — it bounds the per-database working set across the three pre-load reads. Rename for clarity: - Property: optimizer.repo.default-limit → analyzer.max-tables-per-database - Field: defaultLimit → maxTablesPerDatabase - Drop the Mockito-flavored implementation note; replace with a javadoc describing semantics (per-database cap; tables beyond the bound defer to the next cycle). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Per PR #533 review (abhisheknath2011): > Going by the paging default limit of 10k, the DBs with more that 10k > tables would bot be processed. Can we ensure that all the tables within > a given DB are process by adjusting the pagination list until we reach > the end? Previous code fetched page 0 only — tables past index 10000 were silently dropped, not "deferred to the next cycle" as the stale doc claimed. Now analyzeDatabase iterates pages until the page comes back smaller than the page size (terminates on partial or empty page). - Wrap the read + per-table loop in a while-pageNumber loop. - Re-load currentOps and latestHistory per page (per-page bound on the in-memory maps; affected tables whose ops fall in a different page get treated as "no current op" → may produce a duplicate PENDING that the scheduler's cancelDuplicates handles). - Rename property: analyzer.max-tables-per-database → analyzer.tables-page-size (it caps a page, not the cycle) - Rename field: maxTablesPerDatabase → tablesPageSize - New test: analyze_iteratesAllPages_processesEveryTableAcrossPageBoundary exercises 3 tables across 2 pages with page size 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…se/optimizer/analyzer/AnalyzerRunner.java Co-authored-by: Abhishek Nath <anath1@linkedin.com>

…ps/optimizer/analyzerapp (#602) ## Status **(WIP)** — inspection branch on top of `mkuchenb/optimizer-3` (PR #533). Splits the analyzer into library + deployable. ## Summary Library/deployable split: - `services/optimizer/analyzer/` — library (analysis logic; `AnalyzerRunner`, `OperationAnalyzer`, `CadencePolicy`, `CadenceBasedOrphanFilesDeletionAnalyzer`) - `apps/optimizer/analyzerapp/` — deployable Spring Boot wrapper (just `AnalyzerApplication` + `application.properties`) The analysis code stays in `services/`; only the `@SpringBootApplication` entry point moves to `apps/`. ## Why `analyzerapp` and not `analyzer` as the leaf name Two Gradle leaf projects both named `analyzer` — one at `:services:optimizer:analyzer`, one at `:apps:optimizer:analyzer` — produced a self-referential `compileJava → compileJava` cycle. Disambiguating the apps leaf to `analyzerapp` avoids it. The exact Gradle mechanism wasn't pinned down (could be the leaf collision, could be the also-colliding implicit parent at `:apps:optimizer` vs explicit `:services:optimizer`), but the rename is the smallest change that makes it build. ## Build changes `services/optimizer/analyzer/build.gradle`: - `api project(':services:optimizer')` so `OperationTypeDto`, repos, etc. are visible on consumers' compile classpath. - `bootJar { enabled = false }` — no `@SpringBootApplication` here anymore. - `jar.archiveClassifier = ''` so the library publishes `analyzer.jar`. `apps/optimizer/analyzerapp/build.gradle` (new): - `openhouse.springboot-ext-conventions` + Spring Boot 2.7.8 plugins. - `implementation project(':services:optimizer:analyzer')` + Spring Boot starter + JPA + MySQL driver. `settings.gradle`: `include ':apps:optimizer:analyzerapp'`. ## Testing - `./gradlew :services:optimizer:analyzer:test` — 20 tests green. - `./gradlew :apps:optimizer:analyzerapp:bootJar` — produces `build/analyzerapp/libs/analyzerapp.jar`. ## What does NOT change - Package: every class keeps `com.linkedin.openhouse.optimizer.analyzer`. `AnalyzerApplication`'s `@SpringBootApplication` component-scan picks up the library classes because they sit in the same package, just in a different module on the classpath. - The scheduler (opt-4) is unchanged in this PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: mkuchenbecker <mkuchenbecker@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

mkuchenbecker and others added 14 commits April 3, 2026 11:00

fix: update service impl to use consolidated find methods

ef3260f

Align OptimizerDataServiceImpl with renamed repository methods from optimizer-1 review feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(optimizer): add apps/optimizer shared module with find-only repos

ac1da01

Shared JPA entities and repositories for optimizer apps (analyzer, scheduler). All repos expose a single find method with optional filters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge mkuchenb/optimizer-1 into optimizer-2

be353ca

Resolve repo conflicts by taking optimizer-1's clean find-only versions. Scheduler-specific methods and streamAll removed per review feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: remove orphan fields from CompleteOperationRequest

02a5ab3

These fields never belonged in the data model — remove them at the source rather than adding then deleting in a later PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge mkuchenb/optimizer-0 into optimizer-1

5c78c8f

Propagate CompleteOperationRequest orphan field removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge mkuchenb/optimizer-1 into optimizer-2

2ddc445

Propagate CompleteOperationRequest orphan field removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(optimizer): add service-layer integration tests

01466c7

H2 integration tests for OptimizerDataServiceImpl covering completeOperation (write history, not-found) and upsertTableStats (create, update, history append). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: assert stats history delta values in upsert test

ff07fde

Strengthen upsertTableStats test to verify history rows contain the raw delta stats from each call, not just the row count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>