feat(actions): Pluggable DQ Actions & Alerting#1289
Draft
mwojtyczka wants to merge 76 commits into
Draft
Conversation
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
…ment extensibility Co-authored-by: Isaac
…tion) Co-authored-by: Isaac
Add six new exception classes to errors.py (TerminalActionError, PipelineFailedError, InvalidConditionError, InvalidActionError, AlertDeliveryError, UnsafeWebhookUrlError), and new config dataclasses DQSecret, TableActionsStorageConfig, LakebaseActionsStorageConfig, ActionEventsConfig to config.py; add actions_location field to RunConfig. Co-authored-by: Isaac
- LakebaseActionsStorageConfig.instance_name changed from str|None=None to a required str field; required fields (location, instance_name) now precede defaulted ones; dead-code guard removed from _split_location. - TableActionsStorageConfig and ActionEventsConfig __post_init__ now validate mode is 'append' or 'overwrite', matching LakebaseActionsStorageConfig. - All inline imports in test_action_config.py hoisted to module level to satisfy pylint C0415. Co-authored-by: Isaac
Introduces databricks.labs.dqx.actions package with ConditionEvaluator that gates DQ actions on metric expressions using a safe AST walker — no eval/exec. Supports arithmetic, comparison, boolean, and literal nodes; raises InvalidConditionError for any other node type or unknown metric. Co-authored-by: Isaac
…p operator errors - Add _validate_tree() that uses ast.walk to visit every AST node and reject disallowed types before any evaluation begins; called unconditionally at the top of both validate() and evaluate() so short-circuit evaluation cannot bypass the allowlist. - Add _ALLOWED_NODE_TYPES frozenset (single definition, reused by the pre-pass); includes ast.Load and abstract base types produced by ast.walk on valid conditions. - Wrap operator application in _eval_binop, _eval_compare, and _eval_unaryop with try/except (ZeroDivisionError, TypeError, OverflowError) and re-raise as InvalidConditionError. - Remove redundant type(node.op) not in _BOOL_OPS guard in _eval_boolop (unreachable — only And/Or are BoolOp ops). - Add 6 new tests: full-tree pre-pass coverage and operator-error wrapping; total 58 tests, all passing. Co-authored-by: Isaac
Add AlertMessage frozen dataclass and StandardMessageBuilder to actions/message.py; builder takes primitives to avoid a circular import with ActionContext (Task 5). 26 unit tests cover the TDD RED/GREEN cycle. Co-authored-by: Isaac
…ey collision - Change test_is_frozen to assert dataclasses.FrozenInstanceError specifically instead of the overly broad Exception. - Prefix per-metric entries in the fields dict with "metric." (e.g. "metric.error_row_count") so a metric named after a reserved key (condition, run_id, run_time, table) cannot silently overwrite or be overwritten by the metadata entries. observed_metrics remains un-prefixed. - Update existing field-assertion tests to use the new "metric.<name>" keys. - Add TestStandardMessageBuilderReservedKeyCollision test that verifies both fields["metric.condition"] and fields["condition"] coexist correctly. Co-authored-by: Isaac
Adds SecretResolver to the actions package, resolving plain strings as-is and DQSecret references via ws.dbutils.secrets.get at delivery time. API failures are wrapped in InvalidParameterError without leaking the resolved secret value. Co-authored-by: Isaac
Introduces the foundational building blocks for the DQX actions & alerting subsystem: ActionStatus enum, ActionContext / ActionResult / ActionServices frozen dataclasses, the Action abstract base class, and DQAction (condition + action binding with eager validation). WebhookClient and SparkSession are guarded behind TYPE_CHECKING to keep the module importable without delivery.py or PySpark present. Co-authored-by: Isaac
Move sys import to top-level, replace abstract-instantiation test with inspect.isabstract check, remove unused error re-exports from base.py, and delete the test_action_base.py per-file pylint override block. Co-authored-by: Isaac
Implements WebhookAuth, validate_webhook_url (SSRF guard), and WebhookClient (urllib-only, no-redirect opener, exponential-backoff retry, no secrets in errors). Co-authored-by: Isaac
…able 4xx Match stdlib redirect_request signature and type the opener param so mypy needs no overrides; catch OSError once (HTTPError subclass) with a single last_exc assignment to satisfy pylint and remove duplicated backoff; fail fast on non-retryable 4xx (not 429); avoid type-ignores in tests. Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Adds CallbackDQAlertDestination, an in-process destination that invokes a user-supplied Python callable on delivery. Not serializable (Task 11 skips it); validate() enforces non-empty name and a callable callback. Co-authored-by: Isaac
Co-authored-by: Isaac
Implements DQAlert (with DQAlertFrequency/NotifyOn enums) for concurrent multi-destination alerting with per-destination error isolation, and FailPipeline which raises PipelineFailedError to terminate the DQX run. Co-authored-by: Isaac
… type-ignores - DQAlert.validate() now rejects duplicate destination names (they would silently clobber entries in ActionResult.destination_errors). - Drop spurious '# type: ignore[import-untyped]' on the typed WebhookClient import and replace a None-defaulted list field with field(default_factory). - Broaden the CWE-117 sanitization test to cover tab/ANSI/null control chars. Co-authored-by: Isaac
Co-authored-by: Isaac
…ept in event-store test Add explicit Callable return type (AGENTS all-annotations rule), remove the redundant try/except around DROP TABLE IF EXISTS in the integration cleanup fixture, and add an exact-1h HOURLY boundary test. Co-authored-by: Isaac
Implements ActionSerializer (with registry-based OCP design for actions/destinations, DQSecret tagged-dict round-trip, CallbackDQAlertDestination skip with warning), TableActionsStorageHandler and LakebaseActionsStorageHandler, ActionsStorageHandlerFactory, and DQActionManager for save/load of DQAction definitions to UC Delta or Lakebase. Adds 23 unit tests (all passing) and integration tests for the UC Delta path. Co-authored-by: Isaac
… fully OCP Validate run_config_name via re.fullmatch before Delta replaceWhere interpolation (mirrors checks_storage.py guard; raises UnsafeSqlQueryError for unsafe chars). Extracted into pure helper build_replace_where_predicate for unit-testability. Added _ACTION_SERIALIZERS and _DESTINATION_SERIALIZERS registries to serializer.py so both serialize and deserialize sides are registry-driven (no isinstance chains). Adding a new action/destination type now requires only one registry entry per side. Co-authored-by: Isaac
delivery.py exists and is fully typed since Task 6, so the TYPE_CHECKING import no longer needs the suppression. Co-authored-by: Isaac
Replace bare operator.* references (typed as object) in _UNARY_OPS, _BIN_OPS, _CMP_OPS, and _EVALUATORS with thin typed wrapper functions (lhs/rhs/val params, cast to float for numeric ops) and precise Callable/dict value types. Node-evaluator functions now share a uniform (ast.AST, metrics) -> object signature via internal cast. Zero inline type: ignore remain; pylint 10/10, mypy and ruff clean, 58 tests pass. Co-authored-by: Isaac
Co-authored-by: Isaac
…re-abort Remove dead _healthy_result helper, update module docstring, and add two new tests: one proving deferred[0] (first terminal error) is raised rather than the last, and one proving alert actions execute and record before a subsequent terminal action aborts the pipeline. Co-authored-by: Isaac
…nnotation-unchecked Untyped test bodies were skipped by mypy (annotation-unchecked notes). Add the return annotations so the bodies are type-checked; _make_event now returns a QueryProgressEvent via typing.cast so the now-checked onQueryProgress calls type-check. Co-authored-by: Isaac
1 task
1 task
Convert the actions models from dataclasses to Pydantic v2 BaseModels so construction-time validation and (de)serialization are driven by Pydantic instead of hand-built validate() methods. - Action becomes a Pydantic ABC; the no-op validate() is removed. - DQAction moves to actions/dq_action.py with action typed as the AnyAction discriminated union (DQAlert | FailPipeline), resolving the base<->alert import cycle. Re-exported from actions/__init__.py. - DQAlert / FailPipeline gain a literal `type` discriminator; DQAlert keeps destination uniqueness/non-empty checks and a field_serializer that excludes CallbackDQAlertDestination from persisted output. - Destinations become Pydantic models with a literal `type`; webhook_url / username / password use the SecretOrStr field type for DQSecret round-trip. - AnyDestination discriminated union added (destinations/union.py). Co-authored-by: Isaac
…/validate Replace the four type registries and per-type build/serialize helpers with a thin facade: to_dict delegates to DQAction.model_dump(mode="json") (omitting a None condition), and from_dict wraps DQAction.model_validate, surfacing any pydantic.ValidationError as InvalidActionError. An unknown action or destination type now fails the discriminated-union match and raises InvalidActionError, the same external contract as before. serializer.py shrinks from 499 to 89 lines. Consumers (definition_storage, manager, evaluator, state, engine) are updated to import DQAction from actions/dq_action.py; their external behaviour is unchanged. Co-authored-by: Isaac
Update the action unit tests for the Pydantic migration: validation now happens at construction (or via model_validate) rather than through a removed validate() method, and DQAction.action / DQAlert.destinations are discriminated unions. - Destination / alert validation tests assert the DQX error is raised at construction instead of calling validate(); type-discriminator tests read the literal off an instance. - Evaluator tests inject mocks/fakes via post-construction assignment (the seam the evaluator exercises) and use lightweight duck-typed fakes. - State / base / serializer tests use real union-member actions and destinations (FailPipeline, DQAlert, CallbackDQAlertDestination); the unknown-action-type case now asserts rejection at DQAction construction. - Serializer round-trip coverage (DQSecret tagged form, enum values, optional condition, callback-skipped-on-serialize) is preserved. - Update DQAction imports in integration tests to actions/dq_action.py. Co-authored-by: Isaac
…alse positive Per AGENTS.md, docstrings use *italics* for object names, not backticks. Convert all remaining double/single backticks (and Sphinx :class:/:func: roles) to italics across the actions package; reword the '**' operator spans and operator lists that italics cannot wrap, and the backoff-formula docstring. Reword secret_field's tagged-dict examples to prose so the literal '"secret": "scope/key"' placeholder no longer trips GitGuardian's secret scanner (it was a documentation example of the wire-format reference, never a real credential). Co-authored-by: Isaac
…DQActionManager file I/O Add serialize_actions and deserialize_actions module-level helpers to serializer.py as convenience wrappers around ActionSerializer for whole-list operations. Export both from the actions package public API (__all__). Add load_actions_from_local_file and save_actions_in_local_file static methods to DQActionManager, supporting .yml, .yaml, and .json files with appropriate error types (InvalidParameterError for bad paths/extensions, InvalidConfigError for parse/write failures). Co-authored-by: Isaac
Change the DQEngine.actions parameter type to accept list[DQAction] | list[dict[str, object]] | None so callers can supply raw metadata dicts in place of (or mixed with) DQAction instances. Dict entries are deserialized via ActionSerializer.from_dict at construction time, so validation errors (unknown type, missing field) surface immediately rather than at evaluation time. Co-authored-by: Isaac
Add tests/unit/test_action_metadata.py covering: - Round-trip serialize/deserialize for mixed DQAlert+FailPipeline lists including DQSecret preservation - Error cases: non-dict element and unknown action type - File round-trips for .yml, .yaml, and .json formats - Invalid extension and missing file error cases - DQEngine normalization: action dict acceptance, mixed lists, and invalid dict rejection at construction time Extend docs/dqx/docs/guide/actions_and_alerts.mdx with a new "Defining actions with metadata (YAML)" section showing the YAML wire format, file loading via DQActionManager, and passing raw dicts directly to DQEngine. Co-authored-by: Isaac
…d unknown-type test Co-authored-by: Isaac
…nces Present the metadata action example in both the programmatic class API and the equivalent declarative YAML using a Tabs block, mirroring the DQX checks docs. Fix two auto-generated API-reference pages that broke 'make docs-build': the StandardMessageBuilder and ConditionEvaluator docstrings used a '**python' pseudo-fence, so the raw braces in their code examples reached the MDX parser and failed acorn. Convert both to proper fenced code blocks. Co-authored-by: Isaac
Add an end-to-end test that a FailPipeline defined as a metadata dict fires against real observed metrics via DQEngine, plus a local-file save/load round-trip for both YAML and JSON. Co-authored-by: Isaac
Default stays 'dqx' (the dedicated CI workspace catalog). Setting DQX_TEST_CATALOG lets the suite run against a workspace that exposes a different catalog (e.g. 'main' on a shared demo workspace). Co-authored-by: Isaac
Reframe the events-table section to lead with durable event history rather than alert-state suppression only. Clarify that every action evaluation is recorded (including not-fired/suppressed ones), document the AlertEvent table columns, and add a SQL example for reviewing alert history. Note the Lakebase backend option. Suppression persistence is now described as a secondary benefit of the same table. Co-authored-by: Isaac
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 34424785 | Triggered | Generic Password | 558ad6f | src/databricks/labs/dqx/actions/destinations/webhook.py | View secret |
| 34424785 | Triggered | Generic Password | 34cdf9f | src/databricks/labs/dqx/actions/destinations/webhook.py | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Move the paired Python/YAML example into the primary 'Defining a DQAction' section using Tabs, so both forms sit together instead of requiring a scroll to a separate metadata section. Aligns with the existing DQX checks docs. The metadata section now covers only the wire format, file loading, and passing dicts to DQEngine, cross-linking back to the paired example. Co-authored-by: Isaac
…fields Declare the optional username/password fields with pydantic Field(default=None) instead of a bare 'password = None' assignment. The literal 'password = <value>' shape tripped GitGuardian's generic-password detector even though the value is None; using Field() breaks that pattern while keeping the field name, type, and serialization identical (verified round-trip). No functional change. Co-authored-by: Isaac
Add an 'Actions table structure' subsection documenting the stored-actions table (action_json, run_config_name, created_at) and how rows are serialized, scoped by run config, and skipped on deserialize failure. Add column types to the action-events (AlertEvent) history table so its structure is fully specified alongside the definition table. Co-authored-by: Isaac
Move the apply/auto-fire section up so it immediately follows 'Defining a DQAction', giving the reader a complete define-then-apply flow up front instead of having to scroll past all the reference material to find how to run an action. Add forward links from the apply section to the destination, frequency, history, storage, and metadata reference sections that now follow. Co-authored-by: Isaac
…ernal task remarks - Consolidate action persistence docs into one section distinguishing the action-definitions and action-events tables; state UC/Lakebase backends once. - Merge 'Defining a DQAction' and 'Using actions with DQEngine' into a single 'Defining and applying actions' section; fix anchor links. - Expand README Key capabilities to match the docs Capabilities list and drop the actions guide link. - Remove internal task-number remarks from actions source and tests. - Revert pyproject.toml blank-line change and tests/constants.py TEST_CATALOG override back to origin/main. Co-authored-by: Isaac
…bs throughout - Restructure the actions guide so the auto-fire define-and-run example comes first as the primary, complete example. - Show both Python (classes) and YAML (metadata) forms via tabs for every action-definition example (auto-fire, manual eval, conditions, destinations, FailPipeline, frequency, secrets). - Remove the disjoint define-only example and the standalone metadata section; fold file load/save mechanics into the persistence section. Co-authored-by: Isaac
…mentation and Contribution Co-authored-by: Isaac
Co-authored-by: Isaac
- Add LogDQAlertDestination, a serializable (type: log) alert destination that writes alerts to the driver logger with no external I/O. Unlike the callback destination it round-trips through metadata, making it ideal for local dev, demos, and e2e tests. Register it in the destination union and public exports. - Add unit tests (delivery, level validation/normalization, CWE-117 sanitization, metadata round-trip, union membership) and shared action_context/action_services fixtures in tests/unit/conftest.py. - Add demos/dqx_demo_alerting.py demonstrating alerting with the log destination and optional Slack, and hook it into the e2e demo runner. - Document the log destination and the demo in the docs. Co-authored-by: Isaac
Wire RunConfig.actions_location (action definitions) and a new action_events_location (event history + durable suppression) into the parallel multi-run-config runner. Each run config with actions is applied through a dedicated engine carrying its own actions, a fresh observer, and an optional event store, keeping the shared engine thread-safe. - config.py: actions_location now means action definitions (comment fixed); add action_events_location for event history / cross-run suppression. - engine.py: _engine_for_run_config / _build_scoped_engine build the scoped engine; _run_config_actions_storage_config / _run_config_action_events_config resolve table/Lakebase/file backends. File paths never route to a table backend; a non-table events location raises a clear InvalidConfigError. - workflow_context.py: resolve relative actions_location to a /Workspace FUSE path (actions load via open(), like custom check functions). - docs: document both keys, the runner usage, and per-run-config action behaviour; add the keys to the installation run-config reference. - tests: unit tests for the resolvers and scoped-engine logic; integration test for end-to-end load + fire + event persistence. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an extensible actions subsystem to DQX. An action runs when data checked by DQX violates an (optional) condition evaluated against the summary metrics produced by
DQMetricsObserver. The first two concrete actions are:DQAlert— sends a notification to one or more destinations: Slack, Microsoft Teams, generic HTTPS webhook, Log (driver logger, no external system) or an in-process callback.FailPipeline— raisesPipelineFailedErrorto stop the current pipeline run.Actions can be defined programmatically (DQX classes) or declaratively as metadata (YAML/JSON dicts), and are passed to
DQEngine(ws, observer=..., actions=[...])— they fire automatically on the save-to-table methods (batch and streaming), or explicitly viaengine.evaluate_actions(...). Action definitions can be stored/loaded from UC or Lakebase tables (or local YAML/JSON files) viaDQActionManager, and action events can be persisted to a UC/Lakebase events table so frequency/status-change suppression survives engine restarts. In the installed DQX Workflows and the parallel multi-run-config runner, eachRunConfigcan point at its ownactions_location(definitions) andaction_events_location(history) so actions are auto-loaded and applied per run config.What's included
databricks.labs.dqx.actions—DQAction,ActionABC,DQAlert,FailPipeline;AlertDestinationhierarchy (Slack / Teams / webhook / log / callback); a safe AST condition evaluator; standard message builder; webhook delivery with retry/backoff and an SSRF guard;SecretResolver(DQSecret);ActionStateStore+ UC/Lakebase event stores;ActionEvaluatororchestrator;ActionSerializer+ UC/Lakebase definition storage;DQActionManager. Destinations and actions are Pydantic discriminated unions, so adding a new type is a small, isolated change (OCP).actions=onDQEngine/DQEngineCore(acceptsDQActioninstances or raw metadata dicts), batch + streaming firing,evaluate_actions(...), and optionalaction_events_configfor persistent state/history.DQActionManager.load_actions_from_local_file/save_actions_in_local_fileround-trip them to disk.DQEngine.apply_checks_and_save_in_tables(and the installed quality-checker workflow) auto-load eachRunConfig.actions_locationand fire those actions for that run. Each run config is applied through a dedicated engine (its own actions, a fresh observer, and an optional event store), keeping the parallel runner thread-safe.action_events_locationpersists event history and durable alert suppression across runs.DQSecret,TableActionsStorageConfig,LakebaseActionsStorageConfig,ActionEventsConfig,RunConfig.actions_location(definitions) andRunConfig.action_events_location(event history).demos/dqx_demo_alerting.pydemonstrating alerting via the log destination (with optional Slack), wired into the e2e demo runner.Known follow-ups (out of scope here)
FailPipelinestill aborts immediately).action_events_location, action names must be distinct (or use a separate events table per run config) — documented, could be scoped by run config in a follow-up.Linked issues
Resolves #204 #610
Tests
Documentation and Demos
This pull request and its description were written by Isaac.