Skip to content

feat(actions): Pluggable DQ Actions & Alerting#1289

Draft
mwojtyczka wants to merge 76 commits into
mainfrom
alerting
Draft

feat(actions): Pluggable DQ Actions & Alerting#1289
mwojtyczka wants to merge 76 commits into
mainfrom
alerting

Conversation

@mwojtyczka

@mwojtyczka mwojtyczka commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds an extensible actions subsystem to DQX. An action runs when data checked by DQX violates an (optional) condition evaluated against the summary metrics produced by DQMetricsObserver. The first two concrete actions are:

  • DQAlert — sends a notification to one or more destinations: Slack, Microsoft Teams, generic HTTPS webhook, Log (driver logger, no external system) or an in-process callback.
  • FailPipeline — raises PipelineFailedError to stop the current pipeline run.

Actions can be defined programmatically (DQX classes) or declaratively as metadata (YAML/JSON dicts), and are passed to DQEngine(ws, observer=..., actions=[...]) — they fire automatically on the save-to-table methods (batch and streaming), or explicitly via engine.evaluate_actions(...). Action definitions can be stored/loaded from UC or Lakebase tables (or local YAML/JSON files) via DQActionManager, and action events can be persisted to a UC/Lakebase events table so frequency/status-change suppression survives engine restarts. In the installed DQX Workflows and the parallel multi-run-config runner, each RunConfig can point at its own actions_location (definitions) and action_events_location (history) so actions are auto-loaded and applied per run config.

What's included

  • New package databricks.labs.dqx.actionsDQAction, Action ABC, DQAlert, FailPipeline; AlertDestination hierarchy (Slack / Teams / webhook / log / callback); a safe AST condition evaluator; standard message builder; webhook delivery with retry/backoff and an SSRF guard; SecretResolver (DQSecret); ActionStateStore + UC/Lakebase event stores; ActionEvaluator orchestrator; ActionSerializer + UC/Lakebase definition storage; DQActionManager. Destinations and actions are Pydantic discriminated unions, so adding a new type is a small, isolated change (OCP).
  • Engine integrationactions= on DQEngine/DQEngineCore (accepts DQAction instances or raw metadata dicts), batch + streaming firing, evaluate_actions(...), and optional action_events_config for persistent state/history.
  • Metadata (YAML/JSON) API — actions can be authored as dicts/YAML with the same wire format as checks; DQActionManager.load_actions_from_local_file / save_actions_in_local_file round-trip them to disk.
  • Workflow & run-config integrationDQEngine.apply_checks_and_save_in_tables (and the installed quality-checker workflow) auto-load each RunConfig.actions_location and fire those actions for that run. Each run config is applied through a dedicated engine (its own actions, a fresh observer, and an optional event store), keeping the parallel runner thread-safe. action_events_location persists event history and durable alert suppression across runs.
  • ConfigDQSecret, TableActionsStorageConfig, LakebaseActionsStorageConfig, ActionEventsConfig, RunConfig.actions_location (definitions) and RunConfig.action_events_location (event history).
  • Demodemos/dqx_demo_alerting.py demonstrating alerting via the log destination (with optional Slack), wired into the e2e demo runner.

Known follow-ups (out of scope here)

  • Fully asynchronous off-thread streaming webhook delivery (currently synchronous per micro-batch and documented; FailPipeline still aborts immediately).
  • Action events are keyed by action name; when several run configs share one action_events_location, action names must be distinct (or use a separate events table per run config) — documented, could be scoped by run config in a follow-up.
  • Email notification.

Linked issues

Resolves #204 #610

Tests

  • added unit tests
  • added integration tests
  • added e2e tests

Documentation and Demos

  • added/updated demos
  • added/updated docs
  • added/updated agent skills

This pull request and its description were written by Isaac.

Add six new exception classes to errors.py (TerminalActionError,
PipelineFailedError, InvalidConditionError, InvalidActionError,
AlertDeliveryError, UnsafeWebhookUrlError), and new config dataclasses
DQSecret, TableActionsStorageConfig, LakebaseActionsStorageConfig,
ActionEventsConfig to config.py; add actions_location field to RunConfig.

Co-authored-by: Isaac
- LakebaseActionsStorageConfig.instance_name changed from str|None=None
  to a required str field; required fields (location, instance_name) now
  precede defaulted ones; dead-code guard removed from _split_location.
- TableActionsStorageConfig and ActionEventsConfig __post_init__ now
  validate mode is 'append' or 'overwrite', matching LakebaseActionsStorageConfig.
- All inline imports in test_action_config.py hoisted to module level to
  satisfy pylint C0415.

Co-authored-by: Isaac
Introduces databricks.labs.dqx.actions package with ConditionEvaluator
that gates DQ actions on metric expressions using a safe AST walker — no
eval/exec. Supports arithmetic, comparison, boolean, and literal nodes;
raises InvalidConditionError for any other node type or unknown metric.

Co-authored-by: Isaac
…p operator errors

- Add _validate_tree() that uses ast.walk to visit every AST node and
  reject disallowed types before any evaluation begins; called
  unconditionally at the top of both validate() and evaluate() so
  short-circuit evaluation cannot bypass the allowlist.
- Add _ALLOWED_NODE_TYPES frozenset (single definition, reused by
  the pre-pass); includes ast.Load and abstract base types produced
  by ast.walk on valid conditions.
- Wrap operator application in _eval_binop, _eval_compare, and
  _eval_unaryop with try/except (ZeroDivisionError, TypeError,
  OverflowError) and re-raise as InvalidConditionError.
- Remove redundant type(node.op) not in _BOOL_OPS guard in
  _eval_boolop (unreachable — only And/Or are BoolOp ops).
- Add 6 new tests: full-tree pre-pass coverage and operator-error
  wrapping; total 58 tests, all passing.

Co-authored-by: Isaac
Add AlertMessage frozen dataclass and StandardMessageBuilder to
actions/message.py; builder takes primitives to avoid a circular import
with ActionContext (Task 5). 26 unit tests cover the TDD RED/GREEN cycle.

Co-authored-by: Isaac
…ey collision

- Change test_is_frozen to assert dataclasses.FrozenInstanceError specifically
  instead of the overly broad Exception.
- Prefix per-metric entries in the fields dict with "metric." (e.g.
  "metric.error_row_count") so a metric named after a reserved key
  (condition, run_id, run_time, table) cannot silently overwrite or be
  overwritten by the metadata entries. observed_metrics remains un-prefixed.
- Update existing field-assertion tests to use the new "metric.<name>" keys.
- Add TestStandardMessageBuilderReservedKeyCollision test that verifies both
  fields["metric.condition"] and fields["condition"] coexist correctly.

Co-authored-by: Isaac
Adds SecretResolver to the actions package, resolving plain strings
as-is and DQSecret references via ws.dbutils.secrets.get at delivery
time. API failures are wrapped in InvalidParameterError without leaking
the resolved secret value.

Co-authored-by: Isaac
Introduces the foundational building blocks for the DQX actions &
alerting subsystem: ActionStatus enum, ActionContext / ActionResult /
ActionServices frozen dataclasses, the Action abstract base class, and
DQAction (condition + action binding with eager validation).
WebhookClient and SparkSession are guarded behind TYPE_CHECKING to keep
the module importable without delivery.py or PySpark present.

Co-authored-by: Isaac
Move sys import to top-level, replace abstract-instantiation test with
inspect.isabstract check, remove unused error re-exports from base.py,
and delete the test_action_base.py per-file pylint override block.

Co-authored-by: Isaac
Implements WebhookAuth, validate_webhook_url (SSRF guard), and WebhookClient
(urllib-only, no-redirect opener, exponential-backoff retry, no secrets in errors).

Co-authored-by: Isaac
…able 4xx

Match stdlib redirect_request signature and type the opener param so mypy
needs no overrides; catch OSError once (HTTPError subclass) with a single
last_exc assignment to satisfy pylint and remove duplicated backoff; fail
fast on non-retryable 4xx (not 429); avoid type-ignores in tests.

Co-authored-by: Isaac
Adds CallbackDQAlertDestination, an in-process destination that invokes
a user-supplied Python callable on delivery.  Not serializable (Task 11
skips it); validate() enforces non-empty name and a callable callback.

Co-authored-by: Isaac
Implements DQAlert (with DQAlertFrequency/NotifyOn enums) for concurrent
multi-destination alerting with per-destination error isolation, and
FailPipeline which raises PipelineFailedError to terminate the DQX run.

Co-authored-by: Isaac
… type-ignores

- DQAlert.validate() now rejects duplicate destination names (they would
  silently clobber entries in ActionResult.destination_errors).
- Drop spurious '# type: ignore[import-untyped]' on the typed WebhookClient
  import and replace a None-defaulted list field with field(default_factory).
- Broaden the CWE-117 sanitization test to cover tab/ANSI/null control chars.

Co-authored-by: Isaac
…ept in event-store test

Add explicit Callable return type (AGENTS all-annotations rule), remove the
redundant try/except around DROP TABLE IF EXISTS in the integration cleanup
fixture, and add an exact-1h HOURLY boundary test.

Co-authored-by: Isaac
Implements ActionSerializer (with registry-based OCP design for actions/destinations,
DQSecret tagged-dict round-trip, CallbackDQAlertDestination skip with warning),
TableActionsStorageHandler and LakebaseActionsStorageHandler, ActionsStorageHandlerFactory,
and DQActionManager for save/load of DQAction definitions to UC Delta or Lakebase.
Adds 23 unit tests (all passing) and integration tests for the UC Delta path.

Co-authored-by: Isaac
… fully OCP

Validate run_config_name via re.fullmatch before Delta replaceWhere interpolation
(mirrors checks_storage.py guard; raises UnsafeSqlQueryError for unsafe chars).
Extracted into pure helper build_replace_where_predicate for unit-testability.

Added _ACTION_SERIALIZERS and _DESTINATION_SERIALIZERS registries to serializer.py
so both serialize and deserialize sides are registry-driven (no isinstance chains).
Adding a new action/destination type now requires only one registry entry per side.

Co-authored-by: Isaac
delivery.py exists and is fully typed since Task 6, so the TYPE_CHECKING
import no longer needs the suppression.

Co-authored-by: Isaac
Replace bare operator.* references (typed as object) in _UNARY_OPS,
_BIN_OPS, _CMP_OPS, and _EVALUATORS with thin typed wrapper functions
(lhs/rhs/val params, cast to float for numeric ops) and precise
Callable/dict value types. Node-evaluator functions now share a uniform
(ast.AST, metrics) -> object signature via internal cast. Zero inline
type: ignore remain; pylint 10/10, mypy and ruff clean, 58 tests pass.

Co-authored-by: Isaac
…re-abort

Remove dead _healthy_result helper, update module docstring, and add two
new tests: one proving deferred[0] (first terminal error) is raised rather
than the last, and one proving alert actions execute and record before a
subsequent terminal action aborts the pipeline.

Co-authored-by: Isaac
…nnotation-unchecked

Untyped test bodies were skipped by mypy (annotation-unchecked notes). Add the
return annotations so the bodies are type-checked; _make_event now returns a
QueryProgressEvent via typing.cast so the now-checked onQueryProgress calls type-check.

Co-authored-by: Isaac
@mwojtyczka mwojtyczka linked an issue Jun 30, 2026 that may be closed by this pull request
1 task
Convert the actions models from dataclasses to Pydantic v2 BaseModels so
construction-time validation and (de)serialization are driven by Pydantic
instead of hand-built validate() methods.

- Action becomes a Pydantic ABC; the no-op validate() is removed.
- DQAction moves to actions/dq_action.py with action typed as the AnyAction
  discriminated union (DQAlert | FailPipeline), resolving the base<->alert
  import cycle. Re-exported from actions/__init__.py.
- DQAlert / FailPipeline gain a literal `type` discriminator; DQAlert keeps
  destination uniqueness/non-empty checks and a field_serializer that excludes
  CallbackDQAlertDestination from persisted output.
- Destinations become Pydantic models with a literal `type`; webhook_url /
  username / password use the SecretOrStr field type for DQSecret round-trip.
- AnyDestination discriminated union added (destinations/union.py).

Co-authored-by: Isaac
…/validate

Replace the four type registries and per-type build/serialize helpers with a
thin facade: to_dict delegates to DQAction.model_dump(mode="json") (omitting a
None condition), and from_dict wraps DQAction.model_validate, surfacing any
pydantic.ValidationError as InvalidActionError. An unknown action or destination
type now fails the discriminated-union match and raises InvalidActionError, the
same external contract as before. serializer.py shrinks from 499 to 89 lines.

Consumers (definition_storage, manager, evaluator, state, engine) are updated to
import DQAction from actions/dq_action.py; their external behaviour is unchanged.

Co-authored-by: Isaac
Update the action unit tests for the Pydantic migration: validation now happens
at construction (or via model_validate) rather than through a removed validate()
method, and DQAction.action / DQAlert.destinations are discriminated unions.

- Destination / alert validation tests assert the DQX error is raised at
  construction instead of calling validate(); type-discriminator tests read the
  literal off an instance.
- Evaluator tests inject mocks/fakes via post-construction assignment (the seam
  the evaluator exercises) and use lightweight duck-typed fakes.
- State / base / serializer tests use real union-member actions and destinations
  (FailPipeline, DQAlert, CallbackDQAlertDestination); the unknown-action-type
  case now asserts rejection at DQAction construction.
- Serializer round-trip coverage (DQSecret tagged form, enum values, optional
  condition, callback-skipped-on-serialize) is preserved.
- Update DQAction imports in integration tests to actions/dq_action.py.

Co-authored-by: Isaac
@mwojtyczka mwojtyczka changed the title feat(actions): DQ Actions & Alerting feat(actions): Pluggable DQ Actions & Alerting Jun 30, 2026
…alse positive

Per AGENTS.md, docstrings use *italics* for object names, not backticks. Convert
all remaining double/single backticks (and Sphinx :class:/:func: roles) to italics
across the actions package; reword the '**' operator spans and operator lists that
italics cannot wrap, and the backoff-formula docstring.

Reword secret_field's tagged-dict examples to prose so the literal '"secret": "scope/key"'
placeholder no longer trips GitGuardian's secret scanner (it was a documentation
example of the wire-format reference, never a real credential).

Co-authored-by: Isaac
@databrickslabs databrickslabs deleted a comment from gitguardian Bot Jul 1, 2026
…DQActionManager file I/O

Add serialize_actions and deserialize_actions module-level helpers to
serializer.py as convenience wrappers around ActionSerializer for whole-list
operations. Export both from the actions package public API (__all__).

Add load_actions_from_local_file and save_actions_in_local_file static methods
to DQActionManager, supporting .yml, .yaml, and .json files with appropriate
error types (InvalidParameterError for bad paths/extensions, InvalidConfigError
for parse/write failures).

Co-authored-by: Isaac
Change the DQEngine.actions parameter type to accept
list[DQAction] | list[dict[str, object]] | None so callers can supply
raw metadata dicts in place of (or mixed with) DQAction instances.

Dict entries are deserialized via ActionSerializer.from_dict at
construction time, so validation errors (unknown type, missing field)
surface immediately rather than at evaluation time.

Co-authored-by: Isaac
Add tests/unit/test_action_metadata.py covering:
- Round-trip serialize/deserialize for mixed DQAlert+FailPipeline lists
  including DQSecret preservation
- Error cases: non-dict element and unknown action type
- File round-trips for .yml, .yaml, and .json formats
- Invalid extension and missing file error cases
- DQEngine normalization: action dict acceptance, mixed lists, and
  invalid dict rejection at construction time

Extend docs/dqx/docs/guide/actions_and_alerts.mdx with a new
"Defining actions with metadata (YAML)" section showing the YAML
wire format, file loading via DQActionManager, and passing raw
dicts directly to DQEngine.

Co-authored-by: Isaac
…nces

Present the metadata action example in both the programmatic class API and the
equivalent declarative YAML using a Tabs block, mirroring the DQX checks docs.

Fix two auto-generated API-reference pages that broke 'make docs-build': the
StandardMessageBuilder and ConditionEvaluator docstrings used a '**python'
pseudo-fence, so the raw braces in their code examples reached the MDX parser
and failed acorn. Convert both to proper fenced code blocks.

Co-authored-by: Isaac
Add an end-to-end test that a FailPipeline defined as a metadata dict fires
against real observed metrics via DQEngine, plus a local-file save/load
round-trip for both YAML and JSON.

Co-authored-by: Isaac
Default stays 'dqx' (the dedicated CI workspace catalog). Setting
DQX_TEST_CATALOG lets the suite run against a workspace that exposes a
different catalog (e.g. 'main' on a shared demo workspace).

Co-authored-by: Isaac
Reframe the events-table section to lead with durable event history rather
than alert-state suppression only. Clarify that every action evaluation is
recorded (including not-fired/suppressed ones), document the AlertEvent table
columns, and add a SQL example for reviewing alert history. Note the Lakebase
backend option. Suppression persistence is now described as a secondary
benefit of the same table.

Co-authored-by: Isaac
@gitguardian

gitguardian Bot commented Jul 1, 2026

Copy link
Copy Markdown

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
34424785 Triggered Generic Password 558ad6f src/databricks/labs/dqx/actions/destinations/webhook.py View secret
34424785 Triggered Generic Password 34cdf9f src/databricks/labs/dqx/actions/destinations/webhook.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

mwojtyczka added 10 commits July 1, 2026 12:20
Move the paired Python/YAML example into the primary 'Defining a DQAction'
section using Tabs, so both forms sit together instead of requiring a scroll
to a separate metadata section. Aligns with the existing DQX checks docs. The
metadata section now covers only the wire format, file loading, and passing
dicts to DQEngine, cross-linking back to the paired example.

Co-authored-by: Isaac
…fields

Declare the optional username/password fields with pydantic Field(default=None)
instead of a bare 'password = None' assignment. The literal 'password = <value>'
shape tripped GitGuardian's generic-password detector even though the value is
None; using Field() breaks that pattern while keeping the field name, type, and
serialization identical (verified round-trip). No functional change.

Co-authored-by: Isaac
Add an 'Actions table structure' subsection documenting the stored-actions
table (action_json, run_config_name, created_at) and how rows are serialized,
scoped by run config, and skipped on deserialize failure. Add column types to
the action-events (AlertEvent) history table so its structure is fully
specified alongside the definition table.

Co-authored-by: Isaac
Move the apply/auto-fire section up so it immediately follows 'Defining a
DQAction', giving the reader a complete define-then-apply flow up front instead
of having to scroll past all the reference material to find how to run an
action. Add forward links from the apply section to the destination, frequency,
history, storage, and metadata reference sections that now follow.

Co-authored-by: Isaac
…ernal task remarks

- Consolidate action persistence docs into one section distinguishing the
  action-definitions and action-events tables; state UC/Lakebase backends once.
- Merge 'Defining a DQAction' and 'Using actions with DQEngine' into a single
  'Defining and applying actions' section; fix anchor links.
- Expand README Key capabilities to match the docs Capabilities list and drop
  the actions guide link.
- Remove internal task-number remarks from actions source and tests.
- Revert pyproject.toml blank-line change and tests/constants.py TEST_CATALOG
  override back to origin/main.

Co-authored-by: Isaac
…bs throughout

- Restructure the actions guide so the auto-fire define-and-run example comes
  first as the primary, complete example.
- Show both Python (classes) and YAML (metadata) forms via tabs for every
  action-definition example (auto-fire, manual eval, conditions, destinations,
  FailPipeline, frequency, secrets).
- Remove the disjoint define-only example and the standalone metadata section;
  fold file load/save mechanics into the persistence section.

Co-authored-by: Isaac
…mentation and Contribution

Co-authored-by: Isaac
- Add LogDQAlertDestination, a serializable (type: log) alert destination that
  writes alerts to the driver logger with no external I/O. Unlike the callback
  destination it round-trips through metadata, making it ideal for local dev,
  demos, and e2e tests. Register it in the destination union and public exports.
- Add unit tests (delivery, level validation/normalization, CWE-117
  sanitization, metadata round-trip, union membership) and shared
  action_context/action_services fixtures in tests/unit/conftest.py.
- Add demos/dqx_demo_alerting.py demonstrating alerting with the log
  destination and optional Slack, and hook it into the e2e demo runner.
- Document the log destination and the demo in the docs.

Co-authored-by: Isaac
Wire RunConfig.actions_location (action definitions) and a new
action_events_location (event history + durable suppression) into the parallel
multi-run-config runner. Each run config with actions is applied through a
dedicated engine carrying its own actions, a fresh observer, and an optional
event store, keeping the shared engine thread-safe.

- config.py: actions_location now means action definitions (comment fixed);
  add action_events_location for event history / cross-run suppression.
- engine.py: _engine_for_run_config / _build_scoped_engine build the scoped
  engine; _run_config_actions_storage_config / _run_config_action_events_config
  resolve table/Lakebase/file backends. File paths never route to a table
  backend; a non-table events location raises a clear InvalidConfigError.
- workflow_context.py: resolve relative actions_location to a /Workspace FUSE
  path (actions load via open(), like custom check functions).
- docs: document both keys, the runner usage, and per-run-config action
  behaviour; add the keys to the installation run-config reference.
- tests: unit tests for the resolvers and scoped-engine logic; integration test
  for end-to-end load + fire + event persistence.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Conditionally failing checks [FEATURE]: Add data quality alerts

1 participant