Skip to content

[Bug] Inclusion of runId in DatasetVersion hash causes version explosion and query timeouts in high-frequency jobs #3082

@bsmin10010-cloud

Description

@bsmin10010-cloud

Hello, Marquez team.

I am operating a Marquez instance integrated with Airflow, running high-frequency batch jobs (e.g., every minute).
I encountered a critical performance issue where the dataset_versions and column_lineage tables grew abnormally large , causing the dataset lineage query to take over 10 seconds or timeout.

Upon investigation, I found that Marquez creates a new dataset version for every single run, even when the schema and dataset facets remain exactly the same.

I dug into the source code and identified that runId is included in the hashing logic for generating the DatasetVersion UUID.

In Utils.java (specifically inside the newDatasetVersionFor method):

  private static Version newDatasetVersionFor(DatasetVersionData data) {
    final byte[] bytes =
        VERSION_JOINER
            .join(
                data.getNamespace(),
                data.getSourceName(),
                data.getDatasetName(),
                data.getPhysicalName(),
                data.getSchemaLocation(),
                data.getFields().stream().map(Utils::joinField).collect(joining(VERSION_DELIM)),
                data.getLifecycleState(),
                data.getRunId())
            .getBytes(UTF_8);
    return Version.of(UUID.nameUUIDFromBytes(bytes));
  }

Since schedulers like Airflow generate a unique runId for every execution, this logic forces Marquez to generate a new DatasetVersion UUID every minute, regardless of whether the actual data/schema has changed.

Affected point

API : get-dataset
Query : ColumnLineageDao.getLineageRowsForDatasets

Proposed Solution:

The DatasetVersion identity should depend on the state of the dataset (Schema, Namespace, Name), not the provenance (Run ID).

I suggest removing data.getRunId() from the version hashing logic. When I locally patched the code by removing that line, Marquez correctly reused the existing version UUID when the schema didn't change, and the performance issue was resolved.

Could you please review this behavior? Ideally, runId should only be associated with the version as a foreign key, not as a seed for the version hash itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions