[Bug] Inclusion of runId in DatasetVersion hash causes version explosion and query timeouts in high-frequency jobs

Hello, Marquez team.

I am operating a Marquez instance integrated with Airflow, running high-frequency batch jobs (e.g., every minute).
 I encountered a critical performance issue where the `dataset_versions` and `column_lineage` tables grew abnormally large , causing the dataset lineage query to take over 10 seconds or timeout.

Upon investigation, I found that Marquez creates a new dataset version for every single run, even when the schema and dataset facets remain exactly the same.

I dug into the source code and identified that runId is included in the hashing logic for generating the DatasetVersion UUID.

In `Utils.java` (specifically inside the `newDatasetVersionFor` method):

```
  private static Version newDatasetVersionFor(DatasetVersionData data) {
    final byte[] bytes =
        VERSION_JOINER
            .join(
                data.getNamespace(),
                data.getSourceName(),
                data.getDatasetName(),
                data.getPhysicalName(),
                data.getSchemaLocation(),
                data.getFields().stream().map(Utils::joinField).collect(joining(VERSION_DELIM)),
                data.getLifecycleState(),
                data.getRunId())
            .getBytes(UTF_8);
    return Version.of(UUID.nameUUIDFromBytes(bytes));
  }
```

Since schedulers like Airflow generate a unique runId for every execution, this logic forces Marquez to generate a new DatasetVersion UUID every minute, regardless of whether the actual data/schema has changed.

### Affected point
API  : [get-dataset](https://marquezproject.ai/docs/api/get-dataset/)
Query : ColumnLineageDao.getLineageRowsForDatasets


### Proposed Solution:

 The DatasetVersion identity should depend on the state of the dataset (Schema, Namespace, Name), not the provenance (Run ID).

I suggest removing data.getRunId() from the version hashing logic. When I locally patched the code by removing that line, Marquez correctly reused the existing version UUID when the schema didn't change, and the performance issue was resolved.

Could you please review this behavior? Ideally, runId should only be associated with the version as a foreign key, not as a seed for the version hash itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Inclusion of runId in DatasetVersion hash causes version explosion and query timeouts in high-frequency jobs #3082

Affected point

Proposed Solution:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Inclusion of runId in DatasetVersion hash causes version explosion and query timeouts in high-frequency jobs #3082

Description

Affected point

Proposed Solution:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions