Hello, Marquez team.
I am operating a Marquez instance integrated with Airflow, running high-frequency batch jobs (e.g., every minute).
I encountered a critical performance issue where the dataset_versions and column_lineage tables grew abnormally large , causing the dataset lineage query to take over 10 seconds or timeout.
Upon investigation, I found that Marquez creates a new dataset version for every single run, even when the schema and dataset facets remain exactly the same.
I dug into the source code and identified that runId is included in the hashing logic for generating the DatasetVersion UUID.
In Utils.java (specifically inside the newDatasetVersionFor method):
private static Version newDatasetVersionFor(DatasetVersionData data) {
final byte[] bytes =
VERSION_JOINER
.join(
data.getNamespace(),
data.getSourceName(),
data.getDatasetName(),
data.getPhysicalName(),
data.getSchemaLocation(),
data.getFields().stream().map(Utils::joinField).collect(joining(VERSION_DELIM)),
data.getLifecycleState(),
data.getRunId())
.getBytes(UTF_8);
return Version.of(UUID.nameUUIDFromBytes(bytes));
}
Since schedulers like Airflow generate a unique runId for every execution, this logic forces Marquez to generate a new DatasetVersion UUID every minute, regardless of whether the actual data/schema has changed.
Affected point
API : get-dataset
Query : ColumnLineageDao.getLineageRowsForDatasets
Proposed Solution:
The DatasetVersion identity should depend on the state of the dataset (Schema, Namespace, Name), not the provenance (Run ID).
I suggest removing data.getRunId() from the version hashing logic. When I locally patched the code by removing that line, Marquez correctly reused the existing version UUID when the schema didn't change, and the performance issue was resolved.
Could you please review this behavior? Ideally, runId should only be associated with the version as a foreign key, not as a seed for the version hash itself.
Hello, Marquez team.
I am operating a Marquez instance integrated with Airflow, running high-frequency batch jobs (e.g., every minute).
I encountered a critical performance issue where the
dataset_versionsandcolumn_lineagetables grew abnormally large , causing the dataset lineage query to take over 10 seconds or timeout.Upon investigation, I found that Marquez creates a new dataset version for every single run, even when the schema and dataset facets remain exactly the same.
I dug into the source code and identified that runId is included in the hashing logic for generating the DatasetVersion UUID.
In
Utils.java(specifically inside thenewDatasetVersionFormethod):Since schedulers like Airflow generate a unique runId for every execution, this logic forces Marquez to generate a new DatasetVersion UUID every minute, regardless of whether the actual data/schema has changed.
Affected point
API : get-dataset
Query : ColumnLineageDao.getLineageRowsForDatasets
Proposed Solution:
The DatasetVersion identity should depend on the state of the dataset (Schema, Namespace, Name), not the provenance (Run ID).
I suggest removing data.getRunId() from the version hashing logic. When I locally patched the code by removing that line, Marquez correctly reused the existing version UUID when the schema didn't change, and the performance issue was resolved.
Could you please review this behavior? Ideally, runId should only be associated with the version as a foreign key, not as a seed for the version hash itself.