What would you like to be improved?
LancePartitionStatisticStorage defines the table_id Arrow column as a 64-bit integer, but createFragmentMetadata() retrieves it as UInt8Vector and writes a Long table ID into that vector.
Relevant locations: LancePartitionStatisticStorage.java (line 114) and LancePartitionStatisticStorage.java (line 388)
This makes the Lance-backed partition statistics update path inconsistent with its own schema and can fail at runtime when statistics are written.
How should we improve?
Use the Arrow vector type that matches the declared schema for table_id, such as BigIntVector, instead of UInt8Vector.
Here's a unit test to help:
@Test
public void testUpdateStatisticsWithLargeTableId() throws Exception {
PartitionStatisticStorageFactory factory = new LancePartitionStatisticStorageFactory();
String metalakeName = "metalake";
MetadataObject metadataObject =
MetadataObjects.of(
Lists.newArrayList("catalog", "schema", "table"), MetadataObject.Type.TABLE);
EntityStore entityStore = mock(EntityStore.class);
TableEntity tableEntity = mock(TableEntity.class);
when(entityStore.get(any(), any(), any())).thenReturn(tableEntity);
when(tableEntity.id()).thenReturn(256L);
FieldUtils.writeField(GravitinoEnv.getInstance(), "entityStore", entityStore, true);
String location = Files.createTempDirectory("lance_stats_large_table_id").toString();
Map<String, String> properties = Maps.newHashMap();
properties.put("location", location);
LancePartitionStatisticStorage storage =
(LancePartitionStatisticStorage) factory.create(properties);
try {
Map<String, StatisticValue<?>> statistics = Maps.newHashMap();
statistics.put("statistic0", StatisticValues.stringValue("value0"));
storage.updateStatistics(
metalakeName,
Lists.newArrayList(
MetadataObjectStatisticsUpdate.of(
metadataObject,
Lists.newArrayList(
PartitionStatisticsModification.update("partition0", statistics)))));
List<PersistedPartitionStatistics> listedStats =
storage.listStatistics(
metalakeName,
metadataObject,
PartitionRange.between(
"partition0",
PartitionRange.BoundType.CLOSED,
"partition0",
PartitionRange.BoundType.CLOSED));
Assertions.assertEquals(1, listedStats.size());
Assertions.assertEquals("partition0", listedStats.get(0).partitionName());
Assertions.assertEquals(1, listedStats.get(0).statistics().size());
Assertions.assertEquals("statistic0", listedStats.get(0).statistics().get(0).name());
Assertions.assertEquals("value0", listedStats.get(0).statistics().get(0).value().value());
} finally {
FileUtils.deleteDirectory(new File(location + "/" + tableEntity.id() + ".lance"));
storage.close();
}
}
What would you like to be improved?
LancePartitionStatisticStorage defines the table_id Arrow column as a 64-bit integer, but createFragmentMetadata() retrieves it as UInt8Vector and writes a Long table ID into that vector.
Relevant locations: LancePartitionStatisticStorage.java (line 114) and LancePartitionStatisticStorage.java (line 388)
This makes the Lance-backed partition statistics update path inconsistent with its own schema and can fail at runtime when statistics are written.
How should we improve?
Use the Arrow vector type that matches the declared schema for table_id, such as BigIntVector, instead of UInt8Vector.
Here's a unit test to help: