Skip to content

[Improvement] Fix Lance partition statistics writes failing due to table_id vector type mismatch #10603

@justinmclean

Description

@justinmclean

What would you like to be improved?

LancePartitionStatisticStorage defines the table_id Arrow column as a 64-bit integer, but createFragmentMetadata() retrieves it as UInt8Vector and writes a Long table ID into that vector.

Relevant locations: LancePartitionStatisticStorage.java (line 114) and LancePartitionStatisticStorage.java (line 388)

This makes the Lance-backed partition statistics update path inconsistent with its own schema and can fail at runtime when statistics are written.

How should we improve?

Use the Arrow vector type that matches the declared schema for table_id, such as BigIntVector, instead of UInt8Vector.

Here's a unit test to help:

@Test
  public void testUpdateStatisticsWithLargeTableId() throws Exception {
    PartitionStatisticStorageFactory factory = new LancePartitionStatisticStorageFactory();
    String metalakeName = "metalake";
    MetadataObject metadataObject =
        MetadataObjects.of(
            Lists.newArrayList("catalog", "schema", "table"), MetadataObject.Type.TABLE);

    EntityStore entityStore = mock(EntityStore.class);
    TableEntity tableEntity = mock(TableEntity.class);
    when(entityStore.get(any(), any(), any())).thenReturn(tableEntity);
    when(tableEntity.id()).thenReturn(256L);
    FieldUtils.writeField(GravitinoEnv.getInstance(), "entityStore", entityStore, true);

    String location = Files.createTempDirectory("lance_stats_large_table_id").toString();
    Map<String, String> properties = Maps.newHashMap();
    properties.put("location", location);

    LancePartitionStatisticStorage storage =
        (LancePartitionStatisticStorage) factory.create(properties);
    try {
      Map<String, StatisticValue<?>> statistics = Maps.newHashMap();
      statistics.put("statistic0", StatisticValues.stringValue("value0"));

      storage.updateStatistics(
          metalakeName,
          Lists.newArrayList(
              MetadataObjectStatisticsUpdate.of(
                  metadataObject,
                  Lists.newArrayList(
                      PartitionStatisticsModification.update("partition0", statistics)))));

      List<PersistedPartitionStatistics> listedStats =
          storage.listStatistics(
              metalakeName,
              metadataObject,
              PartitionRange.between(
                  "partition0",
                  PartitionRange.BoundType.CLOSED,
                  "partition0",
                  PartitionRange.BoundType.CLOSED));

      Assertions.assertEquals(1, listedStats.size());
      Assertions.assertEquals("partition0", listedStats.get(0).partitionName());
      Assertions.assertEquals(1, listedStats.get(0).statistics().size());
      Assertions.assertEquals("statistic0", listedStats.get(0).statistics().get(0).name());
      Assertions.assertEquals("value0", listedStats.get(0).statistics().get(0).value().value());
    } finally {
      FileUtils.deleteDirectory(new File(location + "/" + tableEntity.id() + ".lance"));
      storage.close();
    }
  }

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions