Skip to content

[Bug] Core dump when adding nullable INT64 column without expression to non-empty collection #415

@feihongxu0824

Description

@feihongxu0824

Description

Calling add_column with a nullable INT64 field on a collection that has multiple segments (i.e., multiple insert + flush cycles), without providing an expression parameter, causes a SIGABRT (core dump) due to an Apache Arrow assertion failure.

Environment

  • zvec version: 0.4.0 (installed from PyPI)
  • Python version: 3.14.2
  • OS: macOS (arm64, Darwin 24.1.0)

Steps to Reproduce

import zvec
import numpy as np
import tempfile
import os

# Create a fresh collection
tmp_dir = tempfile.mkdtemp()
coll_path = os.path.join(tmp_dir, 'test_coll')

schema = zvec.CollectionSchema(
    name='test_coll',
    vectors=[
        zvec.VectorSchema(
            name='embedding',
            data_type=zvec.DataType.VECTOR_FP32,
            dimension=4,
            index_param=zvec.HnswIndexParam(metric_type=zvec.MetricType.IP),
        )
    ],
)

coll = zvec.create_and_open(coll_path, schema=schema)

# Insert in multiple batches with flush to create multiple segments
for batch in range(3):
    docs = [
        zvec.Doc(
            id=str(batch * 10 + i),
            vectors={'embedding': np.random.rand(4).astype(np.float32).tolist()}
        )
        for i in range(4)
    ]
    coll.insert(docs)
    coll.flush()

print('doc_count:', coll.stats.doc_count)  # 12 documents across 3 segments

# Reopen the collection
del coll
coll = zvec.open(coll_path)

# Add a nullable INT64 column without expression — this crashes
field = zvec.FieldSchema(name='score', data_type=zvec.DataType.INT64, nullable=True)
coll.add_column(field_schema=field)

Expected Behavior

Either:

  1. The column should be added successfully with null values for all existing rows across all segments, or
  2. A Python-level error should be raised indicating that an expression is required (similar to the ValueError raised for non-nullable columns)

Actual Behavior

Process crashes with SIGABRT (exit code 134):

/Users/runner/work/zvec/zvec/thirdparty/arrow/apache-arrow-21.0.0/cpp/src/arrow/chunked_array.cc:170:
Check failed: (offset) <= (length_) Slice offset greater than array length

Key Observations

  • Does NOT crash with a single segment (one insert + flush)
  • Crashes when the collection has multiple segments (multiple insert + flush cycles)
  • Does NOT crash when providing an expression parameter (e.g., expression='0')
  • When nullable=False and no expression is provided, the Python layer correctly raises ValueError, preventing the crash
  • The root cause appears to be in the C++ layer: when adding a nullable column to a multi-segment collection without an expression, the code attempts to slice an Arrow ChunkedArray with an invalid offset (likely because null values are not properly materialized for each segment)

Workaround

Provide an explicit expression parameter:

field = zvec.FieldSchema(name='score', data_type=zvec.DataType.INT64, nullable=True)
coll.add_column(field_schema=field, expression='0')  # works fine

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions