Description
Calling add_column with a nullable INT64 field on a collection that has multiple segments (i.e., multiple insert + flush cycles), without providing an expression parameter, causes a SIGABRT (core dump) due to an Apache Arrow assertion failure.
Environment
- zvec version: 0.4.0 (installed from PyPI)
- Python version: 3.14.2
- OS: macOS (arm64, Darwin 24.1.0)
Steps to Reproduce
import zvec
import numpy as np
import tempfile
import os
# Create a fresh collection
tmp_dir = tempfile.mkdtemp()
coll_path = os.path.join(tmp_dir, 'test_coll')
schema = zvec.CollectionSchema(
name='test_coll',
vectors=[
zvec.VectorSchema(
name='embedding',
data_type=zvec.DataType.VECTOR_FP32,
dimension=4,
index_param=zvec.HnswIndexParam(metric_type=zvec.MetricType.IP),
)
],
)
coll = zvec.create_and_open(coll_path, schema=schema)
# Insert in multiple batches with flush to create multiple segments
for batch in range(3):
docs = [
zvec.Doc(
id=str(batch * 10 + i),
vectors={'embedding': np.random.rand(4).astype(np.float32).tolist()}
)
for i in range(4)
]
coll.insert(docs)
coll.flush()
print('doc_count:', coll.stats.doc_count) # 12 documents across 3 segments
# Reopen the collection
del coll
coll = zvec.open(coll_path)
# Add a nullable INT64 column without expression — this crashes
field = zvec.FieldSchema(name='score', data_type=zvec.DataType.INT64, nullable=True)
coll.add_column(field_schema=field)
Expected Behavior
Either:
- The column should be added successfully with null values for all existing rows across all segments, or
- A Python-level error should be raised indicating that an expression is required (similar to the
ValueError raised for non-nullable columns)
Actual Behavior
Process crashes with SIGABRT (exit code 134):
/Users/runner/work/zvec/zvec/thirdparty/arrow/apache-arrow-21.0.0/cpp/src/arrow/chunked_array.cc:170:
Check failed: (offset) <= (length_) Slice offset greater than array length
Key Observations
- Does NOT crash with a single segment (one insert + flush)
- Crashes when the collection has multiple segments (multiple insert + flush cycles)
- Does NOT crash when providing an
expression parameter (e.g., expression='0')
- When
nullable=False and no expression is provided, the Python layer correctly raises ValueError, preventing the crash
- The root cause appears to be in the C++ layer: when adding a nullable column to a multi-segment collection without an expression, the code attempts to slice an Arrow ChunkedArray with an invalid offset (likely because null values are not properly materialized for each segment)
Workaround
Provide an explicit expression parameter:
field = zvec.FieldSchema(name='score', data_type=zvec.DataType.INT64, nullable=True)
coll.add_column(field_schema=field, expression='0') # works fine
Description
Calling
add_columnwith a nullableINT64field on a collection that has multiple segments (i.e., multiple insert + flush cycles), without providing anexpressionparameter, causes a SIGABRT (core dump) due to an Apache Arrow assertion failure.Environment
Steps to Reproduce
Expected Behavior
Either:
ValueErrorraised for non-nullable columns)Actual Behavior
Process crashes with SIGABRT (exit code 134):
Key Observations
expressionparameter (e.g.,expression='0')nullable=Falseand no expression is provided, the Python layer correctly raisesValueError, preventing the crashWorkaround
Provide an explicit
expressionparameter: