feat(clp_s::filter): Add library for representing bloom filters; Add xxHash v0.8.3 as a dependency by ShangDanLuXian · Pull Request #2010 · y-scope/clp

ShangDanLuXian · 2026-02-20T14:14:42Z

Description

This is the first PR for the issue #2090
This PR is intentionally narrowed to a standalone clp_s Bloom filter library + xxHash integration + unit tests.

Main changes:

Add BloomFilter implementation under components/core/src/clp_s/filter/:
- BloomFilter.cpp/.hpp
- ErrorCode.cpp/.hpp
- XxHash.cpp/.hpp
Add dependency plumbing for xxHash (v0.8.3) in taskfiles/deps/main.yaml.
Add/adjust CMake wiring for clp_s::filter:
- components/core/src/clp_s/filter/CMakeLists.txt
- components/core/src/clp_s/CMakeLists.txt
- components/core/CMakeLists.txt
- components/core/cmake/Options/options.cmake
Add unit tests under components/core/src/clp_s/filter/tests/:
- test-clp_s-bloom_filter.cpp
- test-clp_s-xxhash.cpp
Pin xxHash usage/version expectations in tests via seeded sanity vectors (XXH3_64bits_withSeed, v0.8.3).

Why xxHash:

Performance: Bloom filtering is hash-heavy; xxHash is fast and suitable for this workload.
Portability: upstream implementation is widely used across architectures/toolchains.
Stability: dependency is pinned in our deps flow, and test vectors are included to detect drift in hash behavior.

Out of scope in this PR:

integration work for other filter-library components (beyond BloomFilter/xxHash/unit tests)
archive writer integration
package/runtime plumbing
query/runtime behavior changes

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Added unit tests:
- components/core/src/clp_s/filter/tests/test-clp_s-bloom_filter.cpp
- components/core/src/clp_s/filter/tests/test-clp_s-xxhash.cpp
Verified build/dependency wiring updates for clp_s::filter and xxHash integration.

coderabbitai · 2026-02-20T14:15:00Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

A new Bloom filter–based filtering system is introduced under clp_s::filter, including a BloomFilter implementation with xxHash integration for hashing, along with supporting infrastructure for configuration, file I/O, and reader interfaces. Build system integration and dependency management for xxHash 0.8.3 are added via CMake.

Changes

Cohort / File(s)	Summary
Bloom Filter Core `components/core/src/clp_s/filter/BloomFilter.hpp`, `BloomFilter.cpp`	Implements a Bloom filter with optimal parameter computation, element addition, membership checking, bit manipulation, and file serialization/deserialization using xxHash for hashing.
Filter Configuration & Error Handling `components/core/src/clp_s/filter/FilterConfig.hpp`, `FilterConfig.cpp`, `ErrorCode.hpp`, `ErrorCode.cpp`	Defines FilterType enum and FilterConfig struct; implements parsing and string conversion utilities; provides error code enumeration with custom error messages for filter operations.
Filter I/O Infrastructure `components/core/src/clp_s/filter/FilterFile.hpp`, `FilterFile.cpp`, `FilterReader.hpp`, `FilterReader.cpp`	Establishes file format with magic header; implements write and read operations for filter payloads; provides FilterReader class for membership queries with normalized input handling.
Filter Builder `components/core/src/clp_s/filter/FilterBuilder.hpp`, `FilterBuilder.cpp`	Constructs BloomFilter instances based on FilterConfig; normalizes input to lowercase; serializes filter to file with optional behaviour based on enabled state.
Hash Integration `components/core/src/clp_s/filter/XxHash.hpp`, `XxHash.cpp`	Wraps xxHash v0.8.3 functionality; exposes seeded 64-bit hash function for Bloom filter hash computation.
Build System Setup `components/core/src/clp_s/filter/CMakeLists.txt`, `components/core/src/clp_s/CMakeLists.txt`, `components/core/CMakeLists.txt`	Adds filter subdirectory and clp_s_filter library target; configures compiler features, include paths, and dependencies (clp_s::io, clp::string_utils, xxHash::xxhash); integrates optional xxHash dependency validation.
Build Options & Dependencies `components/core/cmake/Options/options.cmake`, `taskfiles/deps/main.yaml`	Introduces CLP_BUILD_CLP_S_FILTER option; declares xxHash as required dependency; defines xxHash v0.8.3 installation task with SHA validation.
Unit Tests `components/core/src/clp_s/filter/tests/test-clp_s-bloom_filter.cpp`, `test-clp_s-xxhash.cpp`	Validates Bloom filter parameter constraints, false-negative/positive rates against statistical bounds, and xxHash seeded output against test vectors.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant FilterBuilder
    participant BloomFilter
    participant FileWriter
    
    Client->>FilterBuilder: FilterBuilder(FilterConfig, num_elements)
    FilterBuilder->>BloomFilter: create(num_elements, fpr)
    BloomFilter->>BloomFilter: compute_optimal_parameters(num_elements, fpr)
    BloomFilter-->>FilterBuilder: BloomFilter instance
    Client->>FilterBuilder: add(value)
    FilterBuilder->>BloomFilter: add(lowercase(value))
    Client->>FilterBuilder: write(path)
    FilterBuilder->>FileWriter: new FileWriter(path)
    FilterBuilder->>BloomFilter: write_to_file(writer)
    BloomFilter->>FileWriter: serialize data
    FileWriter-->>FilterBuilder: success

sequenceDiagram
    participant Client
    participant FilterReader
    participant FileReader
    participant BloomFilter
    
    Client->>FilterReader: read_from_file(reader)
    FilterReader->>FileReader: read magic header
    FileReader-->>FilterReader: magic valid
    FilterReader->>FileReader: read filter type
    FileReader-->>FilterReader: FilterType
    alt FilterType is Bloom
        FilterReader->>BloomFilter: try_read_from_file(reader)
        BloomFilter->>FileReader: read parameters & data
        FileReader-->>BloomFilter: bloom data
        BloomFilter-->>FilterReader: BloomFilter instance
    end
    FilterReader-->>Client: true
    Client->>FilterReader: possibly_contains(value)
    FilterReader->>BloomFilter: possibly_contains(lowercase(value))
    BloomFilter-->>FilterReader: bool result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main changes: adding a Bloom filter library to clp_s and integrating xxHash v0.8.3 as a dependency, which are the primary objectives of the changeset.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can generate walkthrough in a markdown collapsible section to save space.

Enable the reviews.collapse_walkthrough setting to generate walkthrough in a markdown collapsible section.

coderabbitai

Actionable comments posted: 11

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/core/CMakeLists.txt`:
- Around line 722-725: Reorder the test entries in
components/core/CMakeLists.txt so they are alphabetical: move the line with
tests/test-clp_s-bloom_filter.cpp to appear before
tests/test-clp_s-delta-encode-log-order.cpp (ensure the sequence around
tests/test-clp_s-end_to_end.cpp and tests/test-clp_s-range_index.cpp still
remains alphabetically sorted).

In `@components/core/src/clp_s/CMakeLists.txt`:
- Around line 259-264: The CMake target clp_s_filter currently links
clp::string_utils as PUBLIC even though its symbols are only used in
implementation files (FilterBuilder.cpp and FilterReader.cpp) and not in public
headers (FilterBuilder.hpp, FilterReader.hpp); change the linkage for
clp::string_utils from PUBLIC to PRIVATE in the target_link_libraries call for
clp_s_filter so the dependency is not propagated transitively (do this as part
of a follow-up CMake cleanup PR if you prefer).

In `@components/core/src/clp_s/filter/BloomFilter.cpp`:
- Line 49: Replace the negation usage with the project's preferred
equality-to-false pattern: change occurrences of
"!std::isfinite(ideal_bit_array_size)", "!m_bit_array.empty()", and
"!bit_array.empty()" in BloomFilter.cpp to "false ==
std::isfinite(ideal_bit_array_size)", "false == m_bit_array.empty()", and "false
== bit_array.empty()" respectively; update the conditional expressions in the
functions/methods that reference ideal_bit_array_size, m_bit_array, and
bit_array so they use the exact "false == <expression>" form while preserving
parentheses and surrounding logic.
- Around line 56-62: The calculation of num_hash_functions currently truncates
the optimal k by casting to uint32_t; change this to round to the nearest
integer (e.g., use std::lround or std::round followed by cast) when computing
num_hash_functions from (bit_array_size / expected_num_elements) * ln2, and keep
the existing std::clamp into capped_num_hash_functions with kMinNumHashFunctions
and kMaxNumHashFunctions; also ensure <cmath> is included so std::lround/
std::round is available.
- Around line 132-144: Add a sanity upper-bound check on bit_array_bytes before
allocating the vector to avoid OOM from malformed input: in the same block that
computes expected_bit_array_bytes and checks bit_array_bytes, define a
reasonable maximum (e.g., constexpr size_t kMaxBitArrayBytes = 256ULL * 1024 *
1024) and return false if bit_array_bytes > kMaxBitArrayBytes; leave the
existing std::numeric_limits<size_t>::max() check in place and only perform the
new bound check before constructing std::vector<uint8_t> bit_array.

In `@components/core/src/clp_s/filter/BloomFilter.hpp`:
- Around line 1-2: Update the include guard in BloomFilter.hpp to match the
project's convention by replacing the current macro CLP_S_BLOOM_FILTER_HPP with
CLP_S_FILTER_BLOOM_FILTER_HPP at the top of the file and update the matching
closing `#endif` comment (and any other occurrences) to use
CLP_S_FILTER_BLOOM_FILTER_HPP so it follows the CLP_S_FILTER_* prefix used by
other headers in filter/.

In `@components/core/src/clp_s/filter/FilterBuilder.cpp`:
- Around line 24-33: FilterBuilder::write currently opens a FileWriter and calls
write_filter_file(m_config.type, m_filter) but if write_filter_file throws the
FileWriter::close() is never called; update FilterBuilder::write to wrap the
write_filter_file call in a try/catch (catch all), call writer.close() in the
catch, rethrow the exception, and keep the normal writer.close() after the try
so the file is always closed even on errors; reference the FileWriter instance
(writer), the write_filter_file(...) call, and the existing open/close sequence
when making the change.

In `@components/core/src/clp_s/filter/FilterFile.cpp`:
- Around line 30-43: The code uses two different default-construction syntaxes
for BloomFilter (BloomFilter{} at the assignment to out_filter when
FilterType::None and BloomFilter() later); pick one style and make both
assignments consistent—either change the first assignment to out_filter =
BloomFilter(); or change the second to out_filter = BloomFilter{}—so both
occurrences (the assignments to out_filter in the branch handling
FilterType::None and the branch before calling
out_filter.read_from_file(reader)) use the same form.

In `@components/core/src/clp_s/filter/FilterFile.hpp`:
- Line 18: The current write_filter_file(FileWriter& writer, FilterType type,
BloomFilter const& filter) API forces callers to provide a BloomFilter even when
type == FilterType::None; add an ergonomic overload such as
write_filter_file(FileWriter& writer, FilterType type) that handles the None
case (or change the signature to accept std::optional<BloomFilter const&>) so
callers can omit a dummy BloomFilter; update implementations of
write_filter_file and any callers to use the new overload for FilterType::None
and keep the existing three-argument overload for the cases that actually need a
BloomFilter.

In `@components/core/tests/test-clp_s-bloom_filter.cpp`:
- Around line 46-51: Replace the negation-style check in the loop that counts
false_negatives by using the project's preferred equality comparison: change the
condition that currently reads using !filter.possibly_contains(value) to use
false == filter.possibly_contains(value). Update the loop that iterates over
inserted and increments false_negatives to use this new condition so the
behavior is unchanged but conforms to the guideline; references:
false_negatives, inserted, and filter.possibly_contains.

In `@taskfiles/deps/main.yaml`:
- Around line 577-595: Update the xxhash dependency to v0.8.3 and verify the
archive SHA256: change the URL value referenced by the URL variable to the
v0.8.3 tarball and replace FILE_SHA256 with the verified SHA256 for that exact
archive; ensure the variables LIB_NAME, XXHASH_OUTPUT_DIR and the generated
cmake file path (G_DEPS_CPP_CMAKE_SETTINGS_DIR/{{.LIB_NAME}}.cmake) remain
correct; after updating, download the new URL locally to compute and confirm the
SHA256 before committing to avoid GitHub archive instability.

coderabbitai · 2026-02-20T14:23:00Z

+
+    auto const num_hash_functions = static_cast<uint32_t>(
+            static_cast<double>(bit_array_size) / static_cast<double>(expected_num_elements) * ln2
+    );
+
+    uint32_t const capped_num_hash_functions
+            = std::clamp(num_hash_functions, kMinNumHashFunctions, kMaxNumHashFunctions);


🧹 Nitpick | 🔵 Trivial

Truncation in num_hash_functions computation may yield a sub-optimal hash count.

Line 57–59 truncates the optimal k via static_cast<uint32_t>(...). The standard formula k = (m/n) * ln(2) is ideally rounded to the nearest integer rather than floored. With std::clamp the result is safe (minimum 1), but using std::round or std::lround before the cast would give a more accurate hash count for borderline values.

Proposed fix

- auto const num_hash_functions = static_cast<uint32_t>( - static_cast<double>(bit_array_size) / static_cast<double>(expected_num_elements) * ln2 - ); + auto const num_hash_functions = static_cast<uint32_t>(std::llround( + static_cast<double>(bit_array_size) / static_cast<double>(expected_num_elements) * ln2 + ));

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/core/src/clp_s/filter/BloomFilter.cpp` around lines 56 - 62, The calculation of num_hash_functions currently truncates the optimal k by casting to uint32_t; change this to round to the nearest integer (e.g., use std::lround or std::round followed by cast) when computing num_hash_functions from (bit_array_size / expected_num_elements) * ln2, and keep the existing std::clamp into capped_num_hash_functions with kMinNumHashFunctions and kMaxNumHashFunctions; also ensure <cmath> is included so std::lround/ std::round is available.

coderabbitai · 2026-02-20T14:23:01Z

+    uint64_t bit_array_bytes = 0;
+    if (clp::ErrorCode_Success != reader.try_read_numeric_value(bit_array_bytes)) {
+        return false;
+    }
+    if (bit_array_bytes > std::numeric_limits<size_t>::max()) {
+        return false;
+    }
+    size_t const expected_bit_array_bytes = (bit_array_size + 7) / 8;
+    if (bit_array_bytes != expected_bit_array_bytes) {
+        return false;
+    }
+
+    std::vector<uint8_t> bit_array(static_cast<size_t>(bit_array_bytes));


🧹 Nitpick | 🔵 Trivial

Consider adding a sanity upper-bound on bit_array_bytes before allocation.

read_from_file validates that bit_array_bytes == (bit_array_size + 7) / 8 but imposes no ceiling on the absolute size. A crafted file could set bit_array_size close to SIZE_MAX, causing a multi-gigabyte std::vector allocation (Line 144) that may OOM. Adding a reasonable maximum (e.g., 256 MiB) would harden the deserialization path against malformed input.

Not critical for a library-only PR, but worth adding before this is wired into archive reading.

Example guard

+ // Reject unreasonably large filters (256 MiB cap). + constexpr size_t kMaxBitArrayBytes{256ULL * 1024 * 1024}; + if (bit_array_bytes > kMaxBitArrayBytes) { + return false; + } + size_t const expected_bit_array_bytes = (bit_array_size + 7) / 8;

🧰 Tools

🪛 Cppcheck (2.19.0)

[style] 142-142: The function 'read_numeric_value' is never used.

(unusedFunction)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/core/src/clp_s/filter/BloomFilter.cpp` around lines 132 - 144, Add a sanity upper-bound check on bit_array_bytes before allocating the vector to avoid OOM from malformed input: in the same block that computes expected_bit_array_bytes and checks bit_array_bytes, define a reasonable maximum (e.g., constexpr size_t kMaxBitArrayBytes = 256ULL * 1024 * 1024) and return false if bit_array_bytes > kMaxBitArrayBytes; leave the existing std::numeric_limits<size_t>::max() check in place and only perform the new bound check before constructing std::vector<uint8_t> bit_array.

coderabbitai · 2026-02-20T14:23:01Z

+bool FilterBuilder::write(std::string const& filter_path) const {
+    if (false == m_enabled) {
+        return false;
+    }
+    FileWriter writer;
+    writer.open(filter_path, FileWriter::OpenMode::CreateForWriting);
+    write_filter_file(writer, m_config.type, m_filter);
+    writer.close();
+    return true;
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find FileWriter class definition find . -type f $ -name "*.cpp" -o -name "*.hpp" -o -name "*.h" $ | xargs grep -l "class FileWriter"

Repository: y-scope/clp

Length of output: 179

🏁 Script executed:

# Search for FileWriter header and implementation files fd -e hpp -e h FileWriter | head -20

Repository: y-scope/clp

Length of output: 173

🏁 Script executed:

# Use ast-grep to find FileWriter class definition ast-grep --pattern 'class FileWriter { $$$ }'

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

cat -n components/core/src/clp_s/FileWriter.hpp

Repository: y-scope/clp

Length of output: 4880

🏁 Script executed:

# Also check for FileWriter implementation in cpp files fd -path "*/clp_s/*" -name "FileWriter.cpp"

Repository: y-scope/clp

Length of output: 223

🏁 Script executed:

find . -path "*/clp_s/*" -name "FileWriter.cpp" -o -path "*/clp_s/*" -name "FileWriter.cpp"

Repository: y-scope/clp

Length of output: 97

🏁 Script executed:

# Search more broadly for FileWriter.cpp find . -name "FileWriter.cpp"

Repository: y-scope/clp

Length of output: 179

🏁 Script executed:

cat -n components/core/src/clp_s/FileWriter.cpp

Repository: y-scope/clp

Length of output: 5687

🏁 Script executed:

# Verify FilterBuilder.cpp location and context cat -n components/core/src/clp_s/filter/FilterBuilder.cpp | head -40

Repository: y-scope/clp

Length of output: 1259

🏁 Script executed:

# Check if there are other usages of FileWriter in the codebase with similar patterns rg -n "FileWriter.*writer" --type cpp | grep -A 3 -B 3 "write_filter_file\|writer.open"

Repository: y-scope/clp

Length of output: 940

🏁 Script executed:

# Check other usages of FileWriter to see if similar pattern exists elsewhere rg -B 2 -A 5 "FileWriter\s+\w+;" --type cpp | grep -A 10 "\.open"

Repository: y-scope/clp

Length of output: 9334

🏁 Script executed:

# Look at FilterFile.cpp to understand what write_filter_file does cat -n components/core/src/clp_s/filter/FilterFile.cpp

Repository: y-scope/clp

Length of output: 1774

Ensure file is closed even if an exception occurs during filtering.

If write_filter_file() throws (e.g., due to disk full or write failure), writer.close() on line 31 is never called. The FileWriter destructor does not implement cleanup—it only logs an error. Wrap the write operation in a try-catch block:

Suggested fix

bool FilterBuilder::write(std::string const& filter_path) const { if (false == m_enabled) { return false; } FileWriter writer; writer.open(filter_path, FileWriter::OpenMode::CreateForWriting); try { write_filter_file(writer, m_config.type, m_filter); } catch (...) { writer.close(); throw; } writer.close(); return true; }

🧰 Tools

🪛 Cppcheck (2.19.0)

[style] 24-24: The function 'is_decimal_digit' is never used.

(unusedFunction)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/core/src/clp_s/filter/FilterBuilder.cpp` around lines 24 - 33, FilterBuilder::write currently opens a FileWriter and calls write_filter_file(m_config.type, m_filter) but if write_filter_file throws the FileWriter::close() is never called; update FilterBuilder::write to wrap the write_filter_file call in a try/catch (catch all), call writer.close() in the catch, rethrow the exception, and keep the normal writer.close() after the try so the file is always closed even on errors; reference the FileWriter instance (writer), the write_filter_file(...) call, and the existing open/close sequence when making the change.

I am intentionally writing like this because I checked that in JsonConstructor.cpp and SchemaMap.cpp there is also no protection against writer.close(), so make it consistent. Also that the loss of data in the writer is acceptable in the context of filter.

I am actually considerring add try catch in the caller(integration) but not here

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check JsonConstructor.cpp and SchemaMap.cpp for FileWriter usage patterns echo "=== Checking JsonConstructor.cpp ===" rg -n -C 5 "FileWriter.*writer" components/core/src/clp_s/JsonConstructor.cpp || echo "No matches in JsonConstructor.cpp" echo "" echo "=== Checking SchemaMap.cpp ===" rg -n -C 5 "FileWriter.*writer" components/core/src/clp_s/SchemaMap.cpp || echo "No matches in SchemaMap.cpp" echo "" echo "=== Looking for writer.open and writer.close patterns in both files ===" rg -n -A 10 "writer\.open" components/core/src/clp_s/JsonConstructor.cpp components/core/src/clp_s/SchemaMap.cpp

Length of output: 5237

@ShangDanLuXian, thank you for the explanation! I've confirmed that JsonConstructor.cpp and SchemaMap.cpp indeed follow the same pattern—FileWriter operations without try-catch protection around the write calls before close(). Your approach is consistent with the existing codebase conventions, and I understand your reasoning that data loss in the filter context is acceptable, with exception handling deferred to the caller/integration layer. I've noted this for future reviews.

✏️ Learnings added

Learnt from: ShangDanLuXian Repo: y-scope/clp PR: 2010 File: components/core/src/clp_s/filter/FilterBuilder.cpp:24-33 Timestamp: 2026-02-20T15:19:43.528Z Learning: In components/core/src/clp_s, FileWriter usage follows a pattern where writer.open(), write operations, and writer.close() are called without try-catch protection around the write operations. This is intentional—data loss is considered acceptable in these contexts (e.g., filters), and exception handling is deferred to the caller/integration layer. Examples: JsonConstructor.cpp, SchemaMap.cpp, and FilterBuilder.cpp.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: haiqi96 Repo: y-scope/clp PR: 523 File: components/core/src/clp/BufferedFileReader.cpp:96-106 Timestamp: 2024-10-24T14:45:26.265Z Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Learnt from: anlowee Repo: y-scope/clp PR: 1176 File: components/core/src/clp_s/ColumnWriter.cpp:49-127 Timestamp: 2025-08-13T15:25:40.805Z Learning: In components/core/src/clp_s/ColumnWriter.cpp, the FormattedFloatColumnWriter::add_value method intentionally allows std::stod to throw exceptions (std::invalid_argument or std::out_of_range) without catching them, as this is designed to terminate decompression when encountering invalid float strings for data integrity purposes.

Learnt from: AVMatthews Repo: y-scope/clp PR: 543 File: components/core/src/clp_s/JsonParser.cpp:769-779 Timestamp: 2024-10-07T21:16:41.660Z Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Learnt from: gibber9809 Repo: y-scope/clp PR: 624 File: components/core/src/clp/BoundedReader.hpp:25-27 Timestamp: 2024-12-10T16:56:33.545Z Learning: In the CLP project, within constructors like `BoundedReader` in `components/core/src/clp/BoundedReader.hpp`, it's acceptable to call methods like `get_pos()` that may throw exceptions without handling potential errors via `try_get_pos()`, since the constructor itself handles errors by throwing.

Learnt from: Bill-hbrhbr Repo: y-scope/clp PR: 1122 File: components/core/src/clp/clp/CMakeLists.txt:175-195 Timestamp: 2025-07-23T09:54:45.185Z Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

coderabbitai · 2026-02-20T14:23:01Z

+namespace clp_s::filter {
+constexpr char kFilterFileMagic[4] = {'C', 'L', 'P', 'F'};
+
+void write_filter_file(FileWriter& writer, FilterType type, BloomFilter const& filter);


🧹 Nitpick | 🔵 Trivial

Consider the API ergonomics of write_filter_file for FilterType::None.

The signature requires a BloomFilter const& even when type is FilterType::None (where the filter is unused). This forces callers to construct a dummy BloomFilter for the no-filter case. A simple overload that takes only FileWriter& and FilterType for the None path (or using std::optional<BloomFilter const&>) would make the intent clearer.

Not a blocker for the initial library PR, but worth considering before archive-writer integration.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/core/src/clp_s/filter/FilterFile.hpp` at line 18, The current write_filter_file(FileWriter& writer, FilterType type, BloomFilter const& filter) API forces callers to provide a BloomFilter even when type == FilterType::None; add an ergonomic overload such as write_filter_file(FileWriter& writer, FilterType type) that handles the None case (or change the signature to accept std::optional<BloomFilter const&>) so callers can omit a dummy BloomFilter; update implementations of write_filter_file and any callers to use the new overload for FilterType::None and keep the existing three-argument overload for the cases that actually need a BloomFilter.

ShangDanLuXian · 2026-02-20T14:25:45Z

+constexpr size_t kDefaultBitArraySize{64};
+constexpr uint32_t kDefaultNumHashFunctions{1};
+constexpr uint32_t kMinNumHashFunctions{1};
+constexpr uint32_t kMaxNumHashFunctions{20};


Correspond to 1-e6 FPR, which will be set as the min FPR in the integration

coderabbitai

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@components/core/CMakeLists.txt`:
- Around line 722-724: No change required because the test entries in
components/core/CMakeLists.txt are now correctly alphabetized; ensure the file
list keeps tests/test-clp_s-bloom_filter.cpp before
tests/test-clp_s-delta-encode-log-order.cpp and that
tests/test-clp_s-end_to_end.cpp remains in the same sorted sequence, preserving
alphabetical order whenever you add or reorder test entries.

In `@components/core/src/clp_s/filter/BloomFilter.cpp`:
- Around line 57-59: The computation of num_hash_functions currently truncates
the real-valued optimum (k = (m/n) * ln2) which can yield a sub-optimal count
for borderline values; change the computation in BloomFilter.cpp so you round to
nearest before casting (e.g., wrap the real-valued expression with std::llround
and then static_cast<uint32_t>) while keeping the existing std::clamp bounds;
update the include if needed to pull in <cmath> and ensure the expression uses
bit_array_size, expected_num_elements and ln2 as before.
- Around line 132-154: The deserialiser creates std::vector<uint8_t> bit_array
from bit_array_bytes without an upper bound, allowing a crafted bit_array_size
to force huge allocations; before constructing bit_array in BloomFilter.cpp
(after reader.try_read_numeric_value and the check against
std::numeric_limits<size_t>::max and expected_bit_array_bytes), validate that
bit_array_bytes <= a sane constant (e.g. 256 * 1024 * 1024) and return false if
it exceeds that cap; keep references to bit_array_bytes, bit_array_size,
reader.try_read_numeric_value and reader.try_read_exact_length to locate the
changes and ensure the guard happens before allocating the vector to prevent
std::bad_alloc/OOM.

In `@components/core/src/clp_s/filter/FilterFile.cpp`:
- Around line 17-46: The inconsistency in default-constructing out_filter in
read_filter_file has been resolved—both assignments now use BloomFilter{}; keep
the existing checks around reading magic, validating type (FilterType::None and
FilterType::Bloom) and the call to out_filter.read_from_file(reader) as-is; no
code change required in read_filter_file, ensure symbols referenced are
read_filter_file, out_filter, BloomFilter, FilterType, and read_from_file remain
unchanged.

gibber9809

Apologies for the slow initial review -- leaving mostly higher level comments so that you have something to work with. Could you go through and apply the comments about include style and constant naming throughout the PR?

XxHash seems like it should be a good choice for hash library given that hashes are supposed to be stable across platforms and into the future post 0.8.0 + performance characteristics. It seems like there's a newer version (0.8.3) though, so could we update to that?

Besides that, I'll need to do another pass on this PR where I take a closer look at design + tests + the way we're pulling in this xxhash library.

gibber9809 · 2026-02-27T01:09:28Z

+constexpr size_t kDefaultBitArraySize{64};
+constexpr uint32_t kDefaultNumHashFunctions{1};
+constexpr uint32_t kMinNumHashFunctions{1};
+constexpr uint32_t kMaxNumHashFunctions{20};
+constexpr uint64_t kPrimaryHashSeed{0};
+constexpr uint64_t kSecondaryHashSeed{0x9e37'79b9'7f4a'7c15ULL};


Suggested change

constexpr size_t kDefaultBitArraySize{64};

constexpr uint32_t kDefaultNumHashFunctions{1};

constexpr uint32_t kMinNumHashFunctions{1};

constexpr uint32_t kMaxNumHashFunctions{20};

constexpr uint64_t kPrimaryHashSeed{0};

constexpr uint64_t kSecondaryHashSeed{0x9e37'79b9'7f4a'7c15ULL};

constexpr size_t cDefaultBitArraySize{64};

constexpr uint32_t cDefaultNumHashFunctions{1};

constexpr uint32_t cMinNumHashFunctions{1};

constexpr uint32_t cMaxNumHashFunctions{20};

constexpr uint64_t cPrimaryHashSeed{0};

constexpr uint64_t cSecondaryHashSeed{0x9e37'79b9'7f4a'7c15ULL};

Nit: style for constants in the repo is to use the prefix c -- sorry, this is one of the details that is only in the private part of our style guide.

gibber9809 · 2026-02-27T01:13:00Z

+        : m_bit_array_size(kDefaultBitArraySize),
+          m_num_hash_functions(kDefaultNumHashFunctions),
+          m_bit_array((kDefaultBitArraySize + 7) / 8, 0) {}


Suggested change

: m_bit_array_size(kDefaultBitArraySize),

m_num_hash_functions(kDefaultNumHashFunctions),

m_bit_array((kDefaultBitArraySize + 7) / 8, 0) {}

: m_bit_array_size{cDefaultBitArraySize},

m_num_hash_functions{cDefaultNumHashFunctions},

m_bit_array{(cDefaultBitArraySize + 7) / 8, 0} {}

Nit: preferred style is to use brace initializers wherever possible -- sorry again, this also is only documented in the private part of our style guide.

gibber9809 · 2026-02-27T01:18:30Z

+    m_bit_array_size = bit_array_size;
+    m_num_hash_functions = num_hash_functions;
+
+    size_t const num_bytes = (m_bit_array_size + 7) / 8;


This code seems to use (x + 7)/8 to get the number of bytes that contain a given number of bits in several places. Might be worth adding a helper called min_bytes_containing_bits in the anonymonous namespace in this file or something.

Might also be worth defining a constant cNumBitsInByte to replace all of the instances of 8 in this file for readability.

gibber9809 · 2026-02-27T01:28:08Z

@@ -0,0 +1,10 @@
+#include "XxHash.hpp"
+
+#define XXH_INLINE_ALL


What does this do?

gibber9809 · 2026-02-27T01:30:14Z

+                PRIVATE
+                clp::string_utils
+        )
+endif()


Could we move all of this stuff for defining the library into a new filter/CMakeLists.txt, then just use add_subdirectory from this CMakeLists file to add that subdirectory?

should we not add the package into this cmakelist in this PR - it should be part of the integration right?

It's fine to add the cmake library in this PR since we aren't linking it into clp-s. It's also got a flag to disable if we need, so no real risk of adverse side effects.

gibber9809 · 2026-02-27T01:31:20Z

+#include "../../clp/ErrorCode.hpp"
+#include "../../clp/ReaderInterface.hpp"


Suggested change

#include "../../clp/ErrorCode.hpp"

#include "../../clp/ReaderInterface.hpp"

#include <clp/ErrorCode.hpp>

#include <clp/ReaderInterface.hpp>

Nit: <> is our preferred include style for pulling in other parts of the clp project.

gibber9809 · 2026-02-27T01:31:45Z

+#include <utility>
+#include <vector>
+
+#include "../FileWriter.hpp"


Suggested change

#include "../FileWriter.hpp"

#include <clp_s/FileWriter.hpp>

Nit

gibber9809

Did a pass on the build system side of this PR. Let me know if any of my comments are unclear.

gibber9809 · 2026-03-02T20:36:56Z

+  xxhash:
+    internal: true
+    vars:
+      LIB_NAME: "xxhash"
+      XXHASH_OUTPUT_DIR: "{{.G_DEPS_CPP_DIR}}/{{.LIB_NAME}}-src"
+    run: "once"
+    deps:
+      - task: "utils:init"
+    cmds:
+      - task: "yscope-dev-utils:remote:download-and-extract-tar"
+        vars:
+          CHECKSUM_FILE: "{{.G_DEPS_CPP_CHECKSUMS_DIR}}/{{.LIB_NAME}}.md5"
+          FILE_SHA256: "baee0c6afd4f03165de7a4e67988d16f0f2b257b51d0e3cb91909302a26a79c4"
+          OUTPUT_DIR: "{{.XXHASH_OUTPUT_DIR}}"
+          URL: "https://github.com/Cyan4973/xxHash/archive/refs/tags/v0.8.2.tar.gz"
+      - >-
+        echo "set(
+        CLP_XXHASH_INCLUDE_DIRECTORY \"{{.XXHASH_OUTPUT_DIR}}\"
+        )" > "{{.G_DEPS_CPP_CMAKE_SETTINGS_DIR}}/{{.LIB_NAME}}.cmake"


Suggested change

xxhash:

internal: true

vars:

LIB_NAME: "xxhash"

XXHASH_OUTPUT_DIR: "{{.G_DEPS_CPP_DIR}}/{{.LIB_NAME}}-src"

run: "once"

deps:

- task: "utils:init"

cmds:

- task: "yscope-dev-utils:remote:download-and-extract-tar"

vars:

CHECKSUM_FILE: "{{.G_DEPS_CPP_CHECKSUMS_DIR}}/{{.LIB_NAME}}.md5"

FILE_SHA256: "baee0c6afd4f03165de7a4e67988d16f0f2b257b51d0e3cb91909302a26a79c4"

OUTPUT_DIR: "{{.XXHASH_OUTPUT_DIR}}"

URL: "https://github.com/Cyan4973/xxHash/archive/refs/tags/v0.8.2.tar.gz"

- >-

echo "set(

CLP_XXHASH_INCLUDE_DIRECTORY \"{{.XXHASH_OUTPUT_DIR}}\"

)" > "{{.G_DEPS_CPP_CMAKE_SETTINGS_DIR}}/{{.LIB_NAME}}.cmake"

xxhash:

internal: true

vars:

LIB_NAME: "xxHash"

XXHASH_OUTPUT_DIR: "{{.G_DEPS_CPP_DIR}}/{{.LIB_NAME}}-src"

run: "once"

cmds:

- task: "utils:install-remote-cmake-lib"

vars:

CMAKE_GEN_ARGS:

- "-DXXHASH_BUILD_XXHSUM=OFF"

- "-DBUILD_SHARED_LIBS=ON"

- '-DCMAKE_INSTALL_PREFIX={{.G_DEPS_CPP_DIR}}/{{.LIB_NAME}}-install'

LIB_NAME: "{{.LIB_NAME}}"

TARBALL_SHA256: "aae608dfe8213dfd05d909a57718ef82f30722c392344583d3f39050c7f29a80"

TARBALL_URL: "https://github.com/Cyan4973/xxHash/archive/refs/tags/v0.8.3.tar.gz"

CMAKE_SOURCE_DIR: "cmake_unofficial"

Changing this to pull in v0.8.3 as well as use the cmake install flow.

Now instead of the CLP_XXHASH_INCLUDE_DIRECTORY stuff you're doing, you can directly use xxHash::xxhash as a link target in target_link_libraries for the relevant targets.

You will also need to add a block like:

if(CLP_NEED_XXHASH) find_package(xxHash 0.8.3 CONFIG REQUIRED) if(xxHash_FOUND) message(STATUS "Found xxHash ${xxHash_VERSION}") endif() endif()

To components/core/CMakeLists.txt
and you'll need to add functions/flags that set CLP_NEED_XXHASH in components/core/cmake/Options/options.cmake -- sorry that this isn't documented, but you'll pretty much want to copy what we do for other targets. That means you'll have to add an option flag like

option( CLP_BUILD_CLP_S_FILTER "Build clp_s::filter." ON )

Alphabetically sorted in the right position in the long list of options at the start of the file + add a function
validate_clp_s_filter_dependencies which asks for CLP_BUILD_CLP_STRING_UTILS+CLP_BUILD_CLP_S_IO and set_clp_s_filter_dependencies which sets CLP_NEED_XXHASH + call these two functions in validate_and_setup_all_clp_dependency_flags like we do for other dependencies + add CLP_NEED_XXHASH to the list in convert_clp_dependency_properties_to_variables.

Hopefully you can figure that out by copying what we do for other dependencies, but if not let me know and I can help -- we really should have documented how to set up these flags/refactored them by now, but haven't yet, so it's more confusing than it needs to be.

gibber9809 · 2026-03-02T20:38:10Z

+        filter/XxHash.hpp
+)
+
+if(CLP_BUILD_CLP_S_IO)


Suggested change

if(CLP_BUILD_CLP_S_IO)

if(CLP_BUILD_CLP_S_FILTER)

See my other comment about how to set up this flag properly.

gibber9809 · 2026-03-02T20:38:46Z

+                PRIVATE
+                ${CLP_XXHASH_INCLUDE_DIRECTORY}


Per other comment, should be able to remove this once you change how the dependency is pulled in.

gibber9809 · 2026-03-02T20:39:45Z

+        target_link_libraries(
+                clp_s_filter
+                PUBLIC
+                clp_s::io


Per other comment, will need to link against xxHash::xxhash.

gibber9809 · 2026-03-02T20:42:12Z

    target_include_directories(unitTest
            PRIVATE
            ${CLP_SQLITE3_INCLUDE_DIRECTORY}
+            ${CLP_XXHASH_INCLUDE_DIRECTORY}


Suggested change

${CLP_XXHASH_INCLUDE_DIRECTORY}

Per other comment, can get rid of this when we update how we pull in the xxhash dependency.

gibber9809 · 2026-03-02T20:42:53Z

+    src/clp_s/filter/FilterReader.cpp
+    src/clp_s/filter/FilterReader.hpp
+    src/clp_s/filter/XxHash.cpp
+    src/clp_s/filter/XxHash.hpp


Instead of pulling in every source file from clp_s/filter could we just link unitTest against clp_s::filter?

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/core/src/clp_s/filter/BloomFilter.cpp`:
- Around line 22-24: The current min_bytes_containing_bits function can overflow
when computing (num_bits + (cNumBitsInByte - 1)); change the calculation to
avoid addition overflow by using division and modulo: compute bytes = num_bits /
cNumBitsInByte and add one only if num_bits % cNumBitsInByte != 0 (handle
num_bits == 0 returning 0). Update min_bytes_containing_bits to use that
division/modulo approach (referencing min_bytes_containing_bits and
cNumBitsInByte) so deserialisation validation cannot be bypassed by SIZE_MAX
inputs.

In `@taskfiles/deps/main.yaml`:
- Around line 570-573: The CMake args currently force shared-only builds by
setting CMAKE_GEN_ARGS with "-DBUILD_SHARED_LIBS=ON"; modify the xxHash build
args in CMAKE_GEN_ARGS to also set "-DBUILD_STATIC_LIBS=ON" and ensure the
project’s components/core/CMakeLists.txt uses a static-vs-shared selection when
calling find_package(xxhash ...) similar to how Boost and LibArchive are handled
(i.e., add logic to respect CLP_USE_STATIC_LIBS and pass the STATIC or SHARED
hint to find_package or set xxHash_USE_STATIC_LIBS accordingly) so the build
supports the CLP static-link mode.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99ea475 and c6ae25f.

📒 Files selected for processing (12)

components/core/CMakeLists.txt
components/core/cmake/Options/options.cmake
components/core/src/clp_s/CMakeLists.txt
components/core/src/clp_s/filter/BloomFilter.cpp
components/core/src/clp_s/filter/BloomFilter.hpp
components/core/src/clp_s/filter/CMakeLists.txt
components/core/src/clp_s/filter/FilterBuilder.cpp
components/core/src/clp_s/filter/FilterFile.cpp
components/core/src/clp_s/filter/FilterFile.hpp
components/core/src/clp_s/filter/FilterReader.cpp
components/core/src/clp_s/filter/XxHash.cpp
taskfiles/deps/main.yaml

coderabbitai · 2026-03-03T17:41:17Z

+          CMAKE_GEN_ARGS:
+            - "-DXXHASH_BUILD_XXHSUM=OFF"
+            - "-DBUILD_SHARED_LIBS=ON"
+            - '-DCMAKE_INSTALL_PREFIX={{.G_DEPS_CPP_DIR}}/{{.LIB_NAME}}-install'


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify static-link mode defaults and xxHash build/link wiring. rg -n "CLP_USE_STATIC_LIBS|find_package\\(xxHash|xxHash::xxhash" \ components/core/CMakeLists.txt \ components/core/src/clp_s/filter/CMakeLists.txt echo "---" rg -n "task:\\s*\"xxhash\"|BUILD_SHARED_LIBS=ON|LIB_NAME:\\s*\"xxHash\"" \ taskfiles/deps/main.yaml

Repository: y-scope/clp

Length of output: 1089

🏁 Script executed:

#!/bin/bash # Search for custom FindxxHash module fd "FindxxHash" components/ echo "---" # Get the full xxHash task definition from taskfiles/deps/main.yaml sed -n '87,600p' taskfiles/deps/main.yaml | grep -A 30 'task: "xxhash"' echo "---" # Check filter CMakeLists.txt around line 31 sed -n '25,40p' components/core/src/clp_s/filter/CMakeLists.txt

Repository: y-scope/clp

Length of output: 1280

🏁 Script executed:

#!/bin/bash # Find the xxhash task definition more precisely rg -A 50 "^\s+xxhash:" taskfiles/deps/main.yaml | head -80

Repository: y-scope/clp

Length of output: 1796

🏁 Script executed:

#!/bin/bash # Search for custom FindxxHash module find components -name "*xxHash*" -o -name "*xxhash*" echo "---" # Check if there's xxHash handling in the main CMakeLists sed -n '330,350p' components/core/CMakeLists.txt

Repository: y-scope/clp

Length of output: 642

🏁 Script executed:

#!/bin/bash # Find CLP_NEED_XXHASH definition rg -n "CLP_NEED_XXHASH" components/core/CMakeLists.txt echo "---" # Check how other libraries handle static vs shared (e.g., BZip2) rg -B5 -A5 "CLP_USE_STATIC_LIBS" components/core/CMakeLists.txt | head -60 echo "---" # Check xxHash official build options rg -n "xxHash_USE_STATIC_LIBS|BUILD_STATIC_LIBS" taskfiles/deps/main.yaml

Repository: y-scope/clp

Length of output: 1873

🏁 Script executed:

#!/bin/bash # Search for CLP_NEED_XXHASH definition/assignment rg -n "set.*CLP_NEED_XXHASH" components/core/CMakeLists.txt echo "---" # Check if xxHash find_package has any handling for static libs sed -n '337,345p' components/core/CMakeLists.txt echo "---" # Check other libraries (BZip2, LibLZMA, etc.) to see the pattern for static/shared handling rg -B2 "CLP_USE_STATIC_LIBS" components/core/CMakeLists.txt | grep -A2 "BZip2\|LibLZMA\|LZ4\|ZLIB"

Repository: y-scope/clp

Length of output: 305

🏁 Script executed:

#!/bin/bash # Search for CLP_NEED_XXHASH in all files rg -r "CLP_NEED_XXHASH" components/ echo "---" # Check for any xxHash configuration in other CMakeLists files rg -n "xxHash\|XXHASH" components/core/src/clp_s/filter/CMakeLists.txt echo "---" # Check the pattern used for other libraries with static handling sed -n '155,175p' components/core/CMakeLists.txt

Repository: y-scope/clp

Length of output: 16101

🏁 Script executed:

#!/bin/bash # Find the actual definition of CLP_NEED_XXHASH rg -n "CLP_NEED_XXHASH\s*(ON|OFF|TRUE|FALSE)" components/core/CMakeLists.txt echo "---" # Look for option definitions rg -n "option.*CLP_NEED" components/core/CMakeLists.txt echo "---" # Check the filter target linking more closely sed -n '20,40p' components/core/src/clp_s/filter/CMakeLists.txt

Repository: y-scope/clp

Length of output: 509

🏁 Script executed:

#!/bin/bash # Search for how CLP_NEED_XXHASH gets set - look for conditionals rg -B10 "CLP_NEED_XXHASH" components/core/CMakeLists.txt | head -40 echo "---" # Check if there's any xxxHash static/shared variant handling elsewhere rg -n "xxHash.*USE_STATIC\|xxHash_USE_STATIC" components/

Repository: y-scope/clp

Length of output: 362

🏁 Script executed:

#!/bin/bash # Search for CLP_NEED_XXHASH definition/set command more broadly rg -n "CLP_NEED_XXHASH\s*=" components/core/CMakeLists.txt echo "---" # Check if it's set via option() rg -n "option.*XXHASH" components/core/CMakeLists.txt echo "---" # Look at the very beginning of the CMakeLists to see option definitions head -150 components/core/CMakeLists.txt | grep -n "option\|CLP_NEED"

Repository: y-scope/clp

Length of output: 310

🏁 Script executed:

#!/bin/bash # Check the options.cmake file where CLP_NEED variables are defined cat components/core/cmake/Options/options.cmake

Repository: y-scope/clp

Length of output: 13124

🏁 Script executed:

#!/bin/bash # Check xxHash's official CMakeLists.txt for build options # Since xxHash is built from a tarball, check the cmake_unofficial directory mentioned in the task rg -n "BUILD_STATIC_LIBS|xxHash_BUILD_XXHSUM|option.*BUILD" \ --type cmake 2>/dev/null | head -20 echo "---" # Verify the pattern used for other libraries with static support rg -B1 -A1 "Boost_USE_STATIC_LIBS ON\|LibArchive_USE_STATIC_LIBS ON" components/core/CMakeLists.txt

Repository: y-scope/clp

Length of output: 341

Add static library variant to xxHash build to support CLP static-link mode.

Line 572 forces -DBUILD_SHARED_LIBS=ON exclusively. Since CLP defaults to CLP_USE_STATIC_LIBS=ON, xxHash must also support static linking. Add -DBUILD_STATIC_LIBS=ON and implement static variant selection in the find_package call, following the pattern used by Boost and LibArchive elsewhere in components/core/CMakeLists.txt.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@taskfiles/deps/main.yaml` around lines 570 - 573, The CMake args currently force shared-only builds by setting CMAKE_GEN_ARGS with "-DBUILD_SHARED_LIBS=ON"; modify the xxHash build args in CMAKE_GEN_ARGS to also set "-DBUILD_STATIC_LIBS=ON" and ensure the project’s components/core/CMakeLists.txt uses a static-vs-shared selection when calling find_package(xxhash ...) similar to how Boost and LibArchive are handled (i.e., add logic to respect CLP_USE_STATIC_LIBS and pass the STATIC or SHARED hint to find_package or set xxHash_USE_STATIC_LIBS accordingly) so the build supports the CLP static-link mode.

gibber9809

Going to go through more later today, but leaving another set of comments on tests + XxHash.{cpp,hpp}.

The updates to cmake scripts look good!

Besides those comments could you:

Move your tests subdirectory to filter/tests -- our convention changed a few months ago to putting tests in a tests subdirectory next to the thing they're testing so that it's easier to find tests related to a given library or feature.
Run clang-tidy on all of your c++ code? I'm finding a lot of details that our clang-tidy config would complain about. If the task flow for running clang-tidy isn't bringing up the issues, you may have to manually run clang-tidy like clang-tidy -p your/build/path/compile_commands.json components/core/src/clp_s/filter/BloomFilter.cpp etc.

gibber9809 · 2026-03-05T16:36:58Z

+
+#include <catch2/catch_test_macros.hpp>
+
+#include "../src/clp_s/filter/BloomFilter.hpp"


Suggested change

#include "../src/clp_s/filter/BloomFilter.hpp"

#include <clp_s/filter/BloomFilter.hpp>

Since this test is linking against the clp_s::filter library.

gibber9809 · 2026-03-05T16:37:23Z

+constexpr size_t kInsertions = 10'000;
+constexpr size_t kQueries = 10'000;
+constexpr double kFalsePositiveRate = 0.001;


Suggested change

constexpr size_t kInsertions = 10'000;

constexpr size_t kQueries = 10'000;

constexpr double kFalsePositiveRate = 0.001;

constexpr size_t cInsertions{10'000};

constexpr size_t cQueries{10'000};

constexpr double cFalsePositiveRate{0.001};

Nit: style.

gibber9809 · 2026-03-05T16:38:55Z

+TEST_CASE("BloomFilter empty filter has no matches", "[clp_s][filter]") {
+    clp_s::filter::BloomFilter filter{0, kFalsePositiveRate};
+
+    for (size_t i = 0; i < kQueries; ++i) {
+        auto const value = make_odd(i);
+        REQUIRE_FALSE(filter.possibly_contains(std::to_string(value)));
+    }
+}


Do you think this test is required? It doesn't seem to actually test anything besides that our bit vector is initialized to zero, which the other test would catch anyway.

gibber9809 · 2026-03-05T16:41:29Z

+    return value * 2;
+}
+}  // namespace
+


Do you think it's worth adding a test that runs xxhash on a handful of inputs and checks that the hash value is what we expect? Just to help avoid a situation where someone somehow links this library against a pre 0.8.0 version of xxhash and gets different hash results.

I was thinking before that make some pre loaded input and output for testing but that will lead to a large number of loc - not a large file but just loc. Or do you think a very small set (like 10 sparse cases) for senity check is enough?

Yeah, 5-10 cases as a sanity check is enough I think.

gibber9809 · 2026-03-05T16:45:16Z

+#include <string>
+#include <vector>
+
+#include <catch2/catch_test_macros.hpp>


Suggested change

#include <catch2/catch_test_macros.hpp>

#include <catch2/catch_message.hpp>

#include <catch2/catch_test_macros.hpp>

Since you're also using INFO, etc

gibber9809 · 2026-03-05T16:48:20Z

+        filter.add(inserted.back());
+    }
+
+    size_t false_negatives = 0;


Suggested change

size_t false_negatives = 0;

size_t false_negatives{};

gibber9809 · 2026-03-05T16:50:07Z

+uint64_t make_odd(uint64_t value) {
+    return value * 2 + 1;
+}
+
+uint64_t make_even(uint64_t value) {
+    return value * 2;
+}


Suggested change

uint64_t make_odd(uint64_t value) {

return value * 2 + 1;

}

uint64_t make_even(uint64_t value) {

return value * 2;

}

[[nodiscard]] auto make_odd(uint64_t value) -> uint64_t {

return value * 2 + 1;

}

[[nodiscard]] auto make_even(uint64_t value) -> uint64_t {

return value * 2;

}

Nit: our style guide requires trailing return syntax.

gibber9809 · 2026-03-05T16:53:16Z

+#include <string_view>
+
+namespace clp_s::filter::xxhash {
+[[nodiscard]] uint64_t hash64(std::string_view value, uint64_t seed);


Suggested change

[[nodiscard]] uint64_t hash64(std::string_view value, uint64_t seed);

/**

* @param value

* @param seed

* @return The 64-bit seeded XXH3 hash of `value`.

*/

[[nodiscard]] auto hash64(std::string_view value, uint64_t seed) -> uint64_t ;

Nit: trailing return syntax.
Nit: missing docstring.

Also see some other missing docstrings, will add all.

gibber9809 · 2026-03-05T16:55:33Z

@@ -0,0 +1,9 @@
+#include "XxHash.hpp"
+
+#include <xxhash.h>


Suggested change

#include <xxhash.h>

#include <cstdint>

#include <string_view>

#include <xxhash.h>

Nit: IWYU -- need to include headers for all symbols used in a file.

gibber9809 · 2026-03-05T16:58:35Z

+#include <xxhash.h>
+
+namespace clp_s::filter::xxhash {
+uint64_t hash64(std::string_view value, uint64_t seed) {


Suggested change

uint64_t hash64(std::string_view value, uint64_t seed) {

auto hash64(std::string_view value, uint64_t seed) -> uint64_t {

Nit: trailing return syntax

ShangDanLuXian · 2026-03-05T18:56:03Z

Going to go through more later today, but leaving another set of comments on tests + XxHash.{cpp,hpp}.

The updates to cmake scripts look good!

Besides those comments could you:

Move your tests subdirectory to filter/tests -- our convention changed a few months ago to putting tests in a tests subdirectory next to the thing they're testing so that it's easier to find tests related to a given library or feature.

Run clang-tidy on all of your c++ code? I'm finding a lot of details that our clang-tidy config would complain about. If the task flow for running clang-tidy isn't bringing up the issues, you may have to manually run clang-tidy like clang-tidy -p your/build/path/compile_commands.json components/core/src/clp_s/filter/BloomFilter.cpp etc.

Thanks for the review. Sorry that forgot to look into the test files and change the styling as well. Also forgot to run clang-tidy again after last commit, will fix it in the next commit.

gibber9809

Added some more comments about BloomFilter.{cpp,hpp} focused mostly on the design of the interface.

Ignoring style stuff in this section of the code for now since it's being worked on.

gibber9809 · 2026-03-05T20:03:53Z

+#include <clp_s/FileWriter.hpp>
+
+namespace clp {
+class ReaderInterface;
+}  // namespace clp


For the write_to_file interface I think it might make sense to rely on clp::WriterInterface & to make it more generic -- that way if we decide to e.g. compress this bloom filter using zstd or something we don't need to change any code here.

Also, for these includes it's fine to just include <clp/ReaderInterface.hpp> and <clp/WriterInterface.hpp> here -- no need for this forward declaration of ReaderInterface to avoid pulling in that header.

gibber9809 · 2026-03-05T20:43:49Z

+    if (false == std::isfinite(ideal_bit_array_size)
+        || ideal_bit_array_size > static_cast<double>(std::numeric_limits<size_t>::max()))
+    {
+        return {cDefaultBitArraySize, cDefaultNumHashFunctions};


It seems like this and the other default return above are hiding errors for a few kinds of invalid inputs.

I think we're better off just forwarding the error in this kind of situation instead of silently falling back to a default. The easiest way to do that might be to use result types for returns here like we've started using in other parts of the codebase, e.g. ystdlib::error_handling::Result<std::pair<size_t, uint32_t>>.

You can take a look at clp_s/timestamp_parser/TimestampParser.{cpp,hpp} to see a somewhat recent example of how we use result types in the codebase.

Currently I treat expected_num_elements == 0 as valid input.
In the variable-dictionary path this should be impossible, but for the Bloom filter library itself it is still a valid state (empty set), so the implementation returns a default empty filter instead of an error.

Please let me know if you’d prefer this to be treated as an error instead.

gibber9809 · 2026-03-05T20:49:08Z

+
+    size_t m_bit_array_size{0};
+    uint32_t m_num_hash_functions{0};
+    std::vector<uint8_t> m_bit_array;


We have a class ystdlib::containers::Array<T> to represent fixed sized arrays with size configured at construction time that might be a better fit than an std::vector here.

gibber9809 · 2026-03-05T21:07:41Z

+    [[nodiscard]] bool possibly_contains(std::string_view value) const;
+
+    void write_to_file(FileWriter& writer) const;
+    [[nodiscard]] bool read_from_file(clp::ReaderInterface& reader);


I think this might make more sense as a static function that returns ystdlib::error_handling::Result<BloomFilter> -- kind of matching the create() pattern used in other parts of the codebase. Mainly my reasoning is that it avoids having to default initialize a BloomFilter first before attempting to initialize it from some ReaderInterface, avoiding some unnecessary allocations + cleans up the error handling since it makes it so that if you're able to create the bloom filter it is guaranteed to be valid.

Does that make sense, or do you think the current approach is better?

I feel this is better, also replied in the default constructor comment that the default constructor is not needed with this approach.

gibber9809 · 2026-03-05T21:08:11Z

+public:
+    BloomFilter(size_t expected_num_elements, double false_positive_rate);
+
+    BloomFilter();


What's the use case for the default constructor?

The default constructor was only needed in the old flow where we default-constructed a BloomFilter as a placeholder and then populated.
With the new Result-based API (create / try_read_from_file) and std::optional at call sites, we no longer need placeholder construction, I have removed the default constructor.

gibber9809 · 2026-03-05T21:10:54Z

+    return value * 2;
+}
+}  // namespace
+


Yeah, 5-10 cases as a sanity check is enough I think.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/core/CMakeLists.txt`:
- Around line 337-338: The CMake find_package call must require exactly xxHash
0.8.3 to avoid incompatible hash outputs: update the conditional that checks
CLP_NEED_XXHASH so the find_package invocation uses the EXACT keyword (i.e.,
change the find_package(xxHash 0.8.3 CONFIG REQUIRED) line to require the exact
version) — this ensures the hard-coded vectors in test-clp_s-xxhash.cpp (which
rely on XXH3_64bits_withSeed from v0.8.3) are validated against the exact
expected library version.

In `@components/core/src/clp_s/filter/ErrorCode.cpp`:
- Around line 19-29: The switch over ErrorCodeEnum in ErrorCode.cpp currently
contains a default case which masks newly added enum values; remove the default
branch inside the switch so the compiler will warn about unhandled ErrorCodeEnum
cases, and move the fallback return (e.g. returning "Unknown error code enum" or
calling an unreachable/assert helper) to after the switch on error_enum so
missing enum cases fail loudly at compile time; update any function that
performs this mapping (the switch on error_enum in ErrorCode.cpp) accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 47ec4743-ce66-49ac-9c6d-0097d0cb1f66

📥 Commits

Reviewing files that changed from the base of the PR and between c6ae25f and 7cfbddc.

📒 Files selected for processing (17)

components/core/CMakeLists.txt
components/core/src/clp_s/filter/BloomFilter.cpp
components/core/src/clp_s/filter/BloomFilter.hpp
components/core/src/clp_s/filter/CMakeLists.txt
components/core/src/clp_s/filter/ErrorCode.cpp
components/core/src/clp_s/filter/ErrorCode.hpp
components/core/src/clp_s/filter/FilterBuilder.cpp
components/core/src/clp_s/filter/FilterBuilder.hpp
components/core/src/clp_s/filter/FilterConfig.hpp
components/core/src/clp_s/filter/FilterFile.cpp
components/core/src/clp_s/filter/FilterFile.hpp
components/core/src/clp_s/filter/FilterReader.cpp
components/core/src/clp_s/filter/FilterReader.hpp
components/core/src/clp_s/filter/XxHash.cpp
components/core/src/clp_s/filter/XxHash.hpp
components/core/src/clp_s/filter/tests/test-clp_s-bloom_filter.cpp
components/core/src/clp_s/filter/tests/test-clp_s-xxhash.cpp

gibber9809

Could you re-run clang-tidy on BloomFilter.cpp and BloomFilter.hpp? I still see some clang-tidy errors when I pull your branch and run it on those files.

For the error about reinterpret_cast I think you can use clp::size_checked_pointer_cast instead (from clp/type_utils.hpp).

I'm still taking a look at whether we need to modify how we pull in xxhash in the taskfile deps flow -- I'll get back to you on that once I've figured it out.

gibber9809 · 2026-03-10T19:56:43Z

+        target_link_libraries(
+                clp_s_filter
+                PUBLIC
+                clp_s::io


Suggested change

clp_s::io

clp_s::clp_dependencies

Since this is now using clp::ReaderInterface and clp::WriterInterface I think we don't need to depend on clp_s::io anymore, but still need the clp-related dependencies.

gibber9809 · 2026-03-10T19:59:22Z

 endfunction()

+function(validate_clp_s_filter_dependencies)
+    validate_clp_dependencies_for_target(CLP_BUILD_CLP_S_FILTER)


Suggested change

validate_clp_dependencies_for_target(CLP_BUILD_CLP_S_FILTER)

validate_clp_dependencies_for_target(CLP_BUILD_CLP_S_FILTER

CLP_BUILD_CLP_S_CLP_DEPENDENCIES

)

Same as my other comment about depending on clp_s_clp_dependencies + I missed that this wasn't checking for deps previously.

ShangDanLuXian · 2026-03-11T07:45:50Z

Could you re-run clang-tidy on BloomFilter.cpp and BloomFilter.hpp? I still see some clang-tidy errors when I pull your branch and run it on those files.

For the error about reinterpret_cast I think you can use clp::size_checked_pointer_cast instead (from clp/type_utils.hpp).

I'm still taking a look at whether we need to modify how we pull in xxhash in the taskfile deps flow -- I'll get back to you on that once I've figured it out.

Sorry that I misunderstood how clang-tidy works previously, now should all be fixed :)

gibber9809

Getting pretty close on this PR, just some small revisions.

gibber9809 · 2026-03-11T20:30:22Z

 endif()

+if(CLP_NEED_XXHASH)
+    find_package(xxHash 0.8.3 EXACT CONFIG REQUIRED)


Suggested change

find_package(xxHash 0.8.3 EXACT CONFIG REQUIRED)

find_package(xxHash 0.8.3 CONFIG REQUIRED)

Removing EXACT per sidebar.

gibber9809 · 2026-03-11T20:32:48Z

+        : m_bit_array_size(bit_array_size),
+          m_num_hash_functions(num_hash_functions),
+          m_bit_array(std::move(bit_array)) {}


Suggested change

: m_bit_array_size(bit_array_size),

m_num_hash_functions(num_hash_functions),

m_bit_array(std::move(bit_array)) {}

: m_bit_array_size{bit_array_size},

m_num_hash_functions{num_hash_functions},

m_bit_array{std::move(bit_array)} {}

Nit: Use {} for initialization

gibber9809 · 2026-03-11T20:37:31Z

+    auto const optimal_parameters
+            = compute_optimal_parameters(expected_num_elements, false_positive_rate);
+    if (optimal_parameters.has_error()) {
+        return optimal_parameters.error();
+    }


Suggested change

auto const optimal_parameters

= compute_optimal_parameters(expected_num_elements, false_positive_rate);

if (optimal_parameters.has_error()) {

return optimal_parameters.error();

}

auto const optimal_parameters{YSTDLIB_ERROR_HANDLING_TRYX(compute_optimal_parameters(expected_num_elements, false_positive_rate)};

You can use YSTDLIB_ERROR_HANDLING_TRYX for results containing a value and YSTDLIB_ERROR_HANDLING_TRYV for void results as nice shorthands for forwarding an error if it exists, and unwrapping the result otherwise.

With this diff you can then use optimal_parameters directly, since it represents the unwrapped value, instead of optimal_parameters.value().

gibber9809 · 2026-03-11T20:39:32Z

+    uint32_t const capped_num_hash_functions
+            = std::clamp(num_hash_functions, cMinNumHashFunctions, cMaxNumHashFunctions);


Suggested change

uint32_t const capped_num_hash_functions

= std::clamp(num_hash_functions, cMinNumHashFunctions, cMaxNumHashFunctions);

uint32_t const capped_num_hash_functions{

std::clamp(num_hash_functions, cMinNumHashFunctions, cMaxNumHashFunctions)};

Nit: we require {} initialization with the make exception being auto const xyz =

gibber9809 · 2026-03-11T20:47:03Z

+}
+
+auto BloomFilter::write_to_file(clp::WriterInterface& writer) const -> void {
+    writer.write_numeric_value<uint32_t>(m_num_hash_functions);


Sorry I didn't think about this earlier, but do you think it makes sense to include an extra byte in this format indicating what hash algorithm was used for this bloom filter (e.g., by defining an enum for hash functions, with 0x00 representing xxhash)? That way if we ever change the hash algorithm in the future it's easy to keep things backwards compatible.

This function writes payload of a bloom filter. There will be other metadata bytes written in the FilterFile (which does not belong to this PR), including the type of filter.
My current implementation (out of scope of this PR again) treats bloom filters using different hash algorithms as different "filter type", unless we have a bunch of hash algorithms to use then it will make a lot of combinations. I feel it would be not very much difference comparing to your suggestion.

But yeah definitely I need to integrate a mapper for hash algorithms in bloomfilter.cpp , do you recommand add in this PR or next

Thanks for the suggestion — after thinking it through again, I agree to add a hash algorithm ID in the BloomFilter payload - it makes mapping easier than my previous approach.

I’ve already implemented a small HashAlgorithm mapper (ID parsing + hash64 dispatch), plus the corresponding BloomFilter.cpp/hpp updates. In total it’s ~80 LOC, but most of that is headers/guards/includes/docstrings; the core logic is quite small and self-contained.

I haven’t pushed it yet to keep scope tight, but I can include it in this PR if you’re okay with that.

Please let me know and I’ll push it. :)

gibber9809 · 2026-03-11T20:48:52Z

+            return ErrorCode{ErrorCodeEnum::ReadFailure};
+        }
+    }
+    return BloomFilter(bit_array_size, num_hash_functions, std::move(bit_array));


Suggested change

return BloomFilter(bit_array_size, num_hash_functions, std::move(bit_array));

return BloomFilter{bit_array_size, num_hash_functions, std::move(bit_array)};

gibber9809 · 2026-03-11T20:54:39Z

+    [[nodiscard]] auto possibly_contains(std::string_view value) const -> bool;
+
+    /**
+     * Writes Bloom filter payload fields to a writer.


Suggested change

* Writes Bloom filter payload fields to a writer.

* Writes a serialized representation of a bloom filter to a writer.

I think this wording is slightly clearer.

gibber9809 · 2026-03-11T20:55:41Z

+     * Reads Bloom filter payload fields from a reader.
+     * @param reader
+     * @return A result containing a parsed BloomFilter on success, or an error code indicating


Suggested change

* Reads Bloom filter payload fields from a reader.

* @param reader

* @return A result containing a parsed BloomFilter on success, or an error code indicating

* Reads a serialized representation of a bloom filter from a reader.

* @param reader

* @return A result containing an initialized BloomFilter on success, or an error code indicating

I think this wording is slightly clearer.

gibber9809 · 2026-03-11T21:00:15Z

+        vars:
+          CMAKE_GEN_ARGS:
+            - "-DXXHASH_BUILD_XXHSUM=OFF"
+            - "-DBUILD_SHARED_LIBS=ON"


Suggested change

- "-DBUILD_SHARED_LIBS=ON"

- "-DBUILD_SHARED_LIBS=OFF"

After further thought I think it's just simpler to only build + link the static version of xxhash. Main justification being that if we wanted to build a static + shared version I think we'd need to build them as separate install targets in separate directories + write a Findxyz.cmake script like we do for LZMA, which adds some maintenance burden and makes it harder to support things like using an xxhash library installed at the system level.

gibber9809

LGTM! Thanks for going through all the revisions.

Before we merge, could we:

Change the PR title to:

feat(clp_s::filter): Add library for representing bloom filters; Add xxHash v0.8.3 as a dependency.

Update the PR description to reference that this is one of the PRs for #2090 -- I can see that you manually reference this PR in that issue, but referencing the issue from the PR makes github automatically create a link to the PR from the issue, which makes it easy to track related PRs.

LinZhihao-723

A bit more style-wise catches and tips/suggestions.
Otherwise, we should be good to go.

LinZhihao-723 · 2026-03-13T21:45:04Z

        CLP_UTFCPP_INCLUDE_DIRECTORY \"{{.UTFCPP_OUTPUT_DIR}}\"
        )" > "{{.G_DEPS_CPP_CMAKE_SETTINGS_DIR}}/{{.LIB_NAME}}.cmake"

+  xxhash:


A few things missing for adding a new dep:

We need to update the clang-format config to include xxhash as an external library header. Check here.

We need to update the table in our dev-doc here.

LinZhihao-723 · 2026-03-13T21:56:35Z

+    if (HashAlgorithm::Xxh364 == hash_algorithm) {
+        return xxhash::hash64(value, seed);
+    }
+


Suggested change

if (HashAlgorithm::Xxh364 == hash_algorithm) {

return xxhash::hash64(value, seed);

}

switch (hash_algorithm) {

case HashAlgorithm::Xxh364:

return xxhash::hash64(value, seed);

default:

break;

}

Since we're using enum, I think it's cleaner if we use a switch statement.
In terms of the default case, I think we should have sth equivalent to unreachable in Rust. But this is out of the scope for this PR, let's just keep it as is.

Thanks — I also applied the same switch style to try_parse_hash_algorithm for consistency. However, the case static_cast<uint8_t>(HashAlgorithm::Xxh364) looks a little awkward to me. Please let me know if you’d prefer a different style there.

LinZhihao-723 · 2026-03-13T23:09:30Z

Just curious: what's the real size of the bitmap in practice? Is it going to be mostly the default (64 * 8)?
If the size can become big, shall we try/benchmark this compressed bitmap? It also supports serde so writing to or reading from a file shouldn't be a problem.

I ran a simple benchmark comparing a RoaringBitmap-backed Bloom filter against the current conventional bitmap implementation. For this workload, RoaringBitmap used slightly more space due to its container/metadata overhead, and it was also slower for both build and query.

The main reason is that Bloom filters are configured near their optimal parameters, which gives a bit density around 50%. Combined with the fact that the set bits are distributed by xxhash, the bitmap is relatively dense and random-looking, with very little local structure or long runs. That makes it a poor fit for RoaringBitmap-style compression.

I think this matches the case described in the RoaringBitmap README where a conventional bitset is preferred:

You have dense set of random values that never form runs of continuous values. For example, consider the set 0,2,4,...,10000. If this is typical of your application, you might be better served with a conventional bitset (e.g., Java's BitSet class).

Make sense. Thanks for your investigation!

…ions

ShangDanLuXian · 2026-03-16T07:20:25Z

A bit more style-wise catches and tips/suggestions. Otherwise, we should be good to go.

Thanks for your review!
I split the updates into 2 commits for easier review. You can look at the first commit for the style/doc changes only, and then the second commit is just the BloomFilter.hpp/.cpp reordering.

Edit:
Also made one small follow-up wording fix for consistency: changed hash algorithm id to hash algorithm ID in the BloomFilter::try_read_from_file docstring to match the wording in HashAlgorithm.hpp

LinZhihao-723

Just found one more nit pick, otherwise we're good

LinZhihao-723 · 2026-03-16T14:51:33Z

+    switch (hash_algorithm) {
+        case static_cast<uint8_t>(HashAlgorithm::Xxh364):
+            return HashAlgorithm::Xxh364;
+        default:
+            return std::nullopt;
+    }


As a reply to this comment, I don't think we need to use switch here (as u suggested, it just looks weird, lol). Essentially this is because we're working on the underlying integer of the enum value instead of the enum value itself. I think it doesn't make too much sense if we're not working directly on the enum type when using switch.
I'm fine if u'd like to rollback or just leave it as is.

Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>

ShangDanLuXian requested review from a team and gibber9809 as code owners February 20, 2026 14:14

coderabbitai Bot reviewed Feb 20, 2026

View reviewed changes

ShangDanLuXian commented Feb 20, 2026

View reviewed changes

coderabbitai Bot reviewed Feb 23, 2026

View reviewed changes

gibber9809 requested changes Feb 27, 2026

View reviewed changes

gibber9809 requested changes Mar 2, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 3, 2026

View reviewed changes

gibber9809 requested changes Mar 5, 2026

View reviewed changes

junhaoliao added this to the March 2026 milestone Mar 7, 2026

coderabbitai Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread components/core/CMakeLists.txt

Comment thread components/core/src/clp_s/filter/ErrorCode.cpp Outdated

ShangDanLuXian added 2 commits March 9, 2026 15:08

add xxhash, bloom filter, and bloom filter tests

5c1bf94

resolve PR comments

5d587a3

ShangDanLuXian force-pushed the filter_library branch from 92b6657 to 5d587a3 Compare March 9, 2026 15:23

ShangDanLuXian changed the title ~~feat(clp_s): add bloom filter library~~ feat(clp_s/filter): add BloomFilter, xxHash wrapper, and unit tests Mar 9, 2026

ShangDanLuXian added 2 commits March 9, 2026 12:03

fix cmakelists

6be8c56

fix main.yaml

d4605bd

gibber9809 requested changes Mar 10, 2026

View reviewed changes

fix(clp_s): resolve filter clang-tidy and dependency review comments

e9704e8

gibber9809 requested changes Mar 11, 2026

View reviewed changes

resolve styling, yaml lint,and static xxhash link

84ff555

ShangDanLuXian mentioned this pull request Mar 12, 2026

Add Bloom-filter pruning support for variable dictionaries #2090

Open

add hash algoritm mapper and restrict comment lines to 100 chars

88a0210

ShangDanLuXian force-pushed the filter_library branch from 94484f1 to 88a0210 Compare March 12, 2026 16:20

gibber9809 self-requested a review March 13, 2026 16:01

Merge branch 'main' into filter_library

77b6ee6

gibber9809 previously approved these changes Mar 13, 2026

View reviewed changes

ShangDanLuXian changed the title ~~feat(clp_s/filter): add BloomFilter, xxHash wrapper, and unit tests~~ feat(clp_s::filter): Add library for representing bloom filters; Add xxHash v0.8.3 as a dependency Mar 13, 2026

LinZhihao-723 requested changes Mar 13, 2026

View reviewed changes

ShangDanLuXian added 2 commits March 16, 2026 07:09

fix(clp_s::filter): address style and doc review comments

1132867

refactor(clp_s::filter): reorder BloomFilter declarations and definit…

69931c5

…ions

ShangDanLuXian dismissed gibber9809’s stale review via 69931c5 March 16, 2026 07:17

fix(clp_s::filter): capitalize algorithm ID in BloomFilter docstring

786c00a

LinZhihao-723 requested changes Mar 16, 2026

View reviewed changes

ShangDanLuXian and others added 2 commits March 16, 2026 11:10

Apply suggestion from @LinZhihao-723

ed4c8d5

Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>

fix(clp_s::filter): revert try_parse_hash_algorithm switch cleanup

d0d4517

LinZhihao-723 approved these changes Mar 16, 2026

View reviewed changes

gibber9809 merged commit 8aa4214 into y-scope:main Mar 16, 2026
27 checks passed

		#include "../../clp/ErrorCode.hpp"
		#include "../../clp/ReaderInterface.hpp"


		#include <catch2/catch_test_macros.hpp>

		#include "../src/clp_s/filter/BloomFilter.hpp"

	#include "../src/clp_s/filter/BloomFilter.hpp"
	#include <clp_s/filter/BloomFilter.hpp>

	#include <catch2/catch_test_macros.hpp>
	#include <catch2/catch_message.hpp>
	#include <catch2/catch_test_macros.hpp>

-[[nodiscard]] uint64_t hash64(std::string_view value, uint64_t seed);
+/**
+ * @param value
+ * @param seed
+ * @return The 64-bit seeded XXH3 hash of `value`.
+ */
+[[nodiscard]] auto hash64(std::string_view value, uint64_t seed) -> uint64_t ;

	uint64_t hash64(std::string_view value, uint64_t seed) {
	auto hash64(std::string_view value, uint64_t seed) -> uint64_t {

	find_package(xxHash 0.8.3 EXACT CONFIG REQUIRED)
	find_package(xxHash 0.8.3 CONFIG REQUIRED)

		uint32_t const capped_num_hash_functions
		= std::clamp(num_hash_functions, cMinNumHashFunctions, cMaxNumHashFunctions);

	return BloomFilter(bit_array_size, num_hash_functions, std::move(bit_array));
	return BloomFilter{bit_array_size, num_hash_functions, std::move(bit_array)};

	* Writes Bloom filter payload fields to a writer.
	* Writes a serialized representation of a bloom filter to a writer.

Conversation

ShangDanLuXian commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Uh oh!

coderabbitai Bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShangDanLuXian Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gibber9809 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gibber9809 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShangDanLuXian commented Feb 20, 2026 •

edited

Loading

coderabbitai Bot commented Feb 20, 2026 •

edited

Loading

coderabbitai Bot Feb 20, 2026 •

edited

Loading

ShangDanLuXian Feb 20, 2026 •

edited

Loading