feat(clp_s::filter): Add library for representing bloom filters; Add xxHash v0.8.3 as a dependency#2010
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughA new Bloom filter–based filtering system is introduced under Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant FilterBuilder
participant BloomFilter
participant FileWriter
Client->>FilterBuilder: FilterBuilder(FilterConfig, num_elements)
FilterBuilder->>BloomFilter: create(num_elements, fpr)
BloomFilter->>BloomFilter: compute_optimal_parameters(num_elements, fpr)
BloomFilter-->>FilterBuilder: BloomFilter instance
Client->>FilterBuilder: add(value)
FilterBuilder->>BloomFilter: add(lowercase(value))
Client->>FilterBuilder: write(path)
FilterBuilder->>FileWriter: new FileWriter(path)
FilterBuilder->>BloomFilter: write_to_file(writer)
BloomFilter->>FileWriter: serialize data
FileWriter-->>FilterBuilder: success
sequenceDiagram
participant Client
participant FilterReader
participant FileReader
participant BloomFilter
Client->>FilterReader: read_from_file(reader)
FilterReader->>FileReader: read magic header
FileReader-->>FilterReader: magic valid
FilterReader->>FileReader: read filter type
FileReader-->>FilterReader: FilterType
alt FilterType is Bloom
FilterReader->>BloomFilter: try_read_from_file(reader)
BloomFilter->>FileReader: read parameters & data
FileReader-->>BloomFilter: bloom data
BloomFilter-->>FilterReader: BloomFilter instance
end
FilterReader-->>Client: true
Client->>FilterReader: possibly_contains(value)
FilterReader->>BloomFilter: possibly_contains(lowercase(value))
BloomFilter-->>FilterReader: bool result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can generate walkthrough in a markdown collapsible section to save space.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 11
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/core/CMakeLists.txt`:
- Around line 722-725: Reorder the test entries in
components/core/CMakeLists.txt so they are alphabetical: move the line with
tests/test-clp_s-bloom_filter.cpp to appear before
tests/test-clp_s-delta-encode-log-order.cpp (ensure the sequence around
tests/test-clp_s-end_to_end.cpp and tests/test-clp_s-range_index.cpp still
remains alphabetically sorted).
In `@components/core/src/clp_s/CMakeLists.txt`:
- Around line 259-264: The CMake target clp_s_filter currently links
clp::string_utils as PUBLIC even though its symbols are only used in
implementation files (FilterBuilder.cpp and FilterReader.cpp) and not in public
headers (FilterBuilder.hpp, FilterReader.hpp); change the linkage for
clp::string_utils from PUBLIC to PRIVATE in the target_link_libraries call for
clp_s_filter so the dependency is not propagated transitively (do this as part
of a follow-up CMake cleanup PR if you prefer).
In `@components/core/src/clp_s/filter/BloomFilter.cpp`:
- Line 49: Replace the negation usage with the project's preferred
equality-to-false pattern: change occurrences of
"!std::isfinite(ideal_bit_array_size)", "!m_bit_array.empty()", and
"!bit_array.empty()" in BloomFilter.cpp to "false ==
std::isfinite(ideal_bit_array_size)", "false == m_bit_array.empty()", and "false
== bit_array.empty()" respectively; update the conditional expressions in the
functions/methods that reference ideal_bit_array_size, m_bit_array, and
bit_array so they use the exact "false == <expression>" form while preserving
parentheses and surrounding logic.
- Around line 56-62: The calculation of num_hash_functions currently truncates
the optimal k by casting to uint32_t; change this to round to the nearest
integer (e.g., use std::lround or std::round followed by cast) when computing
num_hash_functions from (bit_array_size / expected_num_elements) * ln2, and keep
the existing std::clamp into capped_num_hash_functions with kMinNumHashFunctions
and kMaxNumHashFunctions; also ensure <cmath> is included so std::lround/
std::round is available.
- Around line 132-144: Add a sanity upper-bound check on bit_array_bytes before
allocating the vector to avoid OOM from malformed input: in the same block that
computes expected_bit_array_bytes and checks bit_array_bytes, define a
reasonable maximum (e.g., constexpr size_t kMaxBitArrayBytes = 256ULL * 1024 *
1024) and return false if bit_array_bytes > kMaxBitArrayBytes; leave the
existing std::numeric_limits<size_t>::max() check in place and only perform the
new bound check before constructing std::vector<uint8_t> bit_array.
In `@components/core/src/clp_s/filter/BloomFilter.hpp`:
- Around line 1-2: Update the include guard in BloomFilter.hpp to match the
project's convention by replacing the current macro CLP_S_BLOOM_FILTER_HPP with
CLP_S_FILTER_BLOOM_FILTER_HPP at the top of the file and update the matching
closing `#endif` comment (and any other occurrences) to use
CLP_S_FILTER_BLOOM_FILTER_HPP so it follows the CLP_S_FILTER_* prefix used by
other headers in filter/.
In `@components/core/src/clp_s/filter/FilterBuilder.cpp`:
- Around line 24-33: FilterBuilder::write currently opens a FileWriter and calls
write_filter_file(m_config.type, m_filter) but if write_filter_file throws the
FileWriter::close() is never called; update FilterBuilder::write to wrap the
write_filter_file call in a try/catch (catch all), call writer.close() in the
catch, rethrow the exception, and keep the normal writer.close() after the try
so the file is always closed even on errors; reference the FileWriter instance
(writer), the write_filter_file(...) call, and the existing open/close sequence
when making the change.
In `@components/core/src/clp_s/filter/FilterFile.cpp`:
- Around line 30-43: The code uses two different default-construction syntaxes
for BloomFilter (BloomFilter{} at the assignment to out_filter when
FilterType::None and BloomFilter() later); pick one style and make both
assignments consistent—either change the first assignment to out_filter =
BloomFilter(); or change the second to out_filter = BloomFilter{}—so both
occurrences (the assignments to out_filter in the branch handling
FilterType::None and the branch before calling
out_filter.read_from_file(reader)) use the same form.
In `@components/core/src/clp_s/filter/FilterFile.hpp`:
- Line 18: The current write_filter_file(FileWriter& writer, FilterType type,
BloomFilter const& filter) API forces callers to provide a BloomFilter even when
type == FilterType::None; add an ergonomic overload such as
write_filter_file(FileWriter& writer, FilterType type) that handles the None
case (or change the signature to accept std::optional<BloomFilter const&>) so
callers can omit a dummy BloomFilter; update implementations of
write_filter_file and any callers to use the new overload for FilterType::None
and keep the existing three-argument overload for the cases that actually need a
BloomFilter.
In `@components/core/tests/test-clp_s-bloom_filter.cpp`:
- Around line 46-51: Replace the negation-style check in the loop that counts
false_negatives by using the project's preferred equality comparison: change the
condition that currently reads using !filter.possibly_contains(value) to use
false == filter.possibly_contains(value). Update the loop that iterates over
inserted and increments false_negatives to use this new condition so the
behavior is unchanged but conforms to the guideline; references:
false_negatives, inserted, and filter.possibly_contains.
In `@taskfiles/deps/main.yaml`:
- Around line 577-595: Update the xxhash dependency to v0.8.3 and verify the
archive SHA256: change the URL value referenced by the URL variable to the
v0.8.3 tarball and replace FILE_SHA256 with the verified SHA256 for that exact
archive; ensure the variables LIB_NAME, XXHASH_OUTPUT_DIR and the generated
cmake file path (G_DEPS_CPP_CMAKE_SETTINGS_DIR/{{.LIB_NAME}}.cmake) remain
correct; after updating, download the new URL locally to compute and confirm the
SHA256 before committing to avoid GitHub archive instability.
|
|
||
| auto const num_hash_functions = static_cast<uint32_t>( | ||
| static_cast<double>(bit_array_size) / static_cast<double>(expected_num_elements) * ln2 | ||
| ); | ||
|
|
||
| uint32_t const capped_num_hash_functions | ||
| = std::clamp(num_hash_functions, kMinNumHashFunctions, kMaxNumHashFunctions); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Truncation in num_hash_functions computation may yield a sub-optimal hash count.
Line 57–59 truncates the optimal k via static_cast<uint32_t>(...). The standard formula k = (m/n) * ln(2) is ideally rounded to the nearest integer rather than floored. With std::clamp the result is safe (minimum 1), but using std::round or std::lround before the cast would give a more accurate hash count for borderline values.
Proposed fix
- auto const num_hash_functions = static_cast<uint32_t>(
- static_cast<double>(bit_array_size) / static_cast<double>(expected_num_elements) * ln2
- );
+ auto const num_hash_functions = static_cast<uint32_t>(std::llround(
+ static_cast<double>(bit_array_size) / static_cast<double>(expected_num_elements) * ln2
+ ));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/core/src/clp_s/filter/BloomFilter.cpp` around lines 56 - 62, The
calculation of num_hash_functions currently truncates the optimal k by casting
to uint32_t; change this to round to the nearest integer (e.g., use std::lround
or std::round followed by cast) when computing num_hash_functions from
(bit_array_size / expected_num_elements) * ln2, and keep the existing std::clamp
into capped_num_hash_functions with kMinNumHashFunctions and
kMaxNumHashFunctions; also ensure <cmath> is included so std::lround/ std::round
is available.
| uint64_t bit_array_bytes = 0; | ||
| if (clp::ErrorCode_Success != reader.try_read_numeric_value(bit_array_bytes)) { | ||
| return false; | ||
| } | ||
| if (bit_array_bytes > std::numeric_limits<size_t>::max()) { | ||
| return false; | ||
| } | ||
| size_t const expected_bit_array_bytes = (bit_array_size + 7) / 8; | ||
| if (bit_array_bytes != expected_bit_array_bytes) { | ||
| return false; | ||
| } | ||
|
|
||
| std::vector<uint8_t> bit_array(static_cast<size_t>(bit_array_bytes)); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider adding a sanity upper-bound on bit_array_bytes before allocation.
read_from_file validates that bit_array_bytes == (bit_array_size + 7) / 8 but imposes no ceiling on the absolute size. A crafted file could set bit_array_size close to SIZE_MAX, causing a multi-gigabyte std::vector allocation (Line 144) that may OOM. Adding a reasonable maximum (e.g., 256 MiB) would harden the deserialization path against malformed input.
Not critical for a library-only PR, but worth adding before this is wired into archive reading.
Example guard
+ // Reject unreasonably large filters (256 MiB cap).
+ constexpr size_t kMaxBitArrayBytes{256ULL * 1024 * 1024};
+ if (bit_array_bytes > kMaxBitArrayBytes) {
+ return false;
+ }
+
size_t const expected_bit_array_bytes = (bit_array_size + 7) / 8;🧰 Tools
🪛 Cppcheck (2.19.0)
[style] 142-142: The function 'read_numeric_value' is never used.
(unusedFunction)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/core/src/clp_s/filter/BloomFilter.cpp` around lines 132 - 144, Add
a sanity upper-bound check on bit_array_bytes before allocating the vector to
avoid OOM from malformed input: in the same block that computes
expected_bit_array_bytes and checks bit_array_bytes, define a reasonable maximum
(e.g., constexpr size_t kMaxBitArrayBytes = 256ULL * 1024 * 1024) and return
false if bit_array_bytes > kMaxBitArrayBytes; leave the existing
std::numeric_limits<size_t>::max() check in place and only perform the new bound
check before constructing std::vector<uint8_t> bit_array.
| bool FilterBuilder::write(std::string const& filter_path) const { | ||
| if (false == m_enabled) { | ||
| return false; | ||
| } | ||
| FileWriter writer; | ||
| writer.open(filter_path, FileWriter::OpenMode::CreateForWriting); | ||
| write_filter_file(writer, m_config.type, m_filter); | ||
| writer.close(); | ||
| return true; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find FileWriter class definition
find . -type f \( -name "*.cpp" -o -name "*.hpp" -o -name "*.h" \) | xargs grep -l "class FileWriter"Repository: y-scope/clp
Length of output: 179
🏁 Script executed:
# Search for FileWriter header and implementation files
fd -e hpp -e h FileWriter | head -20Repository: y-scope/clp
Length of output: 173
🏁 Script executed:
# Use ast-grep to find FileWriter class definition
ast-grep --pattern 'class FileWriter {
$$$
}'Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
cat -n components/core/src/clp_s/FileWriter.hppRepository: y-scope/clp
Length of output: 4880
🏁 Script executed:
# Also check for FileWriter implementation in cpp files
fd -path "*/clp_s/*" -name "FileWriter.cpp"Repository: y-scope/clp
Length of output: 223
🏁 Script executed:
find . -path "*/clp_s/*" -name "FileWriter.cpp" -o -path "*/clp_s/*" -name "FileWriter.cpp"Repository: y-scope/clp
Length of output: 97
🏁 Script executed:
# Search more broadly for FileWriter.cpp
find . -name "FileWriter.cpp"Repository: y-scope/clp
Length of output: 179
🏁 Script executed:
cat -n components/core/src/clp_s/FileWriter.cppRepository: y-scope/clp
Length of output: 5687
🏁 Script executed:
# Verify FilterBuilder.cpp location and context
cat -n components/core/src/clp_s/filter/FilterBuilder.cpp | head -40Repository: y-scope/clp
Length of output: 1259
🏁 Script executed:
# Check if there are other usages of FileWriter in the codebase with similar patterns
rg -n "FileWriter.*writer" --type cpp | grep -A 3 -B 3 "write_filter_file\|writer.open"Repository: y-scope/clp
Length of output: 940
🏁 Script executed:
# Check other usages of FileWriter to see if similar pattern exists elsewhere
rg -B 2 -A 5 "FileWriter\s+\w+;" --type cpp | grep -A 10 "\.open"Repository: y-scope/clp
Length of output: 9334
🏁 Script executed:
# Look at FilterFile.cpp to understand what write_filter_file does
cat -n components/core/src/clp_s/filter/FilterFile.cppRepository: y-scope/clp
Length of output: 1774
Ensure file is closed even if an exception occurs during filtering.
If write_filter_file() throws (e.g., due to disk full or write failure), writer.close() on line 31 is never called. The FileWriter destructor does not implement cleanup—it only logs an error. Wrap the write operation in a try-catch block:
Suggested fix
bool FilterBuilder::write(std::string const& filter_path) const {
if (false == m_enabled) {
return false;
}
FileWriter writer;
writer.open(filter_path, FileWriter::OpenMode::CreateForWriting);
try {
write_filter_file(writer, m_config.type, m_filter);
} catch (...) {
writer.close();
throw;
}
writer.close();
return true;
}🧰 Tools
🪛 Cppcheck (2.19.0)
[style] 24-24: The function 'is_decimal_digit' is never used.
(unusedFunction)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/core/src/clp_s/filter/FilterBuilder.cpp` around lines 24 - 33,
FilterBuilder::write currently opens a FileWriter and calls
write_filter_file(m_config.type, m_filter) but if write_filter_file throws the
FileWriter::close() is never called; update FilterBuilder::write to wrap the
write_filter_file call in a try/catch (catch all), call writer.close() in the
catch, rethrow the exception, and keep the normal writer.close() after the try
so the file is always closed even on errors; reference the FileWriter instance
(writer), the write_filter_file(...) call, and the existing open/close sequence
when making the change.
There was a problem hiding this comment.
I am intentionally writing like this because I checked that in JsonConstructor.cpp and SchemaMap.cpp there is also no protection against writer.close(), so make it consistent. Also that the loss of data in the writer is acceptable in the context of filter.
I am actually considerring add try catch in the caller(integration) but not here
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check JsonConstructor.cpp and SchemaMap.cpp for FileWriter usage patterns
echo "=== Checking JsonConstructor.cpp ==="
rg -n -C 5 "FileWriter.*writer" components/core/src/clp_s/JsonConstructor.cpp || echo "No matches in JsonConstructor.cpp"
echo ""
echo "=== Checking SchemaMap.cpp ==="
rg -n -C 5 "FileWriter.*writer" components/core/src/clp_s/SchemaMap.cpp || echo "No matches in SchemaMap.cpp"
echo ""
echo "=== Looking for writer.open and writer.close patterns in both files ==="
rg -n -A 10 "writer\.open" components/core/src/clp_s/JsonConstructor.cpp components/core/src/clp_s/SchemaMap.cpp
Length of output: 5237
@ShangDanLuXian, thank you for the explanation! I've confirmed that JsonConstructor.cpp and SchemaMap.cpp indeed follow the same pattern—FileWriter operations without try-catch protection around the write calls before close(). Your approach is consistent with the existing codebase conventions, and I understand your reasoning that data loss in the filter context is acceptable, with exception handling deferred to the caller/integration layer. I've noted this for future reviews.
✏️ Learnings added
Learnt from: ShangDanLuXian
Repo: y-scope/clp PR: 2010
File: components/core/src/clp_s/filter/FilterBuilder.cpp:24-33
Timestamp: 2026-02-20T15:19:43.528Z
Learning: In components/core/src/clp_s, FileWriter usage follows a pattern where writer.open(), write operations, and writer.close() are called without try-catch protection around the write operations. This is intentional—data loss is considered acceptable in these contexts (e.g., filters), and exception handling is deferred to the caller/integration layer. Examples: JsonConstructor.cpp, SchemaMap.cpp, and FilterBuilder.cpp.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Learnt from: anlowee
Repo: y-scope/clp PR: 1176
File: components/core/src/clp_s/ColumnWriter.cpp:49-127
Timestamp: 2025-08-13T15:25:40.805Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, the FormattedFloatColumnWriter::add_value method intentionally allows std::stod to throw exceptions (std::invalid_argument or std::out_of_range) without catching them, as this is designed to terminate decompression when encountering invalid float strings for data integrity purposes.
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Learnt from: gibber9809
Repo: y-scope/clp PR: 624
File: components/core/src/clp/BoundedReader.hpp:25-27
Timestamp: 2024-12-10T16:56:33.545Z
Learning: In the CLP project, within constructors like `BoundedReader` in `components/core/src/clp/BoundedReader.hpp`, it's acceptable to call methods like `get_pos()` that may throw exceptions without handling potential errors via `try_get_pos()`, since the constructor itself handles errors by throwing.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
| namespace clp_s::filter { | ||
| constexpr char kFilterFileMagic[4] = {'C', 'L', 'P', 'F'}; | ||
|
|
||
| void write_filter_file(FileWriter& writer, FilterType type, BloomFilter const& filter); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider the API ergonomics of write_filter_file for FilterType::None.
The signature requires a BloomFilter const& even when type is FilterType::None (where the filter is unused). This forces callers to construct a dummy BloomFilter for the no-filter case. A simple overload that takes only FileWriter& and FilterType for the None path (or using std::optional<BloomFilter const&>) would make the intent clearer.
Not a blocker for the initial library PR, but worth considering before archive-writer integration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/core/src/clp_s/filter/FilterFile.hpp` at line 18, The current
write_filter_file(FileWriter& writer, FilterType type, BloomFilter const&
filter) API forces callers to provide a BloomFilter even when type ==
FilterType::None; add an ergonomic overload such as
write_filter_file(FileWriter& writer, FilterType type) that handles the None
case (or change the signature to accept std::optional<BloomFilter const&>) so
callers can omit a dummy BloomFilter; update implementations of
write_filter_file and any callers to use the new overload for FilterType::None
and keep the existing three-argument overload for the cases that actually need a
BloomFilter.
| constexpr size_t kDefaultBitArraySize{64}; | ||
| constexpr uint32_t kDefaultNumHashFunctions{1}; | ||
| constexpr uint32_t kMinNumHashFunctions{1}; | ||
| constexpr uint32_t kMaxNumHashFunctions{20}; |
There was a problem hiding this comment.
Correspond to 1-e6 FPR, which will be set as the min FPR in the integration
There was a problem hiding this comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@components/core/CMakeLists.txt`:
- Around line 722-724: No change required because the test entries in
components/core/CMakeLists.txt are now correctly alphabetized; ensure the file
list keeps tests/test-clp_s-bloom_filter.cpp before
tests/test-clp_s-delta-encode-log-order.cpp and that
tests/test-clp_s-end_to_end.cpp remains in the same sorted sequence, preserving
alphabetical order whenever you add or reorder test entries.
In `@components/core/src/clp_s/filter/BloomFilter.cpp`:
- Around line 57-59: The computation of num_hash_functions currently truncates
the real-valued optimum (k = (m/n) * ln2) which can yield a sub-optimal count
for borderline values; change the computation in BloomFilter.cpp so you round to
nearest before casting (e.g., wrap the real-valued expression with std::llround
and then static_cast<uint32_t>) while keeping the existing std::clamp bounds;
update the include if needed to pull in <cmath> and ensure the expression uses
bit_array_size, expected_num_elements and ln2 as before.
- Around line 132-154: The deserialiser creates std::vector<uint8_t> bit_array
from bit_array_bytes without an upper bound, allowing a crafted bit_array_size
to force huge allocations; before constructing bit_array in BloomFilter.cpp
(after reader.try_read_numeric_value and the check against
std::numeric_limits<size_t>::max and expected_bit_array_bytes), validate that
bit_array_bytes <= a sane constant (e.g. 256 * 1024 * 1024) and return false if
it exceeds that cap; keep references to bit_array_bytes, bit_array_size,
reader.try_read_numeric_value and reader.try_read_exact_length to locate the
changes and ensure the guard happens before allocating the vector to prevent
std::bad_alloc/OOM.
In `@components/core/src/clp_s/filter/FilterFile.cpp`:
- Around line 17-46: The inconsistency in default-constructing out_filter in
read_filter_file has been resolved—both assignments now use BloomFilter{}; keep
the existing checks around reading magic, validating type (FilterType::None and
FilterType::Bloom) and the call to out_filter.read_from_file(reader) as-is; no
code change required in read_filter_file, ensure symbols referenced are
read_filter_file, out_filter, BloomFilter, FilterType, and read_from_file remain
unchanged.
gibber9809
left a comment
There was a problem hiding this comment.
Apologies for the slow initial review -- leaving mostly higher level comments so that you have something to work with. Could you go through and apply the comments about include style and constant naming throughout the PR?
XxHash seems like it should be a good choice for hash library given that hashes are supposed to be stable across platforms and into the future post 0.8.0 + performance characteristics. It seems like there's a newer version (0.8.3) though, so could we update to that?
Besides that, I'll need to do another pass on this PR where I take a closer look at design + tests + the way we're pulling in this xxhash library.
| constexpr size_t kDefaultBitArraySize{64}; | ||
| constexpr uint32_t kDefaultNumHashFunctions{1}; | ||
| constexpr uint32_t kMinNumHashFunctions{1}; | ||
| constexpr uint32_t kMaxNumHashFunctions{20}; | ||
| constexpr uint64_t kPrimaryHashSeed{0}; | ||
| constexpr uint64_t kSecondaryHashSeed{0x9e37'79b9'7f4a'7c15ULL}; |
There was a problem hiding this comment.
| constexpr size_t kDefaultBitArraySize{64}; | |
| constexpr uint32_t kDefaultNumHashFunctions{1}; | |
| constexpr uint32_t kMinNumHashFunctions{1}; | |
| constexpr uint32_t kMaxNumHashFunctions{20}; | |
| constexpr uint64_t kPrimaryHashSeed{0}; | |
| constexpr uint64_t kSecondaryHashSeed{0x9e37'79b9'7f4a'7c15ULL}; | |
| constexpr size_t cDefaultBitArraySize{64}; | |
| constexpr uint32_t cDefaultNumHashFunctions{1}; | |
| constexpr uint32_t cMinNumHashFunctions{1}; | |
| constexpr uint32_t cMaxNumHashFunctions{20}; | |
| constexpr uint64_t cPrimaryHashSeed{0}; | |
| constexpr uint64_t cSecondaryHashSeed{0x9e37'79b9'7f4a'7c15ULL}; |
Nit: style for constants in the repo is to use the prefix c -- sorry, this is one of the details that is only in the private part of our style guide.
| : m_bit_array_size(kDefaultBitArraySize), | ||
| m_num_hash_functions(kDefaultNumHashFunctions), | ||
| m_bit_array((kDefaultBitArraySize + 7) / 8, 0) {} |
There was a problem hiding this comment.
| : m_bit_array_size(kDefaultBitArraySize), | |
| m_num_hash_functions(kDefaultNumHashFunctions), | |
| m_bit_array((kDefaultBitArraySize + 7) / 8, 0) {} | |
| : m_bit_array_size{cDefaultBitArraySize}, | |
| m_num_hash_functions{cDefaultNumHashFunctions}, | |
| m_bit_array{(cDefaultBitArraySize + 7) / 8, 0} {} |
Nit: preferred style is to use brace initializers wherever possible -- sorry again, this also is only documented in the private part of our style guide.
| m_bit_array_size = bit_array_size; | ||
| m_num_hash_functions = num_hash_functions; | ||
|
|
||
| size_t const num_bytes = (m_bit_array_size + 7) / 8; |
There was a problem hiding this comment.
This code seems to use (x + 7)/8 to get the number of bytes that contain a given number of bits in several places. Might be worth adding a helper called min_bytes_containing_bits in the anonymonous namespace in this file or something.
Might also be worth defining a constant cNumBitsInByte to replace all of the instances of 8 in this file for readability.
| @@ -0,0 +1,10 @@ | |||
| #include "XxHash.hpp" | |||
|
|
|||
| #define XXH_INLINE_ALL | |||
| PRIVATE | ||
| clp::string_utils | ||
| ) | ||
| endif() |
There was a problem hiding this comment.
Could we move all of this stuff for defining the library into a new filter/CMakeLists.txt, then just use add_subdirectory from this CMakeLists file to add that subdirectory?
There was a problem hiding this comment.
should we not add the package into this cmakelist in this PR - it should be part of the integration right?
There was a problem hiding this comment.
It's fine to add the cmake library in this PR since we aren't linking it into clp-s. It's also got a flag to disable if we need, so no real risk of adverse side effects.
| #include "../../clp/ErrorCode.hpp" | ||
| #include "../../clp/ReaderInterface.hpp" |
There was a problem hiding this comment.
| #include "../../clp/ErrorCode.hpp" | |
| #include "../../clp/ReaderInterface.hpp" | |
| #include <clp/ErrorCode.hpp> | |
| #include <clp/ReaderInterface.hpp> |
Nit: <> is our preferred include style for pulling in other parts of the clp project.
| #include <utility> | ||
| #include <vector> | ||
|
|
||
| #include "../FileWriter.hpp" |
There was a problem hiding this comment.
| #include "../FileWriter.hpp" | |
| #include <clp_s/FileWriter.hpp> |
Nit
gibber9809
left a comment
There was a problem hiding this comment.
Did a pass on the build system side of this PR. Let me know if any of my comments are unclear.
| xxhash: | ||
| internal: true | ||
| vars: | ||
| LIB_NAME: "xxhash" | ||
| XXHASH_OUTPUT_DIR: "{{.G_DEPS_CPP_DIR}}/{{.LIB_NAME}}-src" | ||
| run: "once" | ||
| deps: | ||
| - task: "utils:init" | ||
| cmds: | ||
| - task: "yscope-dev-utils:remote:download-and-extract-tar" | ||
| vars: | ||
| CHECKSUM_FILE: "{{.G_DEPS_CPP_CHECKSUMS_DIR}}/{{.LIB_NAME}}.md5" | ||
| FILE_SHA256: "baee0c6afd4f03165de7a4e67988d16f0f2b257b51d0e3cb91909302a26a79c4" | ||
| OUTPUT_DIR: "{{.XXHASH_OUTPUT_DIR}}" | ||
| URL: "https://github.com/Cyan4973/xxHash/archive/refs/tags/v0.8.2.tar.gz" | ||
| - >- | ||
| echo "set( | ||
| CLP_XXHASH_INCLUDE_DIRECTORY \"{{.XXHASH_OUTPUT_DIR}}\" | ||
| )" > "{{.G_DEPS_CPP_CMAKE_SETTINGS_DIR}}/{{.LIB_NAME}}.cmake" |
There was a problem hiding this comment.
| xxhash: | |
| internal: true | |
| vars: | |
| LIB_NAME: "xxhash" | |
| XXHASH_OUTPUT_DIR: "{{.G_DEPS_CPP_DIR}}/{{.LIB_NAME}}-src" | |
| run: "once" | |
| deps: | |
| - task: "utils:init" | |
| cmds: | |
| - task: "yscope-dev-utils:remote:download-and-extract-tar" | |
| vars: | |
| CHECKSUM_FILE: "{{.G_DEPS_CPP_CHECKSUMS_DIR}}/{{.LIB_NAME}}.md5" | |
| FILE_SHA256: "baee0c6afd4f03165de7a4e67988d16f0f2b257b51d0e3cb91909302a26a79c4" | |
| OUTPUT_DIR: "{{.XXHASH_OUTPUT_DIR}}" | |
| URL: "https://github.com/Cyan4973/xxHash/archive/refs/tags/v0.8.2.tar.gz" | |
| - >- | |
| echo "set( | |
| CLP_XXHASH_INCLUDE_DIRECTORY \"{{.XXHASH_OUTPUT_DIR}}\" | |
| )" > "{{.G_DEPS_CPP_CMAKE_SETTINGS_DIR}}/{{.LIB_NAME}}.cmake" | |
| xxhash: | |
| internal: true | |
| vars: | |
| LIB_NAME: "xxHash" | |
| XXHASH_OUTPUT_DIR: "{{.G_DEPS_CPP_DIR}}/{{.LIB_NAME}}-src" | |
| run: "once" | |
| cmds: | |
| - task: "utils:install-remote-cmake-lib" | |
| vars: | |
| CMAKE_GEN_ARGS: | |
| - "-DXXHASH_BUILD_XXHSUM=OFF" | |
| - "-DBUILD_SHARED_LIBS=ON" | |
| - '-DCMAKE_INSTALL_PREFIX={{.G_DEPS_CPP_DIR}}/{{.LIB_NAME}}-install' | |
| LIB_NAME: "{{.LIB_NAME}}" | |
| TARBALL_SHA256: "aae608dfe8213dfd05d909a57718ef82f30722c392344583d3f39050c7f29a80" | |
| TARBALL_URL: "https://github.com/Cyan4973/xxHash/archive/refs/tags/v0.8.3.tar.gz" | |
| CMAKE_SOURCE_DIR: "cmake_unofficial" |
Changing this to pull in v0.8.3 as well as use the cmake install flow.
Now instead of the CLP_XXHASH_INCLUDE_DIRECTORY stuff you're doing, you can directly use xxHash::xxhash as a link target in target_link_libraries for the relevant targets.
You will also need to add a block like:
if(CLP_NEED_XXHASH)
find_package(xxHash 0.8.3 CONFIG REQUIRED)
if(xxHash_FOUND)
message(STATUS "Found xxHash ${xxHash_VERSION}")
endif()
endif()
To components/core/CMakeLists.txt
and you'll need to add functions/flags that set CLP_NEED_XXHASH in components/core/cmake/Options/options.cmake -- sorry that this isn't documented, but you'll pretty much want to copy what we do for other targets. That means you'll have to add an option flag like
option(
CLP_BUILD_CLP_S_FILTER
"Build clp_s::filter."
ON
)
Alphabetically sorted in the right position in the long list of options at the start of the file + add a function
validate_clp_s_filter_dependencies which asks for CLP_BUILD_CLP_STRING_UTILS+CLP_BUILD_CLP_S_IO and set_clp_s_filter_dependencies which sets CLP_NEED_XXHASH + call these two functions in validate_and_setup_all_clp_dependency_flags like we do for other dependencies + add CLP_NEED_XXHASH to the list in convert_clp_dependency_properties_to_variables.
Hopefully you can figure that out by copying what we do for other dependencies, but if not let me know and I can help -- we really should have documented how to set up these flags/refactored them by now, but haven't yet, so it's more confusing than it needs to be.
| filter/XxHash.hpp | ||
| ) | ||
|
|
||
| if(CLP_BUILD_CLP_S_IO) |
There was a problem hiding this comment.
| if(CLP_BUILD_CLP_S_IO) | |
| if(CLP_BUILD_CLP_S_FILTER) |
See my other comment about how to set up this flag properly.
| PRIVATE | ||
| ${CLP_XXHASH_INCLUDE_DIRECTORY} |
There was a problem hiding this comment.
Per other comment, should be able to remove this once you change how the dependency is pulled in.
| target_link_libraries( | ||
| clp_s_filter | ||
| PUBLIC | ||
| clp_s::io |
There was a problem hiding this comment.
Per other comment, will need to link against xxHash::xxhash.
| target_include_directories(unitTest | ||
| PRIVATE | ||
| ${CLP_SQLITE3_INCLUDE_DIRECTORY} | ||
| ${CLP_XXHASH_INCLUDE_DIRECTORY} |
There was a problem hiding this comment.
| ${CLP_XXHASH_INCLUDE_DIRECTORY} |
Per other comment, can get rid of this when we update how we pull in the xxhash dependency.
| src/clp_s/filter/FilterReader.cpp | ||
| src/clp_s/filter/FilterReader.hpp | ||
| src/clp_s/filter/XxHash.cpp | ||
| src/clp_s/filter/XxHash.hpp |
There was a problem hiding this comment.
Instead of pulling in every source file from clp_s/filter could we just link unitTest against clp_s::filter?
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/core/src/clp_s/filter/BloomFilter.cpp`:
- Around line 22-24: The current min_bytes_containing_bits function can overflow
when computing (num_bits + (cNumBitsInByte - 1)); change the calculation to
avoid addition overflow by using division and modulo: compute bytes = num_bits /
cNumBitsInByte and add one only if num_bits % cNumBitsInByte != 0 (handle
num_bits == 0 returning 0). Update min_bytes_containing_bits to use that
division/modulo approach (referencing min_bytes_containing_bits and
cNumBitsInByte) so deserialisation validation cannot be bypassed by SIZE_MAX
inputs.
In `@taskfiles/deps/main.yaml`:
- Around line 570-573: The CMake args currently force shared-only builds by
setting CMAKE_GEN_ARGS with "-DBUILD_SHARED_LIBS=ON"; modify the xxHash build
args in CMAKE_GEN_ARGS to also set "-DBUILD_STATIC_LIBS=ON" and ensure the
project’s components/core/CMakeLists.txt uses a static-vs-shared selection when
calling find_package(xxhash ...) similar to how Boost and LibArchive are handled
(i.e., add logic to respect CLP_USE_STATIC_LIBS and pass the STATIC or SHARED
hint to find_package or set xxHash_USE_STATIC_LIBS accordingly) so the build
supports the CLP static-link mode.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (12)
components/core/CMakeLists.txtcomponents/core/cmake/Options/options.cmakecomponents/core/src/clp_s/CMakeLists.txtcomponents/core/src/clp_s/filter/BloomFilter.cppcomponents/core/src/clp_s/filter/BloomFilter.hppcomponents/core/src/clp_s/filter/CMakeLists.txtcomponents/core/src/clp_s/filter/FilterBuilder.cppcomponents/core/src/clp_s/filter/FilterFile.cppcomponents/core/src/clp_s/filter/FilterFile.hppcomponents/core/src/clp_s/filter/FilterReader.cppcomponents/core/src/clp_s/filter/XxHash.cpptaskfiles/deps/main.yaml
| CMAKE_GEN_ARGS: | ||
| - "-DXXHASH_BUILD_XXHSUM=OFF" | ||
| - "-DBUILD_SHARED_LIBS=ON" | ||
| - '-DCMAKE_INSTALL_PREFIX={{.G_DEPS_CPP_DIR}}/{{.LIB_NAME}}-install' |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify static-link mode defaults and xxHash build/link wiring.
rg -n "CLP_USE_STATIC_LIBS|find_package\\(xxHash|xxHash::xxhash" \
components/core/CMakeLists.txt \
components/core/src/clp_s/filter/CMakeLists.txt
echo "---"
rg -n "task:\\s*\"xxhash\"|BUILD_SHARED_LIBS=ON|LIB_NAME:\\s*\"xxHash\"" \
taskfiles/deps/main.yamlRepository: y-scope/clp
Length of output: 1089
🏁 Script executed:
#!/bin/bash
# Search for custom FindxxHash module
fd "FindxxHash" components/
echo "---"
# Get the full xxHash task definition from taskfiles/deps/main.yaml
sed -n '87,600p' taskfiles/deps/main.yaml | grep -A 30 'task: "xxhash"'
echo "---"
# Check filter CMakeLists.txt around line 31
sed -n '25,40p' components/core/src/clp_s/filter/CMakeLists.txtRepository: y-scope/clp
Length of output: 1280
🏁 Script executed:
#!/bin/bash
# Find the xxhash task definition more precisely
rg -A 50 "^\s+xxhash:" taskfiles/deps/main.yaml | head -80Repository: y-scope/clp
Length of output: 1796
🏁 Script executed:
#!/bin/bash
# Search for custom FindxxHash module
find components -name "*xxHash*" -o -name "*xxhash*"
echo "---"
# Check if there's xxHash handling in the main CMakeLists
sed -n '330,350p' components/core/CMakeLists.txtRepository: y-scope/clp
Length of output: 642
🏁 Script executed:
#!/bin/bash
# Find CLP_NEED_XXHASH definition
rg -n "CLP_NEED_XXHASH" components/core/CMakeLists.txt
echo "---"
# Check how other libraries handle static vs shared (e.g., BZip2)
rg -B5 -A5 "CLP_USE_STATIC_LIBS" components/core/CMakeLists.txt | head -60
echo "---"
# Check xxHash official build options
rg -n "xxHash_USE_STATIC_LIBS|BUILD_STATIC_LIBS" taskfiles/deps/main.yamlRepository: y-scope/clp
Length of output: 1873
🏁 Script executed:
#!/bin/bash
# Search for CLP_NEED_XXHASH definition/assignment
rg -n "set.*CLP_NEED_XXHASH" components/core/CMakeLists.txt
echo "---"
# Check if xxHash find_package has any handling for static libs
sed -n '337,345p' components/core/CMakeLists.txt
echo "---"
# Check other libraries (BZip2, LibLZMA, etc.) to see the pattern for static/shared handling
rg -B2 "CLP_USE_STATIC_LIBS" components/core/CMakeLists.txt | grep -A2 "BZip2\|LibLZMA\|LZ4\|ZLIB"Repository: y-scope/clp
Length of output: 305
🏁 Script executed:
#!/bin/bash
# Search for CLP_NEED_XXHASH in all files
rg -r "CLP_NEED_XXHASH" components/
echo "---"
# Check for any xxHash configuration in other CMakeLists files
rg -n "xxHash\|XXHASH" components/core/src/clp_s/filter/CMakeLists.txt
echo "---"
# Check the pattern used for other libraries with static handling
sed -n '155,175p' components/core/CMakeLists.txtRepository: y-scope/clp
Length of output: 16101
🏁 Script executed:
#!/bin/bash
# Find the actual definition of CLP_NEED_XXHASH
rg -n "CLP_NEED_XXHASH\s*(ON|OFF|TRUE|FALSE)" components/core/CMakeLists.txt
echo "---"
# Look for option definitions
rg -n "option.*CLP_NEED" components/core/CMakeLists.txt
echo "---"
# Check the filter target linking more closely
sed -n '20,40p' components/core/src/clp_s/filter/CMakeLists.txtRepository: y-scope/clp
Length of output: 509
🏁 Script executed:
#!/bin/bash
# Search for how CLP_NEED_XXHASH gets set - look for conditionals
rg -B10 "CLP_NEED_XXHASH" components/core/CMakeLists.txt | head -40
echo "---"
# Check if there's any xxxHash static/shared variant handling elsewhere
rg -n "xxHash.*USE_STATIC\|xxHash_USE_STATIC" components/Repository: y-scope/clp
Length of output: 362
🏁 Script executed:
#!/bin/bash
# Search for CLP_NEED_XXHASH definition/set command more broadly
rg -n "CLP_NEED_XXHASH\s*=" components/core/CMakeLists.txt
echo "---"
# Check if it's set via option()
rg -n "option.*XXHASH" components/core/CMakeLists.txt
echo "---"
# Look at the very beginning of the CMakeLists to see option definitions
head -150 components/core/CMakeLists.txt | grep -n "option\|CLP_NEED"Repository: y-scope/clp
Length of output: 310
🏁 Script executed:
#!/bin/bash
# Check the options.cmake file where CLP_NEED variables are defined
cat components/core/cmake/Options/options.cmakeRepository: y-scope/clp
Length of output: 13124
🏁 Script executed:
#!/bin/bash
# Check xxHash's official CMakeLists.txt for build options
# Since xxHash is built from a tarball, check the cmake_unofficial directory mentioned in the task
rg -n "BUILD_STATIC_LIBS|xxHash_BUILD_XXHSUM|option.*BUILD" \
--type cmake 2>/dev/null | head -20
echo "---"
# Verify the pattern used for other libraries with static support
rg -B1 -A1 "Boost_USE_STATIC_LIBS ON\|LibArchive_USE_STATIC_LIBS ON" components/core/CMakeLists.txtRepository: y-scope/clp
Length of output: 341
Add static library variant to xxHash build to support CLP static-link mode.
Line 572 forces -DBUILD_SHARED_LIBS=ON exclusively. Since CLP defaults to CLP_USE_STATIC_LIBS=ON, xxHash must also support static linking. Add -DBUILD_STATIC_LIBS=ON and implement static variant selection in the find_package call, following the pattern used by Boost and LibArchive elsewhere in components/core/CMakeLists.txt.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@taskfiles/deps/main.yaml` around lines 570 - 573, The CMake args currently
force shared-only builds by setting CMAKE_GEN_ARGS with
"-DBUILD_SHARED_LIBS=ON"; modify the xxHash build args in CMAKE_GEN_ARGS to also
set "-DBUILD_STATIC_LIBS=ON" and ensure the project’s
components/core/CMakeLists.txt uses a static-vs-shared selection when calling
find_package(xxhash ...) similar to how Boost and LibArchive are handled (i.e.,
add logic to respect CLP_USE_STATIC_LIBS and pass the STATIC or SHARED hint to
find_package or set xxHash_USE_STATIC_LIBS accordingly) so the build supports
the CLP static-link mode.
gibber9809
left a comment
There was a problem hiding this comment.
Going to go through more later today, but leaving another set of comments on tests + XxHash.{cpp,hpp}.
The updates to cmake scripts look good!
Besides those comments could you:
- Move your
testssubdirectory tofilter/tests-- our convention changed a few months ago to putting tests in atestssubdirectory next to the thing they're testing so that it's easier to find tests related to a given library or feature. - Run clang-tidy on all of your c++ code? I'm finding a lot of details that our clang-tidy config would complain about. If the
taskflow for runningclang-tidyisn't bringing up the issues, you may have to manually run clang-tidy likeclang-tidy -p your/build/path/compile_commands.json components/core/src/clp_s/filter/BloomFilter.cppetc.
|
|
||
| #include <catch2/catch_test_macros.hpp> | ||
|
|
||
| #include "../src/clp_s/filter/BloomFilter.hpp" |
There was a problem hiding this comment.
| #include "../src/clp_s/filter/BloomFilter.hpp" | |
| #include <clp_s/filter/BloomFilter.hpp> |
Since this test is linking against the clp_s::filter library.
| constexpr size_t kInsertions = 10'000; | ||
| constexpr size_t kQueries = 10'000; | ||
| constexpr double kFalsePositiveRate = 0.001; |
There was a problem hiding this comment.
| constexpr size_t kInsertions = 10'000; | |
| constexpr size_t kQueries = 10'000; | |
| constexpr double kFalsePositiveRate = 0.001; | |
| constexpr size_t cInsertions{10'000}; | |
| constexpr size_t cQueries{10'000}; | |
| constexpr double cFalsePositiveRate{0.001}; |
Nit: style.
| TEST_CASE("BloomFilter empty filter has no matches", "[clp_s][filter]") { | ||
| clp_s::filter::BloomFilter filter{0, kFalsePositiveRate}; | ||
|
|
||
| for (size_t i = 0; i < kQueries; ++i) { | ||
| auto const value = make_odd(i); | ||
| REQUIRE_FALSE(filter.possibly_contains(std::to_string(value))); | ||
| } | ||
| } |
There was a problem hiding this comment.
Do you think this test is required? It doesn't seem to actually test anything besides that our bit vector is initialized to zero, which the other test would catch anyway.
| return value * 2; | ||
| } | ||
| } // namespace | ||
|
|
There was a problem hiding this comment.
Do you think it's worth adding a test that runs xxhash on a handful of inputs and checks that the hash value is what we expect? Just to help avoid a situation where someone somehow links this library against a pre 0.8.0 version of xxhash and gets different hash results.
There was a problem hiding this comment.
I was thinking before that make some pre loaded input and output for testing but that will lead to a large number of loc - not a large file but just loc. Or do you think a very small set (like 10 sparse cases) for senity check is enough?
There was a problem hiding this comment.
Yeah, 5-10 cases as a sanity check is enough I think.
| #include <string> | ||
| #include <vector> | ||
|
|
||
| #include <catch2/catch_test_macros.hpp> |
There was a problem hiding this comment.
| #include <catch2/catch_test_macros.hpp> | |
| #include <catch2/catch_message.hpp> | |
| #include <catch2/catch_test_macros.hpp> |
Since you're also using INFO, etc
| filter.add(inserted.back()); | ||
| } | ||
|
|
||
| size_t false_negatives = 0; |
There was a problem hiding this comment.
| size_t false_negatives = 0; | |
| size_t false_negatives{}; |
| uint64_t make_odd(uint64_t value) { | ||
| return value * 2 + 1; | ||
| } | ||
|
|
||
| uint64_t make_even(uint64_t value) { | ||
| return value * 2; | ||
| } |
There was a problem hiding this comment.
| uint64_t make_odd(uint64_t value) { | |
| return value * 2 + 1; | |
| } | |
| uint64_t make_even(uint64_t value) { | |
| return value * 2; | |
| } | |
| [[nodiscard]] auto make_odd(uint64_t value) -> uint64_t { | |
| return value * 2 + 1; | |
| } | |
| [[nodiscard]] auto make_even(uint64_t value) -> uint64_t { | |
| return value * 2; | |
| } |
Nit: our style guide requires trailing return syntax.
| #include <string_view> | ||
|
|
||
| namespace clp_s::filter::xxhash { | ||
| [[nodiscard]] uint64_t hash64(std::string_view value, uint64_t seed); |
There was a problem hiding this comment.
| [[nodiscard]] uint64_t hash64(std::string_view value, uint64_t seed); | |
| /** | |
| * @param value | |
| * @param seed | |
| * @return The 64-bit seeded XXH3 hash of `value`. | |
| */ | |
| [[nodiscard]] auto hash64(std::string_view value, uint64_t seed) -> uint64_t ; |
Nit: trailing return syntax.
Nit: missing docstring.
There was a problem hiding this comment.
Also see some other missing docstrings, will add all.
| @@ -0,0 +1,9 @@ | |||
| #include "XxHash.hpp" | |||
|
|
|||
| #include <xxhash.h> | |||
There was a problem hiding this comment.
| #include <xxhash.h> | |
| #include <cstdint> | |
| #include <string_view> | |
| #include <xxhash.h> |
Nit: IWYU -- need to include headers for all symbols used in a file.
| #include <xxhash.h> | ||
|
|
||
| namespace clp_s::filter::xxhash { | ||
| uint64_t hash64(std::string_view value, uint64_t seed) { |
There was a problem hiding this comment.
| uint64_t hash64(std::string_view value, uint64_t seed) { | |
| auto hash64(std::string_view value, uint64_t seed) -> uint64_t { |
Nit: trailing return syntax
Thanks for the review. Sorry that forgot to look into the test files and change the styling as well. Also forgot to run clang-tidy again after last commit, will fix it in the next commit. |
gibber9809
left a comment
There was a problem hiding this comment.
Added some more comments about BloomFilter.{cpp,hpp} focused mostly on the design of the interface.
Ignoring style stuff in this section of the code for now since it's being worked on.
| #include <clp_s/FileWriter.hpp> | ||
|
|
||
| namespace clp { | ||
| class ReaderInterface; | ||
| } // namespace clp |
There was a problem hiding this comment.
For the write_to_file interface I think it might make sense to rely on clp::WriterInterface & to make it more generic -- that way if we decide to e.g. compress this bloom filter using zstd or something we don't need to change any code here.
Also, for these includes it's fine to just include <clp/ReaderInterface.hpp> and <clp/WriterInterface.hpp> here -- no need for this forward declaration of ReaderInterface to avoid pulling in that header.
| if (false == std::isfinite(ideal_bit_array_size) | ||
| || ideal_bit_array_size > static_cast<double>(std::numeric_limits<size_t>::max())) | ||
| { | ||
| return {cDefaultBitArraySize, cDefaultNumHashFunctions}; |
There was a problem hiding this comment.
It seems like this and the other default return above are hiding errors for a few kinds of invalid inputs.
I think we're better off just forwarding the error in this kind of situation instead of silently falling back to a default. The easiest way to do that might be to use result types for returns here like we've started using in other parts of the codebase, e.g. ystdlib::error_handling::Result<std::pair<size_t, uint32_t>>.
You can take a look at clp_s/timestamp_parser/TimestampParser.{cpp,hpp} to see a somewhat recent example of how we use result types in the codebase.
There was a problem hiding this comment.
Currently I treat expected_num_elements == 0 as valid input.
In the variable-dictionary path this should be impossible, but for the Bloom filter library itself it is still a valid state (empty set), so the implementation returns a default empty filter instead of an error.
Please let me know if you’d prefer this to be treated as an error instead.
|
|
||
| size_t m_bit_array_size{0}; | ||
| uint32_t m_num_hash_functions{0}; | ||
| std::vector<uint8_t> m_bit_array; |
There was a problem hiding this comment.
We have a class ystdlib::containers::Array<T> to represent fixed sized arrays with size configured at construction time that might be a better fit than an std::vector here.
| [[nodiscard]] bool possibly_contains(std::string_view value) const; | ||
|
|
||
| void write_to_file(FileWriter& writer) const; | ||
| [[nodiscard]] bool read_from_file(clp::ReaderInterface& reader); |
There was a problem hiding this comment.
I think this might make more sense as a static function that returns ystdlib::error_handling::Result<BloomFilter> -- kind of matching the create() pattern used in other parts of the codebase. Mainly my reasoning is that it avoids having to default initialize a BloomFilter first before attempting to initialize it from some ReaderInterface, avoiding some unnecessary allocations + cleans up the error handling since it makes it so that if you're able to create the bloom filter it is guaranteed to be valid.
Does that make sense, or do you think the current approach is better?
There was a problem hiding this comment.
I feel this is better, also replied in the default constructor comment that the default constructor is not needed with this approach.
| public: | ||
| BloomFilter(size_t expected_num_elements, double false_positive_rate); | ||
|
|
||
| BloomFilter(); |
There was a problem hiding this comment.
What's the use case for the default constructor?
There was a problem hiding this comment.
The default constructor was only needed in the old flow where we default-constructed a BloomFilter as a placeholder and then populated.
With the new Result-based API (create / try_read_from_file) and std::optional at call sites, we no longer need placeholder construction, I have removed the default constructor.
| return value * 2; | ||
| } | ||
| } // namespace | ||
|
|
There was a problem hiding this comment.
Yeah, 5-10 cases as a sanity check is enough I think.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/core/CMakeLists.txt`:
- Around line 337-338: The CMake find_package call must require exactly xxHash
0.8.3 to avoid incompatible hash outputs: update the conditional that checks
CLP_NEED_XXHASH so the find_package invocation uses the EXACT keyword (i.e.,
change the find_package(xxHash 0.8.3 CONFIG REQUIRED) line to require the exact
version) — this ensures the hard-coded vectors in test-clp_s-xxhash.cpp (which
rely on XXH3_64bits_withSeed from v0.8.3) are validated against the exact
expected library version.
In `@components/core/src/clp_s/filter/ErrorCode.cpp`:
- Around line 19-29: The switch over ErrorCodeEnum in ErrorCode.cpp currently
contains a default case which masks newly added enum values; remove the default
branch inside the switch so the compiler will warn about unhandled ErrorCodeEnum
cases, and move the fallback return (e.g. returning "Unknown error code enum" or
calling an unreachable/assert helper) to after the switch on error_enum so
missing enum cases fail loudly at compile time; update any function that
performs this mapping (the switch on error_enum in ErrorCode.cpp) accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 47ec4743-ce66-49ac-9c6d-0097d0cb1f66
📒 Files selected for processing (17)
components/core/CMakeLists.txtcomponents/core/src/clp_s/filter/BloomFilter.cppcomponents/core/src/clp_s/filter/BloomFilter.hppcomponents/core/src/clp_s/filter/CMakeLists.txtcomponents/core/src/clp_s/filter/ErrorCode.cppcomponents/core/src/clp_s/filter/ErrorCode.hppcomponents/core/src/clp_s/filter/FilterBuilder.cppcomponents/core/src/clp_s/filter/FilterBuilder.hppcomponents/core/src/clp_s/filter/FilterConfig.hppcomponents/core/src/clp_s/filter/FilterFile.cppcomponents/core/src/clp_s/filter/FilterFile.hppcomponents/core/src/clp_s/filter/FilterReader.cppcomponents/core/src/clp_s/filter/FilterReader.hppcomponents/core/src/clp_s/filter/XxHash.cppcomponents/core/src/clp_s/filter/XxHash.hppcomponents/core/src/clp_s/filter/tests/test-clp_s-bloom_filter.cppcomponents/core/src/clp_s/filter/tests/test-clp_s-xxhash.cpp
92b6657 to
5d587a3
Compare
gibber9809
left a comment
There was a problem hiding this comment.
Could you re-run clang-tidy on BloomFilter.cpp and BloomFilter.hpp? I still see some clang-tidy errors when I pull your branch and run it on those files.
For the error about reinterpret_cast I think you can use clp::size_checked_pointer_cast instead (from clp/type_utils.hpp).
I'm still taking a look at whether we need to modify how we pull in xxhash in the taskfile deps flow -- I'll get back to you on that once I've figured it out.
| target_link_libraries( | ||
| clp_s_filter | ||
| PUBLIC | ||
| clp_s::io |
There was a problem hiding this comment.
| clp_s::io | |
| clp_s::clp_dependencies |
Since this is now using clp::ReaderInterface and clp::WriterInterface I think we don't need to depend on clp_s::io anymore, but still need the clp-related dependencies.
| endfunction() | ||
|
|
||
| function(validate_clp_s_filter_dependencies) | ||
| validate_clp_dependencies_for_target(CLP_BUILD_CLP_S_FILTER) |
There was a problem hiding this comment.
| validate_clp_dependencies_for_target(CLP_BUILD_CLP_S_FILTER) | |
| validate_clp_dependencies_for_target(CLP_BUILD_CLP_S_FILTER | |
| CLP_BUILD_CLP_S_CLP_DEPENDENCIES | |
| ) |
Same as my other comment about depending on clp_s_clp_dependencies + I missed that this wasn't checking for deps previously.
Sorry that I misunderstood how clang-tidy works previously, now should all be fixed :) |
gibber9809
left a comment
There was a problem hiding this comment.
Getting pretty close on this PR, just some small revisions.
| endif() | ||
|
|
||
| if(CLP_NEED_XXHASH) | ||
| find_package(xxHash 0.8.3 EXACT CONFIG REQUIRED) |
There was a problem hiding this comment.
| find_package(xxHash 0.8.3 EXACT CONFIG REQUIRED) | |
| find_package(xxHash 0.8.3 CONFIG REQUIRED) |
Removing EXACT per sidebar.
| : m_bit_array_size(bit_array_size), | ||
| m_num_hash_functions(num_hash_functions), | ||
| m_bit_array(std::move(bit_array)) {} |
There was a problem hiding this comment.
| : m_bit_array_size(bit_array_size), | |
| m_num_hash_functions(num_hash_functions), | |
| m_bit_array(std::move(bit_array)) {} | |
| : m_bit_array_size{bit_array_size}, | |
| m_num_hash_functions{num_hash_functions}, | |
| m_bit_array{std::move(bit_array)} {} |
Nit: Use {} for initialization
| auto const optimal_parameters | ||
| = compute_optimal_parameters(expected_num_elements, false_positive_rate); | ||
| if (optimal_parameters.has_error()) { | ||
| return optimal_parameters.error(); | ||
| } |
There was a problem hiding this comment.
| auto const optimal_parameters | |
| = compute_optimal_parameters(expected_num_elements, false_positive_rate); | |
| if (optimal_parameters.has_error()) { | |
| return optimal_parameters.error(); | |
| } | |
| auto const optimal_parameters{YSTDLIB_ERROR_HANDLING_TRYX(compute_optimal_parameters(expected_num_elements, false_positive_rate)}; |
You can use YSTDLIB_ERROR_HANDLING_TRYX for results containing a value and YSTDLIB_ERROR_HANDLING_TRYV for void results as nice shorthands for forwarding an error if it exists, and unwrapping the result otherwise.
With this diff you can then use optimal_parameters directly, since it represents the unwrapped value, instead of optimal_parameters.value().
| uint32_t const capped_num_hash_functions | ||
| = std::clamp(num_hash_functions, cMinNumHashFunctions, cMaxNumHashFunctions); |
There was a problem hiding this comment.
| uint32_t const capped_num_hash_functions | |
| = std::clamp(num_hash_functions, cMinNumHashFunctions, cMaxNumHashFunctions); | |
| uint32_t const capped_num_hash_functions{ | |
| std::clamp(num_hash_functions, cMinNumHashFunctions, cMaxNumHashFunctions)}; |
Nit: we require {} initialization with the make exception being auto const xyz =
| } | ||
|
|
||
| auto BloomFilter::write_to_file(clp::WriterInterface& writer) const -> void { | ||
| writer.write_numeric_value<uint32_t>(m_num_hash_functions); |
There was a problem hiding this comment.
Sorry I didn't think about this earlier, but do you think it makes sense to include an extra byte in this format indicating what hash algorithm was used for this bloom filter (e.g., by defining an enum for hash functions, with 0x00 representing xxhash)? That way if we ever change the hash algorithm in the future it's easy to keep things backwards compatible.
There was a problem hiding this comment.
This function writes payload of a bloom filter. There will be other metadata bytes written in the FilterFile (which does not belong to this PR), including the type of filter.
My current implementation (out of scope of this PR again) treats bloom filters using different hash algorithms as different "filter type", unless we have a bunch of hash algorithms to use then it will make a lot of combinations. I feel it would be not very much difference comparing to your suggestion.
But yeah definitely I need to integrate a mapper for hash algorithms in bloomfilter.cpp , do you recommand add in this PR or next
There was a problem hiding this comment.
Thanks for the suggestion — after thinking it through again, I agree to add a hash algorithm ID in the BloomFilter payload - it makes mapping easier than my previous approach.
I’ve already implemented a small HashAlgorithm mapper (ID parsing + hash64 dispatch), plus the corresponding BloomFilter.cpp/hpp updates. In total it’s ~80 LOC, but most of that is headers/guards/includes/docstrings; the core logic is quite small and self-contained.
I haven’t pushed it yet to keep scope tight, but I can include it in this PR if you’re okay with that.
Please let me know and I’ll push it. :)
| return ErrorCode{ErrorCodeEnum::ReadFailure}; | ||
| } | ||
| } | ||
| return BloomFilter(bit_array_size, num_hash_functions, std::move(bit_array)); |
There was a problem hiding this comment.
| return BloomFilter(bit_array_size, num_hash_functions, std::move(bit_array)); | |
| return BloomFilter{bit_array_size, num_hash_functions, std::move(bit_array)}; |
| [[nodiscard]] auto possibly_contains(std::string_view value) const -> bool; | ||
|
|
||
| /** | ||
| * Writes Bloom filter payload fields to a writer. |
There was a problem hiding this comment.
| * Writes Bloom filter payload fields to a writer. | |
| * Writes a serialized representation of a bloom filter to a writer. |
I think this wording is slightly clearer.
| * Reads Bloom filter payload fields from a reader. | ||
| * @param reader | ||
| * @return A result containing a parsed BloomFilter on success, or an error code indicating |
There was a problem hiding this comment.
| * Reads Bloom filter payload fields from a reader. | |
| * @param reader | |
| * @return A result containing a parsed BloomFilter on success, or an error code indicating | |
| * Reads a serialized representation of a bloom filter from a reader. | |
| * @param reader | |
| * @return A result containing an initialized BloomFilter on success, or an error code indicating |
I think this wording is slightly clearer.
| vars: | ||
| CMAKE_GEN_ARGS: | ||
| - "-DXXHASH_BUILD_XXHSUM=OFF" | ||
| - "-DBUILD_SHARED_LIBS=ON" |
There was a problem hiding this comment.
| - "-DBUILD_SHARED_LIBS=ON" | |
| - "-DBUILD_SHARED_LIBS=OFF" |
After further thought I think it's just simpler to only build + link the static version of xxhash. Main justification being that if we wanted to build a static + shared version I think we'd need to build them as separate install targets in separate directories + write a Findxyz.cmake script like we do for LZMA, which adds some maintenance burden and makes it harder to support things like using an xxhash library installed at the system level.
94484f1 to
88a0210
Compare
gibber9809
left a comment
There was a problem hiding this comment.
LGTM! Thanks for going through all the revisions.
Before we merge, could we:
- Change the PR title to:
feat(clp_s::filter): Add library for representing bloom filters; Add xxHash v0.8.3 as a dependency.
- Update the PR description to reference that this is one of the PRs for #2090 -- I can see that you manually reference this PR in that issue, but referencing the issue from the PR makes github automatically create a link to the PR from the issue, which makes it easy to track related PRs.
LinZhihao-723
left a comment
There was a problem hiding this comment.
A bit more style-wise catches and tips/suggestions.
Otherwise, we should be good to go.
| CLP_UTFCPP_INCLUDE_DIRECTORY \"{{.UTFCPP_OUTPUT_DIR}}\" | ||
| )" > "{{.G_DEPS_CPP_CMAKE_SETTINGS_DIR}}/{{.LIB_NAME}}.cmake" | ||
|
|
||
| xxhash: |
| if (HashAlgorithm::Xxh364 == hash_algorithm) { | ||
| return xxhash::hash64(value, seed); | ||
| } | ||
|
|
There was a problem hiding this comment.
| if (HashAlgorithm::Xxh364 == hash_algorithm) { | |
| return xxhash::hash64(value, seed); | |
| } | |
| switch (hash_algorithm) { | |
| case HashAlgorithm::Xxh364: | |
| return xxhash::hash64(value, seed); | |
| default: | |
| break; | |
| } |
Since we're using enum, I think it's cleaner if we use a switch statement.
In terms of the default case, I think we should have sth equivalent to unreachable in Rust. But this is out of the scope for this PR, let's just keep it as is.
There was a problem hiding this comment.
Thanks — I also applied the same switch style to try_parse_hash_algorithm for consistency. However, the case static_cast<uint8_t>(HashAlgorithm::Xxh364) looks a little awkward to me. Please let me know if you’d prefer a different style there.
There was a problem hiding this comment.
Just curious: what's the real size of the bitmap in practice? Is it going to be mostly the default (64 * 8)?
If the size can become big, shall we try/benchmark this compressed bitmap? It also supports serde so writing to or reading from a file shouldn't be a problem.
There was a problem hiding this comment.
I ran a simple benchmark comparing a RoaringBitmap-backed Bloom filter against the current conventional bitmap implementation. For this workload, RoaringBitmap used slightly more space due to its container/metadata overhead, and it was also slower for both build and query.
The main reason is that Bloom filters are configured near their optimal parameters, which gives a bit density around 50%. Combined with the fact that the set bits are distributed by xxhash, the bitmap is relatively dense and random-looking, with very little local structure or long runs. That makes it a poor fit for RoaringBitmap-style compression.
I think this matches the case described in the RoaringBitmap README where a conventional bitset is preferred:
You have dense set of random values that never form runs of continuous values. For example, consider the set 0,2,4,...,10000. If this is typical of your application, you might be better served with a conventional bitset (e.g., Java's BitSet class).
There was a problem hiding this comment.
Make sense. Thanks for your investigation!
Thanks for your review! Edit: |
LinZhihao-723
left a comment
There was a problem hiding this comment.
Just found one more nit pick, otherwise we're good
| switch (hash_algorithm) { | ||
| case static_cast<uint8_t>(HashAlgorithm::Xxh364): | ||
| return HashAlgorithm::Xxh364; | ||
| default: | ||
| return std::nullopt; | ||
| } |
There was a problem hiding this comment.
As a reply to this comment, I don't think we need to use switch here (as u suggested, it just looks weird, lol). Essentially this is because we're working on the underlying integer of the enum value instead of the enum value itself. I think it doesn't make too much sense if we're not working directly on the enum type when using switch.
I'm fine if u'd like to rollback or just leave it as is.
Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
Description
This is the first PR for the issue #2090
This PR is intentionally narrowed to a standalone
clp_sBloom filter library + xxHash integration + unit tests.Main changes:
BloomFilterimplementation undercomponents/core/src/clp_s/filter/:BloomFilter.cpp/.hppErrorCode.cpp/.hppXxHash.cpp/.hppxxHash(v0.8.3) intaskfiles/deps/main.yaml.clp_s::filter:components/core/src/clp_s/filter/CMakeLists.txtcomponents/core/src/clp_s/CMakeLists.txtcomponents/core/CMakeLists.txtcomponents/core/cmake/Options/options.cmakecomponents/core/src/clp_s/filter/tests/:test-clp_s-bloom_filter.cpptest-clp_s-xxhash.cppXXH3_64bits_withSeed,v0.8.3).Why
xxHash:xxHashis fast and suitable for this workload.Out of scope in this PR:
Checklist
breaking change.
Validation performed
components/core/src/clp_s/filter/tests/test-clp_s-bloom_filter.cppcomponents/core/src/clp_s/filter/tests/test-clp_s-xxhash.cppclp_s::filterandxxHashintegration.