-
Notifications
You must be signed in to change notification settings - Fork 111
Description
Describe the bug
This issue may be related to the previously closed issue #268 . The following code is tested against the latest main branch.
Typically, we use find_all() to get the positions where substrings appear in the main string. Therefore, I would use sz::find_all() as follows:
namespace sz = ashvardanian::stringzilla;
void test_1()
{
sz::string_view m ("hello world, hello cpp");
sz::string_view n ("hello");
auto r = sz::find_all(m, n);
for (auto i : r)
{
auto start = std::distance(m.cbegin(), i.cbegin());
std::cout << "[" << start << ", " << start + i.size() << ")\n";
}
}The output will be:
[0, 5)
[13, 18)
However, the same operation applied to sz::string will produce chaotic results:
void test_2()
{
// Now, both m and n are sz::string
sz::string m ("hello world, hello cpp");
sz::string n ("hello");
auto r = sz::find_all(m, n);
for (auto i : r)
{
auto start = std::distance(m.cbegin(), i.cbegin());
std::cout << "[" << start << ", " << start + i.size() << ")\n";
}
}Possible output:
[-64, 18446744073709551557)
[-51, 18446744073709551570)
The issue arises because the return type of find_all (range_matches<string_type_>) is bound to the input parameter type. When string_type_ is sz_string, it does not conform to the documentation of range_matches, which states:
... similar to a pair of boost::algorithm::find_iterator
If we forcibly use the range_matches<string_view> to receive the result of find_all, everything works correctly:
void test_3()
{
sz::string m ("hello world, hello cpp");
sz::string n ("hello");
auto r = sz::find_all<sz::string_view>(m, n);
for (auto i : r)
{
auto start = std::distance(m.cbegin(), i.cbegin());
std::cout << "[" << start << ", " << start + i.size() << ")\n";
}
}The above code will produce the correct result.
Additionally, using the standard library's std::string_view with sz::find_all() naturally yields correct results, so we skip that.
What about using std::string? I found that using the C++ standard library's std::string to call sz::find_all() ** fails ** to compile:
// Note: switched to C++ std::string
std::string m ("hello world, hello cpp");
std::string n ("hello");
auto r = sz::find_all(m,n); // Compilation failsI believe this compilation failure is user-friendly. As long as users understand the lifecycle considerations related to stringzilla, they can quickly use std::string to write:
void test_4()
{
std::string m ("hello world, hello cpp");
std::string n ("hello");
// auto r = sz::find_all(m, n);
auto r = sz::find_all(sz::string_view(m), sz::string_view(n));
for (auto i : r)
{
auto start = std::distance(m.cbegin().base(), i.cbegin());
std::cout << "[" << start << ", " << start + i.size() << ")\n";
}
}This also produces the correct result (note that in std::distance(), we need to call .base()).
Based on the different compilation results when passing std::string or std::string_view to sz::find_all(), my suggestions are:
Option 1: Explicitly prevent find_all() from being used with sz::string objects, forcing users to use sz::string_view (or std::string_view). This would make it easier for users to notice the lifecycle of string_view.
Option 2: Fix range_matches<T1, T2> to always be range_matches<sz::string_view, sz::matcher_find<sz::string_view, ...>>.
Personally, I lean toward Option 1.
Steps to reproduce
Expected behavior
StringZilla version
main
Operating System
Windows 11
Hardware architecture
x86
Which interface are you using?
C++ bindings
Contact Details
Are you open to being tagged as a contributor?
- I am open to being mentioned in the project
.githistory as a contributor
Is there an existing issue for this?
- I have searched the existing issues
Code of Conduct
- I agree to follow this project's Code of Conduct