Skip to content

Bug: result of find_all() cannot obtain the position of the substring within the parent string #278

@zhuangyan-stone

Description

@zhuangyan-stone

Describe the bug

This issue may be related to the previously closed issue #268 . The following code is tested against the latest main branch.

Typically, we use find_all() to get the positions where substrings appear in the main string. Therefore, I would use sz::find_all() as follows:

namespace sz = ashvardanian::stringzilla;

void test_1()
{
    sz::string_view m ("hello world, hello cpp");
    sz::string_view n ("hello");

    auto r = sz::find_all(m, n);

    for (auto i : r)
    {
        auto start = std::distance(m.cbegin(), i.cbegin());
        std::cout << "[" << start << ", " << start + i.size() << ")\n";
    }
}

The output will be:

[0, 5)
[13, 18)

However, the same operation applied to sz::string will produce chaotic results:

void test_2()
{
    // Now, both m and n are sz::string 
    sz::string m ("hello world, hello cpp"); 
    sz::string n ("hello");

    auto r = sz::find_all(m, n);

    for (auto i : r)
    {
        auto start = std::distance(m.cbegin(), i.cbegin());
        std::cout << "[" << start << ", " << start + i.size() << ")\n";
    }
}

Possible output:

[-64, 18446744073709551557)
[-51, 18446744073709551570)

The issue arises because the return type of find_all (range_matches<string_type_>) is bound to the input parameter type. When string_type_ is sz_string, it does not conform to the documentation of range_matches, which states:

... similar to a pair of boost::algorithm::find_iterator

If we forcibly use the range_matches<string_view> to receive the result of find_all, everything works correctly:

void test_3()
{
    sz::string m ("hello world, hello cpp");
    sz::string n ("hello");

    auto r = sz::find_all<sz::string_view>(m, n);

    for (auto i : r)
    {
        auto start = std::distance(m.cbegin(), i.cbegin());
        std::cout << "[" << start << ", " << start + i.size() << ")\n";
    }
}

The above code will produce the correct result.

Additionally, using the standard library's std::string_view with sz::find_all() naturally yields correct results, so we skip that.

What about using std::string? I found that using the C++ standard library's std::string to call sz::find_all() ** fails ** to compile:

// Note: switched to C++ std::string
std::string m ("hello world, hello cpp"); 
std::string n ("hello");

auto r = sz::find_all(m,n); // Compilation fails

I believe this compilation failure is user-friendly. As long as users understand the lifecycle considerations related to stringzilla, they can quickly use std::string to write:

void test_4()
{
    std::string m ("hello world, hello cpp");
    std::string n ("hello");

    // auto r = sz::find_all(m, n);
    auto r = sz::find_all(sz::string_view(m), sz::string_view(n));

    for (auto i : r)
    {
        auto start = std::distance(m.cbegin().base(), i.cbegin());
        std::cout << "[" << start << ", " << start + i.size() << ")\n";
    }
}

This also produces the correct result (note that in std::distance(), we need to call .base()).

Based on the different compilation results when passing std::string or std::string_view to sz::find_all(), my suggestions are:

Option 1: Explicitly prevent find_all() from being used with sz::string objects, forcing users to use sz::string_view (or std::string_view). This would make it easier for users to notice the lifecycle of string_view.

Option 2: Fix range_matches<T1, T2> to always be range_matches<sz::string_view, sz::matcher_find<sz::string_view, ...>>.

Personally, I lean toward Option 1.

Steps to reproduce

Expected behavior

StringZilla version

main

Operating System

Windows 11

Hardware architecture

x86

Which interface are you using?

C++ bindings

Contact Details

Are you open to being tagged as a contributor?

  • I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions