Skip to content

[Rule] ClosestSubstring to ILP #1035

@zhangjieqingithub

Description

@zhangjieqingithub

Source

ClosestSubstring

Source model issue: #1033

Target

ILP/i32

Motivation

This rule gives ClosestSubstring immediate solver connectivity through the existing integer-linear-programming hub.

The formulation chooses both the center string and one window from each input string, then minimizes a radius variable. It uses window-choice indicators to activate exactly the Hamming-distance constraint for the selected window of each string.

Reference

Ming Li, Bin Ma, and Lusheng Wang, "On the closest string and substring problems," Journal of the ACM 49(2):157-171, 2002. https://doi.org/10.1145/506147.506150

Reduction Algorithm

Let the source instance have:

  • alphabet size q,
  • input strings s_1, ..., s_n,
  • substring length ell,
  • W_i = |s_i| - ell + 1 possible windows in string s_i.

Construct an ILP/i32 instance as follows.

  1. For every center position r in {0,...,ell-1} and alphabet symbol a in {0,...,q-1}, create an integer variable x_{r,a}.
    It means the center has symbol a at position r.

  2. For every input string s_i and window start p in {0,...,W_i-1}, create an integer variable y_{i,p}.
    It means window p is selected from string s_i.

  3. Create one nonnegative integer radius variable R.

  4. For each center position r, add the assignment constraint:
    sum_a x_{r,a} = 1.

  5. For each input string s_i, add the window-choice constraint:
    sum_p y_{i,p} = 1.

  6. For every input string s_i and every window start p, add the conditional radius constraint:
    R + sum_{r=0}^{ell-1} x_{r, s_i[p+r]} - ell * y_{i,p} >= 0.

    If y_{i,p} = 1, this becomes:
    R + number_of_matches(c, s_i[p..p+ell)) >= ell,
    equivalently R >= d_H(c, s_i[p..p+ell)).

    If y_{i,p} = 0, the constraint is automatically satisfied because R >= 0 and the match count is nonnegative.

  7. Minimize R.

  8. Recover the source solution by reading:

    • the center symbol at each position r from the unique a with x_{r,a} = 1,
    • the selected window in each string s_i from the unique p with y_{i,p} = 1.

Correctness Argument

The center assignment constraints encode exactly one center string c in Sigma^ell. The window-choice constraints encode exactly one length-ell substring from each input string.

For a selected window p in string s_i, the expression sum_r x_{r, s_i[p+r]} counts the number of positions where the center agrees with that window. Thus ell - sum_r x_{r, s_i[p+r]} is exactly the Hamming distance between the center and the selected window.

The conditional radius constraint is active exactly for the selected window of each input string, and it forces R to be at least that selected Hamming distance. For unselected windows, the constraint is redundant.

Therefore minimizing R minimizes the maximum Hamming distance between the center and one chosen window from each input string. This is exactly the ClosestSubstring objective.

Size Overhead

Let:

  • q = alphabet_size,
  • ell = substring_length,
  • n = num_strings,
  • W = total_num_windows = sum_i (|s_i| - ell + 1).
Target metric Formula
num_vars q * ell + W + 1
num_constraints ell + n + W

Validation Method

Use closed-loop validation on small instances:

  • enumerate all center strings and all window choices directly,
  • solve the ILP instance,
  • extract the center string and selected windows,
  • verify that the extracted solution has the same optimum radius.

Include examples where:

  • the optimum radius is nonzero,
  • the best window is not the first window in at least one string,
  • no common exact substring exists,
  • multiple window choices tie for the same center.

Example

Use the source instance from #1033:

  • alphabet {0,1},
  • substring length ell = 3,
  • strings 00011, 10100, 11001.

The center 010 is encoded by:

  • x_{0,0} = 1,
  • x_{1,1} = 1,
  • x_{2,0} = 1.

Choose windows:

  • from 00011, choose start 0, window 000,
  • from 10100, choose start 1, window 010,
  • from 11001, choose start 0, window 110.

For string 00011 and selected start 0, the active radius constraint is:
R + x_{0,0} + x_{1,0} + x_{2,0} - 3*y_{1,0} >= 0.

Substituting the chosen center and y_{1,0}=1 gives:
R + 1 + 0 + 1 - 3 >= 0,
so R >= 1.

For string 10100 and selected start 1, the window is exactly 010, so the active constraint allows R >= 0.

For string 11001 and selected start 0, the active constraint forces R >= 1.

Thus R = 1 is feasible. Since the three sets of length-3 windows have empty intersection, radius 0 is impossible. Therefore the ILP optimum is 1, matching the ClosestSubstring optimum.

BibTeX

@article{LiMaWang2002ClosestStringSubstring,
  author  = {Ming Li and Bin Ma and Lusheng Wang},
  title   = {On the Closest String and Substring Problems},
  journal = {Journal of the ACM},
  volume  = {49},
  number  = {2},
  pages   = {157--171},
  year    = {2002},
  doi     = {10.1145/506147.506150},
  url     = {https://doi.org/10.1145/506147.506150}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    ruleA new reduction rule to be added.

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions