Source
ClosestSubstring
Source model issue: #1033
Target
ILP/i32
Motivation
This rule gives ClosestSubstring immediate solver connectivity through the existing integer-linear-programming hub.
The formulation chooses both the center string and one window from each input string, then minimizes a radius variable. It uses window-choice indicators to activate exactly the Hamming-distance constraint for the selected window of each string.
Reference
Ming Li, Bin Ma, and Lusheng Wang, "On the closest string and substring problems," Journal of the ACM 49(2):157-171, 2002. https://doi.org/10.1145/506147.506150
Reduction Algorithm
Let the source instance have:
- alphabet size
q,
- input strings
s_1, ..., s_n,
- substring length
ell,
W_i = |s_i| - ell + 1 possible windows in string s_i.
Construct an ILP/i32 instance as follows.
-
For every center position r in {0,...,ell-1} and alphabet symbol a in {0,...,q-1}, create an integer variable x_{r,a}.
It means the center has symbol a at position r.
-
For every input string s_i and window start p in {0,...,W_i-1}, create an integer variable y_{i,p}.
It means window p is selected from string s_i.
-
Create one nonnegative integer radius variable R.
-
For each center position r, add the assignment constraint:
sum_a x_{r,a} = 1.
-
For each input string s_i, add the window-choice constraint:
sum_p y_{i,p} = 1.
-
For every input string s_i and every window start p, add the conditional radius constraint:
R + sum_{r=0}^{ell-1} x_{r, s_i[p+r]} - ell * y_{i,p} >= 0.
If y_{i,p} = 1, this becomes:
R + number_of_matches(c, s_i[p..p+ell)) >= ell,
equivalently R >= d_H(c, s_i[p..p+ell)).
If y_{i,p} = 0, the constraint is automatically satisfied because R >= 0 and the match count is nonnegative.
-
Minimize R.
-
Recover the source solution by reading:
- the center symbol at each position
r from the unique a with x_{r,a} = 1,
- the selected window in each string
s_i from the unique p with y_{i,p} = 1.
Correctness Argument
The center assignment constraints encode exactly one center string c in Sigma^ell. The window-choice constraints encode exactly one length-ell substring from each input string.
For a selected window p in string s_i, the expression sum_r x_{r, s_i[p+r]} counts the number of positions where the center agrees with that window. Thus ell - sum_r x_{r, s_i[p+r]} is exactly the Hamming distance between the center and the selected window.
The conditional radius constraint is active exactly for the selected window of each input string, and it forces R to be at least that selected Hamming distance. For unselected windows, the constraint is redundant.
Therefore minimizing R minimizes the maximum Hamming distance between the center and one chosen window from each input string. This is exactly the ClosestSubstring objective.
Size Overhead
Let:
q = alphabet_size,
ell = substring_length,
n = num_strings,
W = total_num_windows = sum_i (|s_i| - ell + 1).
| Target metric |
Formula |
num_vars |
q * ell + W + 1 |
num_constraints |
ell + n + W |
Validation Method
Use closed-loop validation on small instances:
- enumerate all center strings and all window choices directly,
- solve the ILP instance,
- extract the center string and selected windows,
- verify that the extracted solution has the same optimum radius.
Include examples where:
- the optimum radius is nonzero,
- the best window is not the first window in at least one string,
- no common exact substring exists,
- multiple window choices tie for the same center.
Example
Use the source instance from #1033:
- alphabet
{0,1},
- substring length
ell = 3,
- strings
00011, 10100, 11001.
The center 010 is encoded by:
x_{0,0} = 1,
x_{1,1} = 1,
x_{2,0} = 1.
Choose windows:
- from
00011, choose start 0, window 000,
- from
10100, choose start 1, window 010,
- from
11001, choose start 0, window 110.
For string 00011 and selected start 0, the active radius constraint is:
R + x_{0,0} + x_{1,0} + x_{2,0} - 3*y_{1,0} >= 0.
Substituting the chosen center and y_{1,0}=1 gives:
R + 1 + 0 + 1 - 3 >= 0,
so R >= 1.
For string 10100 and selected start 1, the window is exactly 010, so the active constraint allows R >= 0.
For string 11001 and selected start 0, the active constraint forces R >= 1.
Thus R = 1 is feasible. Since the three sets of length-3 windows have empty intersection, radius 0 is impossible. Therefore the ILP optimum is 1, matching the ClosestSubstring optimum.
BibTeX
@article{LiMaWang2002ClosestStringSubstring,
author = {Ming Li and Bin Ma and Lusheng Wang},
title = {On the Closest String and Substring Problems},
journal = {Journal of the ACM},
volume = {49},
number = {2},
pages = {157--171},
year = {2002},
doi = {10.1145/506147.506150},
url = {https://doi.org/10.1145/506147.506150}
}
Source
ClosestSubstring
Source model issue: #1033
Target
ILP/i32
Motivation
This rule gives
ClosestSubstringimmediate solver connectivity through the existing integer-linear-programming hub.The formulation chooses both the center string and one window from each input string, then minimizes a radius variable. It uses window-choice indicators to activate exactly the Hamming-distance constraint for the selected window of each string.
Reference
Ming Li, Bin Ma, and Lusheng Wang, "On the closest string and substring problems," Journal of the ACM 49(2):157-171, 2002. https://doi.org/10.1145/506147.506150
Reduction Algorithm
Let the source instance have:
q,s_1, ..., s_n,ell,W_i = |s_i| - ell + 1possible windows in strings_i.Construct an
ILP/i32instance as follows.For every center position
r in {0,...,ell-1}and alphabet symbola in {0,...,q-1}, create an integer variablex_{r,a}.It means the center has symbol
aat positionr.For every input string
s_iand window startp in {0,...,W_i-1}, create an integer variabley_{i,p}.It means window
pis selected from strings_i.Create one nonnegative integer radius variable
R.For each center position
r, add the assignment constraint:sum_a x_{r,a} = 1.For each input string
s_i, add the window-choice constraint:sum_p y_{i,p} = 1.For every input string
s_iand every window startp, add the conditional radius constraint:R + sum_{r=0}^{ell-1} x_{r, s_i[p+r]} - ell * y_{i,p} >= 0.If
y_{i,p} = 1, this becomes:R + number_of_matches(c, s_i[p..p+ell)) >= ell,equivalently
R >= d_H(c, s_i[p..p+ell)).If
y_{i,p} = 0, the constraint is automatically satisfied becauseR >= 0and the match count is nonnegative.Minimize
R.Recover the source solution by reading:
rfrom the uniqueawithx_{r,a} = 1,s_ifrom the uniquepwithy_{i,p} = 1.Correctness Argument
The center assignment constraints encode exactly one center string
c in Sigma^ell. The window-choice constraints encode exactly one length-ellsubstring from each input string.For a selected window
pin strings_i, the expressionsum_r x_{r, s_i[p+r]}counts the number of positions where the center agrees with that window. Thusell - sum_r x_{r, s_i[p+r]}is exactly the Hamming distance between the center and the selected window.The conditional radius constraint is active exactly for the selected window of each input string, and it forces
Rto be at least that selected Hamming distance. For unselected windows, the constraint is redundant.Therefore minimizing
Rminimizes the maximum Hamming distance between the center and one chosen window from each input string. This is exactly the ClosestSubstring objective.Size Overhead
Let:
q = alphabet_size,ell = substring_length,n = num_strings,W = total_num_windows = sum_i (|s_i| - ell + 1).num_varsq * ell + W + 1num_constraintsell + n + WValidation Method
Use closed-loop validation on small instances:
Include examples where:
Example
Use the source instance from #1033:
{0,1},ell = 3,00011,10100,11001.The center
010is encoded by:x_{0,0} = 1,x_{1,1} = 1,x_{2,0} = 1.Choose windows:
00011, choose start0, window000,10100, choose start1, window010,11001, choose start0, window110.For string
00011and selected start0, the active radius constraint is:R + x_{0,0} + x_{1,0} + x_{2,0} - 3*y_{1,0} >= 0.Substituting the chosen center and
y_{1,0}=1gives:R + 1 + 0 + 1 - 3 >= 0,so
R >= 1.For string
10100and selected start1, the window is exactly010, so the active constraint allowsR >= 0.For string
11001and selected start0, the active constraint forcesR >= 1.Thus
R = 1is feasible. Since the three sets of length-3 windows have empty intersection, radius0is impossible. Therefore the ILP optimum is1, matching the ClosestSubstring optimum.BibTeX