Motivation
ClosestString is a standard string-combinatorics model that is not currently in the catalog.
It is useful because it captures the consensus-string problem under Hamming distance: given several equal-length strings, find one center string that minimizes the worst Hamming distance to the inputs. This problem appears in computational biology, coding theory, and clustering over discrete sequences.
Associated rule:
Definition
Name: ClosestString
Reference: Ming Li, Bin Ma, and Lusheng Wang, "On the closest string and substring problems," Journal of the ACM 49(2):157-171, 2002. https://doi.org/10.1145/506147.506150
Let Sigma = {0, ..., q-1} be a finite alphabet. The input is a list of strings
s_1, ..., s_n in Sigma^m
all having the same length m.
A solution is a center string
c in Sigma^m.
The objective is to minimize the maximum Hamming distance from the center to the input strings:
minimize max_i d_H(c, s_i).
So this is a minimization problem.
Variables
Let m be the common string length and q = alphabet_size.
- Count:
m variables
- Per-variable domain:
{0, ..., q-1}
- Meaning: variable
c_j is the symbol chosen for the center string at position j
A configuration is always syntactically feasible if every symbol lies in the alphabet. Its objective value is the maximum Hamming distance to the input strings.
Schema (data type)
Type name: ClosestString
| Field |
Mathematical type |
Description |
alphabet_size |
positive integer |
Size q of the finite alphabet |
strings |
list of equal-length strings over {0,...,q-1} |
Input strings s_1,...,s_n |
Suggested size fields:
alphabet_size
num_strings
string_length
total_length = num_strings * string_length
Input validation should require:
alphabet_size > 0 unless all strings and the center length are zero,
- all input strings have equal length,
- every symbol is
< alphabet_size.
Complexity
The direct exact baseline enumerates all possible center strings and evaluates the maximum Hamming distance.
This gives:
O(alphabet_size^string_length * num_strings * string_length).
The registry complexity expression can use:
alphabet_size ^ string_length.
Extra Remark
This model is distinct from ClosestSubstring. In ClosestString, every input string has the same length as the center, so there is no window-selection decision.
Reduction Rule Crossref
How to solve
Example Instance
Use binary alphabet Sigma = {0,1} and input strings of length 3:
| String |
Value |
s_1 |
000 |
s_2 |
011 |
s_3 |
101 |
s_4 |
110 |
Expected Outcome
An optimal center string is:
c = 000.
Its distances are:
d_H(000, 000) = 0
d_H(000, 011) = 2
d_H(000, 101) = 2
d_H(000, 110) = 2
So the objective value is 2.
No center has radius 1: the four input strings are pairwise arranged so that every binary length-3 string is at distance at least 2 from at least one input string. Therefore the optimum radius is exactly 2.
BibTeX
@article{LiMaWang2002ClosestStringSubstring,
author = {Ming Li and Bin Ma and Lusheng Wang},
title = {On the Closest String and Substring Problems},
journal = {Journal of the ACM},
volume = {49},
number = {2},
pages = {157--171},
year = {2002},
doi = {10.1145/506147.506150},
url = {https://doi.org/10.1145/506147.506150}
}
Motivation
ClosestStringis a standard string-combinatorics model that is not currently in the catalog.It is useful because it captures the consensus-string problem under Hamming distance: given several equal-length strings, find one center string that minimizes the worst Hamming distance to the inputs. This problem appears in computational biology, coding theory, and clustering over discrete sequences.
Associated rule:
Definition
Name:
ClosestStringReference: Ming Li, Bin Ma, and Lusheng Wang, "On the closest string and substring problems," Journal of the ACM 49(2):157-171, 2002. https://doi.org/10.1145/506147.506150
Let
Sigma = {0, ..., q-1}be a finite alphabet. The input is a list of stringss_1, ..., s_n in Sigma^mall having the same length
m.A solution is a center string
c in Sigma^m.The objective is to minimize the maximum Hamming distance from the center to the input strings:
minimize max_i d_H(c, s_i).So this is a minimization problem.
Variables
Let
mbe the common string length andq = alphabet_size.mvariables{0, ..., q-1}c_jis the symbol chosen for the center string at positionjA configuration is always syntactically feasible if every symbol lies in the alphabet. Its objective value is the maximum Hamming distance to the input strings.
Schema (data type)
Type name:
ClosestStringalphabet_sizeqof the finite alphabetstrings{0,...,q-1}s_1,...,s_nSuggested size fields:
alphabet_sizenum_stringsstring_lengthtotal_length = num_strings * string_lengthInput validation should require:
alphabet_size > 0unless all strings and the center length are zero,< alphabet_size.Complexity
The direct exact baseline enumerates all possible center strings and evaluates the maximum Hamming distance.
This gives:
O(alphabet_size^string_length * num_strings * string_length).The registry complexity expression can use:
alphabet_size ^ string_length.Extra Remark
This model is distinct from
ClosestSubstring. InClosestString, every input string has the same length as the center, so there is no window-selection decision.Reduction Rule Crossref
How to solve
ILP/i32via [Rule] ClosestString to ILP #1034.Example Instance
Use binary alphabet
Sigma = {0,1}and input strings of length3:s_1000s_2011s_3101s_4110Expected Outcome
An optimal center string is:
c = 000.Its distances are:
d_H(000, 000) = 0d_H(000, 011) = 2d_H(000, 101) = 2d_H(000, 110) = 2So the objective value is
2.No center has radius
1: the four input strings are pairwise arranged so that every binary length-3 string is at distance at least2from at least one input string. Therefore the optimum radius is exactly2.BibTeX