Skip to content

[Model] ClosestString #1032

@zhangjieqingithub

Description

@zhangjieqingithub

Motivation

ClosestString is a standard string-combinatorics model that is not currently in the catalog.

It is useful because it captures the consensus-string problem under Hamming distance: given several equal-length strings, find one center string that minimizes the worst Hamming distance to the inputs. This problem appears in computational biology, coding theory, and clustering over discrete sequences.

Associated rule:

Definition

Name: ClosestString
Reference: Ming Li, Bin Ma, and Lusheng Wang, "On the closest string and substring problems," Journal of the ACM 49(2):157-171, 2002. https://doi.org/10.1145/506147.506150

Let Sigma = {0, ..., q-1} be a finite alphabet. The input is a list of strings
s_1, ..., s_n in Sigma^m
all having the same length m.

A solution is a center string
c in Sigma^m.

The objective is to minimize the maximum Hamming distance from the center to the input strings:

minimize max_i d_H(c, s_i).

So this is a minimization problem.

Variables

Let m be the common string length and q = alphabet_size.

  • Count: m variables
  • Per-variable domain: {0, ..., q-1}
  • Meaning: variable c_j is the symbol chosen for the center string at position j

A configuration is always syntactically feasible if every symbol lies in the alphabet. Its objective value is the maximum Hamming distance to the input strings.

Schema (data type)

Type name: ClosestString

Field Mathematical type Description
alphabet_size positive integer Size q of the finite alphabet
strings list of equal-length strings over {0,...,q-1} Input strings s_1,...,s_n

Suggested size fields:

  • alphabet_size
  • num_strings
  • string_length
  • total_length = num_strings * string_length

Input validation should require:

  • alphabet_size > 0 unless all strings and the center length are zero,
  • all input strings have equal length,
  • every symbol is < alphabet_size.

Complexity

The direct exact baseline enumerates all possible center strings and evaluates the maximum Hamming distance.

This gives:

O(alphabet_size^string_length * num_strings * string_length).

The registry complexity expression can use:

alphabet_size ^ string_length.

Extra Remark

This model is distinct from ClosestSubstring. In ClosestString, every input string has the same length as the center, so there is no window-selection decision.

Reduction Rule Crossref

How to solve

Example Instance

Use binary alphabet Sigma = {0,1} and input strings of length 3:

String Value
s_1 000
s_2 011
s_3 101
s_4 110

Expected Outcome

An optimal center string is:

c = 000.

Its distances are:

  • d_H(000, 000) = 0
  • d_H(000, 011) = 2
  • d_H(000, 101) = 2
  • d_H(000, 110) = 2

So the objective value is 2.

No center has radius 1: the four input strings are pairwise arranged so that every binary length-3 string is at distance at least 2 from at least one input string. Therefore the optimum radius is exactly 2.

BibTeX

@article{LiMaWang2002ClosestStringSubstring,
  author  = {Ming Li and Bin Ma and Lusheng Wang},
  title   = {On the Closest String and Substring Problems},
  journal = {Journal of the ACM},
  volume  = {49},
  number  = {2},
  pages   = {157--171},
  year    = {2002},
  doi     = {10.1145/506147.506150},
  url     = {https://doi.org/10.1145/506147.506150}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    modelA model problem to be implemented.

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions