Hi, I'm trying to use BPE for biological sequences, and for that want to initialize it from all possible 3-letter combinations. Namely:
let tokens: Vec<Vec<u8>> = kwargs
.strings
.iter()
.map(move |s| s.as_bytes().to_vec())
.collect();
let enc = BytePairEncoding::from_dictionary(tokens, None);
and here I'm getting
panicked at /home/marinegor/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/bpe-0.2.1/src/byte_pair_encoding.rs:505:70:
where strings = [f'{l1}{l2}{l3} for l1 in 'AUGC' for l2 in 'AUGC' for l3 in 'AUGC'] are passed from python.
Which is confusing, since I can successfully initialize (and tokenize) strings from simpler lists of tokens, e.g. ['AB','BA'] or 2-letter combinations from the AUGC.