This repository contains my submission for the GSoC 2025 FASEROH evaluation test. The primary goal of this project is to explore the capability of sequence-to-sequence deep learning models in learning symbolic mathematical operations, specifically focusing on predicting the Taylor series expansion (up to the fourth order) of given mathematical functions.
The project involved three main tasks:
- Dataset Generation & Preprocessing: Creating a dataset of mathematical functions and their corresponding Taylor expansions using SymPy, followed by tokenization.
- LSTM Model Training: Implementing and training an LSTM-based sequence-to-sequence model with attention to learn the mapping from a function's token sequence to its Taylor expansion's token sequence.
- Transformer Model Training (Optional): Implementing and training a Transformer-based sequence-to-sequence model, exploring novel tokenization and attention mechanisms, to perform the same task.
This README provides a detailed overview of the implementation for each task, highlighting the methodologies, architectural choices, and key findings. The images of model architectures are also uploaded alongside in this Github repository. The model pths can be accessed from https://drive.google.com/drive/folders/1wCKMzRB7pEp55P2wuqA9YcPqkkHVtQCF?usp=drive_link.
The submission consists of three main Jupyter Notebooks:
Data_Generation.ipynb: Contains the code for generating the dataset of functions and their Taylor expansions using SymPy and multiprocessing.LSTM_Training.ipynb: Details the implementation, training, and evaluation of the LSTM-based sequence-to-sequence model.Transformer_Training.ipynb: Details the implementation, training, and evaluation of the Transformer-based sequence-to-sequence model, including novel tokenization and attention mechanisms.
Each notebook with outputs are also uploaded.
To programmatically generate a diverse dataset of mathematical functions represented as token sequences and their corresponding Taylor series expansions (up to the 4th order) around x=0.
-
Vocabulary Definition: A vocabulary of mathematical operators (binary:
add,sub,mul,div,pow; unary:sq,sqrt,cb,cbrt,exp,ln,sin,cos, etc.) and the variablexwas defined. Each token was assigned a probability weight to influence its frequency during generation and an arity (0, 1, or 2). -
Recursive Expression Generation (
generate_expression):- A recursive function generates expressions in prefix notation (e.g.,
['add', 'x', 'sin', 'x']forx + sin(x)). - The generation process respects operator arity and maximum tree depth (
max_depth=3used for this submission to manage complexity). - Constraint: To enhance diversity and avoid trivial cancellations or overly complex nests, if a function from a specific family (e.g., trigonometric, hyperbolic, exp/log, powers/roots) was chosen, subsequent recursive calls for its children were made using a vocabulary excluding other members of that same family within that branch.
- A recursive function generates expressions in prefix notation (e.g.,
-
SymPy Conversion (
sequence_to_sympy): The generated token sequence (prefix notation) is parsed recursively to construct the corresponding SymPy symbolic expression. -
Taylor Expansion Calculation (
taylor_expansion):- For each SymPy function, the Taylor expansion up to the 4th order around
x=0was computed. - SymPy's
diffandsubs(orlimitifsubsfails due to singularities at x=0) were used to calculate terms iteratively.evalf(precision)was used for numerical evaluation. - A fallback to SymPy's
series().removeO()was implemented for robustness, though the primary method involves manual term calculation.
- For each SymPy function, the Taylor expansion up to the 4th order around
-
Validation & Filtering: Generated samples were validated to ensure suitability:
- Real Coefficients (
not_real): Taylor expansions with complex coefficients were discarded. - Non-Constant (
is_constant): Constant functions/expansions were discarded. - Non-NaN: Samples resulting in
sp.nan(due to irrecoverable errors or undefined limits) were discarded.
- Real Coefficients (
-
Efficient Parallel Generation (
generate_dataset,generate_sample_with_timeout):- Python's
multiprocessing(withset_start_method('fork')) andconcurrent.futures.ProcessPoolExecutorwere employed to parallelize the generation process across multiple CPU cores (n_workers=16). This significantly sped up dataset creation. - A timeout mechanism (
generate_sample_with_timeout) was implemented for each sample generation to prevent SymPy from getting stuck on overly complex or problematic expressions, ensuring the overall process didn't stall. - A progress bar (
tqdm) tracked the number of valid samples generated versus total attempts, providing insight into the filtering rate.
- Python's
-
Final Processing: The generated data (function token list, SymPy function, SymPy Taylor expansion) was stored in a Pandas DataFrame. SymPy's
simplifywas applied to the input functions to create a canonical representation, and duplicates based on this simplified form were removed. The final DataFrame was saved usingpickle.
Generating symbolic expressions and computing Taylor series, especially with validation checks, is computationally intensive. Even with multiprocessing on 16 cores, generating the dataset of 3000 samples (after filtering) took approximately 3 hours on my available hardware. This limitation restricted the feasible dataset size for this evaluation. A larger dataset would likely benefit model training further.
To train a sequence-to-sequence LSTM model with attention to predict the tokenized Taylor expansion from a tokenized input function.
- Encoder Vocabulary: Derived from the unique non-numeric tokens present in the simplified input functions generated in Task 1, plus special tokens (
PAD,<UNK>) and common numeric components (+,-,0-9,E-1toE1). - Decoder Vocabulary: A fixed vocabulary designed to represent polynomial coefficients up to O(x^4). It includes:
- Special tokens:
PAD,SOS,EOS. - Coefficient degrees:
x0,x1,x2,x3,x4. - P10 Number Representation: Signs (
+,-), Exponents (E-5toE5), Digits (0-9).
- Special tokens:
- P10 Number Encoding/Decoding: A custom scheme (
encode_number,decode_number) represents floating-point coefficients. It encodes a number into a fixed-length sequence of tokens (sign, exponentE{exp}, leading digit,precisionmantissa digits). This provides a consistent token representation for numerical values.precision=4was used. - Input Encoding (
encode_enc): Converts a simplified SymPy function into a sequence of encoder token IDs usingsympy_tokenizerandencode_numberfor any numerical constants. - Output Encoding (
encode_dec): Extracts coefficients of the Taylor polynomial (up to x^4) and converts them into a sequence of decoder token IDs, includingSOS,EOS, degree markers (x{i}), and P10 encoded coefficients. - Output Decoding (
decode_dec,seq_to_coeffs): Converts a sequence of predicted decoder token IDs back into a list of polynomial coefficients.
- A custom PyTorch
Dataset(TaylorDataset) was created to serve tokenized input/output pairs. - A custom
collate_fnhandles batching, padding input and output sequences to the maximum length within each batch using thePADtoken ID (0). It also prepares input lengths required forpack_padded_sequence.
A standard Encoder-Decoder architecture with Bahdanau-style attention was implemented:
-
Encoder (
EncoderLSTM):- An
nn.Embeddinglayer maps input token IDs toENC_EMBED_DIM-dimensional vectors. - Dropout is applied to embeddings.
- An
nn.LSTM(multi-layer,N_LAYERS=2) processes the padded sequences.pack_padded_sequenceis used before the LSTM andpad_packed_sequenceafter to handle variable-length inputs efficiently. - Outputs the LSTM's output sequence (for attention) and the final hidden and cell states (to initialize the decoder).
- An
-
Attention (
Attention):- A feed-forward network calculates attention scores (energy) based on the previous decoder hidden state and all encoder outputs.
- A
tanhactivation is used. - A final linear layer (
v) reduces the energy to a single score per encoder timestep. softmaxconverts scores to attention weights.- Input padding mask is applied to prevent attention over
<PAD>tokens.
-
Decoder (
DecoderLSTM):- An
nn.Embeddinglayer maps target token IDs toDEC_EMBED_DIM-dimensional vectors. - Dropout is applied.
- At each timestep
t:- Takes the previous predicted token's embedding (
input), the previous hidden/cell states, and all encoder outputs. - Calculates attention weights using the
Attentionmodule based on the current hidden state and encoder outputs. - Computes a weighted context vector by multiplying attention weights with encoder outputs.
- Concatenates the input embedding and the context vector.
- Feeds this concatenated vector into the
nn.LSTM(multi-layer,N_LAYERS=2), using the previous hidden/cell states. - The LSTM outputs the new hidden/cell states and an output vector.
- A final linear layer (
fc_out) projects the concatenation of the LSTM output, the context vector, and the input embedding to the size of the decoder vocabulary, producing logits.
- Takes the previous predicted token's embedding (
- An
-
Seq2Seq Wrapper (
LSTMSeq2Seq):- Combines the Encoder and Decoder.
- Implements the forward pass, handling the flow from encoder to decoder.
- Includes Teacher Forcing during training: With probability
teacher_forcing_ratio, the true target token is fed as the next input; otherwise, the decoder's own prediction (argmax of logits) is used. Teacher forcing is disabled during validation/evaluation. - Creates the source padding mask.
- Loss Function:
nn.CrossEntropyLosswithignore_indexset to thePADtoken ID. Weighted Cross Entropy was used (via theweightargument), assigning higher weights to less frequent but important tokens (like degree markersx{i}and exponentsE{i}) to encourage the model to predict them correctly. Weights were defined in theTokenizer. - Optimizer:
AdamW(Adam with decoupled weight decay,LEARNING_RATE=1e-4,WEIGHT_DECAY=0.01). - Scheduler:
CosineAnnealingLR(T_max=50) to cyclically adjust the learning rate. - Gradient Clipping: Applied (
clip=1.0) to prevent exploding gradients. - Training Loop (
train_one_epoch): Standard PyTorch training loop iterating through batches, calculating loss, backpropagating, and updating weights. - Validation Loop (
validate): Evaluates the model on the validation set after each epoch without teacher forcing. - Early Stopping: Training stops if the validation loss doesn't improve for a
patiencenumber of epochs (10 used here). The best model state dict based on validation loss is saved.
- Inference (
evaluate_model):- The trained model (best checkpoint loaded) generates predictions autoregressively on the test set.
- For each input function, the encoder runs once.
- The decoder starts with the
SOStoken. At each step, it predicts the next token based on its previous output. - Finite State Prediction (
finite_state_prediction): To ensure syntactically valid output sequences (e.g., a sign must follow a degree marker, digits must follow an exponent), a simple Finite State Machine (FSM) logic was added during inference. Based on the expected token type (degree, sign, exponent, number digit), the prediction is constrained to only valid tokens for that state by selecting theargmaxonly from the logits corresponding to valid next tokens. - Decoding stops when
EOSis predicted or a maximum length is reached.
- Metrics:
- Coefficient RMSE (
coeff_rmse): Calculates the Root Mean Squared Error between the predicted and target coefficient vectors. - Polynomial RMSE (
polynomial_rmse): Samplesn=100points in the rangex_range=(-1, 1), evaluates the predicted and target polynomials at these points, and calculates the RMSE between the function values. This measures the functional similarity.
- Coefficient RMSE (
The trained LSTM model achieved:
- Polynomial RMSE: ~323.21
- Coefficient RMSE: ~315.13
The relatively high RMSE suggests the LSTM model struggled to achieve high precision, potentially due to the complexity of the symbolic mapping, the limitations of the dataset size, or the inherent challenges LSTMs face with very long-range dependencies compared to Transformers. The Finite State Prediction helps generate valid sequences but doesn't guarantee numerical accuracy.
To implement and train a Transformer model for the same Taylor expansion prediction task, leveraging its strengths in handling sequential data and exploring novel approaches to tokenization and positional information encoding.
This is a key innovation in this implementation, designed to explicitly provide the Transformer with structural information about the input function tree and the numerical structure of the output polynomial.
-
Encoder Tokenization (
sympy_tokenizermodified):- Still uses
sympy_tokenizerto get prefix notation tokens. - Crucially, it also generates an "absolute path" for each token in the expression tree. This path represents the traversal from the root. For example, in
add(sin(x), mul(x, x))which tokenizes to['add', 'sin', 'x', 'mul', 'x', 'x']:'add':[1](root)'sin':[1, 0, 1](first child of root)'x' (in sin):[1, 0, 1, 0](first child of 'sin')'mul':[1, 0, 1](second child of root)'x' (first in mul):[1, 0, 1, 1, 0](first child of 'mul')'x' (second in mul):[1, 0, 1, 0, 1](second child of 'mul')
- Numerical constants are still encoded using P10 (
encode_number), but their corresponding paths are adapted (details in code). The paths are padded to a fixedpos_dim=6.
- Still uses
-
Encoder Positional Embeddings (
encode_enc):- Absolute (
enc_a): The generated tree paths (padded vectors) are used directly as absolute positional encodings. This explicitly tells the model where each token resides within the function's structure. - Relative (
enc_r): For every pair of tokens(i, j)in the input sequence, a relative positional encoding is calculated by comparing their absolute path vectors element-wise using a simple difference function (rel_ij: returns -1 ifi=0, j>0, 1 ifi>0, j=0, 0 otherwise, element-wise). This results in a(seq_len, seq_len, pos_dim)tensor, capturing the relationship between the positions of any two tokens in the tree structure.
- Absolute (
-
Decoder Tokenization (
encode_dec):- Similar to the LSTM decoder (P10 encoding,
SOS,EOS,x{i}tokens).
- Similar to the LSTM decoder (P10 encoding,
-
Decoder Positional Embeddings (
encode_dec,return_dec_embeddings):- Absolute (
dec_a): A simpler absolute encoding is used. Tokens belonging to the same coefficient (x{i}, sign, exponent, digits) share the same absolute position index (based on the coefficient's term number). This groups related tokens. Represented as a(seq_len, 1)tensor. - Relative (
dec_r): Calculated based on the fine-grained position within a number. Tokens representing digits closer to the decimal point have different relative encodings compared to those further away. This captures the numerical significance within the P10 encoding. Represented as a(seq_len, seq_len, 1)tensor.
- Absolute (
-
Pros of this Approach:
- Structure Awareness: Encoder embeddings directly encode the tree structure, potentially allowing the Transformer to learn hierarchical relationships better than standard sequential embeddings.
- Disentangled Information: Relative embeddings capture pairwise relationships, complementing the absolute position information.
- Tailored Embeddings: Different, task-appropriate positional encoding strategies are used for the structured input (function tree) and the more sequentially structured output (polynomial coefficients).
- Hypothesized Benefit: This richer positional information, especially for the encoder, is hypothesized to allow the Transformer to more effectively understand the symbolic structure and perform the transformation to a Taylor series.
-
Disentangled Attention (
DisentangledAttention):- Core Idea: This is another novel component. Instead of adding positional encodings to word embeddings before attention, this custom attention mechanism incorporates positional information directly into the attention score calculation, keeping content and position somewhat separate (disentangled).
- Implementation: Inspired by papers like DeBERTa and implementations like x-transformers, this module modifies the standard scaled dot-product attention formula:
Attention(Q, K, V) = Softmax( (Qc*Kc^T + Qc*Kr^T + Kc*Qr^T + Pa*Pa^T) / sqrt(d) ) * V(Conceptual formula; actual implementation uses efficient einsum/matmul and separate projections for content and position).Qc, Kc: Content queries and keys (from token embeddings).Qr, Kr: Relative position queries and keys (derived from relative positional encodingsenc_r/dec_r).Pa: Absolute position embeddings (enc_a/dec_a).
- It takes separate inputs for content (
query,key,value) and positional information (a_query,a_key,r_query,r_key). - It uses separate linear projections for content and positional inputs before computing the attention scores.
- It supports causal masking for the decoder's self-attention.
-
Encoder Block (
Encoder):- Consists of a
DisentangledAttentionlayer (for self-attention, usingenc_aandenc_r) followed by Layer Normalization. - A standard Feed-Forward Network (FFN) with ReLU activation, followed by Layer Normalization.
- Residual connections are used around both sub-layers.
- Consists of a
-
Decoder Block (
Decoder):- Consists of a masked
DisentangledAttentionlayer (for self-attention, usingdec_aanddec_r) followed by Layer Normalization. - A second
DisentangledAttentionlayer (for cross-attention) attends to the encoder's output (enc_x), using the decoder's state (dec_x) as query and incorporating both encoder and decoder positional information (enc_a,dec_a,enc_r,dec_r), followed by Layer Normalization. - A standard FFN, followed by Layer Normalization.
- Residual connections are used around all three sub-layers.
- Consists of a masked
-
Full Transformer (
Transformer):- Input and output embedding layers (
enc_embed,dec_embed). - Stacks
NEncoder blocks (n_encoders=12). - Stacks
NDecoder blocks (n_decoders=12). - A final linear layer projects the decoder output to the decoder vocabulary size.
- Input and output embedding layers (
- Uses the same
TaylorDatasetbut with the enhanced tokenizer providing positional encodings. - The
collate_fnis more complex, needing to pad:enc_seqs,out_seqs: 1D padding (standard).enc_as,dec_as: 2D padding (sequence length, position dimension).enc_rs,dec_rs: 3D padding (batch, seq_len, seq_len, position dimension) using custom helper functions (pad_3d_tensors,pad_2d_tensors).
- Similar setup to the LSTM:
- Loss: Weighted
nn.CrossEntropyLoss(same weights, ignoring PAD). - Optimizer:
AdamW(lr=5e-5,weight_decay=0.01). - Scheduler:
CosineAnnealingWarmRestarts(T_0=10,T_mult=2). - Training/Validation Loops: Adapted for the Transformer architecture and the complex batch structure (multiple tensors for sequences and positions). Validation uses teacher forcing for simplicity and speed, primarily serving as a check against overfitting rather than a full generative evaluation during training.
- Early Stopping: Based on validation loss (
patience=10).
- Loss: Weighted
- Inference (
evaluate_model):- Autoregressive generation similar to the LSTM evaluation.
- At each step, the required absolute (
dec_a) and relative (dec_r) positional embeddings for the current prediction sequence length are generated on-the-fly usingtokenizer.return_dec_embeddings. - The Finite State Prediction (
finite_state_prediction) logic is also used here during decoding to ensure syntactically valid coefficient sequences.
- Metrics: Polynomial RMSE and Coefficient RMSE.
The trained Transformer model achieved:
- Polynomial RMSE: ~15.67
- Coefficient RMSE: ~18.26
These results show a significant improvement over the LSTM model, suggesting that the Transformer architecture, potentially aided by the novel structural tokenization and disentangled attention mechanism, is better suited for this symbolic task. The RMSE values are considerably lower, indicating better functional and coefficient-wise accuracy.
- Compute Constraints: The primary limitation was the lack of extensive GPU resources and time.
- This restricted the dataset size (3000 samples). A much larger dataset (e.g., 100k or 1M samples) would be ideal.
- It limited the extent of hyperparameter tuning (embedding dimensions, number of heads, layers, learning rates, dropout). The chosen parameters (
embed_dim=768,n_heads=4,n_layers=12) are relatively large and might be suboptimal or prone to overfitting on the small dataset. - Training duration was limited. The models, especially the Transformer, could likely benefit from significantly more training epochs.
- Positional Embedding/Attention Validation: While the novel positional encoding and disentangled attention schemes implemented in the Transformer show promise (indicated by the improved RMSE), a more rigorous ablation study is needed to definitively isolate their contribution. Training baseline Transformers (e.g., with standard sinusoidal embeddings or learned embeddings only) on the same dataset would provide a clearer comparison. This was not feasible given the time constraints.
- Tokenization Granularity: The P10 number representation, while functional, might not be the optimal way to handle coefficients. Exploring alternative numerical representations could be beneficial.
- Evaluation Metrics: While RMSE provides a good measure, exploring other symbolic-specific metrics (e.g., exact match percentage after simplification, structural similarity metrics) could offer further insights.
Future work could involve:
- Generating a much larger dataset.
- Performing extensive hyperparameter optimization.
- Conducting ablation studies on the Transformer's positional encoding and attention mechanisms.
- Training for longer durations.
- Exploring different numerical tokenization strategies and possibly encoding much better representations of the input data.
This project successfully demonstrated the feasibility of using sequence-to-sequence models for symbolic mathematics tasks like Taylor expansion prediction. Since this employs a good tokenization system for a sequence of numbers, this perfectly aligns with the main task of FASEROH, which is to map a sequence of numbers (histogram data) to a function, I really believe this rough idea right now can make way to a much better architecture, perfectly suited to make a transformer capture the information of the sequence of numbers in the best way possible. In this task:
- A robust data generation pipeline using SymPy and multiprocessing was established.
- An LSTM with attention model was implemented and trained, providing a baseline performance.
- A Transformer model incorporating novel structural positional encodings and a disentangled attention mechanism was developed and trained, achieving significantly better results than the LSTM on the generated dataset.
The promising results of the Transformer, despite compute limitations, highlight the potential of tailored architectures and input representations for symbolic reasoning tasks. I am enthusiastic about the potential of this research direction and eager to contribute further to the FASEROH project if selected for GSoC 2025.