[FEATURE] Potential optimization for the selective_scan_seq

By replacing explicite tensor operations with `torch.einsum()` in the Zero-Order-Hold transformation, performance and readability can be improved.


Replacing the original Zero-Order-Hold transformation in line 518 of `mamba_arch.py`
```python
     deltaA = torch.exp(delta.unsqueeze(-1) * A) 
     deltaB = delta.unsqueeze(-1) * B.unsqueeze(2) 
     BX = deltaB * (x.unsqueeze(-1))
```

with:
```python
deltaA = torch.einsum('bld,dn->bldn', dt, A)
   BX = torch.einsum('bld,bld,bln->bldn', dt, u, B) 
```

can improve execution time by up to ~40% while requiring the same number of FLOPS. (See attached plot)

![Image](https://github.com/user-attachments/assets/335954aa-0045-4e0a-b7d7-c8e6cda378df)

Moreover, vectorization of the loop does not further improve execution time.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Potential optimization for the selective_scan_seq #250

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Potential optimization for the selective_scan_seq #250

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions