This is part 1 of my new multi-part series 🐍 Towards Mamba State Space Models for Images, Videos and Time Series.
Is Mamba all you need? Certainly, people have thought that for a long time of the Transformer architecture introduced by A. Vaswani et. al. in Attention is all you need back in 2017. And without any doubt, the transformer has revolutionized the field of deep learning over and over again. Its general-purpose architecture can easily be adapted for various data modalities such as text, images, videos and time series and it seems that the more compute resources and data you throw at the Transformer, the more performant it becomes.
However, the Transformer’s attention mechanism has a major drawback: it is of complexity O(N²), meaning it scales quadratically with the sequence length. This implies the larger the input sequence, the more compute resources you need, making large sequences often unfeasible to work with.
- What is this Series About?
- Why Do We Need a New Model?
- Structured State Space Models