Up until now, most generative music models have been producing mono sound. This means MusicGen does not place any sounds or instruments on the left or right side, resulting in a less lively and exciting mix. The reason why stereo sound has been mostly overlooked so far is that generating stereo is not a trivial task.
As musicians, when we produce stereo signals, we have access to the individual instrument tracks in our mix and we can place them wherever we want. MusicGen does not generate all instruments separately but instead produces one combined audio signal. Without access to these instrument sources, creating stereo sound is hard. Unfortunately, splitting an audio signal into its individual sources is a tough problem (I’ve published a blog post about that) and the tech is still not 100% ready.
Therefore, Meta decided to incorporate stereo generation directly into the MusicGen model. Using a new dataset consisting of stereo music, they trained MusicGen to produce stereo outputs. The researchers claim that generating stereo has no additional computing costs compared to mono.
Although I feel that the stereo procedure is not very clearly described in the paper, my understanding it works like this (Figure 3): MusicGen has learned to generate two compressed audio signals (left and right channel) instead of one mono signal. These compressed signals must then be decoded separately before they are combined to build the final stereo output. The reason this process does not take twice as long is that MusicGen can now produce two compressed audio signals at approximately the same time it previously took for one signal.
Being able to produce convincing stereo sound really sets MusicGen apart from other state-of-the-art models like MusicLM or Stable Audio. From my perspective, this “little” addition makes a huge difference in the liveliness of the generated music. Listen for yourselves (might be hard to hear on smartphone speakers):
Mono
Stereo
MusicGen was impressive from the day it was released. However, since then, Meta’s FAIR team has been continually improving their product, enabling higher quality results that sound more authentic. When it comes to text-to-music models generating audio signals (not MIDI etc.), MusicGen is ahead of its competitors from my perspective (as of November 2023).
Further, since MusicGen and all its related products (EnCodec, AudioGen) are open-source, they constitute an incredible source of inspiration and a go-to framework for aspiring AI audio engineers. If we look at the improvements MusicGen has made in only 6 months, I can only imagine that 2024 will be an exciting year.
Another important point is that with their transparent approach, Meta is also doing foundational work for developers who want to integrate this technology into software for musicians. Generating samples, brainstorming musical ideas, or changing the genre of your existing work — these are some of the exciting applications we are already starting to see. With a sufficient level of transparency, we can make sure we are building a future where AI makes creating music more exciting instead of being only a threat to human musicianship.