MusicGen is an auto-regressive transformer-based model, meaning generates audio codes (tokens) in a causal fashion.
At each decoding step, the model generates a new set of audio codes, conditional on the text input and all previous audio codes. From the
frame rate of the EnCodec model used to decode the generated codes to audio waveform.