SAME-L
Stability AI's 852M-parameter audio autoencoder for high-quality music and sound effect encoding and reconstruction.
Model Description
Please note: For commercial use, please refer to https://stability.ai/license
Model Description
Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically Aligned Music autoEncoder), a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard) while maintaining excellent reconstruction quality and strong downstream generative performance. We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses. The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.
Usage
This model can be used with:
- the
stable-audio-3inference and fine-tuning library - the
stable-audio-toolsresearch library
Using with stable-audio-3
import torchaudio
from stable_audio_3 import AutoencoderModel
ae = AutoencoderModel.from_pretrained("same-l")
waveform, sr = torchaudio.load("audio.wav")
latents = ae.encode(waveform, sr)
audio_out = ae.decode(latents)
Using with stable-audio-tools
import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond
device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
model_half = True
# Download model
model, model_config = get_pretrained_model("stabilityai/SAME-L")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
model = model.to(device)
if model_half:
model = model.to(torch.float16)
audio, sr = torchaudio.load(/path/to/audiofile) # [channels, samples]
if audio.shape[0] == 1:
audio = audio.repeat(2, 1)
audio = audio.unsqueeze(0).to(device)
if model_half:
audio = audio.half()
with torch.no_grad():
latents = model.encode_audio(audio)
reconstructed = model.decode_audio(latents)
reconstructed = reconstructed.squeeze(0).cpu()
reconstructed = reconstructed.to(torch.float32).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
Model Details
- Model type:
SAMEis a continuous autoencoder model based on a transformer architecture. - Language(s): English
- License: Stability AI Community License.
- Research Paper: https://arxiv.org/abs/2605.18613
Training dataset
Datasets Used
Our dataset consists of ~19,500 hours of licensed production audio from AudioSparx which includes a 66/25/9% mix of music, sound effects, and instrument stems.
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.