SA

SAME-L

Otherby Stability AI·Model page

Stability AI's 852M-parameter audio autoencoder for high-quality music and sound effect encoding and reconstruction.

Share:

Model Description

Please note: For commercial use, please refer to https://stability.ai/license

Model Description

Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically Aligned Music autoEncoder), a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard) while maintaining excellent reconstruction quality and strong downstream generative performance. We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses. The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.

Usage

This model can be used with:

  1. the stable-audio-3 inference and fine-tuning library
  2. the stable-audio-tools research library

Using with stable-audio-3

import torchaudio
from stable_audio_3 import AutoencoderModel

ae = AutoencoderModel.from_pretrained("same-l")
waveform, sr = torchaudio.load("audio.wav")
latents = ae.encode(waveform, sr)
audio_out = ae.decode(latents)

Using with stable-audio-tools

import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond

device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
  model_half = True

# Download model
model, model_config = get_pretrained_model("stabilityai/SAME-L")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

model = model.to(device)
if model_half:
  model = model.to(torch.float16)

audio, sr = torchaudio.load(/path/to/audiofile)  # [channels, samples]
if audio.shape[0] == 1:
    audio = audio.repeat(2, 1)

audio = audio.unsqueeze(0).to(device)
if model_half:
  audio = audio.half()
with torch.no_grad():
    latents = model.encode_audio(audio)  
    reconstructed = model.decode_audio(latents)         
reconstructed = reconstructed.squeeze(0).cpu()  
reconstructed = reconstructed.to(torch.float32).clamp(-1, 1).mul(32767).to(torch.int16).cpu()

Model Details

Training dataset

Datasets Used

Our dataset consists of ~19,500 hours of licensed production audio from AudioSparx which includes a 66/25/9% mix of music, sound effects, and instrument stems.

Author
SA
Stability AI
Organization · ✓
stabilityai
Details
Downloads6.3K
Likes20
AccessOpen Source
Parameters852M
Licenseother
Librarystable-audio-3
CreatedMay 17, 2026
UpdatedJun 24, 2026
View on Hugging Face
Languages
en
Get the full context.

Sign up to read complete case studies, access detailed metrics, and unlock all use cases.

SAME-L — AI Model Details | Applied