Audio Stem Separation in Python: Demucs, Spleeter & API Compared (2026 Guide)
Four stems from any audio file — working code for local GPU inference, CPU-only, and a one-request cloud API
You've probably heard the term "stem separation" thrown around in music production, ML research, and audio tooling circles. But what exactly is it, and how do you do it programmatically in Python?
This guide is the definitive answer. We'll cover the two main open-source libraries (Demucs and Spleeter), a cloud REST API for environments without a GPU, and practical code examples for the most common use cases: karaoke generation, drum isolation for practice, and building ML training datasets.
What is audio stem separation?
Short answer: Stem separation (also called music source separation) uses a trained neural network to decompose a mixed audio file into isolated tracks — typically vocals, drums, bass, and "other" (guitars, keys, etc.). The model never hears the isolated parts during inference; it learned the signal statistics of each instrument class from thousands of labeled recordings during training.
For a deeper dive into the signal processing and machine learning behind it, see how AI stem separation works under the hood.
The practical result: given song.mp3, you get back vocals.wav, drums.wav, bass.wav, and other.wav — each as a clean mono or stereo WAV.
Common developer use cases:
| Use case | Stems needed | Why |
| Karaoke track generator | no_vocals | Remove lead vocals, keep everything else |
| Vocal isolation / acapella | vocals | Remixes, pitch analysis, training data |
| Drum practice tool | drums | Isolated click track for drummers |
| ML dataset generation | All 4 | Train instrument classifiers, pitch detectors |
| DJ stem mixing | drums, bass | Live stem mixing in Rekordbox / Serato |
| Podcast cleanup | vocals | Isolate speech, remove background music |
Demucs vs Spleeter — which Python library should you use?
Short answer: Use Demucs (htdemucs model) for anything you're building today. Spleeter is legacy — it hasn't had a major release since 2022 and depends on TensorFlow 1.x. Demucs outperforms it on every benchmark and is actively maintained by Meta AI Research.
The only reason to reach for Spleeter in 2026 is if you're maintaining an older pipeline that already uses it, or if you need its 5-stem model (vocals/drums/bass/piano/other).
Demucs htdemucs | Spleeter 4stems | |
| SDR vocals | ~9.4 dB | ~6.5 dB |
| SDR drums | ~8.9 dB | ~6.0 dB |
| SDR bass | ~8.2 dB | ~5.5 dB |
| GPU required | No (CPU ok, 5–10× slower) | No |
| Dependency | PyTorch | TensorFlow 1.x |
| Active development | Yes (Meta AI) | No (last release 2022) |
| Stem modes | 2-stem, 4-stem, 6-stem | 2-stem, 4-stem, 5-stem |
| License | MIT | MIT |
For a full quality comparison with listening tests, see the Spleeter vs Demucs comparison.
Method 1 — Demucs (recommended)
Install
pip install demucs
Demucs downloads model weights on first run (~80 MB for htdemucs). PyTorch uses your GPU automatically if CUDA is available; otherwise it runs on CPU.
4-stem separation (vocals / drums / bass / other)
import subprocess
from pathlib import Path
def separate_stems(
input_path: str | Path,
output_dir: str | Path = "separated",
model: str = "htdemucs",
) -> dict[str, Path]:
"""Separate audio into 4 stems using Demucs.
Returns a dict mapping stem name → output file path.
"""
input_path = Path(input_path).resolve()
output_dir = Path(output_dir).resolve()
output_dir.mkdir(parents=True, exist_ok=True)
subprocess.run(
["demucs", "-n", model, "--out", str(output_dir), str(input_path)],
check=True,
)
stem_dir = output_dir / model / input_path.stem
return {
stem: stem_dir / f"{stem}.wav"
for stem in ("vocals", "drums", "bass", "other")
}
if __name__ == "__main__":
stems = separate_stems("my_song.mp3")
for name, path in stems.items():
size_kb = path.stat().st_size // 1024
print(f" {name}: {path} ({size_kb} KB)")
2-stem separation (vocals / no_vocals)
When you only need vocals vs instrumental (e.g. karaoke track), use --two-stems to skip the unnecessary drum/bass separation — it's faster and produces a cleaner blend in the "no_vocals" output:
def separate_two_stems(
input_path: str | Path,
output_dir: str | Path = "separated",
model: str = "htdemucs",
) -> tuple[Path, Path]:
"""Returns (vocals_path, no_vocals_path)."""
input_path = Path(input_path).resolve()
output_dir = Path(output_dir).resolve()
output_dir.mkdir(parents=True, exist_ok=True)
subprocess.run(
[
"demucs", "-n", model,
"--two-stems=vocals",
"--out", str(output_dir),
str(input_path),
],
check=True,
)
stem_dir = output_dir / model / input_path.stem
return stem_dir / "vocals.wav", stem_dir / "no_vocals.wav"
Model selection guide
| Track type | Recommended model | Why |
| Pop/rock post-1990 | htdemucs_ft | Fine-tuned, +0.5 dB SDR on vocals |
| Jazz, classical | htdemucs (base) | Better generalisation on complex polyphony |
| Electronic / heavily processed | mdx_extra | Trained on more electronic material |
| Speed priority | htdemucs | Faster than _ft variant |
Switch model by changing the -n flag: demucs -n htdemucs_ft ...
Batch processing
def batch_separate(input_dir: str, output_dir: str = "separated") -> None:
input_dir = Path(input_dir)
audio_files = list(input_dir.glob("*.mp3")) + list(input_dir.glob("*.wav"))
print(f"Processing {len(audio_files)} files…")
for audio_file in sorted(audio_files):
print(f" → {audio_file.name}")
try:
stems = separate_stems(audio_file, output_dir)
print(f" ✓ {len(stems)} stems saved")
except subprocess.CalledProcessError as e:
print(f" ✗ failed: {e}")
Method 2 — Spleeter (legacy, still useful for 5-stem)
Spleeter is slower and lower quality than Demucs for most use cases, but its 5-stem model (vocals / drums / bass / piano / other) has no direct Demucs equivalent — useful if you specifically need isolated piano.
Install
pip install spleeter
⚠️ Spleeter requires TensorFlow 1.x and can conflict with modern Python environments. Use a dedicated virtual environment or Docker image. It is not actively maintained.
4-stem separation
from spleeter.separator import Separator
from pathlib import Path
def spleeter_separate(
input_path: str | Path,
output_dir: str | Path = "separated_spleeter",
stems: int = 4,
) -> Path:
"""Separate audio using Spleeter. Returns the output directory for this track."""
input_path = Path(input_path).resolve()
output_dir = Path(output_dir).resolve()
separator = Separator(f"spleeter:{stems}stems")
separator.separate_to_file(str(input_path), str(output_dir))
return output_dir / input_path.stem
if __name__ == "__main__":
result_dir = spleeter_separate("my_song.mp3", stems=4)
for stem_file in sorted(result_dir.glob("*.wav")):
print(f" {stem_file.name}")
When to still use Spleeter
- You need isolated piano stems (use
stems=5) - You're maintaining an existing ML pipeline built on Spleeter and the migration cost isn't justified
- You're benchmarking against older research that used Spleeter as the baseline
For everything else: use Demucs.
Method 3 — StemSplit REST API (no GPU, production-ready)
If you're deploying to a serverless function, a CPU-only cloud server, or building a web app where you can't run local inference, the StemSplit stem splitter API gives you cloud GPU separation through a simple REST call. No model weights, no CUDA dependencies, no cold-start latency.
Install
pip install requests
Full implementation (upload → poll → download all stems)
import time
import requests
from pathlib import Path
STEMSPLIT_API_BASE = "https://stemsplit.io/api"
def separate_via_api(
input_path: str | Path,
api_key: str,
output_dir: str | Path = "separated_api",
stems: list[str] | None = None,
poll_interval: int = 5,
timeout: int = 300,
) -> dict[str, Path]:
"""Separate audio into stems using the StemSplit REST API.
Args:
input_path: Local audio file (mp3, wav, flac, m4a — max 10 min).
api_key: StemSplit API key.
output_dir: Directory to save downloaded stems.
stems: Which stems to request. Defaults to all 4 (vocals/drums/bass/other).
poll_interval: Seconds between status polls.
timeout: Max seconds to wait.
Returns:
Dict mapping stem name → downloaded file path.
"""
input_path = Path(input_path)
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
stems = stems or ["vocals", "drums", "bass", "other"]
headers = {"Authorization": f"Bearer {api_key}"}
# 1. Upload and start the job
with input_path.open("rb") as f:
resp = requests.post(
f"{STEMSPLIT_API_BASE}/separate",
headers=headers,
files={"file": (input_path.name, f, "audio/mpeg")},
data={"stems": ",".join(stems)},
timeout=60,
)
resp.raise_for_status()
job_id = resp.json()["jobId"]
print(f"Job started: {job_id}")
# 2. Poll until complete
elapsed = 0
while elapsed < timeout:
time.sleep(poll_interval)
elapsed += poll_interval
status = requests.get(
f"{STEMSPLIT_API_BASE}/jobs/{job_id}", headers=headers, timeout=30
).json()
print(f" {status['status']} ({elapsed}s)")
if status["status"] == "completed":
stem_urls = status["stems"]
break
if status["status"] == "failed":
raise RuntimeError(f"Job failed: {status.get('error')}")
else:
raise TimeoutError(f"Job {job_id} timed out after {timeout}s")
# 3. Download each stem
results: dict[str, Path] = {}
for stem_name, url in stem_urls.items():
out_path = output_dir / f"{input_path.stem}_{stem_name}.wav"
dl = requests.get(url, timeout=120, stream=True)
dl.raise_for_status()
with out_path.open("wb") as f:
for chunk in dl.iter_content(chunk_size=8192):
f.write(chunk)
results[stem_name] = out_path
print(f" ✓ {stem_name}: {out_path}")
return results
if __name__ == "__main__":
import os
stems = separate_via_api(
"my_song.mp3",
api_key=os.environ["STEMSPLIT_API_KEY"],
output_dir="output",
)
Practical use cases
1 — Karaoke track (remove lead vocals)
_, instrumental = separate_two_stems("song.mp3")
print(f"Karaoke track: {instrumental}")
# → separated/htdemucs/song/no_vocals.wav
Or request only the instrumental stem via API to save processing time:
stems = separate_via_api("song.mp3", api_key=KEY, stems=["other", "drums", "bass"])
For the full karaoke generation workflow including synced lyrics, see how to extract acapella from any song in Python.
2 — Isolated drum track for practice
stems = separate_stems("song.mp3")
drum_track = stems["drums"]
print(f"Drum track: {drum_track}")
# Play this alongside the original to practice along
3 — ML training dataset generation
from pathlib import Path
import json
def build_stem_dataset(
input_dir: str,
output_dir: str = "dataset",
model: str = "htdemucs",
) -> None:
"""Separate a folder of songs into stems for ML training."""
input_dir = Path(input_dir)
output_dir = Path(output_dir)
manifest = []
for audio_file in sorted(input_dir.glob("*.mp3")):
stems = separate_stems(audio_file, output_dir / "stems", model)
manifest.append({
"source": str(audio_file),
"stems": {k: str(v) for k, v in stems.items()},
})
print(f" ✓ {audio_file.name}")
manifest_path = output_dir / "manifest.json"
manifest_path.write_text(json.dumps(manifest, indent=2))
print(f"\nManifest saved: {manifest_path} ({len(manifest)} tracks)")
Conclusion
For new Python projects in 2026, the hierarchy is clear:
- Local Demucs (
htdemucs_ft) — best quality, free, GPU optional - StemSplit API — production cloud option, no GPU or model management
- Spleeter — only if you need 5-stem (piano) or are maintaining legacy code
The choice between local and API depends on your infrastructure: if you're running a batch job on a machine you control, go local. If you're building a web service or running on cloud functions, the REST API is far simpler to operate.
Related articles
- How to Extract Acapella from Any Song in Python — vocals-only extraction in depth, including yt-dlp pipeline
- How to Isolate Vocals from Any Song: 5 Methods Compared — non-Python methods including browser tools