How to Extract Acapella from Any Song in Python: Demucs, API & YouTube Pipeline (2026)
Three working methods — local GPU inference, a REST API call, and a yt-dlp pipeline — with copy-paste code for each
You're building a remix tool, a karaoke generator, a vocal training dataset, or just trying to get a clean acapella for a mash-up — and you need a programmatic way to strip the instrumental from any audio file.
The good news: in 2026 this is a solved problem. AI source separation models have reached the point where you can get usable acapella tracks from pop recordings in a single command. The bad news is that there are three or four different ways to do it, and the right one depends on your hardware, your use case, and whether you can wait for a local GPU job.
This guide covers all three approaches with working Python code:
- Local Demucs — free, runs on CPU or GPU, highest quality
- StemSplit REST API — cloud-based, no GPU needed, one
requestscall - YouTube → acapella pipeline — yt-dlp + either of the above
⚠️ Copyright notice: Extracting vocals from commercially released music is fine for personal, research, or transformative use in most jurisdictions. Distributing the resulting acapella or using it commercially requires a sync licence from the rights holder. Always verify the licence status of source material before publishing.
What is the best AI acapella extractor?
Short answer: For local extraction, htdemucs (Meta AI) is the current state of the art — it achieves an SDR of ~8.9 dB on vocals against the MUSDB18-HQ benchmark. For cloud extraction without a GPU, the StemSplit API returns a clean acapella stem in under 90 seconds for a three-minute track. See this comparison of the best acapella extractors for a broader tool-by-tool breakdown.
| Method | SDR (vocals) | GPU required | Cost | Best for |
htdemucs (local) | ~8.9 dB | No (CPU ok, slow) | Free | Batch jobs, max quality |
htdemucs_ft (local, fine-tuned) | ~9.4 dB | Recommended | Free | Highest quality single tracks |
| StemSplit API | ~8.5 dB | No | Free tier + credits | Apps, automation, no-GPU servers |
| Older Demucs v2/v3 | ~7.5 dB | No | Free | Legacy compatibility |
Method 1 — Local Demucs (best quality, free)
Install
pip install demucs
Demucs will download model weights (~80 MB for htdemucs) on first run. It uses your GPU automatically if PyTorch detects CUDA; otherwise it falls back to CPU (expect 3–5× real-time on a modern CPU core).
One-liner command
demucs -n htdemucs --two-stems=vocals audio.mp3
The --two-stems=vocals flag produces only two outputs:
vocals.wav— the acapellano_vocals.wav— the instrumental
Output lands in separated/htdemucs/<track_name>/.
Python wrapper (subprocess)
import subprocess
from pathlib import Path
def extract_acapella(
input_path: str | Path,
output_dir: str | Path = "separated",
model: str = "htdemucs",
) -> Path:
"""Extract acapella vocals from an audio file using Demucs.
Returns the path to the vocals.wav output file.
"""
input_path = Path(input_path).resolve()
output_dir = Path(output_dir).resolve()
output_dir.mkdir(parents=True, exist_ok=True)
subprocess.run(
[
"demucs",
"-n", model,
"--two-stems=vocals",
"--out", str(output_dir),
str(input_path),
],
check=True,
)
# Demucs nests output: <output_dir>/<model>/<stem_name>/vocals.wav
stem_name = input_path.stem
return output_dir / model / stem_name / "vocals.wav"
if __name__ == "__main__":
acapella = extract_acapella("my_song.mp3")
print(f"Acapella saved to: {acapella}")
Batch processing
from pathlib import Path
def batch_extract(input_dir: str, output_dir: str = "separated") -> list[Path]:
input_dir = Path(input_dir)
results = []
for audio_file in sorted(input_dir.glob("*.mp3")) + sorted(input_dir.glob("*.wav")):
print(f"Processing: {audio_file.name}")
try:
result = extract_acapella(audio_file, output_dir)
results.append(result)
print(f" ✓ {result}")
except subprocess.CalledProcessError as e:
print(f" ✗ failed: {e}")
return results
Model selection tip
Use htdemucs_ft (fine-tuned) for the highest quality on music from 1990 onwards — it improves vocal SDR by ~0.5 dB over the base htdemucs model at the cost of a slightly longer run time:
demucs -n htdemucs_ft --two-stems=vocals audio.mp3
For a full local setup walkthrough (including CUDA configuration), see StemSplit's Demucs local setup guide.
Method 2 — StemSplit REST API (no GPU required)
If you're running on a server without a GPU, building a web app, or processing audio in a serverless function, the StemSplit acapella extractor API is the fastest path. You upload an audio file, poll for completion, and download the acapella stem — no model weights, no CUDA dependencies.
Install
pip install requests
Full implementation
import time
import requests
from pathlib import Path
STEMSPLIT_API_BASE = "https://stemsplit.io/api"
def extract_acapella_api(
input_path: str | Path,
api_key: str,
output_path: str | Path = "acapella.wav",
poll_interval: int = 5,
timeout: int = 300,
) -> Path:
"""Extract acapella using the StemSplit API.
Args:
input_path: Local path to the audio file (mp3, wav, flac, m4a).
api_key: Your StemSplit API key.
output_path: Where to save the acapella WAV.
poll_interval: Seconds between status checks.
timeout: Max seconds to wait before raising TimeoutError.
Returns:
Path to the downloaded acapella file.
"""
input_path = Path(input_path)
output_path = Path(output_path)
headers = {"Authorization": f"Bearer {api_key}"}
# 1. Upload + start job
with input_path.open("rb") as f:
resp = requests.post(
f"{STEMSPLIT_API_BASE}/separate",
headers=headers,
files={"file": (input_path.name, f, "audio/mpeg")},
data={"stems": "vocals"}, # vocals-only separation
timeout=60,
)
resp.raise_for_status()
job = resp.json()
job_id = job["jobId"]
print(f"Job started: {job_id}")
# 2. Poll until complete
elapsed = 0
while elapsed < timeout:
time.sleep(poll_interval)
elapsed += poll_interval
status_resp = requests.get(
f"{STEMSPLIT_API_BASE}/jobs/{job_id}",
headers=headers,
timeout=30,
)
status_resp.raise_for_status()
status = status_resp.json()
print(f" Status: {status['status']} ({elapsed}s elapsed)")
if status["status"] == "completed":
vocals_url = status["stems"]["vocals"]
break
if status["status"] == "failed":
raise RuntimeError(f"Job failed: {status.get('error', 'unknown error')}")
else:
raise TimeoutError(f"Job {job_id} did not complete within {timeout}s")
# 3. Download acapella
dl = requests.get(vocals_url, timeout=120, stream=True)
dl.raise_for_status()
output_path.parent.mkdir(parents=True, exist_ok=True)
with output_path.open("wb") as f:
for chunk in dl.iter_content(chunk_size=8192):
f.write(chunk)
print(f"✓ Acapella saved: {output_path}")
return output_path
if __name__ == "__main__":
import os
acapella = extract_acapella_api(
"my_song.mp3",
api_key=os.environ["STEMSPLIT_API_KEY"],
output_path="output/acapella.wav",
)
The API handles FLAC, M4A, MP3, and WAV inputs up to 10 minutes. For shorter tracks (< 4 minutes) completion typically takes 20–60 seconds.
How to extract acapella from a YouTube video?
Short answer: Use yt-dlp to download the best-quality audio, then pipe it through either Demucs or the StemSplit API. This two-step pipeline handles any public YouTube video in under two minutes.
⚠️ Important: YouTube's Terms of Service (Section 5) prohibit downloading videos except where the platform provides a download button. Only process videos you own, videos licensed under Creative Commons, or content you have explicit permission to download.
Install yt-dlp
pip install yt-dlp
Pipeline code
import subprocess
import tempfile
from pathlib import Path
def youtube_to_acapella(
youtube_url: str,
output_path: str | Path = "acapella.wav",
method: str = "local", # "local" or "api"
api_key: str | None = None,
) -> Path:
"""Download audio from YouTube and extract acapella vocals.
Args:
youtube_url: YouTube video URL.
output_path: Destination for the acapella WAV.
method: "local" uses Demucs; "api" uses StemSplit REST API.
api_key: Required when method="api".
"""
with tempfile.TemporaryDirectory() as tmp:
tmp_dir = Path(tmp)
audio_file = tmp_dir / "source.%(ext)s"
# Download best audio quality as WAV
subprocess.run(
[
"yt-dlp",
"--extract-audio",
"--audio-format", "wav",
"--audio-quality", "0",
"--output", str(audio_file),
youtube_url,
],
check=True,
)
# Find the downloaded file
source = next(tmp_dir.glob("source.*"))
print(f"Downloaded: {source.name} ({source.stat().st_size // 1024} KB)")
if method == "local":
return extract_acapella(source, output_dir=tmp_dir)
elif method == "api":
if not api_key:
raise ValueError("api_key is required for method='api'")
return extract_acapella_api(source, api_key=api_key, output_path=output_path)
else:
raise ValueError(f"Unknown method: {method!r}. Use 'local' or 'api'.")
# Example — extract acapella from a YouTube video you own
result = youtube_to_acapella(
"https://www.youtube.com/watch?v=YOUR_VIDEO_ID",
output_path="output/acapella.wav",
method="local",
)
print(f"Done: {result}")
For background music removal from video content (a related but different use case), see how to remove background music from YouTube videos.
Acapella quality tips
Even with a top-tier model, there are a few things that significantly affect output quality:
Model choice
| If your track is… | Use |
| Pop/rock post-1990 | htdemucs_ft |
| Jazz, classical, or complex polyphony | htdemucs (base) or mdx_extra |
| Old/lo-fi recordings | demucs (v2) — newer models sometimes overfit to modern production |
LUFS normalisation before separation
Pre-normalising your source to −14 LUFS before feeding it to Demucs often improves separation consistency, particularly for very loud or very quiet recordings:
pip install pyloudnorm soundfile
import soundfile as sf
import pyloudnorm as pyln
import numpy as np
def normalise_to_lufs(input_wav: str, output_wav: str, target_lufs: float = -14.0) -> None:
data, rate = sf.read(input_wav)
meter = pyln.Meter(rate)
loudness = meter.integrated_loudness(data)
gain_db = target_lufs - loudness
normalised = data * (10 ** (gain_db / 20))
normalised = np.clip(normalised, -1.0, 1.0)
sf.write(output_wav, normalised, rate)
print(f"Normalised: {loudness:.1f} → {target_lufs} LUFS")
Reducing reverb artefacts
Demucs sometimes bleeds reverb tails from the instrumental into the vocals stem. Running the acapella through a light de-reverb step (iZotope RX, ReaFIR, or the open-source asteroid toolkit) cleans this up before use in a mix.
No-code option: StemSplit Acapella Extractor
If you just need a quick acapella without writing code, StemSplit's online acapella extractor handles the whole pipeline in the browser — upload your track, get back vocals and instrumental as separate WAV downloads. No install, no API key for the free tier.
For an in-depth comparison of online and offline tools, see the best acapella extractors guide.
Related articles
- How to Isolate Vocals from Any Song: 5 Methods Compared — broader overview including non-Python tools
- Remove Background Music from YouTube Videos — when you need the vocal track from a video, not an audio file