Skip to main content

Command Palette

Search for a command to run...

How to Extract Acapella from Any Song in Python: Demucs, API & YouTube Pipeline (2026)

Three working methods — local GPU inference, a REST API call, and a yt-dlp pipeline — with copy-paste code for each

Published
7 min read

You're building a remix tool, a karaoke generator, a vocal training dataset, or just trying to get a clean acapella for a mash-up — and you need a programmatic way to strip the instrumental from any audio file.

The good news: in 2026 this is a solved problem. AI source separation models have reached the point where you can get usable acapella tracks from pop recordings in a single command. The bad news is that there are three or four different ways to do it, and the right one depends on your hardware, your use case, and whether you can wait for a local GPU job.

This guide covers all three approaches with working Python code:

  1. Local Demucs — free, runs on CPU or GPU, highest quality
  2. StemSplit REST API — cloud-based, no GPU needed, one requests call
  3. YouTube → acapella pipeline — yt-dlp + either of the above

⚠️ Copyright notice: Extracting vocals from commercially released music is fine for personal, research, or transformative use in most jurisdictions. Distributing the resulting acapella or using it commercially requires a sync licence from the rights holder. Always verify the licence status of source material before publishing.


What is the best AI acapella extractor?

Short answer: For local extraction, htdemucs (Meta AI) is the current state of the art — it achieves an SDR of ~8.9 dB on vocals against the MUSDB18-HQ benchmark. For cloud extraction without a GPU, the StemSplit API returns a clean acapella stem in under 90 seconds for a three-minute track. See this comparison of the best acapella extractors for a broader tool-by-tool breakdown.

MethodSDR (vocals)GPU requiredCostBest for
htdemucs (local)~8.9 dBNo (CPU ok, slow)FreeBatch jobs, max quality
htdemucs_ft (local, fine-tuned)~9.4 dBRecommendedFreeHighest quality single tracks
StemSplit API~8.5 dBNoFree tier + creditsApps, automation, no-GPU servers
Older Demucs v2/v3~7.5 dBNoFreeLegacy compatibility

Method 1 — Local Demucs (best quality, free)

Install

pip install demucs

Demucs will download model weights (~80 MB for htdemucs) on first run. It uses your GPU automatically if PyTorch detects CUDA; otherwise it falls back to CPU (expect 3–5× real-time on a modern CPU core).

One-liner command

demucs -n htdemucs --two-stems=vocals audio.mp3

The --two-stems=vocals flag produces only two outputs:

  • vocals.wav — the acapella
  • no_vocals.wav — the instrumental

Output lands in separated/htdemucs/<track_name>/.

Python wrapper (subprocess)

import subprocess
from pathlib import Path


def extract_acapella(
    input_path: str | Path,
    output_dir: str | Path = "separated",
    model: str = "htdemucs",
) -> Path:
    """Extract acapella vocals from an audio file using Demucs.

    Returns the path to the vocals.wav output file.
    """
    input_path = Path(input_path).resolve()
    output_dir = Path(output_dir).resolve()
    output_dir.mkdir(parents=True, exist_ok=True)

    subprocess.run(
        [
            "demucs",
            "-n", model,
            "--two-stems=vocals",
            "--out", str(output_dir),
            str(input_path),
        ],
        check=True,
    )

    # Demucs nests output: <output_dir>/<model>/<stem_name>/vocals.wav
    stem_name = input_path.stem
    return output_dir / model / stem_name / "vocals.wav"


if __name__ == "__main__":
    acapella = extract_acapella("my_song.mp3")
    print(f"Acapella saved to: {acapella}")

Batch processing

from pathlib import Path


def batch_extract(input_dir: str, output_dir: str = "separated") -> list[Path]:
    input_dir = Path(input_dir)
    results = []
    for audio_file in sorted(input_dir.glob("*.mp3")) + sorted(input_dir.glob("*.wav")):
        print(f"Processing: {audio_file.name}")
        try:
            result = extract_acapella(audio_file, output_dir)
            results.append(result)
            print(f"  ✓ {result}")
        except subprocess.CalledProcessError as e:
            print(f"  ✗ failed: {e}")
    return results

Model selection tip

Use htdemucs_ft (fine-tuned) for the highest quality on music from 1990 onwards — it improves vocal SDR by ~0.5 dB over the base htdemucs model at the cost of a slightly longer run time:

demucs -n htdemucs_ft --two-stems=vocals audio.mp3

For a full local setup walkthrough (including CUDA configuration), see StemSplit's Demucs local setup guide.


Method 2 — StemSplit REST API (no GPU required)

If you're running on a server without a GPU, building a web app, or processing audio in a serverless function, the StemSplit acapella extractor API is the fastest path. You upload an audio file, poll for completion, and download the acapella stem — no model weights, no CUDA dependencies.

Install

pip install requests

Full implementation

import time
import requests
from pathlib import Path


STEMSPLIT_API_BASE = "https://stemsplit.io/api"


def extract_acapella_api(
    input_path: str | Path,
    api_key: str,
    output_path: str | Path = "acapella.wav",
    poll_interval: int = 5,
    timeout: int = 300,
) -> Path:
    """Extract acapella using the StemSplit API.

    Args:
        input_path: Local path to the audio file (mp3, wav, flac, m4a).
        api_key: Your StemSplit API key.
        output_path: Where to save the acapella WAV.
        poll_interval: Seconds between status checks.
        timeout: Max seconds to wait before raising TimeoutError.

    Returns:
        Path to the downloaded acapella file.
    """
    input_path = Path(input_path)
    output_path = Path(output_path)
    headers = {"Authorization": f"Bearer {api_key}"}

    # 1. Upload + start job
    with input_path.open("rb") as f:
        resp = requests.post(
            f"{STEMSPLIT_API_BASE}/separate",
            headers=headers,
            files={"file": (input_path.name, f, "audio/mpeg")},
            data={"stems": "vocals"},  # vocals-only separation
            timeout=60,
        )
    resp.raise_for_status()
    job = resp.json()
    job_id = job["jobId"]
    print(f"Job started: {job_id}")

    # 2. Poll until complete
    elapsed = 0
    while elapsed < timeout:
        time.sleep(poll_interval)
        elapsed += poll_interval

        status_resp = requests.get(
            f"{STEMSPLIT_API_BASE}/jobs/{job_id}",
            headers=headers,
            timeout=30,
        )
        status_resp.raise_for_status()
        status = status_resp.json()

        print(f"  Status: {status['status']} ({elapsed}s elapsed)")
        if status["status"] == "completed":
            vocals_url = status["stems"]["vocals"]
            break
        if status["status"] == "failed":
            raise RuntimeError(f"Job failed: {status.get('error', 'unknown error')}")
    else:
        raise TimeoutError(f"Job {job_id} did not complete within {timeout}s")

    # 3. Download acapella
    dl = requests.get(vocals_url, timeout=120, stream=True)
    dl.raise_for_status()
    output_path.parent.mkdir(parents=True, exist_ok=True)
    with output_path.open("wb") as f:
        for chunk in dl.iter_content(chunk_size=8192):
            f.write(chunk)

    print(f"✓ Acapella saved: {output_path}")
    return output_path


if __name__ == "__main__":
    import os
    acapella = extract_acapella_api(
        "my_song.mp3",
        api_key=os.environ["STEMSPLIT_API_KEY"],
        output_path="output/acapella.wav",
    )

The API handles FLAC, M4A, MP3, and WAV inputs up to 10 minutes. For shorter tracks (< 4 minutes) completion typically takes 20–60 seconds.


How to extract acapella from a YouTube video?

Short answer: Use yt-dlp to download the best-quality audio, then pipe it through either Demucs or the StemSplit API. This two-step pipeline handles any public YouTube video in under two minutes.

⚠️ Important: YouTube's Terms of Service (Section 5) prohibit downloading videos except where the platform provides a download button. Only process videos you own, videos licensed under Creative Commons, or content you have explicit permission to download.

Install yt-dlp

pip install yt-dlp

Pipeline code

import subprocess
import tempfile
from pathlib import Path


def youtube_to_acapella(
    youtube_url: str,
    output_path: str | Path = "acapella.wav",
    method: str = "local",  # "local" or "api"
    api_key: str | None = None,
) -> Path:
    """Download audio from YouTube and extract acapella vocals.

    Args:
        youtube_url: YouTube video URL.
        output_path: Destination for the acapella WAV.
        method: "local" uses Demucs; "api" uses StemSplit REST API.
        api_key: Required when method="api".
    """
    with tempfile.TemporaryDirectory() as tmp:
        tmp_dir = Path(tmp)
        audio_file = tmp_dir / "source.%(ext)s"

        # Download best audio quality as WAV
        subprocess.run(
            [
                "yt-dlp",
                "--extract-audio",
                "--audio-format", "wav",
                "--audio-quality", "0",
                "--output", str(audio_file),
                youtube_url,
            ],
            check=True,
        )

        # Find the downloaded file
        source = next(tmp_dir.glob("source.*"))
        print(f"Downloaded: {source.name} ({source.stat().st_size // 1024} KB)")

        if method == "local":
            return extract_acapella(source, output_dir=tmp_dir)
        elif method == "api":
            if not api_key:
                raise ValueError("api_key is required for method='api'")
            return extract_acapella_api(source, api_key=api_key, output_path=output_path)
        else:
            raise ValueError(f"Unknown method: {method!r}. Use 'local' or 'api'.")


# Example — extract acapella from a YouTube video you own
result = youtube_to_acapella(
    "https://www.youtube.com/watch?v=YOUR_VIDEO_ID",
    output_path="output/acapella.wav",
    method="local",
)
print(f"Done: {result}")

For background music removal from video content (a related but different use case), see how to remove background music from YouTube videos.


Acapella quality tips

Even with a top-tier model, there are a few things that significantly affect output quality:

Model choice

If your track is…Use
Pop/rock post-1990htdemucs_ft
Jazz, classical, or complex polyphonyhtdemucs (base) or mdx_extra
Old/lo-fi recordingsdemucs (v2) — newer models sometimes overfit to modern production

LUFS normalisation before separation

Pre-normalising your source to −14 LUFS before feeding it to Demucs often improves separation consistency, particularly for very loud or very quiet recordings:

pip install pyloudnorm soundfile
import soundfile as sf
import pyloudnorm as pyln
import numpy as np


def normalise_to_lufs(input_wav: str, output_wav: str, target_lufs: float = -14.0) -> None:
    data, rate = sf.read(input_wav)
    meter = pyln.Meter(rate)
    loudness = meter.integrated_loudness(data)
    gain_db = target_lufs - loudness
    normalised = data * (10 ** (gain_db / 20))
    normalised = np.clip(normalised, -1.0, 1.0)
    sf.write(output_wav, normalised, rate)
    print(f"Normalised: {loudness:.1f}{target_lufs} LUFS")

Reducing reverb artefacts

Demucs sometimes bleeds reverb tails from the instrumental into the vocals stem. Running the acapella through a light de-reverb step (iZotope RX, ReaFIR, or the open-source asteroid toolkit) cleans this up before use in a mix.


No-code option: StemSplit Acapella Extractor

If you just need a quick acapella without writing code, StemSplit's online acapella extractor handles the whole pipeline in the browser — upload your track, get back vocals and instrumental as separate WAV downloads. No install, no API key for the free tier.

For an in-depth comparison of online and offline tools, see the best acapella extractors guide.