What Are Audio Stems?
In music production, a stem is an individual audio component of a mixed track. When a producer finishes a song, they typically have dozens of individual tracks — kick drum, snare, bass guitar, lead vocal, backing vocals, guitars, synths, and so on. These are mixed down to a stereo file for distribution.
Stem splitting (also called audio source separation) is the reverse process: taking that final stereo mix and attempting to reconstruct the individual components. The four standard stems are:
Vocals
Lead and backing vocal content, typically center-panned in the stereo field.
Drums
Kick, snare, hi-hats, cymbals — characterized by sharp transients across all frequencies.
Bass
Bass guitar, sub bass, and kick drum fundamentals — concentrated below 250 Hz.
Other
Everything else: guitars, synths, pads, strings, and harmonic instruments.
Why Stem Splitting Is a Hard Problem
Separating a mixed audio track into stems is fundamentally an underdetermined problem. You have one stereo signal (2 channels) and you are trying to recover 4 or more independent sources. There is no unique mathematical solution — any algorithm must make assumptions about the nature of each source.
The challenge is that different instruments occupy overlapping frequency ranges. A vocal note at 440 Hz (A4) shares frequency space with a guitar chord, a synth pad, and the harmonics of a bass note. Simply filtering by frequency is not enough.
Frequency overlap between instruments:
The STFT Approach: Analyzing Audio Frame by Frame
The most common signal-processing approach to stem splitting uses the Short-Time Fourier Transform (STFT). Instead of analyzing the entire audio signal at once, STFT divides it into short overlapping frames and applies a Fourier transform to each frame.
This produces a spectrogram — a 2D representation of the audio where the X axis is time, the Y axis is frequency, and the brightness (or color) represents the energy at each time-frequency point. Each cell in this grid is called a time-frequency bin.
Step 1: Frame the signal
Divide the audio into overlapping frames of 2048 samples (≈46ms at 44.1kHz). Apply a Hann window to each frame to reduce spectral leakage at the frame boundaries.
Step 2: Apply FFT
Apply the Fast Fourier Transform (FFT) to each windowed frame. This converts the time-domain signal to a complex spectrum of 1024 frequency bins (half of 2048, due to symmetry).
Step 3: Compute magnitude & phase
For each bin, compute the magnitude (energy) and phase. The magnitude tells us how much energy is at each frequency; the phase tells us the timing of that frequency component.
Step 4: Apply masks
Compute a soft mask for each stem based on the bin's characteristics (frequency, mid/side ratio, spectral flux). Multiply the spectrum by the mask to isolate each stem.
Step 5: Inverse FFT + overlap-add
Apply the inverse FFT to each masked spectrum to get back a time-domain signal. Overlap-add the frames to reconstruct the full-length audio for each stem.
Mid/Side Decomposition: The Key to Vocal Isolation
The most important cue for isolating vocals is their position in the stereo field. In professionally mixed music, the lead vocal is almost always panned to the center — meaning it appears equally in both the left and right channels.
Mid/side (M/S) decomposition exploits this. For each frequency bin, we compute:
Mid (center) = (Left + Right) / 2
Side (stereo) = (Left − Right) / 2
Content that is identical in both channels (like a centered vocal) has high mid energy and low side energy. Content that differs between channels (like a guitar panned left or a synth with stereo width) has high side energy.
The Wiener soft mask for vocals is computed as:
vocal_mask = mid² / (mid² + side² + ε)
This gives a value between 0 and 1 for each bin. Bins dominated by center-panned content (vocals) get a mask close to 1; bins with strong stereo content get a mask close to 0. The mask is then applied to the original spectrum to isolate the vocal content.
Spectral Flux: How Drums Are Detected
Drums are characterized by their transient nature — sudden, sharp increases in energy across a wide frequency range. A kick drum hit, snare crack, or hi-hat creates a rapid onset that is very different from the sustained energy of a vocal or guitar.
Spectral flux measures this: for each frequency bin, it computes the positive difference in magnitude between the current frame and the previous frame:
flux(k) = max(0, |X_t(k)| − |X_(t-1)(k)|)
High spectral flux indicates a sudden energy increase — a transient. Bins with high flux relative to their total energy are classified as drum content. The drum mask is:
drum_mask = clamp((flux_ratio − 0.15) / 0.35, 0, 1)
This approach works well for detecting percussive onsets but can also capture other transient sounds (plucked strings, piano attacks). The result is a drum stem that contains all transient-heavy content, not just drums in the strict sense.
Bass Isolation: Frequency Band Filtering
Bass isolation is the most straightforward of the four stems. Bass instruments (bass guitar, sub bass, kick drum fundamentals) are physically constrained to low frequencies — typically below 250 Hz. Very little vocal or melodic content exists below this threshold.
The bass mask is simply:
bass_mask(k) = 1 if bin_frequency(k) < 250 Hz, else 0
This hard frequency cutoff is then combined with the other masks to ensure bass content is not double-counted in the vocal or drum stems. The result is a clean low-frequency stem containing all sub-bass and bass content.
Note that the kick drum has both a low-frequency fundamental (captured in the bass stem) and a high-frequency transient attack (captured in the drum stem). This is a known limitation of frequency-domain stem splitting — the kick drum is split across two stems.
Limitations of Signal-Processing Stem Splitting
STFT-based stem splitting with spectral masking is a powerful technique, but it has inherent limitations compared to AI-based approaches like Demucs or Spleeter:
Bleed between stems
Because instruments share frequency space, some content will appear in multiple stems. A guitar chord in the vocal frequency range will partially appear in the vocal stem.
Mono vocal assumption
The mid/side approach assumes vocals are center-panned. Vocals with heavy stereo reverb, chorus effects, or deliberate panning will not be fully isolated.
Kick drum split
The kick drum's sub-bass fundamental goes to the bass stem; its transient attack goes to the drum stem. This is a fundamental limitation of frequency-domain separation.
Complex mixes
Heavily layered productions with many instruments in the same frequency range will produce less clean separation than sparse arrangements.
For professional-grade stem separation, AI-based tools like Demucs (Meta Research) or Spleeter (Deezer) use deep neural networks trained on thousands of songs to achieve much cleaner separation. However, these require significant compute resources and cannot run in a browser.
Practical Use Cases for Stem Splitting
DJ Mashups & Remixes
Extract the vocal stem from one track and mix it over the instrumental of another. Create acapella mashups, vocal swaps, and creative blends that would be impossible with full mixes.
Music Production
Study how professional tracks are arranged and mixed. Use stems as reference material, sample individual elements, or create remixes with the original stems.
Karaoke & Practice
Remove vocals to create karaoke backing tracks. Isolate the vocal stem to practice harmonies, transcribe lyrics, or study vocal technique.
Stem Mastering
Process each stem independently for more precise mastering. Apply different EQ, compression, and limiting to vocals, drums, bass, and instruments separately.
Video & Content
Use the instrumental stem as background music for videos. The absence of vocals reduces copyright issues and creates a cleaner audio bed for voiceovers.
Music Education
Isolate individual instruments to study their parts. Slow down stems for transcription. Analyze the frequency content of each element in a professional mix.
Frequently Asked Questions
What is stem splitting in audio?
Stem splitting (audio source separation) is the process of separating a mixed stereo audio track into its individual components — typically vocals, drums, bass, and other instruments. Each separated component is called a "stem" and can be used independently for remixing, DJing, or music production.
How does STFT-based stem splitting work?
STFT stem splitting analyzes audio frame by frame in the frequency domain. Each frequency bin is classified using mid/side energy ratios, frequency band constraints, and spectral flux. Soft masks are applied per bin to isolate each stem, then the masked spectra are converted back to audio via inverse FFT.
Can I split stems online for free?
Yes. WavinTools offers a free online stem splitter that runs entirely in your browser. No upload, no account, no limits. Upload any MP3 or WAV file and download 4 separate stems: vocals, drums, bass, and other instruments.
What is the difference between stem splitting and vocal removal?
Vocal removal only removes the vocal track, leaving a single instrumental output. Stem splitting separates the audio into 4 individual stems (vocals, drums, bass, other), giving you full control over each element independently.
What are stems used for in music production?
Stems are used for remixing (replacing or layering individual elements), DJ mashups (combining vocals from one track with the instrumental of another), stem mastering (processing each element separately), karaoke track creation, and music education (studying arrangement and mixing techniques).
What audio formats does the stem splitter support?
The WavinTools stem splitter accepts MP3 and WAV input files. All 4 output stems are exported as high-quality WAV files (16-bit, original sample rate), ready for use in any DAW or DJ software.