Mixing for Podcast and Spoken Word

By Sarah Okonkwo · March 1, 2026

Mixing for Podcast and Spoken Word

Mixing music is often about vibe. Mixing spoken word is about clarity, consistency, and trust—if the listener has to strain, they’ll bail. The tricky part is that podcasts and voice content get played everywhere: earbuds on a train, a smart speaker in a kitchen, a car at 70 mph, or a phone at low volume.

Good spoken-word mixes aren’t “fancy.” They’re controlled. Your job is to keep the voice intelligible, even when the talent turns their head, laughs, whispers, or hits a sudden emphasis. Here are the moves I rely on in real sessions to make dialogue sit solid and translate everywhere.

Start with cleanup: strip silence and tame room tone (don’t nuke it)
Before EQ or compression, remove distractions: long gaps, chair squeaks, mic bumps, and excessive breath storms. Use a strip-silence or clip-gain workflow, but leave a little natural room tone so the edit doesn’t sound like it’s cutting to “digital black.” In a two-host edit, matching room tone across both mics (or adding a consistent tone bed) can stop the mix from sounding like it’s jumping between rooms.
Use clip gain first, compressor second
If the host leans in for a whisper and then blasts a laugh, don’t ask one compressor to solve it all. Ride clip gain (or region gain) to get the performance into a sane range, then compress for tone and consistency. Real-world example: when a guest is on a cheap USB mic and keeps drifting off-axis, clip-gain phrases back into line can save you from pumping, breath exaggeration, and harsh artifacts.
High-pass with intent: remove rumble without thinning the voice
A high-pass filter is mandatory, but set it by ear and context—not habit. For most adult voices, try starting around 70–100 Hz (higher for thin, noisy recordings; lower for rich voices you want to keep warm). If you’re editing a remote interview with HVAC rumble, a steeper slope can help, but watch that you don’t strip the chest out of the voice and end up with a “phone call” tone.
Find the “intelligibility band” and make it stable (2–5 kHz, usually)
The money zone for understanding consonants often lives in the 2–5 kHz range, but harshness can sit nearby. Use a gentle wide EQ boost if the voice is dull, or a dynamic EQ cut if it’s spiky on certain words. In a studio narration session on a bright condenser (like a Rode NT1 or AT4050), I’ll often use dynamic EQ to catch sharp consonants without dulling the whole read.
De-ess like a human: split the problem (and don’t chase only “S”)
One de-esser set aggressively can turn a voice into mush, especially on cheap earbuds. Try two lighter stages: one targeting 6–8 kHz for “S/T,” and another (or dynamic EQ) around 3–5 kHz if “ch/ts” are biting. If you’re stuck with a harsh Zoom track, a DIY trick is manual clip gain on the worst sibilant syllables—slow, but it can beat artifacts from over-de-essing.
Compress in stages: leveler + character, instead of one heavy unit
Spoken word likes controlled dynamics, but it also needs to sound natural. Use a smooth leveling compressor first (2:1 to 3:1, medium attack, medium release) doing a few dB, then a second compressor or limiter catching peaks. In a live-recorded panel with surprise laughter and applause bleed, a second stage with a faster release can keep things from exploding while the first stage keeps the voice anchored.
Use a peak limiter to stop “random overs,” not to win loudness wars
Set a brickwall limiter near the end of the chain to catch unpredictable spikes—big laughs, bumped mic stands, sudden emphasis. Keep gain reduction modest most of the time; if it’s slamming constantly, go back to clip gain and compression. For delivery, many podcast platforms normalize anyway, so the goal is clean, consistent loudness rather than squeezing every last dB.
Control breaths and mouth noise with editing + targeted tools
Big gasps and wet clicks pull attention away from the content. Don’t delete every breath—just turn the loud ones down 3–8 dB with clip gain so they feel natural. If you have RX-style tools, use de-click or mouth de-noise lightly; in a pinch, a narrow EQ dip around 1–3 kHz on a single click region can reduce the “tick” without damaging the word.
Make music beds obey the voice: duck with sidechain, not guesswork
If there’s intro music, transitions, or an ambient bed, sidechain duck it from the voice so it automatically gets out of the way. Set a fast-ish attack so the voice stays upfront, and a release that feels musical (too fast sounds like pumping, too slow makes the bed disappear). Real scenario: branded shows with constant low music—sidechain ducking keeps the host intelligible without you riding automation for every sentence.
Match tone between mics (or between remote guests) using EQ and ambience
Two-host shows often suffer from “one voice is in a booth, the other is in a kitchen.” Use EQ to pull them toward a shared tonal center: tame honk (often 200–500 Hz), reduce boxiness (around 300–800 Hz depending on the room), and align brightness. If one track is super dry and the other is roomy, adding a tiny, short room reverb (or a subtle ambience bed) to the dry track can make the cuts feel less jarring—keep it nearly imperceptible.
Check translation on small speakers and in mono before you print
A podcast mix that sounds great on studio monitors can fall apart on a phone. Do a quick pass on earbuds, a phone speaker, and a small mono Bluetooth speaker; listen for buried consonants, harshness fatigue, and whether music overwhelms speech. In post houses, we’ll often hit a mono check mid-session—if the voice loses clarity in mono, it’s usually an EQ masking issue or phase weirdness from stereo processing.

Quick Reference Summary

Clean edits first; keep room tone consistent so cuts don’t “snap.”
Clip gain to even out performance; compression is for finishing, not rescue.
HPF around 70–100 Hz (by ear) to kill rumble without thinning.
Stabilize intelligibility (often 2–5 kHz) with gentle EQ or dynamic EQ.
De-ess in lighter stages; consider manual gain on the worst syllables.
Two-stage compression + a modest limiter beats one heavy chain.
Duck music beds with sidechain to protect words automatically.
Match hosts/guests with EQ and subtle ambience; then check on small speakers.

Conclusion

Spoken-word mixing is mostly small moves done consistently: clean up the mess, control dynamics before they hit your processors, and protect intelligibility at all costs. Try two or three of these tips on your next episode—especially clip gain before compression and sidechain ducking—and you’ll hear the mix tighten up fast. Once the voice feels effortless to listen to, you’re basically done.

Mixing for Podcast and Spoken Word