Spatial Audio: Beyond Stereo, Into the Space Around the Listener

Spatial Audio: Beyond Stereo, Into the Space Around the Listener

By Priya Nair ยท

Sound Design

Spatial Audio: Beyond Stereo, Into the Space Around the Listener

By Nina Patel -- Film Sound Designer, 3 feature films + 5 streaming series · 13 min read

Sound engineer working with spatial audio monitoring array

Stereo has been the dominant audio format for nearly 70 years. Two channels, left and right, creating the illusion of width through amplitude and timing differences. It's a remarkable illusion -- a well-mixed stereo recording can place sounds anywhere along a horizontal arc in front of the listener -- but it has a fundamental limitation: it only works when you're sitting in one specific position, the sweet spot between the two speakers. Move your head six inches to the left, and the entire spatial image shifts.

Spatial audio changes that paradigm entirely. Instead of encoding a fixed speaker configuration, spatial audio describes sounds in three-dimensional space and lets the playback system -- whether it's a 7.1.4 speaker array, a soundbar with upward-firing drivers, or a pair of AirPods Pro -- figure out how to recreate that spatial image for the listener. The shift from channel-based to object-based audio is the most significant change in how we record, mix, and consume audio since the introduction of stereo itself.

The Three Spatial Audio Formats

When people say "spatial audio," they're usually referring to one of three distinct technologies: Dolby Atmos, Sony 360 Reality Audio, or MPEG-H. Each uses a different approach to spatial audio encoding and rendering, but they share the same fundamental concept: describe sounds in 3D space, not in speaker channels.

Dolby Atmos is the most widely adopted format. It uses a bed-channel plus object architecture, where the bed provides ambient content on discrete channels and objects are positioned in 3D space with metadata. Atmix supports up to 128 simultaneous objects and can render to speaker configurations from 5.1.2 to 24.1.10, as well as binaural headphone output using head-related transfer functions.

Sony 360 Reality Audio uses a spherical harmonic representation of the sound field. Instead of discrete objects with position coordinates, 360 Reality Audio describes the sound field mathematically using spherical harmonics up to third order. This approach is more efficient for diffuse sound fields (like audience ambience in a live recording) but less precise for point sources (like a specific instrument in a mix). The format is primarily used for music streaming through Tidal and Deezer.

MPEG-H is the broadcast-oriented spatial audio format. It includes interactive features that let the listener adjust dialogue level, choose camera-angle-specific audio, or enable audio descriptions -- all encoded within the same bitstream. MPEG-H is used for terrestrial and satellite broadcast in South Korea and parts of Europe, and it's the format specified for next-generation broadcast audio by the DVB consortium.

Object-Based vs Channel-Based Audio

The distinction between object-based and channel-based audio is the foundation of spatial audio. In channel-based audio -- stereo, 5.1, 7.1 -- each audio channel corresponds to a specific speaker. The mix engineer decides what goes to each speaker, and the listener hears exactly that distribution. In object-based audio, the mix engineer describes where each sound should be located in 3D space, and the playback system's renderer decides which speakers produce that sound.

This difference matters most for content that needs to work across multiple playback configurations. A film mixed in Dolby Atmos has one master file that serves Dolby Cinema (with 64+ speakers), a home theater with 7.1.4, a soundbar with virtualized surround, and a pair of stereo earbuds. The renderer handles the translation. A film mixed in 5.1 requires separate downmixes for stereo and separate upmixes for Atmos-capable systems, each of which introduces potential quality loss.

Binaural Rendering and Head-Related Transfer Functions

For headphone listeners, spatial audio is rendered binaurally using head-related transfer functions (HRTFs). An HRTF describes how sound from a specific direction in space is filtered by the listener's head, torso, and outer ear before reaching the eardrum. By applying the appropriate HRTF to a sound source, you can create the perception that the sound is coming from any direction in 3D space -- even through a pair of stereo headphones.

The challenge with HRTFs is that they're individual. Your head shape, ear shape, and torso dimensions are different from mine, so your HRTFs are different from mine. Dolby's binaural renderer uses a generalized HRTF based on an average of hundreds of individual measurements. For about 70-80% of listeners, this produces convincing spatial imaging. For the remaining 20-30%, the spatial image may feel imprecise or localized inside the head rather than in the space around them.

Recording for Spatial Audio Production

Capturing source material for spatial audio mixing requires different microphone techniques than stereo recording. The goal is to capture a sound field that can be decoded into any spatial audio format, not just a fixed speaker configuration. The two primary approaches are ambisonic recording and multi-microphone array recording.

Ambisonic recording uses a coincident microphone array -- typically a tetrahedral arrangement of four capsules -- to capture a full-sphere sound field. The resulting signal is encoded in B-format (W, X, Y, Z channels for first-order ambisonics, plus additional channels for higher orders). First-order ambisonics captures the sound field with relatively low spatial resolution -- about 30 degrees of angular resolution. Third-order ambisonics (32 channels) captures the field with about 10 degrees of resolution, which is sufficient for most production applications.

The Sennheiser AMBEO VR mic and the Zoom H3-VR are popular first-order ambisonic microphones for field recording. For studio recording, I use the Soundfield by Sennheiser SPS200, which is a second-order ambisonic microphone (25 channels) that captures significantly more spatial detail. The SPS200 outputs via Dante, which means the 25 channels arrive at the recording computer over a single Ethernet cable.

Multi-Microphone Techniques for Spatial Source Capture

When ambisonic recording isn't practical -- either because of equipment availability or because the source material needs to be individually processed -- a multi-microphone array is the alternative. The Decca Tree configuration (three omnidirectional microphones in a specific triangular arrangement) can be expanded for spatial audio by adding height microphones and surround microphones to create a full-sphere capture.

On a recent project, we recorded a string quartet using a 12-microphone array: a Decca Tree at the front (Left, Center, Right at 2 meters height), four outriggers at the sides (2 meters out, 1.5 meters height), two height microphones above the ensemble (3.5 meters), and two rear microphones behind the performers (4 meters back). Each microphone was recorded on a separate channel, giving us the flexibility to position individual microphones in the spatial audio mix and create a natural, enveloping image of the ensemble.

"Spatial audio isn't about surrounding the listener with sound. It's about giving the listener the sensation of being inside the acoustic environment where the sound was created. The difference is subtle but profound -- one is a technical effect, the other is an emotional experience." -- Leslie Ann Jones, Director of Music Recording at Skywalker Sound, interviewed by Sound on Sound, 2022

Spatial Audio in Post-Production Workflows

Integrating spatial audio into an existing post-production workflow requires changes to the monitoring setup, the mixing software, and the delivery pipeline. The monitoring change is the most significant -- you need a spatial audio monitoring system (minimum 5.1.2, ideally 7.1.4) that is calibrated to the appropriate standard. The mixing software change involves adding spatial audio plugins to your DAW session. The delivery change involves generating the appropriate spatial audio master file.

In Pro Tools, spatial audio mixing uses the Dolby Atmos Renderer plugin, which receives object tracks from the Pro Tools session and renders them to the monitoring configuration in real time. The object tracks are standard mono Pro Tools tracks with the Dolby Atmos Panner inserted as an insert plugin. The Panner provides X, Y, and Z position controls that can be automated using Pro Tools' standard automation system.

Stem Delivery for Spatial Audio

Delivering spatial audio content to streaming platforms requires stem organization that differs from conventional 5.1 delivery. The Dolby Atmos M&E (music and effects) stem, for instance, contains all non-dialogue elements positioned in 3D space, organized as objects rather than channels. This allows the localization team to replace dialogue in different languages while preserving the spatial positioning of all other elements.

The stem specification for Netflix spatial audio originals requires: Dialogue stem (mono objects for each character), M&E stem (full spatial mix of music and effects), and a printmaster (the complete final mix). Each stem is delivered as a separate Dolby Atmos Master file, and the platform's encoding pipeline combines them into the final streaming bitstream.

Comparing Spatial Audio Playback Systems

The listener's experience of spatial audio depends entirely on their playback system. A 7.1.4 home theater produces a fundamentally different spatial image than a pair of AirPods Pro, even though both are rendering the same Dolby Atmos bitstream. Understanding these differences is essential for mixing content that translates well across all playback scenarios.

Table 1: Spatial Audio Playback System Comparison
Playback System Spatial Resolution Height Perception Sweet Spot Market Penetration (2024)
7.1.4 Speaker Array High (discrete positions) Excellent Limited (1-3 seats) ~2% of households
Soundbar (e.g., Sonos Arc) Medium (virtualized) Moderate Moderate (3-5 seats) ~12% of households
Headphones (AirPods Pro) High (binaural HRTF) Good (HRTF-dependent) Individual (per listener) ~35% of smartphone users
TV Built-in Speakers Low (downmixed) None Wide (entire room) ~50% of households

The Future of Spatial Audio Content Creation

Spatial audio is transitioning from a premium format to a mainstream expectation. Apple has required spatial audio mixes for all new music releases on Apple Music since 2021. Netflix has mandated Atmos delivery for all original content since 2022. Dolby reports that over 500 million devices support Atmos playback, spanning smartphones, tablets, laptops, televisions, soundbars, home theater systems, and automotive audio systems.

The next frontier is real-time spatial audio for live events. Concert venues equipped with L-Acoustics L-ISA or d&b Soundscape systems are delivering spatial audio to audiences in real time, with sound objects positioned to match the on-stage positions of the performers. Broadcast television is beginning to experiment with MPEG-H for sports coverage, where the spatial audio places the listener inside the stadium rather than in front of a stereo broadcast.

For sound designers and mix engineers, spatial audio is no longer a specialty skill -- it's a core competency. The professionals who learn spatial audio now will be positioned for the content creation landscape of the next decade, just as the engineers who learned digital recording in the 1990s were positioned for the content explosion that followed.

References: Leslie Ann Jones, "Spatial Audio Recording Techniques," Sound on Sound, March 2022 | Dolby Laboratories, "Spatial Audio Consumer Adoption Report" (2024) | ITU-R BS.2051-2, "Advanced Sound Programme Production for Immersive Audio" (2021) | Francis Rumsey, "Spatial Audio: Theory and Practice," 2nd Edition (2020)