
Creating Explosions Foley for AR
1) Introduction: the technical problem AR exposes
Explosions are one of the most abused sound categories in media: they’re often designed for impact on stereo speakers, not for plausibility in a listener’s immediate environment. Augmented Reality (AR) forces the issue. When a “virtual” explosion is anchored to a real room, your audio must survive scrutiny from a brain that is continuously cross-checking audio cues against the user’s actual acoustics, head movement, and visual tracking. A conventional cinematic boom may still “sound cool,” but it can fail in AR because:
- Localization is unforgiving: head-tracked binaural rendering reveals phasey close-mics, poor interaural time difference (ITD) behavior, and inappropriate early reflections.
- Scale cues must match visuals: spectral tilt, onset speed, and low-frequency behavior imply charge size and distance. If these don’t match the AR object’s size and placement, it feels wrong.
- The room is not yours: you cannot rely on a pre-baked reverb. The user’s real space is providing its own reflections, and you must decide how much synthetic room to add (if any) without double-counting.
- Mobile constraints bite: AR often runs on a phone or standalone headset. CPU budgets, latency targets, and loudspeaker limitations (tiny drivers) constrain what you can reproduce.
This article treats “explosion foley for AR” as an engineering task: building assets and playback logic that preserve believable shock, scale, direction, and environmental integration under head tracking and device constraints.
2) Background: physics and engineering principles behind “explosion sound”
2.1 What the ear interprets as an explosion
An explosion in air is fundamentally a rapid release of energy that creates a high-pressure front (shock wave in the near field of the blast) followed by turbulent flow, debris interactions, and often a sustained burn/roar. For audio purposes, listeners parse explosions into layers:
- Impulse / crack: fast rise time, broadband energy, cues for proximity and violence.
- Body / boom: low-frequency energy and decay time; strongest distance and scale cue.
- Tail: reflections, environmental response, scattering, and air absorption shaping the late field.
- Debris / grit: granular transients (shrapnel impacts, dirt, wood splinters) that sell materiality.
2.2 Distance law, air absorption, and the “boom that vanishes” on small speakers
In free field, sound pressure level decays approximately with inverse distance (−6 dB per doubling of distance), assuming a point source and no boundary gain. AR experiences are rarely free-field: a living room adds early reflections and low-frequency modal buildup, while outdoor scenes may have strong ground reflections. Your foley must accommodate a range of playback environments.
Air absorption increases with frequency and distance. At 20°C and 50% RH, attenuation above 10 kHz can become noticeable over tens of meters; at 100 m, high-frequency content is heavily reduced compared to low-mid energy. This is critical for AR because visual distance is explicit: if the explosion is rendered as 50 m away but retains lots of 8–12 kHz “crackle,” it contradicts the physics cues.
On mobile/headset speakers, energy below ~150–250 Hz may be reproduced weakly or via psychoacoustic bass enhancement. Explosion design for AR must therefore include perceptual low-end cues (50–120 Hz “feel” translates poorly) using harmonics and controlled distortion or resonant layers that translate on small drivers while staying believable on headphones.
2.3 Time structure: rise time, precedence, and head tracking
Humans localize transients using ITD/ILD and onset timing. A true blast has a steep onset; if your foley uses slow fades or smeared transients, localization becomes ambiguous. Under head tracking, ambiguity becomes obvious as the sound “swims” instead of pinning to an AR anchor.
Early reflections are interpreted under the precedence effect: the first arriving wavefront dominates localization, while later arrivals contribute spaciousness. In AR, the real room already provides reflections from loudspeaker playback (or headphone leakage), so adding strong synthetic early reflections can shift apparent source position or blur it.
2.4 Standards and metrics relevant to AR asset prep
While AR platforms vary, professional pipelines benefit from consistent measurement:
- Sample rate / bit depth: 48 kHz, 24-bit is a common baseline; 96 kHz can preserve transient design headroom but costs CPU/storage.
- True peak management: keep assets below −1 dBTP to survive sample-rate conversion and spatial processing; use true-peak meters (ITU-R BS.1770 methodology is widely adopted for true-peak measurement).
- Loudness normalization: platform policies differ; internal mixing often targets a consistent integrated loudness. For effects libraries, a practical approach is to deliver calibrated assets (e.g., −23 to −18 LUFS integrated for long tails, higher for short transient-only assets) and then mix in-engine.
- Latency budgets: head-tracked spatial audio becomes uncomfortable if end-to-end latency is high. Many AR systems aim for <10–20 ms motion-to-sound update; long look-ahead limiters and linear-phase processing can be problematic.
3) Detailed technical analysis: building explosion foley that survives AR
3.1 Source layering architecture (recommended)
A robust AR explosion asset is not a single WAV; it’s a bundle of coherent elements with separate control in-engine:
- Core transient (0–80 ms): tight broadband hit, minimal pre-delay, clean polarity alignment.
- LF body (50–400 ms): low-mid “thump” with controlled decay; tuned to avoid masking speech/UI.
- Mid “pressure” (200 ms–1.5 s): dense energy 150 Hz–2 kHz, often where perceived power lives on small speakers.
- Tail (0.8–6 s): environment-specific late field; ideally parameterized rather than baked.
- Debris layer (0.1–2 s): granular impacts, filtered noise, dirt/metal/wood specifics.
3.2 Data points: timing and spectral targets that read as “explosion”
Real blasts vary enormously, but for designed foley that reads well in AR, the following heuristics are effective:
- Transient rise time: keep the initial peak’s rise within ~1–5 ms for close explosions; for distant events, soften to ~5–20 ms and reduce HF content.
- Spectral centroid: close “crack” often benefits from energy up to 8–12 kHz; distant blasts roll off aggressively above ~4–6 kHz.
- LF emphasis band: perceived “weight” often centers around 60–120 Hz on headphones/sub systems; for translation, add supporting harmonics in the 120–240 Hz region.
- Tail length: outdoors: ~0.5–2 s (unless in canyons/urban corridors). Indoors: can extend 2–6 s depending on room size and damping; in AR, use shorter synthetic tails to avoid conflicting with the user’s real room acoustics.
- Dynamic range: explosions are intrinsically high crest-factor. Preserve crest factor where possible; avoid flattening with heavy limiting that creates “constant loud” artifacts under spatialization.
3.3 Recording and foley construction methods
AR explosions are typically designed rather than recorded literally, for safety and control. Professional foley/sound design techniques that produce convincing components:
- Transient: balloon pops, starter pistols, firecracker snaps (where legal), slammed heavy objects, close-mic’d transient-rich sources. Capture at 96 kHz if possible to retain transient detail for later processing.
- Body: large-diaphragm kick drum hits, dumpster impacts, distant thunder recordings, slowed-down impacts. Time-stretch with high-quality algorithms to avoid metallic artifacts.
- Pressure noise: filtered noise bursts, gas whooshes, combustion “chuffs,” layered with short convolution impulses.
- Debris: gravel drops, glass tinkles (careful—can read as “brittle” and too small), wood splinters, metal clanks; use micro-delays to spread impacts spatially.
Microphone strategy: even for foley, record multiple perspectives. A practical setup includes a close mic for transient detail (e.g., dynamic or small diaphragm condenser), and a mid/far mic for natural air and time smear. Maintain phase awareness: align or deliberately offset layers, but avoid accidental cancellations around 80–200 Hz where “body” lives.
3.4 Processing chain: preserving punch under spatial rendering
Spatializers (HRTF binaural, Ambisonics decoders, object-based renderers) can change peak levels and perceived brightness. A conservative processing approach:
- High-pass filtering: remove unusable subsonics (<20–30 Hz) to protect headroom. For mobile playback, even 40 Hz HPF can be appropriate depending on content.
- Transient shaping: use sparingly; over-enhancement can cause brittle localization artifacts when head rotates.
- Multiband dynamics: prefer gentle band control over brickwall limiting. Example: lightly compress 120–400 Hz (1.5–2:1, 10–30 ms attack, 80–200 ms release) to stabilize “thump” without flattening the crack.
- Saturation for translation: add subtle harmonic enrichment centered around 100–250 Hz to imply weight on small speakers. Keep it low; in binaural, distortion can feel “inside the head” if overdone.
- True-peak limiting: last stage, minimal gain reduction (1–3 dB GR), ceiling at −1 dBTP.
3.5 AR-specific: environmental integration without double-rooming
In AR, a common failure mode is double room: you add a big tail, then the user’s room adds another, resulting in a smeared, detached explosion. Strategies:
- Dry-first design: keep core transient and body mostly dry. Use engine reverb that can be adjusted or disabled based on platform capabilities and detection of user environment.
- Late-only reverb: if you must bake reverb, bias toward late energy and reduce early reflections, preserving localization.
- Distance-dependent tail: scale tail level and HF damping with distance. Far explosions can have more environment tail relative to transient; near explosions should be more direct.
3.6 Visual description: a practical layer timeline diagram
Imagine a horizontal timeline from 0 to 6 seconds:
- 0–0.08 s: a tall, narrow spike (transient crack), full-bandwidth but controlled above 10 kHz.
- 0.05–0.4 s: a wide low-frequency hill (60–200 Hz) peaking slightly after the crack.
- 0.2–1.5 s: a dense midband “cloud” (150 Hz–2 kHz) that decays smoothly.
- 0.8–6 s: a fading tail with decreasing brightness (air absorption curve), minimal early-reflection bumps.
- 0.1–2 s: scattered small spikes (debris impacts), some panned/spatialized slightly differently for width.
4) Real-world implications: mixing, playback systems, and user safety
4.1 Headphones vs speakers: the AR split
AR users may listen on open speakers (phone/tablet), near-ear transducers, or sealed headphones. Each changes your explosion design priorities:
- Phone/tablet speakers: limited below ~150–250 Hz; emphasize 150–400 Hz punch and harmonic weight. Keep peaks under control because small speakers distort audibly.
- Open-ear AR headsets: bass is often weak and environmental noise is high; transients and midrange clarity matter. Overly subtle tails disappear.
- Sealed headphones: full bandwidth possible; too much sub/ultra-low can be fatiguing. Binaural artifacts are more apparent; keep phase-coherent layers.
4.2 Safety and comfort
Explosions can cause discomfort if dynamic peaks are aggressive, especially in headsets. A pragmatic engineering stance is to cap effect peaks in the mix bus (not by crushing assets) and to maintain consistent loudness relative to UI and dialogue. If your platform supports it, provide a “reduced intensity” mode that reduces transient level and LF body by a few dB without destroying the event’s readability.
4.3 Asset memory and CPU budgets
Long, high-sample-rate stereo files consume bandwidth and memory. AR benefits from a hybrid approach: short PCM assets for transient/body, and procedurally generated or parameterized tails (reverb, filtered noise) computed in-engine. When streaming is required, ensure your codec choice doesn’t pre-echo transients; explosions are particularly revealing of transform coding artifacts.
5) Case studies: professional patterns that work
5.1 “Tabletop detonation” in a living room AR demo
Scenario: a small virtual charge detonates on a coffee table, with users standing 1–2 m away. The room is the user’s actual living room; you cannot predict RT60.
- Design choices: dry transient with very tight onset; modest LF body focused around 80–160 Hz; debris layer of wood splinters and small objects; minimal baked reverb.
- Distance model: at 1 m, keep HF present but not sizzling; at 2 m, reduce 6–12 kHz by ~2–4 dB and increase tail ratio slightly.
- Practical result: localization stays anchored to the table under head turns; the user’s room supplies believable reflections naturally.
5.2 “Street-level blast” in an outdoor AR navigation experience
Scenario: a virtual explosion occurs 30–50 m down a street. Visuals show a flash and dust plume.
- Design choices: softened transient (rise ~10–20 ms), significant HF roll-off above ~4–6 kHz, strong 120–250 Hz content for translation on mobile speakers, and a medium tail (0.8–2 s) suggesting outdoor reflections.
- Engineering note: for street canyons, add a subtle, delayed slap (e.g., 80–180 ms) at low level to imply building reflections without creating a “reverb box.”
- Practical result: users perceive correct distance; the explosion doesn’t sound “too close” or “inside the phone.”
5.3 “Cinematic” explosion adapted for head-tracked binaural
Scenario: a pre-existing library explosion designed for film is reused in AR. Typical issues: heavy bus limiting, wide stereo low end, baked reverb, and phasey layers.
- Remediation: split into transient/body/tail stems; collapse LF below ~120 Hz to mono (or a coherent center object) to prevent spatial wobble; remove or reduce baked early reflections; restore crest factor by undoing limiting where possible (or rebuild transient from cleaner sources).
- Practical result: the same “brand” of explosion translates, but now anchors correctly and doesn’t smear under head movement.
6) Common misconceptions (and what actually works)
- Misconception: “More sub = more power.”
Correction: in AR playback, sub-80 Hz often doesn’t reproduce. Power is frequently perceived from 100–250 Hz punch plus transient clarity. Add harmonics and mid-pressure rather than only sub. - Misconception: “Stereo width makes it bigger.”
Correction: wide stereo LF can destabilize localization in binaural renderers. Keep the initial event spatially coherent; widen debris and tail instead. - Misconception: “Bake reverb so it sounds finished.”
Correction: baked early reflections fight the user’s real environment. In AR, prioritize dry direct sound and use controllable late-field reverb if needed. - Misconception: “Explosions should always be bright and sharp.”
Correction: distance and air absorption demand HF roll-off. If the explosion looks far but sounds bright, users perceive a scale error immediately. - Misconception: “Limiters solve loudness consistency.”
Correction: heavy limiting reduces crest factor and can create fatigue, especially in headsets. Use mix-level management and gentle multiband control.
7) Future trends: where AR explosion audio is heading
- Scene-aware acoustics: improved real-time estimation of room size/materials using device sensors could allow reverb matching and reflection synthesis that complements the user’s space rather than guessing.
- Perceptual bass synthesis tuned for AR: smarter harmonic generation that adapts to device playback limits and ambient noise, keeping explosions impactful without overdriving small transducers.
- Object-based mixing with metadata: delivering explosion assets as parameterized objects (transient/body/debris/tail) with metadata for distance filters, directivity, and intensity scaling, instead of monolithic WAVs.
- Better HRTF personalization: individualized HRTFs reduce localization errors and front/back confusion, making precise explosion placement more convincing—while also making weak assets easier to notice, raising the quality bar.
- Dynamic range governance: platform-level loudness and peak management for AR could converge toward more standardized behavior, similar to broadcast loudness practices, but adapted for interactive head-tracked audio.
8) Key takeaways for practicing engineers
- Design explosions as controllable layers (transient, body, pressure, tail, debris), not single prints.
- Protect localization: keep the first-arrival transient dry, phase-coherent, and fast. Avoid wide/phasey LF.
- Match visual distance with physics cues: inverse-distance level, HF damping, softened rise time, and tail ratio must align with what the user sees.
- Assume unknown rooms: minimize baked early reflections to avoid double-room artifacts; prefer late-only or engine-controlled reverb.
- Mix for device translation: create perceived weight in 120–250 Hz and controlled harmonics; don’t rely on sub-bass that won’t play.
- Maintain crest factor: explosions need transient contrast; use gentle dynamics and true-peak ceilings (e.g., −1 dBTP) rather than aggressive limiting.
- Test under head tracking at multiple distances and on multiple playback paths (phone speaker, open-ear, sealed headphones). Problems that hide in stereo become obvious in AR.
AR doesn’t require that explosions be “real.” It requires that they be consistent with the user’s sensory context. When the transient is localizable, the spectrum implies the correct scale, and the environmental response doesn’t fight the room the user is standing in, explosions stop sounding like a sound effect and start sounding like an event.









