Immersive Audio Is NOT Going Away (And Here’s Why…)

Written By Kyle Mathias | Audio Basics, Audio Gear, Mixing

It’s easy to feel skeptical about immersive audio… I mean, surround sound (even for film) never really caught on with most consumers because very few people wanted to buy and configure an expensive speaker system.

But the technology I’ll demonstrate may lead to you and me and everyone else listening to almost everything in an immersive format in just a few years, even if we don’t have an expensive speaker system.

Thanks to Sennheiser for sponsoring this video and supporting audio education.

You’re going to want to wear headphones for this one… I’ll be using the Sennheiser HD 490 PRO headphones, but the demos in this video can be enjoyed with any headphones. In fact, that’s the true power of the technology I want to show you.

First, let’s start with a quick demo…

Put on your headphones and listen closely – you might even close your eyes.

Notice that the individual sounds seem to come from every direction.

Using just headphones, you can experience width, height, and depth.

But how is this possible with only headphones, with no speakers behind you or above you? There are a few techniques at play here. Some new, and some old.

In the very first recordings, depth was the only immersive variable that could be achieved. When a performer stood closer to the recording device, they sounded closer when someone listened to the recording.

This effect can be somewhat simulated by adjusting the level of each sound source in a mix. Make the lead vocal louder than the background vocals, and the background vocals will appear to be further away. This makes sense, because sound gets quieter the further it travels from a source. Using this principle adds some depth to a mix, but modern technology takes this to a new level – more on that in a moment.

When we moved from monophonic sound systems (with only one speaker) to stereophonic systems (with a left speaker and right speaker), music and film became a lot more immersive, as the addition of another speaker opened up the doors to width and stereo image.

When you send the same signal to two speakers in a stereo configuration, you’ll hear a phantom image of the sound that seems to come from between the speakers. By adjusting the relative balance between the left and right speaker, you can pan that sound side-to-side.

The next logical step was to start adding more speakers… So engineers experimented with quadraphonic audio and 5.1 Surround, where speakers are placed around the listener at ear level. However, phantom imaging only really works from side to side, because our ears are on the sides of our head.

That means you can’t create a phantom image between the front right and the rear right speaker. If you want to hear a sound from the side, you need a side speaker. And if you want to hear a sound from above, you need a height speaker. Hence, we have started to see formats like 7.1.4 or higher with speakers to the sides, above, and sometimes below the listener…

But compared to mono, the primary problem with all of these other formats – including stereo – is that they all require not only more speakers, but more setup.

When you put a mono speaker in front of a listener and point it at them, the setup is pretty simple and repeatable, even for casual listeners. Replace that single mono speaker with a simple stereo or LCR soundbar, and the setup process is still not too demanding. But when you start adding in other speakers, things become more complicated…

For one, the fidelity and immersiveness of a speaker system are heavily dependent on the position and configuration of the speakers. If you want to feel surrounded by your surround sound system, the speakers not only need to surround you, but they should also adhere to placement standards. Otherwise, a sound that is intended to be behind the listener could sound like it’s coming from the end table instead. True immersiveness depends heavily on the setup when using physical speakers.

This is why it’s such a big deal that we can now put on a pair of headphones and hear things from all around us. Obviously, the immersiveness of the experience will vary between a cheap pair of earbuds, an entry-level pair of studio headphones, and a professional pair of headphones like the HD 490 PROs. The timbre, distortion, and immersiveness will be significantly improved with higher-quality headphones.

But every one of these options provides a way to experience immersive audio, and that’s the important part. It’s a huge shift in the cost-of-entry to experience immersive audio.

To me, this suggests the world will be more likely to adopt immersive audio because many people already have headphones. Mixing engineers and audio enthusiasts will still build amazing immersive speaker setups, but some version of that immersive experience will be available to anyone in headphones.

Let me demonstrate these concepts with a free binaural panner plugin called dearVR MICRO.

A basic pan knob (like the one built into a DAW) is based on the concept from earlier – louder on the left side suggests the sound source is to your left. Louder on the right side suggests it’s to your right.

This principle provides a good stereo image with speakers, but on headphones, it sounds like the sound comes from within your head rather than from in front of you.

That’s partly because of crossfeed between the left and right channel and partly because, when you listen through speakers, your skull and your ears will have effects on the audio that tell your brain where the sound comes from (but you don’t hear these effects on headphones). The sum of these effects is called a Head-Related Transfer Function (or HRTF).

Microphones such as the Neumann KU-100 aim to capture these HRTF effects during the recording process by utilizing a dummy head with microphone capsules in the ears.

This takes advantage of the three primary localization cues:

Interaural level differences, where a sound is louder in one ear than the other (as we’ve already discussed), interaural time differences, where a sound reaches one ear slightly before reaching the other, and HRTFs which include the spectral differences caused by acoustic interaction with the head and ears. Think of this as an EQ filter that changes depending on where a sound originates.

Binaural microphones are useful for not only making immersive recordings but also for understanding how sound at different angles and frequencies will be impacted by HRTFs.

If you took one of these dummy head microphones, placed it in front of a speaker, and rotated it, you could measure the HRTF at a variety of points along the horizontal plane. To capture the horizontal and vertical dimensions, you could place the binaural microphone in a dome or a sphere of speakers and measure.

Theoretically, that data could be loaded into a plugin like dearVR MICRO, which performs an HRTF simulation, allowing you to pan sounds that were recorded with a normal microphone (such as this shaker) horizontally and vertically.

We can simulate the shaker in front, side-to-side, above, and behind using this plugin which has a built-in HRTF model.

Simulated reflections are also an important element of immersiveness, because they give our brain additional information about the size and shape of the space we are listening in.

Let’s open up a different plugin called dearVR PRO 2, which will help to visualize the controls that are used in the world of immersive audio production.

Generally speaking, the parameters of a panner plugin can be exposed for automation so that an engineer can program a combination of all of these parameters to match sound for picture or create FX in a music mix.

But there are a few key limitations to binaural technology as it is today…

The first problem is that your head and ears are shaped differently from my head and ears. You’ve heard the dearVR HRTF model, which is a generic HRTF model that attempts to translate well to a wide range of head and ear shapes.

This works pretty well for me, but generic HRTF models like this don’t work very well for some people. The ideal solution would be to measure your own HRTF and use that data (which is customized to you and the way you hear the world).

Every day, there is more research to make generic HRTFs better and research to find new ways of measuring custom HRTFs at scale. I think these will become increasingly more realistic as the technology advances.

The second problem is delivery to the listener. Don’t get the wrong idea here – the delivery of a binaural mix is no more difficult than the delivery of a stereo mix. I was able to upload a simple 2-channel stereo file to YouTube and the binaural effect works quite well for you (and people across the world) listening in headphones.

However, you wouldn’t fully experience the immersiveness if you watched this video with speakers, because the mix was specifically made for binaural listening in headphones. In fact, those HRTF simulations might sound kind of strange on speakers…

I can change the ‘Output Format’ from ‘Binaural’ to ‘2CH Stereo’ and now it sounds better on speakers, but we don’t get the binaural effect in headphones unless we switch back to ‘Binaural’.

The ultimate goal would be to get the right mix to each listener, which depends on the playback system the listener is using. If you have stereo speakers, you need the stereo mix. Someone with a multi-channel speaker system needs a mix suited for their system, while someone listening in headphones needs the binaural mix.

Fortunately, there are several different frameworks that address this problem.

If you’re interested, I’d recommend looking into ambisonics, as well as object-based mixing. Each one provides a way for recording and mixing engineers to create recordings that can be decoded or rendered to binaural for headphones, or a variety of other playback configurations – all with the same set of audio files and metadata.

Another way binaural could be improved is with head tracking, where the sound field adapts to the direction of the listener’s head. As humans, we learn a lot about the sounds around us by making small movements with our heads.

For example, a sound in front of us to our left will result in the same level and timing differences as a sound behind us to our left. This is sometimes referred to as the “cone of confusion”. A simple rotation of our head quickly tells us if that sound is in front or behind us. And while this is possible with an immersive speaker system, it isn’t possible in headphones without head tracking.

There are already a few different systems in place for head tracking, but the limitation right now is a lack of hardware and standards that support the widespread use of head tracking technology.

Once this technology becomes more widely available, we can take it a step further by implementing 6 degrees of freedom. This would allow listeners to not only rotate within a sound field, but also move around in the sound field with binaural simulation. I’m excited for what the future of binaural holds for film, video games, and maybe even music.

But what do you think? Where does it go from here? Do we eventually have more immersive music and film? More immersive video calls with family and friends across the world? Let me know what you think in a comment below.

Disclaimer: This page contains affiliate links, which means that if you click them, I will receive a small commission at no cost to you.