A quick guide to immersive audio technologies
In an age where video and text dominate our digital interactions, audio remains a powerful yet underutilized tool for communication. Spatial audio – especially when paired with enhanced voice calling or new immersive technologies such as Augmented Reality (AR) and Extended Reality (XR) – is transforming how we connect, making conversations feel more natural, engaging, and lifelike. Whether you're curious about how immersive audio works or want to understand the technologies shaping the future of voice communication, this quick guide breaks down key terms and innovations.
What is spatial audio communication?
The term Spatial Audio Communication refers to the use of 3D sound – as well as other technologies such as video – in exchanges between people, both in real-time and non-real time.
Examples include instant voice messages and voice-based stories, which have become popular ways to express ourselves.
While the visual and text-based communication space is increasingly saturated, voice and sound remain relatively untapped and present huge growth potential. That’s why we’re actively researching and promoting new audio solutions to enable more natural and intuitive interactions.
What is the IVAS codec?
The IVAS (Immersive Voice and Audio Services) codec is the backbone of every future spatial voice communication innovation. It bridges the gap between caller and listener, transforming ordinary voice and video calls into immersive, engaging, and context-rich experiences.
Born out of collaboration among 13 companies – with Nokia as a key contributor – IVAS has been approved by the global telecommunications standards body 3GPP in Release 18 specification, approved in June 2024.
With strong industry-wide support, we’re working together to bring immersive spatial audio to mobile networks and services worldwide.
What is the MASA format?
MASA (Metadata-Assisted Spatial Audio) is a new format specifically designed for compact devices like mobile phones. It’s one of the formats that the IVAS codec supports.
What makes MASA unique is that it stores spatial audio in only two audio channels while preserving crucial metadata for spatial positioning and head tracking – making it both powerful and widely compatible.
Nokia’s Immersive Voice encodes spatial audio into MASA format using the phone’s built-in microphones.
What is MASA object-based audio?
OMASA (Object-based audio with Metadata-Assisted Spatial Audio) allows you to fully control voice and ambient sound balance during calls.
It enables real-time transport of mono audio objects along with spatial audio. This allows, for example, mixing the volumes of all the audio objects independently, resulting in a more interactive and user-friendly listening experience.
What is Nokia Immersive Voice?
Nokia Immersive Voice is our complete, end-to-end spatial audio solution for experiencing real-time voice communication with IVAS codec on mobile devices, XR classes, and other types of multi-microphone devices.
To capture spatial audio, the solution requires access to the device’s integrated microphones. Listening of spatial audio is possible with any device – either with headphones or through the device’s integrated loudspeakers.
Here’s how it works:
- The spatial analysis algorithm of Nokia Immersive Voice reads the device’s microphones and reduces unwanted noises while processing the audio into MASA format.
- The MASA data is encoded with the IVAS codec and transmitted to the receiving end, where it’s rendered for playback with advanced Acoustic Echo Cancellation (AEC), designed specifically for spatial audio.
- The listener hears audio that reflects the real spatial scene – creating a lifelike, real-time conversation experience.
The solution includes dynamic controls for adjusting ambience, choosing audio orientation, and enabling head-tracking – giving users even more control over their audio environment.