Upgrading H.26x video coding features for the AI era

Woman wearing glasses and a yellow sweater adjusts a smartphone on a tripod with a ring light in a cozy room.

When every pixel can be faked, it’s getting harder to trust what we see on screen. Videos generated and manipulated using artificial intelligence (AI) are raising new questions about authenticity and creative control. 

That’s why standards matter. Nokia is leading the way with the Versatile Supplemental Enhancement Information (VSEI) standard, designed to help everyone—from content creators to viewers—verify, protect, and enhance video in the AI era. The VSEI standard complements video coding standards like H.264 (Advanced Video Coding, AVC), H.265 (High Efficiency Video Coding, HEVC), and H.266 (Versatile Video Coding, VVC). Version 4 of the VSEI standard was recently completed, addressing emerging use cases in AI, machine vision, and the preservation of creative intent. In this blog, we summarize the new features and enhancements provided by the new VSEI version.

Versatile Supplemental Enhancement Information (VSEI)

The VSEI standard plays a crucial role in enhancing video coding standards such as H.264, H.265, and H.266; collectively referred to as H.26x in this blog. While the core decoding algorithms of the H.26x standards have remained unchanged for years, the continuously developed VSEI standard ensures that H.26x codec implementations can continue to evolve to address particular use cases better.

VSEI version 4 is a major update, providing a range of new features, as well as enhancements to features specified in earlier versions of the standard. The standardization of VSEI version 4 was a collaborative effort over a period of 2 years by more than 10 companies, with Nokia being among the most active contributors.

The VSEI standard specifies supplemental enhancement information (SEI) messages that can be included in coded video bitstreams. This extra information helps devices understand and process videos better. The metadata contained in SEI messages is synchronized with the coded video and can help improve picture quality or give details about the video itself. Thanks to the VSEI standard, decoders in different devices and applications can read and use this information in the same way, making video experiences more reliable and consistent.

Shepherding AI manipulations of video content

AI-based video generation and manipulation have become increasingly accessible and sophisticated, which makes detecting AI-generated or manipulated content more difficult. Therefore, verifying the authenticity of video content has become essential. 

VSEI version 4 lets video creators use digital signing of coded video to prove that their content is authentic and hasn’t been altered since its creation. For instance, a news agency can use digital signing to add a special mark to its videos, which allows viewers to verify that the video really comes from the agency and hasn’t been changed.

New regulations require showing clear labels, called AI markings, if content was created or modified by AI. This is particularly important, for example, when generative AI is used to alter the appearance of a public figure, like a politician during an election campaign. VSEI version 4 makes it possible to add these AI marking labels to videos, so viewers know when AI was involved. 

Additionally, VSEI version 4 lets content owners set AI usage restrictions, which are rules about how their videos can be used by AI. For example, they can choose to prevent their videos from being used to train AI models, helping to protect their privacy and uphold content owner rights.

Generative AI for video enhancement and compression

The previous version of the VSEI standard introduced support for neural-network post-filtering (NNPF), arguably marking the first time AI was integrated in a video standard. Since then, Nokia has explored various aspects of NNPF technology, most lately for concealing common problems in videos, called artefacts, such as contouring (uneven color areas) and blockiness (visible squares), resulting from video coding at limited bitrates. NNPF also allows content creators to control the post-processing of their videos, ensuring that their creative intent remains uncompromised.

Now, VSEI version 4 makes NNPF even smarter by adding generative AI features. For example, text prompts can be added to videos to guide generative filtering. In addition to conventional filtering purposes, such as making videos look sharper, generative NNPF can be used to extend pictures spatially or create future pictures. 

Generative face video coding is another new feature. It lets videos of human faces to be coded at bit rates as low as a few kilobits per second. This technology works by coding one main or base picture and some additional details, and then AI creates the rest of the video using those inputs. The VSEI standard includes signaling to tell decoders which neural network models and face parameters to use for the video to play correctly. 

Creator-driven post-processing

VSEI version 4 allows video creators to specify the preferred order of post-processing operations, including color transformation, adding film grain, and rotating pictures for display. They can also set up different processing chains for different display resolutions. What's more, the film grain support, which has existed in the H.26x codecs for decades, has now been enhanced to enable the signaling of different film grain models depending on the display resolution. This means that videos look their best whether they’re shown on a phone or on a big screen. 

With these additions to VSEI version 4, content creators now have better control over how their videos look in receiver devices and help preserve their creative intent.

Video for computer vision

Video is increasingly consumed by machine analysis tasks rather than watched by humans. It has been reported that machine-to-machine video constitutes tens of zettabytes annually. Thus, optimizing video compression without compromising machine task accuracy is becoming more important. VSEI version 4 adds several new features for enhanced machine-to-machine video.

As videos optimized for computer vision may not provide an optimal viewing experience for humans, safeguards against displaying machine-targeted video have been incorporated within the signaling that describes encoder operation, post-processing chains, and neural-network post-filtering. More broadly, the types of encoder optimizations can be detailed in the encoder optimization information (EOI) SEI message, which allows receiving systems to make suitable adjustments to post-processing and analysis tasks.

Many machine tasks, such as person identification, work best when the important parts, called regions of interest (ROIs), are shown in the highest possible picture quality, while the background doesn’t matter as much. Video encoding systems can use ROI detection and optimization of the pre-processing or encoding to make sure these important areas look their best, even if it means lowering the quality and bitrate of the remaining regions. For example, an encoder may use a finer quantization step size for ROIs, which can be described in an EOI SEI message. Alternatively, an encoder may pack foreground regions at a higher spatial resolution and background regions at a lower resolution into source pictures used in encoding. This can be described in a packed regions information (PRI) SEI message so the receiving system knows how to restore the original positions of the regions.

When a video is split into different objects using semantic or instance segmentation, each object can be shown with its own solid color in an object mask picture. VSEI version 4 makes describing object masks possible, so these masks can be included in the same coded video clip as the original source video. This feature makes H.26x a great output format for segmented video.

Picture metadata extensions

Sometimes video is recorded at a different speed than it is shown. For example, a video may be captured at a high picture rate, such as 240 Hz, and played back in slow motion, or vice versa. The source picture timing information SEI message carries metadata about the capture timing of the pictures, helping keep track of when each picture was taken.

Some image sensors can capture wavelengths beyond visible light. The modality information SEI message indicates whether the images in the video show visible light, infrared, or ultraviolet, and can even include details about the exact wavelength.

Just like digital photos can store extra details (metadata), VSEI version 4 lets videos include image format metadata, so important information about how and when the video was made can travel with the file.

Nokia advances emerging use cases for coded video

As AI continues to reshape how we create and experience video, the need for trust and authenticity has never been greater. VSEI version 4 sets new benchmarks for transparency, creative control and intelligent machine vision. These latest enhancements empower content creators, device makers and viewers to verify, protect and enhance video, ensuring that innovation and trust go hand in hand in the digital world.

The new and enhanced SEI messages specified in VSEI version 4 make the H.26x coding standards even more capable of addressing the most important emerging use cases for coded video, including AI and machine vision, while safeguarding creative intent.

At Nokia, we are proud to have led the way in the development of the VSEI standard. Our team has contributed key technologies to the standard and held pivotal editorial roles in shaping its direction. Today, we continue to pioneer secure, intelligent and inspiring video experiences for the digital world.

Miska Hannuksela

About Miska Hannuksela

Miska Hannuksela, (M.Sc., Dr. Tech), is the Head of Video Research at Nokia Technologies and a Nokia Bell Labs Fellow. He is an internationally acclaimed expert in video and image compression and end-to-end multimedia systems.

Jill  Boyce

About Jill Boyce

Jill Boyce joined Nokia in February 2024. She is an IEEE Fellow, recognized in 2019 for contributions to video coding. She is a longtime contributor to video coding standardization, including MPEG-2, H.264/AVC, HEVC, VVC, and VSEI. She is the lead editor of the first and fourth editions of the Versatile Supplemental Enhancement Information (VSEI) standard. When she isn't traveling to JVET/MPEG meetings, she enjoys the Pacific Northwest outdoors and playing tennis.

Connect with Jill on LinkedIn 

Article tags