Using AI to compress audio files for quick and easy sharing
Compression is an important part of the internet today, because it enables people to easily share high-quality photos, listen to audio messages, stream their favorite shows, and so much more. Even when using today’s state-of-the-art techniques, enjoying these rich multimedia experiences requires a speedy internet connection and plenty of storage space. For current and future experiences — like the metaverse — to deliver high-quality, uninterrupted experiences for everyone, compression techniques will need to overcome these limitations.
Today, we are detailing progress that our Fundamental AI Research (FAIR) team has made in the area of AI-powered hypercompression of audio. Imagine listening to a friend’s audio message in an area with low connectivity and not having it stall or glitch. Our research shows how we can use AI to help us achieve this. We built a three-part system and trained it end to end to compress audio data to the size we target. This data can then be decoded using a neural network. We achieve an approximate 10x compression rate compared with MP3 at 64 kbps, without a loss of quality. While such techniques have been explored before for speech, we are the first to make it work for 48 kHz sampled stereo audio (i.e., CD quality), which is the standard for music distribution. We are sharing additional details in a research paper, along with code and samples as part of our commitment to open science.
The new approach can compress and decompress audio in real time to state-of-the-art size reductions. More work needs to be done, but eventually it could lead to improvements such as supporting faster, better-quality calls under poor network conditions and delivering rich metaverse experiences without requiring major bandwidth improvements.
While our techniques do not yet cover video, this is the start of an ongoing initiative with the goal of advances that could improve experiences such as videoconferencing, streaming movies, and playing games with friends in VR.
Codecs, which act as encoders and decoders for streams of data, help power most of the audio compression people currently use online. Some examples of commonly used codecs include MP3, Opus, and EVS. Classic codecs like these decompose the signal between different frequencies and encode as efficiently as possible. Most classic codecs leverage human hearing knowledge (psychoacoustics) but have a finite or given set of handcrafted ways to efficiently code and decode the file. We are probably close to the limit of what handcrafting can give us, which is why it’s important to explore new techniques.
In order to push the boundaries of what’s possible, we need AI to help. We created Encodec, a neural network that is trained end to end to reconstruct the input signal. It consists of three parts:
The encoder, which takes the uncompressed data and transforms it into a higher dimensional and lower frame rate representation. The quantizer, which compresses this representation to the size we target. We train the quantizer to give us the size (or set of sizes) that we want while retaining the most important information to rebuild the original signal. This compressed representation is what we store on disk or send through the network. This is the equivalent of the .mp3 file on your computer. The decoder is the final step. It turns the compressed signal back into a waveform that is as similar as possible to the original. The key to lossy compression is to identify changes that will not be perceivable by humans, as perfect reconstruction is impossible at low bit rates. To do so, we use discriminators to improve the perceptual quality of the generated samples.This creates a cat-and-mouse game where the discriminator’s job is to differentiate between real samples and reconstructed samples. The compression model attempts to generate samples to fool the discriminators by pushing the reconstructed samples to be more perceptually similar to the original samples.
We achieve state-of-the art results in low bit rate speech audio compression (1.5 kbps to 12 kbps) as evaluated by human annotators who compared several compression methods, including Google's latest codec Lyra-v2, with the uncompressed one and ranked them accordingly. Across all bandwidth and quality levels, our model encodes and decodes audio in real time on a single CPU core. We see many areas where we can continue to build and improve on this research in the future. We believe we can attain even smaller file sizes, as we haven’t yet reached the limits of quantization techniques. On the applied research side, there is more work that can be done on the trade-off between computing power and the size of compressed audio. Dedicated chips, such as those that are already on phones and laptops, could be improved in the future to help compress and decompress files, while consuming less power.
We invest in and share fundamental AI research like this so the broader community can learn from and build on these advancements. This research could lead to richer, faster online experiences for people around the world, regardless of the speed of their internet connection.
There is additional research that will need to be done to help us get there. We want to continue exploring how we can compress audio to even smaller file sizes without significantly degrading the quality. We also plan to explore spatial audio compression, which will require a technique that can compress several audio channels while keeping accurate spatial information. These learnings could be useful for future metaverse experiences. We are also exploring techniques for using AI to compress audio and video, and we hope to share more about that work in the future.
Read our paper to learn more about AI-powered hypercompression for audio, and then download the code.