Spoiler alert: Do not read unless you have watched “Top Gun: Maverick.”
The long-in-the-works sequel to the 1986 Tom Cruise blockbuster has been shattering expectations since it launched into the theaters on May 27, soaring to a $160 million domestic opening. The high-flying, practically shot jet plane action sequences and Cruise’s inimitable star power have brought crowds to their feet.
But Val Kilmer’s return as Tom “Iceman” Kazansky was an undeniable highlight for many. And it’s all thanks to artificial intelligence.
Going into “Top Gun: Maverick,” Kilmer’s appearance was a big question mark. The actor lost his ability to speak after going through throat cancer treatment in 2014. Instead of leaving Iceman and Kilmer out of the long-anticipated sequel, the writers wove Kilmer’s story into the character. In “Top Gun: Maverick,” Iceman also has cancer and, for the majority of his brief time in the film, communicates with Maverick by typing.
However, Kilmer does have one emotional line of dialogue, which required a unique partnership between Kilmer and Sonantic, a voice synthesis company. Sonantic, with which Kilmer previously partnered in 2021 for another project, fed hours of Kilmer’s archival recordings through an A.I. to generate a voice model that is a vocal clone of the actor.
The use of this technology hit close to home for Rupal Patel, a Northeastern professor in communication sciences and disorders. Patel has worked extensively with vocal synthesis technology, and almost as soon as the film hit theaters, her phone started going off. Her work in the Communication Analysis and Design Laboratory and with her spin-out company VocaliD uses the technology to recreate voices for those who have lost them or who never had them in the first place.
Patel said that Sonantic’s work with Kilmer is only possible because of rapid advances in vocal synthesis technology over the last decade. What used to be a time-consuming, expensive process that required hours of recorded or archived audio to create or restore a voice is more streamlined and advanced than ever before.
“In the last 10 years, there’s been all these advances in machine learning that allow us to now basically take less audio––maybe about an hour of audio or sometimes even less than that––and feed it to a neural network,” Patel said. “That neural network then learns how to speak like that person.”
According to Patel, Sonantic most likely employed a text-to-speech approach for “Top Gun: Maverick.” With text to speech, previously recorded audio or audio from a vocal donor is used as the foundation of the vocal model. In this case, Kilmer provided hours of archival footage that essentially helped train the neural model to clone his voice. However, the recordings did not provide enough audio to produce an accurate model, Sonantic wrote in a blog post about its original collaboration with Kilmer.
Ultimately, Sonantic said it “generated more than 40 different voice models and selected the best, highest-quality, most expressive one.” From there, creatives took the voice model, fed it dialogue and fine-tuned the performance manually.
Kilmer’s appearance in “Top Gun: Maverick” has brought audiences to tears, but Patel said audiences shouldn’t expect Oscar-worthy work from an A.I.-assisted voice performance just yet.
“It’s going to be a while, but what we’re seeing today versus even five years ago is a night-and-day difference because machine learning has supercharged this field in a way that we could never imagine,” Patel said.
The use of voice synthesis in “Top Gun: Maverick” and even “The Mandalorian” is only the tip of the iceberg when it comes to the technology. Synthesized voices are everywhere, even if most people don’t realize it, Patel said. It goes beyond virtual assistants like Apple’s Siri and Amazon’s Alexa; synthesized voices are now used in telemarketing, advertisements, audio books, and even radio.
However, as with any new technology, Patel urged caution when considering its
uses––and misuses. In 2020, cyber-fraudsters voice cloned a company director in order to pull off a $35 million bank heist, but the potential ethical pitfalls go beyond criminal activity. There are also issues of how to guarantee fair use and royalties for voice actors.
“It’s going to be really important that companies that are creating voice clones actually have a seat at the table to understand how to prevent this from being misused and abused technology,” Patel said. “Consent is really important. Understanding the use cases is really important.”
Patel’s VocaliD and Modulate, another Boston-based company, created the Aithos Coalition to help ensure that synthetic media technologies are used ethically and not creating “big booby traps down the road,” Patel said.
Taking a measured, multi-pronged approach to the future of this technology will pay dividends, she said, especially as voice synthesis continues to advance.
“If the past few years dictate where we’re going with this, the line between human and virtual is going to get blurrier and blurrier,” Patel said.