Last time, we saw a comedy skit written by AI on Netflix but it lacked something, emotion. Now hard as it is to recreate emotion, it looks like this startup has cracked the code when it comes to voice and AI.
Sonantic, an AI voice startup, says it’s made a minor breakthrough in its development of audio deepfakes. They are creating a synthetic voice that can express subtleties like teasing and flirtation. The company says the key to its advance is the incorporation of non-speech sounds into its audio. This simply means training its AI models to recreate those small gestures that make us human. The intakes of breath — tiny scoffs and half-hidden chuckles — give real speech its stamp of biological authenticity.
AI at its finest: The Photoshop for Voice
Sonantic CEO Zeena Qureshi describes the company’s software as “Photoshop for voice.” Its interface lets users type out the speech they want to synthesize. Then you can specify the mood of the delivery, and then select from a cast of AI voices, most of which are copied from real human actors.
This is by no means a unique offering (rivals like Descript sell similar packages). However, Sonantic says its level of customization is more in-depth than that of its rivals.
“We chose love as a general theme,” Sonantic co-founder and CTO John Flynn tells The Verge. “But our research goal was to see if we could model subtle emotions. Bigger emotions are a little easier to capture.”
Take a listen and see if you, well, fall in love:
Emotional choices for delivery include anger, fear, sadness, happiness, and joy. With this week’s update, they add flirtatious, coy, teasing, and boasting.
A “director mode” allows for even more tweaking. They state you can adjust the voice pitch or dial up or down the intensity of delivery. You can even add those little non-speech vocalizations like laughs and breaths.
In the video above, you can hear the company’s attempt at a flirtatious AI. Whether or not you think it captures the nuances of human speech is a subjective question. On a first listen, I thought the voice was near-indistinguishable from that of a real person. What do you think?