+++ title = "人と間違えるほど自然な合成音声を高速で作成するツールをWellSaid Labsが開発" date = "2021-07-28T13:45:50+08:00" type = "blog" banner = "img/banners/banner-3.jpg" +++
## 人と間違えるほど自然な合成音声を高速で作成するツールをWellSaid Labsが開発
## WellSaid Labs raises $10M to boost its synthetic voice business, fueled by AI
WellSaid Labs will have a lot more to say in the years ahead, thanks to $10 million in new investment that’ll be used to beef up the Seattle startup’s efforts to put a widening chorus of AI-generated synthetic voices to work.
The Series A funding round — led by Fuse, an early-stage venture capital firm that counts Seattle Seahawks star linebacker Bobby Wagner among its partners — follows up on $2 million in seed funding that WellSaid raised in 2019 when it was spun out from Seattle’s Allen Institute for Artificial Intelligence.
One of the investors in that earlier seed round, Voyager Capital, contributed to the newly announced Series A funding. So did Qualcomm Ventures and Good Friends.
WellSaid CEO Matt Hocking said the new funding will go toward growing the text-to-speech startup, which has a dozen employees.
“We need to double down on the research that we’re putting to work, and the research that we’re doing here to continually improve our technology,” Hocking told GeekWire. “On top of that, there’s obviously hires to build out our product offering and serve more customers in more diverse and interesting ways. And then as well as that, we’re definitely focused on our sales team and building that up.”
WellSaid Labs’ platform makes a wide assortment of natural-sounding synthetic voices available via its audio production platform, for use in applications ranging from in-house training materials to quick-hit social media videos.
“We’re not trying to create better voices than humans,” Hocking said. “That’s not what we’re here for. A lot of content goes unvoiced, simply because of the quick turnaround that needs to happen, or it needs to be updated constantly, or it’s just an internal piece of content that doesn’t have a budget associated with it.”
Those are situations in which WellSaid comes in handy. “It opens up opportunities to allow voice to be added to these productions where they wouldn’t usually have that alternative,” Hocking said.
He declined to name customer names, but for what it’s worth, WellSaid’s website lists endorsements from Nokia, the University of California at San Francisco, Blue Sky eLearn and a Canadian food retailer called Sobeys.
WellSaid offers more than a dozen text-to-speech avatars based on human voice patterns, ranging from the revved-up patter of a car salesman to no-nonsense recitations that sound as if they’re coming from a woman researcher. The company claims its software has achieved “human parity” for naturalness in short audio clips.
But wait … there’s more: Customers can create their own “AI Voice Avatars” to spec, capturing the speaking style of a branded voice. Theoretically, WellSaid could bring Jeff Bezos into the studio and create a synthetic voice that makes it sound as if the former Amazon CEO is reading out a welcome message to new employees. (Realistically, if that need ever arose, Amazon would probably have its own voice synthesis team take on the job.)
As time goes on, WellSaid aims to add to its repertoire and increase the fidelity of its synthetic voices. In the future, the company’s voices just might play speaking roles in video games, read scripts on computer-generated news programs, or engage in complex real-time interactions with consumers.
All this raises deeper questions about WellSaid’s technology and its business model. First of all, what’s to stop somebody from synthesizing, say, President Joe Biden’s voice for malign purposes?
“We obviously have a responsibility to ensure that our technology is being used in the right way for the right purposes,” Hocking said. “We create domain-specific voices based on a real voice. We would never go and just build a voice without someone’s consent.”
And when it comes to the business model, how can WellSaid hope to compete with companies like Google, Amazon and Microsoft, all of which have their own voice synthesis platforms?
“We’re in competition with them because they do TTS [text-to-speech],” Hocking acknowledged. “But we’ve re-architected and reinvented what TTS is.”
Hocking argued that WellSaid is well-placed to pursue new applications for text-to-speech technology. “We’ve been exposed to some of these other interesting use cases,” he explained. “The stuff that used to be only possible on a movie set five years ago is now possible in a different perspective today.”
And from Hocking’s perspective, Seattle is the right place for pushing further out into the speech synthesis frontier.
“The majority of our team is from Seattle,” he pointed out. “We all met here, and our preference is obviously to have people living in the area — not only because we feel as though there’s great talent here, but as well as that, it’s just a great place to build a business.”
## Synthetic Speech Startup WellSaid Labs Raises $10M
Synthetic Speech Startup WellSaid Labs Raises $10M
Seattle-based synthetic speech technology startup WellSaid Labs has closed a $10 million Series A funding round led by FUSE. WellSaid offers brands and enterprise clients a text-to-speech platform performed by their choice of artificially generated voices and styles, a service in increasing demand as the synthetic voices improve.
WellSaid Says
WellSaid’s collection of voice avatars can read out scripts, performing monologues or multi-voice dialogues in whatever appropriate style, gender, and mood are appropriate. The AI can be taught to pronounce unusual or branded terms correctly, and the audio can be fine-tuned to add or eliminate pauses or even switch out voices. If none of WellSaid’s voices seem quite right, the startup works with the client to design and program a new one, requiring a recording of a few hours of the voice the client wants. Since the company began in 2018 at the Allen Institute of Artificial Intelligence as a research project, WellSaid has continually refined how human-like its artificial voices sound. WellSaid claimed their new funding round was oversubscribed, with participation from Voyager, Qualcomm Ventures, and GoodFriends.The money will expand the size of the team of about a dozen and fund more research and development for the technology underlying the platform. Most immediately, the startup is planning to upgrade the variety of texts it can service and shorten the time to create the voice.
“We’ve added AI Voice to the toolkit of thousands of content creators and their teams,” WellSaid Labs CEO Matt Hocking said. “Our human-parity AI voice can be produced faster than real-time, and updated on-demand. Opening up new and exciting opportunities to ‘add voice’ where never before perceived possible. AI voice easily ensures every production can be created and updated efficiently at scale.”
Synthetic Value
As synthetic voices improve, companies offering variations on the tech have mushroomed to service advertising, movies, video games, and other verticals. Plenty of startups like Lovo, Resemble AI, and Supertone are raising funding rounds and partnering with celebrities to market synthetic voices, Other examples of the advances in the field include how Replica Studios created a desktop app to speed up integrating synthetic speech into films and video games and the way digital voice interface creator ReadSpeaker is augmenting SoundHound’s Houndify voice AI platform to sound more lifelike. The advances in synthetic speech technology mean even free, limited tools can produce impressive results like this fan-made trailer for Skryim voiced solely by AI.
Follow @voicebotai Follow @erichschwartz
4
## AI voice actors sound more human than ever—and they’re ready to hire
The company blog post drips with the enthusiasm of a ’90s US infomercial. WellSaid Labs describes what clients can expect from its “eight new digital voice actors!” Tobin is “energetic and insightful.” Paige is “poised and expressive.” Ava is “polished, self-assured, and professional.”
Each one is based on a real voice actor, whose likeness (with consent) has been preserved using AI. Companies can now license these voices to say whatever they need. They simply feed some text into the voice engine, and out will spool a crisp audio clip of a natural-sounding performance.
WellSaid Labs, a Seattle-based startup that spun out of the research nonprofit Allen Institute of Artificial Intelligence, is the latest firm offering AI voices to clients. For now, it specializes in voices for corporate e-learning videos. Other startups make voices for digital assistants, call center operators, and even video-game characters.
Not too long ago, such deepfake voices had something of a lousy reputation for their use in scam calls and internet trickery. But their improving quality has since piqued the interest of a growing number of companies. Recent breakthroughs in deep learning have made it possible to replicate many of the subtleties of human speech. These voices pause and breathe in all the right places. They can change their style or emotion. You can spot the trick if they speak for too long, but in short audio clips, some have become indistinguishable from humans.
AI voices are also cheap, scalable, and easy to work with. Unlike a recording of a human voice actor, synthetic voices can also update their script in real time, opening up new opportunities to personalize advertising.
But the rise of hyperrealistic fake voices isn’t consequence-free. Human voice actors, in particular, have been left to wonder what this means for their livelihoods.
How to fake a voice
Synthetic voices have been around for a while. But the old ones, including the voices of the original Siri and Alexa, simply glued together words and sounds to achieve a clunky, robotic effect. Getting them to sound any more natural was a laborious manual task.
Deep learning changed that. Voice developers no longer needed to dictate the exact pacing, pronunciation, or intonation of the generated speech. Instead, they could feed a few hours of audio into an algorithm and have the algorithm learn those patterns on its own.
“If I’m Pizza Hut, I certainly can’t sound like Domino’s, and I certainly can’t sound like Papa John’s.” Rupal Patel, founder and CEO of VocaliD
Over the years, researchers have used this basic idea to build voice engines that are more and more sophisticated. The one WellSaid Labs constructed, for example, uses two primary deep-learning models. The first predicts, from a passage of text, the broad strokes of what a speaker will sound like—including accent, pitch, and timbre. The second fills in the details, including breaths and the way the voice resonates in its environment.
Making a convincing synthetic voice takes more than just pressing a button, however. Part of what makes a human voice so human is its inconsistency, expressiveness, and ability to deliver the same lines in completely different styles, depending on the context.
Capturing these nuances involves finding the right voice actors to supply the appropriate training data and fine-tune the deep-learning models. WellSaid says the process requires at least an hour or two of audio and a few weeks of labor to develop a realistic-sounding synthetic replica.
AI voices have grown particularly popular among brands looking to maintain a consistent sound in millions of interactions with customers. With the ubiquity of smart speakers today, and the rise of automated customer service agents as well as digital assistants embedded in cars and smart devices, brands may need to produce upwards of a hundred hours of audio a month. But they also no longer want to use the generic voices offered by traditional text-to-speech technology—a trend that accelerated during the pandemic as more and more customers skipped in-store interactions to engage with companies virtually.
“If I’m Pizza Hut, I certainly can’t sound like Domino’s, and I certainly can’t sound like Papa John’s,” says Rupal Patel, a professor at Northeastern University and the founder and CEO of VocaliD, which promises to build custom voices that match a company’s brand identity. “These brands have thought about their colors. They’ve thought about their fonts. Now they’ve got to start thinking about the way their voice sounds as well.”
Whereas companies used to have to hire different voice actors for different markets—the Northeast versus Southern US, or France versus Mexico—some voice AI firms can manipulate the accent or switch the language of a single voice in different ways. This opens up the possibility of adapting ads on streaming platforms depending on who is listening, changing not just the characteristics of the voice but also the words being spoken. A beer ad could tell a listener to stop by a different pub depending on whether it’s playing in New York or Toronto, for example. Resemble.ai, which designs voices for ads and smart assistants, says it’s already working with clients to launch such personalized audio ads on Spotify and Pandora.
The gaming and entertainment industries are also seeing the benefits. Sonantic, a firm that specializes in emotive voices that can laugh and cry or whisper and shout, works with video-game makers and animation studios to supply the voice-overs for their characters. Many of its clients use the synthesized voices only in pre-production and switch to real voice actors for the final production. But Sonantic says a few have started using them throughout the process, perhaps for characters with fewer lines. Resemble.ai and others have also worked with film and TV shows to patch up actors’ performances when words get garbled or mispronounced.
## WellSaid Labs Raises $10 M in Series A Round
34085
Well Said Labs intends to use the funding to further AI and product development, scale go-to-market operations, and expand the workforce.
FREMONT, CA: Well Said Labs, the artificial intelligence text-to-speech technology company, recently announced a 10 million dollars Series A round led by FUSE, with participation from previous investor Voyager, along with Qualcomm Ventures LLC, and Good Friends. Given the historic year-over-year revenue increase and great client demand, the Series A was oversubscribed with VC interest. Well Said Labs intends to use the funding to further AI and product development, scale go-to-market operations, and expand the workforce.
“We have added AI Voice to the toolkit of thousands of content creators and their teams,” said Matt Hocking, CEO, Well Said Labs. “Our human-parity AI voice can be produced faster than real-time, and updated on-demand. Opening up new and exciting opportunities to ‘add voice’ where never before perceived possible. AI voice easily ensures every production can be created and updated efficiently at scale.”
Well Said Labs empowers content creators and product teams to create engaging voice content for infinite use-cases in streaming services, radio, programmatic advertising, digital marketing, and corporate training content, with a mission to offer businesses and brands the highest quality Text-to-Speech (TTS) service imaginable.
“Plain and simple, Well Said is the future of content creation for voice,” said Cameron Borumand, General Partner at FUSE. “This is why thousands of customers love using the product daily with off-the-charts bottom-up adoption. Matt and Michael have assembled a world-class team and we couldn’t be more thrilled to be a part of the Well Said journey.”
Well Said Labs have redesigned TTS to address the most difficult content development challenges businesses face and provide an easier approach for content creators—big or small—to create all of their desired material in one consistent voice that represents their brand. The Voice Avatar library from Well Said Labs gives anyone access to various read styles and tones that can be utilized in the performances. Brands may also develop their own AI Voice Avatars to their specifications, capturing the likeness, style, and distinctiveness of the voice required to deliver their messages in the most effective way possible.
“Content creators or product experience designers were previously faced with difficult tradeoffs between quality and scalability when using TTS tools or human voiceover. Well Said’s incredible voices, which are accessible through a studio application or a scalable API, removes the need to choose whether you want natural, lifelike speech or infinitely scalable and easily editable voice content. Well Said provides both and delivers it however your team would like to consume it,” said James Newell, Partner at Voyager Capital. “Creative teams have found it to be extremely useful when they need to produce multiple pieces of high-quality content in a consistent voice in hours instead of weeks.”
In the field of AI, creating natural-sounding speech from text is a “grand challenge” that has been a research aim for decades. Well Said Labs has consistently researched and made remarkable advancements in the quality, speed, and reliability of neural text-to-speech systems during the last three years. TTS from Well Said Labs was the first to reach human parity for naturalness across numerous voices on short audio snippets in June 2020.
“Recent developments in TTS technology using generative AI have enabled synthetic voices to sound very human-like, finding exciting new applications for voice including e-learning, advertising and news readers,” said Carlos Kokron, Vice President at Qualcomm Technologies Inc. and Managing Director at Qualcomm Ventures Americas. “Well Said Labs provides an industry leading product that generates highly accurate human-like voices. We look forward to working with Well Said Labs to help fuel the creator economy with human-parity AI voices across mobile and IoT.”
Creating natural-sounding speech from text is considered a “grand challenge” in the field of AI and has been a research goal for decades. Over the last three years, Well Said Labs has consistently researched and developed tremendous breakthroughs in neural text-to-speech systems’ quality, speed, and reliability. In June 2020, Well Said Labs’ TTS became the first to achieve human parity for naturalness on short audio clips across multiple voices.
“WellSaid’s team has applied deep technical expertise to build a platform that enables easy creation and editing of incredibly lifelike audio,” said Dave Gilboa of Good Friends and co-CEO War by Parker. “We see meaningful growth potential in the use of high-quality audio in giving brands the ability to communicate with customers and creators the ability to engage with audiences.”
Well Said Studio removes the complications that standard text-to-speech technologies present to creatives, making voiceover production, updating, and publishing cost-effective and simple. Well Said Labs’ core AI engine is accessible to product developers via real-time APIs, allowing them to power digital experiences with a dependable and scalable speech infrastructure.