Achieving empathy at scale in an AI-driven world

Post by
Teo Borschberg and Nicolas Perony
Achieving empathy at scale in an AI-driven world

OTO.ai founders, Teo Borschberg and Nicolas Perony, believe that just as the web changed the world in the ‘90s, so too the voice-first industry will empower exponential opportunity for humanity.


As humans, we’re moving deeper into a voice-first context, a shift only accelerated by COVID-19. OTO.ai predicts that in 2021, voice will go mainstream to supersede clumsy computer interfaces and eclipse voice-to-text-solutions. This as humans and machines finally begin to better understand each other.


We know this because OTO’s vision is simply to make human-computer interfaces feel more, well, human. We’ve been hard at work maturing this market for three and a half years. Now our vision is being realised.


On a single day in April 2020, Microsoft reported that 200 million participants using Microsoft Teams generated over 4.1 billion meeting minutes. By 2023, a quarter of all employee interactions with applications will be speech-driven. In 2024, the world will be home to eight billion digital voice assistants — outnumbering humans.


The speech recognition industry is skyrocketing as data-driven organisations seek to extract meaning by transcribing speech to text. Transcription of audio into written sentences, albeit an impressive technological achievement and one that took decades to perfect, only ever captures what is said, not how. Spoken language contains a wealth of information hidden in nuances of intonation, like emotions, engagement, satisfaction, biomarkers, and cultural traits. OTO has made major advances in extracting and creating value out of these often cryptic cues, using proprietary algorithms in our next-generation acoustic engine.


OTO’s DeepTone™ is an AI framework that helps industry understand key human behaviors and sounds from a speaker’s tone to realise a rich acoustic map of voice data in real time, at the edge.


With current speech-to-text solutions, the rich spectrum of acoustic information contained in the human voice, data that reveals so much about human behaviour, is lost. But advances in artificial intelligence now enable real-time voice analysis at scale, at the edge, so organisations can better understand their most valuable constituents – customers, users, and employees. 


Anticipating the explosive growth of voice, in 2017 we founded OTO.ai by spinning out the best voice technology available at that time from SRI International’s STAR Lab, known as the birthplace of some of Silicon Valley’s best breakthroughs. We did this knowing that this move was but a stepping stone towards the accomplishment of our dream — the creation of an AI capable of interacting with care and empathy with people, and facilitating care and empathy between people, at scale. 


By validating our technology commercially in the call centre industry, we helped companies ensure a high level of customer service. One of our customers is a US call centre raising donations for nonprofits. By coaching close to 1000 of its agents, OTO’s AI-powered real-time tone coaching tool led to 30% higher agent engagement, better customer service, optimised quality control, and a 15% increase in donation amounts.


Earlier this year, OTO was scaling rapidly by creating value in the call centre industry, when COVID-19 hit, and the world changed. This temporary setback allowed us to contemplate the state of the global voice market, and we realised that there was a much bigger wave to ride.


As the pandemic laid bare society’s fault lines, the OTO team decided that to better serve humanity we had to build the best voice intelligence engine in the world, and make it available so anyone could join our quest of enabling voice intelligence, everywhere. We called this DeepTone™.

In a world where voice is everywhere, DeepTone™is already helping humans make a difference in the fields of healthcare, customer experience, and robotics.

DeepTone™ is a state-of-the art deep learning framework that we built to capture so-called “latent speaker states”, indicating the subjective state a person is in (including how they feel!) based on subtle nuances in their voice. The development of DeepTone™ draws from years of experience and experimentation in acoustic technology and behavioural modelling by the OTO team, building upon our learnings from using the previous SRI-developed system, SenSay, which we leveraged until 2019 to raise the bar in speech emotion modelling -- a framework we call Acoustic Language Processing.


Now a production-grade AI layer that uses rich, lifelike datasets to provide the most powerful and versatile speech emotion recognition system in the world, DeepTone™ is trained on a unique and proprietary dataset containing 100 times more utterances from 100 times more speakers than common emotion data sets. We are proud to say we took a bleeding-edge system, and improved it in every possible way. DeepTone™ is now:

  • Better at modelling emotions, as we’ve improved the recognition accuracy by 40%;
  • More versatile, as we use it to derive biomarkers characteristic of gender and age for example, or leverage its acoustic representation to model custom behaviours (including online rage or Alzheimer’s disease);
  • More generic: we trained it on over 10 thousand individual voices comprising a multitude of languages, accents, and ethnic backgrounds. This is a crucial step towards accounting for the diversity of human vocal expression, a traditional bottleneck in AI development;
  • More granular: DeepTone™ outputs predictions 10 to 20 times per second, allowing a much finer description of tone, down to within-word rather than within-sentence descriptions;
  • A whopping 100 times faster (and more power-efficient), than our previous framework and competing tools: this not only leads to higher operating margins on data centre deployments, but makes it possible to run voice AI at the edge, on mobile or even IoT devices, while protecting data privacy;
  • Plug & play and language-independent, saving customers from the lengthy, costly calibration and fine-tuning they would have to do with traditional speech-to-text systems.

Armed with this AI innovation, our vision of creating a better voice intelligence system for everyone, and finally enabling speech-to-meaning, is coming true.


Business and government alike are storing ever-increasing volumes of raw voice data that could be analysed to help them better understand the humans they serve — whether they be citizens, customers or employees. Acoustic understanding can help save lives in the healthcare sector, and even help rebuild the trust lost between people and brands using empathy. 


The voice-first opportunity presents pathways to revenue for financial recovery, particularly for organisations reliant on human connection for revenues. Pioneers in this nascent industry will enjoy first-mover advantage, growth through innovation and economic reward.


Today, OTO’s voice intelligence assists humans in a growing number of use cases: in healthcare OTO technology delivers improved diagnosis, in robotics augmented interaction, and in retail we enrich real-time conversations between customers and voice bots. DeepTone™is enabling and accelerating fundamental academic research on human and animal communication, and what we’re particularly proud of is that our acoustic engine is also being used to promote healthier relationships and to temper human hatred in online communities.


OTO’s dream is a voice-first world where robots understand humans more than ever before, and help people solve intractable problems for the greater good. Source : @possessedphotography on Unsplash.com.


The advent of 5G shows how the continued improvement of telecommunications technology provides cheaper, more widely available and more equally distributed information. As IoT and data on the edge become ubiquitous, mobile devices will become smarter and more capable of integrating into our lives. In this context, ensuring the capability of machines to better understand the subtleties of voice is more important than ever. In the future we won’t clumsily dispatch crude texts to machines in the hope they’ll understand this dumbed-down communication: we will talk to machines with intention and meaning, like we talk to our friends and the people we love.


Voices are amazingly powerful. Humans learn to speak long before writing or typing; even an infant is aware of the power of its cry. Our voices make us who we are. At OTO, we truly believe that voice technology will ultimately make all our lives better. If, like us, you wish to ensure that the world is bettered through truer connections between humans and machines, let’s talk.