GPUs have a highly parallel architecture enabling high throughput inference which means more streams of audio can be processed in parallel.īuilding on the findings from DeepMind’s Chinchilla paper, summer intern Andy Lo from the University of Cambridge established scaling laws for our SSL models and showed that these transformer-based audio models show similar scaling properties to large language models. With Ursa, we achieved our breakthrough performance by scaling our SSL model by an order of magnitude to 2 billion parameters and our language model by a factor of 30, both of which have been made possible by using GPUs for inference. We have made significant improvements at every stage of this pipeline, compounding to produce the accuracy gains shown in Table 1. Listen to the “Text Formatting” sample with the audio player above, which showcases our output or read about it in our blog. Consistent ITN formatting is imperative when building applications that rely on dates, times, currencies, and contact information. We also apply inverse text normalization (ITN) models to process numerical entities in our transcriptions into a consistent and professional written form. Our diarization models exploit the same general self-supervised representations to enhance our transcripts with speaker information. The predicted phonemes are then mapped into a transcript by using a large language model to identify the most likely sequence of words. We then use paired audio-transcript data in a second stage to train an acoustic model that learns to map self-supervised representations to phoneme probabilities. This uses an efficient transformer variant that learns rich acoustic representations of speech (internally we name these models after bears, so we thought it was only fitting to call our release 'Ursa'). We first train a self-supervised learning (SSL) model using over 1 million hours of unlabeled audio across 48 languages. Together, these technologies break down language barriers and make a big leap towards our goal of understanding every voice. ![]() ![]() For the first time, we’re making GPU-accelerated transcription possible on-prem, with Ursa providing unrivaled accuracy and low total cost of ownership (TCO) to enterprises.Īdditionally, we are proud to release new translation capabilities alongside our ground-breaking speech recognition. Ursa-quality transcription is also available for real-time recognition, leveraging the same underlying models. Both Ursa’s standard and enhanced English models outperform all other vendors, delivering a significant 35% and 22% relative improvement respectively, compared to our previous release (shown in Table 1). Moving to GPUs for inference and scaling up our models has allowed Ursa’s enhanced model to surpass human-level transcription accuracy on the Kincaid46 dataset † and remove an additional 1 in 5 errors on average compared to Microsoft, the nearest large cloud vendor. What sets Ursa apart from other speech-to-text offerings is its exceptional accuracy.
0 Comments
Leave a Reply. |