Categories
Categories
by on November 25, 2020
118 views

What makes machine learning powerful is its ability to absorb and analyze large amounts of data and extract actionable insights from them. Such data, in the form of text, is easily assimilated. However, not all information exists in the form of written text. If we look at the various repositories which a machine learning program can draw from, spoken words on video, audio recordings, and live events could prove quite important. Thus, it becomes an important goal for AI to transcribe voices accurately and reliably.

Even without the AI context, voice transcription has been an important function area for various industries for a long time. Traditionally, it was performed and, in many organizations, continues to be performed by human transcribers. Grand View Research projects that the global voice recognition market overall will hit $127.58 billion by 2024.

Now the statistic above would certainly give pause to a lot of people. This is the era of new-age technologies and industries are increasingly looking to leverage IoT, AI/ML, and more to bring about increased automation in various processes. Automation ensures that work gets done faster, more efficiently, and with zero errors. So how come to a field, which is already seeing traction thanks to AI/ML, is still looking at an increasing projection of human intervention?

Well, the answer lies in the “error-free” part of the proposition. While basic voice recognition has progressed leaps and bounds and voice assistants such as Siri and Alexa are increasingly making our lives simpler, their algorithms are far from getting to a stage where they can be trusted with complex transcriptions. If you’re looking to understand the various factors that make voice transcription a significantly difficult computing problem, this essay by Markus Forsberg is an excellent read. What we need to understand here is that even the most sophisticated voice assistants draw from sets of pre-programmed commands, while accurate transcription involves listening carefully to every word and figuring out whether it’s all coming together in the overall narrative. This is the reason we would never trust Siri or Alexa with news stories, the outcomes of expensive and important trials, or even the lives of patients.

The situation is only further compounded when multiple voices, accents, regional terms etc. come to play. People do not moderate their speech keeping in mind they’ll have to make things easy for an AI, so it’s the algorithms that need to become more perceptive in order to do the heavy lifting, but can machine learning get to that level?

The answer, for the most part, is yes! Some of the leading companies and top minds within them have been making rapid progress in smartening machine learning up to a level where dependence on humans can be largely eliminated. These organizations include Microsoft, Google, Amazon, Baidu, Apple, Cisco Systems, Nuance, and more. Let us take a look at how far they have gotten with their specialized offerings.

A few years ago, Microsoft presented us with a historic moment when they declared they have achieved similar levels of accuracy with machine transcription as with humans. In a paper published in 2016, they unveiled a computer whose transcription regularly recorded a 5.9% error rate, which is on par with professional, human transcribers.

Ever since, the company has enabled users to search and type by voice in Outlook, MS Office, and others. They have also integrated the same in the Xbox and their voice assistant, Cortana.

As for Google, they were slightly ahead of Microsoft when they added voice typing to Google docs back in 2015. On Android, we are now used to searching and typing by voice, with the software learning rapidly to eliminate all errors. However, such solutions still need the circumstances to be perfect to provide desired results and are in further need of polish.

Amazon’s specialized “transcribe” solution can transcribe English or Spanish and works particularly well with phone audio. They will enable support for multiple speakers as well and the solution is priced cheaper than most human transcription services.

A major name in this field is Nuance Communications which, unlike Amazon, Microsoft, and Google, is focused on this field of work. Their offerings are part of their “Dragon” line and there are specific solutions for areas such as legal, academic, medical, and so on. Cloud-based, they work seamlessly across devices and integrate with repositories they can learn further from, such as electronic health records in hospitals.

We are most familiar with Apple’s Siri which has also come a long way since being introduced in 2014. Alongside improving Siri, Apple has a team working on tools that are built-in iOS which enable users to dictate and have voicemails read out to them.

Another notable entry in this list is Cisco Systems which, owing to their purchase of Tropo, which was a startup providing an API for voice and SMS applications to small businesses, can help their customers with text-to-speech and transcription services in 32 languages.

Apart from these big names, there are a few startups and mid-size companies which have achieved success in cracking the effective transcription puzzle with their algorithms. Here are a few mentions:

Trint: Based in London and founded in 2014, the company is focused on professional transcription services at competitive rates

Simon Says: Based in the USA, Simon Says is focused on the media industry and automatically transcribes audio and video files in 64 languages. The company also claims to support almost 100 different dialects and multiple speakers

AISense: The company has developed an app called Otter which can record and transcribe all meetings so that no information is missed. The tool listens keenly and learns from the voices to improve accuracy over time.

Sonix: Targeted at radio broadcasters and podcasters, the company offers fast transcription and novel features such as deleting parts of the audio clip by removing the corresponding text from the transcription file and the ability to see areas where the algorithm isn’t entirely confident of the result for quick editing. Such areas are marked with different colors.

Conclusion:

We began by looking at the various problems faced by the software when it comes to pulling off accurate transcriptions. However, looking at the advancements being made by companies large and small, it is safe to say machine learning and AI have indeed made progress to an extent where we can look forward to them taking over most complex transcription tasks. The time and effort this will save are undeniably major and will result in the rapid completion of important tasks. Though innovations and refinements are still in progress, looking at the range of offerings around and how they are getting better by the day, it is fair to say that we are close to the home stretch.
Be the first person to like this.