Speech to Text: the toughest challenge for mankind


The biggest frustration of being partly hearing impaired is the continuous tension in a conversation scenario. It is a daily struggle with failed efforts to reconstruct words and sentences out of fragments of sound. A fraction of the affected people can do lip-reading well (useless in a telephone conversation) and the ones with severe deafness has to resort to sign language which very few hearing people know (or could be bothered to learn).

We are in the age of warp-speed technology right now and this article is an attempt to figure out how today’s technology can be leveraged to provide the single most useful communication tool between the hearing world and the impaired, irrespective of medium or gadgets: The subtitles (closed captions or CC).

Imagine a handheld device with a display screen which can translate any speech to text. Any speaker, any kind of accent. But as fast as a complete sentence is spoken. This gadget alone can change the whole world. It can even be only 80% accurate and the human brain can still decipher the missing parts in a real-time conversation.

Technology in advanced countries like in the US has subtitles in TV programs which is a god-sent for the hearing impaired. For the 25 years I lived in my country, I never understood a movie or a TV show. On the first day in the US when I turned on the TV at a friend’s place, I felt like I was reborn. Movies on TV became an instant addiction for the next few years.

Closed captioning has enormously helped the hearing impaired people, but it still does nothing for the daily struggle. A major part of that is undoubtedly the use of telephone. The existing TTY technology is old, and not scalable. A recent innovation in this arena is brought by CapTel which still needs a dedicated device and plan. They even launched a web based service Web CapTel in Australia which a blogger found really great. This is a service powered by human translators, so still does not solve the fundamental problem of machines converting speech to text.

In this age of smartphones, it will be really really awesome to have a smart speech-to-text app in an handheld phone like the iPhone. Imagine being able to talk either face-to-face or on the phone while reading real-time subtitles. It would be the best app and invention for the mankind, I am petty damn sure.

  1. #1 by Francesco Gallarotti on October 23, 2009 - 5:46 am

    Of course I agree with you… and now imagine if that tool/app could also translate the caption in any language essentially removing language barriers. You could go to Japan and walk up to a stranger and start a conversation just by reading the real time translation that your device will show…
    Unfortunately untrained speech recognition and real time translation are two of the most difficult problems to be solved with a computer, aren’t they?
    IMHO, if anyone will ever reach this goal it will probably be Google, given that they are already launched in that direction with the transcription of voice mail into emails in their Google Voice service and with Google Translate (which has been getting better and better in the past few months)…

  2. #2 by Thara on May 10, 2010 - 3:56 am

    Hi Joy,

    I came across your blog when I was browsing the Internet for speech to text software. I am looking for this exact piece of marvel for someone who is partly hearing impaired. It is so frustrating for him to talk on the phone. Yes, I came across Sprint’s Captel, but as you have mentioned it is someone who transcribes the conversation. Is ther any machine that will transcribe on a screen in real time? I also saw Dragon Dictation/iCommunicator, but they all seem complicated. Apparently you have to “train” the machine to recognize people’s speeches. So, there is no way one can talk to a total stranger. Is there something close to any of these devices? I live in India, and I don’t see anything that will solve this issue. And people here are so phone dependent. I also came across this Swedish company that converts speech into animation, but not text. It’s called synface. Would love to hear if there’s anything out there that can solve this problem.

(will not be published)