However much technology advances, the primary mode of human expression and communication remains the spoken word. So technology will just have to be more accommodating.
Research into computer voice recognition has been going on for as long as there have been computers. IBM, for instance, began working on voice technologies in the 1950s in pursuit of Defense Department objectives like trying to read Russian messages during the Cold War. Since then, voice recognition research has turned toward more commercial applications.
"What we have been doing is steadily building a broad portfolio of voice capabilities," says Nigel Beck, director of marketing for IBM Voice Systems. "Then, as those capabilities mature, we try to put them into the right type of product."
Until recently, however, few voice technologies have delivered practical value. Far from understanding the substance--let along the intent--of your speech, they have been mainly good for the crypto-mystical quality of faulty transcriptions like "He hath delivered to whom of his head confides."
But advances in processing power and memory have put increasingly powerful algorithms within reach of the average PC.
"The basic statistical models that drive speech recognition have gotten much more sophisticated," says David Wald, senior technology advisor at Dragon Systems, a long-time developer of speech recognition software. "What has come along with that is the processing power on personal computers, and the memory that goes with it, have both increased to the point where we actually can use more sophisticated modeling technology."
The biggest remaining barriers to leveraging these tools, then, have more to do with cynicism based on previous experience than with the current state of technology.
Unlike releases even a year or two old, current products offer speech recognition accuracy as high as 95 percent to 97 percent, while requiring less training and offering greater voice control over the PC. These tools offer both power and productivity for a range of sales and service functions from personal productivity tools to hands-free control of mobile applications to call center automation and analysis to voice-enabled electronic commerce.
Tools at Hand
Speech-to-text software from companies like Dragon Systems, IBM and Lernout & Hauspie (L&H) make it easy to dictate memos and e-mail for transcription on the computer with remarkable accuracy. Combined with specially designed digital recording devices, such software in the hands of sales professionals can, on one hand, take advantage of travel time and free the field rep from paperwork in favor of more direct interaction with the customer. On the other hand, they can also be used to more spontaneously capture impressions and information about customers after sales calls.
All three companies have new versions of their dictation products that increase accuracy and reduce the time it takes to "enroll" a speaker by training the software to understand how he or she speaks.
IBM's dictation product line is ViaVoice, with the latest version titled "Millenium." ViaVoice Pro Elite, released in April, even frees the user from having to wear a headset microphone by including Andrea Electronics' noise-canceling Desktop Array far-field microphone. In a quiet environment, the microphone and software lets computer users create documents, navigate the Web and control the desktop all by voice.
Belgian voice giant L&H has released Voice Xpress version 5.0. One enhancement is a "disfluency filter" to automatically take out the "ums" and "ahs" so embarrassingly common in speech.
To bolster accuracy through better enrollment, Voice Xpress can read word processing documents--and now e-mail--on the user's hard disk as part of the training process. "It is the number one thing they can do beyond enrollment to increase accuracy. It lets the software read your documents," says William Destefanis, senior director of product management at L&H. "It understands your writing style and adapts the language model accordingly."
ViaVoice and Voice Xpress will both operate with Olympus' lightweight digital recorders, transcribing recordings after downloading to the desktop or laptop through serial or USB cables.
Dragon Systems, meanwhile, is up to version 4.0 of NaturallySpeaking. The company has merged its dictation technology even more closely with a digital recorder from Voice It Worldwide in Dragon's NaturallySpeaking Mobile Organizer.
"A lot of people were using the product to collect 'to do' items for themselves--appointments that they had to schedule, short notes that they had to send, calendar items and things like that," Wald explains. "The idea is to actually take the underlying speech recognition technology and apply a natural language processing layer on top of that to interpret the items that you give as PIM functions."
For example, dictate into the recorder, "Schedule a meeting with Bob Smith for 3 p.m. tomorrow about marketing strategies." Once downloaded to the computer, the software will transcribe the sound file, check your address book to figure out who Bob is, reference your calendar to figure out what date tomorrow is, and create an appointment with the correct subject line in popular contact management programs like Microsoft Outlook, Lotus Notes, GoldMine, ACT! or the Palm desktop.
Say, "Send an e-mail to Tom White regarding meeting. Period. Tom. Comma. Thanks very much for taking the time to meet with my team..." Mobile Organizer will create a message addressed to White with the subject "meeting" and the rest in the body of the e-mail. As soon as you approve the interpretation, the message is on its way.
Speech technology vendors are also building recognition capabilities into mobile phone applications and call-in information systems. By embedding both speech-to-text and text-to-speech solutions in ubiquitous computing devices, speech recognition and speech synthesis can eliminate keyboards and even screens. Your car can now read your e-mail to you while you drive. Whether you need to dictate a field report, manage investments, get directions or buy tickets, you or your customers can do it from a mobile phone because of the processing power driving the software on a server you have dialed into. Since every voice is unique, such systems can have built-in authentication based on voice prints, too.
IBM Natural Language Understanding technology supports unstructured, conversational dialogue rather than requiring the user to speak specific commands, so a user can speak on the telephone to a computer in much the same way they would speak to a person. Meanwhile, the company's ViaVoice text-to-speech engine synthesizes speech with the latest in linguistic technologies for highly intelligible artificial speech output. IBM's Enterprise Voice Solutions include tools to add voice processing to voice messaging and fax so people can access services and information by telephone or through the Web around the clock, or direct incoming calls to as many as 250,000 different names.
Putting these together, for instance, IBM created an automated system for British Airways. Wherever they are in the world, flight attendants can call into the system and be informed over the phone of their schedules and assignments, while logging sick days and other requests.
Voice of the Enterprise
Meanwhile, sophisticated audio mining tools are beginning to appear that can capture valuable information from a broad range of call center activities.
Audio mining software works somewhat differently from transcription software. Where speech-to-text transcription software is speaker dependent, audio mining software is speaker independent. Accuracy with speaker-independent recognition is only about 70 percent, versus the high-90s percent accuracy rate with dictation products trained to a particular voice. But it will transcribe virtually any voice recorded with sufficient quality. In some cases, these applications will even attempt to insert punctuation (which must be spoken out loud with transcription software). But audio mining products are not intended to transcribe speech. Rather, they are intended to identify and index key concepts.
"The point is to provide something searchable--not to provide a transcript," Wald says.
L&H's Rough 'n' Ready is a collection of audio-mining technologies wrapped into a demonstration application. Run Rough 'n' Ready on recordings of television news broadcasts, for instance, and it will separate the different speakers in the broadcasts based on voice patterns, do a rough transcription of what they are saying, identify key concepts in the broadcast and create an index hyperlinked to the correct locations on the original recording.
L&H announced plans last March to acquire both Dragon Systems and Dictaphone. Meanwhile, Microsoft made a $45 million equity investment in L&H. A new version of the speech API (SAPI 5.0) will contain various L&H components, helping software developers to take speech into account as they design their applications.
Dictaphone's telephony call center recording systems can be used to monitor all of the phone lines inside a business and record conversations that take place on them. Financial institutions and public safety agencies do this for their own protection--and may be legally required to record 100 percent of the calls that come into their call center. Businesses focused on call centers are also recording customer interaction to improve the quality of service inside of those call centers. Dictaphone solutions can also track hold times and number of transfers.
"Look at all the information that they have at their fingertips," says Tom Morse, senior director of engineering at L&H. "By bringing audio mining into the picture we can start analyzing calls in new ways."
Audio mining can chart the frequency of important key words, automatically monitor how closely agents are following their scripts and perform stress analysis on the
callers. Working with one customer, such tools were able to spot a counter-intuitive trend: Calls were being handled four times faster when they weren't being passed from agent to agent in search of the expert in a particular problem.
"Their agents were not being trained well enough on when to hand off a phone call," he explains. "They were able to make a change to their call center and then, because they are
recording 100 percent of their calls coming in, they
were able to see that trend change in their graphs over the next few weeks."
Dragon has been developing audio mining applications on its own, according to audio mining Product Manager Robin Gaynor.
She points out that server-based enterprise voice technologies often have more computing power to back them up, so they can perform more sophisticated tasks. "The speaker is speaking for the benefit of another human and the machine just happens to be there," she explains.
For example, Gaynor took a sample of support calls that come in for Dragon software and used their "phone mining" tools. "Just by searching through our calls I found that there were a lot of calls about 'sound cards.' I noticed that because it provides you with certain key words that have been spoken frequently that ordinarily don't occur in natural conversation," she explains. "I found the term 'sound blaster.' I searched for sound blaster and found that we had a lot of calls that came in about a version of Sound Blaster sound cards. I figured this out in about 10 minutes."
IBM, L&H and Dragon are not the only players leveraging speech in sales, service and CRM. Nuance Communications, a spin-off of technology innovator SRI International, is focusing directly on enterprise applications. Nuance's platform has been licensed by more than 150 companies to create voice-enabled customer service applications and voice-enabled e-commerce that can be done over the telephone. Some examples include American Airlines' voice-driven flight information system, United Parcel Service's service to track packages and schedule pickups by voice, and a Charles Schwab system that gives clients stock quotes via a toll-free number.
Meanwhile, about a hundred companies, including IBM, AT&T, Lucent and Motorola, formed the VoiceXML Forum to coordinate creation of a standard speech mark-up language to make Internet information and voice and phone content accessible.
Looking to the future, speech technologies can have a more interactive role in center functions. "One of the biggest problems in call centers is the turnover rate for agents. They only last a few months," Morse says. "Speech technology can be used on the front end in training the agent, with role-playing and immediate feedback on whether they are getting lost in the script. During the call you also can have passive monitoring of the call, listening to the way the conversation is going and possibly provide extra help to the agent as they are dealing with a particular subject."