Blog: Dialing for Dialog

04/20/2008

Integration for Delegation

In our recent press release we announced that applications that are richly connected to enterprise backend systems can yield up to three times the automation rate of non-connected dialog systems. The reason for this dramatic increase in automation performance is simple, and is called “delegation”. Rather than having callers perform certain operations or provide certain pieces of information, the dialog system delegates other systems and information repositories to do that. We like to say: “The best question is a question not asked,” to stress on the fact that if there are other ways to collect some pieces of information other that asking the caller, that should be done. By delegating the collection of information or the performing of some actions to external enterprise backends rather than to the caller would lead to better interaction experience and higher automation.

Equipment identification in technical support calls is the perfect example. During a technical support call for internet service it may be important to know the modem type of the caller. In order to provide that information, often the caller has to drop the phone, crawl to some unreachable place, like the bottom shelf of the entertainment cabinet, locate the modem—which is not a simple thing for everyone…with some many boxes and cables around—locate the brand name, go back to the telephone, and speak it. In the meanwhile noise may have triggered the speech recognizer, and there is always the chance of misrecognition. The whole thing may take a few minutes that add to caller frustration and increases the chance of a much dreaded hang-up or “operator!!!” Instead, a dip into the customer account database to locate the caller’s records and the type of modem would take a few seconds, can be done in parallel with other tasks—for instance collecting the reason of the call in natural language—and lead to a much more pleasant, reassuring, and successful interaction.

Let’s face it! Subscribers paid for a service, and asking them to do all this work when something is wrong is not the best possible customer care. Delegating machines to do all the work—rather than callers—is the way to go. We are moving towards a time where an automated customer care call would go like this:

System: Thank you for calling Acme customer care. How can I help you?

Caller: I just got my bill and there is a one hundred and fifty dollar charge I don’t understand.

System: I understand your bill is too high. Is that right?

Caller: Yes

System: I am sorry about that. Let me see what the problem is, and I will call you back in a few minutes.

We are not there yet…but we are moving in the right direction.

Posted by Roberto on Apr 20, 2008 12:00:53 PM Permalink | Comments (0)

04/07/2008

What Happened in Vegas

Last week, from March 30 to April 4, Las Vegas hosted ICASSP, the International Conference on Acoustic Speech and Signal Processing. This is one of the largest conferences of the IEEE Signal Processing Society which is held every year in different locations around the world (next year ICASSP 2009 will be held in Taipei, Taiwan). More than 2000 participants, more than 1300 academic papers presented at the conference, 300 of which dedicated to speech and language technologies.

Topics related to speech recognition, spoken language understanding, and dialog technologies are always very hot and represent the majority of the presentations in the speech and language area. That is a clear indication that the interest of academic and industrial research in the human-machine interaction using speech recognition has not diminished, but has been growing steadily during the past years.

So, what is hot in speech and language technology research this year? Certainly voice search and its applications is one of the areas that attracted a lot of attention, with a special session entirely dedicated to the topic. However, besides basic research on speech recognition and language modeling, there were also several interesting presentations on the most recent advances on dialog learning, emotion detection, speech translation, audio mining, spoken information retrieval, and spoken language understanding.

The Show & Tell session was one among the most interesting events; it attracted hundreds of people for a whole afternoon. SpeechCycle presented a demo of our “third generation” customer care systems, including the advanced tools that participate into what we call “the cycle”, such as authoring, reporting, learning and optimization of dialog, speech accuracy tuning, annotation, SLM training, and online behavior modification, based on our latest QuickTouch product. Our demo gave to the many people that stopped by our booth a clear sense of the advancements that a company like SpeechCycle has brought to the commercial world. We at SpeechCycle have a good tradition of technological innovation, and we cherish our strong links with the worldwide research community in the speech and language areas. Contrary to the popular say, this time we hope that what happened in Vegas won’t stay in Vegas.

-------

SpeechCycle’s papers presented at scientific conventions and workshops:

· Evanini, K., Suendermann, D., Pieraccini, R., Call Classification for Automated Troubleshooting on Large Corpora, ASRU 2007, Kyoto, Japan, December 9-13, 2007

· Albalate, A., Dimitrov, D., Pieraccini, R, Unsupervised Categorisation Approaches for Technical Support Automated Agents, Interspeech 2007, Antwerp, Belgium, August 27-31, 2007.   

· Acomb, K., Bloom, J., Dayanidhi, K., Hunter, P., Krogh, P., Levin, E., Pieraccini, R., Technical Support Dialog Systems, Issues, Problems, and Solutions, HLT 2007 Workshop on “Bridging the Gap, Academic and Industrial Research in Dialog Technology,” Rochester, NY, April. 26, 2007.

· Levin, E., Pieraccini, R., Value-Based Optimal Decision for Dialog Systems, Proc. of IEEE/ACL 2006 Workshop on Spoken Language Technologies (SLT 06), Aruba, Dec. 10-13, 2006.

Posted by Roberto on Apr 7, 2008 6:10:25 PM Permalink | Comments (0)

04/01/2008

50,000,000 Calls

We recently announced that SpeechCycle has processed 50 million calls. Beyond the successful execution of a process that delivers high levels of automation to the largest US cable companies, this achievement can be attributed to several levels of technological advancement and innovation brought by SpeechCycle. Here are some:

- Our analytic and reporting tools that enable the analysis of millions and millions of calls and help provide higher and higher levels of automation.

- The capability delivered by our integration software to seamlessly exchange information with enterprise web services and deliver what we characterize as an “immersive caller experience”.

- The flexibility of our SaaS platform with respect to different customers, backends, applications, call-centers, and its scalability towards high volumes of calls.

-  A special and deep understanding of “data driven” speech technology for continuously improving grammars, statistical language models, and the Voice User Interface.

- A data bank of millions of utterances, transcribed and semantically annotated, that represent one of the richest linguistic inventories of caller expressions and responses.

All of this allowed SpeechCycle to create its “rich phone applications” which are beginning to move beyond the area of technical support and expand into new industries as well.

Posted by Roberto on Apr 1, 2008 9:46:44 PM Permalink | Comments (0)

03/21/2008

We are all connected

Braininthevat

Spoken dialog systems can help provide better customer care when they are connected and can interact not only with the caller, but with other systems with which exchange knowledge and perform actions.

Daniel Dennett, a professor of Philosophy at Tufts University, one of the most eminent contemporary philosophers of the mind, and one of my heroes, often talks about the “brain in the vat” metaphor. What is it? Imagine—just imagine, please don’t do it at home—someone’s brain is removed from the body and immersed in a vat of liquid that keeps it alive. Also imagine the brain terminal neurons are attached to powerful computers which will provide the exact stimuli that would produce, in the brain, a perception of the world, just the same as a physical body would. The brain in the vat is a powerful thought experiment commonly used by the philosophers of the mind to discuss about reality, mind, and consciousness. Here I would like to use it for a more mundane pursuit—and I humbly apologize for that to all the philosophers of the mind.

Think for an instant of a brain in a vat; for making the experiment less grim, do not think of someone’s brain, but a brain artificially grown in a lab by a group of bio-computing engineers. A brain with no memory of a past and no dreams of a future; a brain without any connections to the real world, except for some wires that get into the terminations of the auditory nerves, and some other wires that connect the articulatory nerves of the mouth to some special nerve-to-speech apparatus (NTS: not invented yet). No touch, no smell, no taste, no vision. Not much fun in the vat … uh? The only thing this brain knows is how to recognize a bunch of spoken words, and which words to speak in response.

What can that poor “thing” do? Not much, except react with words to what is spoken to it, as programmed by the bio-computing engineers. What if things in the world around it change? Can it perceive it? Certainly not. What if someone asks for help and the brain, after having tried to help with all the possible instructions for which it is programmed, it has to send that person to a more expert “real” human assistant? Can it send a note that summarizes what was done so that the human assistant can try something else? Probably not, because the brain in the vat cannot send “notes” on a different channel than the one its auditory and speaking nerves are connected to. It cannot take measurement, do stuff, check things, move objects and verify that the objects have been moved. All it can do is ask others to do all these things, and as we know asking other sometimes does not work. All of this ineffectiveness in dealing with the real world is because the poor thing is “not connected”.

Non-connected brains in the vat are pretty much what we build today when we create “non-connected” spoken dialog systems. There is very little perceived—and actual— intelligence in a non-connected static dialog system; all it can do is recognize speech and talk following a precise and pre-established call-flow. But if you start connecting it to the rest of the world, the system can start “perceiving” the status of what’s happening and it can act consequently in a more “intelligent” way. Speech is not anymore a repository of knowledge, but it is a mean, a channel among many others, used to communicate with humans with their strange protocol called natural language. Besides speech and natural language, there are a lot of other communications going on through myriads of Web services that “connect” the call-flow with the rest of the world. We know that because at SpeechCycle we build connected spoken dialog systems, where the “spoken” part is only one part of the equation. And we believe that connectedness is the future of intelligent systems.

Imagine you call an automated agent because your internet is down. A connected system can get to your account, check where you live, and then check the network to see if there are any outages in that area, or maybe realize you haven’t paid for three months and then …well…”you have to talk to someone in the billing department who can help you to  resuscitate your account from the limbo of delinquency”… and by the way … “I can also connect you right away”—and behind the scenes send the human billing agent a note on what they should do as soon as they pick up the phone, so we won’t waste any time.

Getting account, network information and sending notes to human agents—typically called screen-pops—, in other words managing information and knowledge, are not the only things that connected dialog systems at SpeechCycle can do. The can actually “do” stuff. They can get into your modem and cable-box at home and reset them; they can run a series of diagnostics and determine the exact cause of the problem you are calling about, they can determine the level of connectivity to your home by sending a ping signal, and do many other things. And sometimes the SpeechCycle systems can do several of those things in parallel—something that humans are not very good at—while at the same time they are talking to you on the phone.

So, what’s involved in building connected spoken dialog systems? I would say that the most important thing is abstraction. Abstracting the functionality of the various connections from the intricate details each custom implementation is the key to success. The creation of abstract objects that reflect the elements common to all applications in a given vertical—for instance accounts, network status, service, modem, cable box, premium channels, etc. —and that can be used by the spoken dialog VUI—aka the call-flow—regardless of the specific implementation is the  key. VUI designers and developers can focus on the interaction without having to fiddle with the backends, knowing that when they invoke a “reset modem” command, the call-flow will do the right thing, will fetch the customer account and the type of modem, will get its IP address, send a reset signal, wait for the response, and return successfully to the main dialog thread when the operation is complete. All of this because we are all connected.

Posted by Roberto on Mar 21, 2008 4:59:45 PM Permalink | Comments (0)

02/26/2008

AI

MetropolisWhat is Artificial Intelligence? Is it really that magic wand, that “ghost in the machine” that can” intelligently” solve all the problems that common “dumb” engineering can’t? In more than 45 years of history, AI’s attempts to build intelligent machines haven’t always met the expectations, causing the so called “AI winter”. In fact traditional knowledge-based AI always suffered of severe scalability and flexibility limitations that impaired its effective application to complex real-life solutions, and often escaped benchmark comparison with other, often more effective, techniques. On the contrary, data-driven technology and modern software engineering—though less glamorous than AI--have brought solid results, especially in the field of spoken interaction with machines. Although some serious attempts to bring traditional AI notions back to life from its last winter are under way in some academic research establishments, only a fair benchmark comparison against mainstream technology with measurable results can prove its superior performance.

I recently happened to hear the term AI—as in Artificial Intelligence—quite more often than during the past twenty years. I found that a little bit odd; I thought that no one with an historical perspective of science and technology would be using the term as nonchalantly as we did in the nineteen-seventies and eighties. I tend to associate the term AI with other terms of the past, like “electronic brain”, “thinking computer”, “cybernetics”, and “the information superhighway”. Today you don’t hear “Look for my page on the information superhighway” …it is so 1990s…  None of my friends who were somehow connected with AI in the good old days call themselves AI experts anymore. Rather, they talk about disciplines like “cognitive sciences” or “machine learning”, use techniques like “Support Vector Machines”, or Markov Random Fields” or unglamorously say “I use statistics” or “I do computer science” during party chats. What happened to AI? Where has it gone?

Back in the days when computers where mostly doing arithmetic calculations—it was the summer of 1956—there was a workshop at the University of Dartmouth where almost all of the computer pioneers in the known world —only a bunch of them at the time—met for two months—good old times…when one could actually go away for two months—to discuss advanced and unconventional computer programs that were able to prove theorems, play chess, and recognize all kinds of patterns. The workshop came to be known as The Dartmouth Summer Research Project on Artificial Intelligence; it was the first time that the term was ever used and it stuck. Since then Artificial Intelligence, or simply AI,  is used to denote a way to approach the solution of problems by a machine “similar” to how we believe intelligent creatures, like most of us humans, do. And I stress the term “believe”, because we don’t know for sure how we, or better our brains, solve such problems.

After we humans have reached the solution of a problem—let’s say proven a theorem or solved a murder mystery—we are often, but not always, able to consciously reverse engineer the solution process that we (believe we have) applied step by step like a detective at the end of the movie—think of Monsieur Poirot or Columbo. Yet sometimes we think we solve problems by consciously applying rules and using reasoning and inference towards the solution of the problem. But other times we don’t (Malcom Gladwell’s Blink supports that). And this is especially true for things for which AI never worked really well, or never proved to outperform other non-AIsh approaches.

Take chess, for instance: one of the reference problems of classical AI. IBM’s Deep Blue, the first computer that won, in 1997, against a human world master—Garry Kasparov—did not use the elegant inference techniques that nostalgic AI aficionados would refer to as AI. Rather Deep Blue used the “brute force” of a computational power able to evaluate 200 million positions per second in a database of 700,000 grandmaster games. Probably if you tried to rationalize, after-the-fact, why Deep Blue won, in the traditional AI style, you won’t find a “step-by-step” conscious process of inference. Most likely Deep Blue won because it was faster, had more memorized games readily available than its human opponent, and could go deeper in analyzing all the possible effects of a move. And I am sure if you had asked Garry Kasparov, right after the game, why he lost, he could have probably tried to rationalize, and given you a step-by-step explanation, but not certainly a set of rules that one can apply in an AI-like system to build a better chess player than him.

Language and speech interpretation is traditionally another field where AI—and I mean the traditional rule-based, knowledge-based AI—never proved to achieve better results than techniques with a less glamorous names like Hidden Markov Models, N-gram statistics, or Finite State Transducers. Since the first speech recognition machine was built at Bell Laboratories in 1952, hundreds of researchers across the globe tried to use traditional knowledge-based AI techniques to interpret the content of the voice signal. All the others tried less elegant statistical, data driven, approaches, and built systems that actually worked. Knowledge based speech recognition never succeeded, and although a few serious scientists are trying today to revive it and marry it that with the statistical approach—a long term research topic—I have heard no one talking about it at any of the most prestigious international academic conferences in the field. 

So …is AI gone? I don’t think so. AI, as the attempt to create a deeper understanding of problems towards their solution inside a machine is not gone. On the contrary there are many serious scientists that are relentlessly working on what in the 1980s would have been called an AI-sh type of solutions. What is gone, at least we hope, is the popular belief that AI is a sort of magic wand which, in virtue of the intelligence hidden in its guts, can solve problems as humans do (do they?) and that will provide better, faster, and cheaper solution development and maintenance. The term AI used to represent, still in the popular belief corroborated by science fiction and superficial third page stories, the panacea of all automation problems, mainly because of the term intelligence in the name. That created expectations that could never be met and caused what is known today as the AI winter. We hope we have finally grown out of this popular belief.

The problems with traditional, classical AI are many. Classical AI proved to be non-scalable, since knowledge had to be put into the system by hand (see the main criticisms to the ambitious Cyc project at http://en.wikipedia.org/wiki/Cyc). Statistical machine learning, instead, gains knowledge automatically, from data. Classical AI traditionally escaped any type of meaningful comparison benchmark against other techniques. Science can be called so when it is driven by data and measurements. In the absence of that, what is left is anecdotic evidence: Yes it works! … but does it work better? How better? Is it cheaper? How cheaper? Measurements and common benchmarks are ate the basis of today’s machine automation. 

On top of the lack of scalability and the absence of measurable performance on common benchmarks—and I am still thinking of knowledge-based, rule-based, reasoning, inference-based, introspective, old AI— one of the main drawbacks that hindered, and still does, the penetration of AI philosophy into areas like speech applications is the fact that it traditionally trades procedural expressivity for built-in behavior. Let me be more explicit. One of the claims of the AI knowledge-based approach is that “you just express the knowledge, and the inference engine uses it in an intelligent way” – so building complex applications may seem less costly, at least on the surface. One may claim the same for using AI in spoken dialog applications. No coding, no call-flows—just write down the knowledge, and the rest will be done by the engine. That’s it!  Unfortunately the behavior of intelligent inference engines—like old Prolog’s inference engine—is often not easy to grasp except for those who designed it. Training developers—today’s software developers are universally fluent in procedural or object-oriented programming and not in inference-based programming—to use AI-like engines can be quite difficult. Especially considering that, in situations where knowledge is vast and not always consistent—rules can contradict each other, they may require to be invoked with some temporal order, they may be incomplete, etc.—the behavior of inference engines may not be predictable. That goes against the VUI Completeness principle which requires that all possible outcomes of a Voice User Interface should be predictable before a spoken dialog application goes into production. And what about last minute change requested by the customer? For instance changing the order in which questions are asked, or changing a procedure according to the company’s best practices? With knowledge-based AI-sh systems a simple change like that can easily become a non-reusable fix or a development nightmare, because one has to bypass the built-in engine behavior with some ad-hoc procedure.

As a consequence of the above considerations, while serious and illustrious researcher have tried for decades to apply inference techniques to spoken dialog systems, and with considerable academic successes (for instance Plan-based dialog at the University of Rochester, or Agenda-based dialog at CMU), the industry still holds on to procedural techniques such as the call-flow representation. Call-flows abstractions, because they are procedural, are easily and naturally grasped by VUI designers and developers who can build sophisticated applications to solve customer problems.  And after all, as someone who has built real spoken dialog applications knows very well, dialog design and development is only one of the elements to determine the success of an application. Integration, platform robustness, speech accuracy, and a myriad of other little, and not so little, things need to be in place for a system to work, to be cost effective, and to provide quality customer experience.   

So what’s the future of complex spoken dialog applications? How will they evolve? Will they proceed following the path of traditional AI, with an intelligent engine in the background able to reason on a database of well structured knowledge? Well, the evidence is against that. Complex applications, so far, did not evolve towards the AI-sh inference way of solving problems. I doubt that sophisticated Web sites that interface complex applications which show some level of intelligence have AI-sh inference engines behind them. Yes knowledge needs to be separated by its usage, but that’s a fundamental rule that every good procedural programmer learns early enough. Do you want to call it Artificial Intelligence? Or maybe we can call it model-view-controller (MVC) style? But without an intelligent engine … how do you handle complexity and cost of development? Software—and call-flows are software—found its own way to handle complexity with modularity, encapsulation, inheritance, polymorphism, and other programmer’s tricks  And that’s not AI.

Classical AI, of the inference-resoning-konwledge-based variety, might come back at some point from its winter hibernation–we do hope so—and it may confront other approaches using comparison benchmarks in a scientific manner, and it may even win. But until then, we have to settle for the “unglamorous” technologies. 

Posted by Roberto on Feb 26, 2008 8:47:21 AM Permalink | Comments (0)

01/27/2008

The complexity ceiling

Pollocknumberone1948 The tools we use determine the complexity we can handle. But tools, in software, are not just the traditional things-that-help-you-do-other-things, but also the abstractions you use. In that sense, the call-flow abstraction is a tool that allows you to build dialog systems up to a certain complexity. Dialog Modules and other abstractions imported from traditional software engineering help push the “complexity ceiling” of the call-flow higher and higher and enable building sophisticated 3rd generation spoken dialog applications. 

“One man’s ceiling is another man’s floor” goes Paul Simon’s song. I would also say “One tool’s ceiling is another tool’s floor”. All the complexity that we can handle depends on the tools we have. We can build a cabin using logs, a hammer, nails, a hacksaw and a ladder, but as soon as we start adding rooms and floors these tools become quite ineffective, and we should start considering using metal joints and fasteners, power drills, and a crane. That’s what I call the complexity ceiling. A tool, or a set of tools, determines that complexity ceiling, the type of complexity you can handle, the level of complexity above which you cannot go. Trying to go above that ceiling would be extremely hard without shifting to a new set of more sophisticated tools

Software can be very complex. What’s a tool in the software industry? Tools are not restricted only to that category of software that we all explicitly call “tools”—like integrated development environments, or IDEs, editors, or debuggers. Programming languages, models, and abstraction are tools as well; they are actually at the basis of the other more “tangible” tools, like the editors and the debuggers.

Let’s talk about spoken dialog systems—after all that’s what this blog is about.  The main abstraction—tool—used for commercial dialog systems today is the call-flow. How did we come about with the idea of “call-flow”? Well…you can imagine that the first time someone with a penchant for programming started building a spoken dialog system (some of us old-timers were there…) he or she probably wrote the whole interaction in C (or maybe in C++). So we can imagine how at that ancient time during the mid 1990 a spoken dialog system looked like: several pages of nested “if-elseif-else” statements, and a few inevitable “goto’s.” After having done that a few times—since programmers are smart and lazy people—those pioneers realized that …actually …that “if-elseif-else” thingy gets into the way, unless you can comfortably read 25 nested conditional statements with goto’s here and there (I know a few who actually can). And also they realized that anytime they were building a new system they were actually doing the same things over and over: perform an action (like for instance play a prompt), evaluate a condition (for instance the return value from a speech recognizer) and, depending on that, select and execute one of a number of possible actions. Blink! It is a graph! Nodes are the actions, and arcs are the conditions! That’s much better than 25 nested “if-else-elseif” statements! And guess what? I can teach it that to a VUI designer in no time!  And by the way…they are already using it in those horrible touch-tone IVRs…

That’s how the call-flow abstraction was imported into the spoken dialog world form the touch-tone IVR world. But the abstraction-tools did not stop here. A few years later someone else realized that even using the call-flow abstraction-tool they were doing the same things over and over again. Anytime they had to collect a piece of information from a speech recognizer—at that time speech recognition systems were still making mistakes, unlike today (oh well…)—they always had to re-prompt in case of low confidence or timeout, or confirm (the “I think you said…” way of talking) when the speech recognizer wasn’t so sure about the result. So some smart and lazy call-flow programmer thought of creating yet another abstraction: the Dialog Module. Dialog Module (or DM) abstraction flourished in the late 1990s, and since then call-flows started to be built with DMs. No longer did VUI designers and call-flow programmers had to specify the logic of every single collection, but they could simply configure DMs with a number of prompts and a bunch of parameters (like number of retries, whether they wanted to have confirmation, etc.). All of a sudden, using DMs, call-flows became less complex and more manageable since they did not have to deal of all those minutiae of timeouts, confirmations, etc. Every time you needed to collect a single piece of information from a caller, rather than re-creating the whole logic, you simply had to put a DM there and configure it. DMs pushed the “complexity ceiling” higher by allowing developers to build more complex applications with the same effort of simpler applications without DMs. That was smart!

Abstractions like DMs are not new in software, au contraire! The whole history of software engineering is a succession of more and more sophisticated abstractions that enabled building more and more complex software. Call-flows, and spoken dialog systems are today following the same path. Indeed we could not possibly build applications like troubleshooting and technical described by hundreds of pages of call-flows, and thousands of DMs, without importing powerful abstractions-tools form the software world. We do use—at SpeechCycle—abstraction tools like inheritance, modularity, recursion and other powerful concoctions invented by software engineers, and we do have “tangible” tools that support them and allow a software avert community, like that of VUI designers, to use the abstractions effectively and build the most sophisticated call-flows for the most complex spoken dialog applications today.

But that’s not all. Creating a call-flow, with logic, prompts and grammars is just the beginning. Testing complex applications with thousands of DMs requires tools; managing hundreds of grammars and thousands of prompts requires tools; building data-driven statistical grammars (SLMs) with hundreds of semantic categories derived from hundred of thousands of utterance samples requires tools; integrating call-flows with customer backends like CRM, databases, and diagnostic systems, requires tools; analyzing deployed system requires tools; reporting system performance requires tools. And even the right tools may fail to effectively deliver sophisticated solutions if there is not a sophisticated process (yet another tool) that orchestrates the whole design-development-delivery cycle

Spoken dialog systems are complex beasts; spoken dialog systems for technical support—what we call 3rd generation, or Speech 3.0—are even more complex beasts. It’s not “just speech technology”, but there is much, much more complexity lurking behind. Taming this complexity requires a high degree of innovation, discipline, and experience. It is not just speech. It is speech, and all the rest!

Posted by Roberto on Jan 27, 2008 6:12:31 PM Permalink | Comments (0)

12/09/2007

Do we need natural language?

Figures1In many cases the recognition of a few keywords is enough to build useful automated systems. However there are applications that could not be automated without natural language understanding. This is especially true when the number of choices is large or there is a potential mismatch between the mental model of the caller and the system.

So, do we need natural language? If speech recognition is a tool--like a keyboard--and if we can build useful applications based on the recognition of a few words, why do we need sophisticated natural language understanding? Why don't we code all the possible meanings at each point of the interaction into a bunch of keywords and design  prompts that clearly instructs the caller to speak one of them?

The reason why we need natural language is that it is not always possible to get away with keywords. Let me make some examples. If I ask you to tell me the toppings you want on your pizza, you can very well express what you want with a set of very predictable words such as mushrooms, ham, or pepperoni. Everyone, or almost everyone, knows what pizza toppings are. The "model" for pizza is so well and widely understood that we can  build an effective directed dialog system that takes the caller through a set of very well understood choices: How many pies? Thin or thick crust? Which toppings? Do you want any beverages? And the same is true for other applications such as flight reservation, banking, and stock trading. A menu based system, with well defined choices at each point, can get you what you want in an effective way (by the way ... ATMs are directed dialog machines...)

But now think of a system that helps you troubleshoot your computer, and imagine a directed dialog that asks you to select the problem-- or the symptom of the problem--you are experiencing. It could start by  giving you a list of possible symptoms to choose from but  the list, most likely, would be so large that it would not be possible to speak it on the phone. The system could attempt at breaking the list into high level categories, like hardware, software, and networking: please tell me if you are experiencing a hardware, software, or networking problem.  Except the computer savvy,  very few people would know which category to select. I do I know what type of problem I have...that's why I called you!!!

In this situation, and many others, we cannot  get away with a bunch of keywords. We cannot leave the burden of selecting what to choose to the caller because the caller does not know what to choose. In many situations, like for instance troubleshooting but also call routing in general, the caller may not share the same mental model of the world, or not have a mental model at all (how many people have a mental model of internet provisioning? And among those that do, how many have the correct one?).  The solution consists in letting callers describe what they want in their own words, and let the machine perform the mapping between what they say and a bunch of predefined categories. This is called Statistical Spoken Language Understanding,  or SSLU, but people many refer to it in many other ways, such as SLM (Statistical Language Model), How May I Help You (HMIHY) technology, call steering, call routing, etc. But the concept is the same: perform an automatic mapping between all possible natural language expressions and a finite set of categories.

Having said that, the design choice of using natural language in a speech recognition applications is not an easy one. One has to consider a lot of factors and balance the delicate trade-offs between coverage and  accuracy that are imposed by the imperfect speech recognition technology . And in many situations the choice is not so obvious.  But this is the subject of a future blog.

Posted by Roberto on Dec 9, 2007 10:46:21 AM Permalink | Comments (2)

11/25/2007

HAL's dreams versus useful tools

Hal9000

Speech recognition research has always aimed at building machines that can talk and understand speech as humans do. However, the realization of this dream is years away, and speech recognition technology is still severely limited. Yet, current speech technology enabled the automation of services effectively used by millions of people. As other machines, like for instance ATMs, speech recognition should not be regarded as a replacement of human beings, but rather as a tool that allows controlling a computer with voice. And like all tools, it requires a little learning for its users to be able to reap greater automation benefits.

HAL 9000, the 'almost' perfect computer of "2001 a Space Odyssey" with Doug Rain's soothing voice and capable of personal feelings and autonomous decisions, permeated most of my youth SciFi dreams and imprinted my career as an adult. For more than 20 years I pursued the goal of natural language communication with machines, fascinated by the power of statistics, huge amounts of data, and automatic learning. Indeed, learning to talk as humans do has been the holy grail of human-machine spoken communication research from more than half a century.

The dream, HAL’s dream, has not faded, but we now understand that there is a marked difference between the dream of building a machine that talks and understands as humans do, and the vision of creating a useful tool. Let's use ATMs, the ubiquitous cash machines, as a reference. ATMs are not mechanical replications of bank tellers, and they never wanted to be. But we do not dismiss them just because they are not that "human."

We actually like ATMs. Why? Because they are fast, always available around every corner, they speak our language no matter which part of the world we are in, and never make mistakes--I have never ever received less or (alas) more cash than the amount that I have withdrawn from my account. And if there was a mistake, it was always "my" mistake, because either I punched the wrong key, or I did not understand what the machine asked me, or because I forgot I had already taken the cash and put it it my wallet just a second before, though I remained puzzled looking at the empty cash slot.

Yes, I do like talking to humans, but unless I have to perform a non-usual transaction, I always choose an ATM over a human bank teller. ATMs do not fulfill HAL's dream, they are tools, not duplicates of humans. But we know how to use them and what to expect from them. And they make our life easier.

HAL's dream of building a human-like speaking machine has not faded, academic research is still pursuing it. It is an ambitious goal pursued by many brilliant scientists. But until we reach it--and we are not there yet--we do have to understand that a tool is a tool is a tool. And voice recognition technology, today, is a tool. Period!

In the mid 1990s AT&T automated their operator service by using a speech recognizer that could understand five, and only five words: calling-card, collect, third-party, person-to-person, operator. Only five words... that’s far from HAL's dream, but so useful to AT&T customers who rarely complained just because they wanted more "natural language," or because they wanted to be able to say "I am traveling in France, I do not have any money, I forgot my credit card at home, can you make a collect call to 555 111 1212?" rather than just "collect!" And so useful to AT&T, by allowing them to save hundreds of million of dollars with only those 5 words!  

So, what's speech technology today? It is a tool that allows the control of a remote computer using your voice. Why voice? Because in some situations that's the only way, or the most convenient method to control a machine or input data. Do we need natural language? We do sometimes, when 5 words are not enough or we cannot summarize all the possibilities with a small set of keywords.

Voice recognition technology is a tool, but it is our responsibility as users to learn how to use it. We do not go to ATMs and push keys without reason, without understanding what we are doing and without having read the instructions on the display. Probably we don't remember it, but there was a time when we learned how to use ATMs, just as there was a time when we learned how to use answering machines, the Web, and mp3 players. Now that we learned how to use those tools, we are happy with them. ATM machines can work only if users know how to use them. The same is true today for voice self service technology. 

Posted by Roberto on Nov 25, 2007 2:07:29 PM
Voice Recognition | Permalink | Comments (3)