Problem of Speech Recognition
From Artificial Intelligence to Smart Environment: on the Problem of Speech Recognition
V.I. Galunov (1), N.G.Kouznetsov (2), A.N. Soloviev (1)
(1) St. Petersburg State University, 11, Universitetskaya emb., St. Petersburg, Russia, 199034; AudiTech Ltd, St. Petersburg, Grazhdansky 22, 195220
(2) NLK Software Consulting, 2-511 Oakvale Drive, Waterloo, Canada
The recent developments in the Speech Recognition problem domain have been primarily driven by the needs of the market. Speech recognition engines have been mainly based on mathematics-centered DTW and HMM. The architectures of the existing speech recognition engines have little in common with the architecture of human audition. The speech recognition technology has not fully utilized the knowledge amassed in the psychology of speech perception. The needs of the Speech Recognition technology have not been adequately addressed by the psychology of speech perception.
One of the dimensions of the sophisticated problem domain of Artificial Intelligence in general and Speech Recognition in particular is a current phenomenon known as Smart Environment. We use a relatively new and undoubtedly nebulous but well-established term "Smart Environment" somewhat facetiously here. An average consumer finds it alluring to buy and use smart phones, smart pens, smart cards, and other available smart gadgets that are thought to be able to increase our productivity in the office and make our home environments more comfortable to live in. This phenomenon is important for scientists and technologists as it gives a market feedback to current achievements in the field. On the one hand it reminds about Artificial Intelligence having recently fallen into disrepute from a business circle's perspective. On the other hand the achievements of Artificial Intelligence are salient enough, so that not only experts but laypeople as well can talk about the benefits of the technology and want to use this in their everyday life.
Speech recognition has drastically matured since 50 or so years ago when many experiments on recognition of simple utterances were performed at various laboratories. Among them are Bell Laboratories, RCA Labs, University College in England, MIT Lincoln Labs, the Institute for Far Distance Communications (Leningrad, USSR), and the Institute for Information Transmission Problems of the Russian Science Academy (Moscow, Russia).
The art of Speech Recognition is awaiting a thorough historical analysis. Little information has been gathered and published so far on the subject. As always it is not simple to find out who was the first one to get interested in the problem of automatic speech recognition or whose research became a pivotal point in speech research. Professor L.L.Myasnikov of the University of Leningrad is often referred to as the first Russian scientist to have performed research on speech recognition at the end of nineteen thirties.
The Speech Recognition technology has not advanced to the point as it was portrayed in the movie 2001: A Space Odyssey (1968) directed by Stanley Kubrick. Computer HAL 9000 created in 1992 was capable of maintaining pretty sophisticated dialogues, its speech production capabilities as well as speech perception ones are far more superior than the capabilities of the currently available systems. But the technology has now matured to the point when the high-tech market has gained its internationally recognized leaders. Among those leaders are IBM, Nuance, and SpeechWorks. Recently, Microsoft joined the group of leaders in speech recognition by bringing their speech server to the market. The Microsoft's server is supposed to essentially increase the productivity of call centers who provide the 24x7 service.
The agencies specializing in strategic analysis Gartner Inc., the Kelsey Group, Cahners In-Stat Group, Giga Information Group, and IDC have offered various prognoses regarding the development of speech recognition and speech synthesis technologies. Analysts From Gartner Inc. have observed that "There are several signs that the speech recognition industry has been maturing. Many implementations provide proof that solutions that use speech recognition can deliver business value". Forecasts for voice service revenues worldwide look very promising, ranging from $1.6 billion (Cahners In-Stat Group) to $3.5 billion (IDC) to $16.3 billion by 2005 (The Kelsey Group).Giga Information Group projects the speech recognition software market is expected to grow from $100 million in 2000 to $2.5 billion by 2005. The Gartner predicts that the worldwide market for TTS software alone would grow to $6 billion by 2005.
A while ago, Artificial Intelligence as a scientific discipline had disappointed tycoons and business magnates including venture capitalists. It had promised a lot, but delivered little of what had been promised. In spite of that well-informed Chief Software Architect of Microsoft Corporation Bill Gates has continued to speculate on the future of speech recognition as well as speech synthesis applications. He basically sends out to the business world two messages regarding Artificial Intelligence. One is that the computer's processing power will be mainly consumed by speech recognition, speech synthesis, and handwriting recognition engines, serving various applications. The other message is that "AI is helping us create more natural user interfaces . . . we need future software to listen, see, reason and understand the user's context, intentions and goals". The latter is especially important as many business people have recently referred to AI as something that is not appropriate to use when talking about real business.
The laity has developed an impression that the art of speech recognition has reached its acme and not much is left to be resolved. If an aspiring entrepreneur approaches a venture capitalist and proposes an exciting idea of a speech-recognition-based computer application (s)he will likely hear in response that "Many other companies have already done this".
However, experts in speech technologies and entrepreneurs seeking new applications in the field are far from being satisfied with this attitude.
There are two paramount factors to mention: end-user interface and the state-of the art of speech technologies. The system works fine when the technology matches the end user interface requirements.
So far, speech recognition engines have been used for the development of the dictation systems.
Access to information anywhere, anytime and in the most convenient way. Businesses have developed a strong desire to hire better educated employees. Any business encourages its employees to become better informed to help the company's customers to find optimal business solutions. The Speech and Speaker Recognition as well as the Speech Synthesis are the key technologies to create a smart working and home environment to drastically increase productivity and to create a computer-driven environment to provide friendly, easy-to-deal-with services.
Speech Technologies: applications and perspectives
The speech technologies in general have got a lot to offer to consumers today as well as in the future. We would like to present descriptions of a few systems. The purpose of this is to show that the list of potential "cool" voice-enabled applications is almost endless. These applications are seemingly simple from an academic perspective. However, it is obvious that each of these business ideas is not simple from a business perspective. We have not perform any market research to see what potential response from the market to these innovations might be. It has not been our priority.
RecordedBookAssistant: Many books are recorded on CDs. Many of the listeners of those books are immigrants learning the language of the country where they live. There are a lot of unfamiliar words for the listeners. The system is expected to be able to listen to a simple set of commands spoken by the listeners in their native language like "stop", "find the word ...", "give the translation of it", "continue...", "replay the utterance again/several times". The key technologies are: speech recognition.
Telephone Number Detector: This system is expected to be able to analyze a message left on your phone and detect telephone numbers and email addresses left in the message. This information can be copied to the phone book and address book associated with your phone, it can be retrieved later when you want to dial the number or send an email. The key technology: word spotting of telephone numbers and email information.
Minutes Taker: This system is expected to be able to take minutes during a corporate teleconference meeting where, possibly, several groups of participants are located in different offices of the company. The system can keep track of what has been said, who said what, and in what order. When the meeting is over, the system can email/printout all the minutes to the participants. The key technology: speech recognition and, possibly, voice recognition are involved.
LiveLanguageDictionary: The system is expected to work as a search engine connected to the Internet and selected TV and Radio stations. Upon a user's request, the system searches for utterances with occurrences of a specified word within specified timeframe, say, within a month. The information collected is stored in the database. After the request has been performed, the system's client can retrieve all the information. The key technology: word spotting.
Customizable Radio (a request-based radio broadcasting): The system is expected to work as a search engine connected to the Internet, TV and Radio stations. It is capable to detect the searched information and deliver it to the client. The client can get it in his/her car using cell phone, or through a radio set connected to the cell phone. Say, a request can be "Business news in oil industry". The search engine gets connected to the websites of various newspapers, uses speech synthesizer, and the information can be delivered in the client's car through a radio set connected to the cell phone. The key technology involved: speech synthesis.
Navigation System: The system is supposed to be able to understand the customer's request and deliver simple instructions on how to get to the destination point. The key technology involved: meaning-text generation, speech synthesis, speech recognition.
Much of the functionality has been already developed so far. See, for example http://www.mapquest.com. We are talking about the speech interface with such a system.
Can these products be delivered now? And if not, then why?
Human Communication: Speech Perception, Speech Recognition and Speech Understanding
In 1990 Albert Bregman, professor of psychology from McGill University (Montreal, Canada) published a fundamental, thought-provoking manuscript titled "Auditory Scene Analysis. The Perceptual Organization of Sound ". The narrative driven by poignant questions provides excellent theoretical analysis of the data collected in the field of auditory perception. Moreover, it provides conceptual foundation for thinking and talking about various phenomena related to the perception of the voice of the human speaker and of her speech.
Bregman presents and analyzes the key processes incorporated in perception system. Among them are auditory streams, integration and segregation processes forming auditory streams, unit formation mechanism, auditory attention, and schemas, the knowledge-based counterparts of primitive processes of stream formation.
Let us describe two observations made by one of the authors of this paper. Observation one: a Chinese and a Russian can speak English equally well, at least a Canadian can say that the level of their speaking and listening comprehension skills are equally well developed. But these Russian and Chinese experience a lot of difficulties communication with each other, especially without being accustomed to each other's speaking habits. Observation two: the more educated the listener is the easier is the task for her to understand the speaker for whom the language she speaks is not native.
Plausible explanations of both of these phenomena can be given in terms of schemas. The better one knows the language the better schemas are developed and the more schemas are involved in speech analysis, and thus the easier for her to understand others.
The conceptual tool provided by the book is interesting from several perspectives. One perspective is highlighted by the question: can the material of the book be used by a software developer as a functional specifications document to design and develop a computer-based artificial auditory perception system. This question turns out to be quite pertinent, if we remember that speech recognition / understanding problem domain is referred to as a brunch of computer science. The second one is highlighted by the question: can this tool be expanded to become suitable for talking and thinking about the skill of human listener to understand speech. And the third one is highlighted by the question: can an ontology of speech recognition problem domain be developed to satisfy expectations and tastes of the multi-disciplinary community of speech researchers.
From a perspective of a software developer getting himself prepared to create a computer-based auditory perception system the book of prof. Bregman delivers a kind of functional specifications document describing the key aspects of human auditory perception. This software developer will need, in accordance with modern software development methodologies, to create a comprehensive glossary containing definitions of concepts used in the problem domain and to design the architecture of the software product which somehow reflects the architecture of human audition. The glossary is crucial because the knowledge required to create such a system is scattered within many scientific disciplines: psychology, psychoacoustics, acoustics, physiology, linguistics, speech therapy, engineering sciences, physics, math, and computer science. Speech scientists contributing to the field are objectively interested in working out such a vocabulary in order to effectively communicate their ideas, results, observations, hypothesis, theories and concerns. It is easy to find a specialist to solve a well defined task within a narrow discipline, and it is much more difficult to find a generalist who could see the whole "forest" of Speech Recognition for the "trees" of the sciences listed above.
Here is an example showing the importance of the development of the ontology of speech recognition. There are three fundamental and widely used terms by speech researchers: speech perception, speech recognition, and speech understanding. Sometimes "speech perception and speech recognition are used interchangeably", sometimes they are used as they have different meanings. The same relates to the other pairs "recognition / understanding" and "perception / understanding".
When we talk about understanding of human speech we realize that there are different levels of understanding a message we can refer to. We could say "I can repeat each sentence, but I do not understand all this". We could add "I understand the motifs of his message, but I do not understand the central point". The Bible delivers a lot of examples when it speaks of love without referring to this word, and we perfectly understand the main subject of the message. We can understand what the speaker wanted to express even if the grammar of the sentence was completely distorted and some words were not correct: a Chinese person can say "How can I say it?", and you will here "How can I see?" And, finally, can we make sure that we understand exactly what the speaker wanted to express? Another example is the pair "speech/voice recognition". The definitions for those and other terms of Speech Recognition problem domain given by Merriam Webster On Line or, say, by complete and unabridged Collins English dictionaries are simply not satisfactory from the speech scientists' perspective.
Not much is known on the architecture of human audition. And not many scientists work on the architectural issues of human audition. Can we describe somehow the phenomenon of understanding? Are mechanisms of understanding of visual scenes different from their auditory counterparts? How mechanisms of understanding of a written text are different from the mechanisms of speech understanding. Can we talk about streams of understanding? Can the understanding of an auditory scene be accurately described by the mechanisms of primitive analysis and by learned schemas discussed by prof. Bregman? Human listener seems to keep focus on several things simultaneously: speaker's language, speaker's voice, speaker's emotions, the meaning of the message, etc. The question here is that "How attention processes are organized to perform these tasks? Can we obtain any experimental data showing how attention switches between the tasks and what drives attention to switch to a different task? On what basis the decision is made to switch to a certain task? "
From the literature one can find that there are at least four subsystems comprising the architecture of human auditory system, namely, acoustics, pragmatics, statistics, and linguistics subsystems. It is not, however, clear how these subsystems communicate with each other, what interfaces they exhibit to each other to exchange the information, what system plays the role of the controller. The question if other subsystems should be included is still open: auditory attention is not covered by any of the subsystems listed above. Can we think of experiments aimed to clarify the interconnections between the subsystems. The audition goes through certain stages through its development. An unanswered question arises here: "How should an architecture of artificial perception system incorporate those developments?". Any person who lives in a culturally diverse society experiences many perceptual challenges listening to an immigrant. Talking to an immigrant one realizes that each sound of their speech is too far from the norm, they are influenced by the native language of the speaker, some sounds are omitted, some sounds, even syllables, are added. The prosodical characteristics of immigrants' utterances are distorted. However, from the communications perspective they can present themselves quite eloquently. How does human mind detect and process regularities? How does it collect statistics? If a subsystem experiences a lack of input how do other subsystems cooperate to perform the task?
Some Problems of Speech Recognition
The capabilities of speech recognition systems have been overestimated. The speech recognition problem domain's formidability has been underestimated. Much is awaiting researchers and technologists ahead.
1. Sophisticated experiments are needed to gain the understanding of the architecture of human audition. A software environment is needed to conduct those experiments and process the results.
2. All technical tools needed to develop databases to store all the utterances a child makes since her first days are available. These tools can be used to create an environment to perform a research aimed to clarify many aspects of development of speech in children. It will help answer the question: "How speech skills develop depending on what a child hears?"
3. How does human listener's perception work with auditory regularities? How does it collect statistics on them? And what kind of statistics? Can this be somehow mimiced by a software application?
4. How are language recognition, voice recognition, and speech recognition processes cooperating to create an adequate description of listener's ambience?
5. Language recognition. A human listener can relatively easy recognize the language the speaker (s)he listens to speaks.
What happens in human perception system when the language is being analyzed? Human listener can recognize the language spoken even if (s)he does not know the language, no knowledge of syntax or vocabulary. Can this somehow be mimiced by a computer system? Many telephone companies if not all would be interested in such a system if it existed.
6. Voice(speaker) recognition by a human listener. Little relevant research has been done. One of the most important thing to describe is the "architecture" of that part of perception system that is in charge of that. In our opinion, one of the areas of research here are voice recognition under cocktail party effect conditions and voice recognition in children. Here is an interesting observation: if you listen to two speakers simultaneously, one speaking your native language, and the other one using a foreign one you will hear the native speech much better even if the other speaker speaks louder.
7. The "general architecture" of speech recognition system in human listener. Some researchers use the word expressions "speech perception" and "speech recognition" as synonymous, some see significant distinctions between them. When we talk about humans we talk about perception, and when we talk about computer systems we talk about recognition? Probably, one of the most interesting areas of research is cocktail party effect and knowledge-based stream formation mechanisms. This research will help understand how schemas can be modeled, and how they interact. Foreign speakers as well as speaking parrots can speak a language the way that native speakers will understand them, though all the sounds in speech are going to either omitted, or pronounced improperly, or some additional sounds will be inserted into speech. The native speakers of the language understand them though the phonetic information is completely distorted.
8. How speaker's voice is different when the speaker switches to a different language. Can a computer system detect that different sentences spoken in different languages belong to the sqme speaker?
9. Everyone who has ever spent some time analyzing oscillograms and spectrograms of speech signal knows about variability of speech signal. The list of the most important constancies and their acoustical correlates.
10. One of the exciting ideas is the development of a computer-based system to model the language acquisition skills of a child.
Psychologists beginning from Aristotle have immensely contributed to the psychology of perception. However, plethora of intractable problems remain unsolved awaiting the attention of new generations of students of speech perception.
 Bregman A.S. Auditory Scene Analysis. The Perceptual Organization of Sound. The MIT Press, 1990. ISBN:
 Cherry E.C. Some experiments on recognition of speech with one and with two ears. Journal of Acoustical Society of America, 25, 975-979, 1953
 Galunov, V.I. Zagoruyko, N.G., Lobanov, B.M. On the ontology of the Speech Recognition problem domain. Specom'2004, September 20-23, 2004, St Petersburg, Russia
 Galunov, V.I., Soloviev A.N., Uvarov V.K. Models of Speech Perception, Speech Production and Problem of Automatic Speech Recognition Specom'2004, September 20-23, 2004, St Petersburg, Russia
 Handel, S. (1989). Listening: An Introduction to the Perception of Auditory Events. Cambridge MA: MIT.
 Harnard, S., (1990). Categorical perception, Cambrdige University press
 Hazan V., Markham D.. Do adults and children find the same vices intelligible.ISCA Workshop on Temporal Integration in the perception of speech, P3-19.
 Huang X., Acero A., Hon H.-W. Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Pearson Education, 2001. ISBN: 0130226165
 Jelinek F. Statistical Methods for Speech Recognition (Language, Speech, and Communication), Bradford Books, 1998 ISBN: 0262100665
 Jusczyk, P. W. (1997). The Discovery of Spoken Language. Cambridge, MA: The MIT Press.
 Klevans R.L., Rodman R.D.. Voice Recognition. Artech House Publishers, 1997. ISBN: 0890069271
 Kosarev Y., Ronzhin A.,Karpov A., Lee I. Approaches to Creation of Situational Databases for Integral Speech Understanding Models. Specom'2003, Moscow, Russia, October 27-29, 2003
 Liberman, Alvin M. (1996). Speech: A Special Code. Cambridge, MA: The MIT Press.
 Lienard J.-S. Speech and Voice Perception: Beyond Pattern Recognition. In Speech Processing, Recognition and Artificial Neural Networks, pages 85-112. Springer Verlag, 1999.
 Markham D., Hazan V.. Speaker intelligibility of adults and children. Proceedings of International Conference for Spoken Language Processing, Denver, 16-20 September 2002, 1685-1688
 Minsky, M. The Society of Mind. New York: Simon and Schuster, 1986.
 O'Shaugnessy D. Speech Communications: Human and Machine. Wiley-IEEE Press, 1999. ISBN: 0780334493
 Rabiner L., Juang B.-H. Fundamentals of Speech Recognition. Pearson Education POD, 1993. ISBN: 0130151572
 Rosenthal D.F., Okuno H.G. Computational Auditory Scene Analysis. Lea, 1998. ISBN: 0805822836
 Ryalls, J. 1996: A basic introduction to speech perception. San Diego: Singular.
 Waibel A., Lee K.-F. Readings in Speech Recognition. Morgan Kaufmann, 1990. ISBN: 155860144
 Belin P, Zatorre, R.J., Lafaille, P., Ahad, P. and Pike, B. (2000) Voice-selective areas in human auditory cortex. Nature, 403, 309-312.