Speech technologies and speech science
Speech technologies and speech science
At present time our societies invest a huge amount of money, know-how and research effort to solve problems of speech recognition and synthesis by machine. This work is stimulated by practical requirements, and research is concentrated on the optimal solutions in the domain of speech technology systems.
From analytic side it has sense to divide "current speech technologies" three parts. First is speech science, all knowledge about speech signal, speech formation and speech perception. Also we could include into this part knowledge about models of speech signal and methods of speech signal processing. Next part is speech technologies as themselves, i.e. achieved level of hardware-software solutions for speech signal processing, which could be used for solving practical (applied)problems. Last part is mentioned above applied systems, i.e. systems for real usage. These three points are bound into following chain: speech science->speech technologies->applied systems.
It has sense to began from last point. It could be divided into three subclasses:
1.Military systems and systems for other special goals. Usually cost of such systems isn't matter and their functions are definitely defined and restricted.
2.Commercial systems. Cost of such system should be proved with direct or indirect profit that could be got using the system or with other benefits.
3."Demonstration" systems. It is special genre. They mimic real applied systems, but they don't imply getting any profit. Their only goal is to manifest level of company achievements. They could be sold, but not for practical usage but for "test", for determining what could be made on their base.
It should be mentioned, that speech systems are not independent. They are always embedded into certain meta-system, which define conditions of using of speech .
Since this for applied systems it is difficult to calculate their efficiency. It couldn't be done via error percentage since:
1.Cost of error may be different.
3.Notion "error" could disappear at all.
Let's talk about link "speech technologies - applied systems". To our mind, new period in development of using of speech systems is starting now. Paradigm has been changed. Instead of passive using of speech systems apart (recognition apart, synthesis apart, identification apart) problem of interactive speech communication of man and machine have come to front edge. Previous period of using of speech systems could be said unsuccessful. Active growth, was forecasted 10 - 15 years ago obviously didn't happen. Where is cause? Main cause is absence of clearly formulated conditions for applied problems. It is possible to solve any technologic problem, e.g. recognition of some thousand of words, but field of application for the problem is not clear. Next reason may be not as visible is attempts to look for solutions "under torch", i.e. solving problems for PC. But PC had been adopted for visual/manual interaction and since attempts to get in this system using speech were initially destined to fault.
Of cause there were attempts to came to solving of some practical problems of local type, where requirements are well defined, e.g. voice dealer. Difference of the problem from classical one is clear here: there is no large vocabulary, but there are hindrances and there is "naive user". Hence in is necessary to solve problems different from ones for PC.
We suppose, that speech technologies are developing through two main directions:
1.Interactive telecommunication services.
Possible third problem is speech translation.
More or less clear problems of local type are:
1.Identification and verification of speaker (in particular for telecommunication service)
2.Psycho-physiological state control (stresses, alcohol etc.)
3.Compression (standards for MELP-2400,probably 500-600)
Lets talk about interaction of speech science and speech technologies now. It is obvious, that due to natural reasons speech technologies are more inertial than speech science. In situation, then interactive Cupertino of machine and human become main problem this increasing gap between commonly approved in speech technologies models and models of human speech behaviour (that is subject of speech science) becomes dangerous.
Systems of speech recognition and synthesis are, in general, based on mathematical theories of signal processing and not on knowledge of nature of human speech processes. Most contemporary systems are far from biological, neural, and psychophysical systems. There designers are not inclined to reach resemblance between the functioning of their systems and this of human beings.
Speech technology has reached its major breakthrough in speech recognition via application of dynamic programming and hidden Markov models, and in speech synthesis – using large basic units such as diphones and allophones. Further success has been stimulated by an increasing computing power of the hardware being used.
First it should be pointed, that mostly active and successful using in automatic speech recognition method based on Markov model is statistical in its nature, that obviously don't reflect mechanism of human speech behaviour. One could ague, that nevertheless it solves the problem. But it is probable that at certain level it will become not feasible, also it is huge. One place where it is not work may be pointed, already.
From a theoretical point of view, on the other hand, it is more interesting to penetrate into the functioning of human processes. In this respect, closeness to the reality is crucial for the evaluation of different models of speech recognition and synthesis, for speech understanding and production. Imitating models are valid in so far as they explain the nature of human behaviour.
Mathematical models of speech information processing have already shown their limitations and do not allow to solve complicated, but actual for a man problems of fluent speech recognition with large vocabularies and without adjustment to a speaker.
On the other hand, our knowledge of speech perception, distinction of tokens, programming of speech production, which could be of help in the optimisation of technical speech systems, is rather limited. To say more – it may be party erroneous; and we hardly know where the errors are at that!
Scientific research aiming at the studies of speech behaviour often fails to formulate any model representations and thus are unable to make use of the opportunities of computer-aided verification of their concepts, which, in its turn, complicates their interaction with the designers of speech technologies.
For this reason for the last ten years theoretical studies of speech and speech technology have taken different directions. It is unfortunate that the latest successes of speech technology added too little to our knowledge of the processes of human speech communication.
We believe that the knowledge about a human being might be helpful for the improvement of the automatic understanding of human speech and enhancement of synthesis by-rule systems, taking into account the following arguments:
- At present basic knowledge and the results of research of human speaking behaviour are considered as an important resource for the progress of automatic speech recognition, understanding of spoken language and synthesis by-rule;
- The users who order a speech technology system usually formulate their requirements only in terms of speech communicative behaviour;
- It is necessary to take into account the fact, that the result of a project in speech technology is oriented towards involvement of a human user in any case; thus the knowledge of his communicative behaviour should be incorporate in an automatic device.
There were multiple attempts to use knowledge about peripheral auditory system properties inside ASR. These attempts were based on an assumption, that using of method of speech representation in auditory system should essentially enforce ASR work.
Unfortunately, these attempts were unsuccessful. Result was that instead of working «incredibly better» some systems were working significantly worse. Probably this fault originates not only in difference between properties of used auditory models and real auditory system. There could be following causes of the fault:
benefits of peripheral auditory description could be appertained in framework of comprehensive model also including central processing units. ASR systems whose principles of analysis differ from these are inadequate for the problem.
peripherical analyser is not best for speech analysis due to it was formed for solving of other problems before speech appeared, and benefits of auditory reception originate as whole in central processing units which also compensate certain imperfections of peripherical analysis.
Actually, both these theories demand to be proved using models of central processing, but till now there is neither this model no clear understanding of character of transformations inside central units. We should mention that from experimental data on speech perception could be made a conclusion, that mechanism of speech perception differs from one of other sounds. The mechanism has higher priority than others, i.e. it usually switch itself on before others.
Since actual perception take place surrounded with multiple hindrances robustness should be one of most important properties of speech perception system also it should be for any perceptive system. A lot of mechanisms for robustness support were formed via evolutionary process. It could be said that direction of development of auditory system was first of all determined by problem of robustness. Most of mechanisms had been formed for detecting and locating of sound before speech communication arose, but they are successfully used for speech perception yet.
Certain role in speech extraction is played by binaural effect (interaction of left and right channels of auditory system), which cause decreasing of detecting threshold (up to 15dB) and speech legibility increasing (up to 6dB).
Short term adaptation is also usual feature of auditory system elements. It could be observed as decreasing of reaction while first 50-100 ms of stimulus action. This favour to emphasis front of signal and to suppress reaction inside of a break between signals.
Principal value for robustness of perception support has, to our mind, multichannel structure of sound analyzer based on division into channels according to frequency bands. Spatial organization of neurons which corresponds to resonant frequencies distribution of baziliar membrane characterize all levels of auditory system. It is not only way of coding of information about frequency of signal but, first of all it serves as a base for extracting of spectrally local features which are reflected in certain frequency bands. Since there is large number of channels, containing elements with different properties (thresholds and types of reaction, time constants, characteristic frequencies, dynamic and frequency rates of reaction etc.) this supply detailed representation of signal inside of auditory system.
Existing of elements with different properties in any channel supply extracting of different properties of stimulus facility. So, existing of prompt and slow adapting elements make it possible to extract stable and non-stable segments of signal. All this allow to recognize signal while certain features are masked (and some mechanism of processing failed) using other features which are tolerant to the masking. E.G. main tone frequency changing could be found via estimating of first harmonic or other stronger harmonics, or via changing of mean spectral pitch.
analysis of contemporary data makes possible to suppose, that speech processing inside left hemisphere of brain is mainly serial, i.e. sence recognition is preceded with extracting of temporary components of signal corresponding to linquestic units as phonemes and syllables, determining of characteristics and identification of components. Right hemisphere uses mainly integral way of processing, while incoming signal is compared as whole with stored acoustic templates of words.
Role of right hermisphere is much important for recognizing of speech signal with multiple hindrances. It probably could be explained by:
- first of all, increasing of role of probabilistic forecasting
- next, that features, on which whole words recognition method is based (prosody, rhythmic) are most robust.
More, noise, which complicate signal processing causes extended elements of analyzer load and hence they become more tired. It is natural, that second parallel channel (hemisphere) which make a part of job increases robudtness of this system.