AUTOMATIC SPEECH RECOGNITION SYSTEMS
AUTOMATIC SPEECH RECOGNITION SYSTEMS
WITH DIFFERENT MODELS OF DIALOG PRODUCTION
Universitetskaya nab. 9/11
Grazdanskiy pr. 22
Tel/fax: 7 (812) 535-95-86
The paper describes two systems of speech recognition: the speech recognition system based on the recognition of keywords, and the system that implements mainly the phatic, emotional and to some extent appellative functions of verbal communication, and intends to maintain the act of communication.
The most of artificial systems of verbal communication use the mechanism of contents information transmission thus only the informational function is implemented. It is well known that in everyday conversation besides the informational function other yet very important functions are used in which mechanisms of code transmission (checking and correction) or qualitative information transmission dominate. With this it is useful to consider a possibility of creation of such systems that implement not only the informational function of communicative act (message transmission as such) but also some other functions.
The paper describes two systems of speech recognition: the speech recognition system based on the recognition of keywords, and the system that implements mainly phatic, emotional and to some extent appellative functions of verbal communication, and intends to maintain the act of communication.
In the approach to speech recognition problem based on the recognition of keywords the main idea is to determine the sense by means of so called keywords from the relatively small dictionary without trying to restore exhaustively the linguistic content of an utterance.
The systems that implement this approach have the following characteristics:
* the finite alphabet of speech messages exists,
* the recognition is reduced to the selection of canonical sentence, while the branching coefficient doesn’t exceed 200,
* the basis of the recognition is the restricted set of keywords.
As a demonstrative model a model of speech conversation book was chosen in the situation of communication between a receptionist and a client of a hotel .
It works with 3 languages: Russian, English and German. For each of two roles about 90 canonical sentences exist. The dictionaries include: about 350 words in Russian and about 200 words in English and in German.
The recognition of speech is based on the idea of keywords.
The subsystem of speech recognition uses the wideband channel, with sample rate of 16kHz, 16-bit quantization.
The subsystem of recognition consists of the following modules (stages of processing):
* segmentation into words
* vector quantization
* comparison with patterns – decision rule
The procedure of search for boundaries between words works with the utterances spoken with small pauses between words (about 200 ms). The main difficulties in solving this task are caused by the variability of utterances, diversity of used dictionary, presence of voiceless stops (pauses within words), influence of non-stationary noise.
The procedure consists of the following stages: 3-channel filtering, finding the beginning and the end of a word by levels, determined with respect to the background noise (which is extracted from the signal as the segments with the least amplitude).
This procedure yields the satisfactory results given the following conditions:
* sample rate is 16 kHz,
* pauses in the beginning of an utterance and between words are at least 200 ms long,
* the duration of a word exceeds 40 ms,
* the duration of voiceless stops in a word is lesser than 190 ms,
* the signal-to-noise ratio is at least 15 dB.
The input signal is filtered by the filter with the band 80 – 8000 Hz. The accuracy of word boundaries detection given the listed conditions is in average 50-80 samples (3-5 ms).
The primary parameter vectors are computed on the segments of 10 ms long, in particular the following are taken:
* the envelope, normalized on the F0 energy or on the maximum energy in the spectrum,
* the first derivative of the envelope,
* 16 frequency bands.
The vector quantisation allows to achieve twice better results of speaker independent speech recognition. In VQ a standard algorithm of k-means is used with a certain modification: the computing is performed in several iterations, in each iteration the coefficients of the metric are recalculated, in order to make the resulting clusters closer to the principal components. As a result 120 clusters are produced.
The recognition of words is an estimation of probability that a spoken by a user word corresponds to a certain word from the given wordlist and the selection of the best candidate. The word being recognized is represented in the terms of secondary description, i.e. after vector quantisation – as a sequence of cluster labels. The patterns from training database are represented in the same manner. The number of utterances of each word in the training database being relatively small (15-40) the more or less considerable statistical procedures like that of Markov chains were inapplicable.
The word being recognized is compared to each pattern from the training set by means of dynamic time wrapping algorithm, the distance between clusters is determined by the metrics of block-city type, computed beforehand on the stage of vector quantisation.
The patterns that “mixed” greatly with other words as well as those patterns that differed much by length from the average length of a word were removed from the training set. Also for each word “cluster probability” was computed, i.e. the probability for each state to be found in a given word. This probability was put in the base of one of the proximity measures used in the module of speech recognition.
The result of recognition is a matrix of probabilities the rows correspond to words from keywords dictionary and columns correspond to spoken wordforms. This matrix is passed to the semantic module.
The semantic module implements the model of restrictions .
A semantic hypothesis is a word sequence that corresponds to some canonical sentence. A special unit, pseudo-word, is used to denote any word missing the dictionary of words being recognized. The set of all hypotheses is a set of restrictions. The main part of processing of the semantic module is in selection of required restrictions from a certain static table of restrictions that was formed beforehand by means of a certain grammar. Each row in this table corresponds to a hypothesis (restriction) – an allowable word sequence.
The grammar determines logical relations on words and restrictions on linear order of words in an utterance. For each canonical sentence a record is performed according to a certain format. This is an infix order record of tree of logical operations on a set of word occurrences in an utterance, and it includes also restrictions on relative order of words in an utterance and length of an utterance.
Standard logical operations are used: AND, OR, NOT.
The most of keywords are used in the description as separate words. The exception is made for numerals: they are divided into 4 classes (1-9, 11-19, tens, hundreds) and the labels of these classes are used in the description.
On the stage of recognition (reducing utterance to a canonical sentence) the following operations are performed:
For each considered canonical sentence semantic hypotheses are chosen that match by their length. For each chosen hypothesis a measure is computed – multiplication of probabilities of words taken from words probability matrix supplied by the word recognition module. I.e. the probability space is all possible word sequences independent on each other and restrictions determine that part of space that is meaningful.
Then all word sequences (hypotheses) contained in the restriction matrix are considered. If the probability of a given word string exceeds a certain level, this hypothesis is stored for the future consideration.
Then all chosen hypotheses are sorted in descending order of their probability measures and less significant of them are dismissed (by means of entropy of gained distribution).
As a result some wordforms in a certain positions in an utterance turned deactivated. In the word probability matrix zeros are placed on their places and the sum of probabilities for a given position is supplemented to unit by adding the residue to the probability of “garbage”.
The operations beginning with the selection of semantic hypotheses are repeated several times with the renewed word probability matrix.
The probability of a canonical sentence is determined as a maximum of probabilities of non-contradictory hypotheses.
There exists a quite different statement of problem of speech recognition, for instance as an organization of a dialog with interactive speaking toy . This is a task of implementing non-informational functions of verbal communication (phatic, appellative, interdictive, etc.) With that the bottom-up approach becomes inefficient – from recognition of minimal speech segments (phonemes), through the search for the best word sequence, syntactic analysis to the determination of a sense. Because the object of a spoken message is different here: this is a maintenance of a contact, inducement to an action, interdiction of an action etc. I.e. the problem is reduced to the recognition of a function of a message and not its linguistic content. On a foreground such units come out as verbal scenario of communication, turn, dialog.
The proposed model of artificial communication system supports mainly phatic, emotional and partly interdictive functions of verbal communication and is intended to maintain a communication act. It is implemented as a toy – speaking parrot. It recognizes a small amount of utterances, “understands” them somehow, and reacts in the form of a sentence from the restricted set or keeps silence – depending on a situation.
The subsystem of speech recognition is analogous to that described above. The only difference is that the unit of recognition is utterance as a whole (and not a word), correspondingly the levels for pauses between utterances are changed. The system recognizes about 60 utterances.
The main innovation is made in the module of dialog organization. The categorical dictionary of dialog consists of semantic primes based on a set of predetermined semantic situations. The utterances of a user and a parrot are assigned subsets of semantic primes.
In the process of dialog construction a history on a certain depth is taken into account. When a continuation of a dialog is considered for each candidate utterance measures of old and new information are calculated to the respect of the information contained in the history. The selection of an utterance that continues the dialog is determined by an integral estimation that depends in time on the measures of old and new information. The form of this dependence defines the global behavior of the system. In general the system changes strategy: in some periods it tends to maintain a certain dialog topic (i.e. known information), in other moments it changes the topic (i.e. activates new information).
Thus this system can implement the strategy of a “guide” within the framework of a dialog: drawing the dialog to a certain semantic situation, imposing its opinion or expressing emotions (e.g. laughing or crying).
From the other hand the modeling of such dialog is justified when the system doesn’t “understand” the collocutor. In this case a neutral scenario of a dialog is used: being within a frames of a given situation the system implements the phatic functions for maintaining the conversation and checking the channel of communication (e.g. phrases such as: ‘well’, ‘what comes next?’ etc.)
The considered above systems of speech recognition with the corresponding models of speech communication with the “human-machine” dialog framework has different applications. The system, based on the recognition of keywords can be used in the situations where the sending/receiving of information is on the first place (domination of the informational function). In the models of a dialog in which the communication as such is important the recognition is reduced to the finding of common semantic situation, common scenario of a dialog, the systems of a second type can be used.
R E F E R E N C E S
1. Galunov V.I., Galunov G.V.. One approach to speech recognition. //Proc. of International Workshop “Dialog’2000. Computational Linguistics and its Applications”. Ed. A.S.Narinyani. Protvino, 2000 - V2., P80-85.
2. Razumikhin D.V. Development of a dyadic speech understanding system. //Proceedings of International Workshop “Dialog’2001. Computational Linguistics and its Applications”. Ed. A.S.Narinyani. Aksakovo, 2001 - V2., P323-329.
3. Soloviev A.N., Victorova K.O., Razumikhin D.V. About using non-informational functions in models of speech communication. // Proc. of International Workshop “Specom’2002. Speech and Computer”. SPb, 2002 – P27-29.