SPEECH DATABASE FOR THE RUSSIAN LANGUAGE
SPEECH DATABASE FOR THE RUSSIAN LANGUAGE
Galounov V.I. (1), van den Heuvel H. (2) , Kochanina J.L. (1), Ostroukhov A.V. (1),
Tropf H. (3), Vorontsova A.V. (1).
This paper gives some information about European project for collecting telephone speech database for teleservices. Within this project the Russian Database was collected. This Datebase comprises 500 speakers from Moscow and 500 speakers from St.Petersburg.
For many teleservices speech recognition has become a key component for a fully automated service. Currently many automated teleservices rely on isolated word recognition. However, future commercial systems will become more and more user-friendly: Continuous speech recognition will become prevalent. Teleservices will emerge from a rigid machine-driven dialogue to a more natural user-driven dialogue.
Satisfactory performance, however, is only achievable if realistic training and testing data are available. The training data should comprise between several hundred and a few thousand speakers per gender depending on the number of speakers of the language in question. It should cover different dialects and accents and it should be representative of the telephone channel conditions likely to be encountered.
Creation of spoken language resources for voice driven teleservices helps to improve information and communications systems and services. In a multilingual environment as Europe, it is essential, that a user has access to ”common European Services” in its own ”native” language and even dialect. Due to the standardised design and validation of the speech databases, speech recognisers can easily be adapted for a large variety of languages and applications.
The economic impact of the project is two fold. Firstly, the production of multilingual voice servers by European industries and the creation of voice driven services by European service providers creates employment in a high tech field and potential chances for export.
Secondly, information and communication systems and services can be used by all interested European users.
In the near future European companies active in the area of speech driven applications will have most success in the area of telecommunication because there exists a strong European basis of telecom products. Teleservices which will be partly or fully automated using modern speech technology comprise a market of several billion ECU/year in Europe.
Due to the current speech technology large language specific speech databases are needed to develop and optimise competitive speech recognition systems.
The Polyphone standard for telephony speech databases (fixed network recording) has been developed. The technical properties of Polyphone data sets are: 25 - 40 utterances per talker, both read and spontaneous; 5000 talkers; Telephone speech material collected digitally directly from the telephone network (a-law, mu-law).
In Europe, the projects SpeechDat(M) and SpeechDat(II) represent the major industrial and academic participants . They are creating European telephone speech databases on a large scale:
- coverage of applications (application-oriented words, phonetically rich sentences)
- coverage of speaking styles (commands, carefully pronounced and spontaneous speech)
- coverage of environmental influences (mobile and fixed telephone network)
- suitable to develop and train robust speech recognisers and for Teleservices.
RUSSIAN SPEECH DATABASE
Of Eastern Europe only two telephone speech databases are known and exactly following SpeechDat standards: Slovene and Russian, each of them containing recordings of 1000 speakers.
The Speech Database for the Russian language was created by AudiTech-RD in February 1998 within the project SpeechDat(II).
The Speech Database for the Russian language collected by AudiTech-RD comprises the records of 1000 speakers (1000 sessions). The records were made in Moscow and St.Petersburg: 500 records in each of the cities. Speaker recruitment was carried out among untrained speakers different social classes of different big collectives: plants, research institutes, higher schools, etc.
Sex, in known, exert influence upon the speech guality (first of all on pitch and intensity) So , it has been decided to collect equal numbers of both sexes.It has been decided to collect several age groups, between 8 and 60 (where the share between 8 and 16 was less than 2%). The people with strong pathological did't include.The regional background of speakers can have a large effects on their speech. For determination of the dialect we operationalised by geographic region in which they grown up, the high-school period and live in place for long time and the presence of regional phonetic features (by expert decision).
The Database represent the following regional speech dialects:
* Petersburg and Moscow (40 per cent of each);
* the dialects of Middle Russia, North and South Russia, Urals and Siberia (1 to 8 per cent).
Recording was carried out through the European ISDN line. The signal format is 8bit 8khz, a-law.
The initial dictionary of the Database contains the lists of the main commonly used application words and commands from computer lexicon, digits and digital sequences, names of big cities and companies, time phrases, dates, money amounts, telephone numbers, credit card numbers, name-surname combinations, phonetically rich words and sentences. The present dictionary is phonetically representative, i.e. provides a full idea of phoneme composition of the Russian language. The dictionary was the basis for creation of the prompt lists for the speakers including prepared reading material as well as a number of questions that assume spontaneous answers. Speech material contains spontaneous speech, reading, commands, and word spelling. Spontaneous speech includes the anwers to the questions like “What time is it now?”, “Where did you spend your childhood?”, “Spell your surname”, etc. (The speakers were allowed not to provide precise personal information and to say invented names and surnames).
Processing of the speech material was carried out by experts in speech acoustics. It presumed numerous listening to all the wave files and making annotation according to the specification defined for the SpeechDat(II) project participants. The annotation presumes inserting the following information in the lable file:
* speech orthographic transcription;
* special marks pointing out the noises, mispronunciations, recording truncations which may occur;
* recording quality assesment (NOISE, OTHER, GARBAGE and OK);
* speaker information (age, sex, regional accent);
* type of acoustic environment.
The structure of the speech database is :
A full record (session) for each speaker consists of 48 speech files and corrersponding lable files with annotation.
The lexicon (file LEXICON) was composed of all the words that were pronounced clearly and with no mistakes by the speakers, with indication of frequency of each word and its phoneme broad transcription.
The lexicon comprises about 10 000 units. Phoneme transcription of the lexicon was carried out according to Russian SAPMA (Speech Assessment Methods Phonetic Alphabet) symbol table developed with the participation of “AudiTech”. SAMPA is a machine-redable phonetic
alphabet It was developed in 1987-89 by an international group of phoneticians and consist of a mapping of symbols of the International Phonetic Alphabet onto ASCII codes in the range 33...127. The LEXICON is an alphabetically ordered list of distinct lexical items with the most frequent pronunciation and same variants.
Besides that speech database contain recording condition information file, speaker information file, file of acoustical quality speech records, corpus contents file. File DISIGN contain full description of database, annotation information, speaker information and lexicon.
The validation of a database is carried out by the SPEX -Speech Processing Expertise Centre. Validation proceeds in two steps:
1. Prevalidation of a small database of 10 speakers for the detect serious design errors before the actual recording start.
2. Validation of complete database.
The final validation report is put into the final CD's as part of the database.
“AudiTech” has developed a number of programs to assist in SpeechDat creation.The most important ones are SPDAT- the transcription system for editing annotation file. in SPDAT a transcriber selects a record, verifies if the speaker pronounced the correctly and modifies the transcription if it is necessary.
The Russian Database be kept on 5 CD-disks (two blocks each other and all documentation files).
The telephone speech databases of Eastern European languages are to be created for the fixed telephone network project SPEECHDAT (E) within the period of 24 months: one for Russian
(2500 speakers), one for Czech (1000 speakers), one for Slovak (1000 speakers) one for Polish (1000 speakers) and Hungary. Most regions of these countries are covered and thus corresponding dialects are also taken into account.
Keeping the quality standards of the databases to be created is secured by two validation steps carried out by validation centre SPEX.