CREATING SPEECH DATABASES FOR EASTERN EUROPEAN LANGUAGES
THE SPEECHDAT(E) PROJECT : CREATING SPEECH DATABASES FOR EASTERN EUROPEAN LANGUAGES
Henk van den Heuvel, Valery Galounov, Herbert S. Tropf
For the implementation of the speech processing technology, i.e. speech recognition and speaker verification, language specific spoken language resources, i.e. speech databases, lexica and related tools, are needed. In order to be competitive with American companies, European companies have to create an effective infrastructure to deal successfully with their multilingual environment. The EU-projects SpeechDat(M), SpeechDat(II), SpeechDat(Car) and ELRA are part of such an infrastructure to create, validate and distribute spoken language resources. These projects were focused on Western European languages. Responding on the fast growing trade between Eastern and Western Europe this infrastructure has to be extended to Eastern European languages. In this spirit the proposed project SpeechDat(E) has its focus on the creation of spoken language resources for Eastern European languages, namely for Russian, Czech, and Slovak.
The SpeechDat(E) project will be carried out within the COPERNICUS framework. Project duration will be 2 years starting in 998. The SpeechDat(E) consortium consists of 3 industrial contractors and 3 academic contractors. Siemens AG (Germany) acts as Project Coordinator, whereas AudiTech (Russia) will acts Scientific Coordinator. The project focuses on the following databases:
- Russian (as spoken in Russia; 4 recordings of different speakers; responsible: Auditech)
- Czech (as spoken in the Czech Republic; 2 recordings of different speakers; responsible TU Brno)
- Slovak (as spoken in the Slovak Republic; 2 recordings of different speakers; responsible: Slovak Academy of Sciences)
Building on the standards, best practice and guidelines settled within SpeechDat(M) and SpeechDat(II) the speech databases for the Eastern European languages can be produced cost-effectively concerning speaker recruitment, set up of platforms, annotation tools, and validation.
Content and creation
All databases are recorded on telephone servers with ISDN connections. The signal format is 8bit 8KHz alaw, the European ISDN standard. For the annotation, the SAM file format has been chosen for two reasons: it separates signal from annnotation data, and it is extensible. The annotations are encoded in ISO-889, and a common SAM file format has been defined for each of the three types of databases. The file system hierarchy is based on purely formal criteria, i.e. it is not content-related. All SpeechDat databases can be addressed consistently in one large file system. File names follow the 8.3 character pattern of ISO-966 for platform independence. There is a large core content common to all SpeechDat databases. It consists of approximately 4 items that cover application words and phrases like digit strings, and phonetically rich words and sentences. The utterances will be annotated orthographically. Annotation is enriched with a set of markers for noises and deviations like mispronunciations and recording truncations. Speaker recruitment is left to the individual partners.
The SpeechDat project is featured by a thorough validation protocol. The specifications which the databases should meet are evaluated by an independent validation centre, SPEX, being associated contractor of the project. Validation proceeds in three steps:
- Prevalidation of a small database of speakers. The objective of this stage is to detect serious errors before the actual recordings start.
- Validation of complete databases. The database is checked against the SpeechDat specifications and a validation report is generated.
- Revalidation of complete databases. In case the validation report shows that improvements of a database are necessary or desirable, then (part of) the database can be offered for a second validation, and a new report is written.
The final validation report is put onto the final CDs as part of the database.
The SpeechDat(E) project was approved as Joint Research Project of the INCO-COPERNICUS Work Programme recently. Meanwhile, the Russian partner has collected a speaker database as an Invited (non-funded) Guest Partner of SpeechDat(II). It will be taken care of that there will be no speaker overlap with SpeechDat(E) recordings. This Russian database comprises speakers from Moscow and from St.Petersburg. The database was completed according to the specifications of the SpeechDat(II) project. Speech material (answers to items) consists of spontaneous answers, reading digit sequences and text material (words and phrases). The total vocabulary is about , units. Recording was carried out through ISDN lines. The phoneme transcription for the lexicon was fulfilled according to the Russian SAMPA table developed according to the requirements of St.Petersburg's phonetic school.
The basic strategies in carrying out such a project for collecting large speech databases are currently adopted by the project SpeechDat(Car), and the SALA project (SpeechDat Across Latin America) for collecting Spanish and Portuguese databases covering Latin American countries.