Multilingvální systémy rozpoznávání řeči a jejich efektivní učení
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The diseratation thesis deals with creation of automatic speech recognition systems (ASR) and with effective adaptation of already existing system to a new language. Today's ASR systems have modular stucture where individual moduls can be considered as language dependent or independent. The main goal of this thesis is research and development of methods that automate and make the development of language dependent modules effective as much as possible using free available data from the internet, machine learning methods and similarities between langages. It is accompanied by documented application and testing of the methods on the major Slavic languages. The work is associated with research projects dealing with development of broadcast monitoring systems for Slavic languages. In the first part, basic concepts and the state of the art are described with focus on individual moduls and parts of the development of ASR systems. It is followed by description of Slavic languages with respect to ASR. The main part of the work is divided into two parts. The first one deals with lingvistic-lexical aspects of the development and the second one deals with acoustic-phonetic aspects. The lingvistic-lexical part deals with the development of a text corpus, a pronunciation lexicon and a language model. Principles and procedures for effective gathering and processing of text data obtained from the internet are described here. The text data needs to be cleaned from unwanted elements, normalized and langauge filtering should be applied. In case of a language using non-latin alphabet, it is appropriate to make an alphabet conversion. Cyrilic-to-latin alphabet conversion was designed for this purpose. Then, words are chosen from the corpus to create the lexicon and statistical language model is computed. The acoustic-phonetic part deals with the development of a phonetic inventory, creation of pronunciation for words in lexicon and the development of an acoustic model (AM). First, principles of a selection of phonemes for a new language and approaches for the creation of pronunciations are described. Next, approaches for gathering acoustis data from the internet and their processing for creation of an AM are described. Three AM training schemes are described. First supervised approach uses recordings with phonetic anotations from which the AM is trained. Second lightly-supervised approach uses recordings together with some accompanying text which might contain parts of the speech in the recordings. The recordings are transribed by an existing speech recognizer and any match between the output and the accompanying text is being searched. Matching parts are cut and added to the train set. All recordings are iteratively processed and more training data are gathered. In the case when the development of a system for a new language, acoustic data from another language can be used in multilingual system for gathering data for the target langauge. Third unsuprvised approach uses several different ASR systems to create phonetic annotations for recordings without any related text. Recordings are transribed with all systems and if their outputs match the output is used as its phonetic annotation. To test all created systems, standardized test sets were created from real data. Final versions of the systems were tested on the test sets to evaluate their usability in the broadcast monitoring tasks. Most of the systems achieved results below 20% of Word Error Rate. As last, proposed methods where applied to another three europen languages. The development was performed mostly automatically using only free available data from the internet. The systems achieved results below 22% of Word Error Rate after few months of development.