Performance of Czech Speech Recognition with Language Models Created from Public Resources

dc.contributor.authorProcházka, Václav
dc.contributor.authorPollak, Petr
dc.contributor.authorŽďánský, Jindřich
dc.contributor.authorNouza, Jan
dc.date.accessioned2016-05-24
dc.date.available2016-05-24
dc.date.issued2011
dc.description.abstractIn this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates) and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.en
dc.description.sponsorshipMSM [6840770014]; [GA CR 102/08/0707]; [TA CR TA01011204]
dc.formattext
dc.identifier.issn1210-2512
dc.identifier.scopus2-s2.0-84857821526
dc.identifier.urihttps://dspace.tul.cz/handle/15240/16410
dc.identifier.urihttps://www.radioeng.cz/fulltexts/2011/11_04_1002_1008.pdf
dc.language.isoen
dc.publisherSpolecnost Pro Radioelektronicke Inzenyrstvi
dc.publisherTechnická Univerzita v Libercics
dc.publisherTechnical university of Liberec, Czech Republicen
dc.relation.ispartofRadioengineeringen
dc.sourcej-scopus
dc.sourcej-wok
dc.subjectspeech recognitionen
dc.subjectLVCSRen
dc.subjectn-gram language modelsen
dc.subjectpublic language resourcesen
dc.titlePerformance of Czech Speech Recognition with Language Models Created from Public Resourcesen
dc.typeArticle
local.accessopen
local.citation.epage1008
local.citation.spage1002
local.departmentInstitute of Information Technology and Electronics
local.event.titleFiber Society Spring 2014 Technical Conference: Fibers for Progress
local.facultyFaculty of Mechatronics, Informatics and Interdisciplinary Studies
local.fulltextyes
local.identifier.stagRIV/46747885:24220/11:#0001963
local.identifier.wok298636800039
local.relation.issue4
local.relation.volume20
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2-s2.0-84857821526-o.pdf
Size:
130.36 KB
Format:
Adobe Portable Document Format
Description:
Článek
Collections