Performance of Czech Speech Recognition with Language Models Created from Public Resources

Procházka, Václav

Performance of Czech Speech Recognition with Language Models Created from Public Resources

dc.contributor.author	Procházka, Václav
dc.contributor.author	Pollak, Petr
dc.contributor.author	Žďánský, Jindřich
dc.contributor.author	Nouza, Jan
dc.date.accessioned	2016-05-24
dc.date.available	2016-05-24
dc.date.issued	2011-01-01
dc.description.abstract	In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates) and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.	en
dc.description.sponsorship	MSM [6840770014]; [GA CR 102/08/0707]; [TA CR TA01011204]
dc.format	text
dc.identifier.issn	1210-2512
dc.identifier.scopus	2-s2.0-84857821526
dc.identifier.uri	https://dspace.tul.cz/handle/15240/16410
dc.identifier.uri	https://www.radioeng.cz/fulltexts/2011/11_04_1002_1008.pdf
dc.language.iso	en
dc.publisher	Spolecnost Pro Radioelektronicke Inzenyrstvi
dc.publisher	Technická Univerzita v Liberci	cs
dc.publisher	Technical university of Liberec, Czech Republic	en
dc.relation.ispartof	Radioengineering	en
dc.source	j-scopus
dc.source	j-wok
dc.subject	speech recognition	en
dc.subject	LVCSR	en
dc.subject	n-gram language models	en
dc.subject	public language resources	en
dc.title	Performance of Czech Speech Recognition with Language Models Created from Public Resources	en
dc.type	Article
local.access	open
local.citation.epage	1008
local.citation.spage	1002
local.department	Institute of Information Technology and Electronics
local.event.title	Fiber Society Spring 2014 Technical Conference: Fibers for Progress
local.faculty	Faculty of Mechatronics, Informatics and Interdisciplinary Studies
local.fulltext	yes
local.identifier.stag	RIV/46747885:24220/11:#0001963
local.identifier.wok	298636800039
local.relation.issue	4
local.relation.volume	20

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2-s2.0-84857821526-o.pdf
Size:: 130.36 KB
Format:: Adobe Portable Document Format
Description:: Článek

Download

Performance of Czech Speech Recognition with Language Models Created from Public Resources

Files

Original bundle

Collections