Identifikace jazyka textového dokumentu

Valta, Jan

Identifikace jazyka textového dokumentu

Title Alternative:Language identification of text documents

Files

mgr_23104.pdf(1.75 MB)

opo_23104.pdf(27.31 KB)

ved_23104.pdf(27.31 KB)

obh_23104.pdf(27.31 KB)

Date

2012

Authors

Valta, Jan

Publisher

Technická Univerzita v Liberci

Abstract

Diplomová práce se zabývá problematikou identifikace jazyka textového dokumentu pomocí statistických n-gramových modelů. Teoretická část popisuje statistický n-gramový model, jeho vytváření a vyhodnocování. Dále popisuje základní vyhlazovací techniky a typy n-gramových modelů. Praktická část porovnává výsledky identifikace jazyka pro různé n-gramové modely, které se liší ve vyhlazovací technice, stupni a typu modelu. Dále pak zjišťuje vliv diakritiky při identifikaci jazyka.
This diploma thesis addresses issues about language identification of text documents with statistical n-gram models. Theoretical section describes statistical n-gram models, its creates and evaluations. Next part describes basic smoothing technique and types n-gram models. Practical section describes results of language identification for different n-gram models, which differs in smoothing technique, order and type n-gram models. Further determines the influence diacritics in language identification.

Description

katedra: ITE; přílohy: 1 DVD; rozsah: 49s

Subject(s)

language identification, n-gram model, smoothing technique, identifikace jazyka, n-gramový model, vyhlazovací technika

Item identifier

https://dspace.tul.cz/handle/15240/12006

Collections

Fakulta mechatroniky, informatiky a mezioborových studií

Show full item record