Building parallel corpora from the Web

Pomikálek, Jan

CS SKLog in Log in (EduId)

Theses 1zmwl9

Building parallel corpora from the Web – Mgr. Jan Pomikálek, Ph.D.

Zpět na vyhledávání

Mgr. Jan Pomikálek, Ph.D.

Advanced ('rigorózní') thesis

Building parallel corpora from the Web

Abstract:

Parallel corpora are a valuable resource for many fields in computational linguistics, e.g. machine translation, cross language information retrieval (CLIR), lexicography. Unfortunately, the sources of parallel texts are very limited. On the other hand, there is World Wide Web with billions of Web pages, some of which are mutual translations. Though its potential for retrieving bilingual texts awaits …more

Abstract:

Parallel corpora are a valuable resource for many fields in computational linguistics, e.g. machine translation, cross language information retrieval (CLIR), lexicography. Unfortunately, the sources of parallel texts are very limited. On the other hand, there is World Wide Web with billions of Web pages, some of which are mutual translations. Though its potential for retrieving bilingual texts awaits …more

Keywords

corpus text corpora web-derived corpora parallel corpora

Language used: English

Date on which the thesis was submitted / produced: 17. 6. 2008

Identifier: https://is.muni.cz/th/j3ahd/

Thesis defence

Date of defence: 23. 6. 2008

Citation record

Cite this text

ISO 690-compliant citation record:

POMIKÁLEK, Jan. \textit{Building parallel corpora from the Web}. Online. Brno: Masaryk University, Faculty of Informatics. 2008. Available from: https://theses.cz/id/1zmwl9/.

Full text of thesis

Contents of on-line thesis archive

Published in Theses:

světu

Other ways of accessing the text

Institution archiving the thesis and making it accessible: Masarykova univerzita, Fakulta informatiky

Reference to the local database directory of the institution

Masaryk University

Faculty of Informatics

Advanced ('rigorózní řízení') programme / field:
Informatics / Informatics

Theses on a related topic

Corpora from reddit.com texts
Jan Brichta
The use of "Once upon a time" in a corpus of fairy tales and in the British National Corpus
Mária Kopecká
Learner Translation Corpus: CELTraC (Czech-English Learner Translation Corpus)
Kristýna Štěpánková
Český Brown Corpus
David Krňávek
Il nuovo corpus di italiano L2 della Università Masaryk di Brno: raccolta e organizzazione dei dati.
Petra Kaňoková
Traducción de las formas del gerundio del español al checo: Análisis a través del corpus paralelo InterCorp
Ilona Mužátková
Funções comunicativas e textuais dos dois pontos. Análise do uso na escrita jornalística brasileira baseada no corpus Linguateca
Andrea Podskalská
Adaptation sémantique et orthographique des verbes empruntés à l’anglais : le rôle du corpus linguistique
Klára Halodová

All theses