Bc. Rastislav Papčo

Master's thesis

Topic Classification for Web Corpora: Method Comparison and Crosslingual Transfer

Topic Classification for Web Corpora: Method Comparison and Crosslingual Transfer
Abstract:
Anglické textové korpusy sú nevyhnutné pre počítačovú lingvistiku. Internet je síce veľkým a lacným zdrojom takýchto dát, ale zvyčajne im chýba štruktúra a metadáta. Cieľom diplomovej práce bolo vyčistiť webové korpusy od zlých textov a zároveň anotovať dáta témami. Témy sa rozpoznávajú dvoma spôsobmi: klasifikáciou a modelovaním. Klasifikácia prebieha supervizovaným fasttextovým modelom, zatiaľ čo …more
Abstract:
English text corpora are essential for computational linguistics. While the internet is a large and cheap source of such data, they usually lack structure and metadata. The aim of this thesis was to clean web corpora from bad texts while also annotating the data with topics. The topics are recognized in two ways: topic classification and topic modeling. Topic classification is solved by a supervised …more
 
 
Language used: English
Date on which the thesis was submitted / produced: 17. 5. 2022

Thesis defence

  • Date of defence: 23. 6. 2022
  • Supervisor: RNDr. Vít Suchomel, Ph.D.
  • Reader: Mgr. Michal Štefánik

Citation record

Full text of thesis

Contents of on-line thesis archive
Published in Theses:
  • světu
Other ways of accessing the text
Institution archiving the thesis and making it accessible: Masarykova univerzita, Fakulta informatiky

Masaryk University

Faculty of Informatics

Master programme / field:
Artificial intelligence and data processing / Machine learning and artificial intelligence

Theses on a related topic

  • No theses on a related topic available.