Bc. Samuel Špalek
Bachelor's thesis
Evaluation of text tokenizers
Evaluation of text tokenizers
Abstract:
Táto práca detailne analyzuje rozdieli medzi dvoma tokenizérmi Unitok a Utok. Tokenizácia je podstatný krok v práci s naturálnym jazykom. Tokenizér rozdeľuje text na čo najmenšie zmysluplné elementy nazývané tokeny. Umožňuje to analyzovať elementy v kontexte ostatných elementov. Niektoré tokenizéry využívajú jednoduchú techniku rozdeľovania textu podla medzier. Zložitejšie a jazykovo zavislé prípady …moreAbstract:
This thesis presents a detailed analysis of the differences between Unitok and Utok, two tokenizers for natural language processing. The tokenization process is a crucial step in NLP text processing and involves breaking down text data into minimal, meaningful elements called tokens. This allows machines to analyze and understand the context of each element in relation to the others. While some tokenizers …more
Language used: English
Date on which the thesis was submitted / produced: 15. 12. 2022
Identifier:
https://is.muni.cz/th/xkj6g/
Thesis defence
- Date of defence: 30. 1. 2023
- Supervisor: doc. Mgr. Pavel Rychlý, Ph.D.
- Reader: RNDr. Vít Suchomel, Ph.D.
Full text of thesis
Contents of on-line thesis archive
Published in Theses:- světu
Other ways of accessing the text
Institution archiving the thesis and making it accessible: Masarykova univerzita, Fakulta informatikyMasaryk University
Faculty of InformaticsBachelor programme / field:
Informatics / Informatics
Theses on a related topic
- No theses on a related topic available.