Bc. Samuel Špalek

Bachelor's thesis

Evaluation of text tokenizers

Evaluation of text tokenizers
Abstract:
Táto práca detailne analyzuje rozdieli medzi dvoma tokenizérmi Unitok a Utok. Tokenizácia je podstatný krok v práci s naturálnym jazykom. Tokenizér rozdeľuje text na čo najmenšie zmysluplné elementy nazývané tokeny. Umožňuje to analyzovať elementy v kontexte ostatných elementov. Niektoré tokenizéry využívajú jednoduchú techniku rozdeľovania textu podla medzier. Zložitejšie a jazykovo zavislé prípady …more
Abstract:
This thesis presents a detailed analysis of the differences between Unitok and Utok, two tokenizers for natural language processing. The tokenization process is a crucial step in NLP text processing and involves breaking down text data into minimal, meaningful elements called tokens. This allows machines to analyze and understand the context of each element in relation to the others. While some tokenizers …more
 
 
Language used: English
Date on which the thesis was submitted / produced: 15. 12. 2022

Thesis defence

  • Date of defence: 30. 1. 2023
  • Supervisor: doc. Mgr. Pavel Rychlý, Ph.D.
  • Reader: RNDr. Vít Suchomel, Ph.D.

Citation record

Full text of thesis

Contents of on-line thesis archive
Published in Theses:
  • světu
Other ways of accessing the text
Institution archiving the thesis and making it accessible: Masarykova univerzita, Fakulta informatiky

Masaryk University

Faculty of Informatics

Bachelor programme / field:
Informatics / Informatics

Theses on a related topic

  • No theses on a related topic available.