Document Type : Original Article
Authors
1 NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
2 Faculty of Computer Science and Engineering, Shahid Beheshti University G.C, Tehran, Iran
Abstract
Tokenization is a critical stage in text preprocessing and presents numerous challenges in languages like Persian, where there is no deterministic word boundary. These challenges include the identification of multi-function morphemes, separation of punctuation marks, omission of spaces between tokens, and handling extra spaces inside words.
Typically, the evaluation of tokenizers focuses on overall performance, and test data does not necessarily cover all challenging linguistic phenomena. As a result, strengths and weaknesses of tokenizers in addressing specific challenges are not independently assessed. This paper examines the challenges posed by the Persian script in detecting word boundaries and evaluates the performance of seven tokenizers in handling these issues. A test set of 4091 tokens across 483 sentences was prepared, with 1010 considered as challenging tokens. The tokenizers were evaluated using this dataset.
The results indicate varying performance among tokenizers when dealing with Persian orthography. Some tokenizers performed better in separating compound words, while others excelled in identifying and preserving zero-length joiners (half-space). A detailed comparison reveals that no tokenizer fully addresses all challenges, highlighting the need for improved algorithms and more sophisticated solutions for Persian word boundary detection.
By introducing a comprehensive benchmark and identifying the strengths and weaknesses of available tokenizers, this study paves the way for the development of better Persian language processing tools.
Keywords
Main Subjects