Research on Dictionary-Based Word Segmentation Algorithms Using Trie Structure

Boxing Zhang; Xin Jing; Qinlong Kang

doi:10.2478/ijanmc-2025-0005

Abstract

This study investigates dictionary-based word segmentation algorithms, which are essential in Natural Language Processing (NLP). Chinese word segmentation poses significant challenges due to the lack of clear word delimiters in the language. This paper explores the advantages and limitations of dictionary-based segmentation algorithms, focusing on how data structures such as Trie and Double-Array Trie (DAT) can enhance segmentation efficiency. An analysis of Trie and DAT structures leads to an optimization achieving constant-time state transitions. This paper evaluates and compares various segmentation algorithms, including full segmentation, forward maximum matching, backward maximum matching, and bidirectional maximum matching. The inherent limitations of dictionary-based segmentation, particularly its dependence on dictionaries and poor disambiguation capability, are also discussed.

References

Pak I, Teh P L. Text segmentation techniques: a critical review [J]. Innovative Computing, Optimization and Its Applications: Modelling and Simulations, 2018: 167-181
Search in Google Scholar Back to article
Liu C, Zhang Q, Feng J, et al. A Chinese word segmentation method based on dictionary and HMM [C]//Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering. 2022: 644-649.
Search in Google Scholar Back to article
Sugahara R, Nakashima Y, Inenaga S, et al. Efficiently computing runs on a trie [J]. Theoretical Computer Science, 2021, 887: 143-151.
Search in Google Scholar Back to article
Yeasin Emon R, Chanda Tista S. An Efficient Word Lookup System by using Improved Trie Algorithm [J]. arXiv e-prints, 2019: arXiv: 1911.01763.
Search in Google Scholar Back to article
Bannai H, Goto K, Kanda S, et al. NP-Completeness for the Space-Optimality of Double-Array Tries [J]. arXiv preprint arXiv:2403.04951, 2024.
Search in Google Scholar Back to article
Piedeleu R, Zanasi F. A String Diagrammatic Axiomatisation of Finite-State Automata [C]//FoSSaCS. 2021: 469-489.
Search in Google Scholar Back to article
Scheibel W, Limberger D, Döllner J. Survey of treemap layout algorithms [C]//Proceedings of the 13th international symposium on visual information communication and interaction. 2020: 1-9.
Search in Google Scholar Back to article
Pei J. A dictionary-based maximum match algorithm via statistical information for Chinese word segmentation [J]. International Journal of Electronics and Information Engineering, 2020, 12(1): 24-33.
Search in Google Scholar Back to article
Li R. English Translation Intelligent Recognition Model Based on Reverse Maximum Matching Segmentation Algorithm [C]//International Conference on Innovative Computing. Singapore: Springer Nature Singapore, 2023: 342-349.
Search in Google Scholar Back to article
Yan X, Xiong X, Cheng X, et al. HMM-BiMM: Hidden Markov Model-based word segmentation via improved Bi-directional Maximal Matching algorithm [J]. Computers & Electrical Engineering, 2021, 94: 107354.
Search in Google Scholar Back to article

Research on Dictionary-Based Word Segmentation Algorithms Using Trie Structure

Abstract

Paradigm

My account