Crosslinguistic Semantic Textual Similarity of Buddhist Chinese and Classical Tibetan

Rafal Felbur; Marieke Meelen; Paul Vierthaler

doi:10.5334/johd.86

Abstract

In this paper we present the first-ever procedure for identifying highly similar sequences of text in Chinese and Tibetan translations of Buddhist sūtra literature. We initially propose this procedure as an aid to scholars engaged in the philological study of Buddhist documents. We create a cross-lingual embedding space by taking the cosine similarity of average sequence vectors in order to produce unsupervised similar cross-linguistic parallel alignments at word, sentence, and even paragraph level. Initial results show that our method lays a solid foundation for the future development of a fully-fledged Information Retrieval tool for these (and potentially other) low-resource historical languages.

References

1Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1423 (last accessed: 8 August 2022) DOI: 10.18653/v1/N19-1423
Back to article
2Faggionato, C., Hill, N., & Meelen, M. (2022, June). NLP Pipeline for Annotating (Endangered) Tibetan and Newar Varieties. In Proceedings of The Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference (p. 1–6). Marseille, France: European Language Resources Association.
Back to article
3Faggionato, C., & Meelen, M. (2019). Developing the Old Tibetan treebank. In N. T. Angelova Mitkov (Ed.), Proceedings of Recent Advances in Natural Language Processing (p. 304–312). Varna: Incoma. DOI: 10.26615/978-954-452-056-4_035
Back to article
4Glavaš, G., Franco-Salvador, M., Ponzetto, S. P., & Rosso, P. (2018). A resource-light method for cross-lingual semantic textual similarity. Knowledge-based systems, 143, 1–9. DOI: 10.1016/j.knosys.2017.11.041
Back to article
5Handy, C., & Meelen, M. (2022, June). MRK alignment scoring guidelines. Zenodo. Retrieved from https://doi.org/10.5281/zenodo.6782150 (last accessed: 8 August 2022).
Back to article
6Inagaki, H. (1978). Index to the Larger Sukhāvatīvyūha-sūtra. A Tibetan Glossary with Sanskrit and Tibetan Equivalents. Tokyo: Nagata Bunshudo.
Back to article
7Karashima, S. (1998). A Glossary of Dharmarakṣa’s Translation of the Lotus Sutra: Zheng fahua jing ci dian. Tokyo: The International Research Institute for Advanced Buddhology, Soka University.
Back to article
8Karashima, S. (2001). A Glossary of Kumārajīva’s Translation of the Lotus Sutra: Myōhō Rengekyō shiten. Tokyo: The International Research Institute for Advanced Buddhology, Soka University.
Back to article
9Karashima, S. (2010). A Glossary of Lokakṣema’s Translation of the Aṣṭasāhasrikā Prajñāpāramitā. Tokyo: The International Research Institute for Advanced Buddhology, Soka University.
Back to article
10Klein, B. E., Dershowitz, N., Wolf, L., Almogi, O., & Wangchuk, D. (2014). Finding Inexact Quotations Within a Tibetan Buddhist Corpus. In 9th Annual International Conference of the Alliance of Digital Humanities Organizations, DH 2014, Lausanne, Switzerland, 8–12 July 2014, Conference Abstracts.
Back to article
11Li, Q. (2011). Da zhidu lun cidian 大智度論辭典. Electronic resource. Retrieved from https://www.dropbox.com/s/ocsagb529k3e70v/dzdl.bgl?dl=0 (last accessed: 1 June 2021).
Back to article
12Meelen, M. (2022). Tibetan language models: from distributional semantics to facilitating Tibetan NLP. Accepted submission to IATS 2022.
Back to article
13Meelen, M., & Hill, N. (2017). Segmenting and POS tagging Classical Tibetan using a memory-based tagger. Himalayan Linguistics, 16(2). DOI: 10.5070/H916234501
Back to article
14Meelen, M., & Roux, É. (2020). Meta-dating the parsed corpus of Tibetan (PACTib). In Proceedings of the 19th Workshop on Treebanks and Linguistic Theories (pp. 31–42). DOI: 10.18653/v1/2020.tlt-1.3
Back to article
15Meelen, M., Roux, É., & Hill, N. (2021). Optimisation of the largest annotated Tibetan corpus combining rule-based, memory-based, and deep-learning methods. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 20(1), 1–11. DOI: 10.1145/3409488
Back to article
16Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168. Retrieved from http://arxiv.org/abs/1309.4168 (last accessed: 8 August 2022).
Back to article
17Nehrdich, S. (2020). A method for the calculation of parallel passages for Buddhist Chinese sources based on million-scale nearest neighbor search. Journal of the Japanese Association for Digital Humanities, 5(2), 132–153. DOI: 10.17928/jjadh.5.2_132
Back to article
18Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Python natural language processing toolkit for many human languages. arXiv preprint arXiv:2003.07082. DOI: 10.18653/v1/2020.acl-demos.14
Back to article
19Silk, J. A. (2020). Tekisuto sokei no nai kōtei: Bukkyō kyōten to yudayakyō rabi bunken kenkyū ni okeru honbun hihan, soshite ‘Hirakareta bunkengaku’ dejitaru hyūmanitīzu purojekuto” テキスト祖型のない校訂: 佛敎經典とユダヤ敎ラビ文獻硏究における本文批評、そして「開かれた文獻學」デジタルヒューマニティーズプロジェクト[Editing without an Ur-text: Buddhist Sūtras, Rabbinic Text Criticism, and the Open Philology Digital Humanities Project]. Tōyō no Shisō to Shūkyō 東洋の思想と宗敎, 37, 22–58.
Back to article
20Vierthaler, P. (2020). A Simple Dictionary-Based Tokenizer for Classical Chinese Text. Retrieved from https://github.com/vierth/dictionary_parser (last accessed: 8 August 2022).
Back to article
21Vierthaler, P. (2022, June). Buddhist Chinese Word Embeddings. Zenodo. Retrieved from https://doi.org/10.5281/zenodo.6782932 (last accessed: 8 August 2022).
Back to article
22Vierthaler, P., & Gelein, M. (2019, 3 22). A blast-based, language-agnostic text reuse algorithm with a markus implementation and sequence alignment optimized for large Chinese corpora. Journal of Cultural Analytics, 4(2). DOI: 10.22148/16.034
Back to article
23Wang, Y.-C. (2020). Word segmentation for Classical Chinese Buddhist literature. Journal of the Japanese Association for Digital Humanities, 5(2), 154–172. DOI: 10.17928/jjadh.5.2_154
Back to article
24Wittern, C. (2016). The Kanseki repository: A new online resource for Chinese textual studies. Digital Scholarship in History and the Humanities.
Back to article
25Xing, C., Wang, D., Liu, C., & Lin, Y. (2015, May–June). Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies (pp. 1006–1011). Denver, Colorado: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N15-1104 (last accessed: 8 August 2022).
Back to article
26Yokoyama, K., & Hirosawa, T. (1996). Index to the Yogācārabhūmi, Chinese-Sanskrit-Tibetan: 漢梵蔵対照瑜伽師地論総索引. Tokyo: Sankibō Busshorin.
Back to article

Crosslinguistic Semantic Textual Similarity of Buddhist Chinese and Classical Tibetan

Abstract

Paradigm

My account