
Figure 1
Pipeline for overall procedure of cross-lingual Buddhist Chinese & Classical Tibetan alignment.
Table 1
Summary of cosine similarity scores of Tibetan-Chinese glossary pairs within the new embedding spaces according to Chinese tokenisation method. Shows the highest scoring pair, lowest scoring pair, and some descriptive statistics. Higher scores with lower standard deviation indicate a more accurate embedding space.
| CHINESE EMBEDDING TYPE | MOST SIMILAR | LEAST SIMILAR | MEDIAN | MEAN | STD |
|---|---|---|---|---|---|
| Character | 0.9 | –0.2 | 0.66 | 0.64 | 0.12 |
| Hybrid1 | 0.9 | 0.19 | 0.66 | 0.65 | 0.11 |
| Hybrid2 | 0.91 | 0.22 | 0.66 | 0.64 | 0.11 |
| Word | 0.92 | 0.3 | 0.67 | 0.67 | 0.11 |

Figure 2
A sample of embeddings selected from the cross-lingual Tibetan-Chinese space. This includes a selection of animal, numerical, seasonal, and directional words.

Figure 3
A zoomed in detail of some of the animal words from the cross-lingual embedding space shown in Figure 1, including English translations.

Figure 4
Sample output for Alignment T2.A1.
Table 2
Results for all texts with four embedding methods for the Chinese input.
| TEXT – CHI. EMBEDDING TYPE | % RANK1 | %RANK5 | %RANK10 | %RANK15 | AV. RANK | #ZERO |
|---|---|---|---|---|---|---|
| Text 1 – Character | 30.95 | 69.05 | 78.57 | 92.86 | 4.33 | 2 |
| Text 1 – Hybrid 1 | 35.71 | 69.05 | 88.1 | 92.86 | 3.56 | 0 |
| Text 1 – Hybrid 2 | 40.48 | 73.81 | 90.48 | 95.24 | 3.4 | 0 |
| Text 1 – Word | 38.1 | 61.9 | 76.19 | 85.71 | 3.92 | 2 |
| Text 2 – Character | 76.19 | 100 | 100 | 100 | 1.24 | 0 |
| Text 2 – Hybrid 1 | 52.38 | 100 | 100 | 100 | 2 | 0 |
| Text 2 – Hybrid 2 | 61.9 | 100 | 100 | 100 | 1.57 | 0 |
| Text 2 – Word | 42.86 | 95.24 | 100 | 100 | 2.48 | 0 |
| Text 3 – Character | 35.29 | 47.06 | 52.94 | 70.59 | 4.58 | 1 |
| Text 3 – Hybrid 1 | 35.29 | 64.71 | 82.35 | 88.23 | 3.53 | 0 |
| Text 3 – Hybrid 2 | 35.29 | 58.82 | 82.35 | 82.35 | 3.36 | 0 |
| Text 3 – Word | 11.76 | 52.94 | 70.59 | 70.59 | 3.92 | 2 |

Figure 5
Top-ranked results for each Chinese embedding method by text.
