Have a personal or library account? Click to login

Comparison of Language Models for English-Latvian Semantic Search

Open Access
|Feb 2025

References

  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008, 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  2. A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp. 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747
  3. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, and P. Barham, “PaLM: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, pp. 1–113, 2023. https://www.jmlr.org/papers/volume24/22-1144/22-1144.pdf
  4. N. Muennighoff, “SGPT: GPT sentence embeddings for semantic search,” arXiv preprint 2202.08904, Aug. 2022. https://doi.org/10.48550/arXiv.2202.08904
  5. Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G.H. Abrego, S. Yuan, C. Tar, Y.-H. Sung, B. Strope, and R. Kurzweil, “Multilingual universal sentence encoder for semantic retrieval,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Jul. 2020, pp. 87–94. https://doi.org/10.18653/v1/2020.acl-demos.12
  6. B. Bijin, “A local and intelligent web information retrieval system,” M.S. thesis, University of Alberta, Alberta, AB, Canada, 2021. https://doi.org/10.7939/r3-eb5m-q238
  7. R. Litschko, I. Vulić, S.P. Ponzetto, and G. Glavaš, “Evaluating multilingual text encoders for unsupervised cross-lingual retrieval,” in Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Lecture Notes in Computer Science, vol. 12656, Springer Verlag, Berlin, Heidelberg, Mar. 2021, pp. 342–358. https://doi.org/10.1007/978-3-030-72113-8_23
  8. R. Litschko, I. Vulić, S.P. Ponzetto, and G. Glavaš, “On cross-lingual retrieval with multilingual text encoders,” Information Retrieval Journal, vol. 25, pp. 149–183, Mar. 2022. https://doi.org/10.1007/s10791-022-09406-x
  9. M. Artetxe and H. Schwenk, “Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 597–610, Sep. 2019. https://doi.org/10.1162/tacl_a_00288
  10. K. Heffernan, O. Çelebi, and H. Schwenk, “Bitext mining using distilled sentence representations for low-resource languages,” in Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, Dec. 2022, pp. 2101–2112. https://doi.org/10.18653/v1/2022.findings-emnlp.154
  11. Meta Research, “Language-agnostic sentence representations,” 2018. [Online]. Available: https://github.com/facebookresearch/LASER. Accessed on: Jul. 18, 2024.
  12. P.-A. Duquenne, H. Schwenk, and B. Sagot, “SONAR: Sentence-level multimodal and language-agnostic representations,” arXiv preprint 2308.11466, Aug. 2023. https://doi.org/10.48550/arXiv.2308.1146
  13. AI at Meta, “SONAR”, 2023. [Online]. Available: https://huggingface.co/facebook/SONAR. Accessed on: Jul. 18, 2024.
  14. F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, “Language-agnostic BERT sentence embedding,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, May 2022, pp. 878–891. https://doi.org/10.18653/v1/2022.acl-long.62
  15. Sentence Transformers, “LaBSE”, 2019. [Online]. Available: https://huggingface.co/sentence-transformers/LaBSE. Accessed on: Jul. 24, 2024.
  16. N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “MTEB: Massive text embedding benchmark,” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, May 2023, pp. 2014–2037. https://doi.org/10.18653/v1/2023.eacl-main.148
  17. N. Reimers and I. Gurevych, “Making monolingual sentence embeddings multilingual using knowledge distillation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Nov. 2020, pp. 3982–3992. https://doi.org/10.18653/v1/2020.emnlp-main.365
  18. L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei, “Multilingual E5 text embeddings: A technical report,” arXiv preprint 2402.05672, 2024. https://arxiv.org/pdf/2406.01607
  19. Sentence Transformers, “paraphrase-multilingual-MiniLM-L12-v2,” 2019. [Online]. Available: https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2. Accessed on: Jul. 24, 2024.
  20. Sentence Transformers, “paraphrase-multilingual-mpnet-base-v2,” 2019. [Online]. Available: https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2. Accessed on: Jul. 24, 2024.
  21. Sentence Transformers, “distiluse-base-multilingual-cased,” 2019. [Online]. Available: https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased. Accessed on: Jul. 24, 2024.
  22. Sentence Transformers, “distiluse-base-multilingual-cased-v2,” 2019. [Online]. Available: https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2. Accessed on: Jul. 24, 2024.
  23. L. Wang, “Multilingual -E5-small,” 2024. [Online]. Available: https://huggingface.co/intfloat/multilingual-e5-small. Accessed on: Jul. 25, 2024.
  24. L. Wang, “Multilingual -E5-base,” 2024. [Online]. Available: https://huggingface.co/intfloat/multilingual-e5-base. Accessed on: Jul. 25, 2024.
  25. L. Wang, “Multilingual -E5-large,” 2024. [Online]. Available: https://huggingface.co/intfloat/multilingual-e5-large. Accessed on: Jul. 23, 2024.
  26. Wikimedia, “CirrusSearch dumps,” 2024. [Online]. Available: https://dumps.wikimedia.org/other/cirrussearch/. Accessed on: May 18, 2024.
  27. M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, and H. Jégou, “The Faiss library,” arXiv preprint 2401.08281, 2024. https://arxiv.org/pdf/2401.08281
  28. J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, June 2019. https://doi.org/10.1109/TBDATA.2019.2921572
  29. Y.A. Malkov and D.A. Yashunin, “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World Graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824–836, Apr. 2020. https://doi.org/10.1109/TPAMI.2018.2889473
  30. J. Briggs, “Faiss: The missing manual,” in Pinecone, 2023. [Online]. Available: https://www.pinecone.io/learn/series/faiss/. Accessed on: Aug. 18, 2024.
DOI: https://doi.org/10.2478/acss-2025-0004 | Journal eISSN: 2255-8691 | Journal ISSN: 2255-8683
Language: English
Page range: 34 - 39
Submitted on: Sep 30, 2024
Accepted on: Jan 2, 2025
Published on: Feb 7, 2025
Published by: Riga Technical University
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Artem Kucheravy, Gints Jēkabsons, published by Riga Technical University
This work is licensed under the Creative Commons Attribution 4.0 License.