Beyond Accuracy: Cross-Linguistic Equity and Socio-Technical Dimensions of Large Language Models

Fidan Kaya Gülađýz

doi:10.2478/acss-2026-0001

Abstract

Artificial intelligence (AI) and AI-based systems are rapidly gaining popularity across all areas of daily life. Among these systems, large language models (LLMs), which probabilistically model language to understand and generate text, stand out at the forefront. The ability to generate results from LLMs, whose primary focus is language, is of significant technical and social importance. As language diversity increases, the ability of LLMs to produce stable and consistent results is trending downwards. This decrease has a close relation with the size of the model, the scope of the training data, and the prompt technique used in response generation. To this end, a study was conducted to measure the success of LLMs in different languages. In the study, four LLMs were examined, three of which were open-source (DeepSeek-Coder-6.7B-Instruct, Qwen2.5-Coder-7B-Instruct, Llama-3.1-8B-Instruct) and one was closed-source (GPT-5). These models were evaluated using the HumanEval-XL dataset across seven natural languages that have different data sources and usage prevalences. Additionally, the effect of the human development index (HDI) values of the countries where the languages are spoken and the prompt technique used on the results was also analysed. Results show that as LLMs grow, performance differences between languages have decreased. Additionally, it has been observed that whether the models are open-source or closed-source also has a significant impact on performance. Among open-source LLMs, DeepSeek-Coder-6.7B-Instruct's accuracy rates range from 37 % to 60 %, while Qwen2.5-Coder-7B-Instruct and Llama-3.1-8B-Instruct have performed more consistently in the 95–99 % range. GPT-5, which is a closed-source LLM, has demonstrated balanced accuracy across all languages. The results obtained reveal remarkable results in ethics, quantity of linguistic data, and equality of access to technology. The results also clearly demonstrate the relationship between multilingual accuracy, language prevalence, and prompt techniques. In this way, the study offers a clearer and more comprehensive understanding of the issues surrounding linguistic justice and the generalization of LLMs in the field of AI.

References

Ý. Kaya, T. H. Gençtürk, and F. K. Gülađýz, “A revolutionary acute subdural hematoma detection based on two-tiered artificial intelligence model,” Ulus. Travma Acil Cerrahi Derg., vol. 29, pp. 858–871, Aug. 2023. https://doi.org/10.14744/tjtes.2023.76756
Search in Google Scholar Back to article
T. H. Gençtürk, F. K. Gülađýz, and Ý. Kaya, “Detection and segmentation of subdural hemorrhage on head CT images,” IEEE Access, vol. 12, pp. 82235–82246, Jun. 2024. https://doi.org/10.1109/ACCESS.2024.3411932
Search in Google Scholar Back to article
T. H. Gençtürk, F. K. Gülađýz, and Ý. Kaya, “Artificial intelligence and computed tomography imaging for midline shift detection,” Eur. Phys. J. Special Topics, vol. 234, pp. 4539–4566, Oct. 2025. https://doi.org/10.1140/epjs/s11734-025-01779-6
Search in Google Scholar Back to article
G. Bharathi Mohan, R. Prasanna Kumar, P. Vishal Krishh, A. Keerthinathan, G. Lavanya, M. K. U. Meghana, S. Sulthana, and S. Doss, “An analysis of large language models: Their impact and potential applications,” Knowl. Inf. Syst., vol. 66, pp. 5047–5070, Sep. 2024. https://doi.org/10.1007/s10115-024-02120-8
Search in Google Scholar Back to article
F. K. Gülađýz, “Large language models for machine learning design assistance: Prompt-driven algorithm selection and optimization in diverse supervised learning tasks,” Appl. Sci., vol. 15, Oct. 2025, Art. no. 10968. https://doi.org/10.3390/app152010968
Search in Google Scholar Back to article
A. M. Rahmani, A. Hemmati, and S. Abbasi, “The rise of large language models: Evolution, applications, and future directions,” Eng. Rep., vol. 7, no. 9, Sep. 2025, Art. no. e70368. https://doi.org/10.1002/eng2.70368
Search in Google Scholar Back to article
J. Chen, Z. Liu, X. Huang, C. Wu, Q. Liu, G. Jiang, Y. Pu, Y. Lei, X. Chen, X. Wang, K. Zheng, D. Lian, and E. Chen, “When large language models meet personalization: Perspectives of challenges and opportunities,” World Wide Web, vol. 27, Jun. 2024, Art. no. 42. https://doi.org/10.1007/s11280-024-01276-1
Search in Google Scholar Back to article
J. Lin et al., “How can recommender systems benefit from large language models: A survey,” ACM Trans. Inf. Syst., vol. 43, no. 2, Art. no. 28, pp. 1–47, Jan. 2025. https://doi.org/10.1145/3678004
Search in Google Scholar Back to article
S. Khanna and X. Li, “Invisible languages of the LLM universe,” arXiv preprint, Art. no. arXiv:2510.11557, Oct. 2025. https://doi.org/10.48550/arXiv.2510.11557
Search in Google Scholar Back to article
R. Adams et al., “Mapping the potentials and limitations of using generative AI technologies to address socio-economic challenges in LMICs,” VeriXiv, vol. 2, Apr. 2025, Art. no. 57. https://doi.org/10.12688/verixiv.948.1
Search in Google Scholar Back to article
A. M. Kondoro, “AI writing assistants in Tanzanian universities: Adoption trends, challenges, and opportunities,” in Proc. Fourth Workshop Intell. Interactive Writing Assistants, Albuquerque, New Mexico, US, May 2025, pp. 37–46. https://doi.org/10.18653/v1/2025.in2writing-1.4
Search in Google Scholar Back to article
C. Wang et al., “Uncovering inequalities in new knowledge learning by large language models across different languages,” arXiv preprint, Art. no. arXiv:2503.04064, Mar. 2025. https://doi.org/10.48550/arXiv.2503.04064
Search in Google Scholar Back to article
Z. Li, Y. Shi, Z. Liu, F. Yang, A. Payani, N. Liu, and M. Du, “Language ranker: A metric for quantifying LLM performance across high and low-resource languages,” in Proc. AAAI Conf. Artif. Intell., AAAI Press, Philadelphia, Pennsylvania, Apr. 2025, pp. 28186–28194. https://doi.org/10.1609/aaai.v39i27.35038
Search in Google Scholar Back to article
I. Adebara, H. O. Toyin, N. T. Ghebremichael, A. Elmadany, and M. Abdul-Mageed, “Where are we? Evaluating LLM performance on African languages,” arXiv preprint, Art. no. arXiv:2502.19582, Feb. 2025. https://doi.org/10.48550/arXiv.2502.19582
Search in Google Scholar Back to article
I. G. Ilin, “Constitutional-legal aspect of creating large language models: The problem of digital inequality and linguistic discrimination,” J. Digital Technol. Law, vol. 3, no. 1, pp. 89–107, 2025. https://doi.org/10.21202/jdtl.2025.4
Search in Google Scholar Back to article
CULTAI Independent Expert Group, “Report of the Independent Expert Group on 2025 Artificial Intelligence and Culture,” UNESCO. [Online]. Available: https://www.unesco.org/sites/default/files/medias/fichiers/2025/09/CULTAI_Report%20of%20the%20Independent%20Expert%20Group%20on%20Artificial%20Intelligence%20and%20Culture%20%28final%20online%20version%29%201.pdf [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
C. Zhang, M. Tao, Z. Liao, and Y. Feng, “MiLiC-Eval: Benchmarking multilingual LLMs for China's minority languages,” arXiv preprint, Art. no. arXiv:2503.01150, Jun. 2025. https://doi.org/10.48550/arXiv.2503.01150
Search in Google Scholar Back to article
I. A. Azime et al., “Proverbeval: Exploring LLM evaluation challenges for low-resource language understanding,” arXiv preprint, Art. no. arXiv:2411.05049, Feb. 2025. https://doi.org/10.48550/arXiv.2411.05049
Search in Google Scholar Back to article
O. Khade, S. Jagdale, A. Phaltankar, G. Takalikar, and R. Joshi, “Challenges in adapting multilingual LLMs to low-resource languages using LoRA PEFT tuning,” arXiv preprint, Art. no. arXiv:2411.18571, Nov. 2024. https://doi.org/10.48550/arXiv.2411.18571
Search in Google Scholar Back to article
D. Bordonaba-Plou and L. M. Jreis-Navarro, “Linguistic injustice in multilingual technologies: The TenTen Corpus Family as a case study,” in Multilingual Digital Humanities, 1st ed. Routledge, 2023, pp. 129–144. https://doi.org/10.4324/9781003393696-12
Search in Google Scholar Back to article
S. Atreja, J. Ashkinaze, L. Li, J. Mendelsohn, and L. Hemphill, “What's in a prompt? A large-scale experiment to assess the impact of prompt design on the compliance and accuracy of LLM-generated text annotations,” in Proc. Int. AAAI Conference on Web and Social Media, Copenhagen, Denmark, Jun. 2025, pp. 122–145. https://doi.org/10.1609/icwsm.v19i1.35807
Search in Google Scholar Back to article
E. Chen, “Enhancing teaching quality through LLM: An experimental study on prompt engineering,” in Proc. 14th Int. Conf. Educ. Inf. Technol. (ICEIT), Guangzhou, China, May 2025, pp. 1–7. https://doi.org/10.1109/ICEIT64364.2025.10976127
Search in Google Scholar Back to article
R. Khojah, F. G. de Oliveira Neto, M. Mohamad, and P. Leitner, “The impact of prompt programming on function-level code generation,” IEEE Trans. Softw. Eng., vol. 51, no. 8, pp. 2381–2395, Aug. 2025. https://doi.org/10.1109/TSE.2025.3587794
Search in Google Scholar Back to article
Q. Ma, W. Peng, C. Yang, H. Shen, K. Koedinger, and T. Wu, “What should we engineer in prompts? Training humans in requirement-driven LLM use,” ACM Trans. Comput.-Hum. Interact., vol. 32, no. 4, pp. 1–27, Aug. 2025. https://doi.org/10.1145/3731756
Search in Google Scholar Back to article
T. Debnath, M. N. A. Siddiky, M. E. Rahman, P. Das, A. K. Guha, M. R. Rahman, and H. M. D. Kabir, “A comprehensive survey of prompt engineering techniques in large language models,” TechRxiv, Oct. 2025. https://doi.org/10.36227/techrxiv.174140719.96375390/v1
Search in Google Scholar Back to article
L. J. Jacobsen and K. E. Weber, “The promises and pitfalls of large language models as feedback providers: A study of prompt engineering and the quality of AI-driven feedback,” AI, vol. 6, no. 2, Feb. 2025, Art. no. 35. https://doi.org/10.3390/ai6020035
Search in Google Scholar Back to article
B. Chen, Z. Zhang, N. Langrené, and S. Zhu, “Unleashing the potential of prompt engineering for large language models,” Patterns, vol. 6, no. 6, Jun. 2025, Art. no. 101260. https://doi.org/10.1016/j.patter.2025.101260
Search in Google Scholar Back to article
T. S. Almeida, G. K. Bonás, J. G. A. Santos, H. Abonizio, and R. Nogueira, “TiEBe: Tracking language model recall of notable worldwide events through time,” arXiv preprint, Art. no. arXiv:2501.07482, May 2025. https://doi.org/10.48550/arXiv.2501.07482
Search in Google Scholar Back to article
Q. Peng, Y. Chai, and X. Li, “HumanEval-XL: A multilingual code generation benchmark for cross-lingual natural language generalization,” in Proc. 2024 Joint Int. Conf. Comput. Linguist., Lang. Resour. Eval. (LREC-COLING 2024), Torino, Italia, May 2024, pp. 8383–8394. [Online]. Available: https://aclanthology.org/2024.lrec-main.735/
Search in Google Scholar Back to article
M. Chen et al., “Evaluating large language models trained on code,” arXiv preprint, Art. no. arXiv:2107.03374, Jul. 2021. https://doi.org/10.48550/arXiv.2107.03374
Search in Google Scholar Back to article
Wikimedia Foundation, “List of Wikipedias,” Meta-Wiki. [Online]. Available: https://meta.wikimedia.org/wiki/List_of_Wikipedias [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
United Nations Development Programme, “Human development report 2025: A matter of choice: People and possibilities in the age of AI,” United Nations Development Programme, New York. [Online]. Available: https://hdr.undp.org/content/human-development-report-2025 [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
Hugging Face, “deepseek-coder-6.7b-instruct,” Hugging Face. [Online]. Available: https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
B. Hui et al., “Qwen2.5-Coder technical report,” arXiv preprint, Art. no. arXiv:2409.12186, Sep. 2024. https://doi.org/10.48550/arXiv.2409.12186
Search in Google Scholar Back to article
Hugging Face, “Llama-3.1-8B-Instruct,” Hugging Face. [Online]. Available: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
OpenAI, “GPT-5 system card,” OpenAI. [Online]. Available: https://cdn.openai.com/gpt-5-system-card.pdf [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
DeepSeek, “Quick start – The temperature parameter,” DeepSeek API Docs. [Online]. Available: https://api-docs.deepseek.com/quick_start/parameter_settings [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
C. Arora, A. I. Sayeed, S. Licorish, F. Wang, and C. Treude, “Optimizing large language model hyperparameters for code generation,” arXiv preprint, Art. no. arXiv:2408.10577, Aug. 2024. https://doi.org/10.48550/arXiv.2408.10577
Search in Google Scholar Back to article
S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “An empirical study of the non-determinism of ChatGPT in code generation,” arXiv preprint, Art. no. arXiv:2308.02828, Oct. 2024. https://doi.org/10.48550/arXiv.2308.02828
Search in Google Scholar Back to article
DeepSeek, “DeepSeek Coder,” GitHub. [Online]. Available: https://github.com/deepseek-ai/DeepSeek-Coder [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
DeepSeek, “DeepSeek Coder: Let the code write itself,” DeepSeek Coder Website. [Online]. Available: https://deepseekcoder.github.io/ [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
D. Guo et al., “DeepSeek-Coder: When the large language model meets programming – The rise of code intelligence,” arXiv preprint, Art. no. arXiv:2401.14196, Jan. 2024. https://doi.org/10.48550/arXiv.2401.14196
Search in Google Scholar Back to article
A. Yang et al., “Qwen 2.5 technical report,” arXiv preprint, Art. no. arXiv:2412.15115, Dec. 2024. https://doi.org/10.48550/arXiv.2412.15115
Search in Google Scholar Back to article
Hugging Face, “Llama 3.1 – 405B, 70B & 8B with multilinguality and long context,” Hugging Face Blog. [Online]. Available: https://huggingface.co/blog/llama31 [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
Meta, “meta-llama/llama3,” GitHub. [Online]. Available: https://github.com/meta-llama/llama3 [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
OpenAI, “Introducing GPT-5,” OpenAI. [Online]. Available: https://openai.com/tr-TR/index/introducing-gpt-5/ [Accessed Aug. 7, 2025].
Search in Google Scholar Back to article
T. Brown et al., “Language models are few-shot learners,” in Proc. Adv. Neural Inf. Process. Syst. 33 (NeurIPS 2020), Virtual-only conference, Art. no. 159, Dec. 2020, pp. 1877–1901. [Online]. Available: https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
Search in Google Scholar Back to article
J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” in Adv. Neural Inf. Process. Syst., Art. no. 1800, Nov. 2022, pp. 24824–24837. [Online]. Available: https://dl.acm.org/doi/10.5555/3600270.3602070
Search in Google Scholar Back to article
Q. Dong et al., “A survey on in-context learning,” arXiv preprint, Art. no. arXiv:2301.00234, Oct. 2024. https://doi.org/10.48550/arXiv.2301.00234
Search in Google Scholar Back to article
Y. Li, “A practical survey on zero-shot prompt design for in-context learning,” in Proc. 14th Int. Conf. Recent Advances in Natural Language Processing (RANLP), Varna, Bulgaria, Sep. 2023, pp. 641–647. https://doi.org/10.26615/978-954-452-092-2_069
Search in Google Scholar Back to article
J. Austin et al., “Program synthesis with large language models,” arXiv preprint, Art. no. arXiv:2108.07732, Aug. 2021. https://doi.org/10.48550/arXiv.2108.07732
Search in Google Scholar Back to article
A. Lewkowycz et al., “Solving quantitative reasoning problems with language models,” in Adv. Neural Inf. Process. Syst., Art. no. 278, Nov. 2022, pp. 3843–3857. [Online]. Available: https://dl.acm.org/doi/10.5555/3600270.3600548
Search in Google Scholar Back to article
H. Zhou, A. Nova, H. Larochelle, A. Courville, B. Neyshabur, and H. Sedghi, “Teaching algorithmic reasoning via in-context learning,” arXiv preprint, Art. no. arXiv:2211.09066, Nov. 2022. https://doi.org/10.48550/arXiv.2211.09066
Search in Google Scholar Back to article
H. Xia, C. T. Leong, W. Wang, Y. Li, and W. Li, “Tokenskip: Controllable chain-of-thought compression in LLMs,” arXiv preprint, Art. no. arXiv:2502.12067, Sep. 2025. https://doi.org/10.48550/arXiv.2502.12067
Search in Google Scholar Back to article
M. Franke, “Sheet 6.3: Decoding strategies,” Pragmatic Natural Language Generation with Neural Language Models Web Site. [Online]. Available: https://michael-franke.github.io/npNLG/06-LSTMs/06d-decoding-GPT2.html [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
Hugging Face, “Generation strategies – Transformers documentation,” Hugging Face. [Online]. Available: https://huggingface.co/docs/transformers/en/generation_strategies [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical analysis of flaky tests,” in Proc. 22nd ACM SIGSOFT Int. Symp. Found. Software Eng., Hong Kong, China, Nov. 2014, pp. 643–653. https://doi.org/10.1145/2635868.2635920
Search in Google Scholar Back to article
Wikipedia, “List of languages by total number of speakers,” Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
D. M. Eberhard, G. F. Simons and C. D. Fennig, Eds., Ethnologue: Languages of the World, 28th ed. SIL International, 2025. [Online]. Available: https://www.ethnologue.com/insights/ethnologue200/ [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
P. Bala, “The impact of mobile broadband and internet bandwidth on human development – A comparative analysis of developing and developed countries,” J. Knowl. Econ., vol. 15, pp. 16419–16453, Jan. 2024. https://doi.org/10.1007/s13132-023-01711-0
Search in Google Scholar Back to article
World Bank, “Chapter 1 – Digital adoption: Accelerating post-pandemic, yet a widening divide,” in Digital Progress and Trends Report 2023, World Bank. [Online]. Available: https://www.worldbank.org/en/publication/digital-progress-and-trends-report?utm_source=https://www.google.com/search?#Report_chapters [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
World Bank, “Chapter 5 – Artificial intelligence: Revolutionary potential and huge uncertainties,” in Digital Progress and Trends Report 2023, World Bank. [Online]. Available: https://www.worldbank.org/en/publication/digital-progress-and-trends-report?utm_source=https://www.google.com/search?#Report_chapters [Accessed Oct. 25, 2025].
Search in Google Scholar Back to article
A. Daupare and G. Jçkabsons, “Benchmarking 24 large language models for automated multiple-choice question generation in Latvian,” Applied Computer Systems, vol. 30, no. 1, pp. 85–90, May 2025. https://doi.org/10.2478/acss-2025-0010
Search in Google Scholar Back to article

Beyond Accuracy: Cross-Linguistic Equity and Socio-Technical Dimensions of Large Language Models

Abstract

Paradigm

My account