Building English – Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora

Dilshad Kaur; Satwinder Singh

doi:10.2478/acss-2023-0024

.blurhash-client-img { display: none !important; }

Building English – Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora

Applied Computer Systems

Volume 28 (2023): Issue 2 (December 2023)

By: Dilshad Kaur and Satwinder Singh

Open Access

|Jan 2024

Abstract

Comparable corpora are the right resources for extracting parallel data due to their abundant availability. It is of great importance where parallel data are scarce. In this study, the focus is placed on building of parallel data for Punjabi and English language pair. The raw data were collected from web contents of “Mann Ki Baat”, which is a collection of textual speeches of Prime Minister of India Mr. Narendra Modi broadcasted every last Sunday of the month. Data were cleaned and pre-processed using a natural language toolkit. An alignment model using BERT was built that aligned two textual files on a sentence level. Furthermore, extraction of noun forms with the help of NLTK library in Python programming was performed. The noun aligned dataset was built for English-Punjabi language pair and made available at Mendeley data repository.

References

A. Ali, S. Siddiq, and M. K. Malik, “Development of parallel corpus and English to Urdu statistical machine translation,” Resource, vol. 9, no. 10, 2010. [Online}. Available: https://www.academia.edu/31197083/Development_of_Parallel_Corpus_and_English_to_Urdu_Statistical_Machine_Translation
Search in Google Scholar Back to article
R. Srivastava and R. A. Bhat, “Transliteration systems across Indian languages using parallel corpora,” in Proceedings of the 27th Pacific Asia Conference on Language, Information and Computation (PACLIC 27), 2013, pp. 390–398.
Search in Google Scholar Back to article
M. M. Kenning, “What are parallel and comparable corpora and how can we use them,” in The Routledge handbook of corpus linguistics. Routledge, Jan. 2010, pp. 487–500. https://www.researchgate.net/publication/265061773_What_are_parallel_and_comparable_corpora_and_how_can_we_use_them
Search in Google Scholar Back to article
D. Kaur and S. Singh, “A systematic literature review on extraction of parallel corpora from comparable corpora,” Journal of Computer Science, vol. 17, no. 10, pp. 924–952, Oct. 2021. https://doi.org/10.3844/jcssp.2021.924.952
Search in Google Scholar Back to article
D. Ştefănescu and R. Ion, “Parallel-Wiki: A collection of parallel sentences extracted from Wikipedia,” in Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2013), Mar. 2013, pp. 24–30.
Search in Google Scholar Back to article
G. P. Archana, V. S. Jithesh, L. B. Remya, and E. Sherly, “Building a parallel Corpora: Translation issues and remedial case,” in 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Kochi, India, Aug. 2015, pp. 2414–2417. https://doi.org/10.1109/ICACCI.2015.7275980
Search in Google Scholar Back to article
J. R. Smith, C. Quirk, and K. Toutanova, “Extracting parallel sentences from comparable Corpora using document level alignment,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 403–411.
Search in Google Scholar Back to article
C. Tillmann, “A beam-search extraction algorithm for comparable data,” in Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, Aug. 2009, pp. 225–228. https://doi.org/10.3115/1667583.1667653
Search in Google Scholar Back to article
A. Srivastav and S. Singh, “Proposed model for context topic identification of English and Hindi news article through LDA approach with NLP technique,” Journal of the Institution of Engineers (India): Series B, vol. 103, no. 4, pp. 591–597, 2022. https://doi.org/10.1007/s40031-021-00655-w
Search in Google Scholar Back to article
W. Ling, G. Xiang, C. Dyer, A. W. Black, and I. Trancoso, “Microblogs as parallel corpora,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1, 2013, pp. 176–186.
Search in Google Scholar Back to article
S. Singh and H. Beniwal, “A survey on near-human conversational agents,” Journal of King Saud University Computer and Information Sciences, vol. 34, no. 10, pp. 8852–8866, Nov. 2022. https://doi.org/10.1016/j.jksuci.2021.10.013
Search in Google Scholar Back to article
A. Safi and S. Singh, “A systematic literature review on phishing website detection techniques,” Journal of King Saud University-Computer and Information Sciences, vol. 35, no. 2, pp. 590–611, Feb. 2023. https://doi.org/10.1016/j.jksuci.2023.01.004
Search in Google Scholar Back to article
S. Abdul-Rauf, H. Schwenk, and M. Nawaz, „Parallel fragments: Measuring their impact on translation performance,” Computer Speech & Language, vol. 43, pp. 56–69, May 2017. https://doi.org/10.1016/j.csl.2016.12.002
Search in Google Scholar Back to article
P. Fung and P. Cheung, “Mining very-nonparallel corpora: Parallel sentence and lexicon extraction via bootstrapping and E,” in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, Jul. 2004, pp. 57–63. https://aclanthology.org/W04-3208/
Search in Google Scholar Back to article
S. Jindal, V. Goyal, and J. S. Bhullar, “Building English-Punjabi parallel corpus for machine translation,” International Journal of Engineering, Science and Mathematics, vol. 7, no. 3, pp. 223–229, 2018.
Search in Google Scholar Back to article
B. Premjith, M. A. Kumar, and K. P. Soman, “Neural machine translation system for English to Indian language translation using MTIL parallel corpus,” Journal of Intelligent Systems, vol. 28, no. 3, pp. 387–398, 2019. https://doi.org/10.1515/jisys-2019-2510
Search in Google Scholar Back to article
M. L. Paramita, A. Aker, P. Clough, R. Gaizauskas, N. Glaros, N. Mastropavlos, and D. Tufiș, “Collecting comparable corpora,” in Using Comparable Corpora for Under-Resourced Areas of Machine Translation, Theory and Applications of Natural Language Processing, I. Skadiņa et al., Eds. Springer, Cham, 2019, pp. 55–87. https://doi.org/10.1007/978-3-319-99004-0_3
Search in Google Scholar Back to article
Z. Zhu, M. Li, L. Chen, and Z. Yang, “Building comparable corpora based on bilingual LDA model,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 2, 2013, pp. 278–282.
Search in Google Scholar Back to article
D. S. Munteanu and D. Marcu, “Improving machine translation performance by exploiting non-parallel corpora,” Computational Linguistics, vol. 31, no. 4, pp. 477–504, Dec. 2005. https://doi.org/10.1162/089120105775299168
Search in Google Scholar Back to article
Y. C. Chiao and P. Zweigenbaum, “Looking for candidate translational equivalents in specialized, comparable corpora,” in COLING 2002: The 17th International Conference on Computational Linguistics: Project Notes, vol. 2, Aug. 2002, pp. 1–5. https://doi.org/10.3115/1071884.1071904
Search in Google Scholar Back to article
A. A. Argaw and L. Asker, “Web mining for an Amharic-English bilingual corpus,” in WEBIST 2005 – 1st International Conference on Web Information Systems and Technologies, Kista, Sweden, 2005. https://www.scitepress.org/papers/2005/12285/12285.pdf
Search in Google Scholar Back to article
S. Gahbiche-Braham, H. Bonneau-Maynard, and F. Yvon, “Two ways to use a noisy parallel news corpus for improving statistical machine translation,” in Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, 2011, pp. 44–51.
Search in Google Scholar Back to article
R. Singh and S. Singh, “Text similarity measures in news articles by vector space model using NLP,” Journal of The Institution of Engineers (India), vol. 102, no. 2, pp. 329–338, Nov. 2020. https://doi.org/10.1007/s40031-020-00501-5
Search in Google Scholar Back to article
D. Widdows, B. Dorow, and C. K. Chan, “Using parallel corpora to enrich multilingual lexical resources,” in Third International Conference on Language Resources, 2002, pp. 240–245.
Search in Google Scholar Back to article
H. Xu, D. Liu, L. Qian, and G. Zhou, “Improving bilingual lexicon construction from Chinese-English comparable corpora via dependency relationship mapping,” in 2011 International Conference on Asian Language Processing, Penang, Malaysia, Nov. 2011, pp. 169–172. https://doi.org/10.1109/IALP.2011.22
Search in Google Scholar Back to article
L. Qian, H. Wang, G. Zhou, and Q. Zhu, “Bilingual lexicon construction from comparable corpora via dependency mapping,” in Proceedings of COLING 2012, 2012, pp. 2275–2290.
Search in Google Scholar Back to article
X. Liu, K. Duh, and Y. Matsumoto, “Topic models + word alignment = a flexible framework for extracting bilingual dictionary from comparable corpus,” in Proceedings of the Seventeenth Conference on Computational Natural Language Learning, 2013, pp. 212–221.
Search in Google Scholar Back to article
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” The Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
Search in Google Scholar Back to article
D. Bouamor, A. Popescu, N. Semmar, and P. Zweigenbaum, “Building specialized bilingual lexicons using large scale background knowledge,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Oct. 2013, pp. 479–489. https://www.researchgate.net/publication/281863666_Building_Specialized_Bilingual_Lexicons_Using_Large-Scale_Background_Knowledge
Search in Google Scholar Back to article
D. Bouamor, N. Semmar, and P. Zweigenbaum, “Context vector disambiguation for bilingual lexicon extraction from comparable corpora,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 2, 2013, pp. 759–764.
Search in Google Scholar Back to article
I. Vulić, W. De Smet, and M. Moens, “Identifying word translations from comparable corpora using latent topic models,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 2, 2011, pp. 479–484.
Search in Google Scholar Back to article
I. Vulić and M.-F. Moens, “Detecting highly confident word translations from comparable corpora without any prior knowledge,” in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp. 449–459.
Search in Google Scholar Back to article
D. Kaur and S. Singh, “English Punjabi aligned nouns dataset,” Mendeley Data, V1, 2022.
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/acss-2023-0024 | Journal eISSN: 2255-8691 | Journal ISSN: 2255-8683

Journal RSS Feed

Language: English

Page range: 245 - 251

Published on: Jan 29, 2024

Published by: Riga Technical University

In partnership with: Paradigm Publishing Services

Publication frequency: Volume open

Keywords:

Related subjects:

Artificial intelligence,

Information technology,

Project management,

Software development

© 2024 Dilshad Kaur, Satwinder Singh, published by Riga Technical University
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 28 (2023): Issue 2 (December 2023)