Medical nearest-word embedding technique implemented using an unsupervised machine learning approach for Bengali language

Kailash Pati Mandal; Prasenjit Mukherjee; Devraj Vishnu; Baisakhi Chakraborty; Tanupriya Choudhury; Pradeep Kumar Arya

doi:10.2478/ijssis-2024-0018

Abstract

The rapid growth of natural language processing (NLP) applications, such as text summarization, speech recognition, information extraction, and machine translation, has led to the development of structured query language (SQL) for extracting information from structured data. However, due to limited resources, converting Natural Language (NL) queries to SQL in Bengali is challenging. This article proposes an unsupervised machine learning model to find semantically Bengali closed words that can generate SQL from NL queries in Bengali. The main objective of the proposed system is to provide support in the creation of patient-oriented explanations and educational resources by simplifying intricate medical terminology. The major findings of the proposed system are as follows: The use of machine translation in the field of medicine facilitates the dissemination of healthcare information to a diverse international audience and improves the performance of entity recognition tasks, including the identification of medical conditions, drugs, or procedures within clinical notes or electronic health data. This system allows a naive user to extract health-related information from a healthcare-structured database without any knowledge of SQL. The system accepts a query and generates a response according to the query in Bengali language. Query tokenization and stop word removal are carried out in the preprocessing stage, and unsupervised machine learning techniques are implemented to process the input query sentence. Tokenized words are converted into vectors using the skip-gram model, with noise-contrastive estimation (NCE) applied to discriminate between actual and irrelevant words. Stochastic gradient descent (SGD) optimizes the model by randomly choosing a small amount of data from the dataset and using cosine similarity to measure closer words. The semantically closer words are found using an unsupervised learning method to generate the SQL.

References

O. Sen, et al., “Bangla Natural Language Processing: A Comprehensive Analysis of Classical, Machine Learning, and Deep Learning Based Methods,” IEEE Access, vol. 10, pp. 38999–39044 April 2022.
Search in Google Scholar Back to article
M. R. Hossain, and M. M. Hoque, “Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations,” Preprints 2020, 2020120600 (doi: 10.20944/preprints202012.0600.v1).
Open DOI Search in Google Scholar Back to article
M. R. Hossain, M. M. Hoque, N. Siddique, I. H. Sarkar, “Bengali text document categorization based on very deep convolution neural network,” Expert Systems with Applications. England, vol. 184, pp. 115394, December 2021.
Search in Google Scholar Back to article
E. A. Emon, S. Rahman, J. Banarjee, A. K. Das, T. Mittra, “A deep learning approach to detect abusive bengali text,” In 2019 7th International Conference on Smart Computing & Communications (ICSCC). Malaysia, pp. 1–5, June 2019.
Search in Google Scholar Back to article
T. T. Mayeesha, A. M, Sarwar, R. M. Rahman, “Deep learning based question answering system in Bengali,” Journal of Information and Telecommunication. England, vol. 5, no. 2, pp. 145–178, April 2021.
Search in Google Scholar Back to article
M. Rahman, S. Haque, and Z. R. Saurav, “Identifying and categorizing opinions expressed in Bangla sentences using deep learning technique,” International Journal of Computer Applications., vol. 176, pp. 8887, April 2020.
Search in Google Scholar Back to article
W. Akanda, and A. Uddin, “Multi-Label Bengali article classification using ML-KNN algorithm and Neural Network,” 2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD). Bangladesh, pp. 466–471, April 2021.
Search in Google Scholar Back to article
M. R. Amin, and M. Chakraborty, “Algorithm for Bengali keyword extraction,” In 2018 International Conference on Bangla Speech and Language Processing (ICBSLP). Bangladesh, pp. 1–5, September 2018.
Search in Google Scholar Back to article
U. Brunner, and K. Stockinger, “Valuenet: A natural language-to-sql system that learns from database information.” In 2021 IEEE 37th International Conference on Data Engineering (ICDE). Greece, pp. 2177–2182, April 2021.
Search in Google Scholar Back to article
S. S. Badhya, A. Prasad, S. Rohan, Y. S. Yashwanth, N. Deepamala, and G. Shobha. “Natural language to structured query language using elastic search for descriptive columns,” In 2019 4th International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS). India, vol. 4, pp. 1–5, December 2019.
Search in Google Scholar Back to article
A. Das, and R. C. Balabantaray, “MyNLIDB: a natural language interface to database,” In 2019 International Conference on Information Technology (ICIT). India, pp. 234–238, December 2019.
Search in Google Scholar Back to article
M. Kaufmann, G. Stechschulte, and A. Huber, “Efficient and Accurate In-Database Machine Learning with SQL Code Generation in Python,” arXiv preprint arXiv:2104.03224 (2021).
Search in Google Scholar Back to article
T. Bai, Y. Ge, S. Guo, Z. Zhang, and L. Gong, “Enhanced natural language interface for web-based information retrieval,” IEEE Access, vol. 9, pp. 4233–4241, December 2020.
Search in Google Scholar Back to article
M. Eminağaoğlu, and Y. Gökşen, “A new similarity measure for document classification and text mining,” KnE Social Sciences, pp. 353–366, January 2020.
Search in Google Scholar Back to article
Y. Tang, “Research on Word Vector Training Method Based on Improved Skip-Gram Algorithm,” Advances in Multimedia, pp. 1–8, February 2022.
Search in Google Scholar Back to article
A. A. A. Rafat, M. Salehin, F. R. Khan, S. A. Hossain, and S. Abujar, “Vector Representation of Bengali Word Using Various Word Embedding Model,” In 2019 8th International Conference System Modeling and Advancement in Research Trends (SMART). India, pp. 27–30, November 2019.
Search in Google Scholar Back to article
S. H. Sumit, M. Z. Hossan, T. A. Muntasir, and T. Sourov, “Exploring word embedding for bangla sentiment analysis,” In 2018 International Conference on Bangla Speech and Language Processing (ICBSLP). Bangladesh, pp. 1–5, September 2018.
Search in Google Scholar Back to article
P. S. Kambali, S. Suri, and B. M. Sagar. “Distributed Representation of Words in Vector Space for Kannada Language,” In 2018 3rd International Conference on Computational Systems and Information Technology for Sustainable Solutions (CSITSS). India, pp. 54–58, December 2018.
Search in Google Scholar Back to article
P. K. Saha, A. Das Mou, and T. Mittra. “A Bangla Word Sense Disambiguation Technique using Minimum Edit Distance Algorithm and Cosine Distance,” In 2019 23rd International Computer Science and Engineering Conference (ICSEC). Thailand, pp. 1–6, November 2019.
Search in Google Scholar Back to article
Q. Du, N. Li, W. Liu, D. Sun, S. Yang, and F. Yue. “A Topic Recognition Method of News Text Based on Word Embedding Enhancement,” Computational Intelligence and Neuroscience, pp. 1–15, February 2022.
Search in Google Scholar Back to article
Y. Hu, H. He, Z. Chen, Q. Zhu, and C. Zheng. “A Unified Model Using Distantly Supervised Data and Cross-Domain Data in NER,” Computational Intelligence and Neuroscience, pp. 1–11, May 2022.
Search in Google Scholar Back to article
B. Wang, and Z. Ou. “Learning neural trans-dimensional random field language models with noise-contrastive estimation,” In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Canada, pp. 6134–6138, April 2018.
Search in Google Scholar Back to article
A. Egitmen, I. Bulut, R. Aygun, A. B. Gunduz, O. Seyrekbasan, and A. G. Yavuz. “Combat mobile evasive malware via skip-gram-based malware detection,” Security and Communication Networks, pp. 1–10, April 2020.
Search in Google Scholar Back to article
A. O. Nicholas, I. O. Ndaman, S. Misra, O. O. Abayomi-Alli, and R. Damaševičius. “Text messaging-based medical diagnosis using natural language processing and fuzzy logic,” Journal of Healthcare Engineering, pp. 1–14, September 2020.
Search in Google Scholar Back to article
C. Zhang, X. Liu, and D. Biś, “An analysis on the learning rules of the skip-gram model,” In 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1–8, July 2019.
Search in Google Scholar Back to article
O. Chehab, A. Gramfort, and A. Hyvärinen, “The optimal noise in noise-contrastive learning is not what you think,” In Uncertainty in Artificial Intelligence, pp. 307–316, August 2022.
Search in Google Scholar Back to article
S. Mohamad, H. Alamri, and A. Bouchachia, “Scaling up stochastic gradient descent for non-convex optimisation,” Machine Learning, pp. 1–41, October 2022.
Search in Google Scholar Back to article
M. Arefin, K. M. Hossen, and M. N, Uddin. “Natural Language Query to SQL Conversion Using Machine Learning Approach,” In 2021 3rd International Conference on Sustainable Technologies for Industry 4.0 (STI). Bangladesh, pp. 1–6, December 2021.
Search in Google Scholar Back to article
J. Liu, Q. Cui, H. Cao, T. Shi, and M. Zhou. “Auto-conversion from Natural Language to Structured Query Language using Neural Networks Embedded with Pre-training and Fine-tuning Mechanism,” In 2020 Chinese Automation Congress (CAC). China, pp. 6651–6654, November 2020.
Search in Google Scholar Back to article
H. Sanyal, S. Shukla, and R. Agrawal. “Natural Language Processing Technique for Generation of SQL Queries Dynamically,” In 2021 6th International Conference for Convergence in Technology (I2CT). India, pp. 1–6, April 2021.
Search in Google Scholar Back to article
C. Sugandhika, and S. Ahangama. “Heuristics-Based SQL Query Generation Engine,” In 2021 6th International Conference on Information Technology Research (ICITR). Sri Lanka, pp. 1–7, December 2021.
Search in Google Scholar Back to article
D. Pal, H. Sharma, and K. Chaudhuri. “Data Agnostic RoBERTa-based Natural Language to SQL Query Generation,” In 2021 6th International Conference for Convergence in Technology (I2CT). India, pp. 1–5, April 2021.
Search in Google Scholar Back to article
S. Huo, T. Ma, J. Chen, M. Chang, L. Wu, and M. J. Witbrock. “Graph enhanced cross-domain text-to-sql generation,” In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 159–163, November 2019.
Search in Google Scholar Back to article
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. “Distributed representations of words and phrases and their compositionality,” Advances in neural information processing systems, vol. 26, 2013.
Search in Google Scholar Back to article
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M Isard, M. Kudlur “{TensorFlow}: a system for {Large-Scale} machine learning,” In12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283, 201.
Search in Google Scholar Back to article
A. Mnih, and Y. W. The, “A fast and simple algorithm for training neural probabilistic language models,” arXiv preprint arXiv:1206.6426 (2012).
Search in Google Scholar Back to article
S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747 (2016).
Search in Google Scholar Back to article
K. P. Mandal, P. Mukherjee, A. Chattopadhyay, B. Chakraborty, “XBLQPS: An Extended Bengali Language Query Processing System for e-Healthcare Domain,” vol. 13, pp. 502–516, 2022.
Search in Google Scholar Back to article

Medical nearest-word embedding technique implemented using an unsupervised machine learning approach for Bengali language

Abstract

Paradigm

My account