Textual outlier detection with an unsupervised method using text similarity and density peak

Mahnaz Taleb Sereshki; Morteza Mohammadi Zanjireh; Mahdi Bahaghighat

doi:10.2478/ausi-2023-0008

Abstract

Text mining is an intriguing area of research, considering there is an abundance of text across the Internet and in social medias. Nevertheless outliers pose a challenge for textual data processing. The ability to identify this sort of irrelevant input is consequently crucial in developing high-performance models. In this paper, a novel unsupervised method for identifying outliers in text data is proposed. In order to spot outliers, we concentrate on the degree of similarity between any two documents and the density of related documents that might support integrated clustering throughout processing. To compare the e ectiveness of our proposed approach with alternative classification techniques, we performed a number of experiments on a real dataset. Experimental findings demonstrate that the suggested model can obtain accuracy greater than 98% and performs better than the other existing algorithms.

References

F. Abedini, M. Bahaghighat, M. S’hoyan, Wind turbine tower detection using feature descriptors and deep learning. Facta Universitatis, Series: Electronics and Energetics, 33, 1 (2019) 133–153. ⇒105
Search in Google Scholar Back to article
J. Allan, V. Lavrenko, D. Malin, R. Swan, Detections, bounds, and timelines: Umass and tdt-3. In Proceedings of Topic Detection and Tracking Workshop, pp. 167–174. Citeseer, 2000. ⇒92
Search in Google Scholar Back to article
M. Bahaghighat, F. Abedini, Q: Xin, M. Mohammadi Zanjireh, S. Mirjalili, Using machine learning and computer vision to estimate the angular velocity of wind turbines in smart grids remotely. Energy Reports, 7 (2021) 8561–8576. ⇒92
Search in Google Scholar Back to article
M. Bahaghighat, Q. Xin, S. Ahmad Motamedi, M. Mohammadi Zanjireh, A. Vacavant, Estimation of wind turbine angular velocity remotely found on video mining and convolutional neural network. Applied Sciences, 10, 10 (2020) 3544. ⇒105
Search in Google Scholar Back to article
C. Barreyre, L. Boussouf, B. Cabon, B. Laurent, J-M. Loubes, Statistical methods for outlier detection in space telemetries. Space Operations: Inspiring Hu-mankind’s Future, pp. 513–547, 2019. ⇒93
Search in Google Scholar Back to article
I. Ben-Gal, Outlier detection in: Data mining and knowledge discovery handbook: A complete guide for practitioners and researchers, 2005. ⇒93
Search in Google Scholar Back to article
Y. Bengio, O. Delalleau, C. Simard, Decision trees do not generalize to new variations. Computational Intelligence, 26, 4 (2010) 449–467. ⇒100
Search in Google Scholar Back to article
M. Bozorgi, M. Mohammadi Zanjireh, M. Bahaghighat, Q. Xin, A time-e cient and exploratory algorithm for the rectangle packing problem. Intelligent Automation & Soft Computing, 31, 2 (2022) 885–898. ⇒92
Search in Google Scholar Back to article
A. Z. Broder, S. C. Glassman, M. S Manasse, G. Zweig, Syntactic clustering of the web. Computer networks and ISDN systems, 29, 8–13 (í997) 1157–1166. ⇒98
Search in Google Scholar Back to article
M. Ester, H-P. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, vol. 96, pp. 226–231, 1996. ⇒93
Search in Google Scholar Back to article
M. Ghorbani, M. Bahaghighat, Q. Xin, F.Özen, ConvLSTMconv network: a deep learning approach for sentiment analysis in cloud computing. Journal of Cloud Computing, 9, Article no: 16 (2020). ⇒92, 105
Search in Google Scholar Back to article
J. Guzman, B. Poblete, On-line relevant anomaly detection in the twitter stream: an e cient bursty keyword detection model. In Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, pp. 31–39, 2013. ⇒92, 94
Search in Google Scholar Back to article
A. Hajikarimi, M. Bahaghighat, Optimum outlier detection in internet of things industries using autoencoder. In Frontiers in Nature-Inspired Industrial Optimization, pp. 77–92, 2022. ⇒92
Search in Google Scholar Back to article
D. J. Higham, An algorithmic introduction to numerical simulation of stochastic differential equations. SIAM Review, 43, 3 (2001) 525–546. ⇒100
Search in Google Scholar Back to article
T. K. Ho, Random decision forests. In Proc. of 3rd Int. Conf. on Document Analysis and Recognition, vol. 1. pp. 278–282. IEEE, 1995 ⇒99
Search in Google Scholar Back to article
V. Hodge, J. Austin, A survey of outlier detection methodologies. Artificial Intelligence Review, 22 (2004) 85–126. ⇒92
Search in Google Scholar Back to article
M. Jamalzadeh, M. Maadani, M. Mahdavi, Ec-mopso: an edge computing-assisted hybrid cluster and mopso-based routing protocol for the internet of vehicles. Annals of Telecommunications, 77, 7–8 (2022) 491–503. ⇒93
Search in Google Scholar Back to article
S. M. Jameii, M. Maadani, Intelligent dynamic connectivity control algorithm for cluster-based wireless sensor networks. In 2016 11th Int. Conf. for Internet Technology and Secured Transactions (ICITST), pp. 416–420. IEEE, 2016. ⇒93
Search in Google Scholar Back to article
T. Joachims, A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Technical Report, Carnegie-Mellon Univ. Pittsburgh. Dept. of Computer Science, 1996. ⇒98
Search in Google Scholar Back to article
S. Kannan, V. Gurusamy, S. Vijayarani, J. Ilamathi, Ms. Nithya, S. Kannan, V. Gurusamy, Preprocessing techniques for text mining. International Journal of Computer Science & Communication Networks, 5, 1 (2014) 7–16. ⇒92
Search in Google Scholar Back to article
F. Khorasani, M. Mohammadi Zanjireh, M. Bahaghighat, Q. Xin, A tradeo between accuracy and speed for k-means seed determination. Comput. Syst. Sci. Eng., 40, 3 (2022) 1085–1098. ⇒92
Search in Google Scholar Back to article
B. S. Kumar, V. Ravi, A survey of the applications of text mining in financial domain. Knowledge-Based Systems, 114 (2016) 128–147. ⇒92
Search in Google Scholar Back to article
R. Kumaraswamy, A. Wazalwar, T. Khot, J. Shavlik, S. Natarajan, Anomaly detection in text: The value of domain knowledge. In The Twenty-Eighth International Flairs Conference, 2015. ⇒92
Search in Google Scholar Back to article
Y. Li, Z. Chen, D. Zha, K. Zhou, H. Jin, H. Chen, X. Hu. Autood: Automated outlier detection via curiosity-guided search and self-imitation learning. arXiv preprint arXiv:2006.11321, 2020. ⇒92
Search in Google Scholar Back to article
Y. Liu, Z. Li, Ch. Zhou, Y. Jiang, J. Sun, M. Wang, X. He, Generative adversarial active learning for unsupervised outlier detection. IEEE Transactions on Knowledge and Data Engineering, 32, 8 (2019) 1517–1528. ⇒93
Search in Google Scholar Back to article
A. R. Lubis, M. Lubis, et al., Optimization of distance formula in k-nearest neighbor method. Bulletin of Electrical Engineering and Informatics, 9, 1 (2020) 326–338. ⇒99
Search in Google Scholar Back to article
H. P. Luhn, A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1, 4 (1957) 309–317. ⇒98
Search in Google Scholar Back to article
M. Norouzi Shad, M. Maadani, M. Nesari Moghadam, Gapso-Svm: an IDSS-based energy-aware clustering routing algorithm for IoT perception layer. Wireless Personal Communications, 216 (2022) 2249–2268. ⇒93
Search in Google Scholar Back to article
M. Oghbaie, M. Mohammadi Zanjireh, Pairwise document similarity measure based on present term set. Journal of Big Data, 5, 1 (2018) 1–23. ⇒98
Search in Google Scholar Back to article
M. Platakis, D. Kotsakos, D. Gunopulos, Searching for events in the blogosphere. In Proceedings of the 18th Int. Conf. on World Wide Web, pp. 1225–1226, 2009. ⇒92
Search in Google Scholar Back to article
X. Qin, L. Cao, E. A. Rundensteiner, S. Madden, Scalable kernel density estimation-based local outlier detection over large data streams. In Proceedings of the 22nd Int. Conf. on Extending Database Technology (EDBT), 2019. ⇒93
Search in Google Scholar Back to article
J. P. Reiter, T. E. Raghunathan, The multiple adaptations of multiple imputation. Journal of the American Statistical Association, 102, 480 (2007) 1462–1471. ⇒99
Search in Google Scholar Back to article
M. Rostami, M. Bahaghighat, M. Mohammadi Zanjireh, Bitcoin daily close price prediction using optimized grid search method. Acta Universitatis Sapientiae, Informatica, 13, 2 (2021) 265–287. ⇒92
Search in Google Scholar Back to article
S. N. Sajedi, M. Maadani, M. Nesari Moghadam, F-leach: a fuzzy-based data aggregation scheme for healthcare IoT systems. The Journal of Supercomputing, 78, 1 (2022) 1030–1047. ⇒92
Search in Google Scholar Back to article
E. Schubert, M. Weiler, H-P. Kriegel, Signitrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. In Proceedings of the 20th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 871–880, 2014. ⇒92
Search in Google Scholar Back to article
H. Schütze, Ch. D. Manning, P. Raghavan, Introduction to information retrieval, vol. 39. Cambridge University Press Cambridge, 2008. ⇒98
Search in Google Scholar Back to article
A. Shamseen, M. Mohammadi Zanjireh, M. Bahaghighat, Q. Xin, Developing a parallel classifier for mining in big data sets. IIUM Engineering Journal, 22, 2 (2021) 119–134. ⇒92, 95
Search in Google Scholar Back to article
M: Templ, J. Gussenbauer, P. Filzmoser, Evaluation of robust outlier detection methods for zero-inflated complex data. Journal of Applied Statistics, 47, 7 (2020) 1144–11673. ⇒92
Search in Google Scholar Back to article
B. Wang, J. Sharma, J. Chen, P. Persaud, Ensemble machine learning assisted reservoir characterization using field production data–an o shore field case study. Energies, 14, 4 (2021) 1052. ⇒101
Search in Google Scholar Back to article
Y. Wu, X. Li, F. Luan, Y. He, A novel gpr-based prediction model for strip crown in hot rolling by using the improved local outlier factor. IEEE Access, 9 (2020) 458–469. ⇒94
Search in Google Scholar Back to article
Y. Yan, L. Cao, C. Kulhman, E. Rundensteiner, Distributed local outlier detection in big data. In Proceedings of the 23rd ACM SIGKDD Int. Conference on knowledge Discovery and Data Mining, pp. 1225–1234, 2017. ⇒92, 93
Search in Google Scholar Back to article
Y. Zhao, Z. Nasrullah, Z. Li, PyOD: A Python toolbox for scalable outlier detection. arXiv preprint arXiv:1901.01588, 2019. ⇒92
Search in Google Scholar Back to article

Textual outlier detection with an unsupervised method using text similarity and density peak

Abstract

Paradigm

My account