Have a personal or library account? Click to login
A novel approach to capture the similarity in summarized text using embedded model Cover

A novel approach to capture the similarity in summarized text using embedded model

Open Access
|Apr 2022

Figures & Tables

Figure 1:

Block diagram for proposed approach.
Block diagram for proposed approach.

Figure 2:

Workflow of proposed approach of near duplicate detection.
Workflow of proposed approach of near duplicate detection.

Figure 3:

Similarity scores for the various text extraction methods.
Similarity scores for the various text extraction methods.

Figure 4:

Impact of text representation on similarity calculation.
Impact of text representation on similarity calculation.

Figure 5:

Graphical representation of similarity scores using various similarity measure techniques.
Graphical representation of similarity scores using various similarity measure techniques.

Figure 6:

Similarity Score distribution using Various Similarity Search Techniques on original and summarized text.
Similarity Score distribution using Various Similarity Search Techniques on original and summarized text.

Figure 7:

Heat map (GloVe) using both approaches.
Heat map (GloVe) using both approaches.

Figure 8:

Comparison of similarity score of original vs. summarized text.
Comparison of similarity score of original vs. summarized text.

Input texts_

InputOriginal text
Text 1“Everyday large volume of data is gathered from different sources and are stored since they contain valuable piece of information. The storage of data must be done in efficient manner since it leads in difficulty during retrieval. Text data are available in the form of large documents. Understanding large text documents and extracting meaningful information out of it is time-consuming tasks. To overcome these challenges, text documents are summarized in with an objective to getrelated information from a large document or a collection of documents. Text mining can be used for this purpose. Summarized text will have reduced size as compare to original one. In this review, we have tried to evaluate and compare different techniques of Text summarization.”
Text 2“In the view of a significant increase in the burden of information over and over the limit by the amount of information available on the internet, there is a huge increase in the amount of information overloading and redundancy contained in each document Extracting important information in a summarized format would help a number of users. It is therefore necessary to have proper and properly prepared summaries. Subsequently, many research papers are proposed continuously to develop new approaches to automatically summarize the text. ''Automatic Text Summarization" is a process to create a shorter version of the original text (one or more documents) which conveys information present in the documents. In general, the summary of the text can be categorized into two types: Extractive-based and Abstractive-based. Abstractive-based methods are very complicated as they need to address a huge-scale natural language. Therefore, research communities are focusing on extractive summaries, attempting to achieve more consistent, non-recurring and meaningful summaries. This review provides an elaborative survey of extractive text summarization techniques. Specifically, it focuses on unsupervised techniques, providing recent efforts and advances on them and list their strengths and weaknesses points in a comparative tabular manner. In addition, this review highlights efforts made in the evaluation techniques of the summaries and finally deduces some possible”

Categorization of Text similarity measurement techniques_

Text similarity measureCategoryConsiders semantic?Approach usedCharacteristics
String basedCharacter basedNoHamming Distance, Levenshtein distance, Damerau-Levenshtein, Needleman-Wunsch, Longest Common Subsequence, Smith-Waterman, Jaro, Jaro-Winkler and N-gramUsed to find typographical mistakes but less efficient text analytics and computationally less effective for large text documents. Used in String matching approximation
Token/term basedNoJaccard similarity Dice’s coefficient Cosine similarity Manhattan distance and Euclidean distanceUseful in case of recognition of term rearrangement
Statistics basedCorpus/knowledge baseYesTF-IDF, (Latent Semantic Indexing (LSI)word2Vec, GloVe, Bidirectional Encoder Representations from Transformers (BERT), Latent Semantic Analysis (LSA), LDAIt uses only text representation and does not consider distance between texts

j_ijssis-2022-0002_TU2

Algorithm 2: Generation of summary for the input documents present in document_set using Extractive approach
1. function Generate_ Summary(document_set)
Input: pair of text documents
returns generated summary
2. forall text document in document_set do
3. Pre-processing: Block level breaking of text into key phrases or sentences, Tokenization (sentences), Lemmatization, stemming, stop word removal, POS tagging, Named Entity Recognition
4. Identification of interrelated sentences: Similarity measuring functions are used to find related sentences to be included in the summary
5. Weighting and ranking of selected sentences: Numeric values are assigned to find important features. Higher ranked sentences are selected for summary
6. output_set:= {text 1_summary, text 2_summary};
7. return output_set // pair of summarized text

Analysis of impact of embedding models on Text similarity measurement_

No. of Text similarity algorithmsApproach usedAverage similarity score (in %) between Text 1 and Text 2Average similarity score (in %) between Text 1 summary and Text 2 summaryDifference (in %)
8Without text embedding models52.6844.6857.995
6With text embedding models71.1971.71 0.52

j_ijssis-2022-0002_TU3

Algorithm 3: Text representation using embedding model to generate vectors
1. function Generate_ vector(output_set)
Input: Pair of summarized text documents
returns vector representation for input document pairs
2. forall summarized text document in output_set do
3. vector_set = embedding_model(output_set);
4. vector_set={VText1, VText2};
5. return vector_set // pair of vectors

Topic modeling on original text_

Topic modelling (using LDA method) onTopics with weights
Text 1 Topic #1 [(‘different’, 1.06), (‘since’, 1.03), (‘data’, 0.97), (‘try’, 0.88), (‘evaluate’, 0.88), (‘technique’, 0.88), (‘review’, 0.88), (‘summarization’, 0.88)]
Topic #2 (‘text’, 1.42), (‘document’, 1.39), (‘large’, 1.16), (‘form’, 1.01), (‘available’, 1.01), (‘summarize’, 0.91), (‘information’, 0.9), (‘meaningful’, 0.85)]
Text 2 Topic #1 [(‘information’, 1.24), (‘summary’, 1.1), (‘summarize’, 1.05), (‘research’, 1.0), (‘amount’, 0.9), (‘increase’, 0.9), (‘help’, 0.84), (‘would’, 0.84)]
Topic #2 [(‘text’, 1.36), (‘based’, 1.34), (‘provide’, 1.08), (‘extractive’, 1.07), (‘summarization’, 1.07), (‘abstractive’, 1.06), (‘technique’, 1.02), (‘summary’, 1.01)]

j_ijssis-2022-0002_TU1

Algorithm 1: Near duplicate detection using summarized text
1. document_set := {Text 1, Text 2}, threshold := ø // Initialize
2. function Near_Duplicate_Detection(document_set)
Input: Pair of text documents
returns labeled documents as near duplicate or non-duplicate
3. output_set=Generate_ Summary(document_set) ; // Phase 1: Generation of summary
4. vector_set = Generate_ vector(output_set) ; // Phase 2: Text representation
5. similarity_score=calculate_similarity_score(vector_set; // Similarity score calculation
6. if similarity_score > ø then // comparison with threshold
7.  label ‘Near Duplicate’
8. else
9.  label ‘ Non Duplicate’
10. end function

j_ijssis-2022-0002_TU4

Algorithm 4: Similarity score calculation for summarized text vectors
1. function calculate_similarity_score (vector_set)
Input: pair of vectors
returns similarity scores of the summarized text documents
2. similarity_score = similarity_function(vector_set)
3. return similarity_score

Text representation techniques_

Text representation methodConcept usedCharacteristicsMeritsDemerits
Vector Space ModelWord count/BOW modelIt uses the concept of linear algebra to compute similaritySimple to compute based on the frequency of wordsIgnore the importance of rare words
Document vectorsTF-IDF vectorsIt also computes the count of documents in which a particular word is present along its significanceIt does not give importance to most frequent words in the document which does not contribute much in similarity computationDoes not consider the semantic aspect
Embedding modelWord embeddingThese are the high dimensional representations of wordsHandle words having similar meaning i.e., synonyms. Does not require any feature engineeringIt cannot be applied directly in the computation of text similarity
Topic modelingLatent Dirichlet Allocation(LDA)Documents are represented by inherent latent topics where each topic can be drawn as probability of distribution of wordsProbabilistic model, for defining feature matrix of a document based on semanticsRequires prior knowledge of the number of and it does not capture correlation

Result analysis_

Similarity functionOriginal textSummarized text
Without embedding model Jaro Winkler [JW]68 72.80
Cosine similarity with k shingles [CS_kshingles]89.081.30
With embedding model Soft cosine similarity using FastText [FT_SoftCS]81.76 92.40
Cosine similarity with GloVe (GloVe_CS)97.8995.60

Conventional near duplicate detection techniques_

CategoryApproachCharacteristicsMerits
Keyword basedBOW (Bag of Words)Comparing words and frequency of words with respect to other documentsUsed in large documents uses Term Frequency -Inverse Document Frequency (TF-IDF) to create fingerprints. Reduces storage space
Fingerprint basedShinglingCompares short phrases adding context to the wordFingerprints are created with tokenized documents by using overlapped substrings and consecutive words. Statistical concepts are used to find near duplicates
SimHashGenerate fixed length hashes for each document which are stored for duplication detectionObtain ‘f’ bit fingerprint for each document. Used as dimension reduction
Hash basedMinHashPhrases are hashed into numbers for comparison to identify duplication and content hashes are storedIt stores a small amount of information for each document for effective comparison
Locality Sensitive Hashing (LSH)Probabilistic approach to detect similar documents. Hash function generated similar hashes for similar shinglesSearch space contains only those documents which tend to be similar which maximizes the probability of collision for similar content

Similarity scores using traditional similarity metrics on original texts, topics, keyword extracted and summary_

Text similarity measureSimilarity score (in %) between Text 1 and Text 2Similarity score (in %) between topics of Text 1 and Text 2Similarity score (in %) between key word extracted of Text 1 and Text 2Similarity score (in %) between Text 1 and Text 2 Summary
Euclidean distance [ED]23.7022.4020.0315.36
Normalized Levenshtein [NL]27.8026.4333.6929.08
Hamming Distance [HD]40.07.1410.827.0
Term Frequency-Inverse Document Frequency [TF-IDF]53.7155.9038.8641.11
Jaccard Distance [JD]56.2338.242.7548.97
Cosine Similarity [CS]63.029.4630.1541.86
Jaro Winkler [JW]68.0 76.8 70.0 72.80
Cosine similarity with k shingles [CS_kshingles] 89.0 62.561.9281.30

Topic modeling on summary of original text_

Topic modelling (using LDA method) applied onTopics with weights
Text 1 Summary Topic #1 [(‘document’,0.091),(‘data’,0.065),(‘information’,0.065), (‘piece’,0.039)’,’(‘contain’,0.039), (’summarize’,0.039), (’manner’, 0.039), (‘do’,0.039), (‘must’,0.039), (‘large’, 0.039)]
Topic #2 [(‘document’,0.044), (‘information’, 0.044), (‘data’,0.044), (‘source’, 0.044), (‘different,’0.043), (‘valuable’,0.043), (‘lead’, 0.043), (‘challenge’, 0.043), (‘collection’, 0.043), (‘relate’, 0.043]
Text 2 Summary Topic #1 [(‘information’,0.056),(‘increase’,0.040),(‘effort’,0.040), (‘amount’,0.040), (‘technique’,0.040),(‘specifically‘,0.024), (‘unsupervised‘,0.024),(‘future’,0.024), (‘overload’,0.024),(‘comparative’, 0.024)]
Topic #2 [(‘information’,0.027), (‘technique’, 0.027), (‘amount’,0.027), (‘effort’, 0.026), (‘increase’,026), (‘possible’, 0.026), (‘redundancy’,0.026), (‘make’,0.026), (‘summary’,0.026), (‘strength’, 0.026)]

Key phrase extraction on Text 1 and Text 2 using weighted TF-IDF method_

Key phrase extraction method applied onKey phrases with weights
Text 1 [(‘form’, 0.57699999999999996), (‘large documents’, 0.57699999999999996), (‘text data’, 0.57699999999999996),(‘large text documents’, 0.57699999999999996), (‘meaningful information’, 0.57699999999999996), (‘time-consuming tasks’, 0.57699999999999996), (‘different techniques’, 0.57699999999999996), (‘review’, 0.57699999999999996), (‘text summarization’, 0.57699999999999996), (‘different sources’, 0.47599999999999998)]
Text 2 [(‘prepared summaries’, 1.0), (‘abstractive-based methods’, 0.70699999999999996), (‘huge-scale natural language’, 0.70699999999999996), (‘documents’, 0.66700000000000004), (‘summary’, 0.63200000000000001), (‘types’, 0.63200000000000001), (‘elaborative survey’, 0.57699999999999996), (‘extractive text summarization techniques’, 0.57699999999999996), (‘review’, 0.57699999999999996), (‘many research papers’, 0.53400000000000003)]

Recent research studies on text similarity and representation_

Concept/algorithm/method usedAuthor(s)Usage
Text similarity (SimHash, MinHash), Text clustering Pamulaparty et al., 2014, 2015, 2017) Hassanian-esfahania and Karga (2018)Near Duplicate detection on the basis of keywords generated from text, Fuzzy C means clustering with discriminant function, Random forest method for classification of near duplicates
Text similarity Yung-Shen et al. (2013) Gali et al. (2016) Near Duplicate detection on the basis of 21 similarity metrics computation between a pair of documents or two titles
Signature based text similarity measurementMohammadi and Khasteh, 2020 (Hajishirzi et al., 2010)Reference texts are generated using genetic algorithms to obtain signatures for text documents as a sequence of 3 grams for detection of duplicate and near duplicate documents. For generating signature cosine text similarity measure is used on the datasets on CiteseerX, Enron and Gold Set of Near-duplicate News Articles
Text similarity Do and LongVan (2015) Near Duplicate detection by applying signatures generated based on ontology on extracted key phrases
Text representation methods Al-Subaihin et al. (2019), Mishra (2019) TF-IDF combined with LSI for topic modeling, spam classification
Text mining, clustering, natural language processing and text similarity Alqahtani et al. (2021) Text matching methods
Semantic similarity Chandrasekaran and Mago (2021) Any NLP task which involves semantic textual similarity
Semantic similarity Roul and Sahoo (2020) Near Duplicate detection of web pages on DUC dataset
Deep learning based semantic similarity Mansoor et al. (2020) Sentence similarity using LSTM and CNN per trained with word2vec on Quora dataset
Text representation using ELMo model Peters et al. (2018) Question answering, Textual entailment, semantic role labelling, Named entity extraction, sentiment analysis
Text representation using FastText model Shashavali et al. (2019) In goal oriented conversational agents (Chabot)
Text similarity based on distance Stefanovicˇ et al. (2019) Plagiarism detection
Semantic similarity for short text based on corpus, knowledge and deep learning model Han et al. (2021) Text classification and text clustering, sentiment analysis, information retrieval, social networks plagiarism detection on the dataset
Text classification based on text embedding method Li and Gong (2021) Deep Learning Text classification on the dataset Sohu news dataset
Text Similarity based on text distance and text representation Wang and Dong (2020) Information retrieval, Machine translation, question answering, machine, document matching
Text representation using BERT model Wang et al. (2019) Extractive-Abstractive Text summarization with BERT embedding model with Reinforcement Learning on CNN/Daily Mail dataset and DUC2002
Word Embedding Model, Text classification, Word tagging Ajees et al. (2021) Alqrainy and Alawairdhi (2021)SVM classification to classify animate nouns for Malayalam text, comprehensive tag for Arabic language
Lexical Taxonomy Nazar et al. (2021) Elimination of incorrect hypernym links, taxonomy with new relations in Spanish, English and French

Similarity scores using text embedding models on original and summarized document_

Embedding modelSimilarity score (in %) between Text 1 and Text 2Similarity score (in %) between Text 1 summary and Text 2 summary
Word2Vec5.2814.26
Universal Sentence Encoder [USE]81.3669.39
FastText with soft cosine similarity [FT_SoftCS]81.76 92.40
ELMo with cosine similarity (ELMo_CS)88.5976.32
Glove with cosine similarity (GloVe_CS) 97.89 95.60
BERT with cosine similarity (BERT_CS)72.2882.29

Different embedding models for text representation (Khattak et al_, 2019; Mishra et al_, 2020)_

Embedding modelCharacteristicsMeritsDemeritsVariants
One hot encodingMaps each word from vocabulary to unique index in vector spaceLearn dense representation of wordsDependent on corpus knowledge
Word2VecMaps each word to a point in vector space E.g. Continuous Bag of Words (CBOW), Skip GramUsed in Neural networks for predicting focus words as prediction-based modelsDimension is between 50 and 500.Context window is between 5 and 10Doc2Vecparagraph2vece.g., Distributed Memory Model of Paragraph Vectors (PV-DM), Paragraph Vector Continuous Bag of words (PV-CBOW)
GloVeTerm co-occurrence matrix based on vocabulary size is usedMinimized reconstruction error, captures larger dependency due to larger context window, Count based modelOrder of dependencies are not preserved; performance depends on data typeGloVe with skip gram window
FastTextSub words are also consideredExtends the functionality of Word2Vec skip gram to handle out of vocabulary (OOV) wordsLonger time to trainProbabilistic FastText
Embedding from Language Models (ELMo)Captures context at both word and character level.Same word can be used for different contextsPerforms sentence level embedding by using bidirectional Recurrent Neural Networks (RNN), can be used in transfer learningUnable to use left to right and right to left context at the same time
Bidirectional Encoder Representations from Transformers(BERT)Considers n bidirectional representations in unsupervised modeIt can be pre trained using one extra output layerRandom sentence is replaced by special tokens(‘Mask’) toconsider both left to right and right to left information at the same timeRobustly Optimized BERT Pre Training Approach (RoBERTa),A lite version of BERT(ALBERT),Encoder that Classifies Token Replacement Accurately’(LECTRA),Generalized Autoregressive Pre Training for Language Understanding (XLNet), Distilled version of BERT (DistilBERT), BERT for Summarization (BERTSUM)

Text summarization on original text_

Text summarization (using LSA method) onGenerated summary
Text 1“Everyday large volume of data is gathered from different sources and are stored since they contain valuable piece of information. The storage of data must be done in efficient manner since it leads in difficulty during retrieval. To overcome these challenges, text documents are summarized in with an objective to get related information from a large document or a collection of documents.”
Text 2“In the view of a significant increase in the burden of information over and over the limit by the amount of information available on the internet, there is a huge increase in the amount of information overloading and redundancy contained in each document. Specifically, it focuses on unsupervised techniques, providing recent efforts and advances on them and list their strengths and weaknesses points in a comparative tabular manner. In addition, this review highlights efforts made in the evaluation techniques of the summaries and finally dedtices some possible future trends.”

Popular Text similarity metrics (Pamulaparty et al_, 2014, 2015; Gali et al_, 2016; Yung-Shen et al_, 2013)_

Similarity measurement methodHighlights
Euclidean distanceConsider the distance of text in vector form. Uses frequency of tokens to generate feature vectors
CosineConsider the angle between two vectors. Fails to capture variations of the representation for unstructured/semi structured text
ManhattanConsider the distance between two real vectors
HammingConsider the count of positions in which two bits are different. Binary strings must be of the same length
Jaccard distanceCompute’s length of two strings and then finds common characters to indicate the presence in near locations. Transposition in reverse order is performed to find matching characters between two strings
Jaro WinklerIt extends the Jaro distance metric by a prefix value (p = 0.1). This provides a higher value of weights to the strings having common prefix length whose value lies in the range of (Xiao et al., 2008; Khattak et al., 2019)
Cosine similarity with k shingles/k gramShingling the document means considering consecutive words and grouping as a single entity. A more general approach is to shingle the document. This takes consecutive words and groups them as a single object. In general, the set of all 1-shingles represents the’ bag of words’ model
TF-IDFBased on the concept of term frequency (TF) which is the count of occurrence of a token in a document. The inverse document frequency (IDF) is the way to find the relevance of unique or odd words. Cosine similarity with TF-IDF is used to find similarity scores
Normalized LevenshteinBased on the minimum number of edit operations
Soft-TFIDFTF-IDF and Jaro Winkler are combined to measure similarity. First Jaro Winkler finds pairs of tokens common to both strings and then TF-IDF is used to find similarity scores exceeding the suitable value of threshold set in Jaro Winkler
Language: English
Page range: 1 - 20
Submitted on: Oct 25, 2021
Published on: Apr 16, 2022
Published by: Professor Subhas Chandra Mukhopadhyay
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2022 Asha Rani Mishra, V.K. Panchal, published by Professor Subhas Chandra Mukhopadhyay
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 15 (2022): Issue 1 (January 2022)