# Paraphrasing (computational linguistics)

Paraphrase or Paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection.[1] Paraphrasing is also useful in the evaluation of machine translation,[2] as well as semantic parsing[3] and generation of new samples to expand existing corpora.[4]

## Paraphrase generation

### Multiple sequence alignment

Barzilay and Lee[4] proposed a method to generate paraphrases through the usage of monolingual parallel corpora, namely news articles covering the same event on the same day. Training consists of using multi-sequence alignment to generate sentence-level paraphrases from an unannotated corpus. This is done by

• finding recurring patterns in each individual corpus, i.e. "X (injured/wounded) Y people, Z seriously" where X, Y, Z are variables
• finding pairings between such patterns the represent paraphrases, i.e. "X (injured/wounded) Y people, Z seriously" and "Y were (wounded/hurt) by X, among them Z were in serious condition"

This is achieved by first clustering similar sentences together using n-gram overlap. Recurring patterns are found within clusters by using multi-sequence alignment. Then the position of argument words are determined by finding areas of high variability within each clusters, aka between words shared by more than 50% of a cluster's sentences. Pairings between patterns are then found by comparing similar variable words between different corpora. Finally new paraphrases can be generated by choosing a matching cluster for a source sentence, then substituting the source sentence's argument into any number of patterns in the cluster.

### Phrase-based Machine Translation

Paraphrase can also be generated through the use of phrase-based translation as proposed by Bannard and Callison-Burch.[5] The chief concept consists of aligning phrases in a pivot language to produce potential paraphrases in the original language. For example, the phrase "under control" in an English sentence is aligned with the phrase "unter kontrolle" in its German counterpart. The phrase "unter kontrolle" is then found in another German sentence with the aligned English phrase being "in check", a paraphrase of "under control".

The probability distribution can be modeled as ${\displaystyle \Pr(e_{2}|e_{1})}$, the probability phrase ${\displaystyle e_{2}}$ is a paraphrase of ${\displaystyle e_{1}}$, which is equivalent to ${\displaystyle \Pr(e_{2}|f)\Pr(f|e_{1})}$ summed over all ${\displaystyle f}$, a potential phrase translation in the pivot language. Additionally, the sentence ${\displaystyle e_{1}}$ is added as a prior to add context to the paraphrase. Thus the optimal paraphrase, ${\displaystyle {\hat {e_{2}}}}$ can be modeled as:

${\displaystyle {\hat {e_{2}}}={\text{arg}}\max _{e_{2}\neq e_{1}}\Pr(e_{2}|e_{1},S)={\text{arg}}\max _{e_{2}\neq e_{1}}\sum _{f}\Pr(e_{2}|f,S)\Pr(f|e_{1},S)}$

${\displaystyle \Pr(e_{2}|f)}$ and ${\displaystyle \Pr(f|e_{1})}$ can be approximated by simply taking their frequencies. Adding ${\displaystyle S}$ as a prior is modeled by calculating the probability of forming the ${\displaystyle S}$ when ${\displaystyle e_{1}}$ is substituted with ${\displaystyle e_{2}}$.

### Long short-term memory

There has been success in using long short-term memory (LSTM) models to generate paraphrases.[6] In short, the model consists of an encoder and decoder component, both implemented using variations of a stacked residual LSTM. First, the encoding LSTM takes a one-hot encoding of all the words in a sentence as input and produces a final hidden vector, which can be viewed as a representation of the input sentence. The decoding LSTM then takes the hidden vector as input and generates new sentence, terminating in an end-of-sentence token. The encoder and decoder are trained to take a phrase and reproduce the one-hot distribution of a corresponding paraphrase by minimizing perplexity using simple stochastic gradient descent. New paraphrases are generated by inputting a new phrase to the encoder and passing the output to the decoder.

## Paraphrase recognition

### Recursive Autoencoders

Paraphrase recognition has been attempted by Socher et al[1] through the use of recursive autoencoders. The main concept is to produce a vector representation of a sentence along with its components through recursively using an autoencoder. The vector representations of paraphrases should have similar vector representations; they are processed, then fed as input into a neural network for classification.

Given a sentence ${\displaystyle W}$ with ${\displaystyle m}$ words, the autoencoder is designed to take 2 ${\displaystyle n}$-dimensional word embeddings as input and produce an ${\displaystyle n}$-dimensional vector as output. The same autoencoder is applied to every pair of words in ${\displaystyle S}$ to produce ${\displaystyle \lfloor m/2\rfloor }$ vectors. The autoencoder is then applied recursively with the new vectors as inputs until a single vector is produced. Given an odd number of inputs, the first vector is forwarded as is to the next level of recursion. The autoencoder is then trained to reproduce every vector in the full recursion tree including the initial word embeddings.

Given two sentences ${\displaystyle W_{1}}$ and ${\displaystyle W_{2}}$ of length 4 and 3 respectively, the autoencoders would produce 7 and 5 vector representations including the initial word embeddings. The euclidean distance is then taken between every combination of vectors in ${\displaystyle W_{1}}$ and ${\displaystyle W_{2}}$ to produce a similarity matrix ${\displaystyle S\in \mathbb {R} ^{7\times 5}}$. ${\displaystyle S}$ is then subject to a dynamic min-pooling layer to produce a fixed size ${\displaystyle n_{p}\times n_{p}}$ matrix. Since ${\displaystyle S}$ are not uniform in size among all potential sentences, ${\displaystyle S}$ is split into ${\displaystyle n_{p}}$ roughly even sections. The output is then normalized to have mean 0 and standard deviation 1 and is fed into a fully connected layer with a softmax output. The dynamic pooling to softmax model is trained using pairs of known paraphrases.

### Skip-thought vectors

Skip-thought vectors are an attempt to create a vector representation of the semantic meaning of a sentence in a similar fashion as the skip gram model.[7] Skip-thought vectors are produced through the use of a skip-thought model which consists of three key components, an encoder and two decoders. Given a corpus of documents, the skip-thought model is trained to take a sentence as input and encode it into a skip-thought vector. The skip-thought vector is used as input for both decoders, one of which attempts to reproduce the previous sentence and the other the following sentence in its entirety. The encoder and decoder can be implemented through the use of a recursive neural network (RNN) or an LSTM.

Since paraphrases carry the same semantic meaning between one another, they should have similar skip-thought vectors. Thus a simple logistic regression can be trained to a good performance with the absolute difference and component-wise product of two skip-thought vectors as input.

## Evaluation

There are multiple methods that can be used to evaluate paraphrases. Since paraphrase recognition can be posed as a classification problem, most standard evaluations metrics such as accuracy, f1 score, or an ROC curve do relatively well. However, there is difficulty calculating f1-scores due to trouble produce a complete list of paraphrases for a given phrase along with the fact that good paraphrases are dependent upon context. A metric designed to counter these problems is ParaMetric.[8] ParaMetric aims to calculate the precision and recall of an automatic paraphrase system by comparing the automatic alignment of paraphrases to a manual alignment of similar phrases. Since ParaMetric is simply rating the quality of phrase alignment, it can be used to rate paraphrase generation systems as well assuming it uses phrase alignment as part of its generation process. A noted drawback to ParaMetric is the large and exhaustive set of manual alignments that must be initially created before a rating can be produced.

The evaluation of paraphrase generation has similar difficulties as the evaluation of machine translation. Often the quality of a paraphrase is dependent upon its context, whether it is being used as a summary, and how it is generated among other factors. Additionally, a good paraphrase usually is lexically dissimilar from its source phrase. The simplest method used to evaluate paraphrase generation would be through the use of human judges. Unfortunately, evaluation through human judges tends to be time consuming. Automated approaches to evaluation prove to be challenging as it is essentially a problem as difficult as paraphrase recognition. While originally used to evaluate machine translations, bilingual evaluation understudy (BLEU) has been used successfully to evaluate paraphrase generation models as well. However, paraphrases often have several lexically different but equally valid solutions which hurts BLEU and other similar evaluation metrics.[9]

Metrics specifically designed to evaluate paraphrase generation include paraphrase in n-gram change (PINC)[9] and paraphrase evaluation metric (PEM)[10] along with the aforementioned ParaMetric. PINC is designed to be used in conjunction with BLEU and help cover its inadequacies. Since BLEU has difficulty measuring lexical dissimilarity, PINC is a measurement of the lack of n-gram overlap between a source sentence and a candidate paraphrase. It is essentially the Jaccard distance between the sentence excluding n-grams that appear in the source sentence to maintain some semantic equivalence. PEM, on the other hand, attempts to evaluate the "adequacy, fluency, and lexical dissimilarity" of paraphrases by returning a single value heuristic calculated using N-grams overlap in a pivot language. However, a large drawback to PEM is that must be trained using a large, in-domain parallel corpora as well as human judges.[9] In other words, it is tantamount to training a paraphrase recognition system in order to evaluate a paraphrase generation system.

## References

1. ^ a b Socher, Richard; Huang, Eric; Pennington, Jeffrey; Ng, Andrew; Manning, Christopher (2011), Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection
2. ^ Callison-Burch, Chris (October 25–27, 2008). "Syntactic Constraints on Paraphrases Extracted from Parallel Corpora". EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing. Honolulu, Hawaii. pp. 196–205.
3. ^ Berant, Jonathan, and Percy Liang. "Semantic parsing via paraphrasing." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2014.
4. ^ a b Barzilay, Regina; Lee, Lillian (May–June 2003). "Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment". Proceedings of HLT-NAACL 2003.
5. ^ Bannard, Colin; Callison-Burch, Chris (2005). "Paraphrasing Bilingual Parallel Corpora". Proceedings of the 43rd Annual Meeting of the ACL. Ann Arbor, Michigan. pp. 597–604.
6. ^ Prakash, Aaditya; Hasan, Sadid A.; Lee, Kathy; Datla, Vivek; Qadir, Ashequl; Liu, Joey; Farri, Oladimeji (2016), Neural Paraphrase Generation with Staked Residual LSTM Networks, arXiv:1610.03098, Bibcode:2016arXiv161003098P
7. ^ Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Ruslan; Zemel, Richard; Torralba, Antonio; Urtasun, Raquel; Fidler, Sanja (2015), Skip-Thought Vectors, arXiv:1506.06726, Bibcode:2015arXiv150606726K
8. ^ Callison-Burch, Chris; Cohn, Trevor; Lapata, Mirella (2008). "ParaMetric: An Automatic Evaluation Metric for Paraphrasing" (PDF). Proceedings of the 22nd International Conference on Computational Linguistics. Manchester. pp. 97–104.
9. ^ a b c Chen, David; Dolan, William (2008). "Collecting Highly Parallel Data for Paraphrase Evaluation". Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon. pp. 190–200.
10. ^ Liu, Chang; Dahlmeier, Daniel; Ng, Hwee Tou (2010). "PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts". Proceedings of the 2010 Conference on Empricial Methods in Natural Language Processing. MIT, Massachusetts. pp. 923–932.

11. Online Paraphrasing tool for rewording articles. Paraphrasing tool