We develop models and extract relevant features for automatic text summarization and investigate the performance of different models on the DUC 2001 dataset. Two different models were developed, one being a ridge regressor and the other one was a multi-layer perceptron. The hyperparameters were varied and their performance were noted. We segregated the summarization task into 2 main steps, the first being sentence ranking and the second step being sentence selection. In the first step, given a document, we sort the sentences based on their Importance, and in the second step, in order to obtain non-redundant sentences, we weed out the sentences that are have high similarity with the previously selected sentences.
Deep Dive into Text Summarization using Deep Learning and Ridge Regression.
We develop models and extract relevant features for automatic text summarization and investigate the performance of different models on the DUC 2001 dataset. Two different models were developed, one being a ridge regressor and the other one was a multi-layer perceptron. The hyperparameters were varied and their performance were noted. We segregated the summarization task into 2 main steps, the first being sentence ranking and the second step being sentence selection. In the first step, given a document, we sort the sentences based on their Importance, and in the second step, in order to obtain non-redundant sentences, we weed out the sentences that are have high similarity with the previously selected sentences.
T he process of text summarization is to condense the information as much as possible without losing the gist of the document. In this project, we develop an extractive summarizer which extracts the most important sentences in a document, which are also salient. T here are 2 main st eps in a summarization task, namely sentence ranking and sentence selection. T he first st ep is done to get an importance score for every sentence in the document and the second st ep is done to avoid redundancy in the summary, which weeds out sentences that convey the same meaning as the earlier selected sentences.
Se ntence ranking -We use the predicted ROUGUE-2 scores of the models, and sort them in the descending order. T he ones with high ROGUE-2 predictions are considered to be important.
Se ntence selection -We use a greedy approach (Li and Li 2014) to stitch together multiple sentences for the summary. In each step of selection, the sentence with maximal salience is added into the summary, unless its similarity with a sentence already in the summary exceeds a threshold. Here we use tf-idf cosine similarity and (Cao et. al 2015) set the threshold T sim = 0.6.
T he process of summarization was converted to a regression task wherein the X_Matrix had 9 features for every sentence and Y value was the rougue-2 score between the sentence and the real summary in the DUC dataset. Different models such as deep MLP and ridge were trained and cross validated on this X_Matrix and Y. T heir hyperparameters were varied and accuracies were plotted. Due to the limited size of dataset and hand-crafted features, we found that the simple ridge regressor beat all the deep models. Since ridge was the best model, sentences were ranked and selected using ridge regressor.
Document Understanding Conference (DUC) is a standard dataset to experiment with and evaluate summarization models. Hence, we collected the DUC 2001 dataset to build the models. T his Dataset has 310 documents with complete texts and summaries written by a human.
A total of 9 features were extracted for every sentence across every documents. T he 9 features are list ed below:
Position -T he position of the sentence. Suppose there are M sentences in the document , then for the i th sentence the position is computed as 1-(i-1)/(M-1)
Average d TF -T he mean term frequency of all words in the sentence, divided by the sentence length.
Average d IDF -T he mean inverse document frequency of all words in the sentence, divided by the sentence length.
in the sentence, divided by the sentence length. After extracting the above 9 features, the train matrix was constructed with the N*M where, Where c is the no of clust ers, d i is the no of docs in clust er i , X ij is the no of sentences in j th doc of the clust er i and M = 9, which is the number of features for every sentence.
Usually, it has been argued that the first sentence of the document captures the most important information of the document. Hence a dummy model which blindly predicts the first sentence as the predicted summary was built. T he mean ROGUE-2 score between the first sentence and the actual summary across all documents was computed and it’s performance was noted.
Ridge regression (T ibshirani 2013) is like least squares but shrinks the estimated coefficients towards zero. Given a response vector y ∈ R n and a predictor matrix X ∈ R n×p , the ridge regression coefficients are defined as :
Here λ ≥ 0 is a tuning parameter, which controls the strength of the penalty term. Note that: When λ = 0, we get the linear regression estimate When λ = ∞, we get βˆr idge = 0 For λ in between, we are balancing two ideas: fitting a linear model of y on X, and shrinking the coefficients
During the validation phase, we used 10-fold cross validation to identify the best parameters for the regressor. Upon cross validating for various polynomial features such as 1, 2 and 3, we found that the validation error is minimum when the polynomial order is 2, as shown in the below plot -Hence, the polynomial order 2 was chosen and the X_matrix was raised to this order during the testing phase.
A multilayer perceptron (Wikipedia) is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate outputs. An MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. Except for the input nodes, each node is a neuron (or processing element) with a nonlinear activation function. MLP utilizes a supervised learning called backpropagation for training the network.
T o put it in simple words, training an MLP has 2 main passes, namely the forward pass and the backward pass. In the forward pass, we compute the output of the activation functions, and in the backward pass, we find the error of the activation functions and finally, we make weight updates.
T he weight updates are generally done as follows (Swingler) -An MLP will have 1 input layer, 1
…(Full text truncated)…
This content is AI-processed based on ArXiv data.