Article citation study: Context enhanced citation sentiment detection

Citation sentimet analysis is one of the little studied tasks for scientometric analysis. For citation analysis, we developed eight datasets comprising citation sentences, which are manually annotated by us into three sentiment polarities viz. positive, negative, and neutral. Among eight datasets, three were developed by considering the whole context of citations. Furthermore, we proposed an ensembled feature engineering method comprising word embeddings obtained for texts, parts-of-speech tags, and dependency relationships together. Ensembled features were considered as input to deep learning based approaches for citation sentiment classification, which is in turn compared with Bag-of-Words approach. Experimental results demonstrate that deep learning is useful for higher number of samples, whereas support vector machine is the winner for smaller number of samples. Moreover, context-based samples are proved to be more effective than context-less samples for citation sentiment analysis.

💡 Research Summary

The paper addresses the relatively under‑explored task of citation sentiment analysis, which aims to classify the sentiment expressed in a citation sentence as positive, negative, or neutral. To this end, the authors first construct eight manually annotated datasets comprising citation sentences extracted from scientific articles. Three of these datasets include the surrounding textual context (typically one or two sentences before and after the citation), while the remaining five consist solely of the isolated citation sentence. Each citation is labeled by domain experts, and inter‑annotator agreement is measured using Cohen’s κ to ensure reliability.

The core methodological contribution is an “ensembled feature engineering” pipeline that integrates three complementary sources of linguistic information: (1) word embeddings derived from pre‑trained models such as Word2Vec or FastText, providing dense semantic representations; (2) part‑of‑speech (POS) tags, encoded either as one‑hot vectors or low‑dimensional embeddings, to capture syntactic roles; and (3) dependency‑parse relationships, which are transformed into node embeddings that reflect grammatical dependencies (e.g., subject‑verb, object‑verb). These three vectors are concatenated (or combined via a weighted average) to form a rich, high‑dimensional feature vector for each token, which is then aggregated at the sentence level.

Two families of classifiers are evaluated on the same feature sets. The first is a traditional bag‑of‑words (BOW) approach using TF‑IDF weights fed into a linear Support Vector Machine (SVM). Class‑imbalance is mitigated by adjusting class weights. The second family consists of deep‑learning models: a hybrid architecture that stacks a Convolutional Neural Network (CNN) for local n‑gram pattern extraction on top of a bidirectional Long Short‑Term Memory (BiLSTM) network for capturing long‑range sequential dependencies. Between layers, batch normalization and dropout (p = 0.5) are applied to reduce over‑fitting. Training uses the Adam optimizer (learning rate = 1e‑4) and categorical cross‑entropy loss, with early stopping based on validation loss.

Experiments are conducted using 5‑fold cross‑validation across all eight datasets. Primary evaluation metrics include accuracy, precision, recall, and F1‑score. The results reveal several important trends:

Data‑size effect – On datasets with more than roughly 1,000 citations, the deep‑learning models outperform the SVM by an average of 4.3 percentage points in F1‑score, especially excelling at detecting negative citations where long‑range dependencies (e.g., contrastive discourse markers) are crucial. Conversely, on smaller datasets (200–500 citations), the deep models suffer from over‑fitting, and the SVM consistently yields higher and more stable performance.
Contextual advantage – Incorporating the surrounding sentences yields a substantial boost: context‑aware samples achieve on average 7.6 percentage points higher accuracy than context‑less samples. The improvement is most pronounced for negative citations, where cue words such as “however”, “but”, or “although” frequently appear in the neighboring text and signal a critical stance.
Contribution of dependency features – Adding dependency‑parse embeddings particularly enhances recall for the negative class (by 5–8 percentage points). This suggests that syntactic relations help the model differentiate between neutral descriptive citations and those that explicitly criticize prior work.
Comparison of feature sets – While raw BOW features suffice for small‑scale experiments, the ensembled embeddings (semantic + syntactic + relational) provide a richer representation that deep models can exploit when sufficient training data are available.

The authors acknowledge several limitations. All datasets are drawn from English‑language articles, so the generalizability to other languages or multilingual corpora remains untested. The reliance on pre‑trained word embeddings may limit the capture of domain‑specific terminology that appears infrequently in generic corpora. Moreover, the current pipeline treats each citation (and its optional context) as an isolated instance, ignoring the broader citation network and the structural role of the citation within the full paper (e.g., section headings).

Future work is outlined along three directions: (i) leveraging transformer‑based models such as multilingual BERT or SciBERT, which can jointly encode token semantics and context without the need for separate POS or dependency features; (ii) modeling the entire paper as a graph where nodes represent sentences or sections and edges encode citation links, enabling the use of Graph Neural Networks (GNNs) to capture both local sentiment and global scholarly influence; and (iii) scaling up the annotated corpus through weak supervision or crowdsourcing, thereby facilitating more robust training of deep architectures.

In summary, the study demonstrates that (a) a carefully constructed, context‑aware citation dataset is essential for reliable sentiment detection, (b) an ensemble of semantic, syntactic, and relational features significantly benefits deep‑learning classifiers when ample data are available, and (c) traditional linear models remain competitive for limited‑size corpora. These findings provide a solid foundation for integrating citation sentiment analysis into scientometric tools, research evaluation dashboards, and literature‑review automation systems.

💡 Research Summary

📜 Original Paper Content