Skim-Aware Contrastive Learning for Efficient Document Representation
š Original Paper Info
- Title: Skim-Aware Contrastive Learning for Efficient Document Representation- ArXiv ID: 2512.24373
- Date: 2025-12-30
- Authors: Waheed Ahmed Abro, Zied Bouraoui
š Abstract
Although transformer-based models have shown strong performance in word- and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse attention mechanisms can handle longer inputs, but are resource-intensive and often fail to capture full-document context. Hierarchical transformer models offer better efficiency but do not clearly explain how they relate different sections of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new self-supervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from unrelated ones. This mimics how humans synthesize information, resulting in representations that are both richer and more computationally efficient. Experiments on legal and biomedical texts confirm significant gains in both accuracy and efficiency.š” Summary & Analysis
1. **Development of a New Document Encoder** - **Simple Explanation**: This paper proposes a new method to handle long legal and medical documents through the CPE, which identifies key segments of each document to generate effective embeddings. - **Metaphor**: The encoder acts like someone who skims through a book to pick out important parts.-
Utilization of Self-Supervised Contrastive Learning
- Simple Explanation: CPE uses self-supervised contrastive learning to identify significant segments within documents, helping the model understand what each document is about.
- Metaphor: This approach mimics how we read new books and focus on remembering important scenes.
-
Validation Through Extensive Experiments
- Simple Explanation: The paper validates the effectiveness of CPE through various experiments showing better results compared to existing methods.
- Metaphor: It’s akin to testing a new drug to see if it outperforms an old one.
š Full Paper Content (ArXiv Source)
Since the introduction of Language Models (LMs), the focus in NLP has been on fine-tuning large pre-trained language models, especially for solving sentence and paragraph-level tasks. However, accurately learning document embeddings continues to be an important challenge for several applications, such as document classificationĀ , rankingĀ , retrieval-augmented generation (RAG) systems that demand efficient document representation encodersĀ and legal and medical applications like judgment predictionĀ , legal information retrievalĀ , and biomedical document classificationĀ .
Learning high-quality document representations is a challenging task due to the difficulty in developing efficient encoders with reasonable complexity. Most document encoders use sentence encoders based on self-attention architectures such as BERT . However, it is not feasible to have inputs that are too long as self-attention scales quadratically with the input length. To process long inputs efficiently, architectures such as Linformer , Big Bird , Longformer and Hierarchical Transformers have been developed. Unlike quadratic scaling in traditional attention mechanisms, these architectures utilize sparse attention mechanisms or hierarchical attention mechanisms that scale linearly. As such, they can process $`4096`$ input tokens, which is enough to embed most types of documents, including legal and medical documents, among others.
While methods based on sparse attention networks offer a solution for complexity, the length of the document remains a problem for building faithful representations for downstream applications such as legal and medical domains. First, fine-tuning these models for downstream tasks is computationally intensive. Second, capturing the meaning of the whole document remains too complex. In particular, it is unclear how or to what extent inner-dependencies between text fragments are considered. This is because longer documents contain more information than shorter documents, making it difficult to capture all the relevant information within a fixed-size representation. Additionally, documents usually cover different parts, making the encoding process complex and may lead to collapsed representations. This is particularly true for legal and medical documents, as they contain specialized terminology and text segments that describe a series of interrelated facts.
When domain experts, such as legal or medical professionals, read through documents, they skillfully skim the text, honing in on some text fragments that, when pieced together, provide an understanding of the content. Inspired by this intuitive process, our work focuses on developing document encoders capable of generating high-quality embeddings for long documents. These encoders mimic the expert ability to distill relevant text chunks, enabling them to excel in downstream tasks right out of the box, without the need for fine-tuning. We propose a novel framework for self-supervised contrastive learning that focuses on long legal and medical documents. Our approach features a self-supervised Chunk Prediction Encoder (CPE) designed to tackle the challenge of learning document representations without supervision. By leveraging both intra-document and inter-document chunk relationships, the CPE enhances how documents should be represented through its important fragments. The model operates by randomly selecting a text fragment from a document and using an encoder to predict whether this fragment is strongly related to other parts of the same document. To simulate the skimming process, we frame this task as a Natural Language Inference (NLI) problem. In our method, āentailmentā and ācontradictionā are not literal NLI labels. Rather, we use an NLI-style binary objective as a practical proxy that teaches the model whether a chunk is semantically compatible with its document context (positive) or not (negative). This enables the encoder to judge whether a local fragment is compatible with document-level context, thereby capturing long-range, cross-fragment relevance. This method not only uncovers connections between different documents, but also emphasizes the relevance of various sections, thus enriching the overall learning process for document representations. The main contributions are as follows:
-
We introduce a self-supervised Chunk Prediction Encoder (CPE) that employs random span sampling into a hierarchical transformer and Longformer: by sampling text spans and training the model to predict whether a span belongs to the same document, CPE captures both intra- and inter-document fragment relationships, preserves global context through chunk aggregation, and models the complex hierarchical structures of long texts.
-
We apply a contrastive loss that pulls together representations of related fragments and pushes apart unrelated ones, reinforcing meaningful connections across different parts of the same document.
-
We conducted extensive experiments to demonstrate the effectiveness of our framework. Specifically, i) we compare the quality of our document embeddings against strong baselines, (ii) we benchmark CPE-based models in an end-to-end fine-tuning setup while training all parameters jointly and comparing downstream classification performance against established methods, and (iii) we perform an ablation study to assess the impact of different chunk sizes, visualize the resulting embedding space, and evaluate performance on shorter documents. We also compare with LLMs such as LLaMA capabilities for handling long legal texts.
Related Work
This section provides an overview of the modelling of long documents and self-supervised document embedding.
Modeling of long documents.
Long documents are typically handled using sparse-attention models such as LongformerĀ and BigBirdĀ . These models use local and global attention mechanisms to overcome the $`O(n^2)`$ complexity of standard full attention mechanisms. Alternatively, one can use a hierarchical attention mechanismĀ , where the document is processed in a hierarchical manner. For example, applied a hierarchical BERT model to model long legal documents. The model first processes the words in each sentence using the BERT-base model to produce sentence embeddings. Then, a self-attention mechanism is applied to the sentence-level embeddings to produce document embeddings. The authors have demonstrated that their hierarchical BERT model outperforms both the Vanilla BERT architecture and Bi-GRU models. Similarly,Ā explored the various methods of splitting the long document and compared them with sparse attention methods on long document classification tasks. Their findings showed that better results were achieved by splitting the document into smaller chunks of 128 tokens. Ā proposed a model called Hi-Transformer, which applies both sentence-level and document-level transformers followed by hierarchical pooling. Meanwhile, introduced a variant of hierarchical attention transformers based on segment encoder and cross-segment encoder, which demonstrated comparable results with the Longformer model. In this work, we consider processing long documents with a hierarchical attention mechanism and sparse-attention Longformer encoders. We further improve document embedding using self-supervised contrastive learning.
Unsupervised Document Representation.
Unsupervised document representation learning is a highly active research field. At first, deep learning models were introduced to create contextualized word representations, such as Word2VecĀ and GloVEĀ . The Doc2VecĀ model was proposed, which utilized contextualized word representation to generate document embeddings. In the same vein, the Skip-ThoughtsĀ model extended the word2vec approach from the word level to the sentence level. Transformer-based modelsĀ were also suggested to produce a vector representation of the sentence. Recently, there have been advancements in self-supervised contrastive learning methods Ā . In this direction, CONPONO proposes using sentence-level objectives with a masked language model to capture discourse coherence between sentences. The sentence-level encoder predicts text that is k sentences away. On the other hand, SimCSEĀ uses a dropout version of the same sentence as a positive pair on short sentences. Similarly, proposed SimCSE learning on long documents with additional Bregman divergence. On the other hand, SDDEĀ model was proposed to generate document representation based on inter-document relationships using an RNN encoder. We follow a similar strategy of exploiting inter-document relationships, we employe transformer-based pre-trained language models with multiple negatives ranking contrastive loss.
Legal and medical document representation
Processing legal and medical documents is an active research topic. propose the hierarchical BERT model to process legal documents. propose the hierarchical transformer model architecture for the legal judgment prediction task. The input document is split into several chunks of size 512 tokens. Each chunk embedding is produced by a pre-trained XLNET model. Then, a Bi-GRU encoder is applied to the chunk embeddings to produce final document embeddings. train the BERT model on CaseHOLD (Case Holdings On Legal Decisions) dataset. employed GPT-2 models to predict how each justice votes for supreme court justiceās opinions. process the legal document using a Multi-Perspective Bi-Feedback Network to classify law articles. propose to represent the document using a graph neural network. To distinguish confusing articles, a distillation-based attention network is applied to extract discriminative features.
For medical document processing, in , authors propose contextualized document representations to answer questions from long medical documents. The model employs a hierarchical dual encoder based on hierarchical LSTM to encode medical entities. proposed a convolutional neural network base label-wise attention network to produce label-wise document representations by attending to the most relevant document information for each medical code. In , authors showed that the pre-trained BERT model outperforms CNN-based label-wise attention networks. In the same direction, proposed encoder-decoder architecture and outperforms the encoder-only model on multi-label text classification for legal and medical domains. The work in provides a comprehensive survey focusing on the integration of pre-trained language models within the biomedical domain. Their investigation highlights the substantial benefits of employing LMs in various NLP tasks. In contrast to previous works, we propose the learn document representation using self-supervised contrastive learning pre-trained LMs.
Chunk Prediction Encoders
To process long documents in time efficient manner, we use state-of-the-art hierarchical transformer method Ā and sparse attention longformer modelĀ . To enhance the quality of these representations, we propose self-supervised CPE that takes into account the relationship between different text chunks and determines relevant ones. Finally, we use the document embedding to predict the outcome of single and multi-label classification for legal and medical classification tasks.
CPE for Hierarchical Representation.
We first briefly introduce hierarchical representation model using
pre-trained model $`\mathcal{M}`$. Let $`D`$ be an input document, and
$`c_1, c_2, ..., c_n`$ denotes the set of corresponding text chunks in
$`D`$ where $`n`$ is the maximum number of chunks, padding with zero if
the chunks are less than $`n`$. Each chunk contains a sequence of $`T`$
tokens $`C = (w_{1}, w_{2},.., w_{t})`$, where $`t`$ is less than
$`512`$. Furthermore, the special classification [CLS] token is added
at the start of each chunk. Our aim is to learn the vector
representation of each chunk using a shared small language model
encoder as follows:
\begin{equation}
\label{eq_sent_rep}
%f = M(w_{[CLS]},w_{1},..,w_{t})
f = \mathcal{M}(w_{\text{[CLS]}}, w_{1}, \ldots, w_{t})
\end{equation}
where $`f`$ represents the output of a model (which can be BERT,
RoBERTa, LegalBERT, ClinicalBioBERT or any other). Following the common
strategy, we consider the [CLS] token as the representation of the
whole chunk. To obtain the final document representation from different
chunk features, we consider the following pooling strategies: the
Mean-Pooling obtained by taking the mean of chunks representation
$`d_t = 1/ n \sum_{i=1}^n {f_i}`$, and the Max-Pooling over chunks
representation. Each chunk encodes the local feature of the document,
and the whole document is represented by the average of these local
features.
Learning process
We propose a chunk prediction encoder to train a hierarchical transformer model using self-supervised contrastive learning to leverage intra and inter-document relationships. For each document, we randomly remove one chunk and then ask the NLI classifier to predict whether this chunk is derived from other chunks of the document. By doing so, we force the model to learn the dependencies between chunks and their relevance in representing a document. Consider a mini-batch of $`N`$ documents, denoted as $`D=\left\{\left( d_i\right )\right\}_1 ^{N}`$. For each document $`{d_i}`$, we randomly select a text chunk $`c^+`$ and remove it from that document $`\tilde{d_i}`$ to form a positive pair $`\left ( \tilde{d_i}, c^+ \right )`$. We then select a negative chunk $`c^-`$ from the remaining $`N-1`$ documents of the batch to form a negative pair, $`(\tilde{d_i}, c^-)`$. Notice that $`c^-`$ does not belong to document $`\tilde{d_i}`$. One concern that could be seen here is the text chunks could be similar and fit on both positive and negative documents. However, this is not an issue as in the training objective, we have multiple negatives, so our model is forced to optimize most dissimilar documents than most similar ones. The chunk predictive contrastive learning process can be viewed as an unsupervised natural language inference task, where a positive chunk sample represents an entailment of a document, and negative samples from other documents represent a contradiction of the document. We use multiple negatives ranking lossĀ to train the model:
\begin{equation}
\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n} \frac{\exp( sim (f(\tilde{d_i}), f(c_i^+)))}{\sum_{1}^{k} \exp(sim (f (\tilde{d_i}), f(c_k^-)))},
\label{eq_loss}
\end{equation}
where $`f`$ denotes the feature vector generated by document encoder, sim represents cosine similarity, $`c_i^+`$ is the positive chunk taken from the document $`d_i`$ and $`c_k^-`$ are the $`k`$ negative sample of chunk taken from other documents than $`d_i`$. Multiple negative ranking loss compares the positive pair representation with the negative pair samples in mini-batch. The general architecture is illustrated in FigureĀ 1. Our architecture includes a shared encoder that generates a complete document embedding, except for chunk $`C4`$. Additionally, the shared encoder shown at the bottom produces a vector representation of $`C4`$. The classifier is responsible for learning whether the embedding of chunk $`C4`$ aligns with or contradicts the document embedding.
CPE for Longformer
During the training of the CPE on the Longformer model, the document is split into two segments: the reference text and the positive text chunk. Our approach aims to ensure that the reference text segment and the positively sampled text chunk, which is extracted from the same document, are in agreement and maximize in consequence the objective function. Suppose we have a mini-batch of $`N`$ documents, denoted as $`D=\left\{\left( d_i\right )\right\}_1 ^{N}`$. For each document $`{d_i}`$, we randomly select a text chunk $`c^+`$ and the remaining text as reference text $`\tilde{d}`$ to form a positive pair $`\left (\tilde{d_i}, c_i^+ \right )`$. We then select a negative chunk $`c^-`$ from the remaining $`N-1`$ documents of the batch to serve as the negative pair, $`(\tilde{d_i}, c_j^-)`$. The reference text and positive text chunks are passed through the same Longformer encoder. The proposed method utilizes the [CLS] token representations to produce the reference text ($`z_{\tilde{d_i}}`$) embedding and text segment ($`z_c^+`$) embedding. The linear classifier is then trained to maximize the agreement between the segment and reference text, both sampled from the same document and minimize the agreement between reference text and segment taken from another document. The multiple negatives ranking loss is used to optimize the model as given as follows:
\begin{equation}
\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n} \frac{\exp(sim (z_{\tilde{d_i}}, z_c^+))}{\sum_{1}^{k} \exp(sim (z_{\tilde{d_i}}, z_{c_k}^-))},
\label{eq_loss}
\end{equation}
where $`z_{\tilde{d_i}}`$, $`z_{c_i}^+`$, and $`z_{c_k}^-`$ denote the feature vectors generated by the Longformer document encoder. The general architecture is depicted in FigureĀ 2. A shared Longformer encoder creates a document embedding for most of the text, excluding a small portion that has been removed. Additionally, a shared encoder at the bottom generates a vector representation of the removed chunk. The classifier learns to differentiate whether the embedding of the small chunk aligns with or contradicts the document embedding.
Document Classification
After producing our document representations, we train an MLP classifier, which consists of three hidden layers and an output layer. The hidden layer uses the TANH activation function. In the output layer, we use the sigmoid for the multi-label classifier and the softmax for the multi-class classifier.
\begin{equation}
\hat{y} = g(tanh(W * d + b))
\label{eq_final}
\end{equation}
where $`g`$ denotes the activation function (sigmoid or softmax), $`d`$ is the document representation, $`W`$ represents the weight matrices, $`b`$ is the bias vector. The model minimizes the cross-entropy loss. Notice that we introduce this classifier to assess the quality of our document embeddings and to evaluate to what extent the representation can be considered to perform classification tasks.
Experimental Setting
In this section, we define the experimental setup for evaluating the proposed contrastive learning framework.
Datasets
We evaluate our encoders on three legal datasets, ECHR, SCOTUS, and EURLEX and two medical datasets, MIMIC and Biosq. TableĀ 2 in Appendix provides statistics and the average document length of each dataset. Notice that we also consider Biosq to test the capability of our model in handling short-length documents as well.
Setting and evaluation metrics
We consider small language models such bert-base1, roberta-base2, legal-bert-base3, and ClinicalBioBERT4 models as we seek for a simple model with less number of parameters. While our hierarchical transformer is capable of processing documents of any length, in this study, we restrict our analysis to the initial 4096 tokens driven primarily by computational considerations. The number of text chunks is set to 32 and the length of the chunk is set to 128 for ECHR, SCOTUS, and MIMIC datasets. For the EURLEX dataset, which is smaller than other datasets, we utilize initial 2048 tokens; 16 segments of 128 tokens each. The Longformer model can handle sequences of up to 4096 input length. We truncate or pad documents that are longer or shorter accordingly. The length of the positive text chunk is set to 128 tokens for the Longformer model. As evaluation metrics, we report results in Micro-averaged F1 and Macro-averaged F1 for Multi-label ECHR and Multi-class SCOTUS classification datasets.
l@l@c@c@c@c@c@c@c@c@ & & & & &
(lr)3-4 (lr)5-6 (lr)7-8 (lr)9-10 & & macro-F1 & $`\mu`$-F1 & macro-F1 &
$`\mu`$-F1 & macro-F1 & $`\mu`$-F1 & macro-F1 & $`\mu`$-F1
& Emb + MLP & 36.32 & 55.56 & 40.6 & 61.79 & 25.86 & 51.44 & 49.06 &
62.99
& Emb$`_{SimCSE}`$ + MLP & 48.69 & 61.37 & 45.50 & 61.14 & 33.43 & 57.07
& 54.59 & 67.13
& Emb$`_{ESimCSE}`$ + MLP & 50.83 & 64.41 & 52.12 & 64.07 & 34.16 &
54.56 & 57.05 & 68.08
& Emb$`_{CPE}`$ + MLP & 54.77 & 66.98 & 54.56 & 66.78 &
40.96 & 63.68 & 57.12 & 68.20
& Emb + MLP & 35.27 & 55.64 & 32.77 & 57.92 & 21.88 & 35.05 & 49.63 &
64.27
& Emb$`_{SimCSE}`$ + MLP & 49.04 & 59.4 & 48.28 & 62.71 & 35.13 & 53.64
& 55.72 & 66.99
& Emb$`_{ESimCSE}`$ + MLP & 45.27 & 56.61 & 57.73 & 67.85 & 33.76 &
54.14 & 57.47 & 67.32
& Emb$`_{CPE}`$ + MLP & 56.02 & 65.58 & 57.85 & 68.07 &
41.94 & 63.44 & 61.01 & 69.42
& Emb+ MLP & 52.63 & 65.31 & 44.93 & 65.42 & 17.78 & 39.47 &
& Emb$`_{SimCSE}`$ + MLP & 57.52 & 68.81 & 57.55 & 68.92 & 40.87 & 61.58
&
& Emb$`_{ESimCSE}`$ + MLP & 56.47 & 68.29 & 56.08 & 67.78 & 40.40 &
62.59 &
& Emb$`_{CPE}`$ + MLP & 57.74 & 69.39 & 59.45 & 71.85 &
42.16 & 64.17 &
& Emb + MLP & not considered for legal documents & 55.84 & 68.9
& Emb$`_{SimCSE}`$ + MLP & not considered for legal documents & 62.74 &
71.06
& Emb$`_{ESimCSE}`$ + MLP & not considered for legal documents & 60.39 &
70.94
& Emb$`_{CPE}`$ + MLP & not considered for legal documents & 63.89 &
71.72
Document Embedding Baseline Models
We compare classifier model performance in the following settings.
Pre-trained models Embedding + MLP Classification: In this setting, we obtained the document embeddings using the existing pre-trained parameters of the hierarchical pre-trained language models (BERT, RoBERTa, LegalBERT, ClinicalBioBERT), and the Longformer model. On top of these document embeddings, we applied a MLP classification layer to predict the legal and medical labels. During training, only the parameters of the MLP layers are updated, while the parameters of the pre-trained models are fixed.
SimCSE Embedding + MLP Classification: SimCSEĀ proposed a contrastive learning framework that employs dropout for data augmentation. We train the SimCSE framework using hierarchical pre-trained language models and the Longformer model on long documents. After producing document embedding from the SimCSE framework, we apply an MLP layer for determining the legal and medical labels.
ESimCSE Embedding + MLP Classification: The Enhanced SimCSEĀ contrastive learning framework applied word repetition operation to construct positive pair. We train the ESimCSE on long documents by employing hierarchical pre-trained language models and the Longformer model. We trained an MLP classifier on top of fixed ESimCSE document embedding.
End-to-end Finetuning Baseline Models
We compared our approach against the following state-of-the-art fine-tuning methods:
Hi-LegalBERTĀ : The hierarchical Legal-BERT encodes each text chunk via a pretrained Legal-BERT [CLS] embedding. These chunk embeddings are then fed into a two-layer Transformer to capture inter-chunk context, then max-pool the context-aware chunk vectors into a single document embedding for classification. LSG AttentionĀ : LSG architecture employs Local, Sparse, and Global attention. LSG is based on block local attention for short-range context, structured sparse attention for mid-range dependencies, and global attention to improve information flow. LSG produces competitive results for classifying long documents.
LegalLongformerĀ : LegalLongformer adapts the Longformer architecture by initializing all parameters from LegalBERT and then fine-tuning the entire model on downstream long document classification tasks.
HAT : (Hierarchical Attention Transformer) employs a two-stage encoder: a segment-wise transformer to capture local context, followed by a cross-segment transformer to model inter-segment interactions. This hierarchical approach enables efficient and accurate classification of long documents.
We propose generalized CPE which can be applied to any hierarchical attentional document such as ToBERTĀ , RoBERTĀ , or sparse attention models such BigBirdĀ . By leveraging more advanced methods, the CPE can achieve better performance. We output the performance of the CPE using an advanced hierarchical attentional document encoder in the tableĀ 3.
Evaluation and Results
We conduct a comprehensive evaluation of the proposed CPE framework by benchmarking against a hierarchical transformer encoder, a sparse-attention Longformer model, and state-of-the-art hierarchical transformer variants. Experiments are carried out on standard long legal and medical datasets under frozen-embedding and full end-to-end fine-tuning settings.
| PTM | Model | ECHR | SCOTUS | EURLEX | MIMIC | ||||
|---|---|---|---|---|---|---|---|---|---|
| 3-4 (lr)5-6 (lr)7-8 (lr)9-10 | macro-F1 | μ-F1 | macro-F1 | μ-F1 | macro-F1 | μ-F1 | macro-F1 | μ-F1 | |
| Longformer | Emb + MLP | 35.89 | 50.35 | 27.35 | 50.28 | 25.83 | 54.45 | 44.55 | 62.09 |
| EmbSimCSE + MLP | 35.15 | 46.81 | 36.82 | 52.64 | 32.01 | 53.61 | 45.36 | 60.77 | |
| EmbESimCSE + MLP | 32.92 | 46.29 | 35.33 | 53.35 | 35.38 | 55.85 | 46.06 | 61.44 | |
| EmbCPE + MLP | 48.94 | 59.71 | 47.46 | 62.57 | 43.24 | 64.57 | 53.06 | 64.93 | |
Evaluation of Hierarchical Representation
Legal and medical topic classification results on fixed document representation via hierarchical transformer results in presented in TableĀ [table-classHT]. From the table, we can observe that MLP classifiers with document embedding from various PTM produce worse performance on all datasets. It is evident that self-supervised contrastive learning SimCSE, ESimCSE, and CPE improves the classification performance of PTM document embedding across the datasets. The SimCSE embedding improves the performance of BERT embedding by 12% in terms of the macro-F1 score on the ECHR. While improvement of ESimCSE over BERT embedding is 14%. However, the CPE embedding achieves the best performance on all datasets using BERT PTM. Specifically, CPE improves macro-F1 scores approximately by 6%, and 4% on BERT embedding of SimCSE, ESimCSE for ECHR, dataset. Similar improvement can be observed on SCOTUS, EURLEX, and MIMIC.
The MLP classifiers based on Bert-base model perform poorly for both legal and biomedical datasets. The potential reason is that legal and biomedical terminology are not well represented in the generic corpora of the BERT model. So we also perform experiments with LegalBERT PTM, BERT version pre-trained on legal corpora. Similarly, we utilize ClinicalBioBERT which is pre-trained on medical documents. The MLP classifier leveraging LegalBERT and ClinicalBioBERT document embeddings demonstrates enhanced performance compared to the classifier based on BERT document embeddings. In-domain knowledge appears to be particularly crucial for the SCOTUS dataset, resulting in a 5% improvement in macro F1 score of LegalBERT Embedding$`_{CPE}`$ MLP over BERT Embedding$`_{CPE}`$ MLP. In comparison, the improvements on the ECHR and EURLEX datasets over BERT Embedding$`_{CPE}`$ MLP models are 3% and 2%, respectively. This is because the SCOTUS dataset is more domain-specific in its language compared to the other datasets. Similarly, in the MIMIC dataset, ClinicalBioBERT Embedding$`_{CPE}`$ shows a notable 3% enhancement in performance compared to BERT Embedding$`_{CPE}`$.
The MLP classifier utilizing RoBERTa embeddings shows better performance compared to BERT-base model and achieves comparable results to the classifiers utilizing LegalBERT and ClinicalBioBERT embeddings. This improved performance can be attributed to RoBERTaās pre-training on larger generic corpora and more extensive vocabulary. The proposed RoBERTa Embedding$`_{CPE}`$ MLP outperforms SimCSE, ESimCSE MLP models on ECHR, SCOTUS, EURLEX, and MIMIC. The proposed CPE outperforms the document Emb, Emb$`_{SimCSE}`$, Emb$`_{ESimCSE}`$ employing various PTM encoders across datasets. The CPE training produced better performance using LegalBERT and ClinicalBioBERT. These results suggest that CPE performance improves with domain-adapted PTM. Furthermore, results indicate the dropout augmentation or simple repetition of words to construct positive pairs and generate text embeddings may not yield significant improvements at the document or paragraph level embeddings.
Evaluation on Longformer
TableĀ [table-classLongformer] shows the MLP classifier results on document representation produced by the contrastively trained Longformer model. The Longformer + MLP classifier performs poorly in all datasets because the model fails to model global context and paragraph-level interaction. It is evident that self-supervised strategies of (SimCSE, ESimCSE, and CPE) improve the results in the Longformer models. The proposed CPE method has a clear advantage over SimCSE and ESimCSE methods, producing 12% and 14% better macro F1-score on the ECHR dataset and 10% and 12% improvement in terms of macro F1-score on the SCOTUS dataset. Similarly, on the MIMIC dataset, the performance gain from CPE is 8% and 7% in macro F1 over SimCSE and ESimCSE, respectively. Based on the results, it is clear that hierarchical transformer models perform better than Longformer-based models on the ECHR, SCOTUS, and MIMIC datasets. The only dataset where Longformer models demonstrate superior performance is EURLEX. This improved performance on the EURLEX dataset can likely be attributed to its shorter length, which allows the Longformer model to capture all relevant information more efficiently than the hierarchical transformer model. Overall, our self-contrastive CPE methods, which are based on Longformer and hierarchical transformer, yield better results compared to SimCSE and ESimCSE models. This demonstrates the effectiveness of our method. Stacking MLP classifiers on top of document embeddings proves to be an effective approach for encoding long documents.
| Model | ECHR | SCOTUS | ||
|---|---|---|---|---|
| 2-3 (lr)4-5 | macro-F1 | μ-F1 | macro-F1 | μ-F1 |
| Hi-LegalBERT | 64.0 | 70.0 | 66.5 | 76.4 |
| LSG | 60.3 | 71.0 | 63.7 | 73.3 |
| LegalLongformer | 63.6 | 71.7 | 66.9 | 76.6 |
| HAT | - | 79.8 | 59.1 | 70.5 |
| Ours | 66.1 | 72.6 | 67.3 | 77.5 |
Evaluation on End-to-end Finetuning Setting
To further solidify our findings, we evaluated our proposed CPE encoder in an end-to-end fine-tuning setting. We applied the LegalBERT CPE encoder to encode different chunks. Rather than using simple average pooling, we employed a two-layer transformer encoder to aggregate the information from different chunks. All parameters are trained in an end-to-end manner. The results for the classification of long documents are presented in TableĀ 1. On the ECHR dataset, our model achieves a macro-F1 score of 66.1 an improvement of 2.5 points over LegalLongformer and 2.1 points over Hi-LegalBERTāand a micro-F1 score that is 0.9 points higher than LegalLongformer and 2.6 points higher than Hi-LegalBERT. Compared with LSG, we see even larger margins +5.8 pp macro and +1.6 pp micro on ECHR, +3.6 pp macro and +4.2 pp micro on SCOTUS. HAT delivers a strong micro-F1 (79.8) on ECHR but underperforms on SCOTUS macro (59.1), indicating less balanced class handling. Our consistent gains in both macro- and micro-F1 confirm that our CPE encoder generates high-quality chunk representations, simplifying the downstream classification of both frequent and rare legal classes during fine-tuning.
Supplementary Evaluation and Analysis
Additional evaluations in the appendix highlight the robustness of the proposed CPE framework. As shown in Table 3, end-to-end fine-tuning with a hierarchical transformer improves performance over baselines like Hi-LegalBERT and LegalLongformer, confirming CPEās effectiveness. An ablation study on chunk size (Table 6) shows 128-token chunks offer the best performance, balancing context and relevance. On the short-document BIOSQ dataset (Table 7), CPE outperforms contrastive baselines, demonstrating its adaptability beyond long texts. Embedding visualizations and similarity metrics (Figure 3) further confirm that CPE produces semantically coherent and discriminative document representations.
Conclusion
We proposed a novel self-supervised contrastive learning framework, Chunk Prediction Encoders (CPE), for generating long document representations in legal and biomedical domains. CPE captures intra-document and inter-document relationships by aligning text chunks within a document and distinguishing them from other documents. Our experiments on legal and biomedical classification tasks demonstrated significant improvements in macro F1 scores, outperforming baselines. Additionally, leveraging domain-specific pre-trained models like LegalBERT and ClinicalBioBERT within CPE further boosted performance, emphasizing the value of domain adaptation in document representation. Our findings highlight the potential of CPE for enhancing document understanding in legal and medical domain.
Limitations
While CPE demonstrates strong performance in long document representation, certain aspects warrant further exploration. Its effectiveness across diverse domains beyond legal and medical texts remains to be fully assessed. Additionally, while the model effectively captures intra-document relationships, further refinements could enhance its adaptability to varying document structures. Future work could explore extensions to multilingual datasets and cross-domain applications.
Acknowledgments
This work was supported by ANR-22-CE23-0002 ERIANA and ANR Chaire IA Responsable.
Experimental Setting
Datasets
The different datasets we use are the following. Table 2 reports statistics.
ECHR: The European Court of Human Rights (ECHR)Ā comprises around 11K cases of alleged human rights violations. Each case contains a list of facts or events from the case description and the task is to predict whether specific human rights articles have been violated among 10 labels.
SCOTUS: The Supreme Court of the United States (SCOTUS)Ā contains complex cases that are not well solved by the lower court. SCOTUS is a single-label classification task, which predicts the court opinions from a choice of 14 available issues such as criminal procedure, and civil rights, among others. The training set consists of 5K and the validation and test set contain 1.4k cases.
EURLEXĀ : European Union (EU) legislation dataset contains 65K cases from the European Union portal 5. The task is to predict its EuroVoc labels such as ( economics, trade, and healthcare). The training set consists of 55K and the validation and test set contain 5k cases. The average document length is approximately 1400 tokens. EUR-LEX is a multi-label classification dataset containing 100 labels (concepts).
MIMIC-III: Medical information mart for intensive care (MIMIC-III)Ā dataset comprises 40K discharge summaries from US hospitals, with each summary mapped to one or more (International classification of diseases, ninth revision) ICD-9 taxonomy labels. We utilized labels from the first level of the ICD-9 hierarchy.
BIOASQĀ : The BIOASQ dataset comprises biomedical articles sourced from PubMed. Each article is annotated with concepts from the Medical Subject Headings (MeSH) taxonomy. We used the first levels of the MeSH taxonomy. The dataset is divided into train and test categories.
Configuration
We use the AdamW optimizer with a learning rate of $`2e-5`$ and weight decay of $`0.001`$. The model is trained for 3 epochs for self-supervised SimCSE, ESimCSE, and CPE settings, while the MLP classifier model is trained for 20 epochs. The classifier uses a batch size of 16, while the self-contrastive learning module uses a batch size of 4. We apply Max Pooling to the chunk representations to aggregate information across the chunks. All models are trained using NVIDIA Quadro RTX 8000 48GB GPU.
| Dataset | Train | Test | Avg. token length |
|---|---|---|---|
| ECHR | 9000 | 1000 | 2050 |
| SCOTUS | 5000 | 1400 | 8000 |
| MIMIC | 30000 | 10000 | 3200 |
| EUR-LEX | 55000 | 5000 | 1400 |
| Bioasq | 80000 | 20000 | 300 |
Statistics of the long and short documents dataset - Average text token length and number of train and test samples for self-contrastive pre-training
Evaluation of Advanced Hierarchical Representation
In SectionĀ 3.0.0.2, we introduced a generalized CPE that can be used with any long document encoder. We hypothesise that the performance of CPE improves when using an advanced document encoder. To test this, we conducted experiments using Transformer over BERT (ToBERT) and Recurrence over BERT (RoBERT). For classification tasks, we kept the parameters of BERT fixed (frozen), and only the Transformer and LSTM encoder with MLP layer were learned during training. The results in TableĀ 3 show indeed that ToBERT improved the performance of the generalized HBERT by 6% in macro F1 and 1.5% in $`\mu`$ F1-score. On the SCOTUS dataset, ToBERT achieved a performance gain of approximately 4% in both macro and $`\mu`$ F1-scores.
| Model | ECHR | SCOTUS | ||
|---|---|---|---|---|
| 2-3 (lr)4-5 | macro-F1 | μ-F1 | macro-F1 | μ-F1 |
| HBERT+MLP | 54.77 | 66.98 | 54.56 | 66.78 |
| RoBERT+MLP | 54.99 | 67.05 | 58.34 | 69.64 |
| ToBERT+MLP | 60.28 | 68.46 | 58.79 | 70.07 |
Ablation Study
We conducted an ablation study to evaluate the impact of different chunk sizes, visualize the quality of the embedding space, and examine the performance of the CPE framework on short documents.
Impact of chunk length: TableĀ [tab:chunkHT] and FigureĀ 3 summarize the CPE classification performance measured by macroāF1 using a hierarchical Transformer encoder with chunk sizes of 64, 128, 256, and 512 across four datasets: ECHR, SCOTUS, EURLEX, and MIMIC. FigĀ 3 illustrates how performance varies with chunk size. For the ECHR data set, on small chunk size of 64 produce 57.64 macro F1 score and remains stable for chunk size 128 and produce 57.74 micro F-1 score. The model performance then gradually decreases for large chunk sizes of 256 and 512. On the other hand, for SCOTUS dataset, there is a notable improvement when moving from a chunk size of 64 to 128, after which the performance slightly drops for larger sizes. The scores for MIMIC show a modest decline, on chunk size 64 to chunk size 512. This demonstrates relative robustness with only a slight decrease as the chunk size increases. Conversely, the EURLEX dataset exhibits its best performance at the smaller chunk sizes of 64 and 128, but shows a sharp decline at chunk sizes of 256 and 512. This suggests that EURLEX is highly sensitive to the chunk size parameter, likely because its shorter average text length means that larger chunks incorporate too much irrelevant detail.
| Chunk Size | ECHR | SCOTUS | EURLEX | MIMIC | ||||
|---|---|---|---|---|---|---|---|---|
| 2-3 (lr)4-5 (lr)6-7 (lr)8-9 | macro-F1 | μ-F1 | macro-F1 | μ-F1 | macro-F1 | μ-F1 | macro-F1 | μ-F1 |
| 64 | 57.64 | 69.1 | 58.77 | 71.14 | 42.10 | 63.81 | 63.09 | 70.74 |
| 128 | 57.74 | 69.39 | 62.69 | 73.29 | 42.16 | 64.17 | 63.35 | 70.79 |
| 256 | 56.93 | 67.69 | 59.18 | 71.79 | 38.99 | 59.18 | 61.54 | 70.27 |
| 512 | 55.60 | 67.29 | 59.45 | 71.85 | 29.29 | 43.63 | 60.63 | 68.90 |
| PTM | Model | Bioasq | ||
|---|---|---|---|---|
| 3-4 | macro-F1 | μ-F1 | ||
| BERT | Embedding + MLP | 68.64 | 83.30 | |
| EmbeddingSimCSE + MLP | 68.05 | 82.70 | ||
| EmbeddingESimCSE + MLP | 68.04 | 82.66 | ||
| EmbeddingCPE + MLP | 70.51 | 84.08 | ||
| ClinicalBioBERT | Embedding + MLP | 68.05 | 83.60 | |
| EmbeddingSimCSE + MLP | 69.13 | 83.31 | ||
| EmbeddingESimCSE + MLP | 68.77 | 83.17 | ||
| EmbeddingCPE + MLP | 71.28 | 84.43 | ||
Performance on short document corpus To evaluate our CPE framework on short documents, we perform experiments on the BIOASQ dataset. We followed the method outlined in Section 3.2, but instead of using the Longformer encoder, we utilized ClinicalBioBERT and BERT features, setting the length of the positive chunk to 64 tokens. Table Ā [table-bio] reports classification results. The top rows show the performance of models using BERT embedding and the bottom rows display the performance of models using ClinicalBioBERT embedding. The ClinicalBioBERT Embedding$`_{CPE}`$ + MLP model produces the highest macro and micro F1 scores, achieving 71.28 and 84.43 macro F1 and micro F1-scores, respectively. This indicates that self-supervised CPE learning produces high-quality embeddings. Conversely, the state-of-the-art ClinicalBioBERT Emb$`_{SimCSE}`$ + MLP and ClinicalBioBERT Emb$`_{ESimCSE}`$ + MLP models does not enhance the performance of the baseline model Embedding + MLP. This suggests that using only dropout augmentation or basic word repetition to form positive pairs for generating text embeddings yields little benefit for document- or paragraph-level representations, even though these techniques perform very well for sentence embeddings. Results demonstrate that the proposed CPE method improves embedding derived from ClinicalBioBERT by around 4% macro-F1.
Embeddings Quality FigureĀ 4 shows the t-SNE projections of the CPE embeddings compared to the SimCSE baseline using the LegalBERT encoder on SCOTUS. As we can see, CPE demonstrates a higher quality of legal act encoding, as evidenced by more compact clusters. To quantify the comparison of visualized embeddings, we applied the DBScan clustering algorithm to the t-SNE projections. We evaluated clustering quality using completeness and homogeneity measures. As shown in FigureĀ 4, the completeness and homogeneity scores for CPE are 0.31 and 0.38, respectively, compared to 0.23 and 0.32 for SimCSE. This indicates a clear improvement in topic separation using the projected embeddings from CPE.

Training Time
Table 4 presents the training time required by each model on the ECHR and SCOTUS datasets. Self-supervised contrastive learning using a CPE encoder requires approximately 3 h on ECHR and 1.5 h on SCOTUS dataset. In contrast, SimCSE and ESimCSE training takes place 4.5 on ECHR and 2.5 h on SCOTUS dataset.
SimCSE, and ESimCSE require significantly longer training times due to their document-level postive pair and negative pair. Our CPE training is more efficient because each positive and negative pair involves a single sampled chunk paired with an aggregated document context rather than encoding the entire document twice, reducing both computation and memory overhead.
Furthermore, for the evaluation of the document embedding in the downstream with an MLP head (1.78M trainable parameters), the training is light, taking less than 10 minutes per epoch on our hardware.
| Model | ECHR (T) | SCOTUS (T) |
|---|---|---|
| CPE | 3 h | 1.5 h |
| SimCSE | 4.5 h | 2.5 h |
| ESimCSE | 4.5 h | 2.5 h |
Training time (in hours or minutes) across different datasets.
Prompting LLAMA on long legal documents
TableĀ 5 demonstrates that the zero-shot prompting model LLAMA underperforms compared to embedding models (HBERT+MLP), primarily due to the extensive length of its prompts. Although the literature suggests that few-shot demonstrations can enhance model performance, the limited context window presents significant challenges when handling long documents, each averaging 4,096 tokens. Incorporating even one-shot examples into the prompt consumes nearly all available space, leaving insufficient room for the actual query. Furthermore, the LLAMA model tends to over-predict a limited number of classes while rarely predicting others, leading to imbalanced classification outcomes, as indicated by a low macro-F1 score.
| Model | ECHR | SCOTUS | ||
|---|---|---|---|---|
| 2-3 (lr)4-5 | macro-F1 | μ-F1 | macro-F1 | μ-F1 |
| HBERT+MLP | 54.77 | 66.98 | 54.56 | 66.78 |
| Llama3-8B-Instruct | 15.39 | 22.54 | 0.369 | 24.91 |
š ė ¼ė¬ø ģź°ģė£ (Figures)




