The rapid advancement of large language models (LLMs) has enabled significant strides in various fields. This paper introduces a novel approach to evaluate the effectiveness of LLM embeddings in the context of inherent geometric properties. We investigate the structural properties of these embeddings through three complementary metrics δ-hyperbolicity, Ultrametricity, and Neighbor Joining. δ-hyperbolicity, a measure derived from geometric group theory, quantifies how much a metric space deviates from being a tree-like structure. In contrast, ultrametricity characterizes strictly hierarchical structures where distances obey a strong triangle inequality. While Neighbor Joining quantifies how tree-like the distance relationships are, it does so specifically with respect to the tree reconstructed by the Neighbor Joining algorithm. By analyzing the embeddings generated by LLMs using these metrics, we uncover to what extent the embedding space reflects an underlying hierarchical or tree-like organization. Our findings reveal that LLM embeddings exhibit varying degrees of hyperbolicity and ultrametricity, which correlate with their performance in the underlying machine learning tasks.
The integration of large language models (LLMs) in healthcare has shown promising advancements in personalized care, from recommending lifestyle changes to suggesting specific treatments based on individual health data [1]. These models leverage vast amounts of text data to generate embeddings that represent complex health-related information. However, understanding the geometric properties of the embeddings is crucial to improve the reliability [2], which may be missing from these embeddings. Hyperbolicity, which reflects negative curvature, helps generalize smooth geometric concepts to abstract spaces like graphs through δ-hyperbolicity [3]. Research has shown that many real-world networks are inherently hyperbolic, and recognizing this structure allows for the development of more efficient algorithms. For example, δ hyperbolicity has been successfully applied in tasks such as estimating the diameter of the graph [4], building compact routing and labeling schemes, and optimizing traffic flow and routing. These improvements are possible precisely because the underlying geometry of the network's embedding has a hyperbolic nature.
One way to assess the structural properties of embeddings is through δ-hyperbolicity, a concept introduced by Gromov [5]. δ-hyperbolicity quantifies the extent to which a metric space deviates from being an ideal tree [6] like structure, where lower δ values indicate structures closer to a tree-like geometry, which suggests hierarchical relationships among data points. In contrast, a higher δ indicates a more complex, nonhierarchical structure.
In the domain of large-scale graph analysis, authors in [7] used δ-hyperbolicity to study the structural behavior of graphs. A previous study considered heuristics for the hyperbolicity / treewidths of autonomous systems and internet router networks, suggesting that treewidth is large for these networks [8]. There is an assumption that large social and information networks have a tree-like or hierarchical structure, but this is rarely tested. A significant empirical study testing the assumption that social and information networks are tree-like, using Gromov’s δ-hyperbolicity [9]. Their work demonstrated that while traditional metrics may not reveal strong treelikeness, refined hyperbolic analysis uncovers meaningful hierarchical patterns, validating the relevance of δ-hyperbolicity for structural analysis.
δ-hyperbolicity quantifies approximate tree-likeness in complex biological networks, simplifying tasks such as distance estimation, network classification, and core-periphery identification [10]. Another metric is Ultrametricity [11], which captures a strict hierarchical structure, ideal for deterministic, nested data like phylogenetic trees. The concept of ultrametricity is fundamental in methods like UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and WPGMA (Weighted Pair Group Method with Arithmetic Mean), used to construct ultrametric trees in hierarchical clustering [12]. Complementing these geometric measures, Q-matrix statistics from Neighbor Joining provide a quantitative, algorithmic assessment of tree-likeness, where low normalized Q values indicate strong adherence to additive tree structure [13], [14].
In this paper, we adapt δ-hyperbolicity and ultrametricity to the domain of genomic sequencing data analyses by evaluating the embeddings generated by LLMs such as “Tasks Assessing Protein Embeddings (TAPE)” [15], ESM2 [16], Seqvec [17], and ProtT5 [18]. We compute δ-hyperbolicity for the embeddings generated by these LLMs to assess their effectiveness. By analyzing the hyperbolicity of these embeddings, we aim to provide a deeper understanding of the geometric properties underlying LLM-generated representations. The key contributions of this study are as follows:
-
We generate embeddings for three datasets using various large language models (LLMs), and evaluate them using δ-hyperbolicity and ultrametricity to assess the geometric properties of the embedding spaces, and comparative analysis of embeddings generated by several LLMs.
-
We perform classification on the embeddings to further evaluate their quality and provide empirical validation for the two geometric metrics.
-
We perform clustering as wel on the embeddings to evaluate their quality and provide empirical validation..
Large language models, such as BERT [19] and GPT-3 [20], have revolutionized natural language processing by generating high-quality embeddings that capture semantic relationships in text. Recent studies have investigated the properties of these embeddings to enhance the performance of downstream tasks [21]. LLMs are used to map 3D protein structure for efficient protein function prediction [22]. δ-hyperbolicity, as introduced by Gromov [5], provides a framework for analyzing the geometric structure of metric spaces. It has been applied in various domains, including graph theory and computational geometry [23]. The concept has been used to study the properties of different types of space and their implications for computational problems. Authors in [24] show that HYPHC closely approximates the optimal tree and outperforms other clustering methods. Authors in [11] treat hierarchical clustering as an optimization problem, where the goal is to find the best ultrametric that fits the data. hey simplify the problem by embedding complex constraints as a min-max term in the cost function for easier gradient descent optimization.
Recent research has explored the geometric properties of embeddings in high-dimensional spaces, including the application of hyperbolicity measures [6]. These studies have shown that understanding the geometric structure of embeddings can provide insights into their effectiveness for various tasks, including clustering and classification. However, measuring the hyperbolicity for the biological sequences and using that information for sequence analysis has not been done in the literature before. It has applications in Phylogenetics and Evolutionary Trees [25]. Biological data often follows treelike evolutionary relationships (e.g., phylogenetic trees) [10]. δ-hyperbolicity helps assess the accuracy of tree-based models.
Has applications in Viral evolution and epidemiology. Protein-Protein Interaction exhibits tree-like structures [26]. Using hyperbolic embeddings guided by δ-hyperbolicity improves drug-target interaction predictions. Mapping proteins to a hyperbolic space [27] allows efficient clustering of functionally similar proteins, thus providing applications in the Functional Annotation of Proteins [22].
In this section, we present our approach for evaluating the δ-hyperbolicity of embeddings generated by large language models (LLMs). Our approach involves the following steps:
-
Obtaining embeddings from an LLM.
-
Computing pairwise Euclidean distance matrices or Poincaré distance matrices.
-
Finding the δ-hyperbolicity of the resulting metric space.
-
Finding the Ultrametricity of the resulting metric space.
We compute Neighbor Joining by summarizing the distribution of Q-matrix values derived from the pairwise distance matrix.
- Analyze the classification and clustering results to understand the geometric properties of embeddings.
In our study, we employ four large language models-SeqVec, ESM-2, TAPE, and ProtT5-to generate embeddings from diverse biological datasets. Our goal is to evaluate how well these models capture meaningful patterns and structural relationships within biological sequences. Specifically, we analyze the geometric properties of the resulting embeddings, such as their alignment with tree-like structures, using metrics like δ-hyperbolicity, ultrametricity, and Neighbor Joining. This allows us to assess the suitability of each embedding space for downstream tasks such as clustering and classification, with a focus on identifying which models produce more interpretable and hierarchically organized representations. a) Seqvec [17]: SeqVec is a deep learning-based method that represents protein sequences as continuous vector embeddings without relying on evolutionary information. Inspired by the ELMo language model from NLP [28], SeqVec treats amino acids as words and learns contextual relationships from large unlabeled protein databases like UniRef50 [29]. Once trained, it generates rich, informative embeddings for new sequences in a fraction of a second, bypassing the time-consuming need for multiple sequence alignments. SeqVec demonstrates that single-sequence models can match or even outperform some alignment-based methods, offering a fast and scalable approach for protein function and structure prediction. b) ESM2 [16]: ESM-2 is a state-of-the-art protein language model developed by [16]. that enables high-resolution protein structure prediction directly from single amino acid sequences, without the need for multiple sequence alignments (MSAs). Built on transformer architecture and scaled up to 15 billion parameters, ESM-2 captures detailed atomic-level structural information as an emergent property of the learned sequence representations [30]. This model powers ESMFold, a fast and accurate structure prediction tool that approaches the performance of traditional MSA-based methods like AlphaFold, while being significantly faster-enabling predictions. ESM-2 represents a major leap forward in single-sequence-based structure prediction and has broad applications in structural biology, metagenomics, and protein function discovery. c) TAPE [15]: TAPE is a semi-supervised LLM, protein representation learning method that works by training a protein large language model and then generating numerical embeddings. The TAPE framework is built around a transfer learning approach for protein sequences , using deep neural network architectures such as Transformers , LSTMs , and ResNets . These models are first trained on large, unlabeled protein sequence datasets, such as UniRef50 [29], using selfsupervised learning objectives, notably masked language modeling, similar to BERT in NLP. The goal is to learn rich, general-purpose embeddings of protein sequences that can be fine-tuned or evaluated on downstream biological tasks, such as secondary structure prediction, contact map prediction, and protein stability. The TAPE benchmark standardizes these tasks, enabling consistent evaluation of how well learned representations capture structural and functional properties of proteins. We get the embeddings as an output, which reflects the contextualized representation of sequences after processing through the model. d) ProtT5 [31]: ProtT5 is a powerful protein language model (pLM) based on the T5 (Text-to-Text Transfer Transformer) architecture from NLP, adapted to learn the language of proteins from large-scale amino acid sequence data. Unlike the original T5 model, which originally has both an encoder and decoder [32]. ProtT5 uses a BERT-style masked language modeling objective, focusing on reconstructing individual masked amino acids rather than spans. In ProtT5, only the encoder is used during inference because it performs better and is more efficient [31]. ProtT5 embeddings capture key biophysical features of proteins, enabling accurate prediction of secondary structure, subcellular localization, and membrane association-all without relying on evolutionary information or multiple sequence alignments. Notably, ProtT5 achieved state-of-the-art accuracy in these tasks, suggesting it effectively learns protein grammar from raw sequence data alone.
We compute pairwise distance matrices for all embeddings using both Euclidean and Poincaré metrics to obtain N × N distance matrices, where N is the total number of protein sequences. Let X be the set of embeddings.
For the Euclidean metric, the distance between two embeddings x i and x j is given by:
where d is the embedding dimensionality.
For the Poincaré metric [2], the hyperbolic distance between two sequence embeddings x i and x j within the unit ball is computed as:
Both metrics capture complementary geometric properties. Euclidean measures standard linear distances. Poincaré captures hierarchical, tree-like relationships in hyperbolic space.
We evaluate the tree-likeness using three metrics: δhyperbolicity, ultrametricity, and Neighbor Joining (NJ).
delta-hyperbolicity (Gromov hyperbolicity) is a more flexible, approximate notion of tree-likeness. It applies to metric spaces and especially to graphs, where triangles are allowed to be “delta-slim,” meaning each side lies within a delta-radius of the other two sides. This permits local deviations or shortcuts while still maintaining an overall tree-like shape. IKt can be applied to more complex, noisy, or interconnected systems.
Ultrametricity is a strong geometric condition where all triangles in the space are either isosceles with equal long sides or equilateral. It enforces a strict hierarchy, making it ideal for modeling perfect tree structures like phylogenetic trees and hierarchical clustering dendrograms. In essence, ultrametricity describes exact hierarchical tree. It can be used in deterministic, nested data (like biological taxonomies).
Finally, we use Q-matrix statistics derived from the Neighbor Joining algorithm to quantify tree-likeness. Pairs of points with low (more negative) Q values indicate strong adherence to the NJ joining criterion, reflecting how well the distance matrix approximates an additive tree structure. Neighbor Joining (NJ) provides a quantitative, algorithmic perspective: by analyzing Q-matrix values, we can compute an NJ score that reflects how well the distance matrix approximates an additive tree.
To evaluate the δ-hyperbolicity, we need to compute the Gromov products for four points a, b, c, and w. The Gromov product [a, b] w is defined as (also mentioned in [6]):
The δ-hyperbolicity is then calculated by checking if the following inequality holds for all quadruples a, b, c, and w [6]:
We estimate the δ-hyperbolicity by computing the minimum value of δ that satisfies the above condition for all quadruples in the space.
The proposed algorithm to evaluate δ-hyperbolicity is outlined in Algorithm 1 and is implemented to assess the geometric structure of LLM-derived protein sequence embeddings. The algorithm begins by taking as input a set of embeddings X computed from different LLMs, a distance function d, and a user-defined number of samples to evaluate. The goal is to calculate the δ-hyperbolicity value δ for the metric space defined by the embeddings. Initially, the embeddings are flattened and padded to a uniform length, then optionally reduced in dimensionality via PCA to retain 99% of variance, ensuring efficient computation. First, it computes the pairwise distance matrix D using the distance function d on the embeddings, where each entry in D represents the distance between a pair of embeddings. The core of the algorithm involves sampling a specified number of random quadruples (a, b, c, w) from the embedding space, rather than iterating exhaustively over all combinations, which helps it to scale to large datasets. For each quadruple, it computes the Gromov products This value represents the extent to which the space deviates from being hyperbolic (tree-like). After all quadruples have been evaluated, the algorithm returns δ max , δ avg , and δ std as the final δ-hyperbolicity values, which quantifies the hyperbolicity of the metric space defined by the embeddings. This value provides insights into the geometric structure of the embeddings, which can be critical for understanding their effectiveness in underlying ML tasks.
Using parallel processing, these δ values are computed for all sampled quadruples, after which the algorithm returns the maximum, average, and standard deviation of δ across all samples. These values collectively quantify the degree of hyperbolicity in the embedding space. A lower average δ indicates a space that is more tree-like, suggesting that the embedding structure may support hierarchical relationships.
A metric space is said to be ultrametric if the distance function d satisfies the ultrametric inequality, also called a stricter version of the triangle inequality [11]:
This is a stronger condition than the standard triangle inequality. It enforces a strict hierarchy, making it ideal for modeling perfect tree structures like phylogenetic trees and hierarchical clustering dendrograms. This inequality implies that, in any triangle formed by three points, the longest two sides are always equal in length, or at least one of them is equal to the third. As a result, all triangles in an ultrametric space are either isosceles with a short base or equilateral, which leads to a very constrained and highly symmetric structure. Compute Gromov products: {using Eq IV.1 and Eq IV.2}
Compute temporary slack:
Append δtemp to ∆ 7: end for 8: Compute:
Ultrametric spaces arise naturally in contexts where elements are organized hierarchically. In these spaces, distances do not reflect direct spatial relationships but instead represent levels of similarity or common ancestry. For example, in phylogenetics, the ultrametric property reflects the idea that all species in a clade evolved from a common ancestor at the same rate, resulting in a tree with equidistant leaves. Similarly, in hierarchical clustering, ultrametric distances can be derived from dendrograms, where the distance between any two items corresponds to the height of their lowest common ancestor in the tree.
Because of this hierarchical nature, ultrametricity is particularly useful in applications where data has an inherent nested or taxonomic structure. It also plays an important role in number theory, p-adic analysis, and certain types of non-Euclidean geometry. The strictness of the ultrametric inequality makes ultrametric spaces more rigid than general metric spaces, which in turn makes them excellent candidates for modeling domains where a perfect hierarchical structure is assumed or observed.
The algorithm titled “Check Ultrametricity of a Distance Matrix” is designed to determine whether a given distance matrix D ∈ R n×n satisfies the ultrametric inequality, a stricter form of the triangle inequality that characterizes hierarchical or tree-like metric spaces. Specifically, the ultrametric condition requires that for any triplet of points (i, j, k), the largest pairwise distance among them must not exceed the maximum of the other two distances.
and
Compute violation terms:
6:
Compute violation ν = max(v1, v2, v3, 0)
if ν > ϵ then
Append ν to violations 9:
Increment num violations ← num violations + 1 10:
end if
Increment total triples ← total triples + 1
12: end for // Compute statistics over violations:
13: max violation = max(violations) if violations ̸ = ∅ else 0 14: avg violation = mean(violations) if violations ̸ = ∅ else 0 15: std violation = std(violations) if violations ̸ = ∅ else 0 16: return max violation, avg violation, std violation
We used a Neighbor Joining (NJ)-based analysis generated from the NJ Q-matrix to measure the degree to which the pairwise distance structure of the embeddings adheres to a tree-like geometry [33].
Given an n × n distance matrix D, the NJ algorithm defines the Q-matrix as
- where pairs with the most negative Q(i, j) values are preferentially joined during tree construction. Rather than explicitly reconstructing a phylogenetic tree, we summarize the distribution of Q-matrix values as a global measure of tree-likeness. Specifically, we compute all off-diagonal entries of the Qmatrix, take their absolute values to obtain a scale-independent measure of deviation from an ideal additive tree, and report the maximum, mean, and standard deviation of these normalized Q values. return nj max = 0, nj avg = 0, nj std = 0 4: end if 5: Compute the Neighbor Joining Q-matrix Q ∈ R n×n :
-
for i = 1 to n do 8:
We use 3 real-world protein sequence data. The summary of the datasets used for experimentation is given below: 1) PDB186: We use a benchmark dataset from the literature (called PDB186), which is used for DNA-Protein binding prediction, hence a binary classification task [34]. More formally, the dataset contains DNA-binding protein sequences and non-binding protein sequences. The research task is to predict if the DNA and protein sequences will bind or not. The minimum, maximum, and average sequence length in the data are 64, 1323, and 264.693, respectively. The total number of sequences is 186, where 93 binds and 93 do not bind.
- Breast Cancer: Our Membranolytic anticancer peptides (ACPs) dataset [35] contains information about the peptides (protein sequences) and their anticancer activity (target labels) on breast cancer cell lines. The target labels are categorized into “very active”, “moderately active”, “experimental inactive”, and “virtual inactive” groups. This dataset contains 949 peptide sequences distributed among the four categories, as shown in Table I. 3) Lung Cancer: The dataset on Membranolytic anticancer peptides (ACPs) [35] provides details regarding peptide sequences and their corresponding anticancer effectiveness against breast and lung cancer cell lines. The target labels are classified into four groups: “very active,” “moderately active,” “experimental inactive,” and “virtual inactive.” In total, the dataset comprises 949 and 901 peptide sequences for breast and lung cancer, respectively. Table II shows the distribution.
To evaluate the results, we analyze the δ-hyperbolicity, ultrametricity, and Neighbor Joining for LLM embeddings. To understand these metrics utility on the different datasets to understand the structural properties.
For the classification task, we use classifiers like SVM, Naive Bayes, MLP, KNN, Random Forest, Logistic Regression, and Decision Tree. Evaluation metrics include accuracy, precision, recall, weighted F1, macro F1, ROC-AUC, and training runtime. We split our data into random training and test sets with a 70 -30% split for both tasks and repeat experiments 5 times. The preprocessed data and code will be available (in the published version).
For the clustering task, we use k-means [36], k-modes [37], and Agglomerative clustering algorithmsAgglomerative [38]. Evaluation metrics used were Silhouette Coefficient [39], Calinski-Harabasz Score [40], and Davies-Bouldin Score [41].
-
Algorithms: a) k-means [36]: The classical k-means clustering method [36] clusters objects based on the Euclidean mean. K-means clustering partitions data into K clusters by repeatedly assigning points to the nearest centroid and updating centroids as the mean of their assigned points. This continues until the assignments stabilize, resulting in compact, well-separated clusters [42]. b) k-modes [37]: We also cluster with k-modes [37], which is a variant of k-means using modes instead of means. Here, pairs of objects are subject to a dissimilarity measure (e.g., Hamming distance) rather than the Euclidean mean. c) Agglomerative [38]: All agglomerative hierarchical clustering methods start from a proximity matrix of either similarities or distances between elements [43]. Agglomerative clustering is a bottom-up algorithm that starts with each data point as its own cluster. It repeatedly merges the closest pairs of clusters based on a chosen distance metric until all points are grouped into a single cluster or a set number of clusters [44].
-
Evaluation Metric: a) Silhouette Coefficient [39]: Given a feature vector, the silhouette coefficient computes how Similar to the feature vector is to its cluster (cohesion) compared to other clusters (separation) [45]. Its score ranges between [-1, 1], where 1 means the best possible clustering and -1 means the worst possible clustering. b) Calinski-Harabasz Score [40]: The Calinski-Harabasz score evaluates the validity of a clustering based on the within-cluster and between-cluster dispersion of each object with respect to each cluster (based on the sum of squared distances) [45]. There is no defined range for this metric. A higher score denotes better defined clusters. c) Davies-Bouldin Score [41]: Given a feature vector, the Davies-Bouldin score computes the ratio of within-cluster to between-cluster distances [45]. There is no defined range for this metric. A smaller score denotes that groups are well separated, and the clustering results are better.
A popular visualization technique, named t-SNE [46], is employed to visualize the feature vectors generated by the SeqVec [17] method. The t-SNE plots are shown in Figure 1 for PDB186, Figure 2 for Lung Cancer, and Figure 3 for breast cancer datasets. The main idea for reporting the t-SNE plots is to observe if there is a clear class separation between the labels in the datasets with different embeddings generated by respective LLMs. We can observe that clusters overlap for Seqvec in Lung cancer and breast cancer datasets, showing that there is no clear decision boundary that exists in the data.
In the first step, we simulate the data to understand the hyperbolicity values for known structures. We computed the following synthetic data. a) Sphere Space Data Computation:: To simulate data in the sphere space, we generate n random points on the surface of a unit sphere in a high-dimensional space. Each point is sampled from a standard normal distribution in R d , and subsequently normalized to lie on the unit sphere by dividing each point by its Euclidean norm. The distance matrix for these points is computed using the Euclidean distance metric. The δ-hyperbolicity values for this space are then calculated by evaluating the shortest path distances between all possible pairs of points, followed by applying the δ-hyperbolicity calculation method described in Section III. b) Dense Graph Data Computation:: A dense graph is generated using the Erdős-Rényi model, where n nodes are connected with a high probability, p = 0.8. The adjacency matrix of the graph is obtained, representing the presence of edges between nodes. To compute the distance matrix, the Floyd-Warshall algorithm is applied to find the shortest paths The radial distance r is uniformly sampled from the interval [0, 1), while the angular coordinate θ is uniformly sampled from the interval [0, 2π). The Cartesian coordinates of each point are then calculated using the polar-to-Cartesian transformation: x = r cos(θ) and y = r sin(θ). The geodesic distance between any two points within the Poincaré ball is computed using the hyperbolic distance formula, which considers both the Euclidean distance between the points and their distances from the origin. d) Hyperbolicity Computation:: For the sphere space, the dense graph, and the Poincaré Space, the hyperbolicity is computed using the distance matrices derived from their respective constructions. The computation involves evaluating the four-point condition for all possible quadruples of points or nodes. Specifically, for any quadruple of points, we calculate the δ value, which represents the deviation from an ideal tree metric. The average and standard deviation of these δ values are then reported in The variation in hyperbolicity values across different metric spaces, as shown in Table III, can be attributed to the underlying structural properties of the data. The sphere space, with a δ value of 0.32 ± 0.19, exhibits slightly higher hyperbolicity, likely due to the curved nature of the space, which deviates from the tree-like structure that minimizes hyperbolicity. The dense graph, with a δ value of 0.31 ± 0.25, shows similar hyperbolicity levels, reflecting the graph’s high connectivity and the presence of many alternative paths between nodes, which reduces the tree-likeness. In contrast, the Poincaré space, with the smallest δ value of 0.2867±0.3328, reflects its inherent hyperbolic geometry, which naturally supports a treelike structure with minimal deviation (this hypothesis aligns well with literature on Poincaré ball analysis [2]), leading to lower hyperbolicity. The differences in hyperbolicity across these spaces highlight how the geometric and topological characteristics of the data influence the extent to which the space approximates a tree metric.
The variation in hyperbolicity values for the PDB186 dataset, as shown in Table IV, can be attributed to the underlying structural properties of the data. The ProtT5 embedding space, with the smallest δ value 0.0418 and ultrametricity value 0.1301, reflects its tree-like structure with minimal deviation. The differences in hyperbolicity across these embeddings highlight how the geometric and topological characteristics of the PDB186 data influence the extent to which the space approximates a tree metric. In contrast, seqvec, with a δ value of 3.2018, exhibits higher hyperbolicity, and similarly, Ultrametricity with a value of 16.6730 for seqvec denotes that the embedding deviates from the tree-like structure.
The Neighbor Joining (NJ) scores differ substantially between Poincaré and Euclidean distance matrices. Euclidean distances yield significantly higher NJ average values (e.g., 318.63 for ProtT5 on PDB186) compared to Poincaré distances (567.03), indicating that Euclidean embeddings deviate more from an additive tree structure. The Poincaré ball naturally preserves hierarchical relationships through its hyperbolic geometry, resulting in lower NJ scores that better approximate tree-like organization. This explains why Poincaré-based metrics show stronger tree-likeness even when Euclidean distances reveal the same underlying embedding structure.
The variation in hyperbolicity values for the Lung Cancer dataset, as shown in Table V, can be attributed to the underlying structural properties of the data. The ProtT5 embedding space, with the smallest δ value with an average of 0.1013 and ultrametricity value0.2408, reflects its tree-like structure with minimal deviation. The differences in hyperbolicity across these embeddings highlight how the geometric and topological characteristics of the data influence the extent to which the space approximates a tree metric. In contrast, the seqvec, with a δ value of 0.5925, exhibits higher hyperbolicity. Similarly, Ultrametricity, with a value of 1.2458 for seqvec, denotes that the embedding deviates from a tree-like structure.
The variation in hyperbolicity values for the Breast Cancer dataset, as shown in Table VI, can be attributed to the underlying structural properties. The ProtT5 embedding space, with the smallest δ value with an average of 0.1049 and ultrametricity value 0.2466, reflects its tree-like structure with minimal deviation. In contrast, the seqvec, with a δ value of 0.5882, exhibits higher hyperbolicity. Similarly, Ultrametricity, with a value of 1.3272 for seqvec, denotes that the embedding deviates from a tree-like structure.
This larger hyperbolicity suggests that the underlying structure of the Seqvec embeddings deviates more significantly from a tree-like structure, which could be due to the complex nature of DNA-protein binding interactions and the way Seqvec computed the embeddings. Whereas ProtT5 embeddings exhibit a tree-like structure, as evident from hyperbolicity and ultrametricity values. The higher δ value reflects the intricate and possibly non-uniform relationships within the dataset, where the distances between certain points do not conform to the idealized tree metric as closely as in the synthetic data. Additionally, the substantial standard deviation highlights the variability in the hyperbolicity across different parts of the dataset, suggesting that some regions may be more tree-like while others exhibit more complex, non-metric space characteristics. This complexity is expected in biological data, where the underlying interactions and dependencies are often non-linear and can vary significantly across different sequences.
The average performance metrics (Accuracy, Precision, Recall, F1 scores, and ROC AUC) for different machine learning models applied to a classification task using various datasets (for 5 experimental runs). We evaluate ROC-AUC for decision boundary confidence for binary classification.
Table VII shows classification results for the PDB186 dataset. We note that the ProtT5 achieves the highest ROC-AUC of 0.7968 using Logistic Regression, followed by ESM2 at 0.7453. This superior performance corresponds directly to their geometric properties: ProtT5’s δ-hyperbolicity and ultrametricity values are substantially lower than those of SeqVec (3.2018 and 16.6730), which achieves only 0.6173 ROC-AUC. The elevated hyperbolicity and ultrametricity of SeqVec indicate significant deviation from tree-like structures, resulting in embeddings that fail to organize DNA-protein binding relationships in a hierarchically meaningful way. underlying hierarchical organization captured by each embedding method remains stable and predictive of classification performance.
The consistent alignment between geometric properties (δhyperbolicity, ultrametricity, NJ scores) and classification performance across three diverse biological datasets provides strong empirical evidence that tree-likeness in embedding spaces is beneficial for capturing the hierarchical nature of protein sequence relationships. Models that generate more tree-like embeddings create representations where biologically related sequences are organized in a manner that facilitates accurate classification by downstream machine learning algorithms. This can be seen with the PDB186 dataset; here, ProtT5 does not achieve the best clustering performance, the SeqVec embeddings that perform best according to clustering metrics exhibit the worst hyperbolicity and ultrametricity values. This divergence highlights an important insight where a strong treelike structure in the embedding space is beneficial for complex or hierarchical datasets, but may not be necessary for simpler binary classification tasks.
The observed contradiction arises because clustering metrics evaluate local separation, while hyperbolicity and ultrametricity reflect global geometric properties, particularly whether the space resembles a tree or hierarchy. In simple binary tasks like PDB186, strong cluster separation does not require a tree-like embedding space. Clustering metrics like Silhouette Score measure how well separated the clusters are locally. We can have well-separated clusters without the embedding space being tree-like. Moreover, PDB186 is a binary classification problem with relatively clear separation between the two classes. Such problems may benefit more from flatter, linearly separable embeddings than from hierarchical structures. In contrast, Lung and Breast Cancer datasets may involve multiclass or more nuanced biological variation, where tree-like relationships help capture underlying semantics or phenotype similarities. Breast cancer datasets highlight the potential of incorporating δ-hyperbolicity, ultrametricity, and Neighbor Joining (NJ) as a criterion for selecting and fine-tuning LLMs. Additionally, our analysis of synthetic data across different geometric spaces provides further insights into the behavior of hyperbolicity in various contexts, reinforcing the importance of understanding the geometric properties of embeddings. Future work could extend this research by exploring other geometric measures, applying this approach to more diverse datasets, and integrating these findings into real-world health applications to validate the practical benefits of hyperbolic embeddings.
To verify this, the algorithm iterates through all unique combinations of triplets in the distance matrix. For each triplet, it retrieves the corresponding distances d ij , d ik , and d jk , then identifies the largest of the three. It compares this maximum value to the sum of the other two (smaller)
Moreover, these embeddings also exhibit the highest values for Ultrametricity and Hyperbolicity, indicating that the learned representations are hierarchically organized and tree-like in structure. Although tree-like embeddings are not universally better, they are very useful for tasks involving hierarchy.
This content is AI-processed based on open access ArXiv data.