Modern HTTPS mechanisms such as Encrypted Client Hello (ECH) and encrypted DNS improve privacy but remain vulnerable to website fingerprinting (WF) attacks, where adversaries infer visited sites from encrypted traffic patterns. Existing WF methods rely on supervised learning with sitespecific labeled traces, which limits scalability and fails to handle previously unseen websites. We address these limitations by reformulating WF as a zero-shot cross-modal retrieval problem and introducing STAR. STAR learns a joint embedding space for encrypted traffic traces and crawl-time logic profiles using a dualencoder architecture. Trained on 150K automatically collected traffic-logic pairs with contrastive and consistency objectives and structure-aware augmentation, STAR retrieves the most semantically aligned profile for a trace without requiring targetside traffic during training. Experiments on 1,600 unseen websites show that STAR achieves 87.9% top-1 accuracy and 0.963 AUC in open-world detection, outperforming supervised and few-shot baselines. Adding an Adapter with only four labeled traces per site further boosts top-5 accuracy to 98.8%. Our analysis reveals intrinsic semantic-traffic alignment in modern web protocols, identifying semantic leakage as the dominant privacy risk in encrypted HTTPS traffic. We release STAR's datasets and code to support reproducibility and future research 1 .
📄 Full Content
As modern HTTPS evolves, traditional protocol-visible identifiers such as Server Name Indication (SNI) and DNS queries are increasingly concealed by mechanisms like Encrypted Client Hello (ECH) [1] and encrypted DNS [2]. This shift limits the effectiveness of conventional web inference techniques that rely on such metadata [3]. However, even when both payloads and headers are fully encrypted, traffic traces still reveal structural patterns-such as packet sizes, timing, and burst behaviors-that reflect the underlying resource structure of websites [4]. Website fingerprinting (WF) approaches [5]- [10] exploit these residual features to infer the site being visited, without requiring access to any plaintext identifiers. In this context, WF has emerged as one of the few remaining passive techniques for web-level inference under full encryption. 1 https://github.com/2654400439/STAR-Website-Fingerprinting Existing WF approaches, however, face fundamental limitations that hinder their scalability and practicality for real-world deployment. Specifically: (i) Traffic drift. Website content evolves dynamically over time [11], necessitating frequent recollection of labeled traffic data and retraining of models; (ii) Limited recognition capability. Current supervised learning-based approaches can only identify previously known websites, lacking the ability to generalize to newly emerging sites. These challenges significantly restrict the applicability of WF in operational settings.
To address these limitations, we introduce a novel approach that jointly exploits traffic modality features and logical modality features to enable scalable and generalizable WF against previously unseen websites. Logical modality features (e.g., URI lengths, response sizes, and protocol versions) can be automatically extracted through large-scale web crawling, capturing resource-level attributes that describe a website’s semantic structure. By mapping both traffic modality features and logical modality features into a shared embedding space, we construct a large-scale website fingerprint database grounded in logical representations. Consequently, the task of identifying a website from unseen traffic can be reformulated as a cross-modal retrieval problem, wherein traffic modality features are matched to the most semantically relevant logical modality features stored in the fingerprint database.
We instantiate this formulation through STAR (Semantic-Traffic Alignment and Retrieval), a dual-encoder architecture that jointly embeds logic and traffic modalities into a unified latent space. STAR is trained on over 150K automatically collected logic-traffic pairs using a contrastive learning objective, with additional auxiliary losses to improve intra-class consistency and discriminability. To further enhance robustness against website evolution, we introduce a structureaware data augmentation mechanism that perturbs both modalities in a semantically consistent manner. During inference, STAR retrieves the most semantically aligned logic profile for an encrypted traffic sample, using cosine similarity in the shared embedding space. This design enables zero-shot classification of encrypted traces with no prior access to traffic from target websites.
Beyond the system design, we also conduct a systematic investigation into why semantic-traffic alignment is possible. We identify three core alignment anchors-on the request side, response side, and transport protocol-each capturing a consistent mapping between traffic features and high-level website structures ( §III-B). These anchors stem from the inherent design of modern web protocols (e.g., header compression, layered transport) and serve as empirical foundations for learning cross-modal associations, further supported by modality-level analyses of discriminability, stability, and crossmodal correlation ( §V-C). Together, these findings not only validate the design rationale behind STAR, but also provide foundational evidence that cross-modal modeling is both feasible and effective for fingerprinting encrypted web traffic. In summary, our contributions are as follows:
• We formalize zero-shot website fingerprinting under HTTPS as a cross-modal retrieval task, removing the need for per-site traffic collection and supporting generalization to unseen websites. • We present STAR, the first dual-modality system that aligns crawl-time semantic logic with encrypted traffic traces through contrastive learning and structure-aware augmentation. to facilitate future research on semantic inference under encrypted protocols [12]. These results highlight the feasibility of zero-shot trafficbased identification and demonstrate that semantic leakage, rather than header visibility, now constitutes the principal privacy risk in the encrypted web.
Website fingerprinting (WF) infers a user’s visited website by analyzing features of encrypted traffic-such as packet lengths, directions, and timing patterns. Introduced formally by Hintz [5], early WF methods used handcrafted features and classical classifiers [6], [13]. The rise of deep learning significantly boosted performance: models like Deep Fingerprinting (DF) [7] achieved high closed-world accuracy via CNNs, and later work explored GNNs [14], Transformers [15], and diffusion models [16].
Recent work has revisited WF under mainstream HTTPS, revealing that even minimal protocol interactions can leak identifying patterns. For example, Cebere et al. [17] analyzed leakage across TLS stages; Gao et al. [14] constructed resource graphs to model site structure; Cheng et al. [8] showed that HTTP version features form unique site-level sequences; Shen et al. [18] demonstrated the use of prior fingerprints to filter obfuscation flows. Other works revealed structural leakage in HTTP/3 and DoH traffic [19], [20].
To reduce the reliance on large labeled datasets, recent work has explored data-efficient strategies under low-data regimes. Few-shot approaches use contrastive learning to pretrain DFstyle encoders [21], [22] or apply kNN over applicationlayer features [8]; generative methods augment training data with synthetic traces [23]. Some studies extend to zero-shot scenarios across network conditions (e.g., VPN changes) [15], [24], but still assume access to traffic from target websites.
This limitation motivates our proposed cross-modal approach, eliminating the necessity for target-site traffic collection entirely.
We consider a standard passive adversary in the website fingerprinting (WF) setting [7], [8], [18]. The attacker resides on the network path between the user and the web server-such as an ISP or router-and is able to observe encrypted traffic but cannot modify, delay, inject, or decrypt any packets. The attacker’s goal is to determine whether the user is visiting a monitored website and, if so, identify which one.
Unlike traditional WF approaches that formulate this task as a site-specific classification problem, we adopt a new threat model where the attacker performs cross-modal retrieval between encrypted traffic traces and semantic website representations. This allows the attacker to recognize previously unseen websites based on semantic-traffic alignment, without requiring labeled traffic samples for each monitored site.
Unlike typical cross-modal learning tasks [25] that align well-structured modalities such as natural language and images, our setting involves the alignment between encrypted traffic traces and abstracted semantic representations of websites. This non-traditional modality pairing introduces new challenges, particularly in the construction of effective input representations.
To guide the design of both traffic and logic modalities, we summarize three key principles that effective cross-modal representations should satisfy, inspired by prior work in fingerprinting and multi-modal retrieval:
• P1: Discriminability (intra-modality)
Each modality should encode features that allow websites to be distinguished from one another, enabling the model to separate classes in both semantic and traffic spaces. The modality representations should remain relatively consistent across repeated visits to the same site under similar conditions. High intra-class consistency is critical for generalization. There should exist identifiable structural relationships between the modalities, which we refer to as alignment anchors. These anchors serve as a learnable bridge that enables the model to connect encrypted traffic behavior with semantic site characteristics.
These principles form the foundation for our modality design choices and motivate the structural observations presented in the following sections.
We define a cross-modal alignment anchor as a feature or structure that exhibits semantic correspondence and measurable correlation between the logic and traffic modalities. In our setting, we identify three such alignment anchors-on the request side, response side, and transport protocol-each rooted in the design of modern web communication ecosystems (see Fig. 1 A-C).
Request Anchor: Our key observation is that the length of HTTPS request packets is linearly related to the Huffman-encoded length of the resource URI. This arises from protocol-level optimizations in HTTP/2 and HTTP/3, which employ static/dynamic header compression [26]. Most headers (e.g., User-Agent, Cookie) are replaced with compact indices, leaving the URI as the dominant uncompressed field, further compressed by a public Huffman table. Thus, request packet length can be approximated as:
where p i is the packet length and H is the number of compressed headers. This alignment is visualized in Fig. 1 A and supported in Fig. 1 D. 2) Response Anchor: Response packets convey web content, and their cumulative size naturally reflects the sum of individual resource sizes [27]. We observe that:
enabling logic-to-traffic comparison of response behavior (Fig. 1B,E).
Protocol Anchor: HTTP/3 operates over QUIC/UDP, creating observable transport-layer patterns [8]. We compare the UDP traffic ratio with the server-side HTTP/3 usage ratio, forming a protocol anchor:
This structural similarity enables indirect inference of protocol usage (Fig. 1C,F).
To quantify these alignments, we perform statistical hypothesis testing on paired samples. For each anchor, we extract matched feature sequences from both modalities, apply normalization when necessary, and measure alignment using Pearson correlation or Wasserstein distance. Significance is evaluated via permutation testing. Aggregated results in Table I confirm statistically significant alignment across all three anchors, motivating the modality representations used in our framework ( §IV).
This section presents the STAR framework for semantic-traffic alignment in website fingerprinting. As shown in Fig. 2, STAR maps paired inputs from two heterogeneous modalities-website logic and encrypted traffic-into a shared embedding space for unified retrieval and classification.
We first define modality-specific input representations ( §IV-B), then design a dual-encoder architecture to embed them ( §IV-C). The encoders are jointly trained with contrastive and auxiliary losses to promote semantic alignment ( §IV-D).
To enhance generalization, we introduce structure-aware data augmentation ( §IV-E). The learned encoders support flexible downstream usage, including zero-shot retrieval and few-shot adaptation (see Fig. 3).
Our STAR framework is built on a cross-modal dualencoder design, as illustrated in Fig. 2 and Fig. 3. It takes as input two modalities for each website access: (i) a logic modality, which encodes semantic web resource structures-such as resource uri lengths, sizes, and protocol behaviors-and (ii) a traffic modality, which captures encrypted packet-level features during access. These paired inputs are processed by separate encoders and projected into a shared embedding space. To enforce semantic alignment, we apply an InfoNCEbased contrastive loss, while auxiliary classification and consistency losses promote inter-class separability and intra-class coherence.
During training (Fig. 2), STAR is optimized on large-scale cross-modal sample pairs collected via automated crawling and traffic capture, further expanded through structure-aware data augmentation. The learning objectives jointly update both encoders to align paired embeddings, classify traffic samples, and cluster instances from the same class.
At inference time (Fig. 3), the trained encoders support multiple downstream scenarios. For zero-shot classification, a test trace is encoded by the traffic encoder and compared-via cosine similarity in the shared embedding space-against a gallery of logic-side prototypes pre-computed from crawl-time profiles. The top-matched class is returned if the similarity exceeds a decision threshold; otherwise, the input is rejected as unmonitored. For few-shot adaptation, STAR integrates with plug-and-play strategies such as linear probing [25] or Tip-Adapter-style fusion [28], both operating over frozen encoders 2 . This retrieval-based formulation enables scalable, flexible deployment in open-world scenarios without requiring retraining or per-target traffic collection.
To enable reliable cross-modal alignment, we design compact yet expressive representations for encrypted traffic and site logic. Each is structured as a fixed-length sequence with features selected to reflect the core alignment anchors iden- tified in §III-B, while maintaining generalizability and model efficiency.
We represent encrypted traffic traces as a sequence of packet-level features, defining a feature matrix T ∈ R 5000×3 , where each row f
• dir(p i ) is the directional packet length, where client-toserver packets are positive and server-to-client packets are negative. This single value reflects both request and response behaviors, preserving alignment signals from both ends. • v i ∈ {1, 2, 3} is the inferred HTTP version. Inspired by [8], we heuristically assign this per-packet label based on transport-layer characteristics: UDP packets are labeled HTTP/3; TCP packets are marked HTTP/2 if two consecutive packets begin with TLS content-type 0x173 , otherwise HTTP/1.1. • s i ∈ Z + is the flow index, indicating the bidirectional connection to which the packet belongs, enabling coarsegrained structural grouping within traces. 2) Logic Modality: The logic modality encodes a website’s resource-level structure as a semantic matrix L ∈ R 80×8 , where each row f (L) j corresponds to a web resource as observed during page load. These resource vectors capture the website’s high-level semantics and are extracted from browser developer logs [30] via automated scripts.
We group the eight features into three semantic categories:
• Identifier length indicators: Huffman-encoded and raw URI lengths provide a compact representation of the resource path size and align with the request-side packet lengths in the traffic modality. These design choices ensure that the learned representations remain compact, semantically meaningful, and structurally aligned across modalities.
To bridge the modality gap between encrypted traffic traces and website logic structures, we adopt a dual-encoder architecture to project each modality into a shared embedding space. This architecture is inspired by the CLIP [25] paradigm, where modality-specific encoders are used to preserve intramodality semantics while enabling cross-modal alignment via contrastive training.
a) Traffic Encoder: For the traffic modality, we build upon the DFNet [7] backbone, a deep convolutional network widely adopted in website fingerprinting literature due to its strong discriminative capacity under encrypted traffic. Given a packet-level input matrix T ∈ R 5000×3 , we replace the original 1D convolutional layers with three-channel convolutions to accommodate the 3-dimensional packet features. We remove the classification head of DFNet and preserve the penultimate hidden representation as the traffic embedding. A subsequent projection head f T maps the encoder output to a normalized embedding:
b) Logic Encoder: For the logic modality, we employ a Transformer encoder [31] to effectively process structured sequences of web resources. Given a resource-level input matrix L ∈ R 80×8 , the encoder utilizes multi-head self-attention to capture feature-wise and resource-wise dependencies, allowing the model to learn hierarchical importance among resources. The output representations are aggregated via masked average pooling, followed by a projection head f L that yields the normalized logic embedding:
Each embedding z L i or z T i resides in a shared latent space R d (we set d = 256), which serves as the basis for cross-modal contrastive alignment. S del ← S del ∪ {s} 10: return ( R, P)
To align the logic and traffic modalities, we adopt a multiobjective training strategy centered on InfoNCE loss [32] and supplemented by auxiliary supervision. The goal is to ensure that matched logic-traffic pairs are closer in the embedding space than mismatched pairs. 1) InfoNCE Loss for Cross-Modal Alignment: We leverage the standard contrastive loss over a batch of N paired samples. For each traffic embedding z T i and its corresponding logic embedding z L i , the InfoNCE objective encourages the inner product ⟨zi T , zi L ⟩ to be higher than that of any non-matching pair. The loss is given by:
where τ is a temperature hyperparameter. Unlike conventional supervised contrastive learning, all negatives in the denominator are guaranteed to be true negatives due to the use of large-scale unlabeled pairs across diverse websites (each pair from a distinct site, see §V-A), preventing semantic ambiguity.
Supervised Contrastive Loss for Discrimination: To further enhance class-discriminative capacity in the traffic modality, we incorporate a supervised contrastive loss using labeled fingerprinting datasets. Following SupCon [33], the loss encourages embeddings from the same class to be closer, while keeping different classes apart:
where P(i) denotes the set of positives (same class), and A(i) the set of all anchors except i.
Consistency Loss for Stability: Given the inherent instability of encrypted traffic, even within the same class, we introduce an intra-class consistency loss to promote local smoothness among traffic embeddings. Specifically, we minimize pairwise distance among all traffic embeddings with the same class label:
where C is the set of intra-class traffic pairs. 4) Final Objective.: The full training objective combines all three components with weighting coefficients:
This hybrid objective enables us to exploit both large-scale weakly-aligned web pairs and reliable supervised samples to improve alignment quality and generalization.
To enhance generalization under site evolution, we introduce a structure-aware augmentation method that perturbs both modalities in a consistent manner. This approach generates realistic logic-traffic sub-pairs while preserving the structural alignment necessary for contrastive training. Unlike traditional view-level augmentations or modality-specific transformations [34], our method exploits a shared structural anchor-server IP addresses-that appears in both modalities and governs subsets of web resources and traffic packets.
The augmentation operates by selectively dropping all resources in the logic modality that are associated with a sampled set of server IPs. The corresponding traffic packets linked to the same IPs are then removed from the traffic modality, producing a semantically valid and internally consistent subpair. To avoid excessive content removal, IPs are sampled with inverse probability proportional to their resource count, and deletions continue until a stochastic threshold is met. This threshold is drawn from a Gaussian prior to introduce controlled variation. The full procedure is outlined in Algorithm 1.
The resulting augmented pairs preserve partial yet coherent cross-modal alignment and are seamlessly integrated into the training set as additional samples for contrastive learning.
In this section, we conduct comprehensive experiments to evaluate the effectiveness of STAR across both closed-world and open-world settings. We first describe the datasets used in our experiments ( §V-A) and introduce competitive baselines ( §V-B). We then perform a series of experiments, including modality design analysis ( §V-C), classification under closedworld ( §V-D) and open-world ( §V-E) settings, as well as indepth ablation and interpretability analysis ( §V-F).
We utilize two types of datasets in our experiments: (1) a large-scale cross-modal dataset constructed by ourselves, and (2) an existing labeled fingerprinting dataset used for evaluation and auxiliary supervision.
(1) Cross-Modal Dataset (STAR-200K). We collect a large-scale dataset of website-level cross-modal samples based (2) Labeled Fingerprinting Dataset (H&W-1600). We use the public dataset from [8], which provides 40 traffic samples for each of 2,240 HTTPS websites across three groups: popular, random, and censorship. We select the popular subset (1,600 websites) for closed-world evaluation. The remaining samples are used as labeled data for supervised training modules ( §IV-D). To prevent data leakage, we ensure all evaluation websites are disjoint from the STAR-200K pretraining and labeled training sets.
To demonstrate the effectiveness of STAR, the first zeroshot website fingerprinting method without access to target traffic, we compare against representative state-of-the-art baselines from three categories.
Standard WF methods include CUMUL [13], which uses cumulative packet lengths with an SVM classifier; DF+ [7], a CNN-based model extended to directional packet lengths for HTTPS settings; RF [37], which utilizes fixed-time aggregation matrices for deep classification; and CountMamba [38], which models coarse-grained count matrices using a state space model for robust, early-stage classification.
Few-shot methods include TF [21] and NetCLR [22], both of which pretrain DF-based encoders using contrastive learning (NetCLR adds self-supervised tasks), and H&W [8], which matches application-layer features via KNN.
Fine-grained methods include FineWP [27], using statistical features with random forests, and Oscar [39], which applies multi-label metric learning for precise web page classification. All baselines are implemented using official or WFlib [40] code with default settings.
To validate the effectiveness of our cross-modal formulation, we begin with a systematic evaluation of different modality representation choices.
• Experimental Setup. We adopt the proposed STAR training paradigm and explore its behavior under various traffic modality representations. The logic modality is fixed to our proposed 8-dimensional web resource-level representation, while the traffic modality varies across prior designs in the website fingerprinting literature. For example, we include Trace sequences [7], flow-level statistical summaries (H123) [8], the Traffic Aggregation Matrix (TAM) [37], and the Windowed Traffic Counting Matrix (WTCM) [38]. For each modality combination, we perform full model training with our multi-loss objective and structure-aware augmentation, then evaluate zero-shot classification performance on the H&W-1600 dataset.
To complement accuracy metrics, we additionally assess three modality design criteria introduced in §III-A:
• Inter-class discriminability (P1) is quantified via Adjusted Mutual Information (AMI), which measures how well the traffic embeddings can be clustered into groups that match the true class labels. • Intra-class stability (P2) is estimated by the Fisher Discriminant Ratio (FDR), comparing between-class and within-class variances of traffic embeddings. • Cross-modal alignability (P3) is captured using Distance Correlation (dCor) [41] between normalized embeddings z T i and z L i . dCor equals zero if and only if two random variables are statistically independent.
All statistics are computed on traffic embeddings normalized to zero mean and unit variance. AMI and dCor are scaleinvariant, while FDR is already a scale-free ratio.
• Results. As shown in Table II, Trace-based representations achieve the highest dCor, benefiting from explicit requestresponse anchoring that naturally aligns with logic semantics. H123 obtains the best AMI and FDR scores, thanks to its protocol-aware descriptors and flow-level aggregation that enhance class discriminability and intra-class consistency.
However, these statistical strengths do not directly translate into strong task performance-neither H123 nor Trace reaches competitive zero-shot accuracy. TAM and WTCM, despite their popularity in prior single-modal tasks, show poor alignment and low performance in our cross-modal setup. This is likely due to their use of fixed-size sliding windows, which obscure packet-level semantic anchors and disrupt alignment with logic-side representations.
In contrast, our proposed traffic encoding preserves both protocol semantics and alignment structures, achieving balanced modality properties and significantly superior classification accuracy. These results validate our cross-modal formulation and modality design as crucial to enabling effective zero-shot retrieval.
To evaluate STAR under a standard closed-world setting, we conduct experiments where the client accesses a fixed set of monitored websites. Results are summarized in Table III.
• Experimental Setup. We train STAR on a mixture of three datasets: the STAR-200K cross-modal dataset, structureaware augmented pairs and the labeled training portion of H&W, combined at a 10:3:3 ratio. To assess zero-shot performance-i.e., recognizing websites unseen during training in the traffic modality-we construct disjoint training and evaluation website sets. The model is optimized with the objective in Eq. 9 for 200 epochs on 5 NVIDIA A100 GPUs, requiring about 4 hours. During inference, we follow a CLIPstyle retrieval each traffic sample is embedded using the traffic encoder and projection head, and cosine similarity is computed against 1,600 logic-side anchors, each representing an embedding of a logic modality sample from the test set. Classification is determined by nearest-neighbor retrieval, and we report both top-1 and top-5 accuracy.
Since no existing website fingerprinting approach supports zero-shot classification, we build a baseline using k-means clustering (K = 1, 600) with optimal label assignment obtained via the Hungarian algorithm. For few-shot evaluation, we follow the standard n-shot setting on H&W-1600, using n labeled samples per class. Competing methods (e.g., TF, NetCLR, H&W) are trained on the n-shot subset, while STAR uses lightweight adaptation via a linear probe and Tip-Adapter.
• Results. STAR delivers strong zero-shot performance, achieving 87.87% top-1 and 96.94% top-5 accuracy over 1,600 website classes, despite not seeing any traffic samples from the evaluation set. This confirms the effectiveness and generalization ability of the learned cross-modal alignment. With few-shot adaptation, STAR’s performance improves further, reaching 95.06% top-1 accuracy with a linear probe at 16-shot and up to 99.09% with Tip-Adapter. Compared with existing few-shot methods, STAR provides both higher upper-bound accuracy and better zero-shot generalization. For example, H&W and TF remain competitive under few-shot settings but still plateau below STAR’s adapted results. Notably, As shown in Fig. 4(a), STAR’s zero-shot accuracy already matches the average 8-shot performance of other methods, which typically require over 100 hours of traffic collection on a single machine [21], highlighting its advantage in low-data and real-time deployment scenarios. Following established practice [8], we adopt a binary classification evaluation strategy between monitored and unmonitored samples. At test time, we construct a balanced evaluation set, ensuring a 1:1 ratio of monitored and unmonitored samples. We report precision and recall at varying decision thresholds, along with the overall AUC and the best F1 score for each method.
• Results. Experimental results are summarized in Fig. 4(b of STAR is 90.65, indicating both high precision and strong recall.
We attribute this performance gain to the cross-modal alignment learning paradigm employed by STAR. Unlike traditional classification-based approaches-which focus on optimizing decision boundaries over a fixed set of monitored classes-our model is trained on large-scale cross-modal sample pairs, allowing it to learn a more generalizable alignment space between website-level semantic features and encrypted traffic patterns. This alignment is not bound to specific class labels, but rather captures discriminative structure across the broader web domain. As a result, even in open-world settings where unseen websites appear, STAR can reliably identify whether a test sample aligns well with any monitored site-without requiring explicit negative class supervision during training. These findings highlight the unique advantage of retrieval-based, modality-aligned approaches in realistic, openset fingerprinting scenarios.
To better understand the design, performance gain, and behavior of STAR, we conduct an in-depth analysis covering its key components, learned representations, and the effect of training scale.
• Ablation Study. Table IV reports • Representation Analysis. We visualize learned embeddings using t-SNE in Fig. 5(a-b). Compared with the TF baseline, STAR produces tighter intra-class clusters and stronger alignment between traffic and logic embeddings. Cosine similarity distributions further confirm higher intra-class similarity and improved separability.
• Attribution and Scale Analysis. Gradient×Input attribution [42] (Fig. 5(c)) reveals that STAR exploits localized discriminative cues in both modalities. In the logic modality, influ-ential features are concentrated in early resource slots, often corresponding to primary page elements, while in the traffic modality, early packet groups contribute disproportionately to alignment. Fig. 5(d) further shows that zero-shot accuracy improves rapidly with training scale and saturates beyond approximately 100k samples, suggesting diminishing returns once sufficient cross-modal diversity is learned.
Overall, this analysis demonstrates that STAR’s superior performance arises not from any single component, but from the joint effect of robust cross-modal pretraining, effective alignment optimization, and scale-driven generalization.
This work redefines website fingerprinting as a cross-modal retrieval problem and presents STAR as a first realization of this paradigm. Our evaluation is scoped to standard HTTPS browsing sessions and validates the approach under typical conditions. More complex settings-such as multi-tab access, cross-network variability, and alternative encryption tunnels like VPN or Tor-remain beyond the scope of this initial study. We also focus on Chrome-based traffic traces, given its prevalence in practice; generalization to other browsers like Firefox, Safari remains to be assessed. These scenarios introduce additional factors that may affect alignment robustness and are left for future exploration.
STAR demonstrates that semantic-traffic alignment enables scalable, zero-shot fingerprinting without target-side traffic, revealing structural leakage as a persistent privacy risk even under full encryption. Beyond facilitating low-overhead deployment, our formulation offers a lens to analyze semantic leakage and guide defense design. Potential countermeasures include perturbing resource structures or obfuscating alignment anchors via traffic shaping, though their effectiveness and associated bandwidth overhead remain open challenges. Future work may extend STAR to multi-page tracking, dynamic contexts, or hybrid inference tasks involving both structure and behavior.
We reframed website fingerprinting under HTTPS as a zeroshot cross-modal retrieval problem and introduced STAR, a dual-encoder system that aligns semantic resource logic with encrypted traffic. Trained on large-scale logic-traffic pairs with structure-aware augmentation, STAR achieves strong zero-shot classification without target-side traffic collection. It surpasses state-of-the-art baselines across closed-and open-world settings, highlighting semantic-traffic alignment as a new axis of vulnerability. We release our dataset and implementation to support future research in both attack and defense directions.
• P3: Alignability (cross-modality)
AnchorMetric Mean Value ↑ p-value ↓ Sig. (%) ↑
Anchor
2 Detailed implementation-level descriptions of inference procedures are provided in an online technical appendix: https://github.com/2654400439/
STAR-Website-Fingerprinting/blob/main/docs/TechnicalAppendix/STAR Technical Appendix.pdf.
• Content indicators:
2 CMA: Cross-Modal Augmentation.3 OT: Optimization Targets (see §IV-D).
2 CMA: Cross-Modal Augmentation.3
2 CMA: Cross-Modal Augmentation.
2
0x17 is the TLS content_type value indicating application data[29]