LSMTCR: A Scalable Multi-Architecture Model for Epitope-Specific T Cell Receptor de novo Design

Designing full-length, epitope-specific TCR {\alpha}\b{eta} remains challenging due to vast sequence space, data biases and incomplete modeling of immunogenetic constraints. We present LSMTCR, a scalable multi-architecture framework that separates specificity from constraint learning to enable de novo, epitope-conditioned generation of paired, full-length TCRs. A diffusion-enhanced BERT encoder learns time-conditioned epitope representations; conditional GPT decoders, pretrained on CDR3\b{eta} and transferred to CDR3{\alpha}, generate chain-specific CDR3s under cross-modal conditioning with temperature-controlled diversity; and a gene-aware Transformer assembles complete {\alpha}/\b{eta} sequences by predicting V/J usage to ensure immunogenetic fidelity. Across GLIPH, TEP, MIRA, McPAS and our curated dataset, LSMTCR achieves higher predicted binding than baselines on most datasets, more faithfully recovers positional and length grammars, and delivers superior, temperature-tunable diversity. For {\alpha}-chain generation, transfer learning improves predicted binding, length realism and diversity over representative methods. Full-length assembly from known or de novo CDR3s preserves k-mer spectra, yields low edit distances to references, and, in paired {\alpha}/\b{eta} co-modelling with epitope, attains higher pTM/ipTM than single-chain settings. LSMTCR outputs diverse, gene-contextualized, full-length TCR designs from epitope input alone, enabling high-throughput screening and iterative optimization.

💡 Research Summary

The paper introduces LSMTCR, a scalable multi‑architecture framework that enables de novo design of full‑length, epitope‑specific α/β T‑cell receptors (TCRs). The authors separate the learning of antigen specificity from immunogenetic constraints, constructing a three‑stage pipeline. First, a diffusion‑enhanced BERT encoder creates time‑conditioned embeddings of the input epitope, capturing its variability through a diffusion process rather than a static representation. Second, conditional GPT decoders—pre‑trained on massive CDR3β datasets and transferred to CDR3α—generate chain‑specific CDR3 sequences. These decoders are cross‑modal conditioned on the epitope embedding and employ temperature‑controlled sampling to finely tune diversity while preserving realistic length and positional grammars. Third, a gene‑aware Transformer predicts V and J gene usage and assembles the complete α and β chains, enforcing V/J recombination rules, frame preservation, and other immunogenetic constraints.
The authors evaluate LSMTCR on several public benchmarks (GLIPH, TEP, MIRA, McPAS) and a curated internal dataset. Across most datasets, LSMTCR outperforms baseline models (e.g., DeepTCR, TCRGAN, AlphaFold‑TCR) in predicted binding scores (pTM/ipTM), recovers the native CDR3 length distribution, and respects positional motifs. Diversity analysis shows that temperature scaling yields higher entropy and lower bias than competing generators, and the generated sequences maintain k‑mer spectra and low edit distances to reference TCRs. Transfer learning from β‑chain to α‑chain further improves binding prediction, length realism, and diversity for α‑chain generation. When assembling full‑length receptors from known or de novo CDR3s, the model preserves sequence statistics and achieves higher structural confidence scores in paired α/β‑epitope co‑modeling than single‑chain settings.
Overall, LSMTCR delivers a fully automated, epitope‑conditioned pipeline that produces diverse, gene‑contextualized, full‑length TCR designs from epitope input alone. This capability enables high‑throughput in silico screening, iterative optimization, and rapid prototyping of personalized immunotherapies, while the modular architecture facilitates future extensions such as experimental validation, structure‑guided refinement, and integration with downstream functional assays.

💡 Research Summary

📜 Original Paper Content