ES-Merging: Biological MLLM Merging via Embedding Space Signals

Biological multimodal large language models (MLLMs) have emerged as powerful foundation models for scientific discovery. However, existing models are specialized to a single modality, limiting their ability to solve inherently cross-modal scientific …

Authors: Wonbin Lee, Dongki Kim, Sung Ju Hwang

ES-Merging: Biological MLLM Merging via Embedding Space Signals
ES-Merging: Biological MLLM Merging via Embedding Space Signals W onbin Lee *1 Dongki Kim *1 Sung Ju Hwang 1,2 1 KAIST 2 DeepAuto.ai {smilelwb01, cleverki, sungju.hwang}@kaist.ac.kr Abstract Biological multimodal lar ge language models (MLLMs) hav e emerged as powerful founda- tion models for scientific discov ery . How- ev er , existing models are specialized to a sin- gle modality , limiting their ability to solve inherently cross-modal scientific problems. While model merging is an ef ficient method to combine the dif ferent modalities into a uni- fied MLLM, existing methods rely on input- agnostic parameter space heuristics that fail to faithfully capture modality specialization. T o overcome this limitation, we propose a representation-aware merging frame work that estimates mer ging coef ficients from embedding space signals. W e first design a probe input that consists of different modality tok ens and forward it through each specialized MLLM to obtain layer-wise embedding responses that re- flect modality-specific representation changes. W e then estimate complementary merging co- efficients at two granularities from the em- bedding space: layer -wise coef ficients from coarse-grained signals and element-wise coef- ficients from fine-grained signals, which are jointly combined for robust coef ficient estima- tion. Experiments on interactiv e ef fect predic- tion benchmarks show that our method outper - forms existing mer ging methods and e ven sur - passes task-specific fine-tuned models, estab- lishing that embedding space signals provide a principled and effecti v e foundation for cross- modal MLLM merging. 1 Introduction Multimodal Large Language Models (MLLMs) hav e been emerging as crucial foundational mod- els for scientific discovery , extending their per- ception to di verse biological modalities across molecules ( P ark et al. , 2024 ; Kim et al. , 2025 ), proteins ( Abdine et al. , 2024 ; Fei et al. , 2025 ), and * denotes Equal Contribution cells ( Fang et al. , 2025b ; Rizvi et al. , 2025 ). De- spite their impressi ve progress within each modal- ity , many scientific problems of interest are cross- modal, requiring to understand the interactiv e ef- fects such as protein-ligand interactions or drug ef fecti veness to cell types. Howe v er , existing bio- logical MLLMs are specialized to a single modality , resulting in limited intersectional knowledge and unreliable cross-modal reasoning. Building a unified model by jointly training on dif ferent modalities is a straightforward approach to upskill cross-modal understanding. Howe v er , it is impractical and time-consuming since construct- ing curated cross-modal instruction datasets in the scientific domain typically requires intensiv e labor and highly specific expertise to elaborate the un- derlying principles of complex interactions. As an alternati ve, model mer ging has gained attention by ef ficiently combining parameters of multiple specialized models. T o retain their task-specific kno wledge, existing methods ( Y adav et al. , 2023 ; Huang et al. , 2024 ; Du et al. , 2024 ) exploit param- eter space signals, such as magnitudes, signs, and directions, heuristically assigning mer ging coeffi- cients. Ho wever , parameter space heuristics are input-agnostic and therefore provide only weak, indirect proxies for modality-specific adaptation. Such input-blindness mak es it dif ficult to isolate meaningful cross-modal interactions, failing to ac- curately combine these adaptations and se verely degrading cross-modal mer ging. Our main observation is that the input-aw are embedding space contains the modality-specific in- formation. As sho wn in Fig. 1 , when molecular to- kens are processed by dif ferent MLLMs, the hidden representations form clearly different embedding distributions. In particular , the molecule-specific LLM induces a more distinct distribution due to its modality-specific understanding. Further , we measure the embedding distribution distance be- tween the base LLM and each specialized MLLM 1 Molecule T oken Embedding Space Base LLM Cell LLM Protein LLM Molecule LLM Figure 1: Molecule token embedding visualization of the last transformer block for the base LLM and each specialized LLM. 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0.2 0.4 0.6 0.8 Average Normalized SWD Specialized Non-Specialized Figure 2: Layer-wise embedding distribution distance under specialized and non-specialized tok ens using sliced W asserstein distances (SWD) ( Bonneel et al. , 2015 ). under specialized and non-specialized token inputs. Fig. 2 shows that specialized inputs consistently produce lar ger embedding distribution distance than non-specialized inputs, suggesting that the embedding responses faithfully reflect modality- specific adaptation when the input matches the model’ s specialization. This observation moti vates our central design principle: rather than relying solely on heuristic parameter space signals, we esti- mate merging coef ficients based on the layer-wise embedding space signals. Moti v ated by this observation, we propose an embedding-signal-based MLLM mer ging ( ES- Merging ), a no vel frame work that mov es the model merging paradigm from parameter space signals to embedding space signals. Our intuition is that the modality specialization of an MLLM can be mea- sured by the dif ferences in the embedding space between the base LLM and the MLLM as shown in Fig. 1 . T o this end, we first design a probe input containing multimodal tokens of dif ferent modali- ties. By forw arding this input through each MLLM and the base LLM, we obtain layer-wise embed- dings that reflect modality-specific representation changes across layers. Based on these embeddings, we propose to com- pute merging coef ficients at two complementary granularities: layer -wise and element-wise. In the layer-wise manner , we compute a layer-le vel impor- tance score by identifying layers where the embed- ding distributional shift gro ws the most, capturing coarse-grained specialization. In the element-wise manner , we estimate fine-grained importance by identifying parameters that most influence the rep- resentation shift. W e then combine the coarse layer importance and fine element importance to produce final merging coef ficients, which are used to fuse specialized models into a single unified MLLM. This design enables more robust and calibrated co- ef ficient estimation by combining layer-le vel spe- cialization with parameter-le vel sensiti vity . W e validate the ef fectiv eness of ES-Merging on interacti ve ef fect prediction tasks over di verse bio- logical modalities by merging three different spe- cialized MLLMs into a single unified model. Ex- perimental results sho w that ES-Mer ging outper- forms not only other model merging methods b ut also the task-specific fine-tuned models, empiri- cally confirming our representation-aware mer ging frame work is crucial to obtain the salient modality- specific signals. Further analyses sho w the effec- ti veness of our components: combining layer - and element-wise me rging coefficients best performs compared to using only one, highlighting the ne- cessity of integrating complementary specialization signals at dif ferent granularities. 2 Related W ork MLLMs for Scientific Disco very Large lan- guage models (LLMs) ( T ouvron et al. , 2023 ; Grattafiori et al. , 2024 ; OpenAI , 2024a , b ; Comanici et al. , 2025 ) are increasingly being extended to scientific discov ery through multimodal large lan- guage models (MLLMs) that incorporate div erse biological modalities, including molecules ( W ang et al. , 2025b ), proteins ( Xiao et al. , 2025 ), and cells ( Dip et al. , 2025 ). In the protein domain, prior works hav e modeled amino-acid sequences ( Xu et al. , 2023 ; Pei et al. , 2023 ) or jointly with struc- tures ( Guo et al. , 2023 ; Abdine et al. , 2024 ; W ang et al. , 2024a ; Xiao et al. , 2024 ; W ang et al. , 2025a ; Fei et al. , 2025 ). Single cell LLMs learn the scRN A-seq representations ( Fang et al. , 2025a ; Kharouiche et al. , 2025 ; Fang et al. , 2025b ) or fur - ther incorporate histology information ( Li et al. , 2025 ; Chen et al. , 2025 ). On the other hand, 2 molecular LLMs ha ve been b uilt on 1D string rep- resentations such as SMILES ( W eininger , 1988 ) and SELFIES ( Krenn et al. , 2020 ) ( Chithrananda et al. , 2020 ; Edwards et al. , 2022 ), 2D molecular graphs ( Liu et al. , 2023 ; Fang et al. , 2023 ; Cao et al. , 2023 ; Y u et al. , 2024 ; P ark et al. , 2024 ), 3D structures ( Li et al. , 2024 ; Guo et al. , 2024 ), and joint 2D-3D representations ( Kim et al. , 2025 ). De- spite this progress, biological MLLMs are limited to a single modality , hindering their ability to solv e cross-modal scientific problems. Model Merging Model mer ging aims to fuse kno wledge from multiple specialist models with minimal additional data or training by combining them directly in parameter space. Existing meth- ods mainly le verage the parameter space signals to guide merging. Magnitude-based methods in- clude T ask Arithmetic ( Ilharco et al. , 2022 ), Con- sensus Mer ging ( W ang et al. , 2024b ), and PCB- Merging ( Du et al. , 2024 ); sign-based methods in- clude TIES-Merging ( Y adav et al. , 2023 ) and EMR- Merging ( Huang et al. , 2024 ); and LS-Merge ( Soro et al. , 2026 ) performs merging in a learned la- tent space over parameters. Beyond such static merging, test-time adaptation methods dynami- cally adjust coef ficients using unlabeled test data, as in AdaMerging ( Y ang et al. , 2023 ) and T win- Merging ( Lu et al. , 2024 ). Ho wever , these methods still rely on parameter space signals, which are not well suited to capturing semantic discrepan- cies across heterogeneous modalities. W e instead propose an embedding space merging frame work for MLLMs that deriv es merging coef ficients from modality-aw are representation signals. 3 Preliminary W e begin by formally describing MLLMs, then formulating the model merging with LoRA. Multimodal Large Language Model MLLMs operate beyond the te xtual space by taking modal- ity token embeddings with text tok ens. Concretely , let M = { m 1 , ..., m K } denote a set of modal- ities. For each modality m i , a modality-specific encoder f m i projects the ra w modality input x m i to a sequence of modality tokens: H m i = f m i ( x m i ) . Then, an MLLM generates textual output y by tak- ing the concatenated textual and modality tok ens: y = g ( H 0 ) , where H 0 = [ H text ; H m i ; ... ] . In this view , the core interface of an MLLM is the token-based embeddings: raw modality in- puts are first mapped into sequences of vectors that are compatible with the LLM’ s embedding di- mension, and then processed together by a single transformer . LoRA Merging In the MLLM setting, modality specialization is often implemented as a param- eter efficient tuning with LoRA ( Hu et al. , 2022 ). Therefore, our mer ging methods are based on mer g- ing LoRA parameters via weighted summation. Specifically , let θ m i denote the LoRA parameter of an MLLM specialized to modality m i . W e in- dex the parameters by layer and weight element: θ l,n m i denotes the n -th weight in the l -th layer of the modality-specific model for m i . Then, the LoRA merging is formalized as follo ws: θ l,n uni ← X m i ∈M λ l,n m i θ l,n m i . (1) Under this formulation, the main challenge is to estimate appropriate mer ging coef ficients λ l,n m i that determine how strongly each modality-specific LoRA parameter contributes to the modality under- standing. Our method addresses this coefficient es- timation problem by deriving λ l,n m i from embedding space signals rather than parameter space statistics. 4 ES-Merging W e present Embedding-Signal-based MLLM Merg- ing (ES-Mer ging), a representation-a ware frame- work for mer ging modality-specialized MLLMs based on embedding space signals. Specifically , ES-Merging constructs probe inputs to elicit repre- sentational dif ferences across models, and con verts these embedding space discrepancies into layer- wise and element-wise merging coef ficients. 4.1 Probe Input As discussed in Section 3 , we vie w the projected modality tokens as the core interface of the LLM backbones. In other words, once a raw modality input is mapped into the shared embedding space, the modality-specific knowledge of an MLLM is reflected in ho w the backbone interprets and trans- forms these token embeddings across layers. From this perspecti ve, understanding modality special- ization reduces to analyzing the layer-wise transfor- mation of modality tokens in representation space. Moti v ated by this view , we design probe inputs that explicitly e xpose token embeddings of dif fer- ent modalities, so that we can compare ho w the 3 B ase LLM M odal i t y   - speci al i z ed LLM L ay e r  B ase LLM M odal i t y   - speci al i z ed LLM M o d a l i ty   T o ken s L ay e r  B ase L L M D i s t r i but i on D i st an ce ( S W D ) ( a) L aye r - w i s e G l o b a l M e r g i n g C o e f f i c i e n t ( b ) E l emen t - w i s e L o ca l M erg i n g C o ef f i ci en t M ean P ool M ean P ool M o d a l i ty   T o ken s … Coa r s e - gr a i ne d S pa c e ( M oda l i t y   ) M o d a l i ty   T o ken s M o d a l i ty   T o ken s L ay e r  Ba c k pr op ∆S W D L 2 D i s t a n c e … Figure 3: Overvie w of ES-Merging. (A) Layer-wise global merging coefficients are computed from the coarse- grained embedding signals, which are the layer -wise dif ferences of distribution distances between mean pooled representations of the base LLM and a specialized MLLM. (B) Element-wise local merging coefficients are assigned based on the fine-grained embedding signals by computing the gradients from the embedding-wise distances. base LLM and each modality-specialized MLLM process the same modality embeddings. Specifically , for each modality m i , we collect a set of raw inputs { x ( k ) m i } K k =1 from correspond- ing datasets and transform them via the modality- specific encoder f m i into the modality token em- beddings { H ( k ) m i } K k =1 , where H ( k ) m i = f m i ( x ( k ) m i ) . Using these embeddings, we then b uild the probe input by concatenating a short textual prefix with all modality token blocks: H 0 , ( k ) probe = [ H text ,m 1 ; H ( k ) m 1 ; H text ,m 2 ; H ( k ) m 2 ; . . . ] , where H text ,m i denotes the text-prefix embeddings associated with modality m i (e.g., the modal- ity identifier or name). In the biological set- ting, the modality set M typically consists of { molecule , protein , cell } , and the resulting probe input is illustrated in Fig. 4 . By forwarding each probe input through the base LLM and each MLLM specialized to modality m j , we obtain layer-wise embeddings by extracting the modality m i tokens, denoted as H l, ( k ) m i → base and H l, ( k ) m i → θ m j in R T m i × d , where l denotes the layer index and T m i denotes the number of modality- m i tokens. These layer -wise representations serv e as the foundation for our subsequent merging coef ficient estimation. W e regard the embedding discrep- ancy between the base LLM and each modality- specialized MLLM as an embedding space signal that reflects the de gree of the modality specializa- tion. W e then le verage this signal at two comple- mentary le vels: globally across layers to estimate which layers contribute more to specialization, and locally within each layer to identify which parame- Molecule: Protein: Cell: Figure 4: Prompt template of the probe input. ter elements are more strongly associated with the specialized transformation. 4.2 Layer -wise Global Merging Coefficient T o capture global modality specialization, we com- pute a layer -wise global merging coefficient from coarse-grained embedding signals (Fig. 3 (a)). Coarse-grained Embedding Signal W e first summarize the modality-specific embeddings into a coarse-grained representation by av eraging token- le vel embeddings, obtaining ˆ h l, ( k ) m i → base ∈ R d for the base model and ˆ h l, ( k ) m i → θ m j ∈ R d for a spe- cialized model θ m j . By collecting these coarse- grained embeddings over different K probe in- puts, we obtain two layer-wise embedding sets, H l m i → base ∈ R K × d and H l m i → θ m j ∈ R K × d , which represent how the base and specialized models pro- cess modality- m i inputs at layer l . W e then quan- tify their representational gap using the embedding distribution distance with sliced W asserstein dis- tance (SWD) ( Bonneel et al. , 2015 ): SWD l m i → θ m j = SWD  H l m i → base , H l m i → θ m j  . A lar ger v alue indicates that θ m j induces a stronger layer-wise shift from the base model when process- ing modality- m i tokens. 4 Layer -wise Importance Estimation While SWD l m i → θ m j measures the cumulativ e differences up to layer l , we are particularly interested in ho w much new modality-specific transformation is introduced at each layer . W e therefore compute the layer-wise change: d l m i → θ m j = SWD l m i → θ m j − SWD l − 1 m i → θ m j . If d l m i → θ m j is lar ge, the corresponding layer of model θ m j contributes more strongly to modality m i -specific processing. Since the magnitude of these changes can v ary across models, we normal- ize them o ver layers using Z-score normalization, obtaining ˆ d l m i → θ m j . W e aggreg ate the normalized changes over all input modalities to obtain the layer- wise importance score of model θ m j : s l θ m j = X m i ∈M ˆ d l m i → θ m j . Layer -wise Global Coefficient Finally , we con- vert the layer -wise importance scores into merging coef ficients by applying a softmax across models: α l m j = exp( s l θ m j /τ ) P m ∈M exp( s l θ m /τ ) . (2) As a result, a specialized MLLM recei ves a lar ger coef ficient at layers where it contributes more strongly to coarse-grained representational changes ov er different modality inputs. 4.3 Element-wise Local Merging Coefficient While the layer -wise global coef ficient captures modality importance at the transformer-layer le vel, it assigns a uniform merging weight to all param- eters within the same layer . T o address this limi- tation, we further introduce an element-wise local merging coef ficient deri ved from fine-grained em- bedding signals (Fig. 3 (b)). Fine-grained Embedding Signals T o measure the fine-grained signals from embeddings, we first measure the distances of each embedding between the base model and a specialized model θ m j , in- stead of using the coarse-grained representations: r l, ( k ) m i → θ m j =    H l, ( k ) m i → base − H l, ( k ) m i → θ m j    F , where F denotes the Frobenius norm, which is anal- ogous to the Euclidean L 2 norm for vectors. This distance measures ho w differently the specialized model θ m j processes modality- m i inputs relati ve to the base model at layer l in a fine-grained manner . Element-wise Importance Estimation W e then estimate the importance of each parameter element by measuring ho w sensiti ve this distance is to that element. Specifically , for the n -th parameter ele- ment in layer l , we accumulate the absolute gradi- ent magnitude ov er all modalities and probe inputs: s l,n θ m j = X m i ∈M K X k =1       ∂ r l, ( k ) m i → θ m j ∂ θ l,n m j       . A larger score indicates that the parameter element is more sensitiv e to the modality-specific dif fer- ences between the base and specialized models. Because the raw scores can v ary across layers and parameters, we normalize them within each layer using Z-score normalization, yielding ˆ s l,n θ m j . Element-wise Local Coefficient W e con vert the normalized scores into element-wise merging coef- ficients by applying the softmax function: β l,n m j = exp  ˆ s l,n θ m j /τ  P m ∈M exp  ˆ s l,n θ m /τ  . (3) Resulting element-wise merging coef ficients selec- ti vely assign each parameter element according to their fine-grained sensiti vities. 4.4 Integrating Layer - and Element-wise Merging Coefficients The coarse-grained layer-wise coef ficient α l m i and the fine-grained element-wise coefficient β l,n m i each capture modality-specific importance at different le vels of granularity . W e combine them into a sin- gle coef ficient by multiplying the layer -wise coef- ficient and the element-wise coef ficient and renor- malizing across modalities: λ l,n m i = α l m i · β l,n m i P m ∈M α l m · β l,n m . (4) The final parameters of the merged model are computed by Eq. 1 . By integrating different co- ef ficient types from dif ferent granularities of em- bedding signals, ES-Merging enables calibrated merging coef ficients that preserve complementary modality expertise more faithfully , enabling robust cross-modal kno wledge composition. 5 Experimental Results 5.1 Experimental Setup Implementation Details For mer ging, we le ver - age the state-of-the-art MLLMs specialized to 5 Molecule-Protein Interaction Molecule-Cell Interaction BindingDB BioSN AP Human A vg. DrugComb GDSC2 A vg. Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Base LLM and Specialized MLLMs LLaMA-3.1-8B-Instruct 51.9 51.7 61.5 60.2 59.0 58.8 57.5 56.9 73.7 77.1 84.8 85.7 79.3 81.4 Mol-LLaMA ( Kim et al. , 2025 ) 55.8 54.7 66.5 64.0 61.5 61.4 61.2 60.0 64.1 80.4 55.2 82.3 59.7 81.4 Prot2T ext-V2 ( Fei et al. , 2025 ) 59.2 59.2 55.3 54.1 47.2 47.2 53.9 53.5 65.6 78.4 59.7 67.0 62.6 72.7 Cell-o1 ( Fang et al. , 2025b ) 53.5 53.1 59.3 59.8 49.1 48.5 54.0 53.8 76.3 77.2 85.9 86.0 81.1 81.6 Mer ging Methods A vg. Merging 65.3 64.9 66.4 66.5 60.9 60.9 64.2 64.1 72.5 74.6 85.4 87.9 78.9 81.2 TIES-Merging ( Y adav et al. , 2023 ) 60.8 60.8 62.7 61.6 58.6 58.6 60.7 60.3 74.7 77.0 85.9 87.1 80.3 82.1 EMR-Merging ( Huang et al. , 2024 ) 64.7 64.2 66.3 66.9 60.4 60.4 63.8 63.8 45.2 71.5 93.4 93.3 69.3 82.4 AdaMerging ( Y ang et al. , 2023 ) 52.3 51.6 64.3 62.0 60.0 60.2 58.9 57.9 48.0 36.9 46.5 38.8 47.2 37.8 PCB-Merging ( Du et al. , 2024 ) 55.1 55.1 60.3 58.8 58.6 58.4 58.0 57.4 77.3 78.7 86.0 86.1 81.7 82.4 Consensus Merging ( W ang et al. , 2024b ) 59.2 59.3 62.1 61.0 59.2 59.1 60.2 59.8 76.0 78.4 84.3 85.8 80.2 82.1 LS-Merge ( Soro et al. , 2026 ) 52.3 52.1 61.8 60.6 58.8 58.5 57.7 57.1 79.1 79.0 83.5 83.3 81.3 81.1 A vg. Merging + FT 60.5 59.7 55.8 55.9 57.2 57.3 57.8 57.6 81.1 80.8 94.0 93.9 87.5 87.4 ES-Merging (Ours) 66.0 65.3 69.1 68.4 62.0 61.9 65.7 65.2 80.7 80.2 94.1 94.0 87.4 87.1 T able 1: Performance comparison of ES-Merging with the base LLM with and without task-specific fine-tuning, specialized MLLMs, and merging methods on the instance-varying interaction prediction tasks. W e report accuracy and macro-F1 across each subset. Bold and underline indicate the best and second best performances, respectively . each modality: Mol-LLaMA ( Kim et al. , 2025 ) for molecule modality , Prot2T ext-V2 ( Fei et al. , 2025 ) for protein modality , and Cell-o1 ( Fang et al. , 2025b ) for single cell modality , whose base LLMs are LLaMA-3.1-8B-Instruct ( Grattafiori et al. , 2024 ). The merging methods are applied across all modality-specific models, resulting in a unified MLLM. Since each do wnstream task re- quires distinct domain knowledge, we provide task- specific fe w-shot in-context examples to induce the appropriate task understanding. For a fair com- parison, we use the same instruction template and in-context examples for all compared methods. For more details for the implementation details, please refer to Appendix A.3 and A.4 . Baselines W e compare ES-Merging with the base LLM, MLLMs specialized to a single modality , and merging methods. As modality-specialized baselines, we consider Mol-LLaMA, Prot2T ext- V2, and Cell-o1, each specialized for molecule, protein, and single-cell modalities, respectiv ely . W e also compare ES-Merging with representa- ti ve mer ging methods, including A verage Mer g- ing, TIES-Merging ( Y adav et al. , 2023 ), EMR- Merging ( Huang et al. , 2024 ), layer -wise AdaMer g- ing ( Y ang et al. , 2023 ), PCB-Mer ging ( Du et al. , 2024 ), Consensus Merging ( W ang et al. , 2024b ), LS-Merge ( Soro et al. , 2026 ), and task-specific fine- tuned model after the av erage merging, denoted as A vg. Merging + FT . For the base LLMs and the specialized MLLMs, unsupported modalities are represented as te xtual inputs. Baseline details are provided in Appendix A.2 . 5.2 Instance-varying Interaction Prediction W e first e v aluate on instance-v arying cross-modal interaction tasks, where the target counterpart changes across instances and the model should gen- eralize across di verse cross-modal pairs. Datasets W e consider two instance-varying cross-modal interaction settings: molecule-protein interaction and molecule-cell interaction. For molecule-protein interaction, the task is to predict whether a giv en molecule interacts with a giv en protein, including BindingDB, BioSN AP , and Hu- man ( K oh et al. , 2023 ). For molecule-cell interac- tion, the task is to predict the ef fect of a molecule on a giv en cell, including DrugComb ( Zagidullin et al. , 2019 ) and GDSC2 ( Cha wla et al. , 2022 ). In both settings, the molecule, protein, and cell counterpart changes for each instance, requiring the model to generalize across div erse cross-modal combinations rather than relying on a fix ed target identity . W e provide the detailed explanation of datasets in Appendix A.1 . Results As sho wn in T able 1 , ES-Merging con- sistently outperforms the merging baselines, sug- gesting that ES-Merging more effecti vely inte- grates complementary cross-modal knowledge from modality-specialized MLLMs and thus lead- ing to stronger generalization when the interaction counterpart varies across instances. Notably , ES- 6 CYP Inhibition CYP Substrate CYP1A2 CYP2C19 CYP2C9 CYP2D6 CYP3A4 A vg. CYP2C9 CYP2D6 CYP3A4 Avg . Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Base LLM and Specialized MLLMs LLaMA-3.1-8B-Instruct 55.3 58.0 50.6 55.4 47.4 53.7 56.4 51.7 43.1 52.8 50.6 54.3 47.8 50.7 30.1 33.1 48.5 50.6 42.1 44.8 Mol-LLaMA 68.5 67.1 65.7 64.3 63.3 63.3 64.7 59.2 64.3 64.0 65.3 63.6 62.7 49.7 61.7 57.4 55.2 54.4 59.9 53.8 Prot2T ext-V2 59.4 55.8 56.8 52.5 67.4 60.1 76.3 55.9 59.6 50.3 63.9 54.9 54.5 48.6 53.4 52.0 56.0 54.2 54.6 51.6 Cell-o1 53.5 58.3 44.9 51.9 43.1 52.8 46.9 46.3 43.6 51.6 46.4 52.2 36.6 41.7 36.1 39.9 48.5 55.0 40.4 45.6 Merging Methods A vg. Merging 61.2 60.5 54.1 53.5 52.7 53.8 53.8 49.6 54.6 55.1 55.2 54.5 40.3 39.6 42.4 41.3 49.3 45.8 44.0 42.2 TIES-Merging 71.8 71.8 67.8 67.9 66.5 65.0 76.1 64.0 65.3 66.3 69.5 67.0 56.0 46.8 55.6 53.2 54.5 53.7 55.4 51.2 EMR-Merging 70.4 70.5 64.8 65.4 66.8 66.3 74.2 61.8 66.1 66.0 68.5 66.0 60.5 50.6 55.6 52.9 51.5 49.4 55.9 51.0 AdaMerging 52.5 55.3 53.1 54.2 44.8 46.0 45.6 45.7 50.4 54.3 49.3 51.1 33.6 33.6 43.6 45.7 49.3 47.4 42.2 42.2 PCB-Merging 69.2 69.4 63.6 64.6 65.0 63.8 72.4 61.0 63.8 65.2 66.8 64.8 55.2 44.9 57.1 53.9 56.7 55.7 56.4 51.5 Consensus Merging 68.5 68.8 63.6 64.1 65.7 64.0 73.5 60.8 63.8 64.7 67.0 64.5 54.5 46.5 55.6 52.9 56.7 57.7 55.6 52.4 LS-Merge 59.1 58.0 56.4 56.0 56.2 54.5 61.0 51.3 55.5 53.5 57.6 54.7 51.3 45.7 44.1 41.9 48.7 46.3 48.0 44.6 A vg. Merging + FT 68.3 67.7 67.0 66.5 62.2 62.1 67.6 60.9 67.6 67.6 66.5 65.0 65.7 50.6 59.4 53.0 55.2 54.6 60.1 52.7 ES-Merging (Ours) 77.4 77.4 70.6 70.5 72.5 70.4 80.7 69.5 71.3 70.8 74.5 71.7 64.2 53.6 60.9 57.2 60.5 59.6 61.9 56.8 T able 2: Performance comparison of ES-Merging on the target-fix ed functionality prediction tasks. W e report accuracy and macro-F1 across each subset. Bold indicates the best and underline indicates the second best. Merging outperforms the task-specific finetuned model (A vg. Mer ging + FT) on the molecule- protein interaction tasks and sho ws comparable performance on the molecule-cell interaction tasks, indicating that ES-mer ging can enhance the cross- modal reasoning without further do wnstream fine- tuning. This superior performance of ES-Merging comes from preserving the reasoning capabilities of specialized MLLMs as sho wn in T able 5 and 6 in Appendix. In contrast, task-specific fine-tuning tends to diminish these reasoning capabilities and can e ven de grade performance, particularly on the molecule-protein interaction tasks. On the other hand, existing merging baselines such as EMR- Merging often e xhibit substantial instability across datasets, showing a degraded performance on Drug- Comb . This implies that simple parameter space heuristics are insufficient for reliably combining heterogeneous multimodal experts, supporting that ES-Merging pro vides a more robust integration of modality-specific kno wledge. 5.3 T arget-fixed Functionality Pr ediction W e further ev aluate on target-fixed cross-modal functionality prediction tasks, where each subtask is associated with a fixed target and requires target- specific biological functionality knowledge be yond simple interaction matching in Section 5.2 . Datasets T o this end, we ev aluate on CYP en- zyme prediction. Unlike the instance-varying set- ting abo ve, each subtask is defined with respect to a fixed enzyme target and the tasks are pre- dicting the biological functionality of the gi ven molecule to the fixed target: inhibitory effects or substrate specificity . Specifically , we con- sider fiv e CYP inhibition subtasks to CYP1A2, CYP2C19, CYP2C9, CYP2D6, and CYP3A4 enzymes ( W eiser et al. , 2023 ), and three CYP substrate subtasks for CYP2C9, CYP2D6, and CYP3A4 enzymes ( Holmer et al. , 2021 ), from the TDC dataset ( Huang et al. , 2021 ). These benchmarks therefore ev aluate whether the merged model can capture not only detailed molecular structural v ariation under a shared tar get, but also biologically meaningful interaction types. Results As sho wn in T able 2 , ES-Mer ging achie ves the best average performance, suggest- ing that it more ef fectively preserv es and integrates expert kno wledge required for target-specific func- tionality prediction. In contrast, on the CYP sub- strate tasks, most merging baselines underperform Mol-LLaMA, likely because these tasks rely hea v- ily on molecular structural understanding under a fixed-tar get setting. Notably , ES-Merging achie ves performance that is comparable to or better than Mol-LLaMA, suggesting that embedding-signal- based merging can not only inte grate modality- specific expertise more effecti vely but also enhance cross-modal understanding. 5.4 Ablation Study W e further conduct an ablation study to see the ef fecti veness of each coef ficient type. As shown in T able 3 , using only one of coefficient types still outperforms other merging baselines, indicating that the embedding space signals pro vide salient and rob ust information for cross-modal mer ging, compared to the merging methods based on the 7 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Molecule Protein Cell 0.32 0.56 0.81 0.98 0.03 0.01 0.87 0.95 0.90 0.44 0.88 0.07 0.42 0.17 0.57 0.02 0.04 0.16 0.08 0.13 0.61 0.61 0.01 0.93 0.69 0.91 0.00 0.04 0.02 0.72 0.00 0.00 0.24 0.08 0.19 0.01 0.16 0.96 0.04 0.02 0.01 0.47 0.08 0.66 0.26 0.45 0.22 0.98 0.01 0.33 0.24 0.71 0.13 0.03 0.72 0.02 0.19 0.04 0.22 0.09 0.95 0.01 0.41 0.70 0.44 0.36 0.00 0.01 0.81 0.03 0.08 0.03 0.09 0.08 0.04 0.28 0.32 0.38 0.22 0.00 0.95 0.51 0.68 0.16 0.26 0.35 0.27 0.04 0.12 0.05 0.78 0.86 0.03 0.27 0.59 0.30 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Figure 5: Computed layer -wise merging coefficient visualization of each specialized MLLM deriv ed by Eq. 2 . The l -th column corresponds to the merging coefficient of the l -th layer , α l m j . Instance-varying T arget-fixed Coefficient T ype Mol-Prot Mol-Cell CYP Inhib . CYP Subs. Layer-wise 63.6 85.2 73.9 57.1 Element-wise 64.9 86.7 72.7 60.5 Layer × Element 65.7 87.4 74.5 61.9 T able 3: Ablation studies on merging coef ficient types. W e report average accurac y on each task group. Method T otal FLOPs ↓ A vg. Merging + FT 891,117 AdaMerging 493,694 ES-Merging (Ours) 149,807 T able 4: T otal Floating Point Operations (FLOPs) as a measure of computational cost for determining the merging coef ficients and LoRA module parameters parameter space signals. On the other hand, com- bining tw o different coefficients shows the best performance, enabling to capture different granu- larities of MLLMs’ specialization. 5.5 Merging Coefficient Analysis W e visualize the layer-wise merging coef ficients in Fig. 5 and the element-wise mer ging coef ficients in Fig. 9 of Appendix. The coef ficients are distinctly distributed across dif ferent MLLMs, showing that modality-specific knowledge is not incorporated uniformly throughout the model. Additionally , we observe that even when the layer-wise coef ficient is high (e.g., the third layer of the molecule LLM), the element-wise merging coefficients within that layer v ary substantially , as sho wn in Fig. 6 . This re- sult indicates that modality specialization emerges at multiple lev els of granularity: only a few param- eter elements within important layers are primarily salient for modality specialization. Therefore, com- bining layer-wise global coef ficients with element- wise local coef ficients leads to the complementary merging of dif ferent granularities, achieving an ac- curate and robust mer ging of different modalities. 5.6 Computational Cost Comparison W e compare the computation cost with the base- lines requiring the computation of gradients and parameter updates. As sho wn in T able 4 , the com- 0 1024 2048 3072 LoRA A Q proj 0 1024 2048 3072 K proj 0 1024 2048 3072 V proj 0 1024 2048 3072 O proj 0 2 4 6 LoRA B 0 2 4 6 0 2 4 6 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Figure 6: Computed element-wise merging coef ficient visualization of each module in the third layer of molecule LLM deriv ed by Eq. 3 . putational cost of ES-Merging is 3.4 × and 6.1 × lo wer than AdaMerging and T ask-specific Fine- tuning respectiv ely , since ES-Merging requires the gradient computation only one time, while other baselines iterativ ely compute gradients and update parameters. While ES-Merging requires less com- putational cost, it outperforms the T ask-specific Finetuning and AdaMerging, indicating the effi- ciency and ef fectiv eness of ES-Merging. 6 Conclusion W e present ES-Mer ging, a nov el MLLM merging frame work that shifts model mer ging from rely- ing on parameter space signals to le veraging em- bedding space signals. Our ke y insight is that input-aw are representations encode rich modality- specific specialization, pro viding a more faithful basis for integrating modality-specialized MLLMs. Based on this observ ation, we introduce two com- plementary types of mer ging coef ficients at dif- ferent granularities: layer -wise global coefficients for capturing coarse-grained specialization and element-wise local coef ficients for capturing fine- grained importance. Experiments on interactiv e ef- fect prediction tasks demonstrate that ES-Merging consistently improv es multimodal merging perfor- mance across div erse biological tasks, highlighting embedding space signals as a principled foundation for MLLM merging. 8 References Hadi Abdine, Michail Chatzianastasis, Costas Bouyioukos, and Michalis V azir giannis. 2024. Prot2text: Multimodal protein’ s function generation with gnns and transformers. In Pr oceedings of the AAAI confer ence on artificial intellig ence , v olume 38, pages 10757–10765. Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister . 2015. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and V ision , 51(1):22–45. He Cao, Zijing Liu, Xingyu Lu, Y uan Y ao, and Y u Li. 2023. Instructmol: Multi-modal inte gration for build- ing a v ersatile and reliable molecular assistant in drug discov ery . . Saurabh Cha wla, Anja Rockstroh, Melanie Lehman, and 1 others. 2022. Gene expression based inference of cancer drug sensiti vity . Natur e Communications , 13(1):5680. Chi-Jane Chen, Y uhang Chen, Sukw on Y un, Natalie Stanley , and T ianlong Chen. 2025. Spatial coordi- nates as a cell language: A multi-sentence framew ork for imaging mass cytometry analysis. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 13241–13252. Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar . 2020. Chemberta: large-scale self- supervised pretraining for molecular property pre- diction. . Gheorghe Comanici, Eric Bieber , Mike Schaek ermann, Ice Pasupat, No veen Sachde va, Inderjit Dhillon, Mar - cel Blistein, Ori Ram, Dan Zhang, Ev an Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advance d reasoning, multimodality , long context, and next generation agentic capabilities. arXiv pr eprint arXiv:2507.06261 . Sajib Acharjee Dip, Adrika Zafor , Bikash Kumar Paul, Uddip Acharjee Shuv o, Muhit Islam Emon, Xuan W ang, and Liqing Zhang. 2025. Llm4cell: A survey of large language and agentic models for single-cell biology . arXiv pr eprint arXiv:2510.07793 . Guodong Du, Junlin Lee, Jing Li, Runhua Jiang, Y ifei Guo, Shuyang Y u, Hanting Liu, Sim K Goh, Ho-Kin T ang, Daojing He, and 1 others. 2024. Parameter competition balancing for model merging. Advances in Neural Information Pr ocessing Systems , 37:84746– 84776. Carl Edwards, T uan Lai, Ke vin Ros, Garrett Honke, Kyungh yun Cho, and Heng Ji. 2022. Translation be- tween molecules and natural language. In Pr oceed- ings of the 2022 Confer ence on Empirical Methods in Natural Languag e Processing , pages 375–413. Y in Fang, Xinle Deng, Kangwei Liu, Ningyu Zhang, Jingyang Qian, Penghui Y ang, Xiaohui Fan, and Huajun Chen. 2025a. A multi-modal ai copilot for single-cell analysis with instruction following. arXiv pr eprint arXiv:2501.08187 . Y in Fang, Qiao Jin, Guangzhi Xiong, Bo wen Jin, Xi- anrui Zhong, Siru Ouyang, Aidong Zhang, Jiawei Han, and Zhiyong Lu. 2025b. Cell-o1: T raining llms to solve single-cell reasoning puzzles with reinforce- ment learning. arXiv preprint . Y in Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui F an, and Hua- jun Chen. 2023. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. arXiv preprint . Xiao Fei, Michail Chatzianastasis, Sarah Almeida Carneiro, Hadi Abdine, La wrence P Petalidis, and Michalis V azirgiannis. 2025. Prot2text-v2: Pro- tein function prediction with multimodal contrastiv e alignment. Advances in Neur al Information Pr ocess- ing Systems . Aaron Grattafiori, Abhimanyu Dube y , Abhinav Jauhri, Abhinav Pandey , Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex V aughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint . Han Guo, Mingjia Huo, Ruiyi Zhang, and Pengtao Xie. 2023. Proteinchat: T ow ards achieving chatgpt-like functionalities on protein 3d structures. A uthorea Pr eprints . Shuhan Guo, Y atao Bian, Ruibing W ang, Nan Y in, Zhen W ang, and Quanming Y ao. 2024. Unimot: Unified molecule-text language model with discrete tok en representation. arXiv preprint . Malte Holmer , Christina de Bruyn K ops, Conrad Stork, and Johannes Kirchmair . 2021. Cypstrate: A set of machine learning models for the accurate classifi- cation of cytochrome p450 enzyme substrates and non-substrates . Molecules , 26(15):4678. Edward J Hu, Y elong Shen, Phillip W allis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean W ang, Liang W ang, W eizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models. ICLR , 1(2):3. Chenyu Huang, Peng Y e, T ao Chen, T ong He, Xi- angyu Y ue, and W anli Ouyang. 2024. Emr -merging: T uning-free high-performance model merging. Ad- vances in Neur al Information Pr ocessing Systems , 37:122741–122769. Ke xin Huang, T ianfan Fu, W enhao Gao, Y ue Zhao, Y usuf Roohani, Jure Leskov ec, Connor W Cole y , Cao Xiao, Jimeng Sun, and Marinka Zitnik. 2021. Ther- apeutics data commons: Machine learning datasets and tasks for drug discovery and de velopment. arXiv pr eprint arXiv:2102.09548 . Gabriel Ilharco, Marco T ulio Ribeiro, Mitchell W orts- man, Suchin Gururangan, Ludwig Schmidt, Han- naneh Hajishirzi, and Ali Farhadi. 2022. Edit- ing models with task arithmetic. arXiv pr eprint arXiv:2212.04089 . 9 Oussama Kharouiche, Aris Markogiannakis, Xiao Fei, Michail Chatzianastasis, and Michalis V azir giannis. 2025. Cell2text: Multimodal llm for generating single-cell descriptions from rna-seq data. arXiv pr eprint arXiv:2509.24840 . Dongki Kim, W onbin Lee, and Sung Ju Hwang. 2025. Mol-llama: T o wards general understanding of molecules in large molecular language model. Ad- vances in Neural Information Pr ocessing Systems . Huan Y ee K oh, Anh T .N. Nguyen, Shirui P an, Lau- ren T . May , and Geoffre y I. W ebb . 2023. Psichic: physicochemical graph neural netw ork for learning protein-ligand interaction fingerprints from sequence data . bioRxiv . Mario Krenn, Florian Häse, AkshatKumar Nigam, P as- cal Friederich, and Alan Aspuru-Guzik. 2020. Self- referencing embedded strings (selfies): A 100% ro- bust molecular string representation. Machine Learn- ing: Science and T echnology , 1(4):045024. Longyi Li, Liyan Dong, Hao Zhang, Dong Xu, and Y ongli Li. 2025. spallm: enhancing spatial do- main analysis in multi-omics data through large lan- guage model inte gration. Briefings in Bioinformatics , 26(4):bbaf304. Sihang Li, Zhiyuan Liu, Y anchen Luo, Xiang W ang, Xiangnan He, Kenji Kawaguchi, T at-Seng Chua, and Qi T ian. 2024. T owards 3d molecule-text in- terpretation in language models. arXiv preprint arXiv:2401.13923 . Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, W enting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, T om Sercu, Sal Candido, and 1 others. 2022. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv , 2022:500902. Zhiyuan Liu, Sihang Li, Y anchen Luo, Hao Fei, Y ixin Cao, Kenji Ka waguchi, Xiang W ang, and T at-Seng Chua. 2023. MolCA: Molecular graph-language modeling with cross-modal projector and uni-modal adapter . In Pr oceedings of the 2023 Confer ence on Empirical Methods in Natural Languag e Pr ocessing . Zhenyi Lu, Chenghao Fan, W ei W ei, Xiaoye Qu, Dan- gyang Chen, and Y u Cheng. 2024. T win-merging: Dynamic integration of modular expertise in model merging. Advances in Neur al Information Pr ocess- ing Systems , 37:78905–78935. OpenAI. 2024a. Gpt-4 technical report. arXiv:2303.08774 . OpenAI. 2024b. Gpt-4o system card. arXiv:2410.21276 . Jinyoung P ark, Minseong Bae, Dohwan K o, and Hyun- woo J Kim. 2024. Llamo: Large language model- based molecular graph assistant. Advances in Neural Information Pr ocessing Systems , 37:131972–132000. Qizhi Pei, W ei Zhang, Jinhua Zhu, K ehan W u, Kaiyuan Gao, Lijun W u, Y ingce Xia, and Rui Y an. 2023. Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language asso- ciations. In Pr oceedings of the 2023 Conference on Empirical Methods in Natural Languag e Pr ocessing , pages 1102–1123. Syed Asad Rizvi, Daniel Le vine, Aakash P atel, Shiyang Zhang, Eric W ang, Curtis Jamison Perry , Nicole May- erli Constante, Sizhuang He, David Zhang, Cerise T ang, and 1 others. 2025. Scaling large lan- guage models for next-generation single-cell analysis. BioRxiv , pages 2025–04. David Rogers and Mathew Hahn. 2010. Extended- connectivity fingerprints. J ournal of chemical in- formation and modeling , 50(5):742–754. Bedionita Soro, Aoxuan Silvia Zhang, Bruno Andreis, Jaehyeong Jo, Song Chong, and Sung Ju Hwang. 2026. Ls-merge: Merging language models in latent space. In The F ourteenth International Conference on Learning Repr esentations . Hugo T ouvron and 1 others. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288 . Chao W ang, Hehe Fan, Ruijie Quan, and Y i Y ang. 2024a. Protchatgpt: T o wards understanding pro- teins with large language models. arXiv preprint arXiv:2402.09649 . Ke W ang, Nik olaos Dimitriadis, Guillermo Ortiz- Jimenez, François Fleuret, and Pascal Frossard. 2024b. Localizing task information for improved model merging and compression. arXiv pr eprint arXiv:2405.07813 . Zhicong W ang, Zicheng Ma, Ziqiang Cao, Chang- long Zhou, Jun Zhang, and Y i Qin Gao. 2025a. Prot2chat: protein large language model with early fusion of text, sequence, and structure. Bioinformat- ics , 41(8):btaf396. Ziqing W ang, Ke xin Zhang, Zihan Zhao, Y ibo W en, Abhishek Pande y , Han Liu, and Kaize Ding. 2025b. A surve y of large language models for text-guided molecular discovery: from molecule generation to optimization. arXiv preprint . David W eininger . 1988. Smiles, a chemical language and information system. 1. introduction to methodol- ogy and encoding rules. Journal of c hemical infor- mation and computer sciences , 28(1):31–36. Benjamin W eiser , Jérôme Genzling, Mihai Burai- Patrascu, Ophélie Rostaing, and Nicolas Moitessier . 2023. Machine learning-augmented docking. 1. cyp inhibition prediction . Digital Discovery , 2:1841– 1849. Peter W illett, John M Barnard, and Geoffrey M Do wns. 1998. Chemical similarity searching. Journal of chemical information and computer sciences , 38(6):983–996. 10 Y ijia Xiao, Edward Sun, Y iqiao Jin, Qifan W ang, and W ei W ang. 2024. Proteingpt: Multimodal llm for pro- tein property prediction and structure understanding. arXiv pr eprint arXiv:2408.11363 . Y ijia Xiao, W anjia Zhao, Junkai Zhang, Y iqiao Jin, Han Zhang, Zhicheng Ren, Renliang Sun, Haixin W ang, Guancheng W an, Pan Lu, and 1 others. 2025. Protein large language models: A comprehensiv e surve y . arXiv pr eprint arXiv:2502.17504 . Minghao Xu, Xinyu Y uan, Santiago Miret, and Jian T ang. 2023. Protst: Multi-modality learning of pro- tein sequences and biomedical texts. In International confer ence on machine learning , pages 38749–38767. PMLR. Prateek Y adav , Derek T am, Leshem Choshen, Colin A Raffel, and Mohit Bansal. 2023. Ties-mer ging: Re- solving interference when merging models. Ad- vances in neur al information pr ocessing systems , 36:7093–7115. Enneng Y ang, Zhenyi W ang, Li Shen, Shiwei Liu, Guib- ing Guo, Xingwei W ang, and Dacheng T ao. 2023. Adamerging: Adaptiv e model mer ging for multi-task learning. arXiv preprint . Botao Y u, Frazier N. Baker , Ziqi Chen, Xia Ning, and Huan Sun. 2024. Llasmol: Advancing large language models for chemistry with a large-scale, comprehen- siv e, high-quality instruction tuning dataset. arXiv pr eprint arXiv:2402.09391 . Bulat Zagidullin, Jehad Aldahdooh, Shuyu Zheng, W enyu W ang, Y inyin W ang, Joseph Saad, Alina Ma- lyutina, Mohieddin Jafari, Ziaurrehman T anoli, Al- berto Pessia, and Jing T ang. 2019. Drugcomb: an integrati ve cancer drug combination data portal . Nu- cleic Acids Resear ch , 47(W1):W43–W51. 11 Organization Appendix is organized as follo ws: In Section A , we provide detailed experimental settings, including datasets, baseline, implemen- tation, and prompt setting details. In Section B , we qualitati vely analyze generated responses. In Section C , we provide additional analysis of our merging method. In Section D , we discuss the limitations of our work. A Experiment Settings A.1 Datasets Detail • BindingDB is a drug-target interaction dataset deri ved from experimentally measured small molecule-protein binding data, typically filtered to human proteins, and used as a binary clas- sification benchmark with 11,054 samples for predicting whether a drug-target pair interacts. • BioSNAP is a drug-target interaction dataset deri ved from known associations between US- marketed drugs and their human protein tar gets, and is used as a binary classification benchmark with 6,058 samples for predicting whether a drug- target pair interacts. • Human is a drug-tar get interaction dataset con- sisting of drug–human protein pairs with highly credible negati ve samples and used as a binary classification benchmark with 1,375 samples to predict whether a drug-target pair interacts. • DrugComb is a drug combination-cell line in- teraction dataset deri ved from standardized and harmonized drug combination screening studies across v arious cancer cell lines, and is used as a binary classification benchmark with 3,631 sam- ples for predicting whether a combination of two drugs produces a synergistic or antagonistic anti- cancer ef fect in a gi ven cancer cell line. • GDSC2 is a drug-cell line interaction dataset de- ri ved from the Genomics of Drug Sensitivity in Cancer project, which screens ov er 1,000 geneti- cally characterized human cancer cell lines with a wide range of anti-cancer therapeutics using a ne wer cell screening platform introduced in 2015, and is used as a binary classification benchmark with 843 samples for predicting whether a giv en cancer cell line is sensiti ve or resistant to a spe- cific drug. • CYP1A2 , CYP2C19 , CYP2C9 , CYP2D6 , and CYP3A4 Inhibition are binary classification benchmarks for predicting whether a giv en drug inhibits a specific cytochrome P450 enzyme iso- form in volved in drug metabolism, consisting of 2,516, 2,533, 2,418, 2,626, and 2,466 samples, respecti vely . • CYP2C9 , CYP2D6 , and CYP3A4 Substrate are three binary classification benchmarks to pre- dict whether a gi ven drug is a substrate of a spe- cific cytochrome P450 enzyme isoform in volv ed in drug metabolism with 134, 133, and 134 sam- ples, respecti vely . A.2 Baseline Detail W e compare our method against three modality- specialized base models, one task-specific fine- tuned model, six existing model merging baselines, and three ablation v ariants of our proposed method. • Mol-LLaMA ( Kim et al. , 2025 ) is a large molec- ular language model trained via multi-modal in- struction tuning that integrates complementary 2D and 3D molecular encoders through a blend- ing module, with a Q-Former projector and a LLaMA backbone fine-tuned with LoRA, pro- viding general molecular understanding with ex- plainability and reasoning capabilities across di- verse molecular tasks. • Prot2T ext-V2 ( Fei et al. , 2025 ) is a multi- modal sequence-to-text model that combines an ESM2-3B ( Lin et al. , 2022 ) protein sequence encoder with a LLaMA-3.1-8B-Instruct decoder via a nonlinear modality projector , using hybrid sequence-le vel contrasti ve alignment learning and instruction-based LoRA fine-tuning to gen- erate rich functional descriptions of proteins di- rectly from amino acid sequences. • Cell-o1 ( Fang et al. , 2025b ) is a cell line- specialized model fine-tuned on transcriptomic omics data, capable of handling cell line- related drug response tasks but limited to single-modality inference without understanding molecule or protein. • A verage Merging directly a verages the parame- ter elements of all specialist MLLMs parameters without any additional computation. • A verage Merging + T ask-Specific Finetune constructs a base model by a veraging the LoRA weights of all three specialist models, and sub- sequently fine-tunes the LoRA parameters inde- pendently on each e valuation task by sampling 2,000 examples from the training dataset for 10 epochs with a batch size of 16. 12 • TIES-Merging ( Y adav et al. , 2023 ) resolves interference among task v ectors through three steps, trim, elect sign, and disjoint merge, which prune low-magnitude parameters, resolve sign conflicts by selecting the dominant direction, and merge only the parameter -aligned subset. • AdaMerging ( Y ang et al. , 2023 ) learns merging coef ficients for task vectors in a test-time adap- tation manner , using the minimization of model output entropy on unlabeled test samples as a surrogate objecti ve without relying on original training data. In our experiments, we adopt the layer-wise AdaMer ging variant which indepen- dently learns a merging coef ficient for each layer of ev ery task vector , showing better performance compared to the task-wise AdaMerging. • EMR-Merging ( Huang et al. , 2024 ) is a tuning- free method that first elects a unified task vector by selecting the maximum absolute v alue of each parameter along the dominant sign direction, and then generates lightweight task-specific masks and rescalers to align the direction and magnitude of the unified model with each original specialist model at inference time. • PCB-Merging ( Du et al. , 2024 ) is a training-free method that constructs a parameter competition balancing matrix through intra-balancing, which measures parameter significance within indi vid- ual tasks, and inter -balancing, which assesses parameter similarity across tasks, then drops low- scoring parameters and rescales the remaining ones to form the final merged model. • Consensus-Merging ( W ang et al. , 2024b ) identi- fies and removes two classes of detrimental pa- rameters, selfish weights that are critical exclu- si vely to a single task and catastrophic weights that are irrele vant to all tasks and retains only the consensus parameters that contrib ute positiv ely to multi-task fusion. • LS-Merge ( Soro et al. , 2026 ) encodes model weights into a latent space via a transformer- based V AE, performs merging through linear in- terpolation in that space, and decodes back to parameters. • Layer -wise ES-Merging (Ours) applies only the layer-wise global coefficient α l m j , deri ved from SWD-based embedding distrib ution shifts between the base and specialized models, as the merging coef ficient. • Element-wise ES-Merging (Ours) applies only the element-wise local coefficient β l,n m j , deriv ed from gradient-based parameter sensitivity scores with respect to fine-grained embedding distances, as the merging coef ficient. • ES-Merging (Ours) inte grates both the global layer coef ficient α l m i and the local element co- ef ficient β l,n m i into a unified merging coefficient λ l,n m i by multiplication and renormalization of el- ements in modalities. A.3 Implementation Details LoRA Configuration All specialized models used for mer ging share a unified LoRA ( Hu et al. , 2022 ) configuration with rank r = 8 , and scaling factor α = 32 , applied to all self-attention pro- jection matrices ( W Q , W K , W V , W O ) and MLP projection layers (gate, up, and do wn projections) of e very transformer block. Details of ES-Merging T o construct the probe in- puts, we randomly sample the 110 samples for each modality from a collection of test sets, construct- ing 330 samples in total. F or the layer-wise mer g- ing coefficient computation, we le verage SWD with slice projection dim = 1,024 and distance order p =2 . 0 , then normalized via softmax with tempera- ture τ =0 . 5 . For the element-wise merging coef fi- cient computation, the temperature τ of the softmax is predefined as 0.5. Existing Merging Methods For all baseline methods, we follo w the experimental settings re- ported in their respecti ve original papers and of fi- cial implementations. For LS-Merge ( Soro et al. , 2026 ), we train V AE with sequence length 16 , 384 , batch size 512 , learning rate 3 × 10 − 4 with 500 step warmup and cosine decay , and KL weight 10 − 4 with adaptiv e adjustment toward a tar get KL of 50 , for 1 , 000 epochs. At merge time, model parameters are split into chunks of size 16 , 384 , encoded into the latent space via the V AE encoder , merged via uniform-weighted Euclidean mean of the posterior means µ , and decoded back to parameter space. A.4 Prompt Setting f or Evaluation T o impart the task-specific knowledge, we pro- vide 5-shot examples from the training set with the prompt templates in T able 9 and T able 10 . The retrie val strategy is designed per task as follo ws: 13 BindingDB BioSNAP Human 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Density Molecule Protein Interaction GDSC2 DrugComb 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Density Molecule Cell Interaction CYP1A2 Inhib. CYP2C19 Inhib. CYP2D6 Inhib. CYP2C9 Inhib. CYP3A4 Inhib. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Density CYP Inhibition CYP2C9 Subs. CYP2D6 Subs. CYP3A4 Subs. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Density CYP Substrate Label Distribution across Datasets Figure 7: Label distribution across datasets. Each panel shows positi ve and neg ativ e sample counts for Molecule- Protein Interaction, CYP Inhibition, CYP Substrate, and Molecule-Cell Interaction tasks. • Molecule-Protein Interaction: T raining sam- ples whose target protein sequence exactly matches the query are first collected. If more than fi ve exact matches exist, the top-5 are selected by T animoto similarity ( W illett et al. , 1998 ) of Morg an fingerprints ( Rogers and Hahn , 2010 ) between their molecules and the query molecule. If fe wer than fi ve are found, the remaining slots are filled with samples from proteins with the highest cosine similarity of protein embeddings from ESM2 to the query protein. Among can- didates from the same similar protein, the one with the highest T animoto similarity to the query molecule is preferred. • Drug-Cell Interaction: T raining samples whose cell line shares an e xact match on the top-50 e x- pressed gene set are priorly collected. If fe wer than fi ve are found, additional samples are dra wn from cell lines with the highest Jaccard similarity on gene sets. When multiple candidates exist from the same cell line, the one with the high- est T animoto similarity to the query molecule is selected. For DrugComb, molecule similarity is computed as the a verage T animoto similarity across both drugs. • CYP Inhibition/Substrate: Since all samples share the same CYP enzyme target, protein-le vel filtering is unnecessary . The top-5 examples are selected solely by T animoto similarity between the training molecules and the query molecule. A.5 Evaluation Metric As sho wn in Figure 7 , while some datasets exhibit a relati vely balanced distrib ution between positiv e and negati ve samples, others such as BindingDB and CYP2D6 Inhibition display a pronounced class imbalance. In particular , the CYP Substrate task contains only approximately 130 samples per dataset, posing a critical challenge of absolute data scarcity . In such imbalanced and low resource settings, relying solely on accuracy as an ev alu- ation metric can be misleading, as a model that predominantly predicts the majority class may still achie ve inflated scores without genuinely captur- ing minority-class patterns. Therefore, we le verage the macro-F1 that addresses this limitation by com- puting the F1 score for each class independently and av eraging them with equal weight regardless of class frequency , thereby providing a fair assess- ment of predicti ve performance across all classes. B Qualitative Analysis In this section, we qualitatively compare the gen- erated responses of each task-specific fine-tuned model with those of ES-Mer ging, as shown in T a- ble 5 and T able 6 . B.1 Molecule-Protein Interaction Pr ediction T able 5 presents a generated response on the molecule-protein interaction prediction task from the Human dataset, where the molecule is thymine and the target protein is thymidine phosphory- lase. Although both ES-Merging and the task- specific finetuning model correctly predict the la- bel, the two responses differ substantially in qual- ity . ES-Mer ging correctly identifies thymine as a pyrimidine base found in DN A and a heteroc yclic aromatic compound, then progressiv ely reasons through its biological context, such as base pair - ing with adenine and its role in DN A synthesis, to conclude that thymine is likely to interact with the protein as a substrate for enzymes in volved in thymidine metabolism. This can be interpreted as e vidence that the chemical structure-le vel knowl- edge acquired from the molecule expert model and the biological function-lev el knowledge acquired from the protein expert model are org anically inte- grated within a single model through embedding 14 Molecule Name: Thymine SMILES: CC1=CNC(=O)NC1=O Protein: Thymidine phosphorylase Protein Sequence: MAALMTPGTGAPP APGDFSGEGSQGLPDPSPEPKQLPELIRMKRDGGRLSEADIRGFV AA VVN GSA QGA QIGAMLMAIRLRGMDLEETSVL TQALA QSGQQLEWPEA WRQQL VDKHSTGGVGDKVSL VLAP ALAACGC KVPMISGRGLGHTGGTLDKLESIPGFNVIQSPEQMQVLLDQA GCCIVGQSEQL VP ADGIL Y AARDVT A TVDSLPLIT AS ILSKKL VEGLSAL VVD VKFGGAA VFPNQEQARELAKTL VGVGASLGLR V AAAL T AMDKPLGRCVGHALEVEEALLC MDGA GPPDLRDL VTTLGGALL WLSGHA GTQA QGAAR V AAALDDGSALGRFERMLAA QGVDPGLARALCSGSP AER RQLLPRAREQEELLAP ADGTVEL VRALPLAL VLHELGA GRSRAGEPLRLGV GAELL VD VGQRLRRGTPWLR VHRDGP ALSGPQSRALQEAL VLSDRAPF AAPSPF AEL VLPPQQ User: Determine whether the given molecule interacts with the protein by follo wing the example. Ground T ruth: Interacts ES-Merging (Ours) The giv en molecule is thymine, a pyrimidine base found in DN A and RNA. It is a heteroc yclic aromatic or ganic compound. The protein is not specified, but based on the examples, it is likely a DN A or RNA polymerase enzyme. Thymine is a nucleobase that pairs with adenine in DN A. It is also a substrate for thymidylate synthase, an enzyme that catalyzes the con version of uracil to thymine in DNA synthesis. Given the information, it is likely that the molecule interacts with the protein, as thymine is a substrate for thymidylate synthase, an enzyme that catalyzes a critical step in DN A synthesis. Final answer: Interacts. Models with A vg. + T ask-Specific Finetuning Final answer: Interacts T able 5: Case study on Molecule-Protein Interaction prediction. space signal-based merging. In contrast, the task- specific finetuned model outputs only the label “In- teracts” without any biological rationale. Because it was trained on instruction data containing labels only , its predictions are less interpretable, whereas ES-Merging inte grates modality-expert kno wledge to produce interpretable reasoning ev en for unseen cross-modal tasks. B.2 Molecule-Cell Interaction Prediction T able 6 presents a qualitativ e analysis of the Drug- Cell Interaction prediction task from the GDSC2 dataset, where the drug is Gefitinib, and the cell line is O VCA420. ES-Merging first identifies the giv en molecule as a quinazoline deriv ativ e and correctly recognizes it as a potent inhibitor of the epidermal gro wth factor receptor (EGFR) tyrosine kinase. It also ac- curately identifies the giv en cell line as a cancer cell, then focuses on genes such as RPS6, EIF1, and GNB2L1 in the rank ed gene-e xpression list to interpret the cell as actively proliferating and potentially dependent on EGFR signaling. On the basis of this, ES-Merging correctly recognizes that the drug’ s mechanism of action is effecti ve against the giv en cell. This reasoning chain spanning from molecular substructure to drug class, target path- way , cell type identification, and cell-lev el interpre- tation demonstrates that the structural knowledge from the molecule expert model and the transcrip- tomic kno wledge from the cell expert model are functionally integrated through ES-Mer ging, while the model with task-specific finetuning again out- puts only the label “Sensitive” without an y biologi- cal rationale. C Further Analysis In this section, we provide a detailed analysis of the merging coefficients used in ES-Merging, extended results of the ablation study in the main paper , and 15 Drug Name: Gefitinib SMILES: COC1=C(C=C2C(=C1)N=CN=C2NC3=CC(=C(C=C3)F)Cl)OCCCN4CCOCC4 Cell Line: O VCA420 T op 50 Genes: RPL35, FTL, MT -ND4, GAPDH, AKR1B10, UBC, RPS6, RPL24, EIF1, GNB2L1, SPP1, RPS11, AKR1C1, RPL19, GNG5, AKR1B10P1, ACTG1, MT -CO3, RPL5, F A U, RPS15A, TXN, RPL8, MIF , LDHB, KR T19, EIF4H, RPL11, ALDH1A1, MGST1, RPS12, CLIC1, SNRPD2, RPL30, H2AFX, NQO1, PRDX1, RPS5, RPLP1, H2AFZ, TUBB4B, F A UP1, N A CA, RPL14, HIST1H2BK, MYL12B, SHFM1P1, HINT1P1, RPL27, HIST1H4C User: Determine whether the given cell is sensiti ve or resistant to the gi ven molecule based on the rank ed gene-expression list. Ground T ruth: Sensitive ES-Merging (Ours) The giv en molecule is 4-(3-fluoro-4-methoxyphenyl)-7-meth yl-6-oxo-1,4-dihydroquinazoline-3-carboximidamide, which is a quinazoline deriv ative . This type of molecule is known to be a potent inhibitor of the epidermal growth factor receptor (EGFR) tyrosine kinase . The cell line provided is a cancer cell , and the presence of genes such as RPS6, EIF1, and GNB2L1 in the list suggests that the cell is acti vely proliferating and may be dependent on EGFR signaling for gro wth and surviv al. Giv en the presence of these genes and the fact that the molecule is an EGFR inhibitor , it is likely that the cell is sensitiv e to the molecule. Final answer: Sensitive Models with A vg. + T ask-Specific Finetuning Final answer: Sensitive T able 6: Case study on Molecule-Cell Interaction prediction. an analysis of embedding visualizations. C.1 Merging Coefficients C.1.1 Layer -wise Merging Coefficients Size Human GDSC2 CYP2C9 Inh. CYP2C9 Sub . 32 60.7 90.6 68.2 53.7 256 62.0 93.1 68.9 48.5 1024 62.0 94.1 72.5 64.2 T able 7: Ablation studies on projection size of SWD for layer -wise ES-Merging across Molecule-Protein and Molecule-Cell Interaction tasks. Projection Dimension Size of SWD When com- puting SWD for the layer-wise merging coeffi- cients, we v ary projection dimension sizes on repre- sentati ve datasets for each task, and deri ved mer g- ing coef ficients accordingly . As shown in T able 7 , performance consistently improves as the projec- tion size increases, with 1024 dimensions achie v- ing the best results. This is because a larger num- ber of projections enables a more precise approx- imation of the distributional differences in high- dimensional space, leading to more accurate com- putation of the layer-wise merging coefficients. Based on this finding, we set the SWD projection size to 1024 in our main experiments. C.1.2 Element-wise Merging Coefficients The coef ficient patterns differ across q/k/v/o_proj modules e ven within the same layer . As shown in Figure 9 , the q/k/v projection modules and the o projection module emphasize different elements e ven at Layer 0. W ithin the same module, the co- ef ficient distributions between LoRA A and LoRA B exhibit distinct patterns. In particular, at Layer 0 of the q projection module, LoRA A shows rel- ati vely balanced coef ficients across all modalities, whereas LoRA B contains regions where molecule and protein are more prominent. On the other hand, 16 Molecule-Protein Interaction Molecule-Cell Interaction CYP Inhibition CYP Substrate Coefficient T ype BindingDB BioSN AP Human Avg. DrugComb GDSC2 Avg . CYP1A2 CYP2C19 CYP2C9 CYP2D6 CYP3A4 A vg. CYP2C9 CYP2D6 CYP3A4 Avg. Layer-wise 65.5 68.2 57.0 63.6 80.1 90.2 85.2 76.3 71.4 72.3 79.2 70.2 73.9 61.2 54.9 55.2 57.1 Element-wise 66.5 66.6 61.4 64.9 79.3 94.1 86.7 76.0 69.1 70.7 77.6 69.8 72.7 65.7 57.6 58.2 60.5 Layer × Element Mixed 66.0 69.1 62.0 65.7 80.7 94.1 87.4 77.4 70.6 72.5 80.7 71.3 74.5 64.2 60.9 60.5 61.9 T able 8: Full results of the ablation studies on merging coef ficient. W e report accuracy across all datasets and their av erage per task group. Bold indicates the best and underline indicates the second best. Molecule Token Embedding Space ES-Merging Cell LLM Protein LLM Molecule LLM (a) Molecule token Protein Token Embedding Space ES-Merging Cell LLM Protein LLM Molecule LLM (b) Protein token Cell Token Embedding Space ES-Merging Cell LLM Protein LLM Molecule LLM (c) Cell token Figure 8: Embedding visualization of the last transformer block for each specialized LLM and the mer ged model with our method for (a) molecule, (b) protein, and (c) cell tokens. coef ficient distrib utions are far from uniform, in- dicating that layer-wise global coefficients alone cannot capture the fine-grained specialization dif- ferences and supporting the necessity of element- wise local coef ficients. C.1.3 Final Coefficients of ES-Merging The layer-wise global coefficient α l m i and the element-wise local coefficient β l,n m i focus on dif- ferent regions. The layer -wise coefficient captures coarse-grained distributional shifts in the embed- ding space, while the element-wise coefficient re- flects fine-grained importance at the indi vidual pa- rameter le vel. As formulated in Eq. 4 , by multiply- ing these two coefficients and normalizing, regions where both coefficients assign high importance are amplified, whereas regions where only one side is high and the other is lo w are suppressed. This enables merging that simultaneously incorporates global layer -lev el specialization signals and local element-le vel specialization s ignals. As sho wn in Figure 10 , compared to the element-wise-only co- ef ficient distributions, the combined results demon- strate that the o verall scale of coefficients is ad- justed according to the global importance of each layer , while the fine-grained element-lev el patterns are preserved. C.2 Details of Ablation Study T able 8 reports the detailed per-dataset results cor - responding to the averages of each task group in T able 3 . The layer-wise ES-Merging and element-wise ES-Merging each dominate on dif f er - ent datasets. Despite this dataset-lev el variations, Layer × Element ES-Merging consistently achiev es the best or comparable performance across all in- di vidual datasets. These results confirm that the two coef ficients capture specialization at different granularities depending on task and dataset char- acteristics, and their combination compensates for the weaknesses of either side, yielding consistent and rob ust merging performance at the individual dataset le vel as well. C.3 Embedding V isualization Figure 8 visualizes the embedding distrib utions of each specialist model and the model mer ged by ES-Merging at the last transformer block for each modality token. In each modality token space, the specialist models form distinct distrib utions, indicating that each model has learned modality- specific representations. The model merged by ES-Merging is positioned between the specialist distributions without being biased to ward any par - ticular specialist, while being relati vely close to the distribution of the specialist corresponding to each modality token. This suggests that ES-Merging integrates the modality-specific kno wledge of each specialist model in a balanced manner , while pre- serving the specialization for each modality . D Limitation In this work, we present ES-Merging, a novel MLLM merging frame work that deriv es layer-wise and element-wise merging coefficients by le verag- ing embedding space signals. Although we have 17 demonstrated its effecti veness on the integration of MLLMs specialized in biochemical domains such as molecules, proteins, and single cells, we hav e not explored its applicability to more general multi-modal domains such as video, image, and audio due to the lack of cross-modal benchmarks on these domains. Since the core principle of our approach to le verage embedding space signals is inherently modality-agnostic, we belie ve that ex- tending ES-Merging to such general multi-modal scenarios is a promising direction and leave this exploration as future work. Additionally , further in- vestigation is needed to v erify whether ES-Mer ging can maximally preserv e performance not only on cross-modal fusion tasks but also on indi vidual single-modality tasks of each specialist model. 18 T able 9: In-conte xt learning prompt templates for the Molecule-Protein Interaction and CYP prediction task. Angle-bracketed tokens are replaced with the corresponding encoder embeddings: and for the target pair , and and for each of the k few-shot e xamples. The

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment