ES-Merging: Biological MLLM Merging via Embedding Space Signals

ES-Merging: Biological MLLM Merging via Embedding Space Signals W onbin Lee *1 Dongki Kim *1 Sung Ju Hwang 1,2 1 KAIST 2 DeepAuto.ai {smilelwb01, cleverki, sungju.hwang}@kaist.ac.kr Abstract Biological multimodal lar ge language models (MLLMs) hav e emerged as powerful founda- tion models for scientiﬁc discov ery . How- ev er , existing models are specialized to a sin- gle modality , limiting their ability to solve inherently cross-modal scientiﬁc problems. While model merging is an ef ﬁcient method to combine the dif ferent modalities into a uni- ﬁed MLLM, existing methods rely on input- agnostic parameter space heuristics that fail to faithfully capture modality specialization. T o overcome this limitation, we propose a representation-aware merging frame work that estimates mer ging coef ﬁcients from embedding space signals. W e ﬁrst design a probe input that consists of different modality tok ens and forward it through each specialized MLLM to obtain layer-wise embedding responses that re- ﬂect modality-speciﬁc representation changes. W e then estimate complementary merging co- efﬁcients at two granularities from the em- bedding space: layer -wise coef ﬁcients from coarse-grained signals and element-wise coef- ﬁcients from ﬁne-grained signals, which are jointly combined for robust coef ﬁcient estima- tion. Experiments on interactiv e ef fect predic- tion benchmarks show that our method outper - forms existing mer ging methods and e ven sur - passes task-speciﬁc ﬁne-tuned models, estab- lishing that embedding space signals provide a principled and effecti v e foundation for cross- modal MLLM merging. 1 Introduction Multimodal Large Language Models (MLLMs) hav e been emerging as crucial foundational mod- els for scientiﬁc discovery , extending their per- ception to di verse biological modalities across molecules ( P ark et al. , 2024 ; Kim et al. , 2025 ), proteins ( Abdine et al. , 2024 ; Fei et al. , 2025 ), and * denotes Equal Contribution cells ( Fang et al. , 2025b ; Rizvi et al. , 2025 ). De- spite their impressi ve progress within each modal- ity , many scientiﬁc problems of interest are cross- modal, requiring to understand the interactiv e ef- fects such as protein-ligand interactions or drug ef fecti veness to cell types. Howe v er , existing bio- logical MLLMs are specialized to a single modality , resulting in limited intersectional knowledge and unreliable cross-modal reasoning. Building a uniﬁed model by jointly training on dif ferent modalities is a straightforward approach to upskill cross-modal understanding. Howe v er , it is impractical and time-consuming since construct- ing curated cross-modal instruction datasets in the scientiﬁc domain typically requires intensiv e labor and highly speciﬁc expertise to elaborate the un- derlying principles of complex interactions. As an alternati ve, model mer ging has gained attention by ef ﬁciently combining parameters of multiple specialized models. T o retain their task-speciﬁc kno wledge, existing methods ( Y adav et al. , 2023 ; Huang et al. , 2024 ; Du et al. , 2024 ) exploit param- eter space signals, such as magnitudes, signs, and directions, heuristically assigning mer ging coefﬁ- cients. Ho wever , parameter space heuristics are input-agnostic and therefore provide only weak, indirect proxies for modality-speciﬁc adaptation. Such input-blindness mak es it dif ﬁcult to isolate meaningful cross-modal interactions, failing to ac- curately combine these adaptations and se verely degrading cross-modal mer ging. Our main observation is that the input-aw are embedding space contains the modality-speciﬁc in- formation. As sho wn in Fig. 1 , when molecular to- kens are processed by dif ferent MLLMs, the hidden representations form clearly different embedding distributions. In particular , the molecule-speciﬁc LLM induces a more distinct distribution due to its modality-speciﬁc understanding. Further , we measure the embedding distribution distance be- tween the base LLM and each specialized MLLM 1 Molecule T oken Embedding Space Base LLM Cell LLM Protein LLM Molecule LLM Figure 1: Molecule token embedding visualization of the last transformer block for the base LLM and each specialized LLM. 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0.2 0.4 0.6 0.8 Average Normalized SWD Specialized Non-Specialized Figure 2: Layer-wise embedding distribution distance under specialized and non-specialized tok ens using sliced W asserstein distances (SWD) ( Bonneel et al. , 2015 ). under specialized and non-specialized token inputs. Fig. 2 shows that specialized inputs consistently produce lar ger embedding distribution distance than non-specialized inputs, suggesting that the embedding responses faithfully reﬂect modality- speciﬁc adaptation when the input matches the model’ s specialization. This observation moti vates our central design principle: rather than relying solely on heuristic parameter space signals, we esti- mate merging coef ﬁcients based on the layer-wise embedding space signals. Moti v ated by this observation, we propose an embedding-signal-based MLLM mer ging ( ES- Merging ), a no vel frame work that mov es the model merging paradigm from parameter space signals to embedding space signals. Our intuition is that the modality specialization of an MLLM can be mea- sured by the dif ferences in the embedding space between the base LLM and the MLLM as shown in Fig. 1 . T o this end, we ﬁrst design a probe input containing multimodal tokens of dif ferent modali- ties. By forw arding this input through each MLLM and the base LLM, we obtain layer-wise embed- dings that reﬂect modality-speciﬁc representation changes across layers. Based on these embeddings, we propose to com- pute merging coef ﬁcients at two complementary granularities: layer -wise and element-wise. In the layer-wise manner , we compute a layer-le vel impor- tance score by identifying layers where the embed- ding distributional shift gro ws the most, capturing coarse-grained specialization. In the element-wise manner , we estimate ﬁne-grained importance by identifying parameters that most inﬂuence the rep- resentation shift. W e then combine the coarse layer importance and ﬁne element importance to produce ﬁnal merging coef ﬁcients, which are used to fuse specialized models into a single uniﬁed MLLM. This design enables more robust and calibrated co- ef ﬁcient estimation by combining layer-le vel spe- cialization with parameter-le vel sensiti vity . W e validate the ef fectiv eness of ES-Merging on interacti ve ef fect prediction tasks over di verse bio- logical modalities by merging three different spe- cialized MLLMs into a single uniﬁed model. Ex- perimental results sho w that ES-Mer ging outper- forms not only other model merging methods b ut also the task-speciﬁc ﬁne-tuned models, empiri- cally conﬁrming our representation-aware mer ging frame work is crucial to obtain the salient modality- speciﬁc signals. Further analyses sho w the effec- ti veness of our components: combining layer - and element-wise me rging coefﬁcients best performs compared to using only one, highlighting the ne- cessity of integrating complementary specialization signals at dif ferent granularities. 2 Related W ork MLLMs for Scientiﬁc Disco very Large lan- guage models (LLMs) ( T ouvron et al. , 2023 ; Grattaﬁori et al. , 2024 ; OpenAI , 2024a , b ; Comanici et al. , 2025 ) are increasingly being extended to scientiﬁc discov ery through multimodal large lan- guage models (MLLMs) that incorporate div erse biological modalities, including molecules ( W ang et al. , 2025b ), proteins ( Xiao et al. , 2025 ), and cells ( Dip et al. , 2025 ). In the protein domain, prior works hav e modeled amino-acid sequences ( Xu et al. , 2023 ; Pei et al. , 2023 ) or jointly with struc- tures ( Guo et al. , 2023 ; Abdine et al. , 2024 ; W ang et al. , 2024a ; Xiao et al. , 2024 ; W ang et al. , 2025a ; Fei et al. , 2025 ). Single cell LLMs learn the scRN A-seq representations ( Fang et al. , 2025a ; Kharouiche et al. , 2025 ; Fang et al. , 2025b ) or fur - ther incorporate histology information ( Li et al. , 2025 ; Chen et al. , 2025 ). On the other hand, 2 molecular LLMs ha ve been b uilt on 1D string rep- resentations such as SMILES ( W eininger , 1988 ) and SELFIES ( Krenn et al. , 2020 ) ( Chithrananda et al. , 2020 ; Edwards et al. , 2022 ), 2D molecular graphs ( Liu et al. , 2023 ; Fang et al. , 2023 ; Cao et al. , 2023 ; Y u et al. , 2024 ; P ark et al. , 2024 ), 3D structures ( Li et al. , 2024 ; Guo et al. , 2024 ), and joint 2D-3D representations ( Kim et al. , 2025 ). De- spite this progress, biological MLLMs are limited to a single modality , hindering their ability to solv e cross-modal scientiﬁc problems. Model Merging Model mer ging aims to fuse kno wledge from multiple specialist models with minimal additional data or training by combining them directly in parameter space. Existing meth- ods mainly le verage the parameter space signals to guide merging. Magnitude-based methods in- clude T ask Arithmetic ( Ilharco et al. , 2022 ), Con- sensus Mer ging ( W ang et al. , 2024b ), and PCB- Merging ( Du et al. , 2024 ); sign-based methods in- clude TIES-Merging ( Y adav et al. , 2023 ) and EMR- Merging ( Huang et al. , 2024 ); and LS-Merge ( Soro et al. , 2026 ) performs merging in a learned la- tent space over parameters. Beyond such static merging, test-time adaptation methods dynami- cally adjust coef ﬁcients using unlabeled test data, as in AdaMerging ( Y ang et al. , 2023 ) and T win- Merging ( Lu et al. , 2024 ). Ho wever , these methods still rely on parameter space signals, which are not well suited to capturing semantic discrepan- cies across heterogeneous modalities. W e instead propose an embedding space merging frame work for MLLMs that deriv es merging coef ﬁcients from modality-aw are representation signals. 3 Preliminary W e begin by formally describing MLLMs, then formulating the model merging with LoRA. Multimodal Large Language Model MLLMs operate beyond the te xtual space by taking modal- ity token embeddings with text tok ens. Concretely , let M = { m 1 , ..., m K } denote a set of modal- ities. For each modality m i , a modality-speciﬁc encoder f m i projects the ra w modality input x m i to a sequence of modality tokens: H m i = f m i ( x m i ) . Then, an MLLM generates textual output y by tak- ing the concatenated textual and modality tok ens: y = g ( H 0 ) , where H 0 = [ H text ; H m i ; ... ] . In this view , the core interface of an MLLM is the token-based embeddings: raw modality in- puts are ﬁrst mapped into sequences of vectors that are compatible with the LLM’ s embedding di- mension, and then processed together by a single transformer . LoRA Merging In the MLLM setting, modality specialization is often implemented as a param- eter efﬁcient tuning with LoRA ( Hu et al. , 2022 ). Therefore, our mer ging methods are based on mer g- ing LoRA parameters via weighted summation. Speciﬁcally , let θ m i denote the LoRA parameter of an MLLM specialized to modality m i . W e in- dex the parameters by layer and weight element: θ l,n m i denotes the n -th weight in the l -th layer of the modality-speciﬁc model for m i . Then, the LoRA merging is formalized as follo ws: θ l,n uni ← X m i ∈M λ l,n m i θ l,n m i . (1) Under this formulation, the main challenge is to estimate appropriate mer ging coef ﬁcients λ l,n m i that determine how strongly each modality-speciﬁc LoRA parameter contributes to the modality under- standing. Our method addresses this coefﬁcient es- timation problem by deriving λ l,n m i from embedding space signals rather than parameter space statistics. 4 ES-Merging W e present Embedding-Signal-based MLLM Merg- ing (ES-Mer ging), a representation-a ware frame- work for mer ging modality-specialized MLLMs based on embedding space signals. Speciﬁcally , ES-Merging constructs probe inputs to elicit repre- sentational dif ferences across models, and con verts these embedding space discrepancies into layer- wise and element-wise merging coef ﬁcients. 4.1 Probe Input As discussed in Section 3 , we vie w the projected modality tokens as the core interface of the LLM backbones. In other words, once a raw modality input is mapped into the shared embedding space, the modality-speciﬁc knowledge of an MLLM is reﬂected in ho w the backbone interprets and trans- forms these token embeddings across layers. From this perspecti ve, understanding modality special- ization reduces to analyzing the layer-wise transfor- mation of modality tokens in representation space. Moti v ated by this view , we design probe inputs that explicitly e xpose token embeddings of dif fer- ent modalities, so that we can compare ho w the 3 B ase LLM M odal i t y   - speci al i z ed LLM L ay e r  B ase LLM M odal i t y   - speci al i z ed LLM M o d a l i ty   T o ken s L ay e r  B ase L L M D i s t r i but i on D i st an ce ( S W D ) ( a) L aye r - w i s e G l o b a l M e r g i n g C o e f f i c i e n t ( b ) E l emen t - w i s e L o ca l M erg i n g C o ef f i ci en t M ean P ool M ean P ool M o d a l i ty   T o ken s … Coa r s e - gr a i ne d S pa c e ( M oda l i t y   ) M o d a l i ty   T o ken s M o d a l i ty   T o ken s L ay e r  Ba c k pr op ∆S W D L 2 D i s t a n c e … Figure 3: Overvie w of ES-Merging. (A) Layer-wise global merging coefﬁcients are computed from the coarse- grained embedding signals, which are the layer -wise dif ferences of distribution distances between mean pooled representations of the base LLM and a specialized MLLM. (B) Element-wise local merging coefﬁcients are assigned based on the ﬁne-grained embedding signals by computing the gradients from the embedding-wise distances. base LLM and each modality-specialized MLLM process the same modality embeddings. Speciﬁcally , for each modality m i , we collect a set of raw inputs { x ( k ) m i } K k =1 from correspond- ing datasets and transform them via the modality- speciﬁc encoder f m i into the modality token em- beddings { H ( k ) m i } K k =1 , where H ( k ) m i = f m i ( x ( k ) m i ) . Using these embeddings, we then b uild the probe input by concatenating a short textual preﬁx with all modality token blocks: H 0 , ( k ) probe = [ H text ,m 1 ; H ( k ) m 1 ; H text ,m 2 ; H ( k ) m 2 ; . . . ] , where H text ,m i denotes the text-preﬁx embeddings associated with modality m i (e.g., the modal- ity identiﬁer or name). In the biological set- ting, the modality set M typically consists of { molecule , protein , cell } , and the resulting probe input is illustrated in Fig. 4 . By forwarding each probe input through the base LLM and each MLLM specialized to modality m j , we obtain layer-wise embeddings by extracting the modality m i tokens, denoted as H l, ( k ) m i → base and H l, ( k ) m i → θ m j in R T m i × d , where l denotes the layer index and T m i denotes the number of modality- m i tokens. These layer -wise representations serv e as the foundation for our subsequent merging coef ﬁcient estimation. W e regard the embedding discrep- ancy between the base LLM and each modality- specialized MLLM as an embedding space signal that reﬂects the de gree of the modality specializa- tion. W e then le verage this signal at two comple- mentary le vels: globally across layers to estimate which layers contribute more to specialization, and locally within each layer to identify which parame- Molecule: Protein: Cell: Figure 4: Prompt template of the probe input. ter elements are more strongly associated with the specialized transformation. 4.2 Layer -wise Global Merging Coefﬁcient T o capture global modality specialization, we com- pute a layer -wise global merging coefﬁcient from coarse-grained embedding signals (Fig. 3 (a)). Coarse-grained Embedding Signal W e ﬁrst summarize the modality-speciﬁc embeddings into a coarse-grained representation by av eraging token- le vel embeddings, obtaining ˆ h l, ( k ) m i → base ∈ R d for the base model and ˆ h l, ( k ) m i → θ m j ∈ R d for a spe- cialized model θ m j . By collecting these coarse- grained embeddings over different K probe in- puts, we obtain two layer-wise embedding sets, H l m i → base ∈ R K × d and H l m i → θ m j ∈ R K × d , which represent how the base and specialized models pro- cess modality- m i inputs at layer l . W e then quan- tify their representational gap using the embedding distribution distance with sliced W asserstein dis- tance (SWD) ( Bonneel et al. , 2015 ): SWD l m i → θ m j = SWD  H l m i → base , H l m i → θ m j  . A lar ger v alue indicates that θ m j induces a stronger layer-wise shift from the base model when process- ing modality- m i tokens. 4 Layer -wise Importance Estimation While SWD l m i → θ m j measures the cumulativ e differences up to layer l , we are particularly interested in ho w much new modality-speciﬁc transformation is introduced at each layer . W e therefore compute the layer-wise change: d l m i → θ m j = SWD l m i → θ m j − SWD l − 1 m i → θ m j . If d l m i → θ m j is lar ge, the corresponding layer of model θ m j contributes more strongly to modality m i -speciﬁc processing. Since the magnitude of these changes can v ary across models, we normal- ize them o ver layers using Z-score normalization, obtaining ˆ d l m i → θ m j . W e aggreg ate the normalized changes over all input modalities to obtain the layer- wise importance score of model θ m j : s l θ m j = X m i ∈M ˆ d l m i → θ m j . Layer -wise Global Coefﬁcient Finally , we con- vert the layer -wise importance scores into merging coef ﬁcients by applying a softmax across models: α l m j = exp( s l θ m j /τ ) P m ∈M exp( s l θ m /τ ) . (2) As a result, a specialized MLLM recei ves a lar ger coef ﬁcient at layers where it contributes more strongly to coarse-grained representational changes ov er different modality inputs. 4.3 Element-wise Local Merging Coefﬁcient While the layer -wise global coef ﬁcient captures modality importance at the transformer-layer le vel, it assigns a uniform merging weight to all param- eters within the same layer . T o address this limi- tation, we further introduce an element-wise local merging coef ﬁcient deri ved from ﬁne-grained em- bedding signals (Fig. 3 (b)). Fine-grained Embedding Signals T o measure the ﬁne-grained signals from embeddings, we ﬁrst measure the distances of each embedding between the base model and a specialized model θ m j , in- stead of using the coarse-grained representations: r l, ( k ) m i → θ m j =    H l, ( k ) m i → base − H l, ( k ) m i → θ m j    F , where F denotes the Frobenius norm, which is anal- ogous to the Euclidean L 2 norm for vectors. This distance measures ho w differently the specialized model θ m j processes modality- m i inputs relati ve to the base model at layer l in a ﬁne-grained manner . Element-wise Importance Estimation W e then estimate the importance of each parameter element by measuring ho w sensiti ve this distance is to that element. Speciﬁcally , for the n -th parameter ele- ment in layer l , we accumulate the absolute gradi- ent magnitude ov er all modalities and probe inputs: s l,n θ m j = X m i ∈M K X k =1       ∂ r l, ( k ) m i → θ m j ∂ θ l,n m j       . A larger score indicates that the parameter element is more sensitiv e to the modality-speciﬁc dif fer- ences between the base and specialized models. Because the raw scores can v ary across layers and parameters, we normalize them within each layer using Z-score normalization, yielding ˆ s l,n θ m j . Element-wise Local Coefﬁcient W e con vert the normalized scores into element-wise merging coef- ﬁcients by applying the softmax function: β l,n m j = exp  ˆ s l,n θ m j /τ  P m ∈M exp  ˆ s l,n θ m /τ  . (3) Resulting element-wise merging coef ﬁcients selec- ti vely assign each parameter element according to their ﬁne-grained sensiti vities. 4.4 Integrating Layer - and Element-wise Merging Coefﬁcients The coarse-grained layer-wise coef ﬁcient α l m i and the ﬁne-grained element-wise coefﬁcient β l,n m i each capture modality-speciﬁc importance at different le vels of granularity . W e combine them into a sin- gle coef ﬁcient by multiplying the layer -wise coef- ﬁcient and the element-wise coef ﬁcient and renor- malizing across modalities: λ l,n m i = α l m i · β l,n m i P m ∈M α l m · β l,n m . (4) The ﬁnal parameters of the merged model are computed by Eq. 1 . By integrating different co- ef ﬁcient types from dif ferent granularities of em- bedding signals, ES-Merging enables calibrated merging coef ﬁcients that preserve complementary modality expertise more faithfully , enabling robust cross-modal kno wledge composition. 5 Experimental Results 5.1 Experimental Setup Implementation Details For mer ging, we le ver - age the state-of-the-art MLLMs specialized to 5 Molecule-Protein Interaction Molecule-Cell Interaction BindingDB BioSN AP Human A vg. DrugComb GDSC2 A vg. Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Base LLM and Specialized MLLMs LLaMA-3.1-8B-Instruct 51.9 51.7 61.5 60.2 59.0 58.8 57.5 56.9 73.7 77.1 84.8 85.7 79.3 81.4 Mol-LLaMA ( Kim et al. , 2025 ) 55.8 54.7 66.5 64.0 61.5 61.4 61.2 60.0 64.1 80.4 55.2 82.3 59.7 81.4 Prot2T ext-V2 ( Fei et al. , 2025 ) 59.2 59.2 55.3 54.1 47.2 47.2 53.9 53.5 65.6 78.4 59.7 67.0 62.6 72.7 Cell-o1 ( Fang et al. , 2025b ) 53.5 53.1 59.3 59.8 49.1 48.5 54.0 53.8 76.3 77.2 85.9 86.0 81.1 81.6 Mer ging Methods A vg. Merging 65.3 64.9 66.4 66.5 60.9 60.9 64.2 64.1 72.5 74.6 85.4 87.9 78.9 81.2 TIES-Merging ( Y adav et al. , 2023 ) 60.8 60.8 62.7 61.6 58.6 58.6 60.7 60.3 74.7 77.0 85.9 87.1 80.3 82.1 EMR-Merging ( Huang et al. , 2024 ) 64.7 64.2 66.3 66.9 60.4 60.4 63.8 63.8 45.2 71.5 93.4 93.3 69.3 82.4 AdaMerging ( Y ang et al. , 2023 ) 52.3 51.6 64.3 62.0 60.0 60.2 58.9 57.9 48.0 36.9 46.5 38.8 47.2 37.8 PCB-Merging ( Du et al. , 2024 ) 55.1 55.1 60.3 58.8 58.6 58.4 58.0 57.4 77.3 78.7 86.0 86.1 81.7 82.4 Consensus Merging ( W ang et al. , 2024b ) 59.2 59.3 62.1 61.0 59.2 59.1 60.2 59.8 76.0 78.4 84.3 85.8 80.2 82.1 LS-Merge ( Soro et al. , 2026 ) 52.3 52.1 61.8 60.6 58.8 58.5 57.7 57.1 79.1 79.0 83.5 83.3 81.3 81.1 A vg. Merging + FT 60.5 59.7 55.8 55.9 57.2 57.3 57.8 57.6 81.1 80.8 94.0 93.9 87.5 87.4 ES-Merging (Ours) 66.0 65.3 69.1 68.4 62.0 61.9 65.7 65.2 80.7 80.2 94.1 94.0 87.4 87.1 T able 1: Performance comparison of ES-Merging with the base LLM with and without task-speciﬁc ﬁne-tuning, specialized MLLMs, and merging methods on the instance-varying interaction prediction tasks. W e report accuracy and macro-F1 across each subset. Bold and underline indicate the best and second best performances, respectively . each modality: Mol-LLaMA ( Kim et al. , 2025 ) for molecule modality , Prot2T ext-V2 ( Fei et al. , 2025 ) for protein modality , and Cell-o1 ( Fang et al. , 2025b ) for single cell modality , whose base LLMs are LLaMA-3.1-8B-Instruct ( Grattaﬁori et al. , 2024 ). The merging methods are applied across all modality-speciﬁc models, resulting in a uniﬁed MLLM. Since each do wnstream task re- quires distinct domain knowledge, we provide task- speciﬁc fe w-shot in-context examples to induce the appropriate task understanding. For a fair com- parison, we use the same instruction template and in-context examples for all compared methods. For more details for the implementation details, please refer to Appendix A.3 and A.4 . Baselines W e compare ES-Merging with the base LLM, MLLMs specialized to a single modality , and merging methods. As modality-specialized baselines, we consider Mol-LLaMA, Prot2T ext- V2, and Cell-o1, each specialized for molecule, protein, and single-cell modalities, respectiv ely . W e also compare ES-Merging with representa- ti ve mer ging methods, including A verage Mer g- ing, TIES-Merging ( Y adav et al. , 2023 ), EMR- Merging ( Huang et al. , 2024 ), layer -wise AdaMer g- ing ( Y ang et al. , 2023 ), PCB-Mer ging ( Du et al. , 2024 ), Consensus Merging ( W ang et al. , 2024b ), LS-Merge ( Soro et al. , 2026 ), and task-speciﬁc ﬁne- tuned model after the av erage merging, denoted as A vg. Merging + FT . For the base LLMs and the specialized MLLMs, unsupported modalities are represented as te xtual inputs. Baseline details are provided in Appendix A.2 . 5.2 Instance-varying Interaction Prediction W e ﬁrst e v aluate on instance-v arying cross-modal interaction tasks, where the target counterpart changes across instances and the model should gen- eralize across di verse cross-modal pairs. Datasets W e consider two instance-varying cross-modal interaction settings: molecule-protein interaction and molecule-cell interaction. For molecule-protein interaction, the task is to predict whether a giv en molecule interacts with a giv en protein, including BindingDB, BioSN AP , and Hu- man ( K oh et al. , 2023 ). For molecule-cell interac- tion, the task is to predict the ef fect of a molecule on a giv en cell, including DrugComb ( Zagidullin et al. , 2019 ) and GDSC2 ( Cha wla et al. , 2022 ). In both settings, the molecule, protein, and cell counterpart changes for each instance, requiring the model to generalize across div erse cross-modal combinations rather than relying on a ﬁx ed target identity . W e provide the detailed explanation of datasets in Appendix A.1 . Results As sho wn in T able 1 , ES-Merging con- sistently outperforms the merging baselines, sug- gesting that ES-Merging more effecti vely inte- grates complementary cross-modal knowledge from modality-specialized MLLMs and thus lead- ing to stronger generalization when the interaction counterpart varies across instances. Notably , ES- 6 CYP Inhibition CYP Substrate CYP1A2 CYP2C19 CYP2C9 CYP2D6 CYP3A4 A vg. CYP2C9 CYP2D6 CYP3A4 Avg . Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Base LLM and Specialized MLLMs LLaMA-3.1-8B-Instruct 55.3 58.0 50.6 55.4 47.4 53.7 56.4 51.7 43.1 52.8 50.6 54.3 47.8 50.7 30.1 33.1 48.5 50.6 42.1 44.8 Mol-LLaMA 68.5 67.1 65.7 64.3 63.3 63.3 64.7 59.2 64.3 64.0 65.3 63.6 62.7 49.7 61.7 57.4 55.2 54.4 59.9 53.8 Prot2T ext-V2 59.4 55.8 56.8 52.5 67.4 60.1 76.3 55.9 59.6 50.3 63.9 54.9 54.5 48.6 53.4 52.0 56.0 54.2 54.6 51.6 Cell-o1 53.5 58.3 44.9 51.9 43.1 52.8 46.9 46.3 43.6 51.6 46.4 52.2 36.6 41.7 36.1 39.9 48.5 55.0 40.4 45.6 Merging Methods A vg. Merging 61.2 60.5 54.1 53.5 52.7 53.8 53.8 49.6 54.6 55.1 55.2 54.5 40.3 39.6 42.4 41.3 49.3 45.8 44.0 42.2 TIES-Merging 71.8 71.8 67.8 67.9 66.5 65.0 76.1 64.0 65.3 66.3 69.5 67.0 56.0 46.8 55.6 53.2 54.5 53.7 55.4 51.2 EMR-Merging 70.4 70.5 64.8 65.4 66.8 66.3 74.2 61.8 66.1 66.0 68.5 66.0 60.5 50.6 55.6 52.9 51.5 49.4 55.9 51.0 AdaMerging 52.5 55.3 53.1 54.2 44.8 46.0 45.6 45.7 50.4 54.3 49.3 51.1 33.6 33.6 43.6 45.7 49.3 47.4 42.2 42.2 PCB-Merging 69.2 69.4 63.6 64.6 65.0 63.8 72.4 61.0 63.8 65.2 66.8 64.8 55.2 44.9 57.1 53.9 56.7 55.7 56.4 51.5 Consensus Merging 68.5 68.8 63.6 64.1 65.7 64.0 73.5 60.8 63.8 64.7 67.0 64.5 54.5 46.5 55.6 52.9 56.7 57.7 55.6 52.4 LS-Merge 59.1 58.0 56.4 56.0 56.2 54.5 61.0 51.3 55.5 53.5 57.6 54.7 51.3 45.7 44.1 41.9 48.7 46.3 48.0 44.6 A vg. Merging + FT 68.3 67.7 67.0 66.5 62.2 62.1 67.6 60.9 67.6 67.6 66.5 65.0 65.7 50.6 59.4 53.0 55.2 54.6 60.1 52.7 ES-Merging (Ours) 77.4 77.4 70.6 70.5 72.5 70.4 80.7 69.5 71.3 70.8 74.5 71.7 64.2 53.6 60.9 57.2 60.5 59.6 61.9 56.8 T able 2: Performance comparison of ES-Merging on the target-ﬁx ed functionality prediction tasks. W e report accuracy and macro-F1 across each subset. Bold indicates the best and underline indicates the second best. Merging outperforms the task-speciﬁc ﬁnetuned model (A vg. Mer ging + FT) on the molecule- protein interaction tasks and sho ws comparable performance on the molecule-cell interaction tasks, indicating that ES-mer ging can enhance the cross- modal reasoning without further do wnstream ﬁne- tuning. This superior performance of ES-Merging comes from preserving the reasoning capabilities of specialized MLLMs as sho wn in T able 5 and 6 in Appendix. In contrast, task-speciﬁc ﬁne-tuning tends to diminish these reasoning capabilities and can e ven de grade performance, particularly on the molecule-protein interaction tasks. On the other hand, existing merging baselines such as EMR- Merging often e xhibit substantial instability across datasets, showing a degraded performance on Drug- Comb . This implies that simple parameter space heuristics are insufﬁcient for reliably combining heterogeneous multimodal experts, supporting that ES-Merging pro vides a more robust integration of modality-speciﬁc kno wledge. 5.3 T arget-ﬁxed Functionality Pr ediction W e further ev aluate on target-ﬁxed cross-modal functionality prediction tasks, where each subtask is associated with a ﬁxed target and requires target- speciﬁc biological functionality knowledge be yond simple interaction matching in Section 5.2 . Datasets T o this end, we ev aluate on CYP en- zyme prediction. Unlike the instance-varying set- ting abo ve, each subtask is deﬁned with respect to a ﬁxed enzyme target and the tasks are pre- dicting the biological functionality of the gi ven molecule to the ﬁxed target: inhibitory effects or substrate speciﬁcity . Speciﬁcally , we con- sider ﬁv e CYP inhibition subtasks to CYP1A2, CYP2C19, CYP2C9, CYP2D6, and CYP3A4 enzymes ( W eiser et al. , 2023 ), and three CYP substrate subtasks for CYP2C9, CYP2D6, and CYP3A4 enzymes ( Holmer et al. , 2021 ), from the TDC dataset ( Huang et al. , 2021 ). These benchmarks therefore ev aluate whether the merged model can capture not only detailed molecular structural v ariation under a shared tar get, but also biologically meaningful interaction types. Results As sho wn in T able 2 , ES-Mer ging achie ves the best average performance, suggest- ing that it more ef fectively preserv es and integrates expert kno wledge required for target-speciﬁc func- tionality prediction. In contrast, on the CYP sub- strate tasks, most merging baselines underperform Mol-LLaMA, likely because these tasks rely hea v- ily on molecular structural understanding under a ﬁxed-tar get setting. Notably , ES-Merging achie ves performance that is comparable to or better than Mol-LLaMA, suggesting that embedding-signal- based merging can not only inte grate modality- speciﬁc expertise more effecti vely but also enhance cross-modal understanding. 5.4 Ablation Study W e further conduct an ablation study to see the ef fecti veness of each coef ﬁcient type. As shown in T able 3 , using only one of coefﬁcient types still outperforms other merging baselines, indicating that the embedding space signals pro vide salient and rob ust information for cross-modal mer ging, compared to the merging methods based on the 7 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Molecule Protein Cell 0.32 0.56 0.81 0.98 0.03 0.01 0.87 0.95 0.90 0.44 0.88 0.07 0.42 0.17 0.57 0.02 0.04 0.16 0.08 0.13 0.61 0.61 0.01 0.93 0.69 0.91 0.00 0.04 0.02 0.72 0.00 0.00 0.24 0.08 0.19 0.01 0.16 0.96 0.04 0.02 0.01 0.47 0.08 0.66 0.26 0.45 0.22 0.98 0.01 0.33 0.24 0.71 0.13 0.03 0.72 0.02 0.19 0.04 0.22 0.09 0.95 0.01 0.41 0.70 0.44 0.36 0.00 0.01 0.81 0.03 0.08 0.03 0.09 0.08 0.04 0.28 0.32 0.38 0.22 0.00 0.95 0.51 0.68 0.16 0.26 0.35 0.27 0.04 0.12 0.05 0.78 0.86 0.03 0.27 0.59 0.30 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Figure 5: Computed layer -wise merging coefﬁcient visualization of each specialized MLLM deriv ed by Eq. 2 . The l -th column corresponds to the merging coefﬁcient of the l -th layer , α l m j . Instance-varying T arget-ﬁxed Coefﬁcient T ype Mol-Prot Mol-Cell CYP Inhib . CYP Subs. Layer-wise 63.6 85.2 73.9 57.1 Element-wise 64.9 86.7 72.7 60.5 Layer × Element 65.7 87.4 74.5 61.9 T able 3: Ablation studies on merging coef ﬁcient types. W e report average accurac y on each task group. Method T otal FLOPs ↓ A vg. Merging + FT 891,117 AdaMerging 493,694 ES-Merging (Ours) 149,807 T able 4: T otal Floating Point Operations (FLOPs) as a measure of computational cost for determining the merging coef ﬁcients and LoRA module parameters parameter space signals. On the other hand, com- bining tw o different coefﬁcients shows the best performance, enabling to capture different granu- larities of MLLMs’ specialization. 5.5 Merging Coefﬁcient Analysis W e visualize the layer-wise merging coef ﬁcients in Fig. 5 and the element-wise mer ging coef ﬁcients in Fig. 9 of Appendix. The coef ﬁcients are distinctly distributed across dif ferent MLLMs, showing that modality-speciﬁc knowledge is not incorporated uniformly throughout the model. Additionally , we observe that even when the layer-wise coef ﬁcient is high (e.g., the third layer of the molecule LLM), the element-wise merging coefﬁcients within that layer v ary substantially , as sho wn in Fig. 6 . This re- sult indicates that modality specialization emerges at multiple lev els of granularity: only a few param- eter elements within important layers are primarily salient for modality specialization. Therefore, com- bining layer-wise global coef ﬁcients with element- wise local coef ﬁcients leads to the complementary merging of dif ferent granularities, achieving an ac- curate and robust mer ging of different modalities. 5.6 Computational Cost Comparison W e compare the computation cost with the base- lines requiring the computation of gradients and parameter updates. As sho wn in T able 4 , the com- 0 1024 2048 3072 LoRA A Q proj 0 1024 2048 3072 K proj 0 1024 2048 3072 V proj 0 1024 2048 3072 O proj 0 2 4 6 LoRA B 0 2 4 6 0 2 4 6 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Figure 6: Computed element-wise merging coef ﬁcient visualization of each module in the third layer of molecule LLM deriv ed by Eq. 3 . putational cost of ES-Merging is 3.4 × and 6.1 × lo wer than AdaMerging and T ask-speciﬁc Fine- tuning respectiv ely , since ES-Merging requires the gradient computation only one time, while other baselines iterativ ely compute gradients and update parameters. While ES-Merging requires less com- putational cost, it outperforms the T ask-speciﬁc Finetuning and AdaMerging, indicating the efﬁ- ciency and ef fectiv eness of ES-Merging. 6 Conclusion W e present ES-Mer ging, a nov el MLLM merging frame work that shifts model mer ging from rely- ing on parameter space signals to le veraging em- bedding space signals. Our ke y insight is that input-aw are representations encode rich modality- speciﬁc specialization, pro viding a more faithful basis for integrating modality-specialized MLLMs. Based on this observ ation, we introduce two com- plementary types of mer ging coef ﬁcients at dif- ferent granularities: layer -wise global coefﬁcients for capturing coarse-grained specialization and element-wise local coef ﬁcients for capturing ﬁne- grained importance. Experiments on interactiv e ef- fect prediction tasks demonstrate that ES-Merging consistently improv es multimodal merging perfor- mance across div erse biological tasks, highlighting embedding space signals as a principled foundation for MLLM merging. 8 References Hadi Abdine, Michail Chatzianastasis, Costas Bouyioukos, and Michalis V azir giannis. 2024. Prot2text: Multimodal protein’ s function generation with gnns and transformers. In Pr oceedings of the AAAI confer ence on artiﬁcial intellig ence , v olume 38, pages 10757–10765. Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pﬁster . 2015. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and V ision , 51(1):22–45. He Cao, Zijing Liu, Xingyu Lu, Y uan Y ao, and Y u Li. 2023. Instructmol: Multi-modal inte gration for build- ing a v ersatile and reliable molecular assistant in drug discov ery . . Saurabh Cha wla, Anja Rockstroh, Melanie Lehman, and 1 others. 2022. Gene expression based inference of cancer drug sensiti vity . Natur e Communications , 13(1):5680. Chi-Jane Chen, Y uhang Chen, Sukw on Y un, Natalie Stanley , and T ianlong Chen. 2025. Spatial coordi- nates as a cell language: A multi-sentence framew ork for imaging mass cytometry analysis. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 13241–13252. Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar . 2020. Chemberta: large-scale self- supervised pretraining for molecular property pre- diction. . Gheorghe Comanici, Eric Bieber , Mike Schaek ermann, Ice Pasupat, No veen Sachde va, Inderjit Dhillon, Mar - cel Blistein, Ori Ram, Dan Zhang, Ev an Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advance d reasoning, multimodality , long context, and next generation agentic capabilities. arXiv pr eprint arXiv:2507.06261 . Sajib Acharjee Dip, Adrika Zafor , Bikash Kumar Paul, Uddip Acharjee Shuv o, Muhit Islam Emon, Xuan W ang, and Liqing Zhang. 2025. Llm4cell: A survey of large language and agentic models for single-cell biology . arXiv pr eprint arXiv:2510.07793 . Guodong Du, Junlin Lee, Jing Li, Runhua Jiang, Y ifei Guo, Shuyang Y u, Hanting Liu, Sim K Goh, Ho-Kin T ang, Daojing He, and 1 others. 2024. Parameter competition balancing for model merging. Advances in Neural Information Pr ocessing Systems , 37:84746– 84776. Carl Edwards, T uan Lai, Ke vin Ros, Garrett Honke, Kyungh yun Cho, and Heng Ji. 2022. Translation be- tween molecules and natural language. In Pr oceed- ings of the 2022 Confer ence on Empirical Methods in Natural Languag e Processing , pages 375–413. Y in Fang, Xinle Deng, Kangwei Liu, Ningyu Zhang, Jingyang Qian, Penghui Y ang, Xiaohui Fan, and Huajun Chen. 2025a. A multi-modal ai copilot for single-cell analysis with instruction following. arXiv pr eprint arXiv:2501.08187 . Y in Fang, Qiao Jin, Guangzhi Xiong, Bo wen Jin, Xi- anrui Zhong, Siru Ouyang, Aidong Zhang, Jiawei Han, and Zhiyong Lu. 2025b. Cell-o1: T raining llms to solve single-cell reasoning puzzles with reinforce- ment learning. arXiv preprint . Y in Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui F an, and Hua- jun Chen. 2023. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. arXiv preprint . Xiao Fei, Michail Chatzianastasis, Sarah Almeida Carneiro, Hadi Abdine, La wrence P Petalidis, and Michalis V azirgiannis. 2025. Prot2text-v2: Pro- tein function prediction with multimodal contrastiv e alignment. Advances in Neur al Information Pr ocess- ing Systems . Aaron Grattaﬁori, Abhimanyu Dube y , Abhinav Jauhri, Abhinav Pandey , Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex V aughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint . Han Guo, Mingjia Huo, Ruiyi Zhang, and Pengtao Xie. 2023. Proteinchat: T ow ards achieving chatgpt-like functionalities on protein 3d structures. A uthorea Pr eprints . Shuhan Guo, Y atao Bian, Ruibing W ang, Nan Y in, Zhen W ang, and Quanming Y ao. 2024. Unimot: Uniﬁed molecule-text language model with discrete tok en representation. arXiv preprint . Malte Holmer , Christina de Bruyn K ops, Conrad Stork, and Johannes Kirchmair . 2021. Cypstrate: A set of machine learning models for the accurate classiﬁ- cation of cytochrome p450 enzyme substrates and non-substrates . Molecules , 26(15):4678. Edward J Hu, Y elong Shen, Phillip W allis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean W ang, Liang W ang, W eizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models. ICLR , 1(2):3. Chenyu Huang, Peng Y e, T ao Chen, T ong He, Xi- angyu Y ue, and W anli Ouyang. 2024. Emr -merging: T uning-free high-performance model merging. Ad- vances in Neur al Information Pr ocessing Systems , 37:122741–122769. Ke xin Huang, T ianfan Fu, W enhao Gao, Y ue Zhao, Y usuf Roohani, Jure Leskov ec, Connor W Cole y , Cao Xiao, Jimeng Sun, and Marinka Zitnik. 2021. Ther- apeutics data commons: Machine learning datasets and tasks for drug discovery and de velopment. arXiv pr eprint arXiv:2102.09548 . Gabriel Ilharco, Marco T ulio Ribeiro, Mitchell W orts- man, Suchin Gururangan, Ludwig Schmidt, Han- naneh Hajishirzi, and Ali Farhadi. 2022. Edit- ing models with task arithmetic. arXiv pr eprint arXiv:2212.04089 . 9 Oussama Kharouiche, Aris Markogiannakis, Xiao Fei, Michail Chatzianastasis, and Michalis V azir giannis. 2025. Cell2text: Multimodal llm for generating single-cell descriptions from rna-seq data. arXiv pr eprint arXiv:2509.24840 . Dongki Kim, W onbin Lee, and Sung Ju Hwang. 2025. Mol-llama: T o wards general understanding of molecules in large molecular language model. Ad- vances in Neural Information Pr ocessing Systems . Huan Y ee K oh, Anh T .N. Nguyen, Shirui P an, Lau- ren T . May , and Geoffre y I. W ebb . 2023. Psichic: physicochemical graph neural netw ork for learning protein-ligand interaction ﬁngerprints from sequence data . bioRxiv . Mario Krenn, Florian Häse, AkshatKumar Nigam, P as- cal Friederich, and Alan Aspuru-Guzik. 2020. Self- referencing embedded strings (selﬁes): A 100% ro- bust molecular string representation. Machine Learn- ing: Science and T echnology , 1(4):045024. Longyi Li, Liyan Dong, Hao Zhang, Dong Xu, and Y ongli Li. 2025. spallm: enhancing spatial do- main analysis in multi-omics data through large lan- guage model inte gration. Brieﬁngs in Bioinformatics , 26(4):bbaf304. Sihang Li, Zhiyuan Liu, Y anchen Luo, Xiang W ang, Xiangnan He, Kenji Kawaguchi, T at-Seng Chua, and Qi T ian. 2024. T owards 3d molecule-text in- terpretation in language models. arXiv preprint arXiv:2401.13923 . Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, W enting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, T om Sercu, Sal Candido, and 1 others. 2022. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv , 2022:500902. Zhiyuan Liu, Sihang Li, Y anchen Luo, Hao Fei, Y ixin Cao, Kenji Ka waguchi, Xiang W ang, and T at-Seng Chua. 2023. MolCA: Molecular graph-language modeling with cross-modal projector and uni-modal adapter . In Pr oceedings of the 2023 Confer ence on Empirical Methods in Natural Languag e Pr ocessing . Zhenyi Lu, Chenghao Fan, W ei W ei, Xiaoye Qu, Dan- gyang Chen, and Y u Cheng. 2024. T win-merging: Dynamic integration of modular expertise in model merging. Advances in Neur al Information Pr ocess- ing Systems , 37:78905–78935. OpenAI. 2024a. Gpt-4 technical report. arXiv:2303.08774 . OpenAI. 2024b. Gpt-4o system card. arXiv:2410.21276 . Jinyoung P ark, Minseong Bae, Dohwan K o, and Hyun- woo J Kim. 2024. Llamo: Large language model- based molecular graph assistant. Advances in Neural Information Pr ocessing Systems , 37:131972–132000. Qizhi Pei, W ei Zhang, Jinhua Zhu, K ehan W u, Kaiyuan Gao, Lijun W u, Y ingce Xia, and Rui Y an. 2023. Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language asso- ciations. In Pr oceedings of the 2023 Conference on Empirical Methods in Natural Languag e Pr ocessing , pages 1102–1123. Syed Asad Rizvi, Daniel Le vine, Aakash P atel, Shiyang Zhang, Eric W ang, Curtis Jamison Perry , Nicole May- erli Constante, Sizhuang He, David Zhang, Cerise T ang, and 1 others. 2025. Scaling large lan- guage models for next-generation single-cell analysis. BioRxiv , pages 2025–04. David Rogers and Mathew Hahn. 2010. Extended- connectivity ﬁngerprints. J ournal of chemical in- formation and modeling , 50(5):742–754. Bedionita Soro, Aoxuan Silvia Zhang, Bruno Andreis, Jaehyeong Jo, Song Chong, and Sung Ju Hwang. 2026. Ls-merge: Merging language models in latent space. In The F ourteenth International Conference on Learning Repr esentations . Hugo T ouvron and 1 others. 2023. Llama 2: Open foundation and ﬁne-tuned chat models. arXiv:2307.09288 . Chao W ang, Hehe Fan, Ruijie Quan, and Y i Y ang. 2024a. Protchatgpt: T o wards understanding pro- teins with large language models. arXiv preprint arXiv:2402.09649 . Ke W ang, Nik olaos Dimitriadis, Guillermo Ortiz- Jimenez, François Fleuret, and Pascal Frossard. 2024b. Localizing task information for improved model merging and compression. arXiv pr eprint arXiv:2405.07813 . Zhicong W ang, Zicheng Ma, Ziqiang Cao, Chang- long Zhou, Jun Zhang, and Y i Qin Gao. 2025a. Prot2chat: protein large language model with early fusion of text, sequence, and structure. Bioinformat- ics , 41(8):btaf396. Ziqing W ang, Ke xin Zhang, Zihan Zhao, Y ibo W en, Abhishek Pande y , Han Liu, and Kaize Ding. 2025b. A surve y of large language models for text-guided molecular discovery: from molecule generation to optimization. arXiv preprint . David W eininger . 1988. Smiles, a chemical language and information system. 1. introduction to methodol- ogy and encoding rules. Journal of c hemical infor- mation and computer sciences , 28(1):31–36. Benjamin W eiser , Jérôme Genzling, Mihai Burai- Patrascu, Ophélie Rostaing, and Nicolas Moitessier . 2023. Machine learning-augmented docking. 1. cyp inhibition prediction . Digital Discovery , 2:1841– 1849. Peter W illett, John M Barnard, and Geoffrey M Do wns. 1998. Chemical similarity searching. Journal of chemical information and computer sciences , 38(6):983–996. 10 Y ijia Xiao, Edward Sun, Y iqiao Jin, Qifan W ang, and W ei W ang. 2024. Proteingpt: Multimodal llm for pro- tein property prediction and structure understanding. arXiv pr eprint arXiv:2408.11363 . Y ijia Xiao, W anjia Zhao, Junkai Zhang, Y iqiao Jin, Han Zhang, Zhicheng Ren, Renliang Sun, Haixin W ang, Guancheng W an, Pan Lu, and 1 others. 2025. Protein large language models: A comprehensiv e surve y . arXiv pr eprint arXiv:2502.17504 . Minghao Xu, Xinyu Y uan, Santiago Miret, and Jian T ang. 2023. Protst: Multi-modality learning of pro- tein sequences and biomedical texts. In International confer ence on machine learning , pages 38749–38767. PMLR. Prateek Y adav , Derek T am, Leshem Choshen, Colin A Raffel, and Mohit Bansal. 2023. Ties-mer ging: Re- solving interference when merging models. Ad- vances in neur al information pr ocessing systems , 36:7093–7115. Enneng Y ang, Zhenyi W ang, Li Shen, Shiwei Liu, Guib- ing Guo, Xingwei W ang, and Dacheng T ao. 2023. Adamerging: Adaptiv e model mer ging for multi-task learning. arXiv preprint . Botao Y u, Frazier N. Baker , Ziqi Chen, Xia Ning, and Huan Sun. 2024. Llasmol: Advancing large language models for chemistry with a large-scale, comprehen- siv e, high-quality instruction tuning dataset. arXiv pr eprint arXiv:2402.09391 . Bulat Zagidullin, Jehad Aldahdooh, Shuyu Zheng, W enyu W ang, Y inyin W ang, Joseph Saad, Alina Ma- lyutina, Mohieddin Jafari, Ziaurrehman T anoli, Al- berto Pessia, and Jing T ang. 2019. Drugcomb: an integrati ve cancer drug combination data portal . Nu- cleic Acids Resear ch , 47(W1):W43–W51. 11 Organization Appendix is organized as follo ws: In Section A , we provide detailed experimental settings, including datasets, baseline, implemen- tation, and prompt setting details. In Section B , we qualitati vely analyze generated responses. In Section C , we provide additional analysis of our merging method. In Section D , we discuss the limitations of our work. A Experiment Settings A.1 Datasets Detail • BindingDB is a drug-target interaction dataset deri ved from experimentally measured small molecule-protein binding data, typically ﬁltered to human proteins, and used as a binary clas- siﬁcation benchmark with 11,054 samples for predicting whether a drug-target pair interacts. • BioSNAP is a drug-target interaction dataset deri ved from known associations between US- marketed drugs and their human protein tar gets, and is used as a binary classiﬁcation benchmark with 6,058 samples for predicting whether a drug- target pair interacts. • Human is a drug-tar get interaction dataset con- sisting of drug–human protein pairs with highly credible negati ve samples and used as a binary classiﬁcation benchmark with 1,375 samples to predict whether a drug-target pair interacts. • DrugComb is a drug combination-cell line in- teraction dataset deri ved from standardized and harmonized drug combination screening studies across v arious cancer cell lines, and is used as a binary classiﬁcation benchmark with 3,631 sam- ples for predicting whether a combination of two drugs produces a synergistic or antagonistic anti- cancer ef fect in a gi ven cancer cell line. • GDSC2 is a drug-cell line interaction dataset de- ri ved from the Genomics of Drug Sensitivity in Cancer project, which screens ov er 1,000 geneti- cally characterized human cancer cell lines with a wide range of anti-cancer therapeutics using a ne wer cell screening platform introduced in 2015, and is used as a binary classiﬁcation benchmark with 843 samples for predicting whether a giv en cancer cell line is sensiti ve or resistant to a spe- ciﬁc drug. • CYP1A2 , CYP2C19 , CYP2C9 , CYP2D6 , and CYP3A4 Inhibition are binary classiﬁcation benchmarks for predicting whether a giv en drug inhibits a speciﬁc cytochrome P450 enzyme iso- form in volved in drug metabolism, consisting of 2,516, 2,533, 2,418, 2,626, and 2,466 samples, respecti vely . • CYP2C9 , CYP2D6 , and CYP3A4 Substrate are three binary classiﬁcation benchmarks to pre- dict whether a gi ven drug is a substrate of a spe- ciﬁc cytochrome P450 enzyme isoform in volv ed in drug metabolism with 134, 133, and 134 sam- ples, respecti vely . A.2 Baseline Detail W e compare our method against three modality- specialized base models, one task-speciﬁc ﬁne- tuned model, six existing model merging baselines, and three ablation v ariants of our proposed method. • Mol-LLaMA ( Kim et al. , 2025 ) is a large molec- ular language model trained via multi-modal in- struction tuning that integrates complementary 2D and 3D molecular encoders through a blend- ing module, with a Q-Former projector and a LLaMA backbone ﬁne-tuned with LoRA, pro- viding general molecular understanding with ex- plainability and reasoning capabilities across di- verse molecular tasks. • Prot2T ext-V2 ( Fei et al. , 2025 ) is a multi- modal sequence-to-text model that combines an ESM2-3B ( Lin et al. , 2022 ) protein sequence encoder with a LLaMA-3.1-8B-Instruct decoder via a nonlinear modality projector , using hybrid sequence-le vel contrasti ve alignment learning and instruction-based LoRA ﬁne-tuning to gen- erate rich functional descriptions of proteins di- rectly from amino acid sequences. • Cell-o1 ( Fang et al. , 2025b ) is a cell line- specialized model ﬁne-tuned on transcriptomic omics data, capable of handling cell line- related drug response tasks but limited to single-modality inference without understanding molecule or protein. • A verage Merging directly a verages the parame- ter elements of all specialist MLLMs parameters without any additional computation. • A verage Merging + T ask-Speciﬁc Finetune constructs a base model by a veraging the LoRA weights of all three specialist models, and sub- sequently ﬁne-tunes the LoRA parameters inde- pendently on each e valuation task by sampling 2,000 examples from the training dataset for 10 epochs with a batch size of 16. 12 • TIES-Merging ( Y adav et al. , 2023 ) resolves interference among task v ectors through three steps, trim, elect sign, and disjoint merge, which prune low-magnitude parameters, resolve sign conﬂicts by selecting the dominant direction, and merge only the parameter -aligned subset. • AdaMerging ( Y ang et al. , 2023 ) learns merging coef ﬁcients for task vectors in a test-time adap- tation manner , using the minimization of model output entropy on unlabeled test samples as a surrogate objecti ve without relying on original training data. In our experiments, we adopt the layer-wise AdaMer ging variant which indepen- dently learns a merging coef ﬁcient for each layer of ev ery task vector , showing better performance compared to the task-wise AdaMerging. • EMR-Merging ( Huang et al. , 2024 ) is a tuning- free method that ﬁrst elects a uniﬁed task vector by selecting the maximum absolute v alue of each parameter along the dominant sign direction, and then generates lightweight task-speciﬁc masks and rescalers to align the direction and magnitude of the uniﬁed model with each original specialist model at inference time. • PCB-Merging ( Du et al. , 2024 ) is a training-free method that constructs a parameter competition balancing matrix through intra-balancing, which measures parameter signiﬁcance within indi vid- ual tasks, and inter -balancing, which assesses parameter similarity across tasks, then drops low- scoring parameters and rescales the remaining ones to form the ﬁnal merged model. • Consensus-Merging ( W ang et al. , 2024b ) identi- ﬁes and removes two classes of detrimental pa- rameters, selﬁsh weights that are critical exclu- si vely to a single task and catastrophic weights that are irrele vant to all tasks and retains only the consensus parameters that contrib ute positiv ely to multi-task fusion. • LS-Merge ( Soro et al. , 2026 ) encodes model weights into a latent space via a transformer- based V AE, performs merging through linear in- terpolation in that space, and decodes back to parameters. • Layer -wise ES-Merging (Ours) applies only the layer-wise global coefﬁcient α l m j , deri ved from SWD-based embedding distrib ution shifts between the base and specialized models, as the merging coef ﬁcient. • Element-wise ES-Merging (Ours) applies only the element-wise local coefﬁcient β l,n m j , deriv ed from gradient-based parameter sensitivity scores with respect to ﬁne-grained embedding distances, as the merging coef ﬁcient. • ES-Merging (Ours) inte grates both the global layer coef ﬁcient α l m i and the local element co- ef ﬁcient β l,n m i into a uniﬁed merging coefﬁcient λ l,n m i by multiplication and renormalization of el- ements in modalities. A.3 Implementation Details LoRA Conﬁguration All specialized models used for mer ging share a uniﬁed LoRA ( Hu et al. , 2022 ) conﬁguration with rank r = 8 , and scaling factor α = 32 , applied to all self-attention pro- jection matrices ( W Q , W K , W V , W O ) and MLP projection layers (gate, up, and do wn projections) of e very transformer block. Details of ES-Merging T o construct the probe in- puts, we randomly sample the 110 samples for each modality from a collection of test sets, construct- ing 330 samples in total. F or the layer-wise mer g- ing coefﬁcient computation, we le verage SWD with slice projection dim = 1,024 and distance order p =2 . 0 , then normalized via softmax with tempera- ture τ =0 . 5 . For the element-wise merging coef ﬁ- cient computation, the temperature τ of the softmax is predeﬁned as 0.5. Existing Merging Methods For all baseline methods, we follo w the experimental settings re- ported in their respecti ve original papers and of ﬁ- cial implementations. For LS-Merge ( Soro et al. , 2026 ), we train V AE with sequence length 16 , 384 , batch size 512 , learning rate 3 × 10 − 4 with 500 step warmup and cosine decay , and KL weight 10 − 4 with adaptiv e adjustment toward a tar get KL of 50 , for 1 , 000 epochs. At merge time, model parameters are split into chunks of size 16 , 384 , encoded into the latent space via the V AE encoder , merged via uniform-weighted Euclidean mean of the posterior means µ , and decoded back to parameter space. A.4 Prompt Setting f or Evaluation T o impart the task-speciﬁc knowledge, we pro- vide 5-shot examples from the training set with the prompt templates in T able 9 and T able 10 . The retrie val strategy is designed per task as follo ws: 13 BindingDB BioSNAP Human 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Density Molecule Protein Interaction GDSC2 DrugComb 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Density Molecule Cell Interaction CYP1A2 Inhib. CYP2C19 Inhib. CYP2D6 Inhib. CYP2C9 Inhib. CYP3A4 Inhib. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Density CYP Inhibition CYP2C9 Subs. CYP2D6 Subs. CYP3A4 Subs. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Density CYP Substrate Label Distribution across Datasets Figure 7: Label distribution across datasets. Each panel shows positi ve and neg ativ e sample counts for Molecule- Protein Interaction, CYP Inhibition, CYP Substrate, and Molecule-Cell Interaction tasks. • Molecule-Protein Interaction: T raining sam- ples whose target protein sequence exactly matches the query are ﬁrst collected. If more than ﬁ ve exact matches exist, the top-5 are selected by T animoto similarity ( W illett et al. , 1998 ) of Morg an ﬁngerprints ( Rogers and Hahn , 2010 ) between their molecules and the query molecule. If fe wer than ﬁ ve are found, the remaining slots are ﬁlled with samples from proteins with the highest cosine similarity of protein embeddings from ESM2 to the query protein. Among can- didates from the same similar protein, the one with the highest T animoto similarity to the query molecule is preferred. • Drug-Cell Interaction: T raining samples whose cell line shares an e xact match on the top-50 e x- pressed gene set are priorly collected. If fe wer than ﬁ ve are found, additional samples are dra wn from cell lines with the highest Jaccard similarity on gene sets. When multiple candidates exist from the same cell line, the one with the high- est T animoto similarity to the query molecule is selected. For DrugComb, molecule similarity is computed as the a verage T animoto similarity across both drugs. • CYP Inhibition/Substrate: Since all samples share the same CYP enzyme target, protein-le vel ﬁltering is unnecessary . The top-5 examples are selected solely by T animoto similarity between the training molecules and the query molecule. A.5 Evaluation Metric As sho wn in Figure 7 , while some datasets exhibit a relati vely balanced distrib ution between positiv e and negati ve samples, others such as BindingDB and CYP2D6 Inhibition display a pronounced class imbalance. In particular , the CYP Substrate task contains only approximately 130 samples per dataset, posing a critical challenge of absolute data scarcity . In such imbalanced and low resource settings, relying solely on accuracy as an ev alu- ation metric can be misleading, as a model that predominantly predicts the majority class may still achie ve inﬂated scores without genuinely captur- ing minority-class patterns. Therefore, we le verage the macro-F1 that addresses this limitation by com- puting the F1 score for each class independently and av eraging them with equal weight regardless of class frequency , thereby providing a fair assess- ment of predicti ve performance across all classes. B Qualitative Analysis In this section, we qualitatively compare the gen- erated responses of each task-speciﬁc ﬁne-tuned model with those of ES-Mer ging, as shown in T a- ble 5 and T able 6 . B.1 Molecule-Protein Interaction Pr ediction T able 5 presents a generated response on the molecule-protein interaction prediction task from the Human dataset, where the molecule is thymine and the target protein is thymidine phosphory- lase. Although both ES-Merging and the task- speciﬁc ﬁnetuning model correctly predict the la- bel, the two responses differ substantially in qual- ity . ES-Mer ging correctly identiﬁes thymine as a pyrimidine base found in DN A and a heteroc yclic aromatic compound, then progressiv ely reasons through its biological context, such as base pair - ing with adenine and its role in DN A synthesis, to conclude that thymine is likely to interact with the protein as a substrate for enzymes in volved in thymidine metabolism. This can be interpreted as e vidence that the chemical structure-le vel knowl- edge acquired from the molecule expert model and the biological function-lev el knowledge acquired from the protein expert model are org anically inte- grated within a single model through embedding 14 Molecule Name: Thymine SMILES: CC1=CNC(=O)NC1=O Protein: Thymidine phosphorylase Protein Sequence: MAALMTPGTGAPP APGDFSGEGSQGLPDPSPEPKQLPELIRMKRDGGRLSEADIRGFV AA VVN GSA QGA QIGAMLMAIRLRGMDLEETSVL TQALA QSGQQLEWPEA WRQQL VDKHSTGGVGDKVSL VLAP ALAACGC KVPMISGRGLGHTGGTLDKLESIPGFNVIQSPEQMQVLLDQA GCCIVGQSEQL VP ADGIL Y AARDVT A TVDSLPLIT AS ILSKKL VEGLSAL VVD VKFGGAA VFPNQEQARELAKTL VGVGASLGLR V AAAL T AMDKPLGRCVGHALEVEEALLC MDGA GPPDLRDL VTTLGGALL WLSGHA GTQA QGAAR V AAALDDGSALGRFERMLAA QGVDPGLARALCSGSP AER RQLLPRAREQEELLAP ADGTVEL VRALPLAL VLHELGA GRSRAGEPLRLGV GAELL VD VGQRLRRGTPWLR VHRDGP ALSGPQSRALQEAL VLSDRAPF AAPSPF AEL VLPPQQ User: Determine whether the given molecule interacts with the protein by follo wing the example. Ground T ruth: Interacts ES-Merging (Ours) The giv en molecule is thymine, a pyrimidine base found in DN A and RNA. It is a heteroc yclic aromatic or ganic compound. The protein is not speciﬁed, but based on the examples, it is likely a DN A or RNA polymerase enzyme. Thymine is a nucleobase that pairs with adenine in DN A. It is also a substrate for thymidylate synthase, an enzyme that catalyzes the con version of uracil to thymine in DNA synthesis. Given the information, it is likely that the molecule interacts with the protein, as thymine is a substrate for thymidylate synthase, an enzyme that catalyzes a critical step in DN A synthesis. Final answer: Interacts. Models with A vg. + T ask-Speciﬁc Finetuning Final answer: Interacts T able 5: Case study on Molecule-Protein Interaction prediction. space signal-based merging. In contrast, the task- speciﬁc ﬁnetuned model outputs only the label “In- teracts” without any biological rationale. Because it was trained on instruction data containing labels only , its predictions are less interpretable, whereas ES-Merging inte grates modality-expert kno wledge to produce interpretable reasoning ev en for unseen cross-modal tasks. B.2 Molecule-Cell Interaction Prediction T able 6 presents a qualitativ e analysis of the Drug- Cell Interaction prediction task from the GDSC2 dataset, where the drug is Geﬁtinib, and the cell line is O VCA420. ES-Merging ﬁrst identiﬁes the giv en molecule as a quinazoline deriv ativ e and correctly recognizes it as a potent inhibitor of the epidermal gro wth factor receptor (EGFR) tyrosine kinase. It also ac- curately identiﬁes the giv en cell line as a cancer cell, then focuses on genes such as RPS6, EIF1, and GNB2L1 in the rank ed gene-e xpression list to interpret the cell as actively proliferating and potentially dependent on EGFR signaling. On the basis of this, ES-Merging correctly recognizes that the drug’ s mechanism of action is effecti ve against the giv en cell. This reasoning chain spanning from molecular substructure to drug class, target path- way , cell type identiﬁcation, and cell-lev el interpre- tation demonstrates that the structural knowledge from the molecule expert model and the transcrip- tomic kno wledge from the cell expert model are functionally integrated through ES-Mer ging, while the model with task-speciﬁc ﬁnetuning again out- puts only the label “Sensitive” without an y biologi- cal rationale. C Further Analysis In this section, we provide a detailed analysis of the merging coefﬁcients used in ES-Merging, extended results of the ablation study in the main paper , and 15 Drug Name: Geﬁtinib SMILES: COC1=C(C=C2C(=C1)N=CN=C2NC3=CC(=C(C=C3)F)Cl)OCCCN4CCOCC4 Cell Line: O VCA420 T op 50 Genes: RPL35, FTL, MT -ND4, GAPDH, AKR1B10, UBC, RPS6, RPL24, EIF1, GNB2L1, SPP1, RPS11, AKR1C1, RPL19, GNG5, AKR1B10P1, ACTG1, MT -CO3, RPL5, F A U, RPS15A, TXN, RPL8, MIF , LDHB, KR T19, EIF4H, RPL11, ALDH1A1, MGST1, RPS12, CLIC1, SNRPD2, RPL30, H2AFX, NQO1, PRDX1, RPS5, RPLP1, H2AFZ, TUBB4B, F A UP1, N A CA, RPL14, HIST1H2BK, MYL12B, SHFM1P1, HINT1P1, RPL27, HIST1H4C User: Determine whether the given cell is sensiti ve or resistant to the gi ven molecule based on the rank ed gene-expression list. Ground T ruth: Sensitive ES-Merging (Ours) The giv en molecule is 4-(3-ﬂuoro-4-methoxyphenyl)-7-meth yl-6-oxo-1,4-dihydroquinazoline-3-carboximidamide, which is a quinazoline deriv ative . This type of molecule is known to be a potent inhibitor of the epidermal growth factor receptor (EGFR) tyrosine kinase . The cell line provided is a cancer cell , and the presence of genes such as RPS6, EIF1, and GNB2L1 in the list suggests that the cell is acti vely proliferating and may be dependent on EGFR signaling for gro wth and surviv al. Giv en the presence of these genes and the fact that the molecule is an EGFR inhibitor , it is likely that the cell is sensitiv e to the molecule. Final answer: Sensitive Models with A vg. + T ask-Speciﬁc Finetuning Final answer: Sensitive T able 6: Case study on Molecule-Cell Interaction prediction. an analysis of embedding visualizations. C.1 Merging Coefﬁcients C.1.1 Layer -wise Merging Coefﬁcients Size Human GDSC2 CYP2C9 Inh. CYP2C9 Sub . 32 60.7 90.6 68.2 53.7 256 62.0 93.1 68.9 48.5 1024 62.0 94.1 72.5 64.2 T able 7: Ablation studies on projection size of SWD for layer -wise ES-Merging across Molecule-Protein and Molecule-Cell Interaction tasks. Projection Dimension Size of SWD When com- puting SWD for the layer-wise merging coefﬁ- cients, we v ary projection dimension sizes on repre- sentati ve datasets for each task, and deri ved mer g- ing coef ﬁcients accordingly . As shown in T able 7 , performance consistently improves as the projec- tion size increases, with 1024 dimensions achie v- ing the best results. This is because a larger num- ber of projections enables a more precise approx- imation of the distributional differences in high- dimensional space, leading to more accurate com- putation of the layer-wise merging coefﬁcients. Based on this ﬁnding, we set the SWD projection size to 1024 in our main experiments. C.1.2 Element-wise Merging Coefﬁcients The coef ﬁcient patterns differ across q/k/v/o_proj modules e ven within the same layer . As shown in Figure 9 , the q/k/v projection modules and the o projection module emphasize different elements e ven at Layer 0. W ithin the same module, the co- ef ﬁcient distributions between LoRA A and LoRA B exhibit distinct patterns. In particular, at Layer 0 of the q projection module, LoRA A shows rel- ati vely balanced coef ﬁcients across all modalities, whereas LoRA B contains regions where molecule and protein are more prominent. On the other hand, 16 Molecule-Protein Interaction Molecule-Cell Interaction CYP Inhibition CYP Substrate Coefﬁcient T ype BindingDB BioSN AP Human Avg. DrugComb GDSC2 Avg . CYP1A2 CYP2C19 CYP2C9 CYP2D6 CYP3A4 A vg. CYP2C9 CYP2D6 CYP3A4 Avg. Layer-wise 65.5 68.2 57.0 63.6 80.1 90.2 85.2 76.3 71.4 72.3 79.2 70.2 73.9 61.2 54.9 55.2 57.1 Element-wise 66.5 66.6 61.4 64.9 79.3 94.1 86.7 76.0 69.1 70.7 77.6 69.8 72.7 65.7 57.6 58.2 60.5 Layer × Element Mixed 66.0 69.1 62.0 65.7 80.7 94.1 87.4 77.4 70.6 72.5 80.7 71.3 74.5 64.2 60.9 60.5 61.9 T able 8: Full results of the ablation studies on merging coef ﬁcient. W e report accuracy across all datasets and their av erage per task group. Bold indicates the best and underline indicates the second best. Molecule Token Embedding Space ES-Merging Cell LLM Protein LLM Molecule LLM (a) Molecule token Protein Token Embedding Space ES-Merging Cell LLM Protein LLM Molecule LLM (b) Protein token Cell Token Embedding Space ES-Merging Cell LLM Protein LLM Molecule LLM (c) Cell token Figure 8: Embedding visualization of the last transformer block for each specialized LLM and the mer ged model with our method for (a) molecule, (b) protein, and (c) cell tokens. coef ﬁcient distrib utions are far from uniform, in- dicating that layer-wise global coefﬁcients alone cannot capture the ﬁne-grained specialization dif- ferences and supporting the necessity of element- wise local coef ﬁcients. C.1.3 Final Coefﬁcients of ES-Merging The layer-wise global coefﬁcient α l m i and the element-wise local coefﬁcient β l,n m i focus on dif- ferent regions. The layer -wise coefﬁcient captures coarse-grained distributional shifts in the embed- ding space, while the element-wise coefﬁcient re- ﬂects ﬁne-grained importance at the indi vidual pa- rameter le vel. As formulated in Eq. 4 , by multiply- ing these two coefﬁcients and normalizing, regions where both coefﬁcients assign high importance are ampliﬁed, whereas regions where only one side is high and the other is lo w are suppressed. This enables merging that simultaneously incorporates global layer -lev el specialization signals and local element-le vel specialization s ignals. As sho wn in Figure 10 , compared to the element-wise-only co- ef ﬁcient distributions, the combined results demon- strate that the o verall scale of coefﬁcients is ad- justed according to the global importance of each layer , while the ﬁne-grained element-lev el patterns are preserved. C.2 Details of Ablation Study T able 8 reports the detailed per-dataset results cor - responding to the averages of each task group in T able 3 . The layer-wise ES-Merging and element-wise ES-Merging each dominate on dif f er - ent datasets. Despite this dataset-lev el variations, Layer × Element ES-Merging consistently achiev es the best or comparable performance across all in- di vidual datasets. These results conﬁrm that the two coef ﬁcients capture specialization at different granularities depending on task and dataset char- acteristics, and their combination compensates for the weaknesses of either side, yielding consistent and rob ust merging performance at the individual dataset le vel as well. C.3 Embedding V isualization Figure 8 visualizes the embedding distrib utions of each specialist model and the model mer ged by ES-Merging at the last transformer block for each modality token. In each modality token space, the specialist models form distinct distrib utions, indicating that each model has learned modality- speciﬁc representations. The model merged by ES-Merging is positioned between the specialist distributions without being biased to ward any par - ticular specialist, while being relati vely close to the distribution of the specialist corresponding to each modality token. This suggests that ES-Merging integrates the modality-speciﬁc kno wledge of each specialist model in a balanced manner , while pre- serving the specialization for each modality . D Limitation In this work, we present ES-Merging, a novel MLLM merging frame work that deriv es layer-wise and element-wise merging coefﬁcients by le verag- ing embedding space signals. Although we have 17 demonstrated its effecti veness on the integration of MLLMs specialized in biochemical domains such as molecules, proteins, and single cells, we hav e not explored its applicability to more general multi-modal domains such as video, image, and audio due to the lack of cross-modal benchmarks on these domains. Since the core principle of our approach to le verage embedding space signals is inherently modality-agnostic, we belie ve that ex- tending ES-Merging to such general multi-modal scenarios is a promising direction and leave this exploration as future work. Additionally , further in- vestigation is needed to v erify whether ES-Mer ging can maximally preserv e performance not only on cross-modal fusion tasks but also on indi vidual single-modality tasks of each specialist model. 18 T able 9: In-conte xt learning prompt templates for the Molecule-Protein Interaction and CYP prediction task. Angle-bracketed tokens are replaced with the corresponding encoder embeddings: and for the target pair , and and for each of the k few-shot e xamples. The token is replaced with the ground-truth prediction string for each example pair in the fe w-shot context. Prompt f or Molecule-Protein Interaction Prediction System Y ou are an expert specialized in drug disco very and molecular biology . Y ou will be given a protein and a molecule. Y our task is to determine whether a given molecule interacts with a speciﬁc pr otein. User Determine whether the giv en molecule interacts with the protein by following the examples. Examples: Example 1: Protein: Molecule: Final answer: {Label} . . . Example k : Protein: Molecule: Final answer: {Label} Protein: Molecule: Y our ﬁnal answer must be exactly one of: ‘Final answer: Interacts’ or ‘Final answer: Does not interact’. Prompt f or CYP Inhibition Prediction System Y ou are an expert specialized in drug disco very and molecular biology . Y ou will be given a protein and a molecule. Y our task is to determine whether a given molecule inhibits a speciﬁc pr otein. User Determine whether the giv en molecule inhibits the protein by following the examples. Examples: Example 1: Protein: Molecule: Final answer: {Label} . . . Example k : Protein: Molecule: Final answer: {Label} Protein: Molecule: Y our ﬁnal answer must be exactly one of: ‘Final answer: Inhibit’ or ‘Final answer: Does not inhibit’. Prompt f or CYP Substrate Prediction System Y ou are an expert specialized in drug disco very and molecular biology . Y ou will be given a protein and a molecule. Y our task is to determine whether a given molecule is a substrate of a speciﬁc pr otein. User Determine whether the giv en molecule is a substrate of the protein by following the examples. Examples: Example 1: Protein: Molecule: Final answer: {Label} . . . Example k : Protein: Molecule: Final answer: {Label} Protein: Molecule: Y our ﬁnal answer must be exactly one of: ‘Final answer: Substrate’ or ‘Final answer: Not a substrate’. 19 T able 10: In-conte xt learning prompt template for the Molecule-Cell Interaction prediction task. Angle-bracketed tokens are replaced as follo ws: tar get molecule token and for each of the k few-shot examples are substituted with the corresponding encoder embedding, while the gene tok ens are replaced with the corresponding cell line string. The token is replaced with the ground-truth prediction string for each example pair in the fe w-shot context. Prompt f or Molecule-Cell Interaction Prediction: GDSC2 System Y ou are an e xpert specialized in cancer pharmacogenomics and drug response prediction. Y ou will be given an anticancer molecule and a cancer cell represented as a list of gene names ordered by expression, where the most highly e xpressed genes appear ﬁrst in descending order . Y our task is to predict whether the molecule will biologically suppress the cancer cell. If the molecule is expected to suppr ess/inhibit cancer cell growth or viability , label it Sensitiv e, otherwise label it Resistant. User Determine whether the giv en cell is sensitive or resistant to the gi ven molecule based on the ranked gene-expression list. Examples: Example 1: Cell: Molecule: Final answer: {Label} . . . Example k : Cell: Molecule: Final answer: {Label} Cell: Molecule: Y our ﬁnal answer should be formatted as either: ‘Final answer: Sensitive’ or ‘Final answer: Resistant’. Prompt f or Molecule-Cell Interaction Prediction: DrugComb System Y ou are an expert specialized in cancer pharmacogenomics and anticancer drug-combination response prediction. Y ou will be giv en two anticancer molecules and a cancer cell represented as a ranked list of gene names (highest expression ﬁrst). Y our task is to predict the binary interaction outcome in terms of anticancer ef fect in that cell line. Labels: Synergistic – the combination produces a stronger anticancer ef fect than expected, Antagonistic – the combination produces a weaker anticancer ef fect than expected. User Predict whether the drug pair is Synergistic or Antagonistic in the gi ven cell line. Examples: Example 1: Cell: Molecule1: Molecule2: Final answer: {Label} . . . Example k : Cell: Molecule1: Molecule2: Final answer: {Label} Cell: Molecule1: Molecule2: Y our ﬁnal answer should be formatted as either: ‘Final answer: Synergistic’ or ‘Final answer: Antagonistic’. 20 self_attn.q_proj Layer 0 0 1365 2730 4095 LoRA-a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 10 0 1365 2730 4095 LoRA-a 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 20 0 1365 2730 4095 LoRA-a 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 30 0 1365 2730 4095 LoRA-a 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 self_attn.k_proj Layer 0 0 1365 2730 4095 LoRA-a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 10 0 1365 2730 4095 LoRA-a 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 20 0 1365 2730 4095 LoRA-a 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 30 0 1365 2730 4095 LoRA-a 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 self_attn.v_proj Layer 0 0 1365 2730 4095 LoRA-a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 10 0 1365 2730 4095 LoRA-a 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 20 0 1365 2730 4095 LoRA-a 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 30 0 1365 2730 4095 LoRA-a 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 self_attn.o_proj Layer 0 0 1365 2730 4095 LoRA-a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 10 0 1365 2730 4095 LoRA-a 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 20 0 1365 2730 4095 LoRA-a 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 30 0 1365 2730 4095 LoRA-a 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 LoRA-b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Element-wise Merging Coefficients Figure 9: Element-wise merging coef ﬁcients of the LoRA parameters for self_attn.q/k/v/o_proj modules at layers 0, 10, 20, and 30. Ro ws correspond to LoRA-A and LoRA-B, and columns correspond to Molecule, Protein, and Cell modalities. 21 Layer 0 0 1365 2730 4095 q_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 q_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 10 0 1365 2730 4095 q_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 q_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 20 0 1365 2730 4095 q_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 q_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 30 0 1365 2730 4095 q_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 q_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Element-wise × Layer-wise Coefficients (a) q projection Layer 0 0 1365 2730 4095 k_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 k_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 10 0 1365 2730 4095 k_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 k_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 20 0 1365 2730 4095 k_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 k_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 30 0 1365 2730 4095 k_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 k_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Element-wise × Layer-wise Coefficients (b) k projection Layer 0 0 1365 2730 4095 v_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 v_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 10 0 1365 2730 4095 v_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 v_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 20 0 1365 2730 4095 v_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 v_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 30 0 1365 2730 4095 v_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 v_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Element-wise × Layer-wise Coefficients (c) v projection Layer 0 0 1365 2730 4095 o_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 o_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 10 0 1365 2730 4095 o_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 o_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 20 0 1365 2730 4095 o_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 o_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Layer 30 0 1365 2730 4095 o_proj lora_a Molecule 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Protein 0.0 0.2 0.4 0.6 0.8 1.0 0 1365 2730 4095 Cell 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 o_proj lora_b 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Element-wise × Layer-wise Coefficients (d) o projection Figure 10: Element × Layer merging coef ﬁcients of the LoRA parameters for self_attn.q/k/v/o_proj modules at Layers 0, 10, 20, and 30, respectiv ely . 22

ES-Merging: Biological MLLM Merging via Embedding Space Signals

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment