Differentiable Thermodynamic Phase-Equilibria for Machine Learning

Accurate prediction of phase equilibria remains a central challenge in chemical engineering. Physics-consistent machine learning methods that incorporate thermodynamic structure into neural networks have recently shown strong performance for activity…

Authors: Karim K. Ben Hicham, Moreno Ascani, Jan G. Rittig

Differentiable Thermodynamic Phase-Equilibria for Machine Learning
Differen tiable Thermo dynamic Phase-Equilibria for Mac hine Learning Karim K. Ben Hic ham 1 , Moreno Ascani 1 , Jan G. Rittig 1 , and Alexander Mitsos ∗ 1,2,3 1 R WTH Aac hen Universit y , Pro cess Systems Engineering (A VT.SVT), Aac hen, German y 2 F orsc hungszen trum Jülic h GmbH, Institute of Climate and Energy Systems ICE-1: Energy Systems Engineering, Jülic h, Germany 3 JARA-ENER GY, Aachen, German y *Email: amitsos@alum.mit.edu Abstract A ccurate prediction of phase equilibria remains a cen tral c hallenge in chemical engineering. Ph ysics- consisten t machine learning metho ds that incorp orate thermo dynamic structure in to neural netw orks ha ve recently sho wn strong p erformance for activit y-co efficien t modeling. Ho w ever, extending suc h ap- proac hes to equilibrium data arising from an extremum principle, suc h as liquid-liquid equilibria, remains difficult. Here w e presen t DISCOMAX, a differen tiable algorithm for phase-equilibrium calculation that guaran tees thermo dynamic consistency at b oth training and inference, only sub ject to a user-sp ecified discretization. The metho d is ro oted in statistical thermo dynamics, and w orks via a discrete en umera- tion with subsequen t mask ed softmax aggregation of feasible states, and together with a straight-through gradien t estimator to enable physics-consisten t end-to-end learning of neural excess Gibbs energy ( g E )- mo dels. W e ev aluate the approach on binary liquid-liquid equilibrium data and demonstrate that it outp erforms existing surrogate-based metho ds, while offering a general framew ork for learning from dif- feren t kinds of equilibrium data. In tro duction Prediction of accurate phase equilibria of multicomponent mixtures is a ubiquitous problem with far-reac hing implications in engineering and scientific fields. Applications of phase equilibrium thermodynamics encom- pass, e.g., the design of unit op erations such as distillation, extraction, crystallization or reactive separa- tions [ 1 – 3 ], the dev elopment of form ulations for pharmaceuticals [ 4 , 5 ], the study of geological pro cesses [ 6 ] and even in biology and life science [ 7 – 9 ]. The prediction of phase equilibria has traditionally relied on g E mo dels suc h as Wilson [ 10 ], NR TL [ 11 ], and UNIQUAC [ 12 ], which are computationally efficient and can mo del a wide v ariet y of systems, but require extensiv e parameter fitting to exp erimen tal data. Group-con tribution metho ds such as UNIF A C [ 13 ] and quan tum-chemistry-based approaches such as COSMO-RS [ 14 , 15 ] extend predictive capabilities beyond fitted systems, yet remain limited b y applicabilit y , accuracy , and/or computational cost. More recen tly , machine-learning mo dels ha ve achiev ed high predictiv e accuracy for activit y co efficien ts and mixture nonidealities [ 16 – 18 ], but often at the exp ense of thermodynamic consistency . Hybrid and ph ysics-informed ML approaches ha ve therefore b een prop osed to em b ed thermo dynamic structure directly in to learning architectures [ 19 – 24 ], see also our p erspective article [ 25 ]. After first presen ting a soft loss to promote thermo dynamic consistency without relying on rigid g E mo dels [ 26 ], we in tro duced a thermo dynamically consistent GNN arc hitecture that predicts comp osition- dep enden t activity co efficien ts by in tro ducing a differen tiable latent v ariable for g E ( x ), and th us enforcing ∗ Corresponding author: amitsos@alum.mit.edu 1 Gibbs-Duhem consistency [ 24 ]. This w as further extended b y [ 27 ] to also ensure g E v anishes for pure comp onen ts. Enforcing thermo dynamic consistency in suc h a wa y has b een used and discussed in multiple follo w-up w orks by now [ 23 , 28 ]. When mo deling activity co efficients or v ap or–liquid equilibria, this g E -based arc hitecture is fully suffi- cien t, as the required thermo dynamic prop erties dep end only on mole fraction x ∗ , the deriv atives d ∆ g E dx    x ∗ and ∆ g E ( x ∗ ) itself. These deriv atives can b e obtained directly through automatic differentiation, enabling end-to-end training without an y additional numerical solv ers needed. Ho wev er, the situation changes fundamentally when the quantit y of interest is itself defined implicitly as the solution of an equation or even as the minimizer of a constrained optimization problem, esp ecially in the presence of many solutions. In such cases, the learning task naturally assumes the structure of a bilevel optimization problem [ 29 ], whic h is formally given by the follo wing equation: min θ L ( y ∗ , ˆ y ∗ ( θ )) s.t. ˆ y ∗ ( θ ) ∈ arg min y ∈Y G ( y ; θ ) where the upp er-level ob jective depends on the solution of a low er-lev el problem. Here L ( y ∗ , ˆ y ∗ ( θ )) denotes the prediction error on the measurements y ∗ . The mo del parameters θ affect the thermodynamic mo del and, as such, the optimal solution to the lo wer-lev el problem. The function G in the low er-lev el problem typically corresp onds to determining the equilibrium state y of a system b y minimizing an appropriate thermo dynamic p oten tial, either directly or via sp ecialized reformulations. The calculation of phase equilibria is a classic problem in computational thermo dynamics, and a wide range of metho ds hav e b een developed for b oth phase equilibrium calculation and mo del parameter estimation (see, e.g., [ 30 ] for a comprehensiv e o verview). Rigorous approac hes, in particular, typically rely on global optimization to ensure that the thermo dy- namic extremum condition (e.g., global minim um of the Gibbs energy at giv en T , p and ov erall comp osition ¯ z ) is satisfied. These metho ds are implemented either as large single-level formulations or as sp ecialized form ulations [ 31 – 34 ]. Discretization-based metho ds, such as the Conv ex En velope Metho d (CEM) [ 35 , 36 ], pro vide alternative deterministic strategies and op erate by constructing a discretized conv ex env elop e of the Gibbs energy landscap e ov er the en tire composition space. Although b oth contin uous optimization form ulations and discretization-based conv exification methods are, in principle, capable of iden tifying the thermo dynamically correct equilibrium state, the resulting solu- tion mappings are generally not readily differentiable, making gradien t computation challenging. Consequen tly , integrating such solvers directly into gradient-based learning framew orks is c hallenging, and training thermo dynamically consistent mo dels on phase-equilibrium data, suc h as equilibrium comp ositions of m ultiphase systems, requires solving a low er-lev el optimization problem for eac h data p oint at ev ery training ep och and differentiating through its solution. F or learning mo del parameters in an end-to-end fashion and using pure comp onent saturation pressures and liquid densities, an approac h is presen ted b y [ 37 ]. In their approach, the authors maintain differen tiability b y using the mo del (PC-SAFT implemented in the F eOs calculation framework) only to obtain fully con verged equilibrium states, and then p erforming a single Newton up date, implemen ted in PyT orc h, on top of the conv erged solution, whic h reconstructs the implicit equilibrium dep endence in a form compatible with automatic differentiation. In the field of material science, [ 38 ] gav e a p ersp ectiv e on learning the parameters of a thermo dynamic p otential from phase equilibria and thermo c hemical prop erties, ho wev er only for fitting a single system. Similar to [ 21 ], an in ternal iteration step is required to solve for the isopotential condition. Recen tly , [ 39 ] hav e in tro duced a machine learning (ML) g E mo del that can be trained not only on v ap or-liquid equilibrium (VLE) and activity co efficien t (A C) data, but also liquid-liquid equilibrium (LLE) data. Their approac h follows a sequen tial strategy: (i) LLE phase-equilibrium data are generated using a semi-empirical thermodynamic mo del for binary mixtures ov er a range of temperatures; (ii) a neural-net work surrogate is trained on this pseudo-data to predict equilibrium phase compositions; (iii) the trained surrogate solv er is frozen and employ ed during the training of the mac hine-learned g E mo del to pro vide differentiable equilibrium predictions. While this surrogate-based approach b y [ 39 ] greatly simplifies the problem of learn- ing LLE b eha vior, it also introduces t wo critical limitations in our view. First, the surrogate solv er is thermo dynamically inconsistent: a learned direct mapping from discretized g E or Gibbs energy of mixing ( ∆ g mix ) v alues to phase comp ositions do es not guarantee satisfaction of any of the following fundamental 2 requiremen ts: (i) mass balance, (ii) global optimalit y of ∆ g mix , and (iii) stabilit y conditions, even in the limit of arbitrarily fine discretization. Secondly , the surrogate training data are, in our opinion, still biased to wards the test set because they are generated from the mo d. UNIF AC mo del, whose parameterization was fitted using man y of the molecules that appear in the test set. As a result, prior information from the test domain is implicitly leak ed into the surrogate solver, even if the surrogate solver is not directly trained on the test set itself. In this work, we prop ose a differentiable multi-phase equilibrium algorithm, whic h can b e integrated in to ML mo del training in an end-to-end fashion. It ensures thermo dynamic consistency b y treating the constrained minimization of the ∆ g mix as a partially masked classification problem, where the ∆ g mix of a given state is interpreted as the logarithm of an unnormalized probability (logit), and the corresp onding normalized state probabilities are obtained via a softmax op erator. The predictions of the solv er are inher- en tly thermo dynamically consisten t, i.e., they guaran tee satisfaction of simultaneously: (i) mass balance, (ii) global optimality of ∆ g mix , and (iii) stabilit y conditions, only sub ject to a user-specified discretization. F ur- thermore, we highlight its natural connection to statistical thermo dynamics. W e do not need any syn thetic data in comparison to previous surrogate-based approaches [ 39 , 40 ]. Rather, our approac h enables direct learning of the physical relationships, without in tro ducing additional neural net work approximation error. This mak es the metho d significan tly simpler and at the same time more general in comparison to previous w ork. Nonetheless, the work by [ 39 ] indicates that using a trained surrogate neural netw ork in place of a solv er works well for binary systems in practice, and we will use it as the main baseline in this w ork. W e a void the pitfalls of lo cal optimization metho ds for this highly non-linear problem by enumerating a discrete grid ov er the full feasible space. While this approach scales exp onen tially with the n umber of components, it remains practical since most av ailable data in volv e binary mixtures. The calculations are implemen ted with batc hed tensor op erations in PyT orch and can b e fully run on the GPU. W e demonstrate our method on a new ly generated dataset, whic h contains around 8.000 unique binary LLE systems, and b enc hmark our metho d with the recently prop osed surrogate solver b y [ 39 ]. Next to the significan t qualitativ e and practical impro vemen ts of our algorithm, we show a reduction of approximately 11% in cross-v alidation MAE in comparison to the surrogate approac h. Relation to previous w ork Relation to Mac hine Learning In mo dern deep learning, it is increasingly common to encoun ter learning problems in whic h parts of the computational pipeline are inheren tly non-differentiable [ 41 ]. T ypical examples include op erations such as selecting a maximizer [ 42 ], sorting [ 43 ], sampling elements from a discrete set [ 44 , 45 ], or locating active pixels in an image for a given task [ 46 ]. T o address the broad class of discrete optimization problems that arise in these contexts, a substantial bo dy of work in differentiable programming has fo cused on replacing non-differen tiable op erations with smooth, contin uous relaxations that permit gradient-based optimization. An in tro duction to the field is given in [ 47 ]. Almost all the proposed strategies are pr ob abilistic in nature and rely on using a neural net work score model to predict parameters λ ∈ Λ of a distribution function p λ and emplo ying this function to reconstruct a con tinuous output ˆ y b y inference of its expectation or sampling [ 47 ]. Usually , the employ ed distribution function p λ mak es use of a temp er atur e parameter τ that con trols the sharpness of the distribution: larger v alues of τ yield smo other distributions, while in the limit of τ → 0 the distribution collapses to the discrete (and non-differentiable) form ulation. Another conceptually related approach is energy-based mo dels (EBMs) [ 48 , 49 ]. These models learn a scalar energy function E ( X, Y ) that assigns low energy to compatible input-output configurations (e.g., an image and its correct class lab el) and high energy to incompatible ones [ 50 ]. Prediction is then performed b y minimizing this energy , often using approximate optimization or sampling-based inference metho ds. Relation to Statistical Thermo dynamics Phase equilibrium thermo dynamics is fundamentally probabilistic [ 51 ]: any macroscopic observ able O at equilibrium corresponds to an ensem ble a v erage ov er a large num b er of accessible microscopic states i . 3 Under conditions of constan t temperature T , this av erage is given by the Boltzmann distribution: ⟨O ⟩ = X i O ( E i ) p ( E i ) = P i O ( E i ) e − E i k B T P j e − E j k B T where E i denotes the energy of state i , k B is the Boltzmann constan t, and T is the temp erature. In the previous equation, p ( E i ) is the probability that the system o ccupies the energy state E i , which is giv en by: p ( E i ) = e − E i k B T P j e − E j k B T . In the thermo dynamic limit, the resulting probability distribution p ( E i ) ∝ exp( − E i /k B T ) becomes sharply p eak ed around the equilibrium state, which justifies th e classical treatment of thermo dynamic equilibrium as the solution of a deterministic minimization problem. This probabilistic structure provides a natural theoretical foundation for introducing smooth relaxations of equilibrium calculations. The softmax function, whic h rescales the n-elements ˆ x 1 , ..., ˆ x n of an array in the range (0 , 1) and is defined as [ 41 , 52 ]: softmax ( − ˆ x i ) = e − ˆ x i P j e − ˆ x j is mathematically equiv alent to a Boltzmann-weigh ted distribution ov er discrete energy states and therefore admits a direct physical in terpretation when applied to discretized thermo dynamic p oten tials. Introducing a temp erature-lik e parameter τ (which, in the Boltzmann distribution, is equiv alent to the k B T term) controls the sharpness of this distribution and recov ers the classical equilibrium solution in the limit τ → 0 , while yielding a regular and differen tiable mapping for finite τ . In our view, the statistical mec hanics in terpretation of the phase equilibrium problem is complementary to the direct application of relaxing discrete v ariables in order to obtain meaningful gradients. In S1 in the SI, we show that the prop osed equilibrium solv er can b e deriv ed from thermo dynamics by considering comp osition fluctuations o ver a finite set of admissible states at fixed temp erature T , pressure p , and ov erall comp osition z . Therein, we deriv e and discuss t w o equiv alent form ulations of our approach. F urthermore, w e pro vide sp ecific case studies using the algorithm to compute pure comp onen t v ap or pressure as w ell as a ternary LLLE of an ionic liquid system in S2 in the SI. Metho ds The following section presen ts our end-to-end framework for thermo dynamically consistent phase-equilibrium prediction, detailing b oth the neural ∆ g mix mo del and the proposed differentiable equilibrium solver. The o verall approac h is illustrated in Figure 1 . Neural P arameterization of the Gibbs Energy of Mixing Our model predicts LLE comp ositions b y learning a parametric represen tation of the Gibbs Energy of mixing ∆ g mix for binary liquid systems. Each molecule pair ( mol 1 and mol 2 ), is enco ded usi ng a graph neural netw ork (GNN), whic h maps the corresp onding molecular graphs to contin uous embeddings, h 1 = GNN( mol 1 ; θ ) , h 2 = GNN( mol 2 ; θ ) . While an y enco der, suc h as simple fingerprin ts or transformers, w ould also w ork, we use the standard Chempr op D-MPNN, whic h constructs a directed line graph where b onds are no des and atoms are edges [ 53 ]. The em b eddings h 1 , h 2 are concatenated to form a mixture embedding M and passed to a neural netw ork e g E ( · ; θ ) , which mo dels the excess Gibbs energy of mixing at comp osition x b efore scaling with x ( x − 1) . As we fo cus exclusively on isothermal binary mixtures at fixed pressure, the only relev an t state v ariable is the comp osition x ∈ [0 , 1] . T o ensure thermo dynamic consistency at the pure-comp onen t b oundaries, w e m ultiply the netw ork output by the pro duct of the liquid-phase comp ositions, such that ∆ g mix ( x ; θ ) = x (1 − x ) e g E ( x ; M , θ ) + ∆ g id ( x ) , 4        󰆒   󰆒󰆒     Mix ture GNN Sa mpl er      󰇛        󰇜 Soft Equ il ib rium Gra di ent Ha rd Eq ui li bri um  󰆒    󰆒 󰆒  Soft B ac kwa rd pas s Ha rd Forw ard pa ss   󰆒   󰆒 󰆒  Sample feasible 󰇛         󰇜 po in ts         󰇛     󰇜      󰆒    󰆒 󰆒  Figure 1: Ov erview of the prop osed end-to-end thermodynamically consisten t learning framew ork. Molecular graphs of the tw o mixture comp onen ts are enco ded using a graph neural netw ork to form a joint mixture em b edding M . The neural mo del parameterizes the Gibbs energy of mixing ∆ g mix ( x ; M ) , which is ev al- uated on a discretized comp osition grid sub ject to the feed composition z . F easible phase-split candidates ( x ′ , x ′′ ) are iden tified via mass-balance constrain ts and com bined to compute the total Gibbs energy of the mixture. In the forward pass, the equilibrium comp ositions are obtained via a hard minimization of ∆ g mix . During training, gradients are propagated through a Boltzmann-weigh ted softmax relaxation using a straight-through estimator, enabling differentiable, thermo dynamically consisten t learning. where e g E is the unrestricted netw ork prediction. The term x (1 − x ) guaran tees ∆ g mix (0) = ∆ g mix (1) = 0 , as required for binary mixtures [ 27 , 38 , 54 ]. In the definition of ∆ g mix ( x ; θ ) , the dep endence of the mixture em b edding M is omitted for notational clarity . Equilibrium Prediction via Discretized Gibbs Energy Minimization The condition of thermo dynamic equilibrium (num b er and composition of the co existing phases) of a liquid system at given temp erature T and pressure p is dictated b y the global minim um of ∆ g mix sub ject to mass balance for the giv en ov erall comp osition z . F or a binary system, the t wo equilibrium comp ositions ( x ′∗ , x ′′∗ ) satisfy the following equation: ( x ′∗ , x ′′∗ ) ∈ arg min x ′ ,x ′′  ∆ g mix ( x ′ ; θ ) ϕ ′ + ∆ g mix ( x ′′ ; θ ) ϕ ′′  , (1) where x ′ and x ′′ denote phase composition in phases one and tw o, resp ectiv ely , and the molar fractions ϕ ′ , ϕ ′′ of b oth phases is giv en b y Eq. (). ϕ ′ = x ′′ − z x ′′ − x ′ ; ϕ ′′ = x ′ − z x ′′ − x ′ . In the contin uous setting, the equilibrium problem in Eq. ( 1 ) constitutes a nonconv ex constrained optimiza- tion problem, since ∆ g mix ( x ; θ ) ma y exhibit m ultiple local minima corresp onding to metastable phase states. Computing the equilibrium state therefore requires iden tifying the global minimizer, whic h is possible (see e.g., [ 34 ]), but computationally exp ensiv e and difficult to integrate into the training. While lo cal optimization metho ds could in principle b e employ ed, this might not b e without problems: differen tiating through a local solv er requires either implicit differentiation or unrolled optimization, b oth of which introduce additional implemen tation complexity , computational ov erhead, and additional h yp erparameters, i.e., starting v alues and n umber of iterations. More imp ortan tly , lo cal solvers may con verge only to stationary p oin ts rather than the global optimum, and their conv ergence b eha vior can c hange abruptly under small perturbations of θ . This will yield incorrect equilibrium predictions and thus wrong and strongly v arying gradients, whic h migh t destabilize training. 5 Algorithm 1 DISCOMAX: Differentiable Binary Equilibrium Solver with DISCRETE enumeration (fp) and masked SOFTMAX aggregation ov er feasible states (bp). Require: F eed composition z ∈ (0 , 1) , Gibbs energy mo del ∆ g mix ( x ; θ ) , base discretization grid X = { x ( k ) } N i =1 , temp erature parameter τ > 0 Ensure: ST Prediction of Equilibrium Phase Comp ositions ( x ′∗ , x ′′∗ ) , 1: Define an augmen ted grid including the feed comp osition. X ′ = { x ( i ) } N ′ i =1 = X ∪ { z } . 2: Compute the v ector of Gibbs energies on the grid: ∆g mix ∈ R N ′ , ∆ g mix,i = ∆ g mix ( x ( i ) ; θ ) . 3: F or each pair of grid indices ( i, j ) with 1 ≤ i ≤ j ≤ N ′ , compute ∆ g mix, ( i,j ) = ( + ∞ if z < x ( i ) or z > x ( j ) ∆ g mix, ( i ) z − x ( j ) x ( i ) − x ( j ) + ∆ g mix, ( j ) z − x ( i ) x ( j ) − x ( i ) otherwise 4: Determine hard minim um indices for inference (not differentiable). ( i ∗ , j ∗ ) ∈ arg min 1 ≤ i ≤ j ≤ N ′ ∆ g mix, ( i,j ) . 5: Set x ′∗ = x ( i ∗ ) , x ′′∗ = x ( j ∗ ) . 6: Compute softmax probabilities: p ( i,j ) = exp( − ∆ g mix, ( i,j ) /τ ) P N ′ u =1 P N ′ v = u exp( − ∆ g mix, ( u,v ) /τ ) . 7: Compute soft estimates for training (differen tiable): x ′∗ soft = N ′ X i =1 N ′ X j = i p ( i,j ) x ( i ) , x ′′∗ soft = N ′ X i =1 N ′ X j = i p ( i,j ) x ( j ) . 8: Straight-Through Gradien t Estimation: x ′∗ ← x ′∗ + ( x ′∗ soft − stopgrad( x ′∗ soft )) x ′′∗ ← x ′′∗ + ( x ′′∗ soft − stopgrad( x ′′∗ soft )) 9: return ( x ′∗ , x ′′∗ ) 6 As an alternative, w e present the forward pass of DISCOMAX, a discrete optimization pro cedure which w orks by enumerating the set of mass-balance feasible candidate equilibrium states and selecting the globally optimal state ( arg min ) from that set. Differen tiation of the elements in the set of Eq. ( 1 ) w.r.t to the mo del parameters θ do es not pro vide mean- ingful gradients directly , since the argmin/argmax o ver a discrete set is a piecewise-constant function [ 43 ]. W e therefore approximate the solution using a relaxation of the discretized Gibbs energy surface set. Let X = { x ( k ) } N k =1 b e a grid on [0 , 1] , where N are the n umber of discretization p oin ts, and let X ′ b e the augmen ted set of grid p oints con taining the elements of X and the feed comp osition z , and thus N ′ = N + 1 en tries. W e compute ∆ g mix on this grid, we construct a sub class Γ (2) ( z ) containing any feasible pair ( i, j ) with x ( i ) ≤ z ≤ x ( j ) (i.e., satisfying the mass-balance with ϕ ( i ) , ϕ ( j ) > 0 ) and ev aluate the total Gibbs energy of mixing ∆ g mix, ( i,j ) of each if its mem b er ( i, j ) according to ∆ g mix, ( i,j ) = ∆ g mix, ( i ) ϕ ( i ) + ∆ g mix, ( j ) ϕ ( j ) . Note that this includes self-com binations, so that ( z , z ) (as the only mem b er of the subclass Γ (1) ( z ) ) can be a v alid solution in the case of no phase separation. All self-combinations are constructed to hav e ϵ low er energy in order to break ties in fav or of homogeneous mixtures instead of heterogeneous mixtures with an infinitesimal second phase, at iden tical ∆ g mix . The hard (non-differentiable) equilibrium estimate is obtained as ( i ∗ , j ∗ ) ∈ arg min i ≤ j ∆ g mix, ( i,j ) , x ′∗ = x ( i ∗ ) , x ′′∗ = x ( j ∗ ) . Differen tiable Relaxation via Straight-Through Softmax The equilibrium comp ositions are defined b y a discrete minimization of a vector of total Gibbs energy of mixing v alues. While the hard arg min satisfies the thermo dynamic extrem um principle exactly , it is non- differen tiable and therefore cannot be used directly in gradient-based training. W e address this by adopting a straigh t-through (ST) gradien t estimation [ 44 , 45 , 55 ], whic h decouples the forwar d prediction from the b ackwar d gradient computation. • F orward pass (fp) (thermodynamically exact, up to discretization). W e compute the hard minimizer ( i ∗ , j ∗ ) ∈ arg min 1 ≤ i ≤ j ≤ N ′ ∆ g mix, ( i,j ) , x ′∗ = x ( i ∗ ) , x ′′∗ = x ( j ∗ ) , and return ( x ′∗ , x ′′∗ ) . This guarantees that the rep orted equilibrium state satisfies the extremum principle and the mass-balance feasibility constrain ts encoded in the masking of infeasible pairs. • Backw ard pass (bp) (smooth surrogate gradients). At the same time, w e form a differen tiable softmax distribution ov er all candidate pairs, p ( i,j ) = exp  − ∆ g mix, ( i,j ) /τ  P N ′ u =1 P N ′ v = u exp  − ∆ g mix, ( u,v ) /τ  , and compute soft equilibrium estimates x ′∗ soft = N ′ X i =1 N ′ X j = i p ( i,j ) x ( i ) , x ′′∗ soft = N ′ X i =1 N ′ X j = i p ( i,j ) x ( j ) . Gradien ts are then propagated through these soft estimates using a straigh t-through update, x ′∗ ← x ′∗ +  x ′∗ soft − stopgrad( x ′∗ soft )  , x ′′∗ ← x ′′∗ +  x ′′∗ soft − stopgrad( x ′′∗ soft )  , suc h that the forward v alue remains the hard minimizer, but the bac kw ard deriv ative follows the differen tiable soft surrogate. Concretely , we replace the gradien ts of ( x ′ , x ′′ ) with those of ( x ′ , soft , x ′′ , soft ) by inv oking tensor.detach() . The temp erature τ controls the sharpness of the soft distribution: as τ → 0 , p ( i,j ) concen trates on the hard minimizer, recov ering the discrete optimum, whereas larger τ yields smoother gradien ts that can improv e optimization early in training [ 44 , 45 ]. 7 Algorithm 1 defines the full DISCOMAX solv er: discr ete enumeration and optimization in the for- w ard pass and masked softmax aggregation ov er feasible states in the backw ard pass. W e thus obtain a thermo dynamically-consisten t binary equilibrium solver form ulation that can be in tegrated with ease in to an y neural netw ork architecture and th us facilitates full end-to-end training, without needing supplemen tary data or training steps, in contrast to previous w ork. F urthermore, it recov ers the thermo dynamically exact, globally optimal solution in the limit of infinitely fine discretization. Baseline: Surrogate Solv er As discussed in Section , using a neural net w ork as a surrogate for a solv er [ 39 ] has sev eral structural limitations, but it remains an imp ortan t and currently the only directly applicable baseline for this work from the literature. Because neither the implementation nor the training code is publicly av ailable, w e re- implemen ted a similar surrogate solv er and benchmark b oth with a GNN architecture. W e do not resolv e the issues of data leak age mentioned in Section . Moreov er, we train the surrogate solver on al l av ailable UNIF AC molecules in our dataset, whic h should yield the most optimistic p erformance of the baseline metho d. More details regarding the baseline implementation are given in the SI, and our GitLab rep ository: https://git.rwth- aachen.de/avt- svt/public/differentiable- thermo- eq- ml . Loss F unction Definitions The p erformance of b oth the surrogate solver and the DISCOMAX solv er depends on the c hoice of additional loss functions. W e propose a no vel Hessian loss and compare it with the “Gibbs loss” presented by [ 39 ]. The total loss, including direct and auxiliary losses can be written as: L total = L 1 + L 2 + λ G L G + λ H L H where the first t wo loss terms are standard mean-squared error losses to p enalize incorrect predictions directly: L 1 = ∥ ˆ x ′ − x ′∗ ∥ 2 2 , L 2 = ∥ ˆ x ′′ − x ′′∗ ∥ 2 2 . Auxiliary Hessian Loss (H) The Hessian loss term promotes thermo dynamically consisten t stability b eha vior at the measured data p oin ts, sp ecifically , con vexit y at the target phase comp ositions ( x ′∗ , x ′′∗ ) and conca vity at the feed composition ( z ) . Con vexit y at x ′∗ and x ′′∗ is a necessary , but not sufficient condition for a consisten t and correctly predicted equilibrium. Concavit y at z is neither necessary nor sufficient, but we assumed that it might improv e training, since at random initialization in the first ep o c h, ∆ g mix of eac h systems is conv ex ev erywhere as the ideal term dominates. W e discuss the need of this auxiliary loss in the results. H ∆ g mix ( x ′∗ ) , H ∆ g mix ( x ′′∗ ) ≻ 0 (con vex, stable phases) , H ∆ g mix ( z ) ≺ 0 (conca ve, unstable mixture) . It is defined as L H = L conv ex ( x ′∗ ) + L conv ex ( x ′′∗ ) + L concav e ( z ) , where each curv ature loss p enalizes deviations from the desired sign of the Hessian diagonals: L conv ex ( x ) = 1 N X i    0 , if ( H ∆ g mix ( x )) ii ≥ ϵ, ϵ − ( H ∆ g mix ( x )) ii , if ( H ∆ g mix ( x )) ii < ϵ, L concav e ( x ) = 1 N X i    0 , if ( H ∆ g mix ( x )) ii ≤ − ϵ, ( H ∆ g mix ( x )) ii + ϵ, if ( H ∆ g mix ( x )) ii > − ϵ, with a small margin ϵ > 0 to define a constraint back-off, in order to av oid oscillations during training. 8 Auxiliary Gibbs Loss (G) The Gibbs loss term in tro duced b y [ 39 ] promotes the correct qualitativ e curv ature of the molar Gibbs energy of mixing ∆ g mix ( T , x ) ov er the comp osition domain for data p oin ts known to exhibit an LLE. In particular, it penalizes cases where the model fails to predict a conca ve region in ∆ g mix , whic h is a w eak but necessary condition for the presence of the correct miscibilit y gap. In our case studies, we only consider isothermal LLE, but the extension is straigh tforward. F or eac h LLE sample k , the curv ature metric S k ( x d ) = ∂ 2 ∆ g mix ( x d ) ∂ x 2 is ev aluated on a discretized comp osition grid { x d } 101 d =1 . The minimum curv ature v alue o ver the grid deter- mines whether the mo del predicts a phase split. The Gibbs loss is defined as L Gibbs = max  0 , min d S k ( x d )  , where negativ e minim um curv ature indicates a concav e region and thus a correctly detected m iscibilit y gap. Computational Exp erimen ts W e ev aluate our proposed differentiable equilibrium solver and compare it with the surrogate solver in tw o complemen tary computational test cases. The first test case isolates the optimization behavior of the solvers on a single binary mixture, allowing us to assess whether each metho d can recov er the phase equilibrium for a fixed system, with no learned dep endence on the molecular structure. The second test case extends the comparison to a c hemically div erse dataset of binary mixtures, with the aim of ev aluating generalization p erformance when molecular structure information is included. Dataset W e create a comprehensiv e binary LLE dataset based on the Dortmund Database trained HANNA 2 mo del [ 39 ]. W e do this by forming all p ossible binary mixtures of 526 unique solv ents, resulting in 138,601 candidate systems. This is the subset of solven ts first published by [ 56 ], which w ere also found in the COS- MObase2021. F rom this p ool, we extract only those mixtures that exhibit a LLE at 298.15 K. This yields 8,597 unique binary systems with tw o liquid phases, which w e then use as the training set for learning a neural g E mo del, only from equilibrium data. Dataset statistics for the first fold are visualized in Figure 2 . Eac h data p oin t consists of features: mol 1 , mol 2 , z and lab els: x ′∗ , x ′′∗ . F or training, the feed z is chosen to b e in the cen ter of the miscibilit y gap, i.e. z = 0 . 5 ∗ ( x ′∗ + x ′′∗ ) . Initially , we ran all exp erimen ts with the equiv alent COSMO-RS generated dataset, but we opted to use HANNA 2 (MIT License) in order to b e able to publish the full dataset and enable full repro ducibilit y of our results. F or the same reason, exp erimen tal equilibrium data from NIST TDE or Dortmund Database were not used in this work, since it is not publicly a v ailable in their entiret y and thus cannot be used for reproducible b enc hmarking. Since the model used for generating the HANNA 2 dataset w as trained using the surrogate solv er, there ma y be some bias in fav or of the surrogate solver, the main baseline of this work, v ersus our differentiable solv er. W e rep ort key results for the COSMO-RS dataset in the SI, where our metho d outp erforms the surrogate solver by an even larger margin. W e publish the full dataset as well as the code used to compute the full dataset using the HANNA 2 mo del. Fitting single systems In the first test case, w e consider a single binary mixture and train a system-sp ecific neural g E mo del to reproduce its LLE b eha vior. The purp ose of this setup is to prob e the optimization prop erties of the equilibrium solvers in isolation, without the confounding effects of molecular feature extraction. In particular, this test examines whether a solv er can drive the model tow ards exact o verfitting of a known LLE system and whether the optimization conv erges to thermo dynamically consisten t solutions. Since performance dep ends only on the target equilibrium comp ositions, we group the a v ailable systems in to 50 bins based on the size 9 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 x 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 x 0 0 train val test 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 x 0 0 − x 0 0 1 0 0 2 0 0 3 0 0 4 0 0 Count train val test 0 . 2 0 . 4 0 . 6 0 . 8 z 0 1 0 0 2 0 0 3 0 0 4 0 0 Count train val test Figure 2: Scatter plot of equilibrium composition pairs (left). Histogram of the miscibilit y gap width (middle). Histogram of the feed composition z (right). Colored by train, v alidation and test split for the first fold. of their miscibility gap b et ween zero and one. One system is selected from each bin, ensuring that the ev aluation cov ers the full range of miscibilit y b eha viors and av oids bias tow ard an y particular miscibility gap width. The neural g E mo del used in this test case is not conditioned on molecular structure and takes only the scalar comp osition x as input. It consists of three hidden la yers with 64 units eac h and ELU activ ations. The mo del is trained for 200 epo chs using the Adam W optimizer in combination with a OneCycle learning rate sc hedule with cosine annealing and a maximum learning rate of 0 . 1 . The softmax temp erature τ is initialized with 0 . 1 and m ultiplied b y a factor of 0 . 98 eac h ep o c h. Early stopping is inten tionally omitted, as the goal of this test is to assess the abilit y of eac h solver to achiev e exact fitting. The w eights in the loss function w ere c hosen to b e λ G = 0 . 01 and λ H = 0 . 05 . Both w eights and the learning rate w ere c hosen based on a small grid search; details are provided in the SI. Generalizing to new systems The second test case ev aluates solver performance in a practical learning scenario, where the mo del m ust generalize across chemically distinct binary mixtures. This setup assesses whether the solvers enable stable end-to-end training of a molecular-structure informed g E mo del and whether the learned representations transfer to previously unseen systems. F or this case study , w e train a neural g E mo del with ten-fold cross-v alidation, whic h is conditioned on b oth molecular graphs and the mixture comp osition. The dataset is split into training, v alidation, and test subsets (80/10/10) comprising different binary mixtures, ensuring that all test systems are unseen during training, i.e., a system-wise split. A dam W optimizer in combination with a OneCycle learning rate schedule with cosine annealing and a maximum learning rate of 0 . 001 , with early stopping enabled and a patience of 50 ep o c hs. T raining the surrogate solver div erged multiple times at higher learning rates. F or the same reason, λ G w as reduced to 0 . 001 , while λ H w as k ept at 0 . 05 . The softmax temp erature τ is initialized with 0 . 1 and multiplied by a factor of 0 . 98 each ep o c h. How ever, performance with a fixed τ of 0 . 01 p erformed similarly as shown for single fold exp erimen ts in the SI. P erformance metrics In b oth test cases, p erformance is ev aluated using system-lev el mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination R 2 . Let x ′∗ i , x ′′∗ i denote the ground-truth v alues and ˆ x ′ i , ˆ x ′′ i the corresp onding predictions for sample i , with N samples p er output. The MAE is computed as the sum of the individual output-wise errors, MAE = 1 N N X i =1 ( | ˆ x ′ i − x ′∗ i | + | ˆ x ′′ i − x ′′∗ i | ) . 10 Similarly , the system-level RMSE is given by RMSE = v u u t 1 N N X j =1 h ( ˆ x ′ i − x ′∗ i ) 2 + ( ˆ x ′′ i − x ′′∗ i ) 2 i . The co efficient of determination R 2 is calculated as usual, b y concatenating the predictions of b oth phase comp ositions. Results In this section, w e present the results of our ev aluation on binary LLE phase-comp osition prediction. W e b egin in the Fitting single systems subsection by benchmarking the surrogate solver against our differen- tiable DISCOMAX solver on the task of fitting a single system, highlighting the adv an tages of an exactly thermo dynamically consisten t formulation. The Generalizing to new systems subsection then rep orts results of the generalization to unseen systems case and compares the p erformance of b oth solver approaches for a full GNN-based g E mo del trained solely on LLE comp osition data, demonstrating ho w our proposed approac h enables end-to-end learning of equilibrium b eha vior. Fitting single systems DISCOMAX H DISCOMAX H+G DISCOMAX G DISCOMAX Surrogate H+G Surrogate H Surrogate G Surrogate 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 MAE Figure 3: Violinplot of the MAE of fitting 50 systems with different gap widths one by one (batc h size = 1). Eac h violin represents a differen t loss configuration. H stands for the auxiliary Hessian loss, and G stands for the auxiliary Gibbs loss. T able 1 and Figure 3 present the mean absolute error (MAE) obtained when fitting 50 individual LLE systems spanning a broad range of miscibilit y-gap widths. The MAE reflects the summed absolute deviation b et w een the predicted and true comp ositions of b oth co existing phases. Eac h violin in Figure 3 represen ts a k ernel density estimate ov er all system MAEs for a given solv er configuration, including b oth our thermo- dynamically consistent DISCOMAX solv er (shown in blue) and the surrogate neural-netw ork solv er (shown in red), eac h trained either with the Hessian loss, the Gibbs loss, or b oth, as describ ed in the Loss F unction Definitions subsection. 11 0 . 0 0 . 5 1 . 0 ∆ g m i x ( x ) Tetralin - Methanol 0 . 0 0 . 5 1 . 0 Indole - Cyanogenchloride 0 . 0 0 . 5 1 . 0 Iso-oxazole - N-hexadecane 0 . 0 0 . 5 1 . 0 x ∆ g m i x ( x ) Naphthalene - Resorcinol 0 . 0 0 . 5 1 . 0 x Naphthalene - Diethyleneglycol 0 . 0 0 . 5 1 . 0 x Thiophene - Glycol DISCOMAX H+G Surrogate G Actual Tie Line Pred. Tie Line F e e d z T a r g e t x 0 & x 0 0 Ground Truth Figure 4: Six systems with increasing miscibilit y gap and comparison of the DISCOMAX H mo del (red) and the Surrogate G mo del (blue). The ground truth data is shown in green, with the resp ectiv e equilibrium comp osition targets shown as dotted vertical lines. 12 A cross all configurations, the prop osed differen tiable DISCOMAX solver consisten tly outp erforms the surrogate baseline by a large margin. The best-p erforming configurations are DISCOMAX H with a MAE of 0 . 015 ± 0 . 017 and DISCOMAX H+G with 0 . 019 ± 0 . 024 , compared to 0 . 101 ± 0 . 067 for the strongest surrogate v ariant (Surrogate H+G), corresp onding to an approximately 6 . 7 × MAE reduction. Thus, ev en when b oth auxiliary loss terms are applied, the surrogate solv er remains substantially less accurate. The gap is even more pronounced without Hessian loss; the surrogate mo del trained with the Gibbs loss alone (Surrogate G, as prop osed by [ 39 ]) yields MAE of 0 . 258 ± 0 . 186 (approximately 17 × higher than DISCOMAX H), while the surrogate alone reaches MAE of 0 . 419 ± 0 . 302 (appro ximately 28 × higher). By contrast, all DISCOMAX v ariants deliv er go o d predictions, and incorp orating the Hessian loss yields the greatest improv ements, as reflected by consistently low er MAE/RMSE and near-unity R 2 v alues, as also shown in T able 1 . Even the pure DISCOMAX method without any auxiliary loss p erforms ab out 2 . 8 × b etter than the best surrogate metho d (Surrogate H+G) and ab out 7 . 2 × b etter than the previously prop osed Surrogate G metho d, in terms of MAE. Especially considering the discretization width of 0 . 01 , these results can b e considered satisfactory for the DISCOMAX solv er. Figure 4 provides a qualitative comparison of the fitted phase-comp osition profiles for six represen tative systems with differen t target miscibilit y gap widths for each mo del, respectively . The surrogate solv er consisten tly displays substan tially larger deviations from the specified target equilibrium compositions and frequen tly pro duces thermo dynamically inconsisten t equilibrium comp osition predictions, with violations of the extrem um principle, i.e., the predicted configuration is unstable, and not ev en metastable. Ev en the satisfaction of simple physical laws, like the conserv ation of mass can not b e guaranteed for the surrogate solv er. An example of this failure is shown in the SI. Notably , for several systems, the surrogate predictions app ear to hav e con verged, yielding surrogate pre- dictions exactly on target, yet to entirely incorrect solutions, indicating that the learned mo del provides stable gradien ts but not ones that p oin t tow ard the equilibrium state. Similar limitations were also high- ligh ted in [ 29 ] for lo cal methods optimizing iso-activity for fitting NR TL parameters. By comparison, the DISCOMAX solver with Hessian loss pro duces qualitatively accurate (close to the target) and, as guaran- teed, thermo dynamically consistent predictions, also for the worst-performing instances. F urthermore, we note a systematic difficult y in capturing very narrow miscibilit y gaps with high precision, which is a common problem also in traditional parametric g E mo dels [ 57 ]. T aken together, these exp erimen ts demonstrate that the prop osed DISCOMAX equilibrium solver pro- vides a substan tially more reliable and ph ysically grounded approach for fitting individual LLE systems than the surrogate neural-netw ork solver. The quantitativ e assessment in Figure 3 shows consisten tly low er errors across a wide v ariet y of gap widths, while the qualitative comparisons in Figure 4 highligh ts that surrogate predictions often violate basic thermo dynamic principles and may con verge to entirely incorrect equilibria. In con trast, the DISCOMAX form ulation preserv es thermo dynamic consistency in all cases, with deviations arising primarily in systems with extremely narrow miscibility gaps. In addition to that the DISCOMAX v ariant without auxiliary loss p erforms already w ell on its own and even b etter than when paired with the Gibbs loss. These results establish the DISCOMAX solver as a robust foundation for end-to-end learning of thermo dynamically consistent g E mo dels, motiv ating its use in the full multi-system training describ ed in the following section. T able 1: Metrics of fitting 50 systems with different gap widths one by one (batc h size = 1). H stands for the auxiliary Hessian loss, and G stands for the auxiliary Gibbs loss. Mo del MAE (mean ± std) RMSE R 2 DISCOMAX H 0.015 ± 0.017 0.018 0.998 DISCOMAX H+G 0.019 ± 0.024 0.024 0.997 DISCOMAX G 0.028 ± 0.058 0.045 0.991 DISCOMAX 0.036 ± 0.070 0.055 0.986 Surrogate H+G 0.101 ± 0.067 0.090 0.962 Surrogate H 0.104 ± 0.074 0.099 0.954 Surrogate G 0.258 ± 0.186 0.265 0.675 Surrogate 0.419 ± 0.302 0.382 0.324 13 Generalizing to new systems W e ev aluate the generalization performance of the proposed thermodynamically consisten t DISCOMAX equilibrium GNN using 10-fold cross-v alidation on the full dataset of binary LLE systems. The metho d is compared against the surrogate-based approac h under identical data splits. Figure 5 sho ws parit y plots of T able 2: F ull cross-v alidation results for the test, v alidation, and train set aggregated ov er ten folds for the b est p erforming mo dels (Surrogate batc h-size 128, DISCOMAX batch-size 256 and DISCOMAX H batc h- size 64). Metrics are sho wn as mean ± standard deviation. Mo del Split MAE RMSE R 2 Surrogate G T est 0.076 ± 0.002 0.095 ± 0.004 0.967 ± 0.002 V alidation 0.078 ± 0.003 0.097 ± 0.004 0.966 ± 0.003 T rain 0.046 ± 0.001 0.051 ± 0.001 0.989 ± < 0.001 DISCOMAX T est 0.074 ± 0.003 0.097 ± 0.005 0.965 ± 0.004 V alidation 0.075 ± 0.003 0.098 ± 0.004 0.965 ± 0.003 T rain 0.035 ± 0.001 0.040 ± 0.001 0.993 ± < 0.001 DISCOMAX H T est 0.068 ± 0.003 0.090 ± 0.005 0.970 ± 0.003 V alidation 0.068 ± 0.003 0.090 ± 0.005 0.970 ± 0.004 T rain 0.022 ± < 0.001 0.023 ± 0.001 0.998 ± < 0.001 predicted versus reference phase comp ositions on the test set for the differen t mo del v ariants. Quantitativ e results aggregated ov er all folds are rep orted in T able 2 . Among all configurations, the DISCOMAX solv er augmen ted with the Hessian-based stability loss (DISCOMAX H) ac hiev es the b est test p erformance, with a mean absolute error of 0 . 068 ± 0 . 003 , a ro ot mean squared error of 0 . 090 ± 0 . 005 , and an R 2 of 0 . 970 ± 0 . 003 . The surrogate solver reaches a test MAE of 0 . 076 ± 0 . 002 , while the plain DISCOMAX solver remains slightly more accurate with a test MAE 0 . 074 ± 0 . 003 . The cross-v alidation MAE is further visualized in the SI. The differences betw een the approaches b ecome more pronounced on the training set shown in Figure 6 . Both DISCOMAX-based mo dels achiev e substantially low er training errors than the surrogate solver, with MAEs of 0 . 022 ± < 0 . 001 (DISCOMAX H) and 0 . 035 ± 0 . 001 (DISCOMAX), compared to 0 . 046 ± 0 . 001 for the Surrogate G approach. This corresp onds to an error reduction of roughly 52% for DISCOMAX H and ab out 24% for DISCOMAX, indicating that the prop osed differentiable equilibrium solver is able to fit the training data m uch more accurately . The surrogate solv er, in contrast, is limited by the approximation error inheren t in learning a direct mapping from discretized ∆ g mix curv es to phase comp ositions. T o further analyze the generalization behavior, Figure 7 rep orts the test-set MAE binned by the width of the phase gap. The DISCOMAX H mo del consistently outp erforms the surrogate solv er across all gap sizes, with the largest improv emen ts obse rv ed for narrow phase splits, where surrogate-based predictions are par- ticularly prone to error. This highlights the b enefit of enforcing thermodynamic consistency directly through the equilibrium solv er, esp ecially in regimes where the ∆ g mix landscap e is shallow and small deviations can lead to large errors in predicted phase comp ositions. T o further inv estigate the differences in p erformance, Figure 7 provides a binned analysis of the MAE as a function of the miscibility gap width, defined as ( x ′′ − x ′ ) for the DISCOMAX H model and the Surrogate G baseline. As expected, the mo deling of phase behavior becomes increasingly difficult as the system approaches the region of narrow er gap widths. Here, the DISCOMAX H mo del demonstrates a clear adv antage in these challenging regimes. F or gap widths ≤ 0 . 40 , the prop osed mo del significantly reduces b oth the median error and error v ariance compared to the Surrogate G baseline. As the gap width increases, the errors for b oth mo dels con verge tow ards the discretization precision of 0 . 01 . Supplementary results are giv en in the Supp orting Information of this article. Ov erall, the findings highlight both the strengths and shortcomings of surrogate-based equilibrium pre- diction, and they demonstrate that the prop osed differentiable DISCOMAX solv er pro vides a more robust and physically grounded foundation for learning equilibrium b ehavior directly from data. 14 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Actual 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Predicted R²: 0.969 RMSE: 0.094 MAE: 0.069 x 0 x 0 0 Ideal 1:1 line a) DISCOMAX H (Ours) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Actual 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Predicted R²: 0.964 RMSE: 0.100 MAE: 0.078 x 0 x 0 0 Ideal 1:1 line b) Surrogate G 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Actual 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Predicted R²: 0.968 RMSE: 0.095 MAE: 0.073 x 0 x 0 0 Ideal 1:1 line c) DISCOMAX (Ours) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Actual 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Predicted R²: 0.102 RMSE: 0.501 MAE: 0.646 x 0 x 0 0 Ideal 1:1 line d) Surrogate Figure 5: T est set parit y plots for different mo del v arian ts for a single fold. F ull cross-v alidation results can b e found in T able 2 15 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Actual 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Predicted R²: 0.998 RMSE: 0.024 MAE: 0.022 x 0 x 0 0 Ideal 1:1 line a) DISCOMAX H (Ours) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Actual 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Predicted R²: 0.992 RMSE: 0.048 MAE: 0.044 x 0 x 0 0 Ideal 1:1 line b) Surrogate G 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Actual 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Predicted R²: 0.994 RMSE: 0.041 MAE: 0.034 x 0 x 0 0 Ideal 1:1 line c) DISCOMAX (Ours) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Actual 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Predicted R²: 0.103 RMSE: 0.495 MAE: 0.639 x 0 x 0 0 Ideal 1:1 line d) Surrogate Figure 6: T raining set parit y plots for differen t mo del v arian ts for a single fold. F ull cross-v alidation results can b e found in T able 2 16 [0.00, 0.20) [0.20, 0.40) [0.40, 0.60) [0.60, 0.80) [0.80, 1.00] R e a l G a p W i d t h B i n n e d ( x 0 0 − x 0 ) 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0 0 . 2 5 MAE Model Surrogate G DISCOMAX H Figure 7: Cross-v alidation T est-set MAE phase-comp osition predictions binned for different gap width for the DISCOMAX H and Surrogate G mo del. Discussion The results of this study yield sev eral notew orthy observ ations. First of all, w e w ere able to show that we can train a machine le arning mo del based on binary LLE data in a fully thermo dynamically consisten t manner, up to a chosen discretization. This has, to the best of our kno wledge, not b een achiev ed yet. On top of that, w e ac hieve sligh tly better predictiv e performance in comparison to surrogate-based approaches, and we do not even require an y auxiliary loss for this. Not requiring an auxiliary loss was esp ecially surprising to us, since the surrogate solv er basically stops working completely without it. While the surrogate solv er performs unexp ectedly p oorly when fitted to individual systems, these defi- ciencies do not fully propagate to the full-dataset training scenario. A likely explanation is that fitting eac h system in isolation exp oses the surrogate solv er to its o wn inaccurate equilibrium estimates, causing the optimization to stabilize at incorrect solutions. In contrast, sto chastic gradien t descent o ver mini-batches, used in the full-training, reduces this effect b y rep eatedly v arying the training signal, thereby preven ting systematic con vergence to these incorrect local m inima. F urthermore, the appro ximation errors of the sur- rogate solv er might also act as a regularizer, which could explain the discrepancy betw een train and test set p erformance of b oth the DISCOMAX and surrogate solvers. The qualitative analysis for the single-system exp erimen ts presented abov e rev eals that training with the surrogate solver often conv erges smo othly but to en tirely wrong solutions. This indicates that although the surrogate mo del provides stable gradients that facilitate optimization, and thus enable differen tiation through the optimization problem, these gradients are biased and do not reliably p oin t tow ards the true target equilibrium state. Correcting such b ehavior w ould require supplying additional curated training data for ev ery failure mo de, whic h b ecomes increasingly impractical and non-scalable in higher-dimensional mixture spaces. A k ey adv an tage of our prop osed differentiable binary equilibrium solver, next to b eing thermo dynami- cally consistent by construction, is that it is genuinely end-to-end trainable. Unlike surrogate-based metho ds, it is not biased by the synthetically generated data used for training the surrogate solv er (e.g. UNIF AC). This is b ecause the differentiable equilibrium calculation itself con tains no trainable parameters. This mak es the metho d fully general and widely applicable to al l equilibrium prop erties, suc h as pure-comp onent v a- p or–liquid equilibria and three-phase LLLE (shown in S2 in the SI), but also to solid–liquid equilibria [ 58 ] and 17 purely solid equilibrium structures. In such settings, surrogate approaches w ould require application-sp ecific training data and solv ers, whereas the prop osed metho d naturally transfers without mo dification. Limitations & F uture Steps The presen t work fo cuses exclusiv ely on learning the g E function from the equilibrium phase comp osition of binary liquid-liquid systems. As such, the training data contain only positive samples, i.e., mixtures that exhibit tw o-phase coexistence. An important next step w ould be to also incorporate data on miscible systems in to training and force ∆ g mix to b e con vex for suc h samples. Integrating b oth b ehaviors within a single framew ork would enable the mo del not only to predict phase-split comp ositions but also to determine whether a phase split occurs in the first place. At present, no mac hine-learning-based metho d join tly addresses b oth tasks in a thermo dynamically principled manner (aside from traditional rigorous solvers). Although we did not inv estigate this explicitly , it is plausible that the surrogate approach of [ 39 ] exhibits similar limitations: the mo del receives no supervision signal for systems in which no phase split should o ccur. This suggests an in teresting direction for future research, namely the use of miscibilit y or single-phase behavior as a complemen tary , weakly sup ervised learning signal. An unexp ected p oten tial adv antage of the surrogate solv er baseline ma y be that it implicitly provides a form of regularization through the noise in tro duced b y the surrogate solv er itself. W e sp eculate this b ecause the gap b et ween training and test error is a lot smaller for this metho d compared to ours, although further inv estigation w ould b e required to confirm this effect. In our exp erimen ts, we did not apply dedicated regularization techniques beyond a small amount of weigh t deca y in A dam W. This choice allow ed for a cleaner comparison b et w een our model and the baseline without in tro ducing any additional h yp erparameters. It is possible that the test p erformance of our metho d could be impro ved by incorporating stronger regularization, for example, by adding drop out la yers. Another potential strategy to mitigate o verfitting would b e the use of pretrained embeddings. Exploring these directions remains an interesting av en ue for future work. While our case study has focused on binary LLE systems, the proposed framework is, in principle, applicable to pure-component VLE, m ulticomp onen t LLE, or SLE, for example. Using it as a general- purp ose equilibrium solver for mixtures with more than a few comp onents w ould b e challenging, since the computational cost of the curren t, largely brute-force, discretization-based algorithm scales p oorly with the n umber of comp onen ts. In particular, the num b er of groups that can b e formed from a given set of states gro ws com binatorially , with an order of magnitude determined b y the binomial co efficien t. It do es not p ose a practical obstacle in our in tended use case of training neural g E mo dels, since exp erimen tally a v ailable phase-equilibrium datasets consist mostly of binary systems. In this setting, the approach remains extremely efficien t and fully applicable. During deplo yment, practitioners can still simply rely on their method of choice for p erforming flash calculations. Conclusion W e address the challenge of integrating rigorous thermodynamic constraints into machine-learning mo dels whose target quantities are defined implicitly as solutions of an extrem um principle. F o cusing on binary LLE, we introduce a differen tiable equilibrium solv er that reformulates the flash calculation as a discrete minimization problem o ver the ∆ g mix surface. Thermo dynamic consistency is enforced exactly in the forward pass by ev aluating the hard minimizer, while differentiabilit y is main tained in the bac kw ard pass via a straigh t-through gradient estimator that propagates gradients via a Boltzmann-weigh ted aggregation ov er discretized comp ositions. Building on this solver, we train GNN mo dels for the excess Gibbs energy using only equilibrium phase- comp osition data, without requiring direct sup ervision on g E or activity co efficien ts. Both the single-system fitting and the generalization exp eriments demonstrate that the proposed approac h reliably predicts phase splits across a wide range of miscibilit y-gap widths, with systematic deviations primarily confined to systems exhibiting a narrow tw o-phase region. W e b enchmark our framework against a neural net work surrogate solver trained on UNIF AC-generated data. Although the surrogate solver provides smooth gradien ts, it frequently conv erges to thermo dynamically inconsisten t and incorrect equilibrium states when trying to fit individual systems. In contrast, the proposed 18 DISCOMAX solver preserves thermo dynamic consistency at ev ery step of training and inference and ac hieves sup erior quantitativ e p erformance in terms of mean absolute error on phase comp ositions and gap widths in full end-to-end GNN training. Our results demonstrate that rigorous equilibrium thermo dynamics can be embedded in to mo dern mac hine- learning pip elines when the target properties arise from fairly small lo wer-lev el optimization problems. The prop osed framew ork establishes a principled foundation for learning thermodynamically grounded mo dels directly from phase-equilibrium data and can, in principle, be extended to other types of phase b eha vior, including v ap or-liquid and solid-liquid equilibria, as well as to more complex chemical systems. Co de a v ailabilit y The code of this work is av ailable under the open-source Eclipse Public License at: https://git.rwth- aachen.de/avt- svt/public/differentiable- thermo- eq- ml . Data a v ailabilit y The full dataset consisting of HANNA 2 predictions for all 138,601 possible binary systems and 8,597 LLE mixtures is av ailable in the same repository . Author con tributions Karim K. Ben Hicham: Conceptualization, Methodology , F ormal analysis, In v estigation, W riting - origi- nal draft; developmen t and implementation of the original algorithm. Moreno Ascani: Conceptualization, Metho dology , W riting - original draft; theoretical analysis and thermodynamic form ulation of the algorithm. Jan G. Rittig: Conceptualization, W riting - review & editing. Alexander Mitsos: Conceptualization, Sup er- vision, F unding acquisition, W riting - review & editing. Supp orting information The Supporting Information pro vides a detailed thermodynamic deriv ation of the proposed equilibrium solv er, together with illustrative case studies co vering binary LLE, ternary LLLE, and v ap or-pressure cal- culations. In addition, it presen ts implemen tation details of the surrogate solver and supplementary results regarding the single-system fitting and the full-dataset training experiments. A c kno wledgmen ts W e ac knowledge supp ort of the W erner Siemens F oundation in the frame of the WSS Researc h Cen ter “catalaix”. Declaration of comp eting in terest The authors declare that they hav e no known comp eting financial interests or p ersonal relationships that could hav e app eared to influence the work reported in this pap er. References (1) Redep enning, C.; Reck er, S.; Marquardt, W. Pinc h-Based Shortcut Method for the Conceptual Design of Isothermal Extraction Columns. AIChE Journal 2017 , 63 , 1236–1245. (2) Cisternas, L. A.; Rudd, D. F. Pro cess Designs for F ractional Crystallization from Solution. Industrial & engine ering chemistry r ese ar ch 1993 , 32 , 1993–2005. 19 (3) Skib oro wski, M.; Górak, A., Pr o c ess Intensific ation: By R e active and Membr ane-Assiste d Sep ar ations ; W alter de Gruyter GmbH & Co K G: 2022. (4) Barros, E. P .; Ries, B.; Champion, C.; Rieder, S. R.; Riniker, S. Accoun ting for Solv ation Correla- tion Effects on the Thermodynamics of W ater Netw orks in Protein Cavities. Journal of Chemic al Information and Mo deling 2023 , 63 , 1794–1805. (5) Sadowski, G.; Kon togeorgis, G. M.; Artusio, F.; Gerogiorgis, D. I.; de Angelis, G.; ten Kate, A.; de Hemptinne, J.-C. The Thermo dynamic Pharma Challenge: A Cross-Cutting Perspective. Eur op e an Journal of Pharmac eutic al Scienc es 2025 , 107178. (6) Paterson, M. S. Nonh ydrostatic Thermo dynamics and Its Geologic Applications. R eviews of Ge ophysics 1973 , 11 , 355–389. (7) Galano-F rutos, J. J.; Nerín-F onz, F.; Sancho, J. Calculation of Protein F olding Thermodynamics Using Molecular Dynamics Simulations. Journal of Chemic al Information and Mo deling 2023 , 63 , 7791–7806. (8) Sing, C. E.; P erry , S. L. Recen t Progress in the Science of Complex Coacerv ation. Soft matter 2020 , 16 , 2885–2914. (9) Greinert, T.; V ogel, K.; Mühlenw eg, J.-K.; Sado wski, G.; Masko w, T.; Held, C. Standard Gibbs En- ergy of Metab olic Reactions: VI. Glyceraldeh yde 3-Phosphate Deh ydrogenase Reaction. Fluid Phase Equilibria 2020 , 517 , 112597. (10) Wilson, G. M. V ap or-Liquid Equilibrium. XI. A New Expression for the Excess F ree Energy of Mixing. Journal of the Americ an Chemic al So ciety 1964 , 86 , 127–130. (11) Renon, H.; Prausnitz, J. M. Local Comp ositions in Thermo dynamic Excess F unctions for Liquid Mixtures. AIChE Journal 1968 , 14 , 135–144. (12) Abrams, D. S.; Prausnitz, J. M. Statistical Thermo dynamics of Liquid Mixtures: A New Expression for the Excess Gibbs Energy of Partly or Completely Miscible Systems. AIChE journal 1975 , 21 , 116–128. (13) F redenslund, A.; Jones, R. L.; Prausnitz, J. M. Group-Con tribution Estimation of Activit y Coefficients in Nonideal Liquid Mixtures. AIChE Journal 1975 , 21 , 1086–1099. (14) Klamt, A.; Ec k ert, F. COSMO-RS: A Nov el and Efficien t Metho d for the a Priori Prediction of Thermoph ysical Data of Liquids. Fluid Phase Equilibria 2000 , 172 , 43–72. (15) Klamt, A. The COSMO and COSMO-RS Solv ation Mo dels. WIREs Computational Mole cular Scienc e 2011 , 1 , 699–709. (16) Winter, B.; Winter, C.; Sc hilling, J.; Bardow, A. A Smile Is All Y ou Need: Predicting Limiting A ctivity Co efficien ts from SMILES with Natural Language Pro cessing. Digital Disc overy 2022 , 1 , 859–869. (17) Medina, E. I. S.; Link e, S.; Stoll, M.; Sundmac her, K. Graph Neural Net works for the Prediction of Infinite Dilution Activit y Coefficients. Digital Disc overy 2022 , 1 , 216–225. (18) Chen, G.; Song, Z.; Qi, Z.; Sundmacher, K. Neural Recommender System for the A ctivity Coefficient Prediction and UNIF AC Mo del Extension of Ionic Liquid-Solute Systems. AIChE Journal 2021 , 67 , e17171. (19) Di Caprio, U.; Degrèv e, J.; Hellinc kx, P .; W aldherr, S.; Leblebici, M. E. HybridGamma: A Thermo dy- namically Consisten t F ramew ork for Hybrid Mo delling of Activit y Coefficients. Chemic al Engine ering Journal 2023 , 475 , 146104. (20) Abranches, D. O.; Maginn, E. J.; Colón, Y. J. Activit y Co efficien t Acquisition with Thermo dynamics- Informed Activ e Learning for Phase Diagram Construction. AIChE Journal 2023 , 69 , e18141. (21) Winter, B.; Rehner, P .; Esp er, T.; Sc hilling, J.; Bardow, A. Understanding the Language of Molecules: Predicting Pure Comp onent Parameters for the PC-SAFT Equation of State from SMILES, arXiv, 2023. (22) F elton, K. C.; Raßp e-Lange, L.; Rittig, J. G.; Leonhard, K.; Mitsos, A.; Mey er-Kirschner, J.; Knösc he, C.; Lapkin, A. A. ML-SAFT: A Machine Learning F ramework for PCP-SAFT Parameter Prediction. Chemic al Engine ering Journal 2024 , 492 , 151999. 20 (23) Alam, M. Z.; Alqallaf, A.; McGill, C. EquiNet: Mo deling V ap or-Liquid Equilibrium Using Simulta- neous Neural-Netw ork Predic tions for Activit y Co efficient and V ap or Pressure. ChemRxiv pr eprint https://doi.or g/10.26434/chemrxiv-2025-wmxtm-v4 2026 . (24) Rittig, J. G.; Mitsos, A. Thermo dynamics-Consisten t Graph Neural Netw orks. Chemic al Scienc e 2024 , 15 , 18504–18512. (25) Rittig, J. G.; Dahmen, M.; Grohe, M.; Sch waller, P .; Mitsos, A. Molecular Mac hine Learning in Chem- ical Pro cess Design, 2025. (26) Rittig, J. G.; F elton, K. C.; Lapkin, A. A.; Mitsos, A. Gibbs–Duhem-informed Neural Netw orks for Binary Activit y Co efficien t Prediction. Digital Disc overy 2023 , 2 , 1752–1767. (27) Sp ec ht, T.; Nagda, M.; F ellenz, S.; Mandt, S.; Hasse, H.; Jirasek, F. HANNA: Hard-constraint Neural Net work for Consistent A ctivity Co efficien t Prediction. Chemic al Scienc e 2024 , 15 , 19777–19786. (28) Y ang, Y.; Lin, S.-T. Ph ysics-Embedded Mac hine Learning Model for Phase Equilibrium Prediction in Multicomp onen t Systems. Journal of Chemic al Information and Mo deling 2025 , 65 , 10180–10193. (29) Mitsos, A.; Bollas, G. M.; Barton, P . I. Bilevel Optimization F ormulation for Parameter Estimation in Liquid–Liquid Phase Equilibrium Problems. Chemic al Engine ering Scienc e 2009 , 64 , 548–559. (30) Zhang, H.; Bonilla-P etriciolet, A.; Rangaiah, G. P . A Review on Global Optimization Methods for Phase Equilibrium Mo deling and Calculations. The Op en Thermo dynamics Journal 2011 , 5 , 71–92. (31) L. E. Bak er; A. C. Pierce; K. D. Luks Gibbs Energy Analysis of Phase Equilibria. So c. Petr ol. Engrs. J. 1982 , 22 , 731–742. (32) M. L. Mic helsen The Isothermal Flash Problem 1. Stability. Fluid Phase Equilibria 1982 , 9 , 1–19. (33) M. L. Michelsen The Isothermal Flash Problem 2. Phase-split Calculation. Fluid Phase Equilibria 1982 , 9 , 21–40. (34) Mitsos, A.; Barton, P . I. A Dual Extrem um Principle in Thermo dynamics. AIChE Journal 2007 , 53 , 2131–2147. (35) Göttl, Q.; Rosen, N.; Burger, J. Conv ex En velope Metho d for T, p Flash Calculations for Mixtures with an Arbitrary Num b er of Components and Arbitrary Aggregate States, arXiv, 2025. (36) Göttl, Q.; Pirnay , J.; Grimm, D. G.; Burger, J. Conv ex Env elope Metho d for Determining Liquid Multi-Phase Equilibria in Systems with Arbitrary Number of Comp onen ts. Computers & Chemic al Engine ering 2023 , 177 , 108321. (37) Winter, B.; Rehner, P .; Esp er, T.; Sc hilling, J.; Bardow, A. Understanding the Language of Molecules: Predicting Pure Comp onent Parameters for the PC-SAFT Equation of State from SMILES. Digital Disc overy 2025 , 4 , 1142–1157. (38) Guan, P .-W. Differentiable Thermo dynamic Mo deling. Scripta Materialia 2022 , 207 , 114217. (39) Hoffmann, M.; Sp ec ht, T.; Göttl, Q.; Burger, J.; Mandt, S.; Hasse, H.; Jirasek, F. A Mac hine-Learned Expression for the Excess Gibbs Energy, arXiv, 2025. (40) Dhamank ar, S.; Jiang, S.; W ebb, M. A. Accelerating Multicomp onen t Phase-Co existence Calculations with Ph ysics-Informed Neural Netw orks. Mole cular Systems Design & Engine ering 2025 , 10 , 89–101. (41) Nesterov, Y. Smooth Minimization of Non-Smo oth F unctions. Mathematic al pr o gr amming 2005 , 103 , 127–152. (42) Martins, A.; Astudillo, R. In International Confer enc e on Machine L e arning , PMLR: 2016, pp 1614– 1623. (43) Blondel, M.; T eb oul, O.; Berthet, Q.; Djolonga, J. In International Confer enc e on Machine L e arning , PMLR: 2020, pp 950–959. (44) Jang, E.; Gu, S.; Poole, B. Categorical Reparameterization with Gumbel-Softmax, arXiv, 2017. (45) Maddison, C. J.; Mnih, A.; T eh, Y. W. The Concrete Distribution: A Contin uous Relaxation of Discrete Random V ariables, arXiv, 2017. 21 (46) Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. Numerical Coordinate Regression with Con volutional Neural Netw orks, arXiv, 2018. (47) Blondel, M.; Roulet, V. The Elemen ts of Differentiable Programming, 2024. (48) Ac kley , D. H.; Hinton, G. E.; Sejno wski, T. J. A Learning Algorithm for Boltzmann Machines. Co gnitive scienc e 1985 , 9 , 147–169. (49) LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F., et al. A T utorial on Energy-Based Learning. Pr e dicting structur e d data 2006 , 1 . (50) Du, Y.; Mordatc h, I. Implicit Generation and Mo deling with Energy Based Mo dels. A dvanc es in neur al information pr o c essing systems 2019 , 32 . (51) Hansen, J.-P .; McDonald, I. R., The ory of Simple Liquids: With Applic ations to Soft Matter ; Academic press: 2013. (52) Agazzi, A.; Carlei, V.; Romito, M.; Sa viozzi, S. Global Optimization via Softmin Energy Minimization, 2025. (53) Heid, E.; Greenman, K. P .; Ch ung, Y.; Li, S.-C.; Graff, D. E.; V ermeire, F. H.; W u, H.; Green, W. H.; McGill, C. J. Chemprop: A Machine Learning P ack age for Chemical Prop ert y Prediction. Journal of Chemic al Information and Mo deling 2023 , 64 , 9–17. (54) Margules, M. S. On the Comp osition of Saturated V ap ors of Mixtures. A kademie der Wissenschaften in Wien, Mathematisch-Naturwissenschaftliche Klasse Abteilung II 1895 , 104 , 1234–1239. (55) Bengio, Y.; Léonard, N.; Courville, A. Estimating or Propagating Gradients through Sto c hastic Neu- rons for Conditional Computation, 2013. (56) Qin, S.; Jiang, S.; Li, J.; Balaprak ash, P .; Lehn, R. C. V.; M. Zav ala, V. Capturing Molecular In- teractions in Graph Neural Netw orks: A Case Study in Multi-Comp onen t Phase Equilibrium. Digital Disc overy 2023 , 2 , 138–151. (57) De P ablo, J. J.; Prausnitz, J. M. Thermodynamics of Liquid-Liquid Equilibria Including the Critical Region. AIChE journal 1988 , 34 , 1595–1606. (58) Al Ibrahim, E.; Morgan, N.; Müller, S.; Motati, S.; Green, W. H. Accurately Predicting Solubility Curv es via a Thermo dynamic Cycle, Machine Learning, and Solven t Ensembles. Journal of the A mer- ic an Chemic al So ciety 2025 . (59) Jaynes, E. T. Information Theory and Statistical Mec hanics. Physic al r eview 1957 , 106 , 620. (60) Jaynes, E. T. Information Theory and Statistical Mec hanics. I I. Physic al r eview 1957 , 108 , 171. (61) Shannon, C. E. A Mathematical Theory of Communication. The Bel l system te chnic al journal 1948 , 27 , 379–423. (62) Rehner, P .; Bauer, G.; Gross, J. F eos: An Op en-Source F ramework for Equations of State and Classical Densit y F unctional Theory. Industrial & Engine ering Chemistry R ese ar ch 2023 , 62 , 5347–5357. (63) Ro dríguez-P almeiro, I.; Ro dríguez, O.; Soto, A.; Held, C. Measuremen t and PC-SAFT Modelling of Three-Phase Behaviour. Physic al Chemistry Chemic al Physics 2015 , 17 , 1800–1810. (64) Michelsen, M. L. State F unction Based Flash Sp ecifications. Fluid Phase Equilibria 1999 , 158 , 617– 626. (65) Gross, J.; Sado wski, G. Perturbed-Chain SAFT: An Equation of State Based on a Perturbation Theory for Chain Molecules. Industrial & engine ering chemistry r ese ar ch 2001 , 40 , 1244–1260. 22 S1 Thermo dynamic Deriv ation In this Section, w e prop ose tw o complemen tary pro cedures for the same LLE T , p -Flash calculation, to which w e refer as mass balance explicit (formulation 1) and mass balance implicit (formulation 2). Both formula- tions, which are equiv alent metho ds for p erforming the LLE T , p -Flash calculation, are derived in detail in this section b y assuming random concentration fluctuations within a system at giv en temp erature T , pressure p , and concentration ¯ z , and app ealing to statistical thermo dynamics arguments. First, a mo del system with random concentration fluctuations is defined and visualized, and relev an t assumptions imp ortant for the subsequen t mathematical deriv ation of the metho ds are discussed. The resulting w orking equations, which in statistical thermodynamics terms can b e interpreted as the probabilit y distribution of the p ossible config- urations of the system, yield the solution of a constrained conv ex optimization problem. This optimization problem is first formulated based on the definition and visualization of the model system and the underlying assumptions. F rom the stationary conditions of the Lagrangian of this optimization problem, the working equations are then deriv ed and discussed. Mass balance explicit (F orm ulation 1). W e consider a closed system at thermal and mechanical equi- librium with its en vironment. The system is therefore at fixed temperature T , pressure p and concen tration ¯ z = ( z 1 , z 2 , . . . , z n ) T and has an ov erall mole num ber n tot . Let us assume random fluctuations in the con- cen tration within the system. F urther assuming random fluctuations implies that at any time t k , the system can b e divided into M equimolar subsystems, with each subsystem i = 1 , ..., M having, at each time k , a lo cal concen tration ¯ x ( i,k ) , a v alue of the molar Gibbs energy g ( i,k ) and mole num b er δ n = n tot / M . W e assume no w that each subsystem is allow ed to populate a discrete set of N concentrations ¯ x (1) , ¯ x (2) , ..., ¯ x ( N ) and resp ectiv e v alue of molar Gibbs energy g (1) , g (2) , ..., g ( N ) . The mo del system is depicted sc hematically in Figure 8 . The probability p i of finding one subsystem at one of the states i = 1 , ..., N is giv en b y the num b er Figure 8: Visualization of the model system considered in this section. The whole system is divided into an (arbitrary) large num b er M of subsystems, each of them having the same system T , p with concen tration and Gibbs energy v alues p opulated among a pre-defined, discrete set. of subsystems M ( i, k ) at the state i at eac h time k ov er the total num b er of subsystems M , av eraged ov er time (for a total num b er of time steps N t ), according to: p i = 1 N t N t X k =1 M ( i, k ) M . F or an arbitrary large num b er of subsystems M or a large n umber of time steps N t , the probability p i tends to ward the v alue that maximizes the Shannon en tropy S [ 59 , 60 ], which is given by Eq. ( 2 ) [ 61 ], sub ject to sp ecific constrain ts of the considered system. S = − N X i =1 p i ln( p i ) (2) 23 The sought probability distribution must satisfy the normalization condition, mass balance, and a sp ecific mean v alue of the Gibbs energy . Equation ( 2 ), sub ject to the aforementioned conditions, can be translated in to the constrained conv ex optimization problem given b y: min ¯ p ∈ R N − S ( ¯ p ) = N X i =1 p i ln( p i ) s.t. N X i =1 p i = 1 , N X i =1 p i x ( i ) j = z j , j = 1 , ..., n − 1 N X i =1 p i g ( i ) = ¯ g and the Lagrangian L : X → R given b y the follo wing equation: L ( ¯ p, λ, ¯ α, β ) = N X i =1 p i ln ( p i ) + λ N X i =1 p i − 1 ! + n − 1 X j =1 α j N X i =1 p i x ( i ) j − z j ! + β N X i =1 p i g ( i ) − ¯ g ! with x := ( ¯ p, λ, ¯ α, β ) ∈ X and X ⊂ R N + n +1 b eing the domain of the Lagrangian function. X is a con vex set, since it is giv en as the Cartesian product of conv ex subsets according to X ∈ R N + × R × R n − 1 × R . The first term of the Lagrangian (the ob jective function) is given as sum of conv ex functions (e.g., the terms p i ln ( p i ) ), while the other terms (each of the equalit y constrain ts multiplied by the resp ectiv e Lagrange multiplier) are affine functions. Therefore, the Lagrangian given in the previous equation is a conv ex function, and its (global) optimum is giv en by its stationary p oint ( ∇L = 0 ), whic h is obtained as the solution of the system of equations given by Eq. ( 3 )-( 6 ). ∂ L ∂ p i = 0 = ln( p i ) + 1 + λ + n − 1 X j =1 α j x ( i ) j + β g ( i ) (3) ∂ L ∂ λ = 0 = N X i =1 p i − 1 , (4) ∂ L ∂ α i = 0 = N X i =1 p i x ( i ) j − z j , j = 1 , ..., n − 1 (5) ∂ L ∂ β = 0 = N X i =1 p i g ( i ) − ¯ g , (6) In Eq. ( 3 ), the Lagrange multiplier λ related to the summation condition (Eq. ( 4 )) can be obtained by inserting Eq. ( 3 ) into Eq. ( 4 ) and is given b y e λ − 1 = N X i =1 e P n − 1 j =1 α j x ( i ) j e β g ( i ) (7) The Lagrange multiplier β is a negative-v alued scaling factor which, in statistical thermo dynamics, is equal to the negativ e v alue of the inv erse thermal energy k B T , β = − 1 /k B T , and has the effect of increasing or decreasing the spread of the distribution. F or the scop e of this work, the factor β corresponds to the (negativ e v alued) inv erse of the temp erature parameter τ , β = − τ − 1 , which can be used to sharpen the p eak of the resulting distribution. Therefore, we are left with a probabilit y distribution p i whic h is a function of 24 the parameters β and α i and is given by: p i = e P n − 1 j =1 α j x ( i ) j e β g ( i ) P N k =1 e P n − 1 j =1 α j x ( k ) j e β g ( k ) i = 1 , ..., N While β can b e freely adjusted, the set of n − 1 parameters α i m ust b e determined iteratively such that the distribution p i satisfies the mass balance given by Eq. ( 5 ) for the given feed comp osition ¯ z . This is accomplished by solving the system of equations given by: F m ( ¯ α ) = N X i =1 x ( i ) m e P n − 1 j =1 α j x ( i ) j e β g ( i ) P N k =1 e P n − 1 j =1 α j x ( k ) j e β g ( k ) − z m = 0 m = 1 , ..., n − 1 whic h for a binary system reduces to: F ( α ) = N X i =1 x ( i ) 1 e αx ( i ) 1 e β g ( i ) P N k =1 e αx ( k ) 1 e β g ( k ) − z 1 = 0 Mass balance implicit (F ormulation 2). T o av oid the iterativ e step of determining the unknown pa- rameter α from the mass balance condition, an alternative form ulation of the problem depicted in Figure 8 , whic h av oids the mass balance condition, can b e defined by grouping subsystems together suc h that each group has the same concen tration ¯ z as the entire system. This pro cedure of decomposing the system in to groups of comp osition ¯ z is depicted sc hematically in Figure 9 , where each group is represen ted by roman n umbers. In principle, any system comp osed of M subsystems, each of which ma y o ccup y one of N av ailable Figure 9: Visualization of the concept b ehind the second formulation of our metho d. The whole set of M subsystems (left picture) can b e constructed as a sum of groups consisting of fewer subsystems (righ t picture), with each group having the same o verall composition ¯ z as the whole system. states at fixed temperature T , pressure p , and comp osition–energy pairs ¯ x ( i ) , g ( i ) = g ( ¯ x ( i ) ) for i = 1 , ..., N can b e constructed from a com bination of any n umber of groups which b elong to a class Γ( ¯ z ) , according to: Γ( ¯ z ) = Γ (1) ( ¯ z ) + Γ (2) ( ¯ z ) + ... + Γ ( n ) ( ¯ z ) , i.e„ the class Γ( ¯ z ) consists of n sub classes Γ ( k ) ( ¯ z ) indexed by k = 1 , ..., n . Each subclass Γ ( k ) ( ¯ z ) contains all p ossible combination of k states that yields a system with ov erall comp osition ¯ z , sub ject to the condition that no such k -tuple includes a sub-combination already represented in one of the lo wer-order sub classes Γ (1) ( ¯ z ) + ... + Γ ( k − 1) ( ¯ z ) . More explicitely , a group of k states belongs to Γ ( k ) ( ¯ z ) if there exists a set of p ositiv e real num b ers ϕ ( k ) ( ¯ z ) depending on the system comp osition ¯ z and satisfying the following t wo Equations: k X i =1 ϕ ( i ) = 1 (8) k X i =1 ϕ ( i ) ¯ x ( i ) = ¯ z (9) 25 suc h that no subset of these k states already qualifies as a mem b er of a low er-order class under the same crite- rion. As an example, consider a ternary system. In this case a triplet of states ( ¯ x (1) , g (1) ) , ( ¯ x (2) , g (2) ) , ( ¯ x (3) , g (3) ) is excluded from Γ (3) ( ¯ z ) if the three comp ositions lie on a straigh t line through the feed comp osition ¯ z , b e- cause at least tw o of the pairs (1 − 2) , (1 − 3) or (2 − 3) already b elongs to the subclass Γ (2) ( ¯ z ) . Similarly , a binary system only contains the sub classes Γ (1) ( ¯ z ) (non-empty only if a state k coincides with the feed comp osition ¯ z ) and the class Γ (2) ( ¯ z ) . As a pro cedure to verify whether a set of k states ¯ x (1) , ..., ¯ x ( k ) with k < n belongs to the sub class Γ ( k ) ( ¯ z ) ( k ) is first to pro of that each comp osition is linearly indep enden t, which means that the matrix ¯ ¯ X ( k ) con taining eac h state comp osition ¯ x ( j ) as columns and giv en b y: ¯ ¯ X ( k ) =       x (1) 1 x (2) 1 · · · x ( k ) 1 x (1) 2 x (2) 2 · · · x ( k ) 2 . . . . . . . . . . . . x (1) n x (2) n · · · x ( k ) n       has rank ( ¯ ¯ X ) = k . If this is the case, then no subset of k already b elongs to a low er-order sub class. As a second step, the set ¯ ϕ ( ¯ z ) can b e determined as solution of the linear least-square problem giv en by: ¯ ϕ = ((( ¯ ¯ X ) T ) ¯ ¯ X ) − 1 ( ¯ ¯ X ) T ¯ z If the array ¯ ϕ satisfies Eq. ( 8 ) - ( 9 ) and the v alue of each element of ¯ ϕ is in the interv al ]0 , 1[ then the set k b elongs to the sub class Γ ( k ) ( ¯ z ) . Starting from N states, this procedure can b e used to populate the subclasses b y defining all p ossible groups in each subclass and then verifying if each group qualifies as sub class mem b er. F or a sub class of order k , the n umber of p ossible groups is giv en b y all the wa y of choosing k states from the N total states, irrespective of their p erm utation, according to the general formula for the binomial co efficien t giv en b y:  N k  = N ! N !( N − k )! (10) whic h therefore represents an upp er limit of the num b er of groups in sub class k . Each group j in eac h sub class Γ ( k ) ( ¯ z ) has a v alue of the Gibbs energy g ( j, k ) whic h is given as linear com bination of the Gibbs energies g ( j ) of each of its states via the vector ¯ ϕ ( j,k ) , according to: g ( j,k ) = k X i =1 ϕ ( i,k ) g ( j ) The probability p S of finding group S ∈ Γ( ¯ z ) in the system is no w giv en as the solution of the following optimization problem: min ¯ p ∈ R N − S ( ¯ p ) = X s ∈ Γ( ¯ z ) p s ln( p s ) s.t. X s ∈ Γ( ¯ z ) p s = 1 , X s ∈ Γ( ¯ z ) p s g ( s ) = ¯ g The solution of the previous optimization problem is the Boltzmann distribution of each defined subgroup m in the class Γ( ¯ z ) , which is given b y: p m = e − β g ( m ) P s ∈ Γ( ¯ z ) e − β g ( s ) (11) with s ∈ Γ( ¯ z ) in the previous equation referring to an y group in class Γ( ¯ z ) . 26 S2 Single System Case Studies S2.1 Case Study 1: Binary LLE The p oten tial of the metho d will b e shown against the simple case study of predicting the LLE of the binary system methanol-cyclohexane at T = 298.15 K and p = 5 bar. The thermo dynamic b ehavior of this system is describ ed using the EoS PC-SAFT with pure-comp onen t and binary interaction pro vided by [ 62 ]. Figure 10 shows the ∆ g mix of the system as a function of the Cyclohexane concen tration x C y clohexane as well as the discrete state probability distribution p i o ver the Cyclohexane concentration as w ell. F or the calculation of the ¯ p = p 1 , ..., p N , an equidistant grid of N = 500 p oin ts was defined. Each probabilit y p i can b e calculated either by Metho d 1 (directly by Eq. ( S1 )) or by Metho d 2 (indirectly from the probability of each group of states p m giv en b y Eq. ( 11 )) with only a marginal difference b et ween the method. Figure 10 shows Figure 10: Predicted discrete probability distributions of the states defined by the discretization (b ottom diagrams) for a homogeneous (right diagrams) and heterogeneous feed (left diagrams), and comparison with the ∆ g mix curv e and LLE calculated b y the PC-SAFT EoS (upper diagrams). that the predicted discrete probability distribution is very sharply peaked around the (single or multiple) concen trations at equilibrium. F or concentrations outside the miscibilit y gap, the probabilit y distribution is unimo dal around the feed concen tration (right diagram), while inside the miscibility gap, the predicted state distribution becomes bimo dal (left diagram) with each distribution p eak ed at eac h phase comp osition and an area b elo w eac h distribution which is prop ortional to the molar amoun t of each phase. By using F ormulation 1, the composition of eac h phase can be determined b y clustering eac h distribution and calculating separate mean concen trations. F orm ulation 2, on the other hand, allows direct ev aluation of the concentration of b oth phases b y directly applying Eq. ( 11 ) since eac h phase comp osition is already clustered in the definition of the binary groups Γ (2) ( ¯ z ) . Figure 11 sho ws the effect of using different temp erature parameters τ on the accuracy of the model in reconstructing the ∆ g mix curv e and on the accuracy of the calculated LLE. The calculations are performed using 100 discretization p oin ts using the three temperature parameters τ = 0 . 0005 , 0 . 005 and 0 . 05 , whic h corresp ond, respectively , to the blue, cyan blue and pale blue curves. Eac h curv e is calculated, for a giv en feed concentration z , as mean-av erage of the ∆ g mix k of eac h state according to Eq. ( S2.1 ) by 27 Figure 11: Effect of increasing the temp erature parameter τ on the reconstructed ∆ g mix curv e and on the predicted LLE (b ottom diagram), compared with the ∆ g mix and LLE predicted by the PC-SAFT EoS (upp er diagram). Descriptions of the curves are rep orted in the main text. 28 v arying the feed concentration b et ween z = 0 (pure methanol) and z = 1 (pure cyclohexane). ∆ g mix = N X k =1 p k ∆ g mix k One in teresting particularity of the metho d, which is exemplarily sho wn by Figure 11 , is that the re-calculated exp ectations of the ∆ g mix curv e are alw ays con vex in the limit of τ → 0 , regardless of the mathematical form of the original curve. Within a miscibilit y gap, the calculated ∆ g mix scales almost linearly b et ween b oth phases. At the same time, one can observe an increase in accuracy at low er τ v alues, due to the fact that the contribution of states a wa y from the mean v alue decreases. The accuracy of the predicted LLE comp osition remains high even when the absolute v alue of the calculated ∆ g mix is low (as is the case at high ∆ g mix v alues) and increases with decreasing ∆ g mix . S2.2 Case Study 2: Three-Phase LLLE In this example, we will sho w how our metho d can b e used to predict the phase equilibrium in a ternary system that undergo es a three-phase liquid-liquid-liquid equilibrium (LLLE). The ternary system under consideration is w ater-do decane-[C 12 mim][Ntf 2 ] with the latter component being an imidazolium-based ionic liquid. The phase equilibrium of this system was measured by Ro dríguez-P almeiro et al. [ 63 ] at T = 298 . 15 K and p = 1 bar. Under certain ov erall comp ositions, the mixture separates into three liquid phases: (I) an aqueous phase comp osed almost entirely of water,(II) an organic phase consisting predominantly of dodecane, and (I II) an ionic-liquid-ric h phase that also contains water and do decane in comparable amounts. The same authors also provided the pure-comp onen ts and binary in teraction parameters of PC-SAFT required to calculate the phase b eha vior. Using Method 1 and with a 100 x 100 discretization grid (corresponding to 5044 p oin ts), an estimate of the equilibrium composition of eac h phase was calculated. F or a feed comp osition of ¯ z = ( 0 . 11 , 0 . 68 , 0 . 21 ), a tri-mo dal probability distribution p i is predicted. The comp osition of eac h phase w as calculated by clustering each distribution and calculating separate av erage comp ositions. Figure 12 sho ws the probabilit y distributions, depicted b y differen t color intensit y ov er the whole comp osition space, as well as the estimated comp ositions of each of the three phases (I) - (II I) (represented by gray points) and the used feed comp osition (red p oin t). This example shows the v ersatility of the metho d in estimating Figure 12: T ernary diagram sho wing the three-phase LLLE equilibrium of the system w ater-do decane- [C 12 mim][Ntf 2 ], calculated using our approach (metho d 1) as w ell as the resulting state distribution as a color map ov er the diagram. differen t top ologies of phase equilibrium by correctly recognizing the num b er of stable phases, without the 29 necessit y of defining case-specific form ulations or p erforming separate stability chec ks. On the other hand, the num b er of grid points sharply increases with increasing n umber of comp onen ts, questioning the choice of equidistant grids with more than 2 comp onents. S2.3 Case Study 3: V ap or pressure calculation The principles of our methods, explained in this section, maintain their v alidit y when working in other ensem bles as the npT . Consider, for example, the problem of determining the v ap or pressure of a pure fluid at giv en temp erature T . W orking on the canonical ( nV T ) ensem ble, the T V -flash is equiv alent to finding the equilibrium condition (n umber of phases, usually one or tw o, and their density) whic h, at given temp erature T and molar v olume v , minimizes the Helmholtz energy A of the system [ 64 ]. F or a tw o-phase solution, the v ap or pressure is then equal to the slop e of the double tangen t which touc hes b oth phases in the A − v plot and is given by Eq. ( S2.3 ) [ 37 ]. p = − A ( V L ) − A ( V V ) V L − V V The same method can b e deriv ed considering fluctuations in the molar v olume v and the lo cal Helmholtz energy A . Discrete ( v k , A k ) p oin ts can b e defined, and the probability distribution p i can b e defined by maximization of Eq. ( 2 ). Figure 13 sho ws the calculation of the v ap or pressure of pure propane at T = 298 . 15 K , employing our method, using PC-SAFT to describ e the phase b ehavior of propane and parameters tak en from [ 65 ]. 100 discretization p oin ts hav e b een defined, from which a set of binary states ha ve b een defined. Using a temp erature-parameter τ = 30 , the v olume and Helmholtz energy of b oth phases can be Figure 13: Helmholtz energy curve of subcritical propane, calculated with PC-SAFT, and the resulting phase co existence and v ap or pressure determined using our approach (gra y p oints connected by a gray line). The v ap or pressure is given b y the negativ e slope of the gray line. estimated. The resulting pressure, calculated using Eq. ( S2.3 ), is p = 9 . 577 bar. The rigorous calculation of the v ap or pressure of propane, using the same set of PC-SAFT parameters, gives p = 9 . 543 bar at the same temp erature, whic h represents a deviation of less than 0 . 4% . This simple calculation show cases the p oten tial of this metho d for accurate estimation of VLE while av oiding numerical pro cedures. 30 S3 Dataset Generation T o compute equilibrium comp osition lab els for our list of binary systems, we compute discretized ∆ g mix curv es for eac h binary system and determine the corresp onding phase splits using the DISCOMAX forward pass procedure. F or every system, a composition grid of 401 equally spaced points in the interv al [0 + ϵ, 1 − ϵ ] with ϵ = 10 − 8 is constructed. At each comp osition x , ∆ g mix is ev aluated, with g E predicted using the HANNA 2 mo del (with H ex data) at T = 298 . 15 K . T o estimate a representativ e feed comp osition, we compute the second deriv ativ e of the discretized ∆ g mix curv e w.r.t to comp osition ( ∂ 2 ∆ g mix ∂ x 2 ) using finite differences and identify regions of negative curv ature [ 29 ]. The equilibrium phase comp ositions ( x ′ , x ′′ ) are then obtained b y minimizing the discretized ∆ g mix using the DISCOMAX algorithm describ ed in the main text. If | x ′ − x ′′ | < 10 − 3 , the system is considered single-phase and no phase split is recorded. The metho d is simple and very fast, but not rigourous strictly sp eaking, since it do es not use deterministic global optimization. After visually insp ecting many systems, we did not find an y case where the metho d failed to identify the correct phase split. The full co de for this pro cedure is also av ailable in the rep ository . The pro cedure is rep eated for all ∼ 140,000 binary systems in our dataset, yielding the final set of equilibrium comp osition lab els. S4 Surrogate Solv er Implemen tation Because the implemen tation details and training co de of the surrogate solv er presented by Hoffmann et al. [ 39 ] are not publicly av ailable, we re-implemented this baseline indep enden tly . Their approach relies on a surro- gate model trained on mod. UNIF AC-generated ∆ g mix data. Ev en when the mo d. UNIF A C data are split in to distinct training and test partitions for training the surrogate; data leak age remains una voidable: the mo d. UNIF AC parameterization itself was calibrated using a wide range of molecules, man y of whic h also app ear in our test set. Consequently , the surrogate solv er possesses prior information ab out the test systems b efore an y training tak es place, giving it an intrinsic adv antage. T o further reduce the need for retraining the solv er for ev ery fold, we delib erately place the surrogate solv er at an even stronger adv antage by training it on al l mo d. UNIF AC-accessible systems rather than attempting to construct an artificial split. This also av oids the substantial practical difficult y of defining a truly leak age-free split when using predictions of a data-driven metho d for pretraining. The dataset for training the surrogate solv er is generated as follo ws. W e ev aluate all ∼ 140 , 000 binary mixtures in our full COSMO-RS dataset and iden tify those for which mod. UNIF AC predictions are feasible, i.e. systems composed exclusively of functional groups supp orted by the mo d. UNIF AC mo del. F or each suc h mixture, we compute discretized ∆ g mix curv es ov er 401 equally spaced comp ositions in the in terv al [0 + ϵ, 1 − ϵ ] , and for temp eratures ranging from 280 K to 380 K . Using the DISCOMAX discretized ∆ g mix minimization procedure describ ed in the main Algorithm, and relying solely on the thermodynamically exact discrete arg min , w e obtain the corresp onding phase splits for eac h system. W e first create a mask for where ∆ g mix is negativ ely curved, and then place the feed comp osition in the center of that region, assuming that only one ma jor t w o-phase region exists. W e used the same metho d for generating the training dataset. This is an inherent limitation of the surrogate solver, which does not condition on a feed comp osition; incorporating the feed as an additional input would increase the dimensionality of the problem and require substantially more training data. F rom the initial set of UNIF AC-compatible mixtures, w e are able to generate ∆ g mix curv es for 82,215 systems, of whic h 5,224 exhibit a liquid-liquid s plit. This yields the complete surrogate-solver training dataset. The surrogate mo del itself is a simple order-inv ariant multila yer p erceptron (MLP). Order inv ariance with resp ect to the 101 discretized comp ositions is enforced by ev aluating the mo del on the original order and output x , and reversed order and output 1 − x , with the final prediction obtained by av eraging these t wo outputs. The netw ork architecture closely follows [ 39 ], consisting of three hidden lay ers with ReLU activ ations and a final sigmoid output lay er. T raining is performed using the Adam optimizer together with a OneCycle learning rate schedule, with a maxim um learning rate of 0 . 01 . The mo del is trained for up to 200 ep ochs with early stopping. Using this setup, w e obtain a mean absolute error of approximately MAE = 0 . 008 , RMSE = 0 . 024 , and R 2 = 0 . 996 on the v alidation dataset. This is comparable to the rep orted 31 MAE = 0 . 004 by [ 39 ], but their surrogate dataset is not publicly a v ailable, so a direct comparison is not p ossible. S5 Supplemen tary Results Figure 14 provides a qualitativ e comparison of the fitted phase-comp osition profiles for the b est-p erforming configuration of each solver, for the most difficult systems for eac h mo del, resp ectiv ely . F or the H2O - N-butyl- 2-p yrrolidinone system, shown on the b ottom left, we ev en observe that the surrogate prediction violates the conserv ation of mass, since the feed line is not b etw een the tw o predicted phase comp ositions, whic h is unph ysical. The influence of the batch size on mo del p erformance on the hidden test set is illustrated 0 . 0 0 . 5 1 . 0 ∆ g m i x ( x ) Thiazole - Octadecanoicacid 0 . 0 0 . 5 1 . 0 Thiazole - Eicosane 0 . 0 0 . 5 1 . 0 Diphenylether - CO2 0 . 0 0 . 5 1 . 0 x ∆ g m i x ( x ) Thiazole - Eicosane 0 . 0 0 . 5 1 . 0 x Iso-oxazole - 1,2-benzenedicarboxylicaciddiisononylester 0 . 0 0 . 5 1 . 0 x Glycol - 2,5-dichlorophenol DISCOMAX H+G Surrogate G Actual Tie Line Pred. Tie Line F e e d z T a r g e t x 0 & x 0 0 Ground Truth Figure 14: The three w orst predicted systems for both the surrogate mo del (top) and the three worst predicted systems for the DISCOMAX mo del (b ottom), o verla yed with the predictions of the other solv er on the same system, resp ectiv ely . in Figure 16 . The most imp ortan t comparison of the w ork is b et ween the prop osed DISCOMAX and DISCOMAX H v ariant and the baseline Surrogate G model. The results demonstrate that b oth DISCOMAX configurations consisten tly yield a low er test set MAE compared to the surrogate approac h across all tested batc h sizes. A degradation in performance is observ ed for all mo del for larger batch sizes. Both proposed metho ds are more robust than the surrogate baseline w.r.t to c hanges in batch size. 32 T able 3: Single system fitting hyperparameter grid search. T ested parameter sets solver : { DISCOMAX, Surrogate } lr : { 0 . 001 , 0 . 005 , 0 . 01 } λ H : { 0 . 01 , 0 . 05 , 0 . 1 } λ G : { 0 . 01 , 0 . 05 , 0 . 1 } solv er lr λ H λ G MAE DISCOMAX 0.001 0.01 - 0.0106 ± 0.0130 DISCOMAX 0.005 0.1 - 0.0137 ± 0.0162 DISCOMAX 0.005 0.05 - 0.0142 ± 0.0194 DISCOMAX 0.005 0.01 - 0.0159 ± 0.0261 DISCOMAX 0.01 0.01 - 0.0168 ± 0.0329 DISCOMAX 0.01 0.05 - 0.0175 ± 0.0243 DISCOMAX 0.001 0.05 - 0.0185 ± 0.0182 DISCOMAX 0.01 0.1 - 0.0228 ± 0.0281 DISCOMAX 0.001 0.1 - 0.0256 ± 0.0252 Surrogate 0.001 - 0.01 0.2046 ± 0.1584 Surrogate 0.001 - 0.05 0.2484 ± 0.1565 Surrogate 0.001 - 0.1 0.2862 ± 0.1676 Surrogate 0.005 - 0.01 0.2923 ± 0.1717 Surrogate 0.01 - 0.01 0.2934 ± 0.2010 Surrogate 0.005 - 0.05 0.3327 ± 0.2039 Surrogate 0.01 - 0.05 0.3520 ± 0.2209 Surrogate 0.005 - 0.1 0.3593 ± 0.1977 Surrogate 0.01 - 0.1 0.3973 ± 0.2506 Surrogate G DISCOMAX DISCOMAX H 0 . 0 6 6 0 . 0 6 8 0 . 0 7 0 0 . 0 7 2 0 . 0 7 4 0 . 0 7 6 0 . 0 7 8 0 . 0 8 0 MAE Figure 15: Cross-v alidation results of the MAE on the resp ectiv e test set of ten folds for the three most imp ortan t mo del v ariations: DISCOMAX H, DISCOMAX, and Surrogate G (pure Surrogate not shown). 33 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 Batch size 0 . 0 6 0 0 . 0 6 5 0 . 0 7 0 0 . 0 7 5 0 . 0 8 0 0 . 0 8 5 0 . 0 9 0 Test MAE DISCOMAX (Ours) DISCOMAX H (Ours) Surrogate G (Hoffmann et al. 2025) Figure 16: T est-set MAE phase-comp osition predictions as a function of batch size for three solver/loss configurations for a single fold. The Surrogate mo del is not s ho wn since it p erforms p oorly . The prop osed DISCOMAX and DISCOMAX H v ariants consisten tly outperform the Surrogate G mo del across all batch sizes. 34 T o in vestigate p oten tial c hemical sources of error, Figure 18 groups the prediction MAE b y the presence of common functional groups in the mixtures. T able 5 lists the frequencies of these groups in b oth training and test sets. The mo del performs well across most c hemical classes, but exhibits degraded accuracy for mixtures in volving carboxylic acids and, to a lesser extent, nitro-con taining comp ounds. These trends likely reflect in trinsic difficulties of mo deling highly nonideal systems rather than artifacts of the solver itself. Figure 17 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 Batch size 0 . 0 6 0 0 . 0 6 5 0 . 0 7 0 0 . 0 7 5 0 . 0 8 0 0 . 0 8 5 0 . 0 9 0 Test MAE DISCOMAX DISCOMAX G DISCOMAX H DISCOMAX H+G Surrogate Surrogate G (Hoffmann et al. 2025) Surrogate H Surrogate H+G Figure 17: T est-set MAE phase-comp osition predictions as a function of batch size for different solv er and loss configurations for a single fold. The DISCOMAX+H and DISCOMAX+G v arian ts consisten tly outp er- form surrogate-based mo dels across all batc h sizes. Smaller batch sizes yield the best o verall p erformance, whereas large batch sizes substantially degrade accuracy , highlighting a strong interaction b et w een mini- batc h statistics and equilibrium-mo del training dynamics. depicts the test set MAE of sev eral solv er-loss configurations across a range of batc h sizes. Consistent with the single-system results, all DISCOMAX v ariants (pure, H, G, H+G) outperform the state-of-the-art Surrogate G mo del for all batc h sizes. DISCOMAX G yields sligh tly inferior p erformance, indicating that the Hessian loss already provides a strong learning signal and that additional Gibbs loss offers limited benefit in this setting. All mo dels suffer from higher batch sizes for the c hosen learning rate of 0.001. The training w as not stable for higher learning rates for some of the surrogate v ariants. The DISCOMAX H v ariant seems most robust to v arying batc h sizes. 35 T able 4: F ull cross-v alidation results for the test, v alidation, and train set aggregated ov er ten folds for the b est p erforming mo dels on the COSMO-RS dataset (Surrogate batch-size 64, DISCOMAX batch-size 128 and DISCOMAX H batc h-size 128). Metrics are shown as mean ± standard deviation. T est MAE T est RMSE T est R 2 Mo del Surrogate G 0.054 ± 0.003 0.054 ± 0.004 0.982 ± 0.003 DISCOMAX H 0.046 ± 0.003 0.047 ± 0.003 0.986 ± 0.002 DISCOMAX 0.045 ± 0.003 0.047 ± 0.003 0.986 ± 0.002 Alcohol (Hydroxy) Aldehyde Alkene Alkyne Amide Amine Aromatic ring (benzene) Carboxylic acid Carboxylic ester Ether Halide Ketone Nitrile Nitro Thioether (Sulfide) Thiol Alcohol (Hydroxy) Aldehyde Alkene Alkyne Amide Amine Aromatic ring (benzene) Carboxylic acid Carboxylic ester Ether Halide Ketone Nitrile Nitro Thioether (Sulfide) Thiol 0 . 0 2 0 . 0 4 0 . 0 6 0 . 0 8 0 . 1 0 MAE Figure 18: Cross-v alidation test-set MAE of phase-composition predictions group ed b y the presence of common functional groups within the binary mixtures. Error patterns reveal strong c hemical dep endencies. F unctional-group frequencies in the train and te st sets are rep orted in T able 5 . 36 T able 5: Common functional groups and coun ts ho w often they are presen t in the data set. F unctional Group Coun t Alcohol (Hydroxy) 4161 Halide 3486 Ether 2777 Ketone 2569 Alk ene 2357 Amine 2217 Aromatic ring (b enzene) 2094 Carb o xylic ester 1438 Nitrile 1386 Amide 899 Aldeh yde 782 Carb o xylic acid 481 Nitro 301 Alkyne 248 Thio ether (Sulfide) 241 Thiol 117 0.95 0.98 0.99 1 τ A n n e a l F a c t o r 0.0001 0.001 0.01 0.1 0.5 1 τ 0.131 0.112 0.109 0.092 0.110 0.093 0.076 0.072 0.104 0.071 0.067 0.072 0.084 0.068 0.074 0.080 0.074 0.076 0.088 0.102 0.073 0.080 0.095 0.118 0 . 0 7 0 . 0 8 0 . 0 9 0 . 1 0 0 . 1 1 0 . 1 2 0 . 1 3 Test MAE Figure 19: Influence of the annealing schedule for τ on the test-set MAE, for a single fold, resp ectiv ely . 37 8 3 2 1 2 8 5 1 2 2 0 4 8 8 1 9 2 Batch size 1 0 0 1 0 1 1 0 2 1 0 3 Runtime (ms) surrogate (cuda) softmax (cpu) surrogate (cpu) cem (cpu) softmax (cuda) Figure 20: Run times of the respective algorithms to complete the full predictions for LLE based on a precomputed ∆ g mix arra y of 101 entries. CEM is only shown for an appro ximate comparison and is separately computed for rep eatedly solving a binary system from the corresp onding work. The ma jor differences in the CEM metho d make a fair comparison difficult. 38

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment