Gradual Domain Adaptation via Normalizing Flows
Standard domain adaptation methods do not work well when a large gap exists between the source and target domains. Gradual domain adaptation is one of the approaches used to address the problem. It involves leveraging the intermediate domain, which g…
Authors: Shogo Sagawa, Hideitsu Hino
Gradual Domain Adaptation via Normalizing Flo ws Shogo Saga w a 1 and Hideitsu Hino 2 , 3 1 Departmen t of Statistical Science, School of Multidisciplinary Sciences, The Graduate Univ ersity for Adv anced Studies (SOKENDAI) Shonan Village, Hay ama, Kanagaw a 240-0193, Japan 2 Departmen t of Statistical Mo deling, The Institute of Statistical Mathematics 10-3 Midori-cho, T ac hik a wa, T oky o 190-8562, Japan 3 Cen ter for Adv anced Intelligence Pro ject (AIP), RIKEN, 1-4-4 Nihonbashi, Ch uo-ku, T oky o 103-0027, Japan Keyw ords: Gradual domain adaptation, Normalizing flo w Abstract Standard domain adaptation metho ds do not w ork well when a large gap exists b et w een the source and target domains. Gradual domain adaptation is one of the approac hes used to address the problem. It in volv es lev eraging the intermediate domain, whic h gradually shifts from the source domain to the target domain. In previous w ork, it is assumed that the n umber of in termediate domains is large and the distance betw een adjacen t domains is small; hence, the gradual domain adaptation algorithm, in volving self-training with unlab eled datasets, is applicable. In practice, how ev er, gradual self- training will fail because the n um b er of in termediate domains is limited and the distance b et w een adjacent domains is large. W e prop ose the use of normalizing flo ws to deal with this problem while main taining the framework of unsup ervised domain adaptation. The prop osed method learns a transformation from the distribution of the target domain to the Gaussian mixture distribution via the source domain. W e ev aluate our proposed metho d by exp erimen ts using real-w orld datasets and confirm that it mitigates the ab o v e-explained problem and improv es the classification p erformance. 1 In tro duction In a standard problem of learning predictive mo dels, it is assumed that the probabil- it y distributions of the test data and the training data are the same. The prediction p erformance generally deteriorates when this assumption does not hold. The simplest solution is to discard the training data and collect new samples from the distribution of test data. Ho wev er, this solution is inefficient and sometimes imp ossible, and there is a strong demand for utilizing v aluable lab eled data in the source domain. Domain adaptation (Ben-David et al., 2007) is one of the transfer learning frame- w orks in which the probabilit y distributions of prediction target and training data are differen t. In domain adaptation, the source domain is a distribution with man y lab eled samples, and the target domain is a distribution with only a few or no lab eled samples. The case with no lab els from the target domain is called unsup ervised domain adap- tation and has been the sub ject of muc h research, including theoretical analysis and real-w orld application (Ben-David et al., 2007; Cortes et al., 2010; Mansour et al., 2009; Redk o et al., 2019; Zhao et al., 2019). In domain adaptation, the predictive p erformance on the target data deteriorates when the discrepancy b et ween the source and target domains is large. Kumar et al. 1 (2020) proposed gradual domain adaptation (GD A), in which it is assumed that the shift from the source domain to the target domain o ccurs gradually and that unlab eled datasets from intermediate domains are a v ailable. The key assumption of GDA (Kumar et al., 2020) is that there are many indexed intermediate domains. The intermediate domains are arranged to connect the source domain to the target domain, and their order is known or index is giv en, starting from the one closest to the source domain to the one closer to the target domain. These in termediate domains connect the source and target domains densely so that gradual self-training is p ossible without the need for lab eled data. In practice, how ev er, the num b er of in termediate domains is limited. Therefore, the gaps b et ween adjacen t domains are large, and gradual self-training do es not work w ell. In this pap er, w e propose a metho d that mitigates the problem of large shifts b et w een adjacen t domains. Our key idea is to use generative mo dels that learn contin uous shifts b et w een domains. Figure 1 sho ws a schematic of the prop osed metho d. W e fo cus on the normalizing flow (NF) (P apamak arios et al., 2021) as a generativ e model to describe gradual shifts. W e assume that the shifts in the transformation pro cess correspond to gradual shifts b et ween the source and target domains. Normalizing flows can ac hieve a more natural and direct transformation from the target domain to the source domain compared to other generativ e models, such as generativ e adversarial netw orks (Pan et al., 2019). Inspired b y the previous w ork (Izmailo v et al., 2020), whic h utilizes an NF for semi-sup ervised learning, we propose a metho d to utilize an NF for gradual domain adaptation. Our trained NF predicts the class label of a sample from the target domain b y transforming the sample to a sample from the Gaussian mixture distribution via the source domain. The transformation b etw een the distribution of the source domain and the Gaussian mixture distribution is learned b y lev eraging lab eled data from the source domain. Note that our prop osed metho d do es not use gradual self-training. The rest of the pap er is organized as follows. W e review related works in Section 2. Then we explain in detail the gradual domain adaptation algorithm, an imp ortant pre- vious w ork, in Section 3. W e introduce our proposed metho d in Section 4. In Section 5, w e present exp erimental results. The last section is devoted to the conclusion of our study . Labe le d pos it ive da t a U n la be le d da t a Labe le d n e g a t ive da t a Uns up ervis ed Doma in Adaptati on with Norma lizin g Flo ws Sourc e Domain Target Dom ain Gaussian Mixtu re L ar ge Font Ver. Int ermediat e Domain 𝑡 = 3 𝑡 = 2 𝑡 = 1 𝑡 = 0 Ti me Index Figure 1: Ov erview of the prop osed method. Owing to the limited num b er of av ailable in- termediate domains, the applicability of gradual self-training is limited. Gradual domain adaptation is p ossible without gradual self-training by utilizing contin uous normalizing flo w. 2 2 Related w orks W e tac kle the gradual domain adaptation problem b y using the normalizing flow. These topics hav e b een actively researc hed in recent y ears, making it challenging to provide a comprehensiv e review. Here, we will introduce a few closely related studies. 2.1 Gradual domain adaptation In con ven tional domain adaptation, a mo del learns the direct transformation b et ween (samples from) the source and target domains. Sev eral metho ds hav e b een prop osed to transfer the source domain to the target domain sequen tially (Gaderma yr et al., 2018; Gong et al., 2019; Hsu et al., 2020; Choi et al., 2020; Cui et al., 2020; Dai et al., 2021). A sequen tial domain adaptation is realized b y using data generated b y mixing the data from the source and target domains. Kumar et al. (2020) prop osed gradual domain adaptation (GDA), and they show ed that it is p ossible to adapt the metho d to a large domain gap by self-training with unlab eled datasets. It is assumed that the intermediate domains gradually shift from the source domain to the target domain, and the sequence of intermediate domains is giv en. Chen and Chao (2021) dev elop ed the metho d in which the in termediate domains are av ailable whereas their indices are unkno wn. Zhang et al. (2021) and Abnar et al. (2021) prop osed to apply the idea of GDA to conv en tional domain adaptation. Since in these methods, they assumed that intermediate domains are una v ailable, they use pseudo-in termediate domains. Zhou et al. (2022a) prop osed a gradual semi-supervised domain adaptation method that utilizes self-training and requests of queries. They also pro vided a new dataset suitable for GDA. Kumar et al. (2020) conducted a theoretical analysis and pro vided a generalization error b ound for gradual self-training. W ang et al. (2022) conducted a theoretical analysis under more general assumptions and derived an impro ved generalization error b ound. Dong et al. (2022) also conducted a theoretical analysis under the condition that all the labels of the intermediate domains are given. He et al. (2023) assumed a scenario where, similar to ours, the a v ailable intermediate domains are limited. They prop ose generating a pseudo-in termediate domain using optimal transport and use self-training to propagate the lab el information of the source domain. There are several problem settings similar to those in GDA. Liu et al. (2020) prop osed ev olving domain adaptation, demonstrating the feasibilit y of adapting a target domain that evolv es ov er time through meta-learning. In contrast to GD A, which has only one target domain, the evolving domain adaptation assumes that the sequence of the target domains is giv en and aims at achieving accurate prediction ov er all the target domains. W ang et al. (2020) assumed the problem where there are multiple source domains with indices, which corresp onds to the GDA problem where all lab els of the intermediate domains can b e accessed. Huang et al. (2022) prop osed the application of the idea of GD A to reinforcement learning. They prop ose a metho d, follo wing a similar concept to curriculum learning (Bengio et al., 2009), that starts with simple tasks and gradually in tro ducing more challenging problems for learning. Multi-source domain adaptation (Zhao et al., 2020) corresp onds to GDA under the condition that all labels of the in termediate domains are given while the indices of the in termediate domains are not giv en. Y e et al. (2022) proposed a method for temporal domain generalization in online recommendation mo dels. Zhou et al. (2022b) prop osed an online learning metho d that utilizes self-training and requests of queries. Y e et al. 3 (2022) and Zhou et al. (2022b) assumed that the labels of the intermediate domains are giv en, while Sagaw a and Hino (2023) applied the multifidelit y activ e learning assuming access to the lab els of the intermediate and target domains at certain costs. 2.2 Normalizing flo ws Normalizing flo ws (NFs) are rev ersible generativ e mo dels that use in v ertible neural net- w orks to transform samples from a kno wn distribution, such as the Gaussian distribu- tion. NFs are trained by the maximum likelihoo d estimation, where the probabilit y densit y of the transformed random v ariable is sub ject to the c hange of v ariable form ula. The architecture of the inv ertible neural net works is constrained (e.g., coupling-based arc hitecture) so that its Jacobian matrix is efficiently computed. NFs with constrained arc hitectures are called discrete NFs (DNFs), and examples include RealNVP (Dinh et al., 2017), Glow (Kingma and Dhariwal, 2018), and Flow++ (Ho et al., 2019). Sev- eral theoretical analyses of the expressiv e p o wer of DNFs ha ve also been rep orted. Kong and Chaudhuri (2020) studied basic flo w models suc h as planar flows (Rezende and Mo- hamed, 2015) and prov ed the b ounds of the expressive p ow er of basic flow mo dels. T eshima et al. (2020) conducted a more generalized theoretical analysis of coupling- based flow mo dels. Chen et al. (2018) prop osed contin uous normalizing flow, mitigating the constraints on the architecture of the inv ertible neural netw orks. CNFs describ e the transformation b etw een samples from the Gaussian to the observed samples from a complicated distribution using ordinary differential equations. Grathw ohl et al. (2019) prop osed a v arian t of CNF called FFJORD, whic h exhibited impro v ed p erformance o ver DNF. FFJORD was follo w ed by studies to impro ve the computational efficiency (Huang and Y eh, 2021; Onken et al., 2021) and on the representation on manifolds (Mathieu and Nick el, 2020; Rozen et al., 2021; Ben-Hamu et al., 2022). Brehmer and Cranmer (2020) suggested that NFs are unsuitable for data that do not p opulate the en tire am bient space. Sev eral normalizing flow mo dels aiming to learn a lo w-dimensional manifold on which data are distributed and to estimate the density on that manifold ha ve b een prop osed (Brehmer and Cranmer, 2020; Caterini et al., 2021; Horv at and Pfister, 2021; Kalatzis et al., 2021; Ross and Cresswell, 2021). Normalizing flows hav e b een applied to several sp ecific tasks, for example, data generation [images (Lu and Huang, 2020), 3D p oin t clouds (Pumarola et al., 2020), c hemical graphs (Kuznetso v and Polyk ovskiy, 2021)], anomaly detection (Kirichenk o et al., 2020), and semi-sup ervised learning (Izmailo v et al., 2020). NFs are also used to comp ensate for other generativ e mo dels (Maha jan et al., 2019; Y ang et al., 2019; Ab dal et al., 2021; Huang et al., 2021) such as generativ e adv ersarial netw orks and v ariational auto encoders (V AEs) (Zhai et al., 2018). As an approac h of domain adaptation, it is natural to learn domain-inv arian t rep- resen tations b etw een the source and target domains, and several metho ds that utilize NFs hav e been dev elop ed for that purp ose. Grov er et al. (2020) and Das et al. (2021) prop osed a domain adaptation metho d that combines adv ersarial training and NFs. These metho ds separately train t wo NFs for the source and target domains and execute domain alignmen t in a common latent space using adv ersarial discriminators. Ask ari et al. (2023) proposed a domain adaptation metho d using NFs and V AEs. The enco der con verts samples from the source and target domains into laten t v ariables. The laten t space of the source domain is forced to b e Gaussian. NFs transform latent v ariables of the target domain into laten t v ariables of the source domain and predict the class lab el of the target data. The abov e-explained metho ds do not assume a situation where there 4 is a large discrepancy b et ween the source and target domains, as assumed in GD A. 3 F orm ulation of gradual domain adaptation In this section, w e introduce the concept and formulation of the gradual domain adap- tation proposed by Kumar et al. (2020), whic h utilizes gradual self-training, and we confirm that a gradual self-training-based metho d is unsuitable when the discrepancies b et w een adjacent domains are large. Consider a m ulticlass classification problem. Let X = R d and Y = { 1 , 2 , . . . , C } be the input and label spaces, respectively . The source dataset has labels S = { ( x (1) i , y (1) i ) } n 1 i =1 , whereas the in termediate datasets and the target dataset do not ha ve lab els U ( j ) = { x ( j ) i } n j i =1 . The subscript i, 1 ≤ i ≤ n j indicates the i -th observ ed datum, and the sup erscript ( j ) , 1 ≤ j ≤ K indicates the j -th domain. W e note that { U ( j ) } K − 1 j =2 are the in termediate datasets and U ( K ) is the target dataset. The source domain corresp onds to j = 1, and the target domain corresp onds to j = K . When j is small, the domain is considered to b e similar to the source domain. In con trast, when j is large, the domain is considered to be similar to the target domain. Let p j b e the probabilit y densit y function of the j -th domain. The W asserstein metrics (Villani, 2009) are used to measure the distance b etw een domains. Kumar et al. (2020) defined the distance betw een adjacent domains as the p er-class ∞ -W asserstein distance. W ang et al. (2022) defined the dis- tance b et ween adjacen t domains as the p -W asserstein distance as a more general metric. F ollowing W ang et al. (2022), w e define the av erage p -W asserstein distance b et ween consecutiv e domains as ρ = 1 K − 1 P K j =2 W p ( p j − 1 ( x , y ) , p j ( x , y )), where W p ( · , · ) denotes the p -W asserstein distance. Kumar et al. (2020) proposed a GDA algorithm that consists of tw o steps. In the first step, the predictive mo del for the source domain is trained with the source dataset. Then, by sequential application of the self-training, lab els of the adjacen t domains are predicted. Let H = { h | h : X → Y } and ℓ : Y × Y → R ≥ 0 b e a hypothesis space and a loss function, respectively . W e consider training the model h (1) b y minimizing the loss on the source dataset h (1) = argmin h ∈H 1 n 1 n 1 X i =1 ℓ ( h ( x (1) i ) , y (1) i ) . F or the joint distribution p j ( x , y ), the exp ected loss with the classifier h ( j ) is defined as ϵ ( j ) ( h ( j ) ) = E x ,y ∼ p j ( x ,y ) [ ℓ ( h ( j ) ( x ) , y )]. The classifier of the curren t domain h ( j ) is used to mak e predictions on the unlabeled dataset U ( j +1) = { x ( j +1) i } n j +1 i =1 in the next domain. Let ST( h ( j ) , U ( j +1) ) b e a function that returns the self-trained model for x ( j +1) ∈ U ( j +1) b y inputting the curren t mo del h ( j ) and an unlab eled dataset U ( j +1) : ST( h ( j ) , U ( j +1) ) = argmin h ∈H 1 n j +1 n j +1 X i =1 ℓ ( h ( x ( j +1) i ) , h ( j ) ( x ( j +1) i )) . Self-training is only applied b et w een adjacent domains. The output of GDA is the clas- sifier for the target domain h ( K ) . The classifier h ( K ) is obtained b y applying sequen tial self-training to the mo del of the source domain h (1) along the sequence of unlab eled datasets U (2) , . . . , U ( K ) denoted as h (2) = ST( h (1) , U (2) ) , · · · , h ( K ) = ST( h ( K − 1) , U ( K ) ) . 5 Kumar et al. (2020) provided the first generalization error b ound for the gradual self-training of e O ( K − 1) ( ϵ (1) ( h (1) )+ O ( q log K − 1 n )), where the sample size in each domain is assumed to b e the same: n 2 = · · · = n K = n without loss of generality . W ang et al. (2022) conducted a theoretical analysis under more general assumptions and deriv ed an impro ved generalization b ound: ϵ (1) ( h (1) ) + ˜ O ρ · ( K − 1) + K − 1 √ n + 1 p n ( K − 1) ! . (1) When w e consider the problem where the n umber of accessible intermediate domains is limited, it is natural that the distance betw een adjacen t domains ρ becomes large. When this occurs, Eq. (1) suggests that the bound for the exp ected loss of the target classifier will b ecome lo ose. T o tac kle the problem of a large distance b etw een adjacen t domains, we prop ose a metho d utilizing generativ e mo dels. 4 Prop osed metho d W e consider a GDA problem with large discrepancies b etw een adjacen t domains. As discussed in Section 3, the applicabilit y of a gradual self-training-based metho ds (Ku- mar et al., 2020; W ang et al., 2022; He et al., 2023) is limited in this situation. Our prop osed GDA metho d mitigates the problem without gradual self-training. Our key idea is to utilize NFs to learn gradual shifts b etw een domains as detailed in Section 4.1. Con ven tional NFs only learn the transformation b etw een an unknown distribution and the standard Gaussian distribution, whereas GDA requires transformation b et ween un- kno wn distributions. Section 4.2 in tro duces a nonparametric likelihoo d estimator and sho ws ho w the lik eliho o d of transformation b et ween samples from adjacent domains is ev aluated. W e consider a multiclass classification problem; hence, our flow-based mo del learns the transformation b et ween the distribution of the source domain and a Gaussian mixture distribution as detailed in Section 4.3. W e discuss the theoretical asp ects and the scalability of the prop osed metho d in Sections 4.4 and 4.5, resp ectiv ely . 4.1 Learning gradual shifts with normalizing flo ws An NF uses an in vertible function f : R d → R d to transform a sample x ∈ R d from the complicated distribution p ( x ) to a sample z ∈ R d from the standard Gaussian p 0 ( z ). The log density of x = f ( z ) satisfies log p ( x ) = log p 0 ( f − 1 ( x )) + log det ∇ f − 1 ( x ) , where ∇ f − 1 ( x ) is the Jacobian of f − 1 . Our aim is to learn the contin uous change x ( K ) 7→ · · · 7→ x (1) 7→ z b y using NFs. W e consider a contin uous transformation, x ( K ) = f ( K ) ( x ( K − 1) ) , . . . , x (1) = f (1) ( z ), b y using m ultiple NFs f ( K ) , . . . , f (1) . Our preliminary exp eriments indicate that contin- uous normalizing flo ws (CNFs) are b etter suited for capturing contin uous transitions b et w een domains than discrete normalizing flo ws. F urther details of these exp eriments are elab orated up on in Section 5.3. T o learn gradual shifts b et ween domains with CNFs, we regard the index of each domain j as a contin uous v ariable. W e introduce a time index t ∈ R + to represen t con tinuous c hanges of domains, and the index of eac h domain j is considered as a particular time p oint. The probabilit y densit y function p j is a sp ecial case of p t when t = j . The CNF g : R d × R + → R d outputs a transformed v ariable dep enden t on time t , and we consider f ( t ) ( · ) as g ( · , t ) following the standard notation of CNF. W e set t = 0 6 and t = j for z and x ( j ) , resp ectiv ely . Note that z = g ( z , 0) and x ( j ) = g ( x ( j ) , j ). Let v b e a neural netw ork parameterized b y ω that represen ts the change in g along t . F ollo wing Chen et al. (2018) and Grath wohl et al. (2019), we express the ordinary differen tial equation (ODE) with resp ect to g using the neural netw ork v as ∂ g /∂ t = v ( g ( · , t ) , t ; ω ). The parameter ω of the neural net w ork v implicitly defines the CNF g . When we sp ecifically refer to the parameter of g , w e denote it as g ω . An NF requires an explicit computation of the Jacobian, and in a CNF, it is calcu- lated by integrating the time deriv ativ e of the log-likelihoo d. The time deriv ativ e of the log-lik eliho o d is expressed as ∂ log p ( g ) /∂ t = − T r( ∂ v /∂ g ) (Chen et al., 2018, Theorem 1). The outputs of the CNF are acquired b y solving an initial v alue problem. Since our goal is to learn the gradual shifts b et ween domains using CNFs, we consider solving the initial v alue problem sequentially . T o formulate the problem, we introduce the function τ that relates the time index t to the input v ariable as follo ws: τ ( t ) = ( x ( t ) , if t ≥ 1, z , otherwise. W e assign t 0 = j − 1 and t 1 = j , and the output of the CNF for the input x ( j ) is obtained b y solving the following initial v alue problem: τ ( t 0 ) ∆ 1 = Z t 0 t 1 " v ( g ( x ( j ) , t ) , t ; ω ) − T r ∂ v ∂ g # dt, g ( x ( j ) , t 1 ) ∆ 0 = x ( j ) 0 (2) where ∆ 1 = log p t 1 ( τ ( t 1 )) − log p t 0 ( τ ( t 0 )) and ∆ 0 = log p t 1 ( τ ( t 1 )) − log p t 1 ( g ( x ( j ) , t 1 )). T o generate the shift b et w een multiple domains, when j > 1, the initial v alue problem is solv ed sequen tially with decreasing v alues of t 1 and t 0 un til t 1 = 1 and t 0 = 0. F or instance, when j = 2, we solve the initial v alue problem giv en b y Eq. (2) and retain the solutions. In the next iteration, we decrease the v alues of t 1 and t 0 and utilize the retained v alues as initial v alues. A CNF is formulated as a problem of maximizing the follo wing log-lik eliho o d with resp ect to the parameter ω : log p j ( g ω ( x ( j ) , j )) = j X t =1 log p t − 1 ( g ω ( x ( j ) , t − 1)) − Z j 0 T r ∂ v ∂ g ω dt. (3) 4.2 Non-parametric estimation of log-lik eliho o d The transformation of a sample from the j -th domain to a sample from the adjacent do- main using a CNF requires the computation of the log-likelihoo d log p t − 1 ( g ω ( x ( t ) , t − 1)), where t = j . W e use a k nearest neighbor ( k NN) estimators for the log-lik eliho o d (Kozac henko and Leonenk o, 1987; Goria et al., 2005). W e compute the Euclidean distance b e- t ween all samples in { x ( t − 1) i } n t − 1 i =1 and g ω ( x ( t ) , t − 1), with g ω ( x ( t ) , t − 1) kept fixed. Let δ k t − 1 ( g ω ( x ( t ) , t − 1)) b e the Euclidean distance b et ween g ω ( x ( t ) , t − 1) and its k -th nearest neigh b or in { x ( t − 1) i } n t − 1 i =1 . The log-likelihoo d of the sample g ω ( x ( t ) , t − 1) is estimated as log p t − 1 ( g ω ( x ( t ) , t − 1)) ∝ − d log δ k t − 1 ( g ω ( x ( t ) , t − 1)) . (4) 7 When training our flow-based mo del, we estimate the log-likelihoo d using Eq. (4) for all samples in the t -th domain { x ( t ) i } n t i =1 and minimize its sample a v erage − d n t P n t i =1 log δ k t − 1 ( g ω ( x ( t ) i , t − 1))) with resp ect to ω . F or simplicit y , we consider the case n t = n t − 1 = n . The cost of computing the log-lik eliho o d b y the k NN estimators is O ( n 2 ). During the training of CNF g ω , the computation of log-lik eliho o d b y the k NN estimators is required eac h time CNF g ω is up dated. W e use the nearest neighbor descent algorithm (Dong et al., 2011) to reduce the computational cost of the k NN estimators. The algorithm can b e used to construct k NN graphs efficiently , and the computational cost is empirically ev aluated to b e O ( n 1 . 14 ). Another wa y to estimate the log-lik eliho o d log p t − 1 ( g ( x ( t ) , t − 1)) is to approximate p t − 1 using a surrogate function. Our preliminary exp erimen ts sho w that the k NN esti- mators are suitable for learning contin uous changes betw een domains, and the details of the exp erimen ts are describ ed in Section 5.4. 4.3 Gaussian mixture mo del An NF transforms observed samples in to samples from a kno wn probabilit y distribution. Izmailo v et al. (2020) prop osed a semi-supervised learning metho d with DNFs using a Gaussian mixture distribution as the known probabilit y distribution. In this subsection, w e explain the Gaussian mixture mo del (GMM) suitable for our prop osed metho d and the log-likelihoo d of the flo w-based mo del with resp ect to the GMM. The distribution p 0 , conditioned on the lab el s , is modeled by a Gaussian with the mean µ s and the co v ariance matrix Σ s , p 0 ( z | y = s ) = N ( z | µ s , Σ s ). F ollo wing Iz- mailo v et al. (2020), w e assume that the classes { 1 , 2 , . . . , C } are balanced, i.e., ∀ s ∈ { 1 , 2 , . . . , C } , p ( y = s ) = 1 /C , and the Gaussian mixture distribution is p 0 ( z ) = 1 C P C s =1 N ( z | µ s , Σ s ). The Gaussian for differen t lab els should b e distinguishable from each other, and it is desirable that the appropriate mean µ s and the cov ariance matrix Σ s are assigned to eac h Gaussian distribution. W e set an iden tit y matrix as the cov ariance matrix for all classes, Σ s = I . W e prop ose to assign the mean vector µ s = [ µ s 1 , . . . , µ s d ] ⊤ using the p olar co ordinates system. Each comp onen t of the mean vector is given by µ s i = ( r cos θ s (sin θ s ) i − 1 , ( i = 1 , . . . , d − 1) , r (sin θ s ) d − 1 , ( i = d ) , (5) where r is the distance from the origin in the p olar co ordinate system and the angle θ s = 2 π ( s − 1) /C, ∀ s ∈ { 1 , 2 , . . . , C } . Note that r is a hyperparameter. Since the source domain has lab eled data, b y using Eq. (3), we can obtain the class conditional log-likelihoo d of a lab eled sample as log p 1 ( g ω ( x (1) , 1) | y = s ) = log N ( g ω ( x (1) , 0) | µ s , Σ s ) − Z 1 0 T r ∂ v ∂ g ω dt. (6) The in termediate domains and the target domain ha ve no labeled data. The log- lik eliho o d of an unlab eled sample is giv en by log p j ( g ω ( x ( j ) , j )) = log ( 1 C C X s =1 N ( g ω ( x ( j ) , 0) | µ s , Σ s ) ) − j X t =2 d log δ k t − 1 ( g ω ( x ( j ) , t − 1)) − Z j 0 T r ∂ v ∂ g ω dt. (7) 8 Namely , to learn the gradual shifts b et ween domains, we maximize the log-likelihoo d of our flo w-based mo del g ω on all the data from the initially given domains. The algorithm minimizes the following ob jective function with resp ect to the flow-based mo del g ω : L( ω ; S, { U ( j ) } K j =2 ) = − 1 n 1 n 1 X i =1 log p 1 ( g ω ( x (1) i , 1) | y i ) − K X j =2 1 n j n j X i =1 log p j ( g ω ( x ( j ) i , j )) . W e show a pseudo co de for our prop osed metho d in Algorithm 1. Our metho d has t wo h yp erparameters, k and r : k affects the computation of log- lik eliho o d by k NN estimators, whereas r con trols the distance b et w een the Gaussian corresp onding to eac h class. W e discuss ho w to tune these hyperparameters in Sec- tion 5.5. W e consider making predictions for a new sample by using our flow-based mo del. The predictive probabilit y that the class of the given test sample x b eing s is p ( y = s | x ) = p ( x | y = s ) p ( y = s ) P C s ′ =1 p ( x | y = s ′ ) p ( y = s ′ ) = N ( g ω ( x , 0) | µ s , Σ s ) P C s ′ =1 N ( g ω ( x , 0) | µ s ′ , Σ s ′ ) . (8) Therefore, the class lab el of a new sample x is predicted by ˆ y = argmax s ∈{ 1 ,...,C } p ( y = s | x ) . Algorithm 1 Gradual Domain Adaptation with CNF Input: labeled dataset S and unlabeled datasets U (2) , . . . , U ( K ) Output: trained CNF g ω 1: j ← K ▷ start training from the target domain 2: while j > 0 do 3: t 0 ← j − 1, t 1 ← j . 4: initial v alues are set to x ( j ) and 0. 5: while t 0 > 0 do ▷ the initial v alue problem is solved sequentially 6: solve the initial v alue problem Eq. (2) to obtain g ω ( x ( j ) , t 0 ) and − T r ∂ v ∂ g . 7: retain the solutions. ▷ these v alues will b e used as initial v alues in the next iteration 8: t 0 ← t 0 − 1, t 1 ← t 1 − 1. 9: end while 10: if j = 1 then ▷ up date CNF g ω 11: maximize the log-likelihoo d of lab eled data as Eq. (6) with respect to ω . 12: else 13: maximize the log-likelihoo d of unlab eled data as Eq. (7) with respect to ω . 14: end if 15: j ← j − 1 ▷ training on the adjacent domain 16: end while 4.4 Theoretical asp ects of flo w-based mo del Our prop osed metho d maximizes the log-likelihoo d of lab eled data with resp ect to ω giv en by Eq. (6). W e utilize NFs to mo del the distribution of inputs p ( x | y ), and via Eq. (8), our prop osed metho d also implicitly mo dels the distribution of outputs p ( y | x ). W e denote the exp ected loss on the source domain as follows: L 1 ( g ω ) = E x ,y ∼ p 1 ( x ,y ) [ − log p ( y | x )] = E x ,y ∼ p 1 ( x ,y ) " − log N ( g ω ( x , 0) | µ y , Σ y ) P C s =1 N ( g ω ( x , 0) | µ s , Σ s ) # . 9 Similarly , the exp ected loss on the j -th domain is denoted as L j ( g ω ) = E x ,y ∼ p j ( x ,y ) [ − log p ( y | x )]. F or notational simplicity , w e omit the argument of the exp ected loss and denote it as L 1 and L j . Note that log p ( y | x ) is a probabilit y mass function, and w e consider the follo wing natural assumption. Assumption 1 (Nguy en et al. (2022)) . F or some M ∈ R ≥ 0 , the loss satisfies 0 ≤ − log p ( y | x ) ≤ M , wher e ∀ x ∈ X , ∀ y ∈ Y . Nguy en et al. (2022) derived an upp er b ound for the loss L 2 on the basis of the source loss L 1 and the Kullback–Leibler (KL) divergence b etw een p 2 and p 1 . Note that in the standard domain adaptation, there is no intermediate domain and K = 2. Prop osition 1 (Nguy en et al. (2022)) . If Assumption 1 holds, we have L 2 ≤ L 1 + M √ 2 p KL[ p 2 ( x ) | p 1 ( x )] + D 1 (9) wher e KL[ ·|· ] denotes the KL diver genc e and D t = E p t +1 ( x ) [KL[ p t +1 ( y | x ) | p t ( y | x )]] . All pro ofs are provided in Appe ndix B. In Eq. (9), we note that it is imp ossible to calculate the conditional misalignment term E p 2 ( x ) [KL[ p 2 ( y | x ) | p 1 ( y | x )]] since the lab els from the second domain are not giv en. W e mak e the following cov ariate shift assumption 1 : Assumption 2 (Shimo daira (2000)) . F or any t ∈ { 1 , 2 , . . . , K } , p t ( y | x ) = p t +1 ( y | x ) , ∀ x ∈ X , ∀ y ∈ Y . W e can reduce the marginal misalignment term KL[ p 2 ( x ) | p 1 ( x )] b y transforming p 2 to p 1 with NFs. A CNF transforms a sample x ∼ p t +1 ( x ) to a sample from the probability distribution p t ( x ) of the adjacent domain, and the lik eliho o d of the mo del satisfies log p t +1 ( g ω ( x , t + 1)) = log p t ( g ω ( x , t )) − Z t +1 t T r ∂ v ∂ g ω dt. The exp ectation of the min us of the log-likelihoo d function to b e minimized is given by E p t +1 ( x ) [ − log p t +1 ( g ω ( x , t + 1))] . (10) Onk en et al. (2021) deriv ed the following prop osition. Prop osition 2 (Onken et al. (2021)) . The minimization of Eq. (10) is e quivalent to the minimization of the KL diver genc e b etwe en p t ( x ) and p t +1 ( x ) tr ansforme d by g ω . Let p ∗ t +1 ( x ) b e the flow ed distribution obtained by transforming p t +1 using g from time t + 1 to t . Prop osition 2 allows us to rewrite Eq. (9) as follo ws: L 2 ≤ L 1 + M √ 2 q KL[ p ∗ 2 ( x ) | p 1 ( x )] + D 1 . (11) W e extend Eq. (11) and introduce the following corollary that gives an upp er b ound of the target loss L K . 1 This assumption can be sligh tly relaxed by setting KL[ p t +1 ( y | x ) | p t ( y | x )] ≤ ε t with small constants ε t ≥ 0 , ∀ t ∈ { 1 , . . . , K − 1 } . 10 Corollary 1. If Assumptions 1 and 2 hold, we have L K ≤ L 1 + M √ 2 K X t =2 q KL[ p ∗ t ( x ) | p t − 1 ( x )] . In gradual domain adaptation, it is assumed that there is a large discrepancy b et w een the source and target domains; hence, it is highly lik ely that p K ( x ) /p 1 ( x ) → ∞ for some x and the KL div ergence b et w een the marginal distributions of the source and the target is not w ell-defined. Corollary 1 suggests that our prop osed method a voids this risk b y utilizing the intermediate domains and that the target loss L K is b ounded. In Section 5.6, w e sho w that the target loss L K is large when our flow-based mo del g ω is trained only with the source and target datasets. Compared to the upp er b ound for the self-training-based GDA Eq. (1), which de- p ends only on the loss in the source domain and the num b er of the intermediate do- mains, our b ound is adaptiv e that incorp orates trained CNF in the KL-div ergence term. The self-training pro cedure contains hyperparameters suc h as the num b er of ep o c hs and learning rates, and how to determine those appropriate hyperparameters based on Eq. (1) is non-trivial. In contrast, Corollary 1 is free from self-training. 4.5 Scalabilit y Grath wohl et al. (2019) discussed the details of the scalability of a CNF. They assumed that the cost of ev aluating a CNF g is O ( dH ), where d is the dimension of the input and H is the size of the largest hidden unit in v . They derived the cost of computing the lik eliho o d as O ( dH N ), where N is the n umber of ev aluations of g in the ODE solver. In general, the training cost of CNF g is high b ecause the n umber of ev aluations N of g in the ODE solver is large. When K domains are giv en, w e ha ve to solv e the initial v alue problem 1 2 K ( K + 1) times in our prop osed method, resulting the computational cost O ( dH N K 2 ), which is computationally exp ensive when the num ber of intermediate domains is large. When we can access man y in termediate domains and the distances b et w een the intermediate domains are small, the con ven tional self-training-based GDA algorithm (Kumar et al., 2020) will b e suitable. 5 Exp erimen ts W e use the implemen tation of the CNF 2 pro vided b y Huang and Y eh (2021). PyN- NDescen t 3 pro vides a python implemen tation of the nearest neigh b or descent (Dong et al., 2011). PyT orch (P aszke et al., 2019) is used to implemen t all procedures in our prop osed metho d except for the procedure of CNF and nearest neighbor descen t. The details of the exp erimen t, suc h as the comp osition of the neural netw ork, are presented in Appendix A. W e use WILDS (Koh et al., 2021) and MoleculeNet (W u et al., 2018) to load pre-pro cessed datasets. All exp eriments are conducted on our server with In- tel Xeon Gold 6354 pro cessors and NVIDIA A100 GPU. Source co de to repro duce the exp erimen tal results is a v ailable at https://github.com/ISMHinoLab/gda_via_cnf . 2 https://github.com/hanhsienhuang/CNF- TPR 3 https://github.com/lmcinnes/pynndescent/tree/master 11 T able 1: Summary of datasets. Rotating MNIST , Portraits , and RxRx1 are image datasets. Two Moon and Block are toy datasets. Block has tw o intermediate domains. Name # of # of # of Samples Dimensions Classes Source In termediate T arget Two Moon 2 2 1,788 1,833 1,818 Block 2 5 1,810 1,840/1,845 1,840 Rotating MNIST 28 × 28 10 2,000 2,000 2,000 (Kumar et al., 2020) Portraits 32 × 32 2 2,000 2,000 2,000 (Ginosar et al., 2015) SHIFT15M 4,096 7 5,000 5,000 5,000 (Kim ura et al., 2021) RxRx1 3 × 32 × 32 4 9,856 9,856 6,856 (T aylor et al., 2019) Tox21 NHOHCount 108 2 3,284 1,898 807 (Thomas et al., 2018) Tox21 RingCount 108 2 1,781 2,308 1,131 (Thomas et al., 2018) Tox21 NumHDonors 108 2 3,246 2,285 835 (Thomas et al., 2018) 5.1 Datasets W e use b enchmark datasets with mo difications for GD A. Since we are considering the situation that the distances of the adjacen t domains are large, w e prepare only one or t wo intermediate domains. W e summarize the information on the datasets used in our exp erimen ts in T able 1. The details of the datasets are shown in App endix A. 5.2 Exp erimen tal settings Brehmer and Cranmer (2020) men tioned that NFs are unsuitable for data that do not p opulate the en tire am bient space. Therefore, we apply UMAP (McInnes et al., 2018) to eac h dataset as prepro cessing. UMAP can use partial labels for semi-sup ervised learning; hence, w e utilize the lab els of the source dataset as the partial lab els. T o determine the appropriate embedding dimension, we train a CNF g ω on the dimension-reduced source dataset b y maximizing Eq. (6) with resp ect to ω . After the training, we ev aluate the accuracy on the dimension-reduced source dataset. W e select the em b edding dimension with whic h the result of mean accuracy of three-fold cross-v alidation in the dimension- reduced source domain is the b est. All parameters in UMAP are set to their default v alues. 5.3 Learning gradual shifts with discrete normalizing flo ws W e aim to learn the contin uous c hange x ( K ) 7→ · · · 7→ x (1) 7→ z by using NFs. In principle, we can map the distribution of the target domain to the Gaussian mixture distribution via the in termediate and the source domains by utilizing m ultiple discrete 12 NFs f ( K ) , . . . , f (1) . How ev er, it is empirically shown that DNFs are unsuitable for learning contin uous change. W e use RealNVP (Dinh et al., 2017) as a DNF. One DNF blo ck consists of four fully connected la yers with 64 no des in each lay er. T o improv e the expressiv e p o wer of the flo w-based mo del, we stac k three DNF blo cks. The output from each DNF blo ck should v ary con tinuously . W e train the DNFs with the source and the intermediate datasets on the Two Moon dataset. Figure 2 shows the transformation of the intermediate data to the source data. W e see that the CNF contin uously transforms the in termediate data in to the source data. On the other hand, the DNFs failed to transform the in termediate data into the source data, and the path of transformation is not contin uous. F rom this preliminary experiment, w e adopt contin uous normalizing flow to realize the proposed gradual domain adaptation method. As a b ypro duct, the proposed metho d can generate syn thetic data from an y in termediate domain even when no sample from the domain is observ ed. Ground T ruth CNF DNF Intermediate Source t=2.0 t=1.8 t=1.4 t=1.0 Intermediate Step 1 Step 2 Step 3 Figure 2: Comparison b et ween the discrete and the contin uous normalizing flows. Whereas con tinuous NFs are suitable for learning contin uous c hange, discrete NFs are unsuit- able for learning contin uous change. 5.4 Estimation of log-lik eliho o d b y fitting Gaussian mixture distribu- tion Our prop osed metho d learns the transformation of a sample from the j -th domain to a sample from the adjacen t domain, and it requires the estimation of the log-likelihoo d log p t − 1 ( g ω ( x ( t ) , t − 1)), where t = j . W e prop osed a computation metho d for the log- lik eliho o d by using k NN estimators in Section 4.2. Here, as another wa y of estimating the log-likelihoo d, we consider the appro ximation of p t − 1 b y a Gaussian mixture distri- bution. Let Q and w ( t − 1) q b e the num ber of mixture comp onents and a mixture weigh t, resp ectiv ely . The subscript q indicates the q -th comp onent of a Gaussian mixture distri- bution, and the sup erscript indicates the domain. W e approximate the adjacen t domain 13 with the Gaussian mixture distribution p t − 1 ( x ( t − 1) ) = Q X q =1 w ( t − 1) q N ( x ( t − 1) | µ ( t − 1) q , Σ ( t − 1) q ) , Q X q =1 w ( t − 1) q = 1 , where µ ( t − 1) q and Σ ( t − 1) q are the mean vector and the cov ariance matrix, resp ectively . W e fit a Gaussian mixture distribution for each domain. Therefore, we distinguish the mixture weigh t, the mean v ector, and the cov ariance matrix with sup erscripts. W e assign a sufficien tly large v alue to Q since our aim is an estimation of the log-likelihoo d log p t − 1 ( g ( x ( t ) , t − 1)). W e compare the log-lik eliho o d estimation by fitted Gaussian mixture distributions with that by k NN estimators on the Two Moon dataset, and conclude that the k NN estimators are suitable for our prop osed metho d. Since our toy dataset is a simple tw o mo on forms, Q = 30 should b e enough for mo deling its distribution with high precision. Figure 3 sho ws the comparison of the transformation b y the CNF trained with k NN estimators and that trained with the Gaussian mixture distributions for ev aluating the lik eliho o d. Whereas the CNF trained with k NN estimators transforms the target data to the source data as exp ected, the CNF trained with fitted Gaussian mixture distributions fails to do so. Ground T ruth kNN fitted GM T arget Intermediate Source t=3.0 t=2.5 t=2.0 t=1.5 t=1.0 t=3.0 t=2.5 t=2.0 t=1.5 t=1.0 Figure 3: Comparison of the metho ds of estimating log p t − 1 ( g ( x ( t ) , t − 1)). The CNF trained with k NN estimators transforms the target data to the source data as exp ected 5.5 Hyp erparameters The proposed metho d has t wo hyperparameters, k and r . These hyperparameters are in tro duced in Sections 4.2 and 4.3, resp ectively . The h yperparameter k affects the computation of log-likelihoo d by k NN estimators, and the h yp erparameter r con trols the distance b et w een the Gaussian distributions corresp onding to eac h class. In this section, from a practical view p oin t, we discuss ho w to tune these h yp erparameters. First, we discuss the tuning metho d of k . W e estimate the log-likelihoo d by using a k NN estimator when learning the transformation b etw een the consecutive domains. In 14 general, the parameter k con trols the trade-off b et ween bias and v ariance. Optimal k also depends on whether the giv en dataset has local fine structures or not. W e determine an appropriate k by fitting a k NN classifier on the source dataset. W e train the k NN classifier with only the source dataset and determine k from the result of three-fold cross-v alidation. W e v ary the hyperparameter k and train our flow-based mo del g ω . After the training, w e ev aluate the accuracy on the target dataset. The ev aluation of our flo w-based mo del w as rep eated three times using differen t initial weigh ts of neural netw orks. In Figure 4, the red dashed line denotes the k determined using the result of k NN classifier fitting. The appropriate k can b e determined roughly by the fitting of the k NN classifier on the source dataset. 5 10 15 20 25 30 0.5 0.6 0.7 0.8 0.9 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 0.5 0.6 0.7 0.8 0.9 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 k k k Accuracy Accuracy T ox21 NHOHCount T ox21 RingCount T ox21 NumHDonors Rotating MNIST Portraits SHIFT15M RxRx1 Figure 4: Experimental result of the training of our flo w-based mo del with v arious hyperpa- rameter k v alues. The appropriate k can be determined roughly by the fitting of the k NN classifier on the source dataset. Next, we discuss a tuning metho d for r . T o conduct gradual domain adaptation with an NF, our flow-based mo del learns the transformation b et ween the distribution of the source domain and a Gaussian mixture distribution. The Gaussian distributions for differen t lab els should b e distinguishable from each other. W e assign the mean of eac h Gaussian distribution with the p olar co ordinates system. Therefore, w e should set the appropriate r , which is the distance from the origin in the p olar co ordinate system, on the basis of the num b er of classes. The h yp erparameter r can b e roughly determined b y considering the n umber of classes and the num b er of dimensions of data. The distribution p 0 , conditioned on the label s , follo ws the Gaussian distribution N ( µ s , Σ s ). Let m ( r ) be the midp oin t v ector of the mean v ectors µ s and µ ′ s of tw o adjacent Gaussian distributions. W e assign a mean vector with the p olar coordinates system as sho wn in Eq. (5), and it dep ends on the hyperparameter r . W e prop ose a metho d to determine r b y calculating max( N ( m ( r ) | µ s , Σ s ) , N ( m ( r ) | µ ′ s , Σ ′ s )), and for simplicit y , w e denote it as U ( r ). W e should select a sufficiently small r such that U ( r ) ≃ 0 since the Gaussian distributions for different lab els should b e separable. W e v ary the h yp erparameter r and calculate U ( r ). Figure 5 shows the result of the calculation. In Figure 5 (a), when the num ber of dimensions is fixed at 2, it can b e seen that a larger r is required with a large num b er 15 of classes. In Figure 5 (b), the num b er of classes is fixed at 10, and w e see that U ( r ) b ecomes sufficiently small even for a small r when the num ber of dimensions is large. 3 6 9 12 15 0 0.05 0.1 0.15 number of classes 2 5 10 r U(r) (a) Number of dimensions is fixed. 3 6 9 12 15 0 0.05 0.1 0.15 number of dimensions 2 4 8 r U(r) (b) Number of classes is fixed. Figure 5: Determination metho d of h yp erparameter r . W e can roughly determine the hyper- param ter r b y considering the n umber of classes and the num b er of dimensions of data. 0 5 10 15 20 0.2 0.4 0.6 0.8 1 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0.2 0.4 0.6 0.8 1 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 r r r Accuracy Accuracy T ox21 NHOHCount T ox21 RingCount T ox21 NumHDonors Rotating MNIST Portraits SHIFT15M RxRx1 Figure 6: Experimental results of the training of our flow-based mo del with v arious hyperpa- rameter r v alues on the source dataset only . When the h yp erparameter r is suffi- cien tly large, the accuracy on the source dataset do es not change significan tly . W e v ary the h yperparameter r and train the prop osed flo w-based mo del g ω with only the source dataset, i.e., w e maximize the log-lik eliho o d given by Eq. (6) with resp ect to ω . The p erformance of our flow-based mo del is ev aluated using three-fold cross- v alidation on the source dataset. Figure 6 sho ws the result of mean accuracy of three- fold cross-v alidation. The red dashed line in Figure 6 represen ts the smallest r , whic h induces U ( r ) < 0 . 001. W e see that the accuracy on the source dataset do es not change significan tly with a sufficiently large r , which means that the Gaussian distributions for differen t lab els are distinguishable from eac h other. Therefore, we should determine the h yp erparameter r b y calculating U ( r ). 16 5.6 Necessit y of in termediate domains Our key idea is to use a CNF to learn gradual shifts b etw een domains. W e show the necessit y of the in termediate domains for the training of the CNF. Figure 7 shows the results of CNF trained with and without the in termediate datasets on the Block dataset. In Figure 7, we show the transformation from the target data to the source data by the trained CNF. Visually , the CNF trained with the in termediate dataset transforms the target data to the source data as exp ected. The accuracy on the target dataset is 0.999 when CNF is trained with the intermediate datasets. In con trast, the accuracy on the target dataset is 0.181 when CNF is trained without intermediate datasets. F rom these results, we conclude that it is imp ortan t to train a CNF with datasets from the source, in termediate, and target domains to capture the gradual shift. Ground T ruth w/ Intermediate w/o Intermediate T arget Intermediate1 Intermediate2 Source t=4.0 t=3.0 t=2.0 t=1.0 t=2.0 t=1.6 t=1.4 t=1.0 Figure 7: Necessit y of intermediate domains. The CNF trained with the intermediate dataset transforms the target data to the source data as exp ected. F rom the p erspective of predictiv e performance on the target data, training CNF with the in termediate datasets is preferable. W ang et al. (2022) mentioned that the sequence of intermediate domains should b e uniformly placed along the W asserstein geodesic betw een the source and target domains. When only one in termediate domain is given, intuitiv ely , a preferable in termediate do- main lies near the midp oin t of the W asserstein geo desic b et ween the source and target domains. In practice, we cannot select arbitrary in termediate domains for domain adap- tation, and the sequence of in termediate domains is only giv en. In the Rotating MNIST dataset, the source dataset consists of samples from the MNIST dataset without any rotation, and the target dataset is prepared by rotating images of the source dataset by angle π / 3. W e consider a preferable in termediate domain 17 to b e a dataset prepared by rotating images from the source dataset by angle π / 6. W e v ary the rotation angle when preparing the in termediate domain and train our flow- based mo del. Note that the num ber of intermediate domains used for training is alwa ys one. After the training, we ev aluate the accuracy on the target dataset. The ev aluation of our flow-based mo del w as rep eated fiv e times using different initial weigh ts of neural net works. In Figure 8 (a), w e see that rotations with small and large angles, i.e., π / 21 and π / 3 . 5, sho w a significan t v ariance in accuracy , and the mean accuracy for these angles is low er than that for π / 6. W e obtain the exp erimen tal results that supp ort our intuition, indicating that the preferable rotation angle for the intermediate domain is π / 6. Figure 8 (b) sho ws the results of comparison b et w een the worst-performing mo del from Figure 8 (a) and the mo del trained without an y in termediate domain. The mo del trained without in termediate domains p erforms the w orst. The quality of the in termediate domain do es impact the results of gradual domain adaptation, but the impact is relativ ely small compared to the result obtained without the in termediate domain. 0.6 0.7 0.8 0.9 1 Rotation angle of intermediate domain Accuracy (a) V arious intermediate domains. 0.6 0.7 0.8 0.9 1 Accuracy (b) Necessity of intermediate domains. Figure 8: Experimental results of the training of our flow-based mo del with v arious intermedi- ate domains. 5.7 Generating artificial in termediate domains While our primary fo cus lies in GDA, our prop osed metho d can generate syn thetic data from in termediate domains, even in the absence of observed samples from that domain. W e aim to utilize NFs to transform the distribution of the target domain to the distribution of the source domain. Optimal transp ort (OT) can also realize a natural transformation from the target domain to the source domain. He et al. (2023) prop osed a metho d called G enerativ e Gradual D O main A daptation with Optimal T ransport (GO A T). GOA T interpolate the initially given domains with OT and apply gradual self-training. It is important to obtain appropriate pseudo-in termediate domains using OT since GO A T is a self-training-based GD A algorithm. Figure 9 sho ws the comparison results b et ween pseudo-intermediate domains generated by the prop osed metho d and those generated by OT. On the Two Moon dataset, CNF is suitable for generating pseudo- in termediate domains, whereas OT is unsuitable for generating pseudo-in termediate domains. In principle, the prop osed metho d is applicable to an y dimensional data as input, suc h as image data, but the current tec hnology of CNF is the computational b ottleneck. It is difficult to handle high-dimensional data directly due to the high-computational cost of off-the-shelf CNF implementation. Moreo ver, as mentioned b y Brehmer and 18 CNF OT T arget Generated Intermediate Generated Source Figure 9: Comparison of the metho ds of generating pseudo-in termediate domains on the Two Moon dataset. CNF generates reasonable pseudo-intermediate domains, whereas OT fails to do so. (a) Originals (b) Samples from CNF Figure 10: Morphing of Rotating MNIST . Cranmer (2020), NFs are unsuitable for data that do not p opulate the entire ambien t space. Therefore, as a prepro cessing, it is reasonable to reduce the dimensionality of high-dimensional data. A com bination of the proposed method and V AE is applicable to image data. W e can utilize the trained CNF and V AE for generating artificial in ter- mediate images such as morphing. W e show a demonstration of morphing on Rotating MNIST in Figure 10. The details of the exp erimen t are shown in App endix A. 5.8 Comparison with baseline metho ds T o verify the effectiveness of the prop osed metho d, we compare it with the baseline metho ds. F ollowing the approach describ ed in Section 5.5, we assign differen t v alues to the hyperparameter k of the prop osed metho d for eac h dataset. The h yp erparameter r is set to r = 3 and r = 10 for binary and m ulticlass classification, resp ectiv ely . The appropriateness of these settings is discussed in Section 5.5. The primary baseline metho ds are self-training-based GD A metho ds, as in tro duced in Section 3. Recall that these metho ds up date the h yp othesis h : X → Y trained on the source dataset b y applying sequen tial self-training. The key idea of GD A is that the h yp othesis h should b e up dated gradually . GIFT (Abnar et al., 2021) and AuxSelfT rain (Zhang et al., 2021) are metho ds that apply the idea of GDA to con ven tional domain adaptation. While con ven tional (non-gradual) domain adaptation is b ey ond the scop e of this study , 19 w e limit our comparisons to those methods inspired by GD A. Although GIFT and AuxSelfT rain do not utilize intermediate domains, w e also show exp erimen tal results using intermediate domains for up dating the hypothesis sequen tially (Sequential GIFT and Sequential AuxSelfT rain). EAML (Liu et al., 2020) is a metho d that uses meta- learning to adapt to a target domain that ev olves ov er time. W e compare the prop osed metho d with EAML because the problem setting assumed by EAML is v ery similar to that assumed in GDA. The details of the exp erimen t, such as the composition of the neural netw ork, are describ ed in App endix A W e provide a brief description of the baseline metho ds as follows. • SourceOnly: T rain the classifier with the source dataset only . • GradualSelfT rain (Kumar et al., 2020): Apply gradual self-training with the initially giv en domains. • GO A T (He et al., 2023): In terpolate the initially giv en domains with optimal trans- p ort and apply gradual self-training. • GIFT (Abnar et al., 2021): Up date the source mo del by gradual self-training with pseudo intermediate domains generated by the source and target domains. • Sequen tial GIFT (Abnar et al., 2021): Apply the GIFT algorithm with the initially giv en domains. • AuxSelfT rain (Zhang et al., 2021): Up date the source mo del by gradual self-training with pseudo intermediate domains generated b y the source and target domains. • Sequen tial AuxSelfT rain (Zhang et al., 2021): Apply the AuxSelfT rain algorithm with the initially given domains. • EAML (Liu et al., 2020): Apply the meta-learning algorithm to the initially giv en domains. Eac h ev aluation was rep eated 10 times using different initial w eigh ts of neural net- w orks. W e sho w the exp erimen tal results in Figure 11. Our prop osed metho d has com- parable or sup erior accuracy to the baseline metho ds on all datasets. W e consider a GDA problem with large discrepancies b etw een adjacen t domains. W e estimate the distance b et w een the source and target domains of each dataset by predicting the target dataset with SourceOnly . The prediction p erformance of SourceOnly on Rotating MNIST is lo w. It suggests there is a large gap b etw een the source and target domains. On the other hand, the gap is not as large for the other datasets. The prop osed metho d is effective on datasets with a large gap b et w een the source and target domains. As mentioned in Section 5.7, the prediction p erformance of a metho d that uses pseudo-intermediate domains and gradual self-training will deteriorate when suitable pseudo-in termediate domains for self-training are not obtained. AuxSelfT rain is the only metho d among the baseline metho ds that incorp orate unsup ervised learning during self-training. Aux- SelfT rain seems to b e suitable for the Portraits dataset, but it do es not app ear to b e suitable for the SHIFT15M dataset. EAML learns meta-represen tations from the sequence of unlab eled datasets. In our problem setting, the n umber of given intermediate domains is limited, which may be insufficient for learning meta-represen tations. The prop osed metho d has demonstrated stable and comparable p erformance on all datasets. More- o ver, as sho wn in Section 5.7, our prop osed metho d can generate artificial intermediate samples. 20 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9 Ours SourceOnly GradualSelfT rain GOA T GIFT Sequential GIFT AuxSelfTrain Sequential AuxSelfT rain EAML Accuracy Accuracy T ox21 NHOHCount T ox21 RingCount T ox21 NumHDonors Rotating MNIST Portraits SHIFT15M RxRx1 Figure 11: Comparison of accuracy on five real-world datasets. 6 Discussion and conclusion Gradual domain adaptation is one of the promising approac hes to addressing the problem of a large domain gap b y leveraging the intermediate domain. Kumar et al. (2020) assumed in their previous work that the in termediate domain gradually shifts from the source domain to the target domain and that the distance b etw een adjacent domains is small. In this study , w e consider the problem of a large distance b etw een adjacent domains. The prop osed metho d mitigates the problem b y utilizing normalizing flows, with a theoretical guaran tee on the prediction error in the target domain. W e ev aluate the effectiveness of our prop osed metho d on fiv e real-w orld datasets. The prop osed metho d mitigates the limit of the applicability of GDA. In this work, we assume that there is no noisy intermediate domain. A noisy in- termediate domain deteriorates the predictive p erformance of the prop osed metho d. It remains to b e our future w ork to develop a method to select sev eral appropriate inter- mediate domains from the given noisy intermediate domains. Ac kno wledgmen ts P art of this w ork is supp orted b y JSPS KAKENHI Gran t No. JP22H03653, JST CREST Gran t Nos. JPMJCR1761 and JPMJCR2015, JST-Mirai Program Grant No. JPMJMI19G1, and NEDO Grant No. JPNP18002. References Ab dal, R., Zh u, P ., Mitra, N. J., and W onk a, P . (2021). Styleflo w: A ttribute-conditioned exploration of stylegan-generated images using conditional contin uous normalizing flo ws. ACM T r ansactions on Gr aphics (T oG) , 40(3):1–21. Abnar, S., Berg, R. v. d., Ghiasi, G., Dehghani, M., Kalc h brenner, N., and Sedghi, H. (2021). Gradual domain adaptation in the wild: When intermediate distributions are absen t. arXiv pr eprint arXiv:2106.06080 . 21 Ask ari, H., Latif, Y., and Sun, H. (2023). Mapflo w: latent transition via normalizing flo w for unsup ervised domain adaptation. Machine L e arning , pages 1–22. Ben-Da vid, S., Blitzer, J., Crammer, K., Pereira, F., et al. (2007). Analysis of represen- tations for domain adaptation. A dvanc es in neur al information pr o c essing systems , 19:137. Ben-Ham u, H., Cohen, S., Bose, J., Amos, B., Nick el, M., Grov er, A., Chen, R. T., and Lipman, Y. (2022). Matc hing normalizing flo ws and probability paths on manifolds. In International Confer enc e on Machine L e arning , pages 1749–1763. PMLR. Bengio, Y., Louradour, J., Collob ert, R., and W eston, J. (2009). Curriculum learning. In Pr o c e e dings of the 26th A nnual International Confer enc e on Machine L e arning , ICML ’09, page 41–48, New Y ork, NY, USA. Asso ciation for Computing Mac hinery . Brehmer, J. and Cranmer, K. (2020). Flo ws for simultaneous manifold learning and densit y estimation. A dvanc es in Neur al Information Pr o c essing Systems , 33:442–453. Caterini, A. L., Loaiza-Ganem, G., Pleiss, G., and Cunningham, J. P . (2021). Rectangu- lar flo ws for manifold learning. A dvanc es in Neur al Information Pr o c essing Systems , 34:30228–30241. Chen, H.-Y. and Chao, W.-L. (2021). Gradual domain adaptation without indexed in termediate domains. A dvanc es in Neur al Information Pr o c essing Systems , 34. Chen, R. T., Rubanov a, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinary differential equations. A dvanc es in neur al information pr o c essing systems , 31. Choi, J., Choi, Y., Kim, J., Chang, J., Kwon, I., Gw on, Y., and Min, S. (2020). Visual domain adaptation by consensus-based transfer to intermediate domain. In Pr o c e e d- ings of the AAAI Confer enc e on Artificial Intel ligenc e , v olume 34, pages 10655–10662. Cortes, C., Mansour, Y., and Mohri, M. (2010). Learning bounds for imp ortance w eigh t- ing. In Nips , v olume 10, pages 442–450. Citeseer. Cui, S., W ang, S., Zh uo, J., Su, C., Huang, Q., and Tian, Q. (2020). Gradually v an- ishing bridge for adv ersarial domain adaptation. In Pr o c e e dings of the IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition , pages 12455–12464. Dai, Y., Liu, J., Sun, Y., T ong, Z., Zhang, C., and Duan, L.-Y. (2021). Idm: An in termediate domain mo dule for domain adaptive p erson re-id. In Pr o c e e dings of the IEEE/CVF International Confer enc e on Computer Vision , pages 11864–11874. Das, H. P ., T ran, R., Singh, J., Lin, Y.-W., and Spanos, C. J. (2021). Cdcgen: Cross- domain conditional generation via normalizing flo ws and adv ersarial training. arXiv pr eprint arXiv:2108.11368 . Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). Density estimation using real NVP. In 5th International Confer enc e on L e arning R epr esentations, ICLR 2017, T oulon, F r anc e, April 24-26, 2017, Confer enc e T r ack Pr o c e e dings . Op enReview.net. Dong, J., Zhou, S., W ang, B., and Zhao, H. (2022). Algorithms and theory for sup ervised gradual domain adaptation. T r ansactions on Machine L e arning R ese ar ch . 22 Dong, W., Moses, C., and Li, K. (2011). Efficien t k-nearest neighbor graph construction for generic similarity measures. In Pr o c e e dings of the 20th international c onfer enc e on World wide web , pages 577–586. Drw al, M. N., Siramshetty , V. B., Banerjee, P ., Go ede, A., Preissner, R., and Dunk el, M. (2015). Molecular similarit y-based predictions of the tox21 screening outcome. F r ontiers in Envir onmental scienc e , 3:54. Gaderma yr, M., Esc hw eiler, D., Klinkhammer, B. M., Bo or, P ., and Merhof, D. (2018). Gradual domain adaptation for segmen ting whole slide images sho wing pathological v ariabilit y . In International Confer enc e on Image and Signal Pr o c essing , pages 461– 469. Springer. Ginosar, S., Rakelly , K., Sachs, S., Yin, B., and Efros, A. A. (2015). A cen tury of p ortraits: A visual historical record of american high sc hool yearbo oks. In Pr o c e e dings of the IEEE International Confer enc e on Computer Vision Workshops , pages 1–7. Gong, R., Li, W., Chen, Y., and Go ol, L. V. (2019). Dlow: Domain flow for adaptation and generalization. In Pr o c e e dings of the IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition , pages 2477–2486. Goria, M. N., Leonenko, N. N., Mergel, V. V., and In verardi, P . L. N. (2005). A new class of random v ector entrop y estimators and its applications in testing statistical h yp otheses. Journal of Nonp ar ametric Statistics , 17(3):277–297. Grath wohl, W., Chen, R. T., Bettencourt, J., Sutsk ev er, I., and Duv enaud, D. (2019). Ffjord: F ree-form con tinuous dynamics for scalable reversible generative models. In International Confer enc e on L e arning R epr esentations . Gro ver, A., Ch ute, C., Sh u, R., Cao, Z., and Ermon, S. (2020). Alignflo w: Cycle consisten t learning from multiple domains via normalizing flows. In Pr o c e e dings of the AAAI Confer enc e on A rtificial Intel ligenc e , volume 34, pages 4028–4035. He, Y., W ang, H., Li, B., and Zhao, H. (2023). Gradual domain adaptation: Theory and algorithms. arXiv pr eprint arXiv:2310.13852 . Ho, J., Chen, X., Sriniv as, A., Duan, Y., and Abb eel, P . (2019). Flow++: Impro ving flo w-based generative models with v ariational dequan tization and arc hitecture design. In International Confer enc e on Machine L e arning , pages 2722–2730. PMLR. Horv at, C. and Pfister, J.-P . (2021). Denoising normalizing flo w. A dvanc es in Neur al Information Pr o c essing Systems , 34:9099–9111. Hsu, H.-K., Y ao, C.-H., Tsai, Y.-H., Hung, W.-C., Tseng, H.-Y., Singh, M., and Y ang, M.-H. (2020). Progressiv e domain adaptation for ob ject detection. In Pr o c e e dings of the IEEE/CVF Winter Confer enc e on Applic ations of Computer Vision , pages 749–757. Huang, H.-H. and Y eh, M.-Y. (2021). Accelerating contin uous normalizing flow with tra jectory p olynomial regularization. In Pr o c e e dings of the AAAI Confer enc e on Ar- tificial Intel ligenc e , volume 35, pages 7832–7839. 23 Huang, P ., Xu, M., Zhu, J., Shi, L., F ang, F., and Zhao, D. (2022). Curriculum reinforce- men t learning using optimal transp ort via gradual domain adaptation. In Oh, A. H., Agarw al, A., Belgra ve, D., and Cho, K., editors, A dvanc es in Neur al Information Pr o c essing Systems . https://openreview.net/forum?id=_cFdPHRLuJ . Huang, Z., Chen, S., Zhang, J., and Shan, H. (2021). Ageflow: Conditional age progres- sion and regression with normalizing flows. In Zhou, Z.-H., editor, Pr o c e e dings of the Thirtieth International Joint Confer enc e on A rtificial Intel ligenc e, IJCAI-21 , pages 743–750. In ternational Join t Conferences on Artificial In telligence Organization. Main T rack. Izmailo v, P ., Kiric henko, P ., Finzi, M., and Wilson, A. G. (2020). Semi-sup ervised learning with normalizing flows. In International Confer enc e on Machine L e arning , pages 4615–4630. PMLR. Kalatzis, D., Y e, J. Z., Pouplin, A., W ohlert, J., and Haub erg, S. (2021). Density estima- tion on smo oth manifolds with normalizing flows. arXiv pr eprint arXiv:2106.03500 . Kim ura, M., Nak amura, T., and Saito, Y. (2021). Shift15m: Multiob jectiv e large-scale fashion dataset with distributional shifts. arXiv pr eprint arXiv:2108.12992 . Kingma, D. P . and Dhariwal, P . (2018). Glo w: Generative flo w with inv ertible 1x1 con volutions. A dvanc es in neur al information pr o c essing systems , 31. Kiric henko, P ., Izmailov, P ., and Wilson, A. G. (2020). Wh y normalizing flo ws fail to detect out-of-distribution data. A dvanc es in neur al information pr o c essing systems , 33:20578–20589. Koh, P . W., Saga wa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Y asunaga, M., Phillips, R. L., Gao, I., et al. (2021). Wilds: A b enchmark of in- the-wild distribution shifts. In International Confer enc e on Machine L e arning , pages 5637–5664. PMLR. Kong, Z. and Chaudhuri, K. (2020). The expressive p o wer of a class of normalizing flo w mo dels. In International c onfer enc e on artificial intel ligenc e and statistics , pages 3599–3609. PMLR. Kozac henko, L. and Leonenko, N. (1987). Sample estimate of the entrop y of a random v ector. Pr oblems of information tr ansmission , 23(2):95 – 101. Kumar, A., Ma, T., and Liang, P . (2020). Understanding self-training for gradual domain adaptation. In International Confer enc e on Machine L e arning , pages 5468– 5479. PMLR. Kuznetso v, M. and Polyk o vskiy , D. (2021). Molgrow: A graph normalizing flow for hierarc hical molecular generation. In Pr o c e e dings of the AAAI Confer enc e on Artificial Intel ligenc e , volume 35, pages 8226–34. Liu, H., Long, M., W ang, J., and W ang, Y. (2020). Learning to adapt to evolving domains. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, A dvanc es in Neur al Information Pr o c essing Systems , v olume 33, pages 22338–22348. Curran Asso ciates, Inc. 24 Lu, Y. and Huang, B. (2020). Structured output learning with conditional generativ e flo ws. In Pr o c e e dings of the AAAI Confer enc e on Artificial Intel ligenc e , v olume 34, pages 5005–5012. Maha jan, S., Gurevych, I., and Roth, S. (2019). Laten t normalizing flo ws for many-to- man y cross-domain mappings. In International Confer enc e on L e arning R epr esenta- tions . Mansour, Y., Mohri, M., and Rostamizadeh, A. (2009). Domain adaptation: Learning b ounds and algorithms. In 22nd Confer enc e on L e arning The ory, COL T 2009 . Mathieu, E. and Nick el, M. (2020). Riemannian con tinuous normalizing flo ws. A dvanc es in Neur al Information Pr o c essing Systems , 33:2503–2515. McInnes, L., Healy , J., Saul, N., and Grossb erger, L. (2018). Umap: Uniform manifold appro ximation and pro jection. The Journal of Op en Sour c e Softwar e , 3(29):861. Nguy en, A. T., T ran, T., Gal, Y., T orr, P ., and Baydin, A. G. (2022). KL guided domain adaptation. In International Confer enc e on L e arning R epr esentations . Onk en, D., W u F ung, S., Li, X., and Ruthotto, L. (2021). Ot-flo w: F ast and accu- rate contin uous normalizing flows via optimal transp ort. In Pr o c e e dings of the AAAI Confer enc e on Artificial Intel ligenc e , volume 35. P an, Z., Y u, W., Yi, X., Khan, A., Y uan, F., and Zheng, Y. (2019). Recent progress on generativ e adv ersarial net works (gans): A survey . IEEE A c c ess , 7:36322–36333. P apamak arios, G., Nalisnic k, E., Rezende, D. J., Mohamed, S., and Lakshminaray anan, B. (2021). Normalizing flo ws for probabilistic modeling and inference. Journal of Machine L e arning R ese ar ch , 22(57):1–64. P aszke, A., Gross, S., Massa, F., Lerer, A., Bradbury , J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Y ang, E., DeVito, Z., Raison, M., T ejani, A., Chilamkurthy , S., Steiner, B., F ang, L., Bai, J., and Chintala, S. (2019). Pytorch: An imp erative st yle, high-performance deep learning library . In W allach, H., Laro c helle, H., Beygelzimer, A., d ' Alc h ´ e-Buc, F., F ox, E., and Garnett, R., editors, A dvanc es in Neur al Information Pr o c essing Systems 32 , pages 8024–8035. Curran Asso ciates, Inc. Pumarola, A., P op ov, S., Moreno-Noguer, F., and F errari, V. (2020). C-flo w: Condi- tional generativ e flow mo dels for images and 3d p oin t clouds. In Pr o c e e dings of the IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition , pages 7949– 7958. Redk o, I., Morv an t, E., Habrard, A., Sebban, M., and Bennani, Y. (2019). A dvanc es in domain adaptation the ory . Elsevier. Rezende, D. and Mohamed, S. (2015). V ariational inference with normalizing flows. In International c onfer enc e on machine le arning , pages 1530–1538. PMLR. Ross, B. and Cressw ell, J. (2021). T ractable density estimation on learned manifolds with conformal em b edding flows. A dvanc es in Neur al Information Pr o c essing Systems , 34:26635–26648. 25 Rozen, N., Gro ver, A., Nick el, M., and Lipman, Y. (2021). Moser flo w: Div ergence- based generative mo deling on manifolds. A dvanc es in Neur al Information Pr o c essing Systems , 34:17669–17680. Saga wa, S. and Hino, H. (2023). Cost-effectiv e framew ork for gradual domain adaptation with multifidelit y . Neur al Networks , 164:731–741. Shimo daira, H. (2000). Improving predictive inference under cov ariate shift by w eighting the log-lik eliho o d function. Journal of statistic al planning and infer enc e , 90(2):227– 244. Simon yan, K. and Zisserman, A. (2015). V ery deep con v olutional netw orks for large-scale image recognition. In International Confer enc e on L e arning R epr esentations . T aylor, J., Earnshaw, B., Mab ey , B., Victors, M., and Y osinski, J. (2019). Rxrx1: An image set for cellular morphological v ariation across many exp erimen tal batc hes. In International Confer enc e on L e arning R epr esentations (ICLR) . T eshima, T., Ishik a wa, I., T o jo, K., Oono, K., Ik eda, M., and Sugiy ama, M. (2020). Coupling-based inv ertible neural netw orks are universal diffeomorphism appro xima- tors. A dvanc es in Neur al Information Pr o c essing Systems , 33:3362–3373. Thomas, R. S., P aules, R. S., Simeonov, A., Fitzpatrick, S. C., Crofton, K. M., Casey , W. M., and Mendric k, D. L. (2018). The us federal to x21 program: A strategic and op erational plan for contin ued leadership. Altex , 35(2):163. Villani, C. (2009). Optimal tr ansp ort: old and new , volume 338. Springer. W ang, H., He, H., and Katabi, D. (2020). Contin uously indexed domain adaptation. In I II, H. D. and Singh, A., editors, Pr o c e e dings of the 37th International Confer enc e on Machine L e arning , volume 119 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 9898–9907. PMLR. W ang, H., Li, B., and Zhao, H. (2022). Understanding gradual domain adaptation: Impro ved analysis, optimal path and b ey ond. In Chaudhuri, K., Jegelk a, S., Song, L., Szep esv ari, C., Niu, G., and Sabato, S., editors, Pr o c e e dings of the 39th International Confer enc e on Machine L e arning , v olume 162 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 22784–22801. PMLR. W u, Z., Ramsundar, B., F ein b erg, E. N., Gomes, J., Geniesse, C., P appu, A. S., Leswing, K., and Pande, V. (2018). Moleculenet: a b enchmark for molecular machine learning. Chemic al scienc e , 9(2):513–530. Y ang, G., Huang, X., Hao, Z., Liu, M.-Y., Belongie, S., and Hariharan, B. (2019). P oint- flo w: 3d p oin t cloud generation with con tinuous normalizing flo ws. In Pr o c e e dings of the IEEE/CVF International Confer enc e on Computer Vision , pages 4541–4550. Y e, M., Jiang, R., W ang, H., Choudhary , D., Du, X., Bh ushanam, B., Mokhtari, A., Kejariw al, A., and Liu, Q. (2022). F uture gradien t descen t for adapting the temp oral shifting data distribution in online recommendation systems. In Cussens, J. and Zhang, K., editors, Pr o c e e dings of the Thirty-Eighth Confer enc e on Unc ertainty in A rtificial Intel ligenc e , v olume 180 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 2256–2266. PMLR. 26 Zhai, J., Zhang, S., Chen, J., and He, Q. (2018). Auto encoder and its v arious v arian ts. In 2018 IEEE International Confer enc e on Systems, Man, and Cyb ernetics (SMC) , pages 415–419. Zhang, Y., Deng, B., Jia, K., and Zhang, L. (2021). Gradual domain adaptation via self-training of auxiliary mo dels. arXiv pr eprint arXiv:2106.09890 . Zhao, H., Des Combes, R. T., Zhang, K., and Gordon, G. (2019). On learning in v ari- an t represen tations for domain adaptation. In International Confer enc e on Machine L e arning , pages 7523–7532. PMLR. Zhao, S., Li, B., Xu, P ., and Keutzer, K. (2020). Multi-source domain adaptation in the deep learning era: A systematic survey . arXiv pr eprint arXiv:2002.12169 . Zhou, S., W ang, L., Zhang, S., W ang, Z., and Zhu, W. (2022a). Active gradual domain adaptation: Dataset and approac h. IEEE T r ansactions on Multime dia , 24:1210–1220. Zhou, S., Zhao, H., Zhang, S., W ang, L., Chang, H., W ang, Z., and Zhu, W. (2022b). Online contin ual adaptation with active self-training. In International Confer enc e on A rtificial Intel ligenc e and Statistics , pages 8852–8883. PMLR. 27 App endix A Exp erimen tal details App endix A.1 Net w orks W e propose a gradual domain adaptation metho d that utilizes normalizing flo ws. The baseline metho ds include self-training-based metho ds and a metho d that uses meta- learning. The comp osition of the neural netw ork for each metho d is shown as follows. Our prop osed metho d One CNF blo c k consists of tw o fully connected la y ers with 64 no des in each la yer. Our flow-based mo del g consists of one CNF blo c k. Self-training-based metho d Recall that these metho ds up date the hypothesis h : X → Y trained on the source dataset by applying sequential self-training. W e follow the h yp othesis used b y He et al. (2023) since they consider the same problem settings as ours. The h yp othesis h consists of an enco der and a classifier. The enco der has t wo conv olutional lay ers, and the classifier has tw o conv olutional lay ers and tw o fully connected lay ers. Since we apply UMAP to all datasets as prepro cessing, we mo dify the con volutional lay er of the mo del to fully connected lay ers. W e use this h yp othesis in all baseline metho ds except for the prop osed metho d and EAML. EAML EAML requires a feature extractor and a meta-adapter. In Liu et al. (2020), the feature extractor consists of tw o conv olutional lay ers, and the meta-adapter com- prises t wo fully connected lay ers. W e mo dify the con volutional lay ers of the feature extractor to fully connected la yers since we apply UMAP to all datasets as prepro- cessing. Other necessary parameters during training are set as sp ecified in Liu et al. (2020). App endix A.2 Datasets W e use b enc hmark datasets with mo difications for gradual domain adaptation. Since w e are considering the situation that the distances of the adjacent domains are large, we prepare only one or t wo in termediate domains. W e describ e the details of the datasets. Two Moon A toy dataset. W e use the two-moon dataset as the source domain. The in termediate and target domains are prepared b y rotating the source dataset b y π / 4 and π / 2, resp ectiv ely . Block A toy dataset. The n um b er of dimensions of the data is tw o, and the n umber of classes is five. W e prepare the intermediate and target domains by adding horizontal mo vemen t to each class. Note that only the Block dataset has tw o in termediate domains. Rotating MNIST (Kumar et al., 2020) W e add rotations to the MNIST data. The rotation angle is 0 for the source domain, π / 6 for the intermediate domain, and π / 3 for the target domain. W e normalize the image in tensit y to the range b etw een 0 and 1 by dividing by 255. 28 Portraits (Ginosar et al., 2015) The Portraits dataset includes photographs of U.S. high sc ho ol students from 1905 to 2013, and the task is gender classification. W e sort the dataset in ascending order by y ear and split the dataset. The source dataset includes data from the 1900s to the 1930s. The intermediate dataset includes data from the 1940s and the 1950s. The target dataset includes data from the 1960s. W e resize the original image to 32 x 32 and normalize the image in tensity to the range b etw een 0 and 1 by dividing by 255. Tox21 (Thomas et al., 2018) The Tox21 dataset contains the results of measuring the toxicit y of comp ounds. The dataset contains 12 types of to xicity ev aluation with a num b er of missing v alues. W e merge these ev aluations into a single ev aluation and consider a comp ound as to xic when it is determined to be harmful in an y of the 12 ev aluations. Since Tox21 has no domain indicator such as y ear, we introduce an indi- cator for splitting the entire dataset into domains. It is a reasonable method from the c hemical view point to divide the en tire dataset in to domains b y the n umber of arbitrary substituen ts in the comp ound. W e select the follo wing three chemically representativ e substituen ts and use the n umber of substituents as a domain indicator. • NHOHCoun t: Num b er of NHOH groups in the comp ound. • RingCoun t: Num b er of ring structures in the comp ound. • NumHDonors: Number of p ositively p olarized hydrogen b onds in the comp ound. The zeroth, first, and second substituen ts are assigned to the source domain, the inter- mediate domain, and the target domain, resp ectiv ely . W e use 108-dimensional molecular descriptors as features used in the previous work (Drwal et al., 2015). SHIFT15M (Kimura et al., 2021) SHIFT15M consists of 15 million fashion images collected from real fashion e-commerce sites. W e estimate seven categories of clothes from image features. SHIFT15M do es not pro vide images but pro vides V GG16 (Simon y an and Zisserman, 2015) features consisting of 4,096 dimensions. The dataset con tains fashion images from 2010 to 2020, and the passage of years causes a domain shift. The n umber of samples from 2010 is significantly smaller than that from other y ears, and we merge samples from 2010 with those from 2011. W e consider the datasets from 2011, 2015, and 2020 as the source domain, the intermediate domain, and the target domain, resp ectiv ely . Owing to the significant n umber of samples, w e randomly select 5,000 samples from each domain. RxRx1 (T aylor et al., 2019) RxRx1 consists of three channels of cell images obtained b y a fluorescence microscop e. W e resize the original image to 32 x 32 and normalize the image in tensity to the range betw een 0 and 1 b y dividing by 255. Domain shifts o ccur in the execution of each batch due to slight c hanges in temp erature, humidit y , and reagent concen tration. W e estimate the cell t yp e used in the exp erimen t from image features. W e consider batc h num b ers one, t wo, and three as the source domain, the intermediate domain, and the target domain, resp ectiv ely . App endix A.3 Image generation W e prop osed a metho d that uses con tinuous normalizing flows (CNFs) to learn grad- ual shifts b et ween domains. Our main purpose is gradual domain adaptation, but the 29 trained CNFs can generate pseudo intermediate domains suc h as morphing as a byprod- uct of the use of normalizing flo w. It is difficult to handle high-dimensional data directly due to the high-computational cost of off-the-shelf CNF implementation. Therefore, we sho wed the demonstration of image generation by com bining the prop osed metho d and v ariational autoenco der (V AE) in Figure 10 of the main b o dy of the pap er. Here, we denote the details of the exp erimen t. W e assign 60,000 samples dra wn from the MNIST dataset without an y rotation to the source domain, and the target domain is prepared by rotating images of the source dataset b y angle π / 3. F ollowing Kumar et al. (2020), the in termediate domains are prepared by adding gradual rotations, angles from π / 84 to π / 3 . 11. The num b er of in termediate domains is 27 in total. T o obtain the laten t v ariables that shift the source domain to the target domain gradually , w e use all the in termediate domains for the training of V AE. After the training of V AE, w e extract the laten t v ariables whose rotation angles corresp ond to 0, π / 6, and π / 3. F ollowing Kumar et al. (2020), w e randomly select 2,000 samples from each domain. W e train our flo w-based mo del with the latent v ariables and generate pseudo intermediate domains b y using trained CNF. Figure 10 of the main b ody of the pap er shows images from deco ded pseudo intermediate domains. 30 App endix B Pro ofs Here, we show the details of pro ofs. Note that Propositions 1 and 2 were originally deriv ed b y Nguy en et al. (2022) and Onk en et al. (2021), resp ectively . F or the sak e of completeness, w e pro vide pro ofs using the notations consistent with those used in this pap er. App endix B.1 Pro of of prop osition 1 (Nguyen et al., 2022) Pr o of. Recall that the exp ected losses on the source and j -th domains are defined as L 1 = E x ,y ∼ p 1 ( x ,y ) [ − log p ( y | x )] and L j = E x ,y ∼ p j ( x ,y ) [ − log p ( y | x )], resp ectively . In the standard domain adaptation, there is no intermediate domain and K = 2. W e ha ve L 2 = E p 2 ( x ,y ) [ − log p ( y | x )] = Z − [log p ( y | x )] p 2 ( x , y ) d x dy = Z − [log p ( y | x )] p 1 ( x , y ) d x dy + Z − log p ( y | x )[ p 2 ( x , y ) − p 1 ( x , y )] d x dy = L 1 + Z − log p ( y | x )[ p 2 ( x , y ) − p 1 ( x , y )] d x dy . W e define sets A and B as A = { ( x , y ) | p 2 ( x , y ) − p 1 ( x , y ) ≥ 0 } , B = { ( x , y ) | p 2 ( x , y ) − p 1 ( x , y ) < 0 } . If Assumption 1 holds, we ha ve Z − log p ( y | x )[ p 2 ( x , y ) − p 1 ( x , y )] d x dy = Z A − log p ( y | x )[ p 2 ( x , y ) − p 1 ( x , y )] d x dy + Z B − log p ( y | x )[ p 2 ( x , y ) − p 1 ( x , y )] d x dy ≤ Z A − log p ( y | x )[ p 2 ( x , y ) − p 1 ( x , y )] d x dy = Z A − log p ( y | x ) | p 2 ( x , y ) − p 1 ( x , y ) | d x dy ≤ M Z A | p 2 ( x , y ) − p 1 ( x , y ) | d x dy ( ∵ − log p ( y | x ) ≤ M ) , where | · | is the absolute v alue. Note that R A | p 2 ( x , y ) − p 1 ( x , y ) | d x dy is called the total v ariation of tw o distributions. F rom the iden tity R p 2 ( x , y ) − p 1 ( x , y ) d x dy = 0, we hav e Z A p 2 ( x , y ) − p 1 ( x , y ) d x dy + Z B p 2 ( x , y ) − p 1 ( x , y ) d x dy = 0 ⇔ Z A p 2 ( x , y ) − p 1 ( x , y ) d x dy = Z B p 1 ( x , y ) − p 2 ( x , y ) d x dy ⇔ Z A | p 2 ( x , y ) − p 1 ( x , y ) | d x dy = Z B | p 2 ( x , y ) − p 1 ( x , y ) | d x dy ⇔ Z A | p 2 ( x , y ) − p 1 ( x , y ) | d x dy = 1 2 Z | p 2 ( x , y ) − p 1 ( x , y ) | d x dy . 31 Therefore, L 2 = L 1 + Z − log p ( y | x )[ p 2 ( x , y ) − p 1 ( x , y )] d x dy ≤ L 1 + M Z A | p 2 ( x , y ) − p 1 ( x , y ) | d x dy = L 1 + M 2 Z | p 2 ( x , y ) − p 1 ( x , y ) | d x dy . Using the Pinsker’s inequality , w e ha ve Z | p 2 ( x , y ) − p 1 ( x , y ) | d x dy 2 ≤ 2 Z p 2 ( x , y ) log p 2 ( x, y ) p 1 ( x, y ) d x dy . Therefore, L 2 ≤ L 1 + M 2 s 2 Z p 2 ( x , y ) log p 2 ( x, y ) p 1 ( x, y ) d x dy = L 1 + M √ 2 p KL[ p 2 ( x , y ) | p 1 ( x , y )] . W e decompose the KL divergence b et w een p 2 ( x , y ) and p 1 ( x , y ) into the marginal and conditional misalignment terms as follows: KL[ p 2 ( x , y ) | p 1 ( x , y )] = E p 2 ( x ,y ) [log p 2 ( x , y ) − log p 1 ( x , y )] = E p 2 ( x ,y ) [log p 2 ( x ) + log p 2 ( y | x ) − log p 1 ( x ) − log p 1 ( y | x )] = E p 2 ( x ,y ) [log p 2 ( x ) − log p 1 ( x )] + E p 2 ( x ,y ) [log p 2 ( y | x ) − log p 1 ( y | x )] = KL[ p 2 ( x ) | p 1 ( x )] + E p 2 ( x ) [ E p 2 ( y | x ) [log p 2 ( y | x ) − log p 1 ( y | x )]] = KL[ p 2 ( x ) | p 1 ( x )] + E p 2 ( x ) [KL[ p 2 ( y | x ) | p 1 ( y | x )]] . Therefore, we ha ve L 2 ≤ L 1 + M √ 2 q KL[ p 2 ( x ) | p 1 ( x )] + E p 2 ( x ) [KL[ p 2 ( y | x ) | p 1 ( y | x )]] , whic h completes the pro of. App endix B.2 Pro of of prop osition 2 (Onken et al., 2021) Pr o of. Let p t +1 b e the initial density of the samples x ∈ R d and g : R d × R + → R d b e the tra jectories that transform samples from p t +1 to p t . The c hange in densit y when p t +1 is transformed from time t + 1 to t is given by the change of v ariables formula p t +1 ( x ) = p ∗ t +1 ( g ( x , t )) | det ∇ g ( x , t ) | , (12) where p ∗ t +1 and ∇ g ( x , t ) are the transformed densit y and the Jacobian of g , resp ectively . Normalizing flows aim to learn a function g that transforms p t +1 to p t . Measuring the discrepancy b et ween the transformed and ob jectiv e distributions indicates whether 32 the trained function g is appropriate. The discrepancy b et ween t wo distributions is measured using the KL divergence KL[ p ∗ t +1 ( x ) | p t ( x )] = Z R d log p ∗ t +1 ( x ) p t ( x ) p ∗ t +1 ( x ) d x . (13) W e transform a sample x using g . By using the change of v ariable form ula, we rewrite Eq. (13) as follows KL[ p ∗ t +1 ( x ) | p t ( x )] = Z R d log p ∗ t +1 ( x ) p t ( x ) p ∗ t +1 ( x ) d x = Z R d log p ∗ t +1 ( g ( x , t )) h h h h h h h | det ∇ g ( x , t ) | p t ( g ( x , t )) h h h h h h h | det ∇ g ( x , t ) | p ∗ t +1 ( g ( x , t )) | det ∇ g ( x , t ) | d x = Z R d log p ∗ t +1 ( g ( x , t )) p t ( g ( x , t )) p ∗ t +1 ( g ( x , t )) | det ∇ g ( x , t ) | d x = Z R d log p ∗ t +1 ( g ( x , t )) p t ( g ( x , t )) p t +1 ( x ) d x ( ∵ E q . (12)) . By using Eq. (12), we ha ve KL[ p ∗ t +1 ( x ) | p t ( x )] = Z R d log p ∗ t +1 ( g ( x , t )) p t ( g ( x , t )) p t +1 ( x ) d x = Z R d log p t +1 ( x ) p t ( g ( x , t )) | det ∇ g ( x , t ) | p t +1 ( x ) d x = E p t +1 ( x ) [log p t +1 ( x ) − { log p t ( g ( x , t )) + log | det ∇ g ( x , t ) |} ] . (14) Chen et al. (2018) prop osed neural ordinary differen tial equations, in whic h a neural net work v parametrized by ω represents the time deriv ativ e of the function g , as follows ∂ g ∂ v = v ( g ( · , t ) , t ; ω ) . Moreo ver, they sho wed that the instantaneous c hange of the densit y can b e computed as follows ∂ log p ( g ) ∂ t = − T r ∂ v ∂ g . Namely , the term log | det ∇ g ( x , t ) | in Eq. (14) is equiv alen t to − R t +1 t T r( ∂ v /∂ g ) dt . Therefore, we rewrite the Eq. (14) as follows: KL[ p ∗ t +1 ( x ) | p t ( x )] = E p t +1 ( x ) log p t +1 ( x ) − log p t ( g ( x , t )) − Z t +1 t T r ∂ v ∂ g dt Recall that the log-likelihoo d of contin uous normalizing flow is given b y log p t +1 ( g ( x , t + 1)) = log p t ( g ( x , t )) − Z t +1 t T r ∂ v ∂ g dt. Therefore, we ha ve KL[ p ∗ t +1 ( x ) | p t ( x )] = E p t +1 ( x ) [log p t +1 ( x ) − log p t +1 ( g ( x , t + 1))] . (15) The first term of Eq. (15) do es not dep end on CNF g , and w e can ignore it during the training of the CNF. Therefore, the minimization of Eq. (10) is equiv alen t to the minimization of the KL divergence b et w een p t and p t +1 transformed by CNF g . 33 App endix B.3 Pro of of corollary 1 Pr o of. In gradual domain adaptation, since the ordered sequence of the intermediate domains is given, we extend Eq. (11) un til the target loss L K as follows: L 2 ≤ L 1 + M √ 2 q KL[ p ∗ 2 ( x ) | p 1 ( x )] + E p 2 ( x ) [KL[ p 2 ( y | x ) | p 1 ( y | x )]] L 3 ≤ L 2 + M √ 2 q KL[ p ∗ 3 ( x ) | p 2 ( x )] + E p 3 ( x ) [KL[ p 3 ( y | x ) | p 2 ( y | x )]] . . . L K ≤ L K − 1 + M √ 2 q KL[ p ∗ K ( x ) | p K − 1 ( x )] + E p K ( x ) [KL[ p K ( y | x ) | p K − 1 ( y | x )]] . Summing up b oth sides of the ab o ve inequalities, we hav e L K ≤ L 1 + M √ 2 K X t =2 q KL[ p ∗ t ( x ) | p t − 1 ( x )] + E p t ( x ) [KL[ p t ( y | x ) | p t − 1 ( y | x )]] . If Assumption 2 holds, we ha ve L K ≤ L 1 + M √ 2 K X t =2 q KL[ p ∗ t ( x ) | p t − 1 ( x )] . 34
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment