Privacy-Preserving Data Fusion for Traffic State Estimation: A Vertical Federated Learning Approach

Pr iv acy -Preser ving Data Fusion f or T raﬃc S tate Estimation: A V er tical Federated Lear ning Approac h Qiqing W ang a , Kaidi Y ang a , ∗ a Department of Civil and Envir onmental Engineer ing, National Univ ersity of Sing apore, 1 Engineer ing Drive 2, Singapor e, 117576, Singapor e A R T I C L E I N F O Keyw ords : Data Fusion Federated Lear ning Data Privacy Traﬃc State Estimation Traﬃc Flow Theory A B S T R A C T This paper proposes a privacy-preserving data fusion method f or traﬃc state estimation (TSE). Unlik e existing works that assume all data sources to be accessible by a single tr usted party, we explicitl y address data pr ivacy concer ns that arise in the collaboration and data sharing between multiple data owners, such as municipal authorities (MAs) and mobility providers (MPs). T o this end, we propose a nov el vertical federated lear ning (FL) approach, FedTSE, that enables multiple data owners to collaborativ ely train and apply a TSE model without ha ving to exc hange their private data. T o enhance t he applicability of the proposed FedTSE in common TSE scenarios with limited availability of g round-truth data, we fur ther propose a privacy-preserving physics- inf ormed FL approach, i.e., FedTSE-PI, that integrates traﬃc models into FL. Real-world dat a validation shows t hat the proposed methods can protect pr ivacy while yielding similar accuracy to the oracle method wit hout pr ivacy considerations. 1. Introduction Big dat a applications ha ve been receiving a signiﬁcant amount of research attention in various domains of intelligent transport ation systems (ITS), including traﬃc state estimation/prediction ( Zheng and Liu , 2017 ; Her rera and Bay en , 2010 ; Xu, W ei, Peng, Xuan and Guo , 2020 ; Zhao, Song, Zhang, Liu, W ang, Lin, Deng and Li , 2019a ; Zheng, Fan, W ang and Qi , 2020 ), traﬃc control ( Guo, Li and Ban , 2019 ; Liu, Zhao, Hoogendoorn and W ang , 2022a ; Amini, Gerostathopoulos and Prehofer , 2017 ; Ning, Zhang, W ang, Obaidat, Guo, Hu, Hu, Guo, Sadoun and Kwok , 2020 ), and trav el route planning ( Huang, Xu and W eng , 2020 ). Traﬃc big data can be generally divided into tw o categories. The ﬁrst type is Euler ian obser vations measured wit h ﬁxed roadside sensors (e.g., loop detectors, traﬃc cameras, radars, etc.) inst alled on a small set of road links. This type of dat a is typically owned by municipal aut horities (MAs). The second type is Lagrangian obser vations obtained from vehicle trajectory data (e.g., real-time v ehicle locations and speeds), typically possessed by mobility providers (MPs) with larg e ﬂeets, such as r ide-hailing companies and public transport operators. Alt hough these data can be useful for ITS applications, t he y provide only an incomplete picture of transport ation systems due to the par tial spatial cov erage of the road netw ork (in ter ms of Euler ian observations) and partial penetration rate ov er the vehicle population (in ter ms of Lagrangian obser v ations). Traﬃc state estimation (TSE) is an impor tant research topic that lev erages t hese partially obser ved traﬃc dat a to infer key traﬃc state variables (e.g., ﬂow , density , speed) on road segments. Researchers ha ve proposed a number of TSE methods based on traﬃc ﬂow theor y and machine learning ( Caceres, Romer o, Benitez and del Castillo , 2012 ; Ke, Li, T ang, Pan and W ang , 2018 ; Xu et al. , 2020 ; Fedoro v , Nikolskaia, Ivano v , Shepelev and Minbaleev , 2019 ; Cai, Zheng and Y u , 2019 ; Rostami- Shahrbabaki, Safa vi, Papageorgiou, Setoodeh and Papamichail , 2020 ; Saeedmanesh, Kouv elas and Geroliminis , 2021 ; van Er p, Knoop and Hoogendoor n , 2018 ; W ang, Fan and W ork , 2016 ; Y ang and Menendez , 2018 ; Zhao, Zheng, W ong, W ang, Meng and Liu , 2019b ; Zhan, Li and Ukkusur i , 2020 ; Li, Boonaer t, Doniec and Lozenguez , 2021a ; W ang, Zhao, Y u, Hu, Zheng, Hua, Zhang, Hu and Guo , 2022 ). How ev er , most of these methods focus on individual dat a sources, making them sensitive to the spatial netw ork cov erage and penetration rate. In order for a better estimation of traﬃc states, a large body of wor ks attempt to impro ve t he performance of TSE algor ithms by combining multiple data sources, with a particular f ocus on the fusion betw een loop detector dat a pro vided by MAs and vehicle trajectories provided MPs ( W ang et al. , 2016 ; van Er p et al. , 2018 ; Wu, Thai, Y adlow sky , Pozdnoukho v and Bay en , 2015 ; W ang, Zhang, Li, Philip and Huang , 2018 ; Shahrbabaki, Safa vi, Papageor giou and Papamic hail , 2018 ; Ambühl and Menendez , 2016 ; Makr idis and Kouv elas , 2023 ; Saeedmanesh et al. , 2021 ). One ∗ Corresponding author qiqing.wang@u.nus.edu (Q. W ang); kaidi.yang@nus.edu.sg (K. Y ang) OR CI D (s): 0009-0003-0053-5910 (Q. W ang); 0000-0001-5120-2866 (K. Y ang) W ang and Y ang: Pr eprint submitted to Elsevier Page 1 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical Federated Learning Approach underl ying assumption behind these wor ks is the collaboration established between multiple data owners (i.e., MA and MPs in our context). Although the beneﬁts of such collaboration ha ve been demonstrated by game-theoretical analy sis ( Liu and Chow , 2022 ) and practical projects (e.g., t he collaboration between DiDi and local aut horities in sev eral Chinese cities as presented in Han, Meng, Zheng and Liu , 2019 ; Zheng, Sun, Huang, Shen, Y u, Zhu, Liu and Liu , 2018 ), existing studies generally f ocus on the operational eﬃciency and ov erlook pr ivacy concerns. In fact, MPs ma y be reluct ant to contr ibute t heir dat a to another par ty since data shar ing ma y leak sensitive information about their customers ’ mobility patter ns and system operations. It has been demonstrated that human mobility patter ns are highly unique and can be used to inf er trav elers ’ identities, personal proﬁles, and social relationships with high accuracy ev en when anonymization is applied ( De Montjoy e, Hidalgo, V erleysen and Blondel , 2013 ). Customers ’ inf ormation has been protected by increasingly str ict pr iv acy protection regulations, e.g., the General Dat a Protection Regulation (GDPR) of the European U nion, the violation of which can lead to huge ﬁne. Moreov er, MPs’ dat a, ev en aggregated and anonymized, can provide sensitive operational characteristics such as ser vice cov erage, ﬂeet composition, and ke y algorit hm parameters ( He and Chow , 2019 ), and the leakage of such information can compromise their competitiv e advantages. On the other hand, MA may also be concer ned that shar ing traﬃc detector data can enable adversaries to inf er signal timing algorit hms or the locations of secret gov er nment facilities ( Stra va , 2018 ), which can lead to security risks. Theref ore, a ke y to the successful implement ation of data fusion algor ithms in transpor tation is to protect the privacy of data owners, including both MA and MPs. T o this end, a small por tion of research lev erages advanced pr ivacy -preserving mechanisms to process traﬃc dat a ( Kim, Edemacu and Jang , 2022 ; Jin, Hua, Francia, Chao, Orow ska and Zhou , 2022 ; Gati, Y ang, Feng, Nie, Ren and T arus , 2021 ; Ying, Cao, Liu, Ma, Ma and Deng , 2022 ; Liu, James, Kang, Niyato and Zhang , 2020 ; Parameswarath, Gope and Sikdar , 2022 ; Gao, Li and W ang , 2022 ), including diﬀerential pr ivacy (DP), secure multi-par ty computation (MPC), and f ederated lear ning (FL). Particularly , FL is specially designed to train data-dr iven models for distr ibuted data owners, each with private data. With FL, each data owner can pr ivatel y train local models with its data on cell phones or local ser vers and send only a por tion of parameters and transf ormed data (instead of t heir raw dat a) to a host server for aggregation into a global model. The advantage of FL is that it does not require the ex c hange of sensitive data and can be readily integrated with state-of-the-ar t machine learning and optimization-based models. Thank s to the advantages, FL has been receiving increasing attention in various application domains, e.g., smar t g rid ( Su, W ang, Luan, Zhang, Li, Chen and Cao , 2021 ), ﬁnance ( Antunes, André da Costa, Küderle, Y ar i and Eskoﬁer , 2022 ), and healthcare ( Imteaj and Amini , 2022 ). A signiﬁcant amount of works hav e been dev oted to inv estigating t he tradeoﬀ betw een model per formance ( Nilsson, Smith, Ulm, Gustav sson and Jirstrand , 2018 ; Lai, Dai, Singapuram, Liu, Zhu, Madhy astha and Chowdhury , 2022 ), privacy protection ( W ei, Li, Ding, Ma, Y ang, Farokhi, Jin, Quek and Poor , 2020 ; Li, W en, Wu, Hu, W ang, Li, Liu and He , 2021b ; Ma, Li, Ding, Y ang, Shu, Quek and Poor , 2020 ), communication eﬃciency ( Chen, Shlezinger, Poor , Eldar and Cui , 2021 ; Hamer, Mohri and Suresh , 2020 ; Rothchild, Panda, Ullah, Ivkin, Stoica, Bra verman, Gonzalez and Arora , 2020 ), comput ation requirements ( Hanzely , Hanzely , Hor váth and Richtárik , 2020 ; Xia, Y e, T ao, Wu and Li , 2021 ; Diao, Ding and Tarokh , 2020 ), and veriﬁability against malicious actors ( Chen, Tian, Liao and Y u , 2020 ; Nguyen and Thai , 2023 ). Despite the success and of FL, the application in transport ation is sparse. To the best of our kno w ledge, only a fe w w orks ha ve implemented FL in t he transport ation context, e.g., human mobility patter n prediction using trajector y dat a from cell phones ( Feng, Rong, Sun, Guo and Li , 2020 ), parking space estimation and traﬃc ﬂow prediction algorithms from spatially distr ibuted sensor dat a ( Huang, Li, Y u, W u, Xie and Xie , 2021 ; Liu et al. , 2020 ; Xia, Jin and Chen , 2022 ). How ev er, existing FL framew orks in transpor tation suﬀer from tw o limit ations. First, exis ting framew orks are generall y based on horizontal FL , which assumes t he dat asets of all par ties to be homogeneous in that they share t he same features and structures. How ev er, since traﬃc data can be extremel y diverse, it is of ten required to per f or m data fusion from multiple heterogeneous datasets with diﬀerent features (e.g., sensor measurement and vehicle trajectories). Moreov er, ground-tr uth labels are typically possessed by MA instead of MPs since acquiring such data is typical via expensiv e aer ial imaging, which requires special per missions from t he aut horities. Hence, vertical FL , in which each data owner possesses diﬀerent features and labels, can be more suit able f or TSE. To t he best of our know ledge, there is no work on applying vertical FL-based data fusion methods for TSE. Second, existing works assume that the g round- trut h labels are abundant, which is unf or tunatel y not tr ue for TSE, as collecting ground-tr uth traﬃc states requires expensiv e eﬀor ts such as drone sur veillance ( Bar mpounakis and Geroliminis , 2020 ). Hence, existing ground-tr uth datasets (e.g., vehicle trajectory dat asets) consist of data of only a fe w days, which ma y not be suﬃcient f or training high-quality FL models. W ang and Y ang: Pr eprint submitted to Elsevier P age 2 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical Federated Learning Approach In this paper, we devise an FL -based pr ivacy -preser ving data fusion approach for MA and MPs to collaborativel y dev elop TSE models. The contr ibution is two-f old. First, we lever age the promising framewor k of FL that enables multiple par ties to collaborativel y train a model without ex changing pr ivate dat a. Unlike existing works in transpor ta- tion that focus on horizont al FL ov er many homogeneous edge devices (e.g., detectors, vehicles, trav elers, etc.), we f ormulate the traﬃc state estimation problem as a vertical FL problem and build our algorit hm on the recently dev eloped FedBCD framew ork ( Liu, Zhang, Kang, Li, Chen, Hong and Y ang , 2022c ), which reduces communication o verhead through local g radient updates and is easy to integrate with neural netw orks. Second, we propose a physics-inf ormed FL approach t hat integ rates traﬃc models with FL to impro ve data eﬃciency , which ensures the applicability of the proposed FedTSE in common TSE scenarios wit h limited g round-trut h a vailability . Physics-inf or med deep learning integrates physical models into lear ning-based approaches to improv e the dat a eﬃciency in the training process and/or to preser v e desired phy sical proper ties of the trained models, which has recently attracted increasing attention in t he transport ation research community ( Han, W ang, Li, Roncoli, Gao and Liu , 2022 ; Mo, Shi and Di , 2021 ; Di, Shi, Mo and Fu , 2023 ; Lu, Li, W u and Zhou , 2023 ). Ho we ver , the integration of traﬃc models into FL diﬀers from classical phy sics-inf ormed deep learning in t hat data privacy of both features and (par tial) labels need to be protected in the training process. W e address t his challenge by combining FL with secure functional encryption. The rest of t he paper is organized as f ollow s. Section 2 introduces the general framew ork f or the TSE problem. Section 3 introduces t he FedTSE approach assuming t he av ailability of ground-tr uth labels. Section 4 presents the case studies and results for FedTSE. Section 5 devises a pr ivacy -preserving phy sics-inf ormed FedTSE approach to integrate FedTSE with traﬃc models to improv e data eﬃciency in training. Section 6 presents the case studies and results f or FedTSE-PI. Section 7 concludes the paper . 2. Problem Statement In this section, we present the pr ivacy -preserving TSE problem. Consider a typical city with an urban transport ation netw ork described by a directed graph  = (  ,  ) , where each vertex 𝑛 ∈  represents a road link, and each edge 𝑒 = ( 𝑚, 𝑛 ) ∈  represents the connectivity from link 𝑚 ∈  to link 𝑛 ∈  . Let us denote the considered time horizon as a set of discrete intervals  = {1 , 2 , ⋯ , 𝑇 } of a given size Δ 𝑡 . A local MA (indexed by 𝑘 = 0 ) and 𝐾 MPs (index ed by 𝑘 = 1 , ⋯ , 𝐾 ) are interested in collaborating to dev elop a lear ning-based model to fuse their data for real-time TSE, i.e., to produce estimates  𝒚 = {  𝑞 𝑛𝑡 ,  𝑘 𝑛𝑡 } 𝑛 ∈  ,𝑡 ∈  f or ﬂow (  𝑞 𝑛𝑡 ) and density (  𝑘 𝑛𝑡 ) on each road link 𝑛 ∈  . The local MA has continuous access to traﬃc count measurements from loop detectors installed on a small por tion of roads  ⊂  . Let us denote 𝑐 𝑛𝑡 as the traﬃc count at time step 𝑡 ∈  measured by the loop detector installed on road link 𝑛 ∈  . MA may also ha ve conducted e xpensive aerial imaging to collect ground-tr uth ﬂow and density information across the entire transpor tation network o ver the time horizon  . Let us denote the ground-tr uth ﬂow and density of road 𝑒 at time step 𝑡 as 𝑞 𝑛𝑡 and 𝑘 𝑛𝑡 , respectivel y . Each local MP manages a ﬂeet of vehicles denoted as  𝑘 , which pro vides real-time vehicle trajector y inf or mation across the networ k. Notice t hat the ﬂeets owned by MPs collectivel y constitute a small por tion of the entire vehicle population on the road. At each time step, the vehicle trajector ies of each MP are characterized by a set of features, such as t he or igin-destination (OD) demand 𝝁 𝑘 𝑡 = { 𝜇 𝑘 𝑜𝑑 𝑡 } 𝑜,𝑑 ∈  , tra vel time 𝝉 𝑘 𝑡 = { 𝜏 𝑟𝑘 𝑛𝑡 } 𝑛 ∈  ,𝑟 ∈  𝑘 and trav el distance 𝝃 𝑘 𝑡 = { 𝜉 𝑟𝑘 𝑛𝑡 } 𝑛 ∈  ,𝑟 ∈  𝑘 of each vehicle belonging to MP 𝑘 on each road link, tur ning ratios 𝜻 𝑘 𝑡 = { 𝜁 𝑘 𝑒𝑡 } 𝑒 ∈  observed by each MP between any adjacent pair of link s, etc. All t hese features can pro vide valuable information about traﬃc states, e.g., the total trav el time and total trav el distance can ser v e as a proxy of density and ﬂow , respectivel y . W e fur ther note that MPs can be heterogeneous in t hat the features used by individual MPs do not ha ve to be t he same. Follo wing t he previous descr iption, we can summarize the dataset of data owner 𝑘 (MA or MP) as  𝑘 = { 𝒅 𝑘 𝑡 } 𝑡 ∈  , which consists of 𝑇 data samples, and each sample 𝒅 𝑘 𝑡 cor responds to a time step 𝑡 . Speciﬁcally , 𝒅 𝑘 𝑡 can be written as 𝒅 𝑘 𝑡 =  ( 𝒙 𝑘 𝑡 , 𝒚 𝑡 ) , if 𝑘 = 0 , 𝑡 ∈  𝒙 𝑘 𝑡 , if 𝑘 = 1 , ⋯ , 𝐾 , 𝑡 ∈  (1) where 𝒚 𝑡 =  𝑞 𝑛𝑡 , 𝑘 𝑛𝑡  𝑛 ∈  represents the ground-tr uth label at time step 𝑡 ∈  , and 𝒙 𝑘 𝑡 summarizes the features cor responding to time step 𝑡 of data owners 𝑘 , represented as 𝒙 𝑘 𝑡 =   𝑐 𝑒𝑡  𝑒 ∈  , if 𝑘 = 0  𝝁 𝑘 𝑡 , 𝝉 𝑘 𝑡 , 𝝃 𝑘 𝑡 , 𝜻 𝑘 𝑡 , ...  𝑖 ∈  , if 𝑘 = 1 , ⋯ , 𝐾 (2) W ang and Y ang: Pr eprint submitted to Elsevier P age 3 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical Federated Learning Approach The data of MPs can be beneﬁcial for MA to understand the traﬃc conditions in the transpor tation networ k, especially t he links on which MA does not hav e detectors. Howe v er, MPs can be reluctant to share their data due to t he fear of privacy leakage. Ev en if MPs only share aggregated data with MA, such data can still be used to inf er sensitive information about MPs’ operations, such as their ser vice cov erag e, algor ithm speciﬁcs, etc., which could potentiall y hinder t heir competitiv e advantages. Hence, our main goal is to facilitate collaboration while preser ving the dat a pr ivacy of both MA and MPs. W e would like to highlight the ke y diﬀerence between our problem setting and t hat of existing FL-based pr ivacy - preserving wor ks in traﬃc systems such as Liu et al. ( 2020 ). Existing works lev erage hor izontal FL in edge computing scenarios where data is hor izontally distr ibuted ov er edge devices, e.g., sensors for speciﬁc traﬃc regions or road segments, and the dat a at diﬀerent devices typically cont ains identical f eatures due to homogeneous sensors. These w orks aim to enable the edge nodes to jointly train a lear ning-based model (typically identical for each edge device) without these devices having to share dat a. Alt hough pr ivacy can be protected with t hese methods, their main beneﬁts lie in the reduction of communication costs. In our problem, howe v er, data is vertically distr ibuted wit hin multiple data owners with separate measurements o ver t he urban network, which can be seen as diﬀerent features such as loop detector data and trajectory information, as shown in Figure 1 . In this type of problem, pr ivacy can be an impor tant consideration because dat a can contain sensitive trade secrets of each data o wner . To the best of our know ledge, FL f or such problems with vertical dat a segment ation has rarely been explored in transpor tation literature. Therefore, we devise a vertical FL -based approach to perform TSE wit h pr ivacy -preserving fusing of MA and MPs’ data. Bef ore proceeding to introduce our v er tical FL -based framew ork, let us summarize the f ollowing assumptions we make about MA and MPs. First, our problem settings implicitly impose an assumption t hat MPs are willing to collabor ate wit h MA by contributing data (in a privacy-preserving manner) and computation resources f or MA to perform TSE, as presented in Assumption 1 . Assumption 1 (Willingness of MPs to collaborate). MPs are willing to contribute their data and computation r esources for the tr aining and deployment of the pr oposed F edTSE algorithms. Remar k 1 (Willingness of MPs to collaborate). W e make the following remar ks for Assumption 1 . Fir st, contribut- ing to MA ’ s tr aﬃc state estimation and control can help alleviate tr aﬃc congestion, whic h in tur n reduces the trav el time of MPs’ ﬂeets and hence beneﬁts MPs’ operations. Mor e broadl y, the willingness of data sharing of data owner s can be modeled as a cooperativ e game ( Jia, Dao, W ang, Hubis, Hynes, Gür el, Li, Zhang, Song and Spanos , 2019 ; Donahue and Kleinberg , 2021 ) or coopetitive game ( Liu and Chow , 2022 ), and incentives have been developed to encour ag e the par ticipation of data owner s, especially in FL settings ( Jiang, Burkhalt er , F u, Ding, Du, Hithnawi, Li and Zhang , 2022 ; Zeng, Zeng, W ang, Li and Chu , 2021 ). Second, in ter ms of computation resour ces, it is natural for MPs, such as rideshar ing companies and mapping companies, to be equipped with suﬃcient com putational resour ces since their own oper ations requir e solving complex optimization pr oblems. Hence, they have the capability to perfor m internal computation, as requir ed by FL. Third, it is worth noting that the collabor ation betw een MPs and MA not only r esides in scientiﬁc r esearc h but also has been implemented in pr actice. F or example, DiDi, a rideshar ing company in China, has contributed its traject or y data to the local authorities in sever al cities to impro ve signal timing and r amp metering algorithms ( Han et al. , 2019 ; Zheng et al. , 2018 ), which sheds light on the practical signiﬁcance of the pr oposed framewor k. Then, we mak e the follo wing assumption on the data-shar ing beha vior of MA and MPs, as described in Assumption 2 . Assumption 2 (Honest-but-curious MA and MPs). W e assume that both types of data owners, i.e., MA and MPs, ar e honest but curious, meaning that they will strictly follo w the communication and computation protocols with no intention to modify their data to achiev e better beneﬁts, but may attempt to infer the data of other data owners. Assumption 2 is a commonly -made assumption in pr ivacy -related literature ( T sao, Y ang, Gopalakr ishnan and Pa v one , 2022a ; Tan and Y ang , 2024 ). W e make the f ollowing remarks on t he implications of this assumption in our context. The ﬁrst remark is about cur ious MA and MPs, as well as their pr iv acy concer ns. Remar k 2 (Curious MA/MPs and sensitive information). W e consider both MA and MPs to be cur ious, as MPs are natur ally inter ested in lear ning the operations of other MPs due to competition, and adver saries may gain unauthor ized W ang and Y ang: Pr eprint submitted to Elsevier P age 4 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical Federated Learning Approach access to the transmitted data to MA . Speciﬁcally, we consider the following infor mation of MA and MPs to be sensitive. F irst, MPs ar e inter ested in prot ecting the information about their customers and business operations, including passeng er or igin-destination pairs (ODs) and vehicle tra jectories that rev eal passenger information, as well as the aggr egated traject ories on the tr aver sed link s that can be used to infer the ser vice area, demand patterns, ﬂeet size, and algorithm paramet ers ( He and Chow , 2020 ), both of which are essential for MPs to maintain their competitive advantag e. Second, MA is mainly concerned that the leakage of their data can enable adversaries to infer tr aﬃc oper ational algorithms (e.g., signal timing algor ithms) and locations of sensitive government facilities, whic h might be lever ag ed by adver saries to per form attac ks. Furt her note that in Assumption 2 , we consider honest MPs to simplify the discussion. In practice, MPs may be incentivized to send for ged dat a if they can achie ve higher beneﬁts, especially with pr ivacy -preserving mechanisms that can make it hard to verify MPs ’ data. In such cases, we can relax this assumption by incor porating a v er iﬁcation mechanism based on zero-know ledge proof ( Fiege, Fiat and Shamir , 1987 ) to prevent MPs’ strategic behavior , as described in Remark 3 . Remar k 3 (V eriﬁcation of MP dat a via zero-kno wledg e proof). Zer o-knowledg e proof enables a pro ver (i.e., MP in our context) to prov e to a veriﬁer (i.e., MA in our context) that it is per forming the corr ect calculations with the tr ue data without explicitly sharing the data. Our pr evious work ( T sao, Y ang, Zoepf and P avone , 2022b ) pr oposed a zero- know ledg e pr oof-based data-shar ing protocol betw een MA and MPs, suc h that MA can outsource their calculations to MPs and be able to verify the calculation results without requiring MPs to shar e their true data. Such calculations can include ﬁnding the value of a function, chec king if some conditions are met, solving a conve x optimization pr oblem, etc. Naturall y, since MPs in the FL framewor k mainly per form simple calculations such as gr adient steps and for war d pr opagation (see Section 3 ), FL and zero-know ledg e proof can be combined to impr ov e the veriﬁability of the infor mation shared by MPs to prev ent strat egic behavior ( Chen et al. , 2020 ; Nguyen and Thai , 2023 ; Ghodsi, Javaheripi, Sheybani, Zhang, Huang and Koushanfar , 2023 ; Li, T ao, Zhang, Liu and Xu , 2021c ), whic h is, how ever , beyond the scope of the paper . With t he problem se ttings and assumptions, w e no w proceed to introduce our proposed FL -based privacy - preserving data fusion framew ork f or TSE. 3. F edTSE: V ertical F ederated T raﬃc St ate Estimation In this section, we present a vertical FL -based data fusion approach, hereafter named FedTSE, to enable MA to impro ve TSE accuracy with MPs’ dat a while protecting the data privacy of all dat a owners. The framew ork of FedTSE is illustrated in Figure 2 , where MA ser ves as the host (in blue) and each MP serves as a guest (in red and yello w). In this framew ork, each data owner 𝑘 (MP or MA) maintains a pr ivate sub-model 𝝓 𝑘 ( ⋅ ) only accessible by t he data owner itself, which can be represented as a neural network parameterized by parameters 𝜽 𝑘 . These models are used and trained in a collaborative manner using pr ivate dat asets  𝑘 that will alwa ys st a y with their owners. As shown in Figure 2 , at each time step, MP 𝑘 processes its data 𝒙 𝑘 according to a sub-model 𝒛 𝑘 = 𝝓 𝑘 ( 𝒙 𝑘 ; 𝜽 𝑘 ) with 𝜽 𝑘 as pr iv ate parameters to be determined. The sub-model output 𝒛 𝑘 will be sent to MA. MA, after collecting all sub-model output 𝒛 𝑘 from each MP , will perform TSE according to its sub-model 𝒚 = 𝝓 0 ( 𝒙 0 , 𝒛 1 , ⋯ , 𝒛 𝑘 ; 𝜽 0 ) with private parameters 𝜽 0 to be determined. In order to obt ain the parameters Θ =  𝜽 0 , ⋯ , 𝜽 𝑘  f or all dat a owners, t he FL framew ork aims to solv e a collaborativ e training problem wr itten as: min Θ 𝐿 (Θ;  ) ≜ 1 𝑇 𝑇  𝑡 =1 𝑓  𝜽 0 , … , 𝜽 𝐾 ; 𝒅 0 𝑡 , ⋯ , 𝒅 𝐾 𝑡  + 𝜆 𝐾  𝑘 =0 𝛾  𝜽 𝑘  (3) where the loss function 𝑓 ( ⋅ ) and regularizer 𝛾 ( ⋅ ) are combined with a weight parameter 𝜆 . The loss function can be represented in the form of 𝑓  𝝓 0 ( 𝒙 0 𝑡 , 𝒛 1 𝑡 , ⋯ , 𝒛 𝑘 𝑡 ; 𝜽 0 ) , 𝒚 𝑡  to penalize the deviation between estimated states 𝝓 0 ( 𝒙 0 𝑡 , 𝒛 1 𝑡 , ⋯ , 𝒛 𝑘 𝑡 ; 𝜽 0 ) and true states 𝒚 𝑡 , where 𝒛 𝑘 𝑡 = 𝝓 𝑘 ( 𝒙 𝑘 𝑡 ; 𝜽 𝑘 ) is the output from t he private model of MP 𝑘 at time step 𝑡 . Speciﬁcally , we leverag e the follo wing loss function for TSE to penalize t he mean squared er rors of ﬂow and density estimation: 𝑓 ( Θ;  ) = 𝑓 𝑘 ( Θ;  ) + 𝑓 𝑞 ( Θ;  ) = 1 𝑁  𝑣 ∈     𝑘 𝑣𝑡 −  𝑘 𝑣𝑡    2 + 𝛼 1 𝑁  𝑣 ∈    𝑞 𝑣𝑡 −  𝑞 𝑣𝑡   2 (4) W ang and Y ang: Preprint submitted to Elsevier P age 5 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach Figure 1: Vertically partitioned traﬃc data in urban transp o rtation netw o rk where 𝛼 represents the weighing f actor between the tw o ter ms. Figure 3 can be jointly solv ed by MA and MPs using variants of the Federated Stochas tic Gradient Descent (Fed- SGD) approach ( McMahan, Moore, Ramage, Hampson and y Arcas , 2017 ; Liu et al. , 2022c ) wit hout explicitly sharing their raw data. W e emplo y a Federated Stochas tic Block Coordinate Descent (FedBCD) ( Liu et al. , 2022c ) approach that enables a suﬃcient number of local updates to account for expensiv e communication overhead in FL training. Algorit hm 1 descr ibes the training procedure of FedTSE, which requires real-time communication between MA and MPs. Within each communication round, MA samples a mini-batch of time steps  ∈  and shares it wit h MPs to synchr onize t he data used for training. Each MP then computes its local model output 𝒛 𝑘 𝑡 = 𝝓 𝑘 ( 𝒙 𝑘 𝑡 ; 𝜽 𝑘 ) (i.e., intermediate results) and sends it to MA. Af ter receiving the sub-model output 𝒛 𝑘 𝑡 from all MPs, MA computes two gradients: (1) t he gradient of 𝐿 wit h respect to the model parameters of MA ( 𝜽 0 ), and (2) the gradients of 𝐿 wit h respect to t he output 𝒛 𝑘 𝑡 of each MP , i.e., 𝜕 𝐿 𝜕 𝒛 𝑘 𝑡 , which will be sent back to MP 𝑘 . MP 𝑘 uses the received g radient 𝜕 𝐿 𝜕 𝒛 𝑘 𝑡 to calculate the gradient of t he loss function 𝐿 wit h respect to its sub-model parameters 𝜃 𝑘 based on the chain r ule of partial der iv atives shown in by Equation 5 : 𝜕 𝐿 𝜕 𝜽 𝑘 =  𝑡 ∈  𝜕 𝐿 𝜕 𝒛 𝑘 𝑡 𝜕 𝒛 𝑘 𝑡 𝜕 𝜽 𝑘 , (5) which will be used by MP 𝑘 to update its sub-model parameter 𝜽 𝑘 . Notice that since t he sub-models are pr ivate, MA does not hav e any inf or mation about 𝜕 𝒛 𝑘 𝑡 𝜕 𝜽 𝑘 and hence 𝜕 𝐿 𝜕 𝜽 𝑘 . After dat a owners communicate and derive the relev ant g radients, they will use these g radients to update their sub- model parameters according to a gradient-descent r ule 𝜽 𝑘,𝑖 +1 = 𝜽 𝑘,𝑖 − 𝜂 𝑘 𝜕 𝐿 𝜕 𝜽 𝑘 with 𝜂 𝑘 indicating the learning rate of data owner 𝑘 . It is worth noting that t he sharing of relev ant gradients and sub-model output incurs expensiv e communication ov erhead betw een MA and MP ( Liu, Kang, Zou, Pu, He, Y e, Ouyang, Zhang and Y ang , 2022b ; W ei, Li, Ma, Ding, W ei, W u, Chen and Ranbaduge , 2022 ). W e adopt a straightf orward method to reduce the amount of dat a ex c hanged between W ang and Y ang: Preprint submitted to Elsevier P age 6 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach Figure 2: Schematic illustration of FedTSE, where MA (in blue) and multiple MPs (in orange and yello w) jointly train a neural netw o rk for TSE wh ile keeping their data private. Only the values and gradients of private sub-mo dels a re exchanged b et ween parties. MA and MPs by allowing them to locally update t he submodel f or 𝑄 > 1 rounds wit hout ex changing inf ormation, using the previousl y received g radient inf or mation. Such local updates hav e been widely used in existing literature, and it has been prov ed that the communication ov erhead can be signiﬁcantly reduced without signiﬁcantly impacting FL per f ormance if 𝑄 is chosen appropr iately ( Liu et al. , 2022c ). W e make t he follo wing remarks to better discuss the properties of FedTSE. As the main motiv ation for adopting FL is privacy , we make Remark 4 to discuss the pr ivacy guarantee provided by FedTSE for each dat a owner . Remar k 4 (Privacy of dat a owners in FedT SE). Similar to Liu et al. ( 2022c ), let us say the data privacy of data owner 𝑘 (MP or MA) is preserved if one cannot uniquely determine its true data 𝒙 𝑘 fr om its exc hang ed messag es, either the sub-model output 𝒛 𝑘 shar ed by MP 𝑘 to MA and/or the corresponding gr adients 𝜕 𝐿 𝜕 𝒛 𝑘 sent from MA to MPs. With such a privacy notion, the developed F edTSE algorithm can prot ect the data pr ivacy of all data owners. The r eason is tw o-fold. F irst, the sub-models owned by individual MPs and the MA ar e private, meaning that the data of both types of data owners are transf ormed via an unknown tr ansformation. Liu et al. ( 2022c ) has prov ed that as long as the featur e dimension of data owner 𝑘 is no less than 2, the exchang ed message fr om data owner 𝑘 corresponds to inﬁnite possibilities of true states 𝒙 𝑘 . This is rather intuitive since we can tweak the true states 𝒙 𝑘 and the sub-model 𝝓 𝑘 ( ⋅ ) in a joint manner without changing the sub-model output 𝒛 𝑘 for MP and the exc hanged gr adients 𝜕 𝐿 𝜕 𝒛 𝑘 for MA. Second, we r educe the dimension of the sub-model output to be signiﬁcantly lower than that of featur es. This fur ther helps to r educe the information shar ed betw een MA and MPs, and thus strengthen privacy prot ection. F urther note that such a pr ivacy guarantee can be enhanced by integr ating with other privacy-pr eser ving mechanisms, such as diﬀer ential privacy ( Dwork , 2006 ; T sao et al. , 2022a ) and secur e multipar ty computation ( Shamir , 1979 ; T an and Y ang , 2024 ), which will ser ve as future wor k . Remark 5 show s that FedTSE does not impose signiﬁcant computational burdens on MPs. Remar k 5 (Computation requirements of F edTSE). F edTSE does not impose heavy computation burden to MP . In particular , each MP’ s private model, i.e., the local neur al netw ork, is relativ ely light-weighted with an input dimension of 𝐻    and an output dimension of no gr eater than    , where 𝐻 repr esents the size of featur e space for each edge W ang and Y ang: Preprint submitted to Elsevier P age 7 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach Algorithm 1: Algorit hm for FedTSE Input: Pr ivate dat asets  𝑘 and lear ning rate 𝜂 𝑘 f or dat a owner 𝑘 = 0 , 1 , … , 𝐾 , regularizer weight parameter 𝜆 , and the number of local updates 𝑄 Output: Θ = ( 𝜽 0 , 𝜽 1 , … , 𝜽 𝑘 ) for each iter ation 𝑖 = 1 , 2 , … do if 𝑖 mod 𝑄 = 0 then MA randomly samples a mini-batch of time steps  ∈  and synchronizes it wit h MPs for 𝑘 = 1 , … , 𝐾 in parallel do MP 𝑘 computes local output 𝒛 𝑘 𝑡 = 𝝓 𝑘 ( 𝒙 𝑘 𝑡 ; 𝜽 𝑘 ) with its pr ivate sub model f or 𝑡 ∈  MP 𝑘 sends 𝒛 𝑘 𝑡 to MA end MA computes g radients 𝜕 𝐿 𝜕 𝜽 0 and updates sub-model parameters according to 𝜽 𝑖 +1 0 = 𝜽 𝑖 0 − 𝜂 0 𝜕 𝐿 𝜕 𝜽 0 MA computes and sends g radients 𝜕 𝐿 𝜕 𝒛 𝑘 𝑡 to each MP 𝑘 end for 𝑘 = 1 , … , 𝐾 in parallel do MP 𝑘 computes 𝜕 𝐿 𝜕 𝜽 𝑘 with t he most recent 𝜕 𝐿 𝜕 𝒛 𝑘 𝑡 according to Equation 5 MP 𝑘 Update 𝜽 𝑖 +1 𝑘 = 𝜽 𝑖 𝑘 − 𝜂 𝑘 𝜕 𝐿 𝜕 𝜽 𝑘 end if conver gence criterion met then break end end (with a typical value of 20 in our case studies), and    r epresents the number of edges. Hence, the training and the deployment of such neural networ ks would be relativ ely aﬀordable for MPs. In fact, for our case study, the neural netw orks can be trained eﬃciently on a PC with NVIDIA GeF orce RTX 2060 within 30 minutes and on a ser ver with NVIDIA A100 within 10 minutes. 4. Case study on F edTSE 4.1. Settings of the case study 4.1.1. Datasets In t his paper, we ev aluate the per f ormance of FedTSE by performing a case study of a signalized urban cor ridor in Athens, Greece. The case study encompasses both a real-wor ld dataset, pNEUMA ( Bar mpounakis and Geroliminis , 2020 ), and a simulated dat aset generated from a well-calibrated simulation created by SUMO ( Lopez, Behr isch, Bieker - W alz, Erdmann, Flötteröd, Hilbr ich, Lück en, Rummel, W agner and Wießner , 2018 ) f or this site. The reason f or inv ol ving simulated data is because the real-world data is limited and insuﬃcient for comprehensiv e analysis. Consequentl y , the real-wor ld data is used to ev aluate t he per f or mance of FedTSE, while the simulated data is used to conduct comprehensive sensitivity analysis. Real-w orld dataset: The real-w orld dataset pNEUMA is adopted for the ev aluation of FedTSE, where a ﬂeet of ten drones recorded trajectories of vehicles operating in t he central district of Athens, Greece, via drone imaging. The trajectories cov ered the mor ning peak on f our weekda ys (i.e. 24/10/2018, 29/10/2018, 30/10/2018 and 01/11/2018), from 8:30 to 10:30 a.m. Due to the batter y capacity limitations of the drones, t he trajector ies were recorded at 30-minute intervals by the swarm in sequential sessions wit h blind gaps (i.e. eﬀective recording inter val is less than 30 minutes). A main road cor r idor named Panepistimiou Street, a pr imary road that r uns one wa y f or non-transit vehicles, together with its side links, is selected as t he studied site. This cor ridor consists of 7 intersections with ﬁxed-time signal timing strategies and 17 links, each wit h 90-160 meters in length. The geometric la yout of the cor ridor and the locations of the detectors are shown in Figure 3 , whereby we generally assume that onl y tw o links are equipped with MA ’ s loop detectors. W e associate vehicle trajectories in t he pNEUMA dataset to links using open-source map-matching software W ang and Y ang: Preprint submitted to Elsevier P age 8 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach called LeuvenMapMatc hing ( Meert and V erbeke , 2018 ). The goal of TSE is to estimate the traﬃc states of 9 links with relativ ely high traﬃc ﬂow . Figure 3: pNEUMA urban road corrido r Simulation dataset: Due to the limited amount of real-world dat a, we built a SUMO simulation to provide a more comprehensiv e analy sis. W e use the real-world trajectory dat a of the f our weekda ys collected from pNEUMA to calibrate our simulation, including the traﬃc signals, demand, and car follo wing models. Virtual traﬃc detectors were created to collect the traﬃc ﬂow data. Our traﬃc simulation w as conducted ov er 810 minutes, with the initial 10 minutes ser ving as a w ar m-up phase f or t he simulation to ensure its stability and accuracy . The subsequent 800 minutes of data collection were then utilized to evaluate our estimations. W e also assume t hat only two links in this main road cor ridor are equipped with MA ’ s loop detectors as shown in Figure 3 . 4.1.2. Benchmar ks for F edTSE T o evaluate the performance of FedTSE, we compare t he follo wing benchmarks. Recall that here MA is assumed to ha ve access to ground-tr uth traﬃc states, and hence data fusion f or FedTSE ref ers to t he expansion of features wit h MPs ’ data. Here, we consider the scenar io wit h one MP . • TSE-n (no dat a fusion) : This benchmark corresponds to t he scenario where MP is not willing to share any dat a with MA due to pr ivacy concer ns, and MA estimates traﬃc states only using its own dat a from a limited number of ﬁxed obser vers in set  (i.e., tw o observations as shown in Figure 3 ) wit hout fusing with MPs’ f eatures. Speciﬁcally , MA ’ s feature at each time step 𝑡 is selected as the past ℎ -step loop detector data of all links with detection, i.e., { 𝑐 𝑛𝑡 ′ } 𝑛 ∈  ,𝑡 ′ ∈{ 𝑡 − ℎ +1 , ⋯ ,𝑡 } . • TSE-p (partial data fusion without pr ivacy protection) : This benc hmark algor ithm cor responds to the scenario where MP directl y shares the real-time speed data calculated from its operating ﬂeets without considering pr ivacy protection, and MA integ rates both its own loop detector dat a and MP’ s speed data to train a TSE model. Speciﬁcally , MA ’ s feature can be represented by t he past ℎ -step loop detector data, similar to TSE-n, and MP’ s feature can be represented by the past ℎ -step speed information, represented by { 𝜇 𝑛𝑡 } 𝑛 ∈  ,𝑡 ′ ∈{ 𝑡 − ℎ +1 , ⋯ ,𝑡 } , where t he speed of link 𝑛 ∈  at time s tep 𝑡 is represented by 𝜇 𝑛𝑡 =   𝑟 ∈  𝜉 𝑟 𝑛𝑡  ∕   𝑟 ∈  𝜏 𝑟 𝑛𝑡  , i.e., t he ratio between total trav el distance and total trav el time. W e envision that t his scheme of collaboration can be accepted by some MPs since the sensitive information contained in t he speed data is much tr uncated compared to ra w trajectory data. Nev er theless, note that ev en though the speed data is highly aggregated, it may still reveal W ang and Y ang: Preprint submitted to Elsevier P age 9 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach MPs ’ ser vice areas. This benchmark is leverag ed to demonstrate the necessity of per forming more extensiv e data fusion with privacy considerations. • Orcale (data fusion without privacy protection) : This benchmark cor responds to t he scenar io where MP fully tr usts MA and is willing to share all its trajectory dat a to MA without pr ivacy protections. Theref ore, MA trains FedTSE wit h loop detectors ’ data and MP’ s data, including t he past ℎ -step total tra vel time and total tra vel distance. Speciﬁcally , MA ’ s features are identical to t hose of TSE-n and TSE-p, and MP’ s features can be represented by   𝑟 ∈  𝜉 𝑟 𝑛𝑡 ,  𝑟 ∈  𝜏 𝑟 𝑛𝑡  𝑛 ∈  ,𝑡 ′ ∈{ 𝑡 − ℎ +1 , ⋯ ,𝑡 } . This benchmark serves as the oracle to quantify a performance upper bound we expect to achiev e via dat a fusion. • Orcale-cell (extensiv e data fusion without privacy protection) : This benchmark is similar to Oracle, and the only diﬀerence is t hat MP fur ther divides its link to multiple cells (6 in our experiment) and a time step into multiple sub-steps (2 sub-steps in our experiment), and uses t he total trav el time and total trav el distance within each cell and sub-step as f eatures. Such ﬁner representation maintains more inf ormation in the trajectory data, which has better potential in improving TSE yet worse pr ivacy concer ns if shared directly . This benchmark is used to ev aluate whether higher-resolution data fusion could enhance TSE per f ormance. • FedTSE (Q=1) : the proposed pr ivacy -preserving data fusion v er tical f ederated lear ning framew ork (FedTSE) with one local update. The features of both MA and MPs are identical to Oracle. • FedTSE-cell (Q=1) : FedTSE whereby MP provides data with cell representation. The f eatures of both MA and MPs are identical to Oracle-cell. This benchmark is leverag ed to highlight t he beneﬁts of vertical FL. Since pr ivacy is protected, vertical FL can encourage MPs to use higher -resolution f eatures, which could in turn impro ve t he TSE per f or mance. • FedTSE (Q=2) : FedTSE with two local updates. The features of both MA and MPs are identical to Oracle. • FedTSE (Q=3) : FedTSE with t hree local updates. The features of both MA and MPs are identical to Oracle. 4.1.3. Experiment settings W e ne xt present our experiment settings, including t he experiment scenar io and model architecture. Experiment scenario: As mentioned above, let us consider the scenario with assume one MA and one MP collaborate to estimate traﬃc states (i.e., density and ﬂow) of our studied site of each time inter val of length Δ 𝑡 = 10 seconds. MA is assumed to hav e access to the ground-tr uth data (i.e. real density and ﬂow on each link). MA fur ther pro vides loop detector data on two links as illustrated in Figure 3 . MP provides the f eatures cor responding to each of the af orementioned benchmarks associated with its operating ﬂeets at each time step 𝑡 . For both MA and MP , we use the past ℎ = 9 steps of all the f eatures to estimate the traﬃc states at the cur rent time step. Model arc hitectur e: MA initiates tw o models: a global model and a sub-model. The global model (maint ained by MA) has an architecture of a simple Multilay er Perceptron (MLP), and the sub-model of MA is based on t he Spatio- T emporal Graph Conv olutional Netw orks (STGCN) proposed by Y u, Yin and Zhu ( 2017 ), which is a well-adopted learning-based model for traﬃc ﬂow estimation and prediction. The pr ivate sub-model employ ed by MP also lev erages STGCN to extract complex spatio-temporal inf ormation. In FedTSE, the input dimensions for MA and MP sub-models are 9 ∗ 2 ∗ 17 = 306 . Features that are not obser ved are ﬁlled with zero. The local output (i.e., sub-model output) dimensions of MA and MP are 9. The ﬁnal output dimension of MA ’ s global model is 18. This reduced dimension setting from input 306 to local output 9 is aligned with t he pr ivacy guarantee of FedTSE as discussed in Remark 4 . It ’ s impor tant to note that t he architecture of both t he global and sub-models for MA and MP can be chang ed as needed. Furt hermore, MA and MP are designed not to hav e know ledge of each other’ s model architecture and model parameters. T o ensure uniformity across diﬀerent benchmarks in our case study , all benchmarks are conﬁgured with the same architecture f or sub-models and global models as those in FedTSE. For benchmarks not utilizing the FL structure, we assume a same architecture with an STGCN f ollow ed by an MLP . This assumption is to maintain consistency in the model architecture f or fur ther comparison. T r aining settings: The learning rate for sub-models STGCN and the global model MLP is set to 3e-4. For all models, we use data from the ﬁrst 80% for training and the last 20% for testing. 12.5% of the training samples are selected f or validation. The batch size is 128 dur ing training. All experiments are conducted using PyTorc h on a server with NVIDIA A100. W e adopt root mean square er ror (RMSE) and mean absolute er ror (MAE) as the per f or mance criter ia to ev aluate the benchmarks. W ang and Y ang: Preprint submitted to Elsevier P age 10 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach T able 1 Mo del p erfo rmance comparison on real-wo rld dataset. Metrics Densit y (veh/km/lane) Flo w (veh/min/lane) MP p enetration rate (%) 20 40 60 80 20 40 60 80 TSE-n RMSE 10.34 10.34 10.34 10.34 1.91 1.91 1.91 1.91 MAE 7.19 7.19 7.19 7.19 1.50 1.50 1.50 1.50 TSE-p RMSE 10.76 9.64 7.82 7.17 1.24 1.15 1.09 1.12 MAE 7.01 6.35 5.28 4.67 0.97 0.91 0.85 0.88 Oracle RMSE 8.82 7.51 6.80 6.79 1.19 1.11 1.05 0.99 MAE 5.85 4.83 4.26 4.05 0.95 0.89 0.85 0.80 F edTSE (Q=1) RMSE 9.56 8.02 7.23 6.02 1.13 1.01 1.02 1.02 MAE 6.21 5.30 4.53 3.96 0.89 0.80 0.81 0.80 T able 2 Mo del p erfo rmance comparison on simulation dataset. Metrics Densit y (veh/km/lane) Flo w (veh/min/lane) MP Penetration rate (%) 20 40 60 80 20 40 60 80 TSE-n RMSE 14.19 14.19 14.19 14.19 0.96 0.96 0.96 0.96 MAE 6.47 6.47 6.47 6.47 0.75 0.75 0.75 0.75 TSE-p RMSE 7.91 7.04 6.34 7.15 0.98 0.95 0.92 0.89 MAE 4.34 3.92 3.63 3.73 0.77 0.75 0.72 0.70 Oracle RMSE 6.75 5.22 4.46 3.99 0.95 0.88 0.83 0.79 MAE 3.84 3.33 2.85 2.57 0.75 0.69 0.65 0.61 F edTSE (Q=1) RMSE 6.81 5.70 5.04 4.27 0.92 0.90 0.86 0.87 MAE 3.94 3.57 3.19 2.90 0.72 0.71 0.67 0.68 Oracle-cell RMSE 5.65 3.84 3.20 2.65 0.95 0.85 0.78 0.69 MAE 3.47 2.65 2.17 1.82 0.75 0.67 0.61 0.53 F edTSE-cell (Q=1) RMSE 5.76 4.34 4.02 3.58 0.91 0.90 0 .90 0.82 MAE 3.58 3.02 2.81 2.48 0.71 0.71 0 .70 0.64 4.2. P erformance of F edTSE P er formance on r eal-wor ld data : Table 1 and Figure 4 compare the TSE per formance of diﬀerent benchmarks on real-w orld data. The benchmarks wit h t he best estimation results are underlined, while those ranked second are highlighted in bold . As TSE-n does not hav e data fusion, its er ror is the same for diﬀerent MP penetration rates, and the per f ormance is the wors t compared to other benchmarks. W e can see the trade-oﬀ between pr ivacy and estimation accuracy b y comparing FedTSE and Oracle, which in general ha ve similar performance. This sugges ts that FedTSE can protect privacy with a marginal impact on t he estimation performance. It is w or th noting that FedTSE outper f or ms TSE-n and TSE-p whereb y MP shares no features or only par tial features due to privacy concer ns. This sugges ts that despite t he trade-oﬀ between pr ivacy and utility , FedTSE can potentiall y impro ve TSE per f or mance by encouraging MP to share more features, which sheds light on the v alue of pr iv acy protection in data fusion. P er formance on simulation data : T able 2 summar izes the model per f ormance on SUMO simulation data to per f or m more comprehensiv e ev aluations. Speciﬁcally , two more benchmarks, Oracle-cell and FedTSE-cell, are added to ev aluate t he value of dat a fusion since the amount of real-w orld dat a is not suﬃcient for training these benchmarks. W e hav e tw o new observations in the simulation data. First, we can see that, FedTSE-cell yields a similar accuracy compared to Oracle-cell, which shows that the proposed vertical FL framew ork can lead to satisfactory tradeoﬀ betw een pr ivacy and estimation per f ormance. Moreov er , Figure 5 illustrates the test RMSE of density and ﬂow dur ing the training process, whereby t he model per f ormance is evaluated at each epoch. From Figure 5 , we can see t hat FedTSE-cell not only has a similar performance to Oracle-cell at t he end of training, the con ver gence speed is also not signiﬁcant higher, which suggests that FedTSE-cell protects privacy at a marginal cost in model performance and training speed. Second, Oracle-cell and FedTSE-cell yield a much lower er ror compared to other benchmarks without cell representations. This demonstrates that higher-resolution features provided by MP can signiﬁcantly impro ve data fusion. This is import ant because MP can be more willing to contr ibute higher resolution data if privacy is under protection, which shows t he value of pr ivacy protection in data fusion. W ang and Y ang: Preprint submitted to Elsevier P age 11 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach 20% 40% 60% 80% MP penetration rate 6 8 10 12 14 Density test RMSE Oracle F edTSE Q=1 TSE-p TSE-n (a) Density test RMSE on real-world data 20% 40% 60% 80% MP penetration rate 5 10 15 Density test RMSE Oracle F edTSE Q=1 TSE-p TSE-n (b) Density test RMSE on simulation data Figure 4: Mo del p erfo rmance comparison with diﬀerent MP p enetration rate. 0 500 1000 1500 2000 Epoch 20 30 40 50 Density T est RMSE (MP 20%) TSE-n Oracle-cell F edTSE-cell (Q=1) (a) Density test RMSE (MP 20% penetration rate) 0 500 1000 1500 2000 Epoch 10 15 20 25 30 35 Density T est RMSE (MP 40%) TSE-n Oracle-cell F edTSE-cell (Q=1) (b) Density test RMSE (MP 40% penetration rate) Figure 5: T est RMSE curve for F edTSE 4.2.1. Sensitivity analysis on the number of local updates 𝑄 In FedTSE, the parameter 𝑄 , representing the number of local updates, plays an impor tant role in balancing the communication costs and estimation per f ormance. Its value can signiﬁcantl y inﬂuence t he speed of conv erg ence and the ov erall eﬀectiveness of t he model. T able 3 and Figure 6 sho w that with a larger Q, the model conv erg es fas ter and can achiev e a cer tain loss t hreshold earlier . The test RMSE cur v e for ﬂow is less stable when 𝑄 = 3 , alt hough reaches the threshold earlier . FedTSE (Q=2) is more stable and outperforms FedTSE (Q=1). Therefore, increasing the number of local updates appropr iately can reduce the total number of communication rounds required to achie ve better per f ormance with FedTSE. This rapid conv erg ence and improv ed model performance is cr itical, giv en that fe wer communication rounds not only enhance the computational eﬃciency at the upper lay er of the model but also minimize potential data leakage. 4.2.2. Sensitivity analysis on sample size This sensitivity analy sis aims to lev erage a well-calibrated SUMO simulation to inspect t he impact of the sample size on the performance of FedTSE. The results are demonstrated in T able 4 . It is wort h noting that the simulation data yields similar per f ormance to real-w orld dat a wit h t he same sample size, which shows that the SUMO simulation is well-calibrated. Moreov er , we can see that the RMSE of estimated density and ﬂow decrease by 3.05 veh/km/lane and 0.28 veh/min/lane when t he sample size increase from 600 to 3600. Hence, FedTSE is sensitive to t he training sample size. W ang and Y ang: Preprint submitted to Elsevier P age 12 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach T able 3 Numb er of communication rounds to reach a target RMSE. (One communication round corresponds to one mini-batch gradient descent of MA’s global mo del with the MPs’ intermediate output) Densit y test RMSE 7.7 Flow test RMSE 1.75 communication rounds communication rounds F edTSE Q=1 812 826 F edTSE Q=2 420 420 F edTSE Q=3 308 322 0 2000 4000 6000 Communication R ounds 10 15 20 25 Density test RMSE F edTSE Q=1 F edTSE Q=2 F edTSE Q=3 (a) Density test RMSE curve (Red line equals to 7.7) 0 2000 4000 6000 Communication R ounds 2 3 4 5 6 7 Flow test RMSE F edTSE Q=1 F edTSE Q=2 F edTSE Q=3 (b) Flow test RMSE cur v e (Red line equals to 1.75) Figure 6: T est RMSE curve for F edTSE with diﬀerent Q T able 4 Sensitivit y analysis on sample size. Real-w orld data Simulation data Sample size 600 600 1200 1800 2400 3000 3600 Densit y (veh/km/lane) RMSE 9.56 9.86 9.95 8.46 7.96 7.3 6.81 MAE 6.21 5.78 5.6 4.92 4.7 4.3 3.94 Flo w (veh/min/lane) RMSE 1.13 1.2 1.15 1.05 1.04 0.96 0.92 MAE 0.89 0.94 0.91 0.82 0.81 0.75 0.72 It is impor tant to note that although t here are many pilot studies t hat lev erage drones to monitor traﬃc ( Bar m- pounakis and Geroliminis , 2020 ), the g round-truth labels are still challenging to obtain, and the real-w orld data suﬀers from limited sample size. Dat a collection via drone sur veillance is expensiv e due to the limited batter y capacity and detection range, as well as t he public accept ance regarding the pr iv acy r isks of monitor ing urban residents. Therefore, we cannot easily assume suﬃcient ground-tr uth data for training FedTSE. This motivates us to extend FedTSE to consider scenar ios where real-wor ld dat asets are insuﬃcient or ev en unav ailable. To t his end, we propose FedTSE-PI as presented in the f ollowing sections. 5. F edTSE-PI: Phy sics-informed V ertical F ederated T raﬃc State Estimation In t his section, we extend the FedTSE proposed in Section 3 to apply to scenarios where ground-tr uth labels are not a vailable. This is a realistic consideration, as the collection of ground-trut h labels can be expensiv e. To address these issues, we introduce a Phy sics-Informed V er tical Federated Learning approach for TSE, hereafter named FedTSE-PI, that integrates models in traﬃc ﬂow theor y , e.g., t he Cell Transmission Model, with FedTSE to enhance data eﬃciency while preser ving traﬃc data pr ivacy . Such an approach combines the data eﬃciency of traﬃc ﬂow models and the W ang and Y ang: Preprint submitted to Elsevier P age 13 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach privacy-preserving capabilities of FedTSE. Speciﬁcally , instead of requiring g round-truth labels, this approach adopts a loss function t hat combines traﬃc model inf ormation wit h only critical par tial obser vations, such as traﬃc ﬂow on road links equipped with loop detectors (provided by MA) and link -lev el speed information of MPs ’ ﬂeets (provided by MPs). To prev ent MPs sensitive information from being inferred from the shar ing of t hese par tial obser v ations, we fur ther propose a privacy-preserving mechanism for MA to calculate the g radients of t he loss function without requiring MPs to explicitl y share their obser vations. This section is organized as follo w s. Section 5.1 presents the general framew ork of FedTSE-PI and Section 5.2 presents the privacy-preserving training algorit hm. 5.1. General framew ork of F edTSE-PI T o enhance the data eﬃciency of training FedTSE, we lev erage a promising framew ork of physics-inf ormed neural netw orks (PINNs) ( Shi, Mo, Huang, Di and Du , 2021 ; Di et al. , 2023 ) to integrate traﬃc ﬂo w theor y (e.g., kinematic w a ve theor y) into FedTSE. The advantage of traﬃc ﬂow models is t hat they hav e a clear physical meaning and can pro vide relativel y accurate estimates of traﬃc states with only a small number of parameters to calibrate (e.g., parameters for the funda mental diagrams). Extensive research has proposed TSE approaches based on traﬃc ﬂow models using statistical ﬁlters or optimization-based methods ( Seo, Bay en, Kusakabe and Asakura , 2017 ; Makridis and Kouv elas , 2023 ; Nie, Qin, W ang and Sun , 2023 ; Lu et al. , 2023 ), which can lev erage critical par tial obser vations, such as traﬃc ﬂow on road links equipped with loop detectors (provided by MA) and speed information of MP’ s ﬂeet (provided by MP). How e ver , these traﬃc ﬂow -based approaches assume that MA has access to all these par tial obser vations without considering the privacy concer ns of MPs. Existing pr ivacy -preserving physics-based TSE methods generally are generally based on open-loop data per turbation ( Le Ny , T ouati and Pappas , 2014 ; He and Chow , 2020 ), which can hinder data quality and hence TSE per formance. In contrast, FedTSE can eﬀectivel y protect the data privacy of both MA and MPs while ensur ing each par ty uses the true data. W e aim to combine traﬃc models wit h FedTSE to exploit the beneﬁts of both approaches. Figure 7 show s the general framewor k, where the traﬃc models are used to speciﬁcally design a loss function that can allow MA and MPs to train the parameters of sub-models 𝝓 𝑘 using only critical par tial obser vations instead of ground-tr uth labels. Figure 7: Schematic illustration of FedTSE-PI W e assume t hat MA possesses a traﬃc model t hat can be mathematically described by a state-space model relying on the f ollowing recursiv e equation. 𝒚 𝑡 +1 = 𝒈 ( 𝒚 𝑡 ; 𝝆 ) + 𝝎 𝑡 (6) W ang and Y ang: Preprint submitted to Elsevier P age 14 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach where 𝒚 𝑡 represents t he st ate including ﬂow and density 1 , 𝝆 represents model parameters t hat can include parameters in the fundamental diag ram, signal timings, and estimated tur ning ratios, and 𝝎 𝑡 represents t he model noises. Without loss of generality , we assume that t hese parameters and model noise distribution are calibrated a prior i , which can be realistic for cities with a well-calibrated simulator for their transpor tation systems. If the parameters and model noise distribution are unknown, we can also estimate t hem simultaneously with the training of FL models. The critical par tial obser vations of both MA and MPs can be seen as measurements of the model, represented by 𝒖 𝑘 𝑡 : 𝒖 𝑘 𝑡 = 𝒉 𝑘 ( 𝒚 𝑡 ) + 𝝂 𝑘 𝑡 (7) where 𝒉 𝑘 ( ⋅ ) represents the measurement function, and 𝒖 𝑘 𝑡 represents the measurement noises of data owner 𝑘 . For MA, the par tial obser vations can be ﬂow measured by loop detectors, and hence 𝒉 𝑘 ( ⋅ ) can be a linear function with a coeﬃcient matr ix consisting of only 1 and 0, whereby an element is 1 if the cor responding state is ﬂow and the cor responding link has loop detectors installed. For MPs, these par tial obser vations can be the link -lev el speed inf ormation provided by their vehicle trajectories, for which 𝒉 𝑘 ( ⋅ ) can be deﬁned as the ratio between ﬂow and density . W e assume that the f orm of the measurement function 𝒉 𝑘 ( ⋅ ) and the distr ibution of noises 𝝂 𝑘 𝑡 are publicly known by both MPs and MA, since it is natural to specify this inf or mation as a ke y requirement before collaboration. Theref ore, given measurements { 𝒖 𝑘 𝑡 } 𝑡 ∈  ,𝑘 ∈{0 , ⋯ ,𝐾 } , MA is interested in solving 𝒚 = { 𝒚 𝑡 } 𝑡 ∈  maximizing the likelihood represented as max 𝒚 𝑃  { 𝒚 𝑡 , 𝒖 𝑡 } 𝑡 ∈   ⇔ max 𝒚  𝑡 ∈  𝑃 ( 𝒚 𝑡 +1  𝒚 𝑡 ) 𝑃 ( 𝒖 𝑡  𝒚 𝑡 ) (8) With the assumption that both t he modeling and measurement noises follo w Gaussian distr ibutions, i.e., 𝝎 𝑡 iid ∼ 𝑁 ( 𝟎 , Σ 𝜔 ) and 𝝂 𝑘 𝑡 iid ∼ 𝑁 ( 𝟎 , Σ 𝜈 ) , maximizing the likelihood is equiv alent to minimizing the f ollowing loss function 𝐿 PI ( 𝒚 ) = 1 2  𝑡 ∈    𝒚 𝑡 +1 − 𝒈 ( 𝒚 𝑡 ; 𝝆 )  𝑇 Σ −1 𝜔  𝒚 𝑡 +1 − 𝒈 ( 𝒚 𝑡 ; 𝝆 )  + 𝐾  𝑘 =0  𝒖 𝑘 𝑡 − 𝒉 𝑘 ( 𝒚 𝑡 )  𝑇 Σ −1 𝜈  𝒖 𝑘 𝑡 − 𝒉 𝑘 ( 𝒚 𝑡 )   (9) Notice that in t he FedTSE framew ork, 𝒚 = 𝒚 ( 𝜽 ) is computed by a combination of sub-models of all data owners, each parameterized b y parameters 𝜽 = { 𝜽 𝑘 } 𝐾 𝑘 =0 . Hence, the loss function 𝐿 PI ( 𝒚 ( 𝜽 )) can be used as the loss function f or training these sub-models. W e speciﬁcally employ the Cell Transmission Model (CTM), a numer ical method that discretizes time and space to solv e L WR equations ( Daganzo , 1994 ), to characterize the dynamics of traﬃc ﬂow . Note that we use CTM as an ex ample due to its popular ity and simplicity , and our methodological framew ork can accommodate any model with a recursive str ucture, such as t he store-and-f orward model ( Aboudolas, Papageor giou and Kosmatopoulos , 2009 ), MET ANET ( Papageorgiou, Blosseville and Hadj-Salem , 1990 ), and Macroscopic Fundamental Diagram (MFD)-based dynamic models f or networ k -le vel traﬃc state estimation ( Geroliminis and Daganzo , 2008 ). The urban transpor tation netw ork is discretized into a set of cells wit h length Δ 𝑥 such that Δ 𝑥 > 𝑣 𝑓 Δ 𝑡 , where 𝑣 𝑓 indicates the free-ﬂow speed of the link. For present ation brevity , we assume each cell cor responds to a road link 𝑛 ∈  , as we can alwa ys divide a link into multiple cells. Then the ﬂow conser vation can be represented by Equation 10 . 𝑘 𝑛,𝑡 +1 = 𝑘 𝑛,𝑡 + Δ 𝑡 Δ 𝑥   𝑚 ∶( 𝑚,𝑛 )∈  𝑞 𝑚𝑛,𝑡 −  𝑚 ∶( 𝑛,𝑚 )∈  𝑞 𝑛𝑚,𝑡  (10) where 𝑞 𝑚𝑛,𝑡 represents the transf er ﬂow from cell 𝑚 to its adjacent cell 𝑛 . Given the turning ratio 𝑝 𝑚𝑛 along t he edge ( 𝑚, 𝑛 ) ∈  , the sending function descr ibing t he maximum ﬂow that can lea ve cell 𝑚 for cell 𝑛 can be wr itten as 𝑆 𝑚𝑛,𝑡 = 𝑝 𝑚𝑛 min{ 𝑞 𝑐 𝑚 , 𝑣 𝑓 𝑚 𝑘 𝑚,𝑡 } (11) 1 Notice that we use symbol 𝑦 instead of the commonly used 𝑥 to be consistent with our previous deﬁnitions of traﬃc st ates in Section 3 . W ang and Y ang: Preprint submitted to Elsevier P age 15 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach where 𝑞 𝑐 𝑚 is t he capacity of a cell, and 𝑣 𝑓 𝑚 is t he free-ﬂow speed. The sending function is calculated as t he minimum of the capacity and t he ﬂow that desires to leav e the cell. The receiving function deﬁnes t he maximum ﬂo w t hat can be accommodated by cell 𝑛 , which can be wr itten as 𝑅 𝑛,𝑡 = min{ 𝑤 𝑛 ( 𝑘 jam 𝑛 − 𝑘 𝑛,𝑡 ) , 𝑞 𝑐 𝑛 } (12) where 𝑘 jam 𝑛 represents the jam density and 𝑤 𝑛 the backw ard wa v e speed f or cell 𝑛 . The ﬂow that can enter cell 𝑛 can be calculated as t he minimum of total sending ﬂow and receiving ﬂow 𝑄 𝑛,𝑡 = min   𝑚 ∶( 𝑚,𝑛 )∈  𝑆 𝑚𝑛,𝑡 , 𝑅 𝑛,𝑡  (13) Then, assuming the traﬃc within each cell is homogeneous, the transfer ﬂow between cell 𝑚 to cell 𝑛 can be der ived as Equation 14 : 𝑞 𝑚𝑛,𝑡 = 𝑄 𝑛,𝑡 𝑆 𝑚𝑛,𝑡  𝑚 ′ ∶( 𝑚 ′ ,𝑛 )∈  𝑆 𝑚 ′ 𝑛,𝑡 (14) W e can summar ize CTM as a recursiv e model in t he form of Eq uation 6 , where the s tate v ar iables 𝒚 𝑡 = ({ 𝑘 𝑛𝑡 } 𝑛 ∈  , { 𝑞 𝑚𝑛𝑡 } ( 𝑚,𝑛 )∈  ) is the state, and t he model parameters can be represented by the parameters of the fundamental diagrams of all links, i.e., 𝝆 =  𝑣 𝑓 𝑛 , 𝑘 jam 𝑛 , 𝑞 𝑐 𝑛  𝑛 ∈  . W e incor porate CTM into the loss function Equation 9 to facilitate the training of FedTSE. 5.2. Privacy -preserving training of F edTSE-PI The training process of FedTSE-PI is summar ized in Algorithm 2 , which follo w s the training of FedTSE. MA calculates tw o types of gradients: (1) the gradient of 𝐿 PI with respect to the model parameters of MA ( 𝜽 0 ), i.e., 𝜕 𝐿 PI 𝜕 𝜽 0 , which will be used to update its parameters 𝜽 0 , and (2) the gradients of 𝐿 PI with respect to t he output 𝑧 𝑘 𝑡 , i.e., 𝜕 𝐿 PI 𝜕 𝒛 𝑘 𝑡 , which will be sent back to MP 𝑘 f or MP 𝑘 to update its parameters. How ev er, one ke y diﬀerence between t he training of FedTSE-PI and FedTSE lies in the loss function, whereby the loss function of FedTSE relies on the ground-trut h labels collected b y MA, whereas the loss function of FedTSE-PI requires critical par tial observations made by both MA and MPs. In other words, t he key input to the loss function of FedTSE-PI is distributed among MA and MPs, and MPs ma y be reluctant to explicitl y share such information due to privacy considerations. T o address such a pr ivacy issue, we propose a pr ivacy -preserving gradient calculation mechanism that allow s MA to calculate the g radients wit hout MPs having to explicitl y share their data. This mechanism relies on secure functional encryption ( Lewko, Okamoto, Sahai, Takashima and W aters , 2010 ) t hat encr ypts pr iv ate data into an encrypted message, such that any user with the encr ypted message can lear n a predetermined function about the private dat a but nothing else. W e speciﬁcally lev erage inner product encr yption (IPE) t hat allow s t he computation of the inner products of two vectors 𝒂 1 and 𝒂 2 without shar ing an y other inf or mation other than their inner product 𝒂 𝑇 1 𝒂 2 . Inner production encryption has been extensiv ely inv estigated in cr yptography literature ( Lewko et al. , 2010 ; Okamoto and Takashima , 2015 ; Couteau and Zarezadeh , 2022 ), which ensures the fas t and secure computation of inner products of two vectors, which can be possessed by two par ties. Our mechanism can employ an y e xisting tw o-party inner production encr yption approaches. Instead of introducing t he details of these approaches, we f ocus on how to f ormulate the gradient calculation problem into a ser ies of secure inner production calculation problems. Let us consider a minibatch  ⊂  . W e take the example of calculating 𝜕 𝐿 PI 𝜕 𝜽 0 , and the calculation of t he 𝜕 𝐿 PI 𝜕 𝒛 𝑘 𝑡 is similar . By t he chain r ule of partial derivatives, we can obtain 𝜕 𝐿 PI 𝜕 𝜽 0 =  𝑡 ∈  𝜕 𝐿 PI 𝜕 𝒚 𝑡 𝜕 𝒚 𝑡 𝜕 𝜽 0 = 1 2  𝑡 ∈  𝜕 𝜕 𝒚 𝑡 ( 𝒚 𝑡 +1 − 𝒈 ( 𝒚 𝑡 ; 𝝆 )) 𝑇 Σ −1 𝜔 ( 𝒚 𝑡 +1 − 𝒈 ( 𝒚 𝑡 ; 𝝆 )) 𝜕 𝒚 𝑡 𝜕 𝜽 0 W ang and Y ang: Preprint submitted to Elsevier P age 16 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach Algorithm 2: Algorit hm for FedTSE-PI Input: Pr ivate dat asets  𝑘 and lear ning rate 𝜂 𝑘 f or dat a owner 𝑘 = 0 , 1 , … , 𝐾 , regularizer weight parameter 𝜆 , and the number of local updates 𝑄 Output: Θ = ( 𝜽 0 , 𝜽 1 , … , 𝜽 𝑘 ) for each iter ation 𝑖 = 1 , 2 , … do if 𝑖 mod 𝑄 = 0 then MA randomly samples a mini-batch of time steps  ∈  and synchronizes it wit h MPs for 𝑘 = 1 , … , 𝐾 in parallel do MP 𝑘 computes local output 𝒛 𝑘 𝑡 = 𝝓 𝑘 ( 𝒙 𝑘 𝑡 ; 𝜽 𝑘 ) with its pr ivate sub model f or 𝑡 ∈  MP 𝑘 sends 𝒛 𝑘 𝑡 to MA end for 𝑘 = 1 , … , 𝐾 in parallel do MA and MP 𝑘 jointl y computes Δ 𝑘 𝜽 =  𝑡 ∈  ( 𝒖 𝑘 𝑡 ) 𝑇 Σ −1 𝜈 𝜕 𝒉 𝑘 𝜕 𝒚 𝜕 𝒚 𝜕 𝜽 0 and Δ 𝑘 𝒛 =  𝑡 ∈  ( 𝒖 𝑘 𝑡 ) 𝑇 Σ −1 𝜈 𝜕 𝒉 𝑘 𝜕 𝒚 𝜕 𝒚 𝜕 𝒛 𝒌 𝒕 using secure functional encr yption-based inner production computation end MA uses {Δ 𝑘 𝒛 } 𝐾 𝑘 =0 and {Δ 𝑘 𝜽 } 𝐾 𝑘 =0 to calculate gradients 𝜕 𝐿 𝜕 𝜽 0 and 𝜕 𝐿 𝜕 𝒛 𝑘 𝑡 according to Equation 15 MA updates sub-model parameters 𝜃 𝑖 +1 0 = 𝜃 𝑖 0 − 𝜂 0 𝜕 𝐿 𝜕 𝜽 0 and sends 𝜕 𝐿 𝜕 𝒛 𝑘 𝑡 to each MP end for 𝑘 = 1 , … , 𝐾 in parallel do MP 𝑘 computes 𝜕 𝐿 𝜕 𝜽 𝑘 with t he most recent 𝜕 𝐿 𝜕 𝒛 𝑘 𝑡 according to Equation 5 MP 𝑘 Update 𝜃 𝑖 +1 𝑘 = 𝜃 𝑖 𝑘 − 𝜂 𝑘 𝜕 𝐿 𝜕 𝜽 𝑘 end if conver gence criterion met then break end end + 1 2  𝑡 ∈  𝐾  𝑘 =0 𝜕 𝜕 𝒚 𝑡 ( 𝒖 𝑘 𝑡 − 𝒉 𝑘 ( 𝒚 𝑡 )) 𝑇 Σ −1 𝜈 ( 𝒖 𝑘 𝑡 − 𝒉 𝑘 ( 𝒚 𝑡 )) 𝜕 𝒚 𝑡 𝜕 𝜽 0 = 1 2  𝑡 ∈  𝜕 𝜕 𝒚 𝑡 ( 𝒚 𝑡 +1 − 𝒈 ( 𝒚 𝑡 ; 𝝆 )) 𝑇 Σ −1 𝜔 ( 𝒚 𝑡 +1 − 𝒈 ( 𝒚 𝑡 ; 𝝆 )) 𝜕 𝒚 𝑡 𝜕 𝜽 0 −  𝑡 ∈  𝐾  𝑘 =0 ( 𝒖 𝑘 𝑡 − 𝒉 𝑘 ( 𝒚 𝑡 )) 𝑇 Σ −1 𝜈 𝜕 𝒉 𝑘 𝜕 𝒚 𝑡 𝜕 𝒚 𝑡 𝜕 𝜽 0 = 1 2  𝑡 ∈  𝜕 𝜕 𝒚 𝑡 ( 𝒚 𝑡 +1 − 𝒈 ( 𝒚 𝑡 ; 𝝆 )) 𝑇 Σ −1 𝜔 ( 𝒚 𝑡 +1 − 𝒈 ( 𝒚 𝑡 ; 𝝆 )) 𝜕 𝒚 𝑡 𝜕 𝜽 0 −  𝑡 ∈  𝐾  𝑘 =0 ( 𝒖 𝑘 𝑡 ) 𝑇 Σ −1 𝜈 𝜕 𝒉 𝑘 𝜕 𝒚 𝑡 𝜕 𝒚 𝑡 𝜕 𝜽 0 +  𝑡 ∈  𝐾  𝑘 =0 ( 𝒉 𝑘 ( 𝒚 𝑡 )) 𝑇 Σ −1 𝜈 𝜕 𝒉 𝑘 𝜕 𝒚 𝑡 𝜕 𝒚 𝑡 𝜕 𝜽 0 (15) where the ﬁrst and third ter ms are known to MA, but the second term includes t he pr ivate data of each MP 𝑘 , i.e., 𝒖 𝑘 𝑡 . Nev ert heless, we notice that the second ter m can be represented as a combination of a sequence of inner products betw een 𝒖 𝑘 𝑡 and each column vector in matrix Σ −1 𝜈 𝜕 𝒉 𝑘 𝜕 𝒚 𝜕 𝒚 𝜕 𝜽 0 . Hence, we can use inner production encryption to allow secure calculation of the second ter m without having to enclose information from both par ties. By integrating with inner product encr yption, t he gradients can be calculated without MPs having to share their data. Nev ertheless, it is still possible for adversaries to infer the tr ue values of 𝝁 𝑘 𝑡 by solving a linear equation if they W ang and Y ang: Preprint submitted to Elsevier P age 17 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach know the inner product between 𝝁 𝑘 𝑡 and a sequence of vectors. This can be prev ented with sev eral tr icks. First, the mini-batch size can be chosen as suﬃciently large such t hat the dimension of { 𝒖 𝑘 𝑡 } 𝑡 ∈  is higher t han the rank of matr ix Σ −1 𝜈 𝜕 𝒉 𝑘 𝜕 𝒚 𝜕 𝒚 𝜕 𝜽 0 , which ensures that the values of 𝝁 𝑘 𝑡 cannot be inf er red from inner products. This can be, in reality , practical since t he av ailability of partial obser v ations is abundant (in contrast to the limited av ailability of ground-tr uth data), with the continuous measurements of MA ’ s detectors and operations of MPs ’ ﬂeets. Second, MA can collaborate wit h multiple MPs such that each MP can repor t critical par tial obser vations on a subset of links tra versed by its ﬂeet, which helps protect MPs’ dat a pr ivacy by not using the dat a. 6. Case Study on F edTSE-PI 6.1. Benchmarks for F edTSE-PI T o evaluate the per formance of FedTSE-PI, we compare the follo wing benchmarks, where MA is assumed to ha ve no access to ground-tr uth traﬃc states, and hence dat a fusion ref ers to t he fusion of both features and measurements. • QEST (phy sics-based density estimation with MP’s dat a) : This benchmark is a state-of-t he-art physics-based TSE approach proposed by Y ang and Menendez ( 2018 ), which estimates queue proﬁle from trajectory data pro vided by MP by solving a conv e x optimization problem formulated using kinematic wa v e t heory . MP fur ther uses t he queue proﬁle to produce its density measurements. Note that this benchmark does not use MA ’ s data. It is wort h noting that the embedded con ve x optimization is eﬃcient to solve, costing no more than 2s on a PC with a AMD Ryzen 5 3600XT 6-Core Processor f or each signal phase. Hence, embedding QEST with t he training of learning-based TSE methods (e.g., FedTSE-PI) does not cost signiﬁcant computation complexity . • QEST -f (phy sics-based density estimation with data fusion) : This benchmark is an extension to QEST to include MA ’ s loop detector and signal timing data, in which MA sends t hese data to MP , and MP produces density measurements by solving a conv ex optimization problem. Note that this benchmark could undermine t he privacy of MA. This benchmark is used to ev aluate t he advantag e of learning-based approaches in compar ison to physics-based approaches. • TSE-PI-p (ph ysics-informed par tial data fusion without privacy protection) : This benchmark is the phy sics- inf ormed v ersion of TSE-p, where MA trains physics-inf or med neural networks with MP’ s speed information. The f eatures of MA and MP are identical to t hose for TSE-p. The loss function follo ws Eq.( 9 ) with t he measurements of MP being t he speed inf or mation. Note that similar to TSE-p, privacy is not explicitl y protected f or this benchmark, and MP only wants to share speed information due to pr ivacy concer ns. • Oracle-PI (physics-inf ormed data fusion without privacy protection) : This model is t he phy sics-inf or med version of Oracle, whereby MP fully tr usts MA, and MA trains physics-inf or med neural networ ks with its own data and MP’ s dat a wit hout pr ivacy protection. The features of MA and MP are identical to t hose for Oracle. The loss function f ollow s Eq.( 9 ) wit h the measurements of MP being t he estimated speed and density using QEST . • UKF (Unscented Kalman Filter) : This benchmark is another state-of-the-ar t physics-based approach proposed by ( Makr idis and Kouv elas , 2023 ) that fuses trajectory data with loop detector dat a without considering pr ivacy concerns. • FedTSE-PI : the proposed phy sics-inf ormed pr ivacy -preserving data fusion vertical FL framew ork. The featur es of MA and MP are identical to t hose for FedTSE. The loss function f ollow s Eq.( 9 ) with the measurements of MP being the speed inf ormation and estimated density using QEST . Note that here, the privacy is protected in two perspectives. First, FL ensures that dat a is distributed. Second, encryption-based inner product computation ensures that MP’ s speed and density measurements are incor porated into t he loss function with privacy protection. • FedTSE-PI-v : This benchmark is identical to FedTSE-PI, except the loss function follo w s Eq.( 9 ) wit h the measurements of MP being only the speed information. • FedTSE-PI-k : This benchmark is identical to FedTSE-PI, excep t the loss function follo w s Eq.( 9 ) with t he measurements of MP being only the estimated density using QEST . W ang and Y ang: Preprint submitted to Elsevier P age 18 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach T able 5 Mo del p erformance comparison of FedTSE-PI. ( F represents whether there exists data fusion. P represents whether the data fusion is privacy-p reserving. V and K rep resent whether MP shares sp eed or density measurements resp ectively .) Key factors Metrics Densit y (veh/km/lane) Flo w (veh/min/lane) F P V K Pr (%) 20 40 60 80 20 40 60 80 ✓ ✓ QEST RMSE 8.20 7.05 6.64 6.43 5.61 5.22 5.06 4.90 MAE 6.20 5.44 5.15 5.04 3.52 3.24 3.14 3.04 ✓ ✓ ✓ QEST-f RMSE 6.89 5.77 5.38 5.23 4.39 4.20 4.12 4.08 MAE 4.94 4.34 4.11 4.06 2.47 2.37 2.34 2.32 ✓ ✓ TSE-PI-p RMSE 11.91 11.30 11.01 11.01 5.15 4.86 4.72 4.64 MAE 8.11 7.46 7.13 6.96 3.53 3.24 3.09 2.86 ✓ ✓ ✓ Oracle-PI RMSE 6.32 5.58 4.93 4.85 3.43 3.09 2.77 2.59 MAE 4.64 4.13 3.71 3.63 2.54 2.24 1.94 1.79 ✓ ✓ ✓ UKF RMSE 7.95 6.88 6.57 6.29 4.62 4.38 4.21 4.14 MAE 6.14 5.41 5.20 5.02 2.56 2.47 2.41 2.36 ✓ ✓ ✓ F edTSE-PI-v RMSE 10.35 10.24 10.58 10.71 4.33 4.17 3.89 4.13 MAE 6.80 6.54 6.45 6.56 2.85 2.74 2.31 2.40 ✓ ✓ ✓ F edTSE-PI-k RMSE 7.25 6.25 6.21 5.90 3.62 3.66 3.48 3.59 MAE 5.40 4.70 4.67 4.52 2.60 2.63 2.54 2.58 ✓ ✓ ✓ ✓ F edTSE-PI RMSE 6.37 5.71 5.07 5.05 3.31 3.04 2.82 2.83 MAE 4.81 4.23 3.81 3.72 2.39 2.18 1.98 2.02 Note that in practice, although ground-tr uth labels are limited, the measurements of MA and MPs (e.g., loop detector data and trajector y dat a) are abundant, as these data owners continuously generate operational dat a. Therefore, it is reasonable to assume that both MA and MPs ha ve a long history of loop detector dat a and trajectory dat a. To realisticall y replicate this, we use simulation data to evaluate the performance of these benchmarks due to the limited data provision of real-world data. The experiment settings are the same with Section 4.1 . 6.2. P erformance of F edTSE-PI T able 5 and Figure 8 compare t he per f ormance of benchmarks using simulation dat a. The benchmarks with the best estimation results are underlined, while those ranked second are highlighted in bold . 20% 40% 60% 80% MP penetration rate 5 10 15 Density test RMSE TSE-PI-p UKF QEST -f F edTSE-PI Oracle-PI (a) Density RMSE 20% 40% 60% 80% MP penetration rate 2 4 6 Flow test RMSE TSE-PI-p UKF QEST -f F edTSE-PI Oracle-PI (b) Flow RMSE Figure 8: Mo del p erfo rmance comparison with diﬀerent MP p enetration rate. T radeoﬀ betw een privacy and estimation performance . W e ev aluate the tradeoﬀ betw een privacy and estimation performance of FedTSE-PI by comparing it to (i) state-of-the-ar t physics-based TSE algor ithms with data fusion, including QEST -f and UKF, and (ii) Oracle-PI without considering pr iv acy . W e can see that ev en t hough a pr ivacy - preserving mechanism is inv olv ed, FedTSE-PI still outper f orms these physics-based methods and yield a similar W ang and Y ang: Preprint submitted to Elsevier P age 19 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach T able 6 Sensitivit y analysis on the numb er of lo ops Metrics Densit y veh/km/lane Flo w veh/min/lane MP p enetration rate (%) 20 40 60 80 20 40 60 80 F edTSE-PI (three loops) RMSE 6.20 5.49 4.99 4.52 3.36 2.85 2.65 2.48 MAE 4.44 4.12 3.68 3.30 2.43 2.00 1.86 1.74 F edTSE-PI (tw o lo ops) RMSE 6.37 5.71 5.07 5.05 3.31 3.04 2.82 2.83 MAE 4.81 4.23 3.81 3.72 2.39 2.18 1.98 2.02 F edTSE-PI (one lo op) RMSE 7.08 6.06 5.63 5.26 4.07 3.63 3.11 2.86 MAE 5.37 4.54 4.25 3.96 3.02 2.63 2.26 2.02 estimation accuracy as Oracle-PI (i.e., wit h at most 4% diﬀerence). This show s that FedTSE-PI can protect pr ivacy at a marginal cost on estimation per f ormance. The advantage of FedTSE-PI is furt her highlighted in Figure 8 , where the blue line (i.e., FedTSE-PI) outper f orms all other existing methods for TSE and is close to the red line (Oracle-PI). This suggests that the integ ration of the phy sics model and FL could better capture the traﬃc dynamics, showing t he beneﬁts of incorporating lear ning-based components. Over all, despite having no access to ground-trut h labels, FedTSE-PI can achie v e ex cellent performance on the TSE problem. Moreov er, let us take a closer look at the performance comparison of FedTSE-PI and Oracle-PI. Figure 9 shows the estimated density and ﬂow on link s with and without MA observations, respectively , in scenar ios wit h 20% penetration rate. W e can see that our proposed FedTSE-PI in blue line yields similar performance compared to Oracle-PI in red line. The vertical federated design in FedTSE-PI does not weaken model per f ormance. For link s equipped with loops, the density is well estimated (see Figure 9c ). This is because t he ﬂow can be perfectl y estimated on these links thanks to the loop detectors. For links without loops, t he estimated density can capture t he main trend with some er rors on peaks, which is because the inf or mation is not suﬃcient for t hese peaks due to lo w penetration rates. Over all, our proposed FedTSE-PI can yield desirable performance similar to Oracle-PI and signiﬁcantly outper- f orms other state-of-the-ar t methods, which show s that FedTSE-PI achie ves eﬃcient tradeoﬀ between privacy and estimation performance. V alue of privacy protection in data fusion . As mentioned abov e, MPs may be more incentivized to par ticipate more activel y in the data fusion and use higher -resolution data, which can in turn improv e estimation accuracy . Here, we demonstrate t his by comparing FedTSE-PI with FedTSE-PI-k and FedTSE-PI-v . As we can see that FedTSE-PI signiﬁcantly outperforms FedTSE-PI-k (at least 14% improv ement) and FedTSE-PI-v (at least 60% impro vement), which demonstrate that the per f or mance of FedTSE-PI can be improv ed by incor porating more measurements. Similar conclusion can also be drawn on the beneﬁts of using higher -resolution f eatures by compar ing FedTSE-PI with TSE- PI-p. This highlights the beneﬁts of incor porating privacy-preserving mechanisms to dat a fusion, since this encourages MPs to share more measurements and features, and thus impro v e TSE performance. 6.2.1. Sensitivity analysis on the number of loops T able 6 show s that more obser vations from MA could enhance the performance of FedTSE-PI. The RMSE of density and ﬂow with 20% MP penetration rate decreases more than 12% and 17% respectivel y from one loop to three loops. Although ha ving only one observation, the density RMSE with 7.08 veh/km/lane is less than t hat of FedTSE- PI-k and UKF as shown in Table 5 . This means t hat MP’ s data is cr ucial f or MA when estimating traﬃc state. MA can still collaborate with MP to per f orm TSE with limited obser vations. Hence, it is necessar y for MA to adopt a privacy-preserving framew ork to encourage MP’ s dat a-sharing. 7. Conclusion In t his paper, we propose two nov el v er tical FL -based methods, FedTSE and FedTSE-PI, f or privacy-preserving learning-based data fusion of MA and MP’ s data f or traﬃc state estimation. This is among the fe w pioneer works that address the cross-silo privacy concerns among ITS par ties interested in collaborating and shar ing heterog eneous datasets wit h diﬀerent f eatures. FedTSE enables multiple dat a owners to collaborativel y train and apply a TSE model without ha ving to explicitl y ex c hange their pr ivate data. FedTSE-PI extends FedTSE to enhance the applicability in scenarios where g round-truth labels are not av ailable, which is common for TSE due to the high cost of getting W ang and Y ang: Preprint submitted to Elsevier P age 20 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach 0 20 40 60 80 100 T imesteps 0 10 20 30 Density Gr ound-truth density Oracle-PI F edTSE-PI (a) Density estimation results of one link wit h no loop detector (veh/km/lane). 0 20 40 60 80 100 T imesteps 0 10 20 Flow Gr ound-truth flow Oracle-PI F edTSE-PI (b) Flow estimation results of one link with no loop detector (veh/min/lane). 0 20 40 60 80 100 T imesteps 0 10 20 Density Gr ound-truth density Oracle-PI F edTSE-PI (c) Density estimation results of one link wit h loop detector (veh/km/lane). Figure 9: Estimated density and ﬂow comparison for FedTSE-PI. (20% MP p enetration rate) ground-tr uth traﬃc states. Case studies demonstrate t hat FedTSE and FedTSE-PI can preser ve t he pr ivacy of data owners wit h minimum impact on t he estimation performance, signiﬁcantly outperforming the baselines where MPs only share par tial or no data due to pr ivacy concerns. Moreov er , by protecting pr ivacy , MPs can be more incentivized to participate more activel y into data fusion and use higher -resolution dat a, which in tur n enhances estimation accuracy . This work opens sever al promising future directions. First, the pr ivacy guarantees of this work can be fur ther enhanced b y integrating with pr iv acy-preserving data per turbation mechanisms such as diﬀerential privacy ( Dw ork , 2006 ) and statistical pr ivacy ﬁlter ( Nekouei, Sandberg, Skoglund and Johansson , 2022 ). Second, we would like to relax t he assumption that MPs are honest with no intention to f org e dat a for their own beneﬁt. This can be achie ved by incor porating a cryptog raphic protocol based on Zero-Kno w ledge Proof ( Fiege et al. , 1987 ) similar to T sao et al. W ang and Y ang: Preprint submitted to Elsevier P age 21 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach ( 2022b ), which can detect the validity of the data used by MPs wit hout MPs having to reveal their data. Third, we are interested in extending FedTSE-PI to consider the uncer tainty quantiﬁcation of both data and models, e.g., via Ba yesian neural networks where t he neural netw ork weights are probabilistic distributions instead of deterministic values. Four th, we would like to extend t he vertical federated lear ning approach to per f orm traﬃc control by integrating it with reinf orcement lear ning. A ckno wledg ement The authors would like to thank Chaopeng T an, Longhao Y an, and Jingyuan Zhou f or fr uitful discussions. This research was suppor ted by the Singapore Ministr y of Education (MOE) under its Academic Research Fund Tier 1 (A-8001183-00-00). This ar ticle solely reﬂects the opinions and conclusions of its authors and not Singapore MOE or any other entity . Ref erences Aboudolas, K., Papageorgiou, M., Kosmatopoulos, E., 2009. Store-and-f orward based methods for the signal control problem in large-scale congested urban road networ ks. Transportation Research Par t C: Emerging Tec hnologies 17, 163–174. Ambühl, L., Menendez, M., 2016. Data fusion algor ithm for macroscopic fundamental diagram estimation. Transportation Research Par t C: Emerging T echnologies 71, 184–197. Amini, S., Gerostathopoulos, I., Prehof er, C., 2017. Big data analytics architecture for real-time traﬃc control, in: 2017 5th IEEE international conf erence on models and technologies for intelligent transport ation systems (MT-ITS), IEEE. pp. 710–715. Antunes, R.S., André da Costa, C., Küderle, A., Y ar i, I.A., Eskoﬁer, B., 2022. Federated learning for healthcare: Systematic review and architecture proposal. ACM Transactions on Intelligent Systems and T echnology (TIST) 13, 1–23. Barmpounakis, E., Geroliminis, N., 2020. On t he new era of urban traﬃc monitor ing with massive drone data: The pneuma large-scale ﬁeld experiment. Transport ation research par t C: emerging technologies 111, 50–71. Caceres, N., Romero, L.M., Benitez, F .G., del Castillo, J.M., 2012. Traﬃc ﬂow estimation models using cellular phone data. IEEE Transactions on Intelligent Transportation Systems 13, 1430–1441. Cai, Z., Zheng, X., Y u, J., 2019. A diﬀerential-pr ivate framew ork f or urban traﬃc ﬂows estimation via t axi companies. IEEE Transactions on Industrial Inf ormatics 15, 6492–6499. Chen, M., Shlezinger, N., Poor, H.V ., Eldar, Y .C., Cui, S., 2021. Communication-eﬃcient federated learning. Proceedings of the National Academy of Sciences 118, e2024789118. Chen, Z., Tian, P ., Liao, W ., Y u, W ., 2020. Zero know ledge cluster ing based adversarial mitigation in heterogeneous f ederated learning. IEEE Transactions on Networ k Science and Engineering 8, 1070–1083. Couteau, G., Zarezadeh, M., 2022. Non-interactive secure comput ation of inner-product from lpn and lwe, in: International Conference on the Theory and Application of Cryptology and Information Secur ity , Spr inger . pp. 474–503. Daganzo, C.F ., 1994. The cell transmission model: A dynamic representation of highwa y traﬃc consistent with t he hydrodynamic theor y . Transportation research part B: methodological 28, 269–287. De Montjoye, Y .A., Hidalgo, C.A., V erle ysen, M., Blondel, V .D., 2013. Unique in the crowd: The pr ivacy bounds of human mobility . Scientiﬁc reports . Di, X., Shi, R., Mo, Z., Fu, Y ., 2023. Phy sics-informed deep learning for traﬃc state estimation: A sur ve y and the outlook. Algor ithms 16, 305. Diao, E., Ding, J., Tarokh, V ., 2020. Heteroﬂ: Computation and communication eﬃcient f ederated lear ning for heterog eneous clients, in: Inter national Conf erence on Lear ning Representations. Donahue, K., Kleinberg, J., 2021. Model-shar ing games: Analyzing federated learning under voluntary par ticipation, in: Proceedings of the AAAI Conf erence on Ar tiﬁcial Intelligence, pp. 5303–5311. Dwor k, C., 2006. Diﬀerential privacy , in: International colloquium on automata, languages, and programming, Spr inger . pp. 1–12. van Er p, P .B., Knoop, V .L., Hoogendoorn, S.P ., 2018. Macroscopic traﬃc state estimation using relative ﬂow s from stationar y and moving observers. Transportation Research Par t B: Methodological 114, 281–299. Fedoro v , A., Nikolskaia, K., Ivanov , S., Shepelev , V ., Minbaleev , A., 2019. Traﬃc ﬂow estimation with dat a from a video surveillance camera. Journal of Big Data 6, 1–15. Feng, J., Rong, C., Sun, F., Guo, D., Li, Y ., 2020. Pmf: A privacy-preserving human mobility prediction framew ork via federated learning. Proceedings of the ACM on Interactive, Mobile, W earable and Ubiquitous Tec hnologies 4, 1–21. Fiege, U., Fiat, A., Shamir, A., 1987. Zero knowledg e proofs of identity , in: Proceedings of the nineteenth annual ACM symposium on Theor y of computing, pp. 210–217. Gao, H., Li, Z., W ang, Y ., 2022. Pr ivacy -preserving collaborative estimation for networ ked vehicles wit h application to collaborative road proﬁle estimation. IEEE Transactions on Intelligent Transportation Systems . Gati, N.J., Y ang, L.T., Feng, J., Nie, X., Ren, Z., Tarus, S.K., 2021. Diﬀerentially private data fusion and deep learning framew ork f or cyber– phy sical–social systems: state-of-the-ar t and perspectives. Information Fusion . Geroliminis, N., Daganzo, C.F ., 2008. Existence of urban-scale macroscopic fundamental diagrams: Some experimental ﬁndings. Transportation Researc h Part B: Methodological 42, 759–770. Ghodsi, Z., Jav aheripi, M., Sheybani, N., Zhang, X., Huang, K., Koushanfar , F ., 2023. zprobe: Zero peek robustness checks for federated learning, in: Proceedings of t he IEEE/CVF International Conference on Computer Vision, pp. 4860–4870. W ang and Y ang: Preprint submitted to Elsevier P age 22 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach Guo, Q., Li, L., Ban, X.J., 2019. Urban traﬃc signal control wit h connected and automated vehicles: A sur ve y . Transportation research par t C: emerging technologies 101, 313–334. Hamer, J., Mohri, M., Suresh, A.T., 2020. F edboost: A communication-eﬃcient algorit hm for federated learning, in: Inter national Confer ence on Machine Learning, PMLR. pp. 3973–3983. Han, Y ., Meng, Y ., Zheng, J., Liu, H., 2019. An Urban Freew a y Ramp Metering Control System based on Trajectory Data. Tec hnical Repor t. Han, Y ., W ang, M., Li, L., Roncoli, C., Gao, J., Liu, P ., 2022. A physics-inf or med reinforcement learning-based strategy f or local and coordinated ramp metering. Transportation Research Part C: Emerging T echnologies 137, 103584. Hanzely , F., Hanzely , S., Hor váth, S., Richtárik, P ., 2020. Lower bounds and optimal algor ithms for personalized federated lear ning. Advances in Neural Information Processing Systems 33, 2304–2315. He, B.Y ., Chow , J.Y ., 2019. Optimal pr ivacy control for transport network data shar ing. Tr anspor tation Research Procedia 38, 792–811. He, B.Y ., Chow , J.Y ., 2020. Optimal privacy control for transpor t network dat a sharing. Transportation Research Part C: Emerging T echnologies 113. Herrera, J.C., Bay en, A.M., 2010. Incorporation of lagrangian measurements in freewa y traﬃc st ate estimation. T ransport ation Research Part B: Methodological 44, 460–481. Huang, F ., Xu, J., W eng, J., 2020. Multi-task trav el route planning wit h a ﬂexible deep learning framework. IEEE Transactions on Intelligent Transportation Systems 22, 3907–3918. Huang, X., Li, P ., Y u, R., Wu, Y ., Xie, K., Xie, S., 2021. Fedparking: A f ederated lear ning based parking space estimation wit h parked vehicle assisted edge computing. IEEE Transactions on V ehicular Tec hnology 70, 9355–9368. Imteaj, A., Amini, M.H., 2022. Leveraging asynchronous federated lear ning to predict customers ﬁnancial distress. Intelligent Systems wit h Applications 14, 200064. Jia, R., Dao, D., W ang, B., Hubis, F.A., Hynes, N., Gürel, N.M., Li, B., Zhang, C., Song, D., Spanos, C.J., 2019. To w ards eﬃcient dat a valuation based on the shapley value, in: The 22nd Inter national Conference on Ar tiﬁcial Intelligence and Statistics, PMLR. pp. 1167–1176. Jiang, J., Burkhalter , L., Fu, F ., Ding, B., Du, B., Hithnawi, A., Li, B., Zhang, C., 2022. Vf-ps: How to select important participants in vertical f ederated lear ning, eﬃciently and securely? Advances in Neural Information Processing Systems 35, 2088–2101. Jin, F ., Hua, W ., Francia, M., Chao, P ., Orow ska, M., Zhou, X., 2022. A survey and experimental study on privacy-preserving trajectory data publishing. IEEE Transactions on Know ledge and Data Engineer ing . Ke, R., Li, Z., Tang, J., Pan, Z., W ang, Y ., 2018. Real-time traﬃc ﬂow parameter estimation from uav video based on ensemble classiﬁer and optical ﬂow . IEEE Transactions on Intelligent Transportation Systems 20, 54–64. Kim, J.W ., Edemacu, K., Jang, B., 2022. Privacy-preserving mechanisms for location pr ivacy in mobile crowdsensing: A sur ve y . Jour nal of Network and Computer Applications . Lai, F., Dai, Y ., Singapuram, S., Liu, J., Zhu, X., Madhy astha, H., Chowdhury, M., 2022. Fedscale: Benchmarking model and system performance of federated lear ning at scale, in: Inter national Conference on Machine Learning, PMLR. pp. 11814–11827. Le Ny , J., Touati, A., Pappas, G.J., 2014. Real-time privacy-preserving model-based estimation of traﬃc ﬂows, in: 2014 ACM/IEEE International Conf erence on Cyber-Phy sical Systems (ICCPS), IEEE. pp. 92–102. Lewk o, A., Okamoto, T., Sahai, A., Takashima, K., W aters, B., 2010. Fully secure functional encryption: Attribute-based encr yption and (hierarchical) inner product encr yption, in: Advances in Cr yptology–EUR OCR YPT 2010: 29th Annual Inter national Conf erence on the Theor y and Applications of Cryptographic Tec hniques, French Riviera, May 30–June 3, 2010. Proceedings 29, Springer . pp. 62–91. Li, J., Boonaert, J., Doniec, A., Lozenguez, G., 2021a. Multi-models machine lear ning methods f or traﬃc ﬂow estimation from ﬂoating car data. Transportation Research Par t C: Emerging Tec hnologies 132, 103389. Li, Q., W en, Z., W u, Z., Hu, S., W ang, N., Li, Y ., Liu, X., He, B., 2021b. A sur ve y on federated lear ning systems: Vision, hype and reality for dat a privacy and protection. IEEE Transactions on Know ledge and Data Engineer ing . Li, Y ., T ao, X., Zhang, X., Liu, J., Xu, J., 2021c. Privacy-preserved federated learning for autonomous dr iving. IEEE Transactions on Intelligent Transportation Systems 23, 8423–8434. Liu, M., Zhao, J., Hoogendoorn, S., W ang, M., 2022a. An optimal control approach of integrating traﬃc signals and cooperative vehicle trajectories at intersections. Transportmetrica B: transpor t dynamics 10, 971–987. Liu, Q., Chow , J.Y ., 2022. Eﬃcient and stable data-sharing in a public transit oligopoly as a coopetitive game. Transportation Research Par t B: Methodological 163, 64–87. Liu, Y ., James, J., Kang, J., Niyato, D., Zhang, S., 2020. Pr ivacy -preserving traﬃc ﬂow prediction: A federated lear ning approach. IEEE Inter net of Things Jour nal . Liu, Y ., Kang, Y ., Zou, T ., Pu, Y ., He, Y ., Y e, X., Ouyang, Y ., Zhang, Y .Q., Y ang, Q., 2022b. V ertical federated learning. arXiv preprint arXiv:2211.12814 . Liu, Y ., Zhang, X., Kang, Y ., Li, L., Chen, T., Hong, M., Y ang, Q., 2022c. Fedbcd: A communication-eﬃcient collaborative learning framew ork f or distributed f eatures. IEEE Transactions on Signal Processing . Lopez, P .A., Behr isch, M., Bieker - W alz, L., Erdmann, J., Flötteröd, Y .P ., Hilbr ich, R., Lück en, L., Rummel, J., W agner, P ., Wießner, E., 2018. Microscopic traﬃc simulation using sumo, in: 2018 21st inter national conference on intelligent transpor tation systems (ITSC), IEEE. pp. 2575– 2582. Lu, J., Li, C., Wu, X.B., Zhou, X.S., 2023. Physics-inf ormed neural netw orks for integrated traﬃc state and queue proﬁle estimation: A diﬀerentiable programming approach on lay ered comput ational graphs. Transportation Research Part C: Emerging T echnologies 153, 104224. Ma, C., Li, J., Ding, M., Y ang, H.H., Shu, F., Quek, T.Q., Poor , H.V ., 2020. On safeguarding pr ivacy and secur ity in the framew ork of federated learning. IEEE network 34, 242–248. Makridis, M.A., Kouvelas, A., 2023. An adaptive framew ork for real-time freew ay traﬃc estimation in t he presence of ca vs. Transportation research part C: emerging technologies 149, 104066. W ang and Y ang: Preprint submitted to Elsevier P age 23 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A., 2017. Communication-eﬃcient lear ning of deep networks from decentralized data, in: Ar tiﬁcial intelligence and statistics, PMLR. pp. 1273–1282. Meert, W ., V erbeke, M., 2018. Hmm with non-emitting states for map matching, in: European Conference on Dat a Analy sis (ECD A), Date: 2018/07/04-2018/07/06, Location: Paderborn, Germany . Mo, Z., Shi, R., Di, X., 2021. A phy sics-informed deep learning paradigm for car-f ollo wing models. Transport ation research par t C: emerging technologies 130, 103240. Nek ouei, E., Sandberg, H., Skoglund, M., Johansson, K.H., 2022. A model randomization approach to statistical parameter privacy . IEEE Transactions on Automatic Control 68, 839–850. Nguyen, T ., Thai, M.T ., 2023. Preser ving pr ivacy and secur ity in federated learning. IEEE/ACM Transactions on Netw orking . Nie, T., Qin, G., W ang, Y ., Sun, J., 2023. Correlating sparse sensing f or larg e-scale traﬃc speed estimation: A laplacian-enhanced low-r ank tensor kriging approach. Transportation Research Part C: Emerging T echnologies 152, 104190. Nilsson, A., Smith, S., Ulm, G., Gustavsson, E., Jirstrand, M., 2018. A performance evaluation of federated learning algor ithms, in: Proceedings of the second workshop on distributed infrastructures for deep lear ning, pp. 1–8. Ning, Z., Zhang, K., W ang, X., Obaidat, M.S., Guo, L., Hu, X., Hu, B., Guo, Y ., Sadoun, B., Kwok, R.Y ., 2020. Joint computing and caching in 5g-envisioned inter net of vehicles: A deep reinforcement lear ning-based traﬃc control system. IEEE Transactions on Intelligent Transportation Systems 22, 5201–5212. Okamoto, T., Takashima, K., 2015. Achieving shor t cipher texts or shor t secret-ke ys for adaptivel y secure general inner-pr oduct encr yption. Designs, Codes and Cr yptography 77, 725–771. Papageor giou, M., Blosseville, J.M., Hadj-Salem, H., 1990. Modelling and real-time control of traﬃc ﬂow on the sout hern part of boulev ard peripher ique in paris: Part i: Modelling. Transportation Research Part A: General 24, 345–359. Paramesw arath, R.P ., Gope, P ., Sikdar, B., 2022. User-empo wered privacy-preserving authentication protocol for electr ic vehicle charging based on decentralized identity and veriﬁable credential. ACM Transactions on Management Information Systems (TMIS) . Ros tami-Shahrbabaki, M., Safa vi, A.A., Papageorgiou, M., Setoodeh, P ., Papamichail, I., 2020. State estimation in urban traﬃc networks: A two-la yer approach. Transportation Researc h Part C: Emerging T echnologies 115, 102616. Ro thchild, D., Panda, A., Ullah, E., Ivkin, N., Stoica, I., Brav erman, V ., Gonzalez, J., Arora, R., 2020. Fetchsgd: Communication-eﬃcient f ederated learning with sketc hing, in: Inter national Conference on Machine Learning, PMLR. pp. 8253–8265. Saeedmanesh, M., Kouv elas, A., Geroliminis, N., 2021. An extended kalman ﬁlter approach for real-time state estimation in multi-region mfd urban netw orks. T ranspor tation Research Part C: Emerging Tec hnologies 132, 103384. Seo, T., Bayen, A.M., Kusakabe, T ., Asak ura, Y ., 2017. T raﬃc state estimation on highwa y: A comprehensive survey . Annual review s in control 43, 128–151. Shahrbabaki, M.R., Safa vi, A.A., Papageorgiou, M., Papamichail, I., 2018. A dat a fusion approach for real-time traﬃc state estimation in urban signalized links. Transportation research par t C: emerging technologies 92, 525–548. Shamir, A., 1979. How to share a secret. Commun. ACM 22, 612–613. Shi, R., Mo, Z., Huang, K., Di, X., Du, Q., 2021. A physics-inf ormed deep lear ning paradigm for traﬃc state and fundament al diagram estimation. IEEE Transactions on Intelligent Transportation Systems 23, 11688–11698. Stra va, 2018. The strav a heat map and the end of secrets. A vailable online at https://www.wired.com/story/ strava- heat- map- military- bases- fitness- trackers- priva cy/ . Su, Z., W ang, Y ., Luan, T.H., Zhang, N., Li, F ., Chen, T ., Cao, H., 2021. Secure and eﬃcient f ederated lear ning for smar t gr id with edge-cloud collaboration. IEEE Transactions on Industrial Inf ormatics 18, 1333–1344. T an, C., Y ang, K., 2024. Privacy-preserving adaptive traﬃc signal control in a connected vehicle environment. Transportation Research Part C: Emerging T echnologies 158, 104453. T sao, M., Y ang, K., Gopalakrishnan, K., Pav one, M., 2022a. Private location sharing f or decentralized routing services. arXiv preprint Extended version. A v ailable at . T sao, M., Y ang, K., Zoepf, S., Pav one, M., 2022b. T rust but verify: Cryptog raphic data pr ivacy f or mobility management. IEEE Transactions on Control of Netw ork Systems 9, 50–61. W ang, R., Fan, S., W ork, D.B., 2016. Eﬃcient multiple model particle ﬁlter ing for joint traﬃc state estimation and incident detection. Transport ation Researc h Part C: Emerging T echnologies 71, 521–537. W ang, S., Zhang, X., Li, F., Philip, S.Y ., Huang, Z., 2018. Eﬃcient traﬃc estimation with multi-sourced dat a by parallel coupled hidden marko v model. IEEE Transactions on Intelligent Transportation Systems 20, 3010–3023. W ang, Y ., Zhao, M., Y u, X., Hu, Y ., Zheng, P ., Hua, W ., Zhang, L., Hu, S., Guo, J., 2022. Real-time joint traﬃc state and model parameter estimation on freewa ys with ﬁxed sensors and connected vehicles: State-of-the-ar t ov erview , methods, and case studies. Transportation Research Part C: Emerging T echnologies 134, 103444. W ei, K., Li, J., Ding, M., Ma, C., Y ang, H.H., Farokhi, F ., Jin, S., Quek, T .Q., Poor , H.V ., 2020. Federated lear ning with diﬀerential pr ivacy: Algorithms and performance analy sis. IEEE Transactions on Information Forensics and Secur ity 15, 3454–3469. W ei, K., Li, J., Ma, C., Ding, M., W ei, S., Wu, F., Chen, G., Ranbaduge, T., 2022. V er tical f ederated learning: Challenges, methodologies and experiments. arXiv preprint arXiv:2202.04309 . Wu, C., Thai, J., Y adlow sky , S., Pozdnoukho v , A., Bayen, A., 2015. Cellpath: Fusion of cellular and traﬃc sensor dat a for route ﬂow estimation via conv e x optimization. Transport ation Research Procedia 7, 212–232. Xia, M., Jin, D., Chen, J., 2022. Shor t-term traﬃc ﬂow prediction based on graph conv olutional networ ks and federated lear ning. IEEE Transactions on Intelligent Transportation Systems . Xia, Q., Y e, W ., Tao, Z., Wu, J., Li, Q., 2021. A sur ve y of federated lear ning f or edge computing: Research problems and solutions. High-Conﬁdence Computing 1, 100008. W ang and Y ang: Preprint submitted to Elsevier P age 24 of 25 Privacy-Preserving Data Fusion fo r T raﬃc State Estimation: A Vertical F ederated Learning App roach Xu, D., W ei, C., Peng, P ., Xuan, Q., Guo, H., 2020. Ge-gan: A novel deep lear ning framewor k for road traﬃc state estimation. Transportation Researc h Part C: Emerging T echnologies 117, 102635. Y ang, K., Menendez, M., 2018. Queue estimation in a connected v ehicle environment: A conv ex approach. IEEE Transactions on Intelligent Transportation Systems 20, 2480–2496. Ying, Z., Cao, S., Liu, X., Ma, Z., Ma, J., Deng, R.H., 2022. Privacysignal: Privacy-preserving traﬃc signal control for intelligent transport ation system. IEEE Transactions on Intelligent Transport ation Systems . Y u, B., Yin, H., Zhu, Z., 2017. Spatio-temporal g raph conv olutional networks: A deep learning framewor k for traﬃc forecasting. arXiv preprint arXiv:1709.04875 . Zeng, R., Zeng, C., W ang, X., Li, B., Chu, X., 2021. A comprehensive survey of incentive mechanism for federated learning. arXiv preprint arXiv:2106.15406 . Zhan, X., Li, R., Ukkusur i, S.V ., 2020. Link -based traﬃc state estimation and prediction for ar terial networks using license-plate recognition data. Transportation Research Par t C: Emerging Tec hnologies 117, 102660. Zhao, L., Song, Y ., Zhang, C., Liu, Y ., W ang, P ., Lin, T., Deng, M., Li, H., 2019a. T-gcn: A temporal graph conv olutional networ k for traﬃc prediction. IEEE Transactions on Intelligent Transportation Systems 21, 3848–3858. Zhao, Y ., Zheng, J., W ong, W ., W ang, X., Meng, Y ., Liu, H.X., 2019b. V ar ious methods for queue length and traﬃc volume estimation using probe vehicle trajectories. T ranspor tation Research Part C: Emerging Tec hnologies 107, 70–91. Zheng, C., Fan, X., W ang, C., Qi, J., 2020. Gman: A graph multi-attention network for traﬃc prediction, in: Proceedings of t he AAAI conf erence on ar tiﬁcial intelligence, pp. 1234–1241. Zheng, J., Liu, H.X., 2017. Es timating traﬃc volumes f or signalized intersections using connected vehicle dat a. Transportation Research Part C: Emerging T echnologies 79, 347–362. Zheng, J., Sun, W ., Huang, S., Shen, S., Y u, C., Zhu, J., Liu, B., Liu, H.X., 2018. Traﬃc signal optimization using crowdsourced vehicle trajectory data. T echnical Report. W ang and Y ang: Preprint submitted to Elsevier P age 25 of 25

Privacy-Preserving Data Fusion for Traffic State Estimation: A Vertical Federated Learning Approach

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment