DDFI: Diverse and Distribution-aware Missing Feature Imputation via Two-step Reconstruction
Incomplete node features are ubiquitous in real-world scenarios, e.g., the attributes of web users may be partly private, which causes the performance of Graph Neural Networks (GNNs) to decline significantly. Feature propagation (FP) is a well-known method that performs well for imputation of missing node features on graphs, but it still has the following three issues: 1) it struggles with graphs that are not fully connected, 2) imputed features face the over-smoothing problem, and 3) FP is tailored for transductive tasks, overlooking the feature distribution shift in inductive tasks. To address these challenges, we introduce DDFI, a Diverse and Distribution-aware Missing Feature Imputation method that combines feature propagation with a graph-based Masked AutoEncoder (MAE) in a nontrivial manner. It first designs a simple yet effective algorithm, namely Co-Label Linking (CLL), that randomly connects nodes in the training set with the same label to enhance the performance on graphs with numerous connected components. Then we develop a novel two-step representation generation process at the inference stage. Specifically, instead of directly using FP-imputed features as input during inference, DDFI further reconstructs the features through the whole MAE to reduce feature distribution shift in the inductive tasks and enhance the diversity of node features. Meanwhile, since existing feature imputation methods for graphs only evaluate by simulating the missing scenes with manually masking the features, we collect a new dataset called Sailing from the records of voyages that contains naturally missing features to help better evaluate the effectiveness. Extensive experiments conducted on six public datasets and Sailing show that DDFI outperforms the state-of-the-art methods under both transductive and inductive settings.
💡 Research Summary
The paper “DDFI: Diverse and Distribution-aware Missing Feature Imputation via Two-step Reconstruction” proposes a novel framework to address the critical challenge of imputing missing node features in graphs, a common issue that severely degrades the performance of Graph Neural Networks (GNNs) in real-world applications.
The authors identify three fundamental limitations of the well-known Feature Propagation (FP) method: 1) performance degradation on graphs that are not fully connected (having multiple components), 2) over-smoothing of imputed features, reducing node diversity, and 3) feature distribution shift in inductive settings, where the graph structure seen during training differs from the full graph used during inference.
To overcome these challenges, DDFI innovatively combines FP with a graph-based Masked AutoEncoder (MAE). The framework consists of three core components. First, a pre-processing step called Co-Label Linking (CLL) randomly connects nodes sharing the same label within the training set. This simple yet effective technique increases graph connectivity without harming homophily, thereby improving the foundation for FP. Second, during the training stage, DDFI employs a graph MAE. The model is trained to reconstruct “complete” features (obtained by applying FP on the CLL-enhanced graph) from corrupted inputs. These inputs are created by applying FP on a randomly edge-dropped version of the graph and then masking features based on a Gaussian distribution strategy. This process teaches the model to generate diverse node representations. Third, and most crucially, DDFI introduces a novel two-step inference process. Instead of directly using FP-imputed features for downstream tasks, DDFI first reconstructs these features by passing them through the entire trained MAE (encoder-decoder). The output is then fed through the encoder again to produce the final node representations. This two-step process aligns the feature distribution of inference data with the distribution learned by the MAE during training, effectively mitigating the feature distribution shift problem in inductive tasks and preserving feature diversity.
Furthermore, the authors highlight a gap in evaluation methodologies. Since existing methods typically evaluate on datasets with artificially masked features, they collect and introduce a new real-world dataset named “Sailing”, derived from maritime voyage records, where 80.4% of features are naturally missing. This provides a more realistic benchmark.
Extensive experiments on six public benchmark datasets and the new Sailing dataset demonstrate that DDFI significantly outperforms state-of-the-art methods in both transductive and inductive settings for tasks like node classification and link prediction. The work makes significant contributions by enhancing the robustness of feature imputation against graph disconnection, actively combating over-smoothing, providing a principled solution to distribution shift in inductive learning, and advancing evaluation practices with a naturally incomplete dataset.
Comments & Academic Discussion
Loading comments...
Leave a Comment