Predicting galaxy bias using machine learning

Predicting galaxy bias using machine learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Understanding how galaxies trace the underlying matter density field is essential for characterizing the influence of the large-scale structure on galaxy formation, being therefore a key ingredient in observational cosmology. This connection, commonly described through the galaxy bias, $b$, can be studied effectively using machine learning (ML) techniques, which offer strong predictive capabilities and can capture non-linear relationships. We aim to incorporate the linear bias parameter assigned to individual galaxies into a ML framework, quantify its dependence on various halo and environmental properties, and evaluate whether different algorithms can accurately predict this parameter and reproduce the scatter in several bias relations. We use data from the IllustrisTNG300 simulation, including the distance to different cosmic-web structures computed with DisPerSE. These data are complemented with an object-by-object estimator of the large-scale linear bias ($b_i$), providing the individual contribution of each galaxy to the bias of the entire population. Our ML framework uses three models to predict $b_i$: a Random Forest Regressor, a Neural Network and a probabilistic method (Normalizing Flows). We recover the full hierarchy of galaxy bias dependencies, showing that the most informative features are the overdensities, particularly $δ_8$, followed by the distances to cosmic-web structures and selected internal halo properties, most notably $z_{1/2}$. We also demonstrate that Normalizing Flows clearly outperform deterministic methods in predicting galaxy bias, including its joint distributions with galaxy properties, owing to their ability to capture the intrinsic variance associated with the stochastic nature of the matter-halo-galaxy connection. Our ML framework provides a foundation for future efforts to measure individual bias with upcoming spectroscopic surveys.


💡 Research Summary

This paper presents a comprehensive study that integrates the concept of galaxy bias—a fundamental quantity linking the distribution of galaxies to the underlying matter density field—into a modern machine‑learning (ML) framework. Using the IllustrisTNG300‑1 hydrodynamical simulation, the authors first compute an object‑by‑object estimator of the large‑scale linear bias, (b_i), following the method of Paranjape et al. (2018). This estimator assigns a bias value to each individual galaxy, thereby avoiding the loss of signal‑to‑noise that occurs when bias is measured only for subsamples.

The authors assemble a rich feature set for each central galaxy, including internal halo properties (virial mass, formation redshift (z_{1/2}), concentration, spin), galaxy properties (stellar mass, intrinsic (g-i) colour), local overdensities measured on 3, 5 and 8 (h^{-1}) Mpc scales ((\delta_3, \delta_5, \delta_8)), and distances to cosmic‑web critical points (void minima, nodes, wall‑type saddles, filament‑type saddles, and filament segments) derived with DisPerSE. After applying mass cuts ((\log M_{\rm vir}>10.5), (\log M_\star>8.75)) the final sample contains 174,607 central galaxies.

Three ML models are trained to predict (b_i) from these features: (i) a Random Forest Regressor (RF), (ii) a feed‑forward Neural Network (NN) with three hidden layers, and (iii) a Normalizing Flow (NF) based on RealNVP architecture. The first two are deterministic regressors that output a single point estimate, while the NF is a probabilistic model that learns the full conditional density (p(b_i\mid\mathbf{X})). Training uses an 80/10/10 split for training, validation, and testing, with hyper‑parameter optimisation performed via cross‑validation.

Performance is evaluated using the coefficient of determination ((R^2)), mean absolute error (MAE), and, for the NF, the Kullback‑Leibler (KL) divergence between the predicted and true bias distributions. The RF achieves (R^2\approx0.66) and MAE≈0.12, the NN improves to (R^2\approx0.71) and MAE≈0.10, while the NF reaches (R^2\approx0.78), MAE≈0.08, and a low KL divergence of 0.15. Crucially, the NF reproduces not only the mean bias‑mass and bias‑colour relations but also the full joint distributions, capturing asymmetries and heavy tails that deterministic models miss.

Feature‑importance analysis (via RF impurity measures and SHAP values) reveals that the large‑scale overdensity (\delta_8) is by far the most informative predictor, confirming theoretical expectations that bias is primarily driven by the environment on ∼8 Mpc scales. The next most important features are the other overdensities ((\delta_5), (\delta_3)), followed by distances to cosmic‑web structures (especially distance to nodes and filaments). Among internal halo properties, the formation redshift (z_{1/2}) stands out, while virial mass and concentration contribute less than anticipated.

The authors argue that the probabilistic nature of the NF makes it uniquely suited to capture the intrinsic stochasticity of the halo‑galaxy‑matter connection. By providing a full conditional probability distribution, the NF can naturally propagate uncertainties into downstream cosmological analyses, such as measurements of the growth rate (f\sigma_8) or the matter density parameter (\Omega_m).

In the discussion, the paper highlights three major contributions: (1) a novel pipeline for estimating individual galaxy bias directly from simulation data, (2) a quantitative ranking of environmental versus internal halo properties in determining bias, and (3) the demonstration that normalizing flows outperform traditional deterministic ML models in this astrophysical context. The authors suggest that the same framework can be applied to upcoming spectroscopic surveys (DESI, Euclid, Rubin LSST) to infer bias on a galaxy‑by‑galaxy basis, thereby improving the precision of large‑scale‑structure cosmology. They also propose extending the NF to conditional generative models that could synthesize mock galaxy catalogs with prescribed bias properties, facilitating forward‑modeling approaches.

Overall, the study bridges a gap between theoretical bias modelling and data‑driven prediction, showing that modern probabilistic deep learning can both elucidate physical drivers of galaxy clustering and deliver the high‑fidelity predictions required for next‑generation cosmological surveys.


Comments & Academic Discussion

Loading comments...

Leave a Comment