Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion
Many machine learning systems have access to multiple sources of evidence for the same prediction target, yet these sources often differ in reliability and informativeness across inputs. In bioacoustic classification, species identity may be inferred both from the acoustic signal and from spatiotemporal context such as location and season; while Bayesian inference motivates multiplicative evidence combination, in practice we typically only have access to discriminative predictors rather than calibrated generative models. We introduce \textbf{F}usion under \textbf{IN}dependent \textbf{C}onditional \textbf{H}ypotheses (\textbf{FINCH}), an adaptive log-linear evidence fusion framework that integrates a pre-trained audio classifier with a structured spatiotemporal predictor. FINCH learns a per-sample gating function that estimates the reliability of contextual information from uncertainty and informativeness statistics. The resulting fusion family \emph{contains} the audio-only classifier as a special case and explicitly bounds the influence of contextual evidence, yielding a risk-contained hypothesis class with an interpretable audio-only fallback. Across benchmarks, FINCH consistently outperforms fixed-weight fusion and audio-only baselines, improving robustness and error trade-offs even when contextual information is weak in isolation. We achieve state-of-the-art performance on CBI and competitive or improved performance on several subsets of BirdSet using a lightweight, interpretable, evidence-based approach. Code is available: \texttt{\href{https://anonymous.4open.science/r/birdnoise-85CD/README.md}{anonymous-repository}}
💡 Research Summary
The paper introduces FINCH (Fusion under INdependent Conditional Hypotheses), an adaptive log‑linear evidence‑fusion framework designed for tasks where multiple, conditionally independent sources of information are available at inference time. In the context of bioacoustic species identification, the authors combine a pre‑trained audio classifier pθ(y|x) with a structured spatiotemporal prior pψ(y|s) derived from large‑scale observational data (e.g., eBird). Classical Bayesian fusion would require full generative models p(x|y) and p(s|y), which are rarely available. Instead, FINCH works directly with discriminative posteriors, adopting a product‑of‑experts‑like log‑linear form:
log p̃ω(y|x,s) = log pθ(y|x) + ω(x,s)·log pψ(y|s).
The key novelty is the per‑sample weight ω(x,s) ≥ 0, learned by a small gating network. This network receives a concatenated feature vector u(x,s) that encodes (1) audio uncertainty (max probability, entropy, top‑2 margin), (2) spatiotemporal uncertainty (same statistics on pψ), and (3) raw metadata (day‑of‑year, hour‑of‑day encoded with sine/cosine, latitude, longitude). A two‑layer MLP gϕ processes u, its output passes through a sigmoid and is scaled by a learnable upper bound ωmax (constrained to
Comments & Academic Discussion
Loading comments...
Leave a Comment