Trapped by simplicity: When Transformers fail to learn from noisy features
Noise is ubiquitous in data used to train large language models, but it is not well understood whether these models are able to correctly generalize to inputs generated without noise. Here, we study noise-robust learning: are transformers trained on data with noisy features able to find a target function that correctly predicts labels for noiseless features? We show that transformers succeed at noise-robust learning for a selection of $k$-sparse parity and majority functions, compared to LSTMs which fail at this task for even modest feature noise. However, we find that transformers typically fail at noise-robust learning of random $k$-juntas, especially when the boolean sensitivity of the optimal solution is smaller than that of the target function. We argue that this failure is due to a combination of two factors: transformers’ bias toward simpler functions, combined with an observation that the optimal function for noise-robust learning typically has lower sensitivity than the target function for random boolean functions. We test this hypothesis by exploiting transformers’ simplicity bias to trap them in an incorrect solution, but show that transformers can escape this trap by training with an additional loss term penalizing high-sensitivity solutions. Overall, we find that transformers are particularly ineffective for learning boolean functions in the presence of feature noise.
💡 Research Summary
The paper investigates whether transformer models can learn the underlying Boolean function when training data are corrupted by feature noise, a scenario common in large‑language‑model training. The authors formalize “noise‑robust learning” as the task of predicting the true label Y = f(X) from a noisy input Z = X⊕E, where each bit of E is flipped independently with probability p. Success requires that after training on (Z,Y) pairs, the model’s hypothesis g yields low error on clean inputs X.
Two model families are compared: Self‑Attention Networks (SANs, i.e., transformers) and LSTMs. Prior work has shown that transformers exhibit a “simplicity bias” toward low‑sensitivity Boolean functions, which makes them robust to label noise but may hinder learning when the optimal predictor under noisy features (f*ₙ) is simpler than the true target f. The authors hypothesize that this bias, combined with the fact that for random Boolean functions the optimal noisy‑feature predictor typically has lower sensitivity than the target, explains why transformers often fail on noise‑robust learning.
Experiments cover three classes of functions: (i) k‑sparse parity, (ii) k‑sparse majority, and (iii) random k‑juntas. For each function, datasets of varying size and noise levels are generated, and 300 random initializations with hyper‑parameter sweeps are performed. Results show:
- Transformers reliably learn sparse parity and majority functions even at moderate noise rates, outperforming LSTMs, which degrade quickly.
- For random k‑juntas, transformers achieve near‑optimal validation accuracy on noisy data but suffer large errors on clean test data, indicating they have converged to a low‑sensitivity “trap” function rather than the true target.
- LSTMs, lacking a strong simplicity bias, also fail on noisy inputs but for different reasons, generally showing poorer performance across all tasks.
To test the trap hypothesis, the authors add an explicit regularization term penalizing high‑sensitivity solutions. This penalty enables transformers to escape the low‑sensitivity basin and recover the correct high‑sensitivity function, confirming that the simplicity bias is the primary obstacle.
The paper concludes that while transformers can be noise‑robust for certain structured Boolean tasks, their inherent preference for simple functions makes them ineffective for learning complex, high‑sensitivity Boolean relationships from noisy features. Mitigating this bias—through sensitivity‑based regularization or alternative architectures—may be necessary for LLMs to succeed on algorithmic or mathematical tasks where training data contain substantial feature noise.
Comments & Academic Discussion
Loading comments...
Leave a Comment