Improving Deep Learning Library Testing with Machine Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deep Learning (DL) libraries like TensorFlow and Pytorch simplify machine learning (ML) model development but are prone to bugs due to their complex design. Bug-finding techniques exist, but without precise API specifications, they produce many false alarms. Existing methods to mine API specifications lack accuracy. We explore using ML classifiers to determine input validity. We hypothesize that tensor shapes are a precise abstraction to encode concrete inputs and capture relationships of the data. Shape abstraction severely reduces problem dimensionality, which is important to facilitate ML training. Labeled data are obtained by observing runtime outcomes on a sample of inputs and classifiers are trained on sets of labeled inputs to capture API constraints. Our evaluation, conducted over 183 APIs from TensorFlow and Pytorch, shows that the classifiers generalize well on unseen data with over 91% accuracy. Integrating these classifiers into the pipeline of ACETest, a SoTA bug-finding technique, improves its pass rate from ~29% to ~61%. Our findings suggest that ML-enhanced input classification is an important aid to scale DL library testing.

💡 Research Summary

Deep learning libraries such as TensorFlow and PyTorch provide a rich set of tensor‑centric APIs, but their internal complexity makes them prone to bugs. Existing fuzzing‑based bug‑finding tools (e.g., ACETest) suffer from low “pass rates” because they generate many inputs that violate hidden pre‑conditions, leading to a typical validity ratio of only about 29 %. This paper proposes a lightweight yet powerful solution: use machine‑learning classifiers trained on a compact abstraction of concrete inputs—namely the shapes (dimensions and sizes) of tensors—to predict whether a generated input will be accepted by the library.

The authors first construct labeled datasets for 183 APIs (98 from PyTorch, 85 from TensorFlow). For each API they generate 10 000 input tuples using two strategies. The “Random” strategy samples each argument type uniformly within predefined bounds (tensor dimensions ≤ 6, each dimension length ∈

Improving Deep Learning Library Testing with Machine Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment