Automatic Rule Extraction from Long Short Term Memory Networks
Although deep learning models have proven effective at solving problems in natural language processing, the mechanism by which they come to their conclusions is often unclear. As a result, these models are generally treated as black boxes, yielding no insight of the underlying learned patterns. In this paper we consider Long Short Term Memory networks (LSTMs) and demonstrate a new approach for tracking the importance of a given input to the LSTM for a given output. By identifying consistently important patterns of words, we are able to distill state of the art LSTMs on sentiment analysis and question answering into a set of representative phrases. This representation is then quantitatively validated by using the extracted phrases to construct a simple, rule-based classifier which approximates the output of the LSTM.
💡 Research Summary
The paper addresses the longstanding problem of interpretability in deep neural networks for natural language processing, focusing specifically on Long Short‑Term Memory (LSTM) models. While LSTMs have become a core component of state‑of‑the‑art systems for tasks such as sentiment analysis, machine translation, and question answering, their internal decision‑making process remains opaque. The authors propose a novel method to quantify the contribution of each input word to the final prediction of an LSTM, thereby turning a black‑box model into a set of human‑readable rules.
The technical contribution begins with a mathematical decomposition of the LSTM’s soft‑max output. By expanding the final hidden state h_T into a sum of incremental cell updates, the authors define a multiplicative importance score β_{i,j}=exp(W_i·(o_T⊙Δc_j)), where Δc_j = tanh(c_j) – tanh(c_{j‑1}) captures the direct effect of word j on the cell state, W_i is the class‑specific weight vector, and o_T is the final output gate. This score reflects how much word j pushes the probability of class i upward. However, this formulation ignores the effect of forget gates that can diminish a word’s influence in later time steps, leading to many false positives.
To remedy this, the authors introduce an additive decomposition that incorporates the product of forget gates between the word’s position and the final time step. The refined score γ_{i,j}=exp(W_i·(o_T⊙(tanh((∏{k=j+1}^{T} f_k)·c_j) – tanh((∏{k=j}^{T} f_k)·c_{j‑1})))) captures both the immediate update and the downstream attenuation caused by the forget mechanism. Empirically, γ yields cleaner importance maps than β.
With these word‑level scores, the paper describes a phrase‑extraction pipeline. For any candidate phrase w_1,…,w_k, the average contribution to each class is computed (S_1 and S_2). The overall score S = max(S_1, S_2) and the associated class C = argmax(S_1, S_2) are used to rank phrases. Because enumerating all possible n‑grams is infeasible, the authors first collect candidate phrases consisting of consecutive words whose importance exceeds a threshold (c = 1). They then re‑score this reduced set using the S metric, effectively performing a brute‑force search over a manageable subset.
The extracted phrases are then fed into a simple rule‑based classifier. Given a document, the classifier scans the phrase list in descending order of S; the first matching phrase determines the predicted class via its associated C. This classifier is fully interpretable, requires only string matching, and retains a substantial portion of the original LSTM’s accuracy.
Experiments are conducted on three benchmarks. For sentiment analysis, the authors use the large‑scale Yelp polarity dataset (560 k training examples) and the Stanford Sentiment Treebank (sentence‑level). Their LSTM baselines achieve 95.3 % (Yelp) and 87.3 % (SST) accuracy, comparable to state‑of‑the‑art. Using the cell‑decomposition pattern matching, they obtain 86.5 % and 76.2 % respectively—significantly higher than a gradient‑based importance method (65.0 % and 68.0 %). The gap widens on longer documents (Yelp), where the cell‑difference patterns better suppress spurious signals.
The second domain is WikiMovies, a question‑answering dataset where each question must be answered by locating an entity within a set of Wikipedia articles. The authors adapt their method by conditioning the document LSTM on the question embedding (adding the question vector to the gate equations). Each candidate entity is treated as a separate binary classification problem, and phrase extraction is restricted to n‑grams that end at the entity token. The resulting rule‑based system, using cell‑decomposition patterns, achieves 74.3 % hit@1, outperforming a gradient‑based approach (57.4 %) and approaching the performance of a more complex key‑value memory network (68.3 %/76.2 %) while remaining far simpler.
Overall, the paper makes three key contributions: (1) a mathematically grounded decomposition of LSTM outputs into word‑level importance scores that respect both immediate updates and forget‑gate dynamics; (2) an efficient algorithm for extracting high‑impact phrases based on these scores; and (3) a demonstration that a lightweight, fully interpretable rule‑based classifier built from the extracted phrases can closely approximate the predictions of a deep LSTM across diverse NLP tasks. By bridging the gap between high performance and model transparency, this work opens avenues for deploying neural models in settings where interpretability, debugging, and computational efficiency are paramount.
Comments & Academic Discussion
Loading comments...
Leave a Comment